Building Reusable Claude Skills for Platform Engineering and Kubernetes Troubleshooting Workflows

When the same context goes into Claude six times before lunch

A platform engineer at a Series B payments startup got tired of one specific ritual. Every morning she opened Claude, pasted the same Kubernetes troubleshooting setup, stack traces, the kubectl flags her team leaned on, a rough sketch of the cluster topology. Then she asked the actual question. Same paragraph of scaffolding, six times a day, before any real work happened.

So she stopped. She pulled down a GitHub repo of community skill files, dropped in three custom ones, and wired them to her team's existing runbooks. Two weeks later, debugging a CrashLoopBackOff took one sentence instead of a paragraph of warm-up.

Here's what most teams get wrong with Claude skills. They treat them like clever prompts you paste at the top of a chat, a fancier system message. Skills are not that. They are reusable context the model pulls on demand: tool definitions, file paths, decision trees, runbook snippets. Treat them like internal libraries, not magic incantations. The mental shift is the whole game. Once you see a skill as a small, versioned, single-purpose module, the rest of the patterns fall out naturally.

1. One skill per workflow, not one per project

The tempting move is a single kubernetes-helper skill that covers deployments, debugging, RBAC, networking, and cost analysis. Feels efficient. It isn't. The model loads the whole thing for every question, and the relevant 5% gets buried under the other 95%. You pay tokens for context that does nothing, and the signal-to-noise ratio drops.

Split by workflow instead. Each one stays small and tightly described:

skills/
  k8s-crashloop-triage/
    SKILL.md          # < 200 lines
  k8s-pdb-review/
    SKILL.md
  k8s-rbac-audit/
    SKILL.md

The description matters as much as the body. That is what the loader reads to decide whether to pull the skill at all:

---
name: k8s-crashloop-triage
description: Use when a pod is in CrashLoopBackOff or restart-looping.
  Covers exit codes, OOMKilled checks, readiness/liveness probe misconfig,
  and image pull failures.
---

A vague description like "helps with Kubernetes" gets loaded for every question or none of them. A tight one gets loaded exactly when it should.

2. Point at runbooks, do not duplicate them

The second failure is copy-pasting your runbook prose into the skill. Now you have two copies that drift apart. The runbook gets updated after an incident, the skill does not, and three weeks later the model is confidently citing a rollback procedure you abandoned.

Reference the source of truth instead. Keep the skill thin and let it point at files that already exist:

## Triage steps

1. Check exit code: `kubectl get pod <name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'`
2. Exit 137 means OOMKilled. See runbooks/oom-remediation.md
3. Exit 1 with no logs usually means a bad command/args. See runbooks/entrypoint-debug.md
4. Probe failures: cross-reference runbooks/probe-tuning.md

When the runbook changes, the skill keeps working because it never held a stale copy. The skill is the index. The runbook is the content.

3. Keep decision trees, not paragraphs

Skills are read by a model under token pressure, not by a human curling up with documentation. Dense prose is expensive and easy to skim past. Structure beats narrative. A short decision tree gives the model branches it can actually follow:

## Restart loop, where to look first

- Restarts climbing, no logs at all
  -> entrypoint or args wrong. Check command/args in the spec.
- Restarts climbing, logs cut off mid-startup
  -> liveness probe firing too early. Raise initialDelaySeconds.
- Exit 137
  -> OOMKilled. Check container_memory_working_set_bytes vs limits.
- ImagePullBackOff
  -> registry auth or tag typo. Describe the pod, read the events.

Every branch maps to a concrete next action. No reading comprehension tax. The model lands on the right path in one hop instead of inferring intent from a wall of text.

4. Version skills like code, because they are

Skills live in git. Review them in pull requests. Test them against real incidents from last quarter and see if the model lands on the answer your on-call engineer actually used. When a skill sends the model down the wrong path during a real page, that is a bug report, not a vibe. Fix the decision tree, commit, move on.

This is also how skills stay honest across a team. One person writes k8s-rbac-audit, someone else hits an edge case it missed, they open a PR with the fix. The skill gets better the same way any shared library does, through use and correction, not through one heroic author keeping it all in their head.

Why this matters in production

The cost is not the typing. It is the 3 a.m. version of you. During an incident, working memory is the scarce resource, and re-establishing context for the model every single time burns it. A tight, well-scoped skill means the on-call engineer types one sentence and gets a relevant answer grounded in the team's own runbooks, not a generic Kubernetes lecture pulled from the model's training data.

Broad, bloated skills do the opposite. They quietly degrade answer quality, because the relevant detail is competing with everything else you crammed in. You will not notice on the easy questions. You will notice on the gnarly 2 a.m. one, which is exactly when you can least afford to.

Done right

Skills are internal libraries for your model. Scope them to one workflow, point them at runbooks instead of duplicating them, write decision trees instead of essays, and review them in git like any other code. Do that and the morning ritual of pasting the same context six times disappears, replaced by a one-sentence question that lands.

Want hands-on labs for building and testing Claude skills against real Kubernetes failure modes? See tekanaid.com/courses ↓