4  Skills, Prompts, and Agent Specialization

Ad hoc prompting is how you start with a coding agent; named, scoped specialists with persistent instructions are how you scale.

4.1 From Ad Hoc Prompting to Operating Leverage

Skills, prompts, and subagents configure agent behavior; the harness from Chapter 2 is the layer that loads, scopes, and enforces those configurations. When this chapter mentions context budget, it is using the term that Chapter 5 defines formally as one part of the three-term vocabulary: context window, context budget, and context engineering.

The ceiling on a single general-purpose agent is real and arrives faster than you expect. Chapter 3 established the baseline: shared rules files give every agent the same team conventions. This chapter builds the next layer on top of that baseline: skills package repeatable procedures, and specialists apply narrower roles, tools, and output contracts to work that the general rules file should not carry forever. Three constraints converge: the codebase overwhelms the context window, a jack-of-all-trades agent has no specialization, and there is no way to coordinate parallel work [61]. The escape hatch is not a smarter model — it is structure. You externalize the standing rules into convention files, package recurring procedures as named skills any compatible agent can install, and define role-specialized subagents your orchestrator can dispatch by name. These specialists are still coding agents in the sense established in Chapter 1: model plus harness plus workflow surface. The difference is scope. A named specialist carries narrower instructions, narrower tools, and a clearer output contract than the general agent. Each layer covers a different failure mode of free-form prompting: convention drift, repeated context-setting, and tool misuse from over-broad mandates.

You stop being the conductor of a single instrument and become the orchestrator of an ensemble. That requires different work: clear specifications, work decomposition, output verification, and codified, reusable units of agent behavior. The leverage compounds because each unit can be tuned, reviewed, and shared without re-prompting from scratch every session. Three focused agents reliably out-produce one generalist running three times as long because parallelism, specialization, and isolation multiply rather than add [61], and a 16-agent team can sustain progress on a 100,000-line C compiler at a cost-benefit profile no human team can match [62].

Three types of artifact carry the load, and they are not interchangeable. Convention files are ambient — always-on standing orders that establish the baseline (type hints, library choices, banned patterns). Skills are triggered — packaged procedures the agent loads when a relevant task appears (take annotated screenshots, audit SEO, run a release). Subagents are delegated — named specialists with their own tool scopes and prompts that an orchestrator dispatches when work fits their mandate [63], [64]. Saved prompts and slash commands are the on-ramp into this stack, not a fourth peer layer: they are how a one-off prompt earns its way toward becoming a triggered skill or a delegated role once it has proven its shape. The three layers stack in dependency order: conventions are the baseline, and skills and subagents inherit from that baseline rather than replace it. Mixing the layers — putting screenshot procedures into the global rules file, or stuffing reviewer-role behavior into ambient context — is one of the most common ways teams turn a working setup into an unreliable one.

4.2 The Convention Baseline Every Specialist Inherits

The inheritance rule is the only thing this chapter needs from the rules layer: convention files define ambient project law; skills and roles stay task-local. Library choice, type policy, banned patterns, repo layout — those go in the convention file, where every skill and every subagent inherits them for free. What this procedure does and what this role decides — those go in the skill or the role. When the two collide, push the fact down into the convention file and shorten the role. See Chapter 3 for the rule-surface detail: which file each tool reads, character ceilings, loader mechanics, and the discipline for keeping the file small.

The cost of skipping this inheritance shows up immediately in a reviewer role. Without a convention file, the role has to restate project law every time:

# reviewer (without ambient conventions)
You are a Python code reviewer. The project targets Python 3.12. All
public functions must have type hints. We use httpx, not requests. We
use pytest, not unittest. We reject bare except clauses and direct
os.environ access outside config.py. Run `pyright` and `ruff check`
before reporting. Group findings as blocker / major / minor.
Review the diff and...

With those facts in the convention file, the same role collapses to:

# reviewer (with ambient conventions)
You are a Python code reviewer. Run the project's lint and type checks.
Group findings as blocker / major / minor. Review the diff and...

The reviewer is shorter, focused on the review mandate, and stays in sync automatically when a project rule changes [64]. Treat any drift between a role prompt and the convention file as a bug in the role, not the file.

4.3 Reusable Prompts: The First Operating Layer

Before a workflow becomes a skill, it usually becomes a reusable prompt or command. The packaging ladder is: one-off prompt → saved prompt or slash command → skill → subagent. Each rung exists for a reason; promotion is not automatic.

A reusable prompt is the simplest artifact in this chapter, and the most underrated. Cody’s Prompt Library is the clearest reference implementation: prompts are stored, named, shared across teams, and parameterized with dynamic context mentions like @selection, @file, @repository, and @directory, so the same template adapts at runtime to whatever the developer has open [65]. The transferable lesson behind that surface — and the one Windsurf’s manual-mode rules and Claude Code’s slash commands also encode — is that a saved prompt is a stable interface only when three contracts are explicit: an argument contract (what the caller passes in), a context contract (what the agent should and should not load, including the workspace-aware mentions Cody exposes), and an output contract (the shape of the answer) [65], [43]. Make those three contracts explicit and the prompt becomes a stable interface. Leave them implicit and you are back to free-form prompting.

A worked example. You find yourself typing some variant of “review this PR for auth regressions” three times in a week. Promote it to /review-auth-pr:

# /review-auth-pr
Arguments: $PR_NUMBER (required, integer)

Context contract:
  - Load only the diff for $PR_NUMBER and the auth-related files it touches
    (auth/, middleware/auth*, tests/auth/).
  - Do NOT load the full repository history or unrelated services.

Task:
  Check for regressions in: session fixation, token refresh, role/permission
  checks, and CSRF handling. Use the project's standing conventions for type
  safety and library choice (already in AGENTS.md).

Output contract:
  Findings grouped as Blocker / Major / Minor.
  Each finding: file:line, one-sentence problem, one-sentence suggested fix.
  No prose preamble. No summary at the end.

Saved prompts have their own maintenance discipline, separate from skills. Store them where the rest of the project lives — under .claude/commands/, .codex/prompts/, or whatever your tool’s slash-command directory is — and version them with the repo so changes show up in code review like any other artifact. Validate the three contracts before promoting anything: feed the command a known argument and confirm the agent loads only the files the context contract names, produces output in exactly the shape the output contract specifies, and refuses or fails cleanly when the argument is missing or malformed. A prompt that silently expands its context window, drops the output structure, or drifts on its argument shape is not a stable interface — it is a free-form prompt with a slash in front of it. Tighten the contracts, re-run the validation, and only then consider it ready to share.

Keep this a command — not a skill, not a subagent — while three things stay true: the body fits on one screen, it does not need bundled examples or assets, and it can run in the orchestrator’s main context without polluting it. Promote it to a skill when the command starts carrying examples of good and bad findings, severity rubrics, project-specific test recipes, or tool setup that would bloat the command body [66]. Promote it to a subagent when the work needs a separate context window, a role-specific mandate, or a narrower tool scope than the orchestrator runs with.

4.4 Skills: Packaging Recurring Procedures

A skill is a markdown file with instructions for a coding agent — it describes how to perform a specific task and references the CLI tools, APIs, or other resources the agent needs to do it [63]. The point is that a general-purpose agent with the right skill becomes a specialist without code changes. Unlike a convention file, which always loads, skill content loads only when invoked, which makes long reference material essentially free in context cost — and skill edits take effect immediately within the current session, enabling rapid iteration [66]. A skill rides on top of the convention file rather than restating it: the standing rules tell the agent what the project’s law is; the skill tells it how to execute one procedure inside that law. A working skill has five things in it, and the absence of any one is a defect:

  1. Trigger condition. A description sharp enough that the agent (or its router) knows when to invoke this skill instead of free-form reasoning. “Generates annotated screenshots of a deployed web app” is sharp; “helps with documentation” is not.
  2. Required tools. Explicit references to the CLIs, MCP servers, or APIs the skill expects. If the skill calls agent-browser, the skill says so.
  3. Bundled examples or assets. Sample inputs, sample outputs, fixture files, or screenshots that demonstrate the expected shape. Skills without examples regress to whatever the model guesses.
  4. Output contract. The shape of the result — file paths written, sections of a markdown report, return values — stated explicitly.
  5. Verification step. How the agent (or a downstream check) confirms the skill produced something usable: run a script, diff against a fixture, check for required headings.

The concrete shape works like this: an app-screenshots skill tells the agent how to call agent-browser to navigate a site, take screenshots, auto-discover navigation, and output an annotated markdown guide. The same procedure file is portable across Claude Code, Codex, Cursor, Windsurf, and Gemini CLI because the format is plain markdown with explicit tool references [63], [67]. The nuance worth holding onto: what travels is the markdown procedure and its bundled assets, not native first-class skill support — installation paths, discovery rules, and invocation mechanics differ by harness. A unified installer like gh skill is what reconciles those differences in practice, dropping the same procedure into the directory each tool expects and resolving how it gets loaded [68], [67]. Path-specific activation via glob patterns and nested .claude/skills/ directories let monorepos scope skills to the parts of the tree where they apply, so irrelevant skills do not consume context [66]. An SEO audit skill follows the same shape: a Puppeteer MCP server gives the agent a real browser, and the skill instructs it to crawl the site, generate a structured todo list, and report issues — turning a multi-hour manual audit into a few-minute autonomous run [69]. Role-playing prompts (“You are an SEO expert named John Wick”) combined with explicit output expectations and tool affordances produce focused, reliable specialist behavior [69].

Promote a workflow to a skill when it has repeated three times with consistent shape — when you find yourself prompting the same procedure across sessions, or copy-pasting the same context block at the start of every chat. Keep one-off explorations as prompts. Do not promote anything to a skill that is genuinely better as ambient context: facts that must always be true belong in the convention file, even if a skill the agent might invoke could carry them [64]. Models produce wrong output on new APIs that postdate their training cutoff; skills, MCP pointers, and code examples all close the gap, but the durable insight is that getting context in matters more than the medium [70].

Treat skills you install from outside your repo like dependencies, not text snippets. The gh skill command brings package-manager guarantees: content-addressed change detection via git tree SHAs, version pinning to tags or commits, immutable releases, and portable provenance metadata so a skill copied between projects retains its update lineage [68]. Skills are executable instructions that shape agent behavior — a silent update is a supply chain risk on par with a silently mutating npm package, and the agent-driven migration story makes this strategic, not just hygienic. Cloudflare published a “migrate to vinext” skill alongside its AI-rewritten Next.js fork, packaging customer acquisition as a single command an agent runs against your codebase [71]. Pin production-critical skills, audit changes deliberately, and treat any skill that mutates source code as a code change subject to the same review gates as any other dependency.

4.5 Role-Specialized Subagents

A subagent is a named specialist — a security scanner, a research assistant, a test runner — defined by a description, a constrained tool set, and a behavioral prompt, that the orchestrator dispatches by slug when work fits the role [61], [72]. The mechanism that makes this real in Claude Code is the .claude/agents/ directory: each file defines an agent with fields like description, prompt, tools, disallowedTools, model, permissionMode, maxTurns, mcpServers, skills, memory, background, and effort. The orchestrator resolves the slug at delegation time, loads the definition, and runs the work in an isolated context window. The skills field is the crucial join between this chapter’s two core subjects: a documentation-writer subagent with skills: [screenshots, api-docs] inherits those reusable procedures without any extra prompt engineering — and inherits the convention-file baseline underneath both, so neither the role nor its skills have to re-state which library to use or which lint command to run.

The reason to bother is that general-purpose agents dilute focus. Given a broad mandate and full tool access, they pick wrong tools, produce inconsistent output, and resist tuning because every tweak regresses some other task type. Two minimal definitions make the contrast concrete:

---
name: research-assistant
description: Investigates architecture questions across docs and the web; cites sources but never edits code.
tools: [Read, Grep, Glob, WebFetch]
model: claude-sonnet
maxTurns: 20
---
You are a research assistant. Investigate the question by reading the
repository and fetching authoritative external sources. Cite each claim
by file path or URL. Return a concise written answer. Do not propose
or apply code changes.
---
name: test-runner
description: Runs the project test suite, parses failures, and applies minimal fixes to make tests pass.
tools: [Bash, Read, Edit]
disallowedTools: [WebFetch]
model: claude-sonnet
permissionMode: acceptEdits
maxTurns: 15
---
You are a test runner. Execute `pytest -q`, identify the smallest
change that makes tests pass without altering intent, and apply it.
If a failure reveals a real bug rather than a stale test, stop and
report instead of papering over it.

The two roles are not differentiated by their prompts alone; they are differentiated by what they can do. The research-assistant cannot run shell, edit files, or fetch outside its read-only mandate; the test-runner cannot reach the web but can execute and patch. Tool restriction is what makes specialization mean something instead of being a polite request the agent might ignore. Notice what is not in either prompt: the test runner does not name a Python version, a lint tool, or a banned pattern. Those facts live in the convention file, where they belong; the role only describes the mandate that is task-local to this specialist.

Dispatch happens one of two ways. Automatic routing matches the user’s task against subagent description fields — if the description is sharp (“reviews Python code for type-safety violations and async-context bugs”), the orchestrator picks correctly. If the description is vague (“helps with code”), the orchestrator misses or misfires, and you wonder why your specialist never runs [73]. Rely on automatic dispatch in well-scoped pipelines where descriptions are tight; invoke explicitly when you are debugging routing or running a specific role on demand.

Specialization also has a context-isolation payoff: a subagent’s reading and reasoning consume its own context budget, not the orchestrator’s, so a research lookup or directory scan can return a small answer instead of bloating the main session [74], [75]. Chapter 5 owns the broader playbook for forking, compacting, and branching sessions; the rule for specialization here is just that narrowing the role is what makes the isolation pay off.

The tool-scope rule that survives into multi-agent coordination is short: shared mutable state wants read-only delegation; isolated state can tolerate edit-capable delegation. Three reviewer subagents with tools: [Read, Grep, Glob] examining the same diff cannot collide because none of them writes — this is the discipline Cognition follows when it restricts subagents to read-only investigation to avoid conflicting implicit decisions [76]. Chapter 15 owns the full coordination playbook (task lists, peer messaging, dependency graphs, branch and worktree strategies, merge sequencing); the role-level rule for this chapter is just the tool-scope half: name the role, bound the tools, and let the state model decide whether the role can write.

The other failure modes are predictable enough to plan around:

  • Vague descriptions cause automatic dispatch to miss. Audit by asking: if I read only the description, would I know exactly when to call this agent?
  • Drift from platform tool availability produces silent capability gaps when a tool is renamed or removed. Treat subagent definitions like code, version-controlled and reviewed.
  • Too-broad role boundaries produce overlap (two agents fight over the same task) or gaps (work falls back to the default agent). Draw the lines so each role owns a coherent slice.
  • Static when dynamic is needed. A security-reviewer factory that picks strict vs balanced mode based on the file’s risk classification beats a single fixed prompt. The Claude Code SDK supports dynamic configuration where runtime conditions shape both model choice and prompt strictness.

Choose filesystem definitions for CLI workflows where humans review and version the files; prefer programmatic SDK definitions when the role itself depends on runtime state, like a security level that changes per file or a model choice that depends on token budget remaining. The same harness pattern travels across providers: a memory-and-skills harness like OpenClaw lets a practitioner swap from Claude to Codex CLI as a configuration change rather than a total reset, because the role definitions and skill files are platform-agnostic [77], [78].

One discipline travels with every reviewer-style role you define: the verification it provides is real only if its tools are real verifiers. A reviewer subagent that says “I reviewed the code and it looks fine” is not closing the loop. A reviewer that runs pytest, parses the output, reports failures, and fails the phase if the type checker disagrees with the implementer is. Specialization is not a substitute for verification; a reviewer role with no test runner, type checker, or linter wired into its tool set is theatre [79].

4.6 When Skills and Roles Compose Into Overlays

The composition principle is short: rules, skills, and roles stack without merging. Conventions set the ambient baseline, skills package procedures, subagents package delegated roles with their own tool scopes, and each layer inherits from the one below without absorbing it. A planner-implementer-reviewer trio is the smallest example — three subagent definitions that all read the same convention file, call the same project skills, and hand off through artifacts on disk:

specs/feature-x/
  plan.md              # written by planner; read by implementer
  implementation.md    # written by implementer; read by reviewer
  review.md            # written by reviewer; gates merge

What a real overlay standardizes, beyond just naming the roles, is the transition machinery: the phase sequence (planner runs before implementer; reviewer cannot merge until both prior artifacts exist), the artifact contracts that make each handoff inspectable on disk, and the gate conditions that pause or escalate the pipeline when an artifact fails its check [72], [62]. Artifact-based handoffs are also what makes an overlay portable across harnesses — the agents change platforms, the files do not [72]. Chapter 9 owns the fuller artifact-flow design and comparative evaluation of complete overlays (BMAD, SuperClaude, Spec Kit, OpenSpec). Chapter 15 owns orchestration portability and the multi-agent handoff machinery. This chapter stops at the primitives, because knowing how the primitives stack — baseline, skill, role, sequence, artifact, gate — is what lets you read those overlays critically rather than cargo-cult them.

4.7 Putting It Together: A Buildable Setup

The minimum viable specialization stack is three files and one directory, in dependency order. Start with a convention file (AGENTS.md, CLAUDE.md, CONVENTIONS.md, or your tool’s equivalent) at the repo root, kept under 150 instructions, listing your conventions, library preferences, and standing rules — every subsequent layer assumes this file is loaded [38]. Add a .claude/skills/ (or your tool’s equivalent) directory with two or three skills for the procedures you find yourself repeating: release notes, screenshot generation, SEO audit, whatever recurs in your work. Add a .claude/agents/ directory with two or three subagent definitions for the roles you actually delegate to: a reviewer with read-only tools, a researcher with read and web-fetch tools, an implementer with the broader edit and bash permissions [63]. Pin any third-party skills you install with gh skill install ... --pin so a silent update cannot change your agent’s behavior overnight [68].

Do this incrementally. The temptation when the structure clicks is to write 22 agents and 26 skills before lunch. Resist it. Add a role only when the general-purpose agent is repeatedly straying outside the scope you want for that task class. Promote a prompt to a skill only after the third repetition. Add a line to the convention file only when the absence of that line has caused the agent to do the wrong thing more than once. Each artifact should earn its place by removing a real, observed failure — not by anticipating one. The art is restraint; the goal of this whole chapter is leverage that compounds rather than ceremony that ossifies.

Wire verification into the stack from day one. Replit’s Agent 3 runs autonomously for 200+ minutes with built-in browser testing as a self-decided loop step, but the harness is what verifies against external signals — not the agent’s own judgment of correctness [80]. Spotify’s background coding agent calls an MCP-exposed verifier tool whose internal mechanics it does not understand; an LLM judge then compares the final diff against the original prompt to catch scope creep [79]. Translate that to your own setup: each subagent role needs a real verifier behind it, each phase transition in your overlay needs an artifact some external check can validate, and the runtime around the model — not the model’s self-assessment — is what carries the verification weight (Chapter 2 established this discipline at the harness level).

Judgment is what survives this stack [81]. Time is solved — agents will produce more code than you can review. Attention is solvable through composition like the patterns above. What remains binding is judgment about what should exist, how it should be structured, and whether the result is maintainable, secure, and reliable enough to ship. Skills, prompts, and specialization are how you encode and scale judgment that already exists in your team; they do not replace it. The teams getting the most out of this stack are not the ones with the most skills installed — they are the ones whose skills, conventions, and roles encode hard-won judgment in a form their agents will actually use. The single rule that closes the chapter is the one that runs through every section above: define the mandate, bound the tools, inherit the baseline, and wire a real verifier behind every role [82].

4.8 Takeaways

  • Keep convention files ambient, skills triggered, and subagents delegated — never mix them by putting procedure logic into the global rules file or embedding reviewer-role behavior into ambient context.
  • Before sharing a reusable prompt or slash command, make three contracts explicit: an argument contract (what the caller passes in), a context contract (what the agent should and should not load), and an output contract (the shape of the answer).
  • A working skill requires all five elements — trigger condition, required tools, bundled examples or assets, output contract, and verification step; treat the absence of any one as a defect, not an acceptable shortcut.
  • Make subagent specialization enforceable by bounding each role’s tools and disallowedTools to its mandate; if a role is supposed to investigate, keep it read-only instead of relying on prompt wording to restrain it.
  • Wire a real external verifier — a test runner, type checker, or linter — into every reviewer subagent’s tool set; a role that returns a prose verdict without invoking an actual tool is not verification, it is theater.
  • Add a subagent role only when the general-purpose agent repeatedly strays out of scope, promote a prompt to a skill only after the third repetition, and add a line to the convention file only when its absence has caused the agent to do the wrong thing more than once.