7 Specification-Driven Development

The agent that fails on your feature request doesn’t need a better model — it needs a better spec.

7.1 Specs, Not Prompts, Are the Unit of Work

The most productive agent users don’t type faster — they plan longer. The tool generates code in seconds, so the natural impulse is to start prompting immediately and iterate. But practitioners who’ve used agents intensively for months converge on the opposite approach: the specification, not the prompt, is the unit of work [113], [114], [115].

The METR study found experienced developers using AI tools were 19% slower than working without AI at all, while believing they were 20% faster [116], [117]. Read that result narrowly: it measured experienced developers, on familiar repositories, with current tools, in one operating regime. Treat it as an operating-regime warning, not an anti-agent verdict: slowdown appears when fast generation meets weak specs, oversized task slices, and thin verification. The mechanism practitioner reports add is that when requests are underspecified, generation speed is quickly eaten by debugging loops, duplicated work, and corrections to plausible-looking code that does not quite fit the job [113], [114]. Front-load the spec or pay it back later.

That does not mean every task deserves the same ceremony. A one-off script, a risky production migration, and a multi-team API change all benefit from treating the spec as the unit of work, but the spec can range from a five-bullet acceptance checklist to a reviewed requirements/design/tasks chain. Depth should scale with ambiguity, blast radius, and how much unattended execution you plan to allow; Chapter 9 returns to that ceremony trade-off explicitly.

Why does planning pay such outsized returns? A specification is context — the most concentrated, highest-signal context you can provide. When you write a spec before prompting, you do two things at once: clarify your own thinking, which surfaces edge cases and design choices early, and assemble exactly the context the agent needs to succeed on the first pass. Developers who skip this step and try “huge swaths” end up with duplicated logic and inconsistent patterns [113]. Chapter 5 explains the broader context discipline behind that pattern.

Tests are the strongest form of spec because they make “done” mechanical. When you write a test before asking the agent to implement, you’ve made the contract executable: write code, run test, see red or green, iterate. A 3,000-line HTML5 parser was ported from Python to JavaScript in four hours using test-driven agentic loops because the existing test suite served as both specification and verification harness [118]. A complete JSONata rewrite from JavaScript to Go finished in seven hours of agent time at $400 in token spend because the test suite acted as the spec, letting the agent run cheaply against a tightly bounded contract [119]. The operator rule that follows is concrete: when a step is bounded by an executable spec — a failing test, a snapshot harness, a typed contract — route it to a cheaper or faster model and run it with lower autonomy ceremony. Adaptive routers like Windsurf’s already lean on this signal: well-scoped, narrow prompts are routed to faster, cheaper models, while ambiguous requests draw the heavier model [120]. Tightly specified work earns cheaper execution; vague work pays the premium.

The practical implication: before you open your agent, open a text file. Write down what you want, how you’ll know it works, and what the agent should not do. Five minutes of specification saves thirty minutes of debugging AI-generated code that looked right but wasn’t.

7.2 What a Good Agent Spec Contains

A good agent-oriented spec does four jobs at once: it gives direction, defines completion, constrains behavior, and narrows the search space. The vision tells the agent what problem it is solving and why this feature exists. Acceptance criteria define done in testable terms — not “make login work better,” but “users can log in with email and password, sessions expire after 24 hours, and every API endpoint enforces authentication” [121]. Boundaries encode where the agent must not improvise: technology mandates, approval points, compliance rules, or “don’t get clever here” instructions. Explicit constraints narrow the solution space further with falsifiable limits such as latency targets, migration conventions, dependency rules, and exclusion zones. Missing any one of these forces the agent to guess, and agent guesses compound into rework.

Keep one distinction sharp: the spec is the human-owned contract — vision, acceptance criteria, boundaries, constraints. The plan is what the agent produces when it reads the spec and proposes a decomposition into steps, files, and tests. The plan refines the spec; it does not replace it. When you approve a plan, you are approving an interpretation of an already-approved spec, not authoring the contract for the first time. If the plan reveals that the spec was wrong, edit the spec first and let the plan be regenerated against it.

Ceremony is a choice, not a requirement. A one-page spec.md plus an approval checkpoint is enough for most feature work. At the lightweight end, a task card with four fields — what you want, what done looks like, what the agent cannot do, and any open questions — is still a spec. GSD-style task-driven workflows use exactly that pattern with almost no document overhead: the card may live in the prompt instead of a file, but that should be treated as a bounded exception for short, disposable work, not as the default for features that may survive resets, handoffs, or review [122]. Heavier systems add more artifacts, more approval gates, and more explicit role handoffs, with frameworks like BMAD treating versioned PRDs, architecture sketches, and stories as a “control manifest” that constrains generation before it begins [123], [124]; Chapter 9 compares those tradeoffs directly.

The most common failure mode is the implementation trap: jumping straight from requirements to code, which buries design decisions invisibly inside generated output. When an agent produces code before you have agreed on capabilities, components, data flow, and interfaces, review becomes cognitively expensive because you are evaluating design and implementation simultaneously [125]. The antidote is straightforward: first ask for the migration shape, route structure, component boundaries, or task plan. Only then ask for code. Chapter 6 owns the broader plan/act mode-switching model; this chapter only needs the rule that a visible plan must exist before execution starts.

Good specs are also modular. Agents degrade when they must simultaneously satisfy too many requirements [121]. Instead of one monolithic document with every rule in the system, keep feature-level intent in spec.md and reference durable team-wide constraints from your rules files or standing orders. Pairing a structured PRD as the “what” with a CLAUDE.md as the “how” eliminates requirement drift while keeping each artifact lean [126]. Chapter 3 connects directly here: standing orders carry the background law of the codebase so the feature spec can concentrate on the job at hand.

7.3 From Vague Request to Agent-Ready Spec

Consider a common scenario. A product manager asks: “Can we add notifications to the app?” That is too vague for an agent. The question is not whether the agent can generate code from that prompt; it can. The question is whether it will generate the code you actually wanted. Without a spec, the agent will invent architecture, pick defaults you never approved, and optimize for local plausibility rather than project fit.

Before (freeform request)

Can we add notifications to the app?

After (agent-ready spec)

# Feature: In-App Notification Center

## Vision
Add a notification bell to the top nav that shows unread count
and a dropdown of recent notifications. Backend publishes events;
frontend polls every 30 seconds (no WebSockets for v1).

## Acceptance Criteria
- Bell icon shows unread count (0 = hidden badge)
- Dropdown lists last 20 notifications, newest first
- Clicking a notification marks it read and navigates to source
- "Mark all read" clears the badge
- API: GET /notifications, PATCH /notifications/:id/read

## Boundaries
- Always: use existing auth middleware for the new endpoints
- Never: add a third-party notification service
- Ask: before adding any new database tables beyond `notifications`

## Constraints
- PostgreSQL, follow existing migration conventions
- < 100 ms p95 for GET /notifications (indexed on user_id + read)
- No new npm dependencies without approval

The spec is an artifact you own and approve, not a transient chat bubble. That is the real difference between specification-driven development and “better prompting.” The lifecycle is short and concrete: draft the contract, review or edit it, approve it, execute against it, and update the document if implementation discovers a real new constraint.

The named planning surfaces in coding agents differ in how durable the plan is and where the checkpoint sits. Choose by trigger condition, not brand:

Claude Code Plan Mode (Shift+Tab) is an in-session, read-only reasoning turn. Extended thinking is on by default; Plan Mode further restricts the agent to read-only operations and AskUserQuestion, so it can explore the codebase and propose changes without touching files [127]. Use it when the plan is disposable and the task is bounded inside one session.
Cline /deep-planning writes the plan to implementation_plan.md as a named, editable execution contract on disk [128]. Use it when the plan must survive a reset, a team handoff, or multi-session work, or when the same plan will be reviewed by someone other than the planner [129].
Aider /architect pairs an architect model that proposes the change with an editor model that applies it, splitting plan and act across two model passes for the same prompt [130]. Use it when your strongest reasoning model is weak at producing exact diffs, or when you want a second pass at the problem without rewriting the prompt.
Roo Code’s auto-approve panel is a second-layer gate, not a planning surface. It governs which categories of action (file writes, shell, MCP) execute without confirmation once a plan already exists, with workspace boundary protection and write-delay diagnostics as safety rails [106]. Use it after the contract exists, when the primary risk is unauthorized execution rather than planning quality.

The same discipline works without any planning mode at all. With Codex CLI or a plain CLI agent, prompt structure does the gating. The first turn asks only for an approved step list and file plan, with the spec embedded directly so the plan is self-contained — the pattern that keeps a Codex session productive for hours from a single prompt [131]:

“Read spec.md for the notification center. Produce a numbered step list and, for each step, list the files it will touch and the test that proves it done. Do not edit any files. Stop and wait for my approval.”

After you have edited that response into a plan you approve — saved as implementation_plan.md — the second turn references it and executes only Step 1:

“Here is the approved plan (paste of implementation_plan.md). Execute Step 1 only: create the migration for the notifications table. Out of scope: routes, handlers, frontend. Stop after the migration test passes.”

There is no plan mode and no permission panel involved; the contract is enforced entirely by what each prompt asks for and refuses. The native surfaces above are conveniences that automate this discipline, not replacements for it.

Use the contract the same way every time:

Draft the spec in a form you can edit.
Run a plan-only turn and make the agent expose its decomposition before any code changes.
Edit and approve the plan; if it is wrong because the spec was wrong, fix the spec first.
Execute one named step at a time against the approved plan.
If implementation discovers a real new constraint, update the spec, regenerate the affected plan steps, then continue.

A short worked example for step 5: midway through Step 2 of the notification center, the agent reports that the existing auth middleware does not expose a stable user_id to non-session API routes, only a session token. That is a real new constraint, not an excuse to improvise. Stop. Open spec.md and add under Constraints: “GET /notifications and PATCH /notifications/:id/read must accept the existing session cookie; do not introduce a new auth path.” Regenerate the plan for the affected step, re-approve, then resume. The discipline is not “no surprises”; it is “every surprise becomes a spec edit before it becomes code.”

For the same notification feature, a clean prompt sequence has one job per turn:

Plan prompt. “Read spec.md. Propose the schema migration, API routes, and any open questions. Do not write code yet.”
Implementation prompt. “Implement only the migration and GET /notifications endpoint from the approved plan. Stay within the existing auth middleware and migration conventions. Do not touch the frontend yet.”
Verification prompt. “Run the relevant integration tests, report whether each acceptance criterion now passes or still fails, and list any gaps before making more code changes.”

That sequence keeps planning, implementation, and verification from collapsing into one mushy request. Each prompt narrows the search space and gives you a cheap checkpoint before the next step. Without that contract, the agent does not merely move faster in the wrong direction; it invents specifics you never approved: WebSockets instead of polling, a third-party Pusher dependency, or a new pub/sub table the spec explicitly excluded.

7.4 Turn the Spec Into Reviewable Units

Decomposition only works if each unit you hand off still carries its own spec contract — scope, boundary, and done criteria. Without that, splitting a task into pieces just distributes the vagueness instead of eliminating it. A two-table database feature that cost $100 and two days across two failed attempts succeeded in under ten minutes when split into two separate pull requests [132]. Agents achieve much higher success on isolated tasks than on multi-step integrations — a measurable complexity cliff that no amount of prompt polish eliminates [113].

The narrower rule here is to carve the spec into the next unit of work that can still be fully specified and reviewed. If a task still requires the agent to make hidden architectural decisions across layers, the spec has not yet become a usable handoff unit. Chapter 8 owns the general decomposition heuristics; this chapter only insists that every decomposed unit must inherit a reviewable contract from the parent spec.

For a feature like the notification center from Section 7.3, one reasonable decomposition is:

Database migration. Create the notifications table and indexes. Boundary: migration files only. Done criteria: migration applies cleanly and schema matches the spec.
API endpoints. Implement GET /notifications and PATCH /notifications/:id/read. Boundary: routes, handlers, and their tests. Done criteria: integration tests pass.

Each step still has a spec, a boundary, and a clear definition of done. If the approved plan says “migration first, then API,” make that explicit in the implementation prompt: “Implement Step 1 from implementation_plan.md: the migration only. Do not touch routes or frontend files.” Without that pointer, the agent will drift from plan to improvisation as soon as the first obstacle appears.

7.5 The Approved Spec as a Portable Artifact

spec.md and implementation_plan.md matter as files because the contract has to outlive any single session. A planner session that produces an approved implementation_plan.md can be closed entirely; a fresh implementer session opened later only needs one prompt: “Read specs/notifications/spec.md and specs/notifications/implementation_plan.md. Implement Step 1 only. Stop after its test passes.” The new session does not reconstruct the planner’s reasoning, because the contract is on disk. Practitioners running 14-task subagent workflows report that the persistent spec doc is what lets them recover from a polluted session in seconds: open the spec in a fresh chat and the agent picks up where the last one drifted [129]. Long-running harnesses make the same bet — Anthropic’s empirical study of multi-session agents pairs an initializer-written feature list with a progress file that every new session reads first [133]. Session-level mechanics for compaction and resets belong to Chapter 5; the narrower point here is that the file has to exist before any of those disciplines has anything to operate on.

7.6 Failure Modes the Contract Has to Defeat

Three failure modes show up repeatedly even when the team claims to be “working from a spec”:

Plan drift. The agent improvises past the approved plan because nobody points later prompts back to the written contract.
Shallow approval. A human says “looks good” to a plan they did not really read, so the checkpoint exists only theatrically.
Over-specification. The document becomes so rigid that the first real implementation constraint forces exception handling and side-channel prompts instead of a clean spec revision.

Each is a different way of letting unreviewed decisions slip past the handoff: drift hides them in the agent’s improvisation, shallow approval hides them in the human’s signature, and over-specification hides them in the side-channel prompts that paper over a brittle plan. The point of the contract is not ceremony; it is to make change explicit and reviewable before it turns into hidden implementation decisions.

The enduring rule is simple: treat the specification as the contract that makes agent work reviewable. If the contract is approved and narrow enough to verify mechanically, the agent can move fast without dragging you into cleanup. If the contract is vague, hidden in chat, or never reintroduced at the handoff boundary, the speed turns back into rework. Reliability starts with the spec you make the agent obey, and that spec has to be something the next session, the next agent, and the next branch can all open and trust.

7.7 Takeaways

Before opening your agent, open a text file and write down what you want, how you’ll know it works, and what the agent should not do — five minutes of specification prevents thirty minutes of debugging plausible-but-wrong code.
Include explicit non-negotiables and hard constraints in every spec — technology mandates, approval points, compliance rules, latency targets, dependency rules, and exclusion zones — because missing any one of these forces the agent to guess, and agent guesses compound into rework.
Keep the spec and the plan separate: treat the spec as the human-owned contract (vision, acceptance criteria, boundaries, constraints) and the plan as the agent’s derived interpretation — if the plan reveals the spec is wrong, edit the spec first and regenerate the plan against it.
Run a plan-only turn before any code changes: ask the agent to produce a numbered step list, the files each step will touch, and the test that proves it done — then edit and approve the plan before issuing any implementation prompt.
When a task step is bounded by an executable spec — a failing test, a snapshot harness, or a typed contract — route it to a cheaper or faster model and run it with lower autonomy ceremony to reduce token cost without sacrificing quality.
Decompose the spec only into units that each still carry their own scope, boundary, and done criteria — splitting a task without those three elements per unit distributes the vagueness rather than eliminating it.
Save the spec and implementation plan as named files on disk so any fresh session can open them and resume without reconstructing the planner’s reasoning — pass only the file reference in the implementation prompt.