8 Planning and Task Decomposition
Most agent failures are decomposition failures: too much asked at once, no plan to review, no checkpoint between intent and execution.
This chapter starts after the intent artifact exists. Chapter 7 owns the spec contract; this chapter turns that accepted intent into bounded chunks, ownership boundaries, loop assignments, and re-plan triggers.
The complexity cliff is real and measurable. Models hit roughly 40% success on isolated React tasks but drop to 25% on multi-step integrations, and SWE-Bench Pro scores fall to 20-43% versus 70%+ on Verified [134]. Experienced developers were slower with frontier tools when review and refactoring swallowed the gains from generated code [135]. The fix is not merely a better prompt or model. It is structuring work so the agent never operates above the complexity ceiling you can inspect.
Three moves compound: accept the spec, decompose the work, then gate execution. Decomposition is the middle move: it creates the executable plan without letting that plan become a second source of truth.
8.1 Decompose the Spec Into an Implementation Plan
The plan is the child artifact that makes execution inspectable. It says which slice runs next, who owns it, which files are in and out, which checks prove it worked, and what event forces a re-plan.
Pick the lightest planning surface that matches the blast radius:
| Work shape | Planning surface | Good enough when | Escalate when |
|---|---|---|---|
| Tiny local edit | Short checklist in the chat or issue | One file, one obvious check, easy revert | The agent names extra files or assumptions |
| Multi-file feature slice | Saved implementation_plan.md or native Plan Mode output |
Several files, one subsystem, clear acceptance tests | Scope crosses subsystem, schema, API, or ownership boundary |
| Outer-loop change | Review packet with plan, risk, rollback, and verification | CI, infra, release, migration, or shared-state work | Verification requires another team or production-like environment |
A weak plan says only “refactor auth” and “update tests.” A useful plan pins the next slice: parent spec, exact files, files not touched, ordered steps, assumptions, verification commands, and stop conditions. For example:
## Implementation plan: migrate auth from sessions to JWT
Parent spec: specs/auth-jwt.md
Next slice: middleware token validation only
In scope: src/middleware/auth.ts, src/auth/jwt.service.ts, auth.middleware.test.ts
Out of scope: src/users/*, db/migrations/*, refresh-token behavior
Checks: npm test -- auth.middleware.test.ts
Stop and ask if: existing session reads remain after the middleware changeThe chunk-sizing rule is practical: a chunk should be small enough that a reviewer can name the invariant it changes without reconstructing the whole feature. If the plan requires the agent to touch unrelated folders, invent architecture, or defer verification until the end, it is not a chunk; it is a project wearing a checklist.
The Research -> Review -> Rebuild workflow shows the same pattern at legacy scale. The agent researches a target component, produces a structured analysis, and waits for a domain expert to validate that analysis before rebuild starts [136]. With that gate, a Bahmni display-control migration that normally took 3-6 days finished in under an hour at under $2 in tokens [136]. The reusable lesson is not the exact toolchain; it is that review happens while correction is still cheap.
8.2 Task Decomposition
Granularity is the central skill. Tasks that are too large produce unusable results; tasks that are too small make agent overhead outweigh the payoff [137]. The 10x–100x velocity gains practitioners report happen on well-defined chunks where the developer already knows what needs to happen and how. The same agents struggle on high-level product requirements because the decomposition work has not been done yet — the agent is being asked to invent the spec and execute it in one shot.
Match the chunk to the agent’s strengths. Agents excel at logic and implementing explicit requirements; they fail at aesthetic judgment, design taste, and decisions that require organizational context [134]. Long-running agentic work amplifies this: agents are reliably good at well-scoped tactical execution, but lose the thread on strategic decisions across hours of work [138]. The decomposition heuristic that follows: keep strategic decisions (sequencing, architecture, taste) on the human side; hand tactical chunks (implement this function in this file with these inputs) to the agent.
For multi-file repetitive work, hierarchical task structures stop the agent from drifting. The pattern that worked for a 12-hour, 315-file frontend refactor was one epic per directory, one bead per file — explicit closure of each subtask, persisted to a SQLite store, so context compaction cycles didn’t lose the thread [138]. Without this, agents go lazy: they skip files, mark work complete that isn’t, or invent shortcuts to declare success. With it, completion becomes binary and auditable across context resets that would otherwise force a full refresh from the developer.
Read tasks parallelize; write tasks don’t. Multi-agent systems consume roughly 15x the tokens of standard chat — about 4x for single-agent — and the cost is only worth it when the work decomposes cleanly into independent units [139]. Research, analysis, information gathering, and exploration favor multi-agent fan-out because each subagent can return a summary without coordinating with peers. Code generation, content creation, and file editing parallelize poorly because the work has implicit dependencies that show up as merge conflicts or contradictory edits when split across agents [139].
Concrete parallelization surfaces sit on a spectrum from fire-and-forget to dependency-aware. At the loose end, parallel-fleet commands run multiple agents concurrently on independent tasks and reduce wall-clock time when the decomposition is genuinely parallel [140]. At the tight end, agent-team patterns coordinate teammates directly via a shared task list and inter-agent messaging, executing in dependency-aware waves rather than fire-and-forget [141]. The cost dimension matters: each teammate in a tight-coordination setup is a full session with its own context window, not a lightweight worker, so the standard pattern is to run the lead on a stronger model and teammates on a cheaper one [141]. Pick the loose end when subtasks are truly independent reads; pick the tight end when subtasks share state or order matters.
Aider’s Architect/Editor pattern shows decomposition can split within a single task, not just across them. A reasoning model (the Architect, e.g., o1-preview) handles solution design; a traditional LLM (the Editor, e.g., DeepSeek or o1-mini) handles format-compliant code edits. The pairing scored 85% on aider’s code editing benchmark — the highest reported — and most models scored higher when paired with themselves as Architect/Editor than when used solo [142]. The transferable lesson: when one model is asked to both reason about the problem and produce strictly formatted edits, attention splits and quality drops. Decomposing the task into “think” and “edit” steps recovers that lost capacity. The same split scales up to the workflow level — a reasoning-strong planner with a faster executor turns the per-task gate into a per-pipeline gate.
8.3 Inner/Outer Loop Workflow Decomposition
The workflow boundary that matters most is simple: decide what belongs inside the fast supervised editing loop and what must move to a slower checkpointed outer loop before the agent starts acting.
The inner loop is file-level edits, compile and unit-test feedback, and single-session iteration. The outer loop is CI, integration tests, deployment, shared-branch mutation, schema changes, and anything where the cost of an unchecked action exceeds the cost of a checkpoint interruption. Choosing that boundary is decomposition work. Enforcing it with permission modes, sandboxes, and approvals belongs to Chapter 6 and Chapter 16.
A workable heuristic: keep a chunk in the inner loop when a wrong edit is local, quickly testable, and cheap to revert. Move it to the outer loop when it touches persistence, infrastructure, public API contracts, production credentials, cross-team ownership, or shared release state. This is why src/auth/jwt.service.ts plus unit tests can be one inner-loop chunk, while db/migrations/*, .github/workflows/*, git push, kubectl, and release notes become outer-loop checkpoints.
Make the boundary visible in the plan itself. Each chunk should carry a loop label, owner, acceptance signal, and re-plan trigger:
| Chunk | Loop | Owner | Acceptance signal | Re-plan trigger |
|---|---|---|---|---|
| Add JWT verifier helper | Inner | Agent | Unit tests pass | Existing session reads still referenced |
| Replace middleware session lookup | Inner | Agent with human diff review | Auth middleware tests pass | More than the named middleware file changes |
| Add migration or release toggle | Outer | Human-led checkpoint | Migration dry run and rollback note | Schema or rollout assumption changes |
Outer-loop work also changes decomposition strategy. Research, CI triage, and cross-file analysis can fan out to parallel agents because they return reports. Writes to shared code, schemas, or release state should stay serialized behind a named owner because the merge and rollback cost is not parallel. If a parallel agent uncovers a dependency, the plan changes before more work starts; it does not become an after-the-fact explanation for drift.
8.4 Planning and Review Cycles
Plan review is where decomposition becomes enforceable. Use a formal gate when a task crosses three to five files, crosses a subsystem boundary, changes schema or public API, or resembles a previous agent change you had to undo. Use a lightweight checklist for a tight single-file edit-test loop where the intended change is explicit and the wrong move is cheap to revert.
Three concrete gate shapes cover most teams.
UI-native plan/act flow. Enter the tool’s Plan Mode or /deep-planning surface. The agent may read and propose but not mutate. Review the plan for files, order, assumptions, checks, and stop conditions. Only then click the equivalent of approve, switch to Act Mode, or accept the plan. If the tool starts editing before that release step, the gate failed.
Prompt-enforced no-mutation flow. When the tool lacks a native mode, start with a hard no-write instruction:
Before changing files, write an implementation plan only. Include parent spec,
in-scope files, out-of-scope files, ordered steps, assumptions, verification
commands, and stop conditions. Do not edit files yet.
The release prompt is separate and narrower: Approved plan v2. Implement step 1 only, run the named check, report the diff and result, then stop. If the agent implements step 2, touches an out-of-scope file, or changes the plan without asking, stop and re-plan.
Programmatic or permission-gated flow. In an SDK or harness, start the session in a read-only or plan permission mode, then flip to the write-enabled mode only after a human approval record exists: setSessionMode('plan') -> review saved plan -> setSessionMode('act'), or deny Edit/Write until approved_plan_id is present. The failure mode is broad auto-approval or bypass: if a setting lets writes proceed before the approval record, the gate is decorative.
The review pass asks five questions:
- What is the next bounded chunk? Name the files, functions, or behaviors in scope, and reject compound steps that hide decisions.
- What must happen first? Order chunks by dependency, not by convenience. Schema, contract, or interface changes precede callers; risky shared-state changes wait for the outer loop.
- Who owns the chunk? A human owns strategy, taste, and cross-team decisions; an agent can own tactical edits with explicit acceptance signals. Read-only research can fan out; write work needs tighter ownership.
- What context bundle travels with it? Include the parent spec link, relevant files, constraints, assumptions, non-goals, accepted plan steps, and verification command. That bundle is what keeps a fresh session, compacted session, or subagent handoff from losing the thread.
- What triggers re-planning? Any touched file outside the plan, failed assumption, new dependency, ambiguous requirement, or verification failure stops execution and updates the plan before more code changes.
Two habits keep the gate from becoming theater. First, reject vague plans instead of approving them under time pressure. Second, compare execution back to the approved plan: did the diff touch only planned files, did every acceptance signal run, and did the agent add, skip, or reinterpret a step without a plan update?
Verification is the other side of the loop. Replit’s Agent 3 embeds a separate testing subagent that uses sandboxed browser automation to catch Potemkin interfaces — UI that renders but lacks functional implementation — and reports large reductions in debugging time and missed defects [143]. The transferable lesson is modest and useful: every bounded chunk should pair with a check the agent can run. A plan that lists steps but no verification signals is incomplete.
The full loop is therefore: decompose accepted intent into chunks, carry a context bundle into each chunk, review before mutation, verify before the next chunk, and re-plan when reality diverges. Run the loop well and the limiting factor becomes decomposition skill, not model quality [137].
8.5 Takeaways
- Write an implementation plan that pins the next slice — parent spec, exact in-scope files, files not touched, ordered steps, assumptions, verification commands, and stop conditions — before the agent starts editing.
- Size each chunk so a reviewer can name the invariant it changes without reconstructing the whole feature; if the plan requires touching unrelated folders, inventing architecture, or deferring verification to the end, split the work further.
- Keep strategic decisions — sequencing, architecture, and taste — on the human side; hand the agent only tactical chunks specifying the function to implement, the file it lives in, and its inputs.
- For long-running repetitive multi-file work, track progress in a hierarchical task list outside the context window and close each subtask explicitly; a directory-level epic with file-level subtasks is one workable pattern when compaction would otherwise lose the thread.
- Fan out independent read, research, and analysis tasks to parallel agents, but keep shared-state write work serialized behind a named owner or dependency-aware plan; if write scopes are not truly independent, parallelism turns into merge conflicts and contradictory edits.
- Label each chunk inner-loop or outer-loop in the plan itself: keep edits that are local, quickly testable, and cheap to revert in the inner loop; move anything touching persistence, infrastructure, public API contracts, production credentials, or shared release state to an outer-loop checkpoint.
- Gate execution behind a plan-only stage before allowing file mutation: use the tool’s Plan Mode, a hard no-write instruction, or read-only session permissions — then release to Act Mode only after reviewing the plan’s files, order, assumptions, and stop conditions.
- Pair every bounded chunk with a verification command the agent can run; reject any plan that lists steps but names no verification signals.