2 How Coding Agent Harnesses Work

The model is the engine; the harness is everything else — and the harness is where reliability lives or dies.

2.1 Model Plus Harness

Chapter 1 planted the model/harness/agent triad; this chapter makes it formal.

Vocabulary: model, harness, agent. A model is the language model. A coding-agent harness is the infrastructure layer that assembles context, exposes tools, enforces permissions, persists sessions, and recovers from failure. After this point, harness is the short form for coding-agent harness. A coding agent is the complete system as the practitioner experiences it: model plus harness plus UI, workflow, and policy.

When later chapters describe what an agent “does,” they are talking about the complete system. When they describe context assembly, tool dispatch, permissions, hooks, connectors, session state, routing, or sandboxing, they are talking about harness behavior even when a product’s documentation casually says “agent.” Building a production-ready coding agent is roughly 10% model and 90% infrastructure [15]. The model is the language engine. The harness is everything around it: prompt assembly, tool dispatch, permission gates, context compaction, session persistence, recovery paths. Practitioners who track model leaderboards and conclude “this one is better than that one” have largely picked the wrong axis. What separates a usable coding agent from a frustrating one is rarely the underlying model — it is the harness that decides when the model sees what, what tools it can call, when it pauses for human input, and how it recovers when something goes wrong.

The reason this matters is durability. Static benchmarks measure a model’s ability to solve a puzzle in one or two turns. Real coding work runs for 50+ tool calls and hours of execution, and reliability emerges only over those long runs [16]. A model that aces a coding benchmark can still fail to follow its initial instructions after a long session if the harness lets context rot, drops permission state, or floods the prompt with stale tool output. A raw LLM is stateless: without a harness it hallucinates tool calls, experiences context rot, and loses progress on restarts [15]. The harness is what turns a stateless completion engine into something you can hand a multi-hour task and trust to finish.

This is also why coding CLIs feel categorically different from a chat website. Claude Code’s key innovation is not a better model — it is running as a localhost agent with access to your private environment, data, and context [17]. The harness lives next to your codebase. It can read your files, run your tests, edit on disk, and persist state across sessions. The shift from “website you visit” to “spirit that lives on your computer” is a harness shift, not a model shift. Coding CLIs are specialized harnesses, not generic frameworks: they ship batteries-included infrastructure for prompts, tool handling, lifecycle hooks, and sub-agent management [16].

The harness also explains why teams keep rebuilding. Manus refactored their harness five times in six months, LangChain re-architected three times in a year, and Vercel removed 80% of their agent tools after each model release shifted the optimal structure [16]. Each frontier model has different tool-use idioms, different context discipline, different reasoning depth. A harness perfectly tuned for the previous generation can produce visibly worse results on the next. The takeaway: do not over-invest in elaborate harness machinery you cannot tear down. The valuable durable artifact is not the harness code — it is the agent logs, which become the training and evaluation feedback loop [16].

If you are choosing or configuring an agent, the harness questions matter more than the model questions. What tools does it expose, and at what granularity? How does permission evaluation work? Where does it store session state, and how does resume work? What does it do when context fills up? When does it pause for you, and when can it bypass you? Those are the questions that determine whether you end the day shipping or debugging.

Two evaluation terms recur in this chapter and in the closing checklist. Semantic tool vocabulary means the named actions the harness exposes to the model — Read, Edit, Bash, AskUserQuestion, mcp__github__merge_pull_request — and the meaning each name carries. A harness with clear verbs gives the model smaller, safer choices; a harness with one generic shell or browser tool pushes policy into prose. Permission evaluation order means the deterministic sequence the host runs before a tool call executes: hooks, denies, mode checks, allow-lists, callbacks, and sandbox boundaries. If that order is vague, you cannot reason about what will actually stop a bad action.

2.2 Tool Dispatch, State, and Recovery

A harness is easier to reason about if you read it as a control sequence: dispatch -> govern -> pause -> persist -> shape context -> recover. Dispatch defines the action vocabulary. Governance decides whether a selected action may run. Pause points turn ambiguity or risk into a decision. Persistence determines what survives the turn, session, process, and working-directory boundary. Context shaping decides what the model sees before the next action. Recovery gives the operator a way back when the loop drifts.

2.3 Dispatch: Semantic Tools as Control Surfaces

Modern coding harnesses expose named primitives rather than one generic “do computer stuff” interface. The names are operational, not decorative:

Tool class	Typical primitives	What the harness can do with the name
Read/search	`Read`, `Glob`, `Grep`	allow read-only agents, parallelize safe calls, audit evidence gathering
Mutation	`Edit`, `Write`, patch tools	require review, trigger formatters, block protected paths
Execution	`Bash`, task runners	scope by command prefix such as `Bash(npm *)`, separate tests from arbitrary shell
Interaction	`AskUserQuestion`, approval callbacks	pause on ambiguity, surface structured questions
Delegation	`Agent`, `Skill`, named subagents	isolate context and tool scope for specialist work

That vocabulary produces concrete payoffs. A read-only reviewer can be configured with allowedTools: ["Read", "Glob", "Grep"]. A hook matcher such as Write|Edit can block .env mutations without parsing model prose [18]. Read-only searches can run concurrently because the harness knows they do not mutate state. Bash command-prefix scoping lets you approve Bash(npm test:*) without approving arbitrary shell [19]. Conversely, an omitted tool is different from a disallowed tool: omitted tools are removed from the model’s action space, while disallowed tools may still be attempted and then blocked. In bypass-style modes, allowlists may stop mattering; hard deny rules and hooks are the safer boundary [20].

2.4 Govern: The Runtime Evaluation Order

A prompt that says “do not touch production” is advisory. A deny rule, hook, permission callback, or sandbox is executable policy. The transferable architecture is a fixed evaluation sequence the harness runs every time the model proposes a tool call:

Model proposes a tool call with arguments.
PreToolUse hooks fire first. They can allow, deny, or modify the call before any rule is checked. This is the earliest programmatic intercept.
Deny rules fire next. disallowedTools and settings.json deny entries unconditionally block — and they hold even in bypassPermissions mode.
Permission mode applies. The session-wide posture (default, dontAsk, acceptEdits, bypassPermissions, plan) decides what happens to anything not yet resolved.
Allow rules pre-approve. allowedTools and settings.json allow entries auto-approve listed tools.
canUseTool callback handles the rest. Anything still unresolved hits the host’s interactive approval callback. If the host never wires it, this step is absent.
Execution. The call runs, then PostToolUse hooks fire for audit and side-effects.

Claude Code is the clearest public reference implementation of this ordered model, including the documented warning that allowedTools does not constrain bypassPermissions and that only disallowedTools and deny rules survive bypass [20], [18]. The exact ordering is product-specific, but the architecture is stable: hard blocks before broad autonomy, explicit session posture, and an intercept layer for actions the harness cannot safely auto-decide. Two consequences are worth absorbing. First, anything you want hard-blocked must live in the deny path. Second, if a host program never wires canUseTool, the human gate is not latent — it is absent, and unlisted tools fall through to the default mode without review.

Tool approval and sandbox scope are different boundaries. allowedTools: ["Bash(npm test:*)"] answers “may this tool call run?” A filesystem or network sandbox answers “what can this process reach after it runs?” Bash inside a workspace-confined, no-network sandbox can run tests without reading $HOME/.ssh or curling production; Bash with broad host access can exfiltrate secrets or mutate unrelated directories once approved. The harness should evaluate both: named-tool approval decides whether the call is allowed, and host confinement decides the blast radius of an allowed call. Runtime enforcement matters because it happens before model intent can rescue you.

2.5 Settings as Layered Governance State

Permission rules do not live in one place. Treat settings as a layered control plane. Claude Code documents this precedence explicitly: managed scope cannot be overridden by user, project, or local settings, and the documented order is Managed > Command-line > Local > Project > User [19].

Layer	Owner	Typical use	Override behavior
Managed/system policy	organization or device manager	immutable security floor, MCP enablement, fleet defaults	wins over user/project/local [19]
User	individual developer	personal defaults, preferred tools, harmless ergonomics	below managed policy
Project	repository	team conventions, repo-specific allowed/denied tools	committed and reviewable
Local	one machine	temporary overrides, experiments, secrets-free local paths	should not become team policy

The same four-layer model surfaces with different names across tools. In Claude Code, the operator surface is settingSources: passing ['user', 'project', 'local'] composes all three filesystem layers, only ['project'] loads just the committed repo policy, and [] is an explicit clean slate; managed policy settings and ~/.claude.json always load regardless of settingSources, by design [18]. That is the security floor; it cannot be opted out of, only overridden by a tighter system policy. Gemini CLI uses a different file layout for the same architecture: ~/.gemini/settings.json for the user layer, <project>/.gemini/settings.json for the project layer, and a system layer that an MDM tool can deploy to OS-conventional paths (with GEMINI_CLI_SYSTEM_DEFAULTS_PATH available for fleet deployment of system defaults), with system overrides taking final precedence on single-value settings [21], [22]. The names differ; the architecture is the same: an immutable system layer for policy, a user layer for convenience, a project layer for team conventions, and a local layer for ephemeral overrides.

Clean slate is useful for CI, but it is not multi-tenant isolation by itself. The Claude Code SDK documentation is explicit that settingSources: [] only opts out of filesystem layers — managed policy and ~/.claude.json still load [18]. The safe multi-tenant recipe names four concrete levers, not a hand-wavy “disable everything else”: pass settingSources: [] to drop user, project, and local config; disable auto-loaded memory (in Claude Code, CLAUDE_CODE_DISABLE_AUTO_MEMORY=1) so per-user memory does not silently rejoin the prompt; mount each tenant on its own working directory or filesystem sandbox so file discovery and session storage cannot cross tenants; and accept that managed/system policy is immutable regardless of any of those choices [18]. If any one of those four levers is missing, the run is not isolated; it is leaking through whichever lever you forgot.

2.6 Pause: Ambiguity and Risk Need Different Gates

Pause points are where the observe -> plan -> execute loop stops before a bad action, and they slot into the evaluation sequence above at different stages. PreToolUse hooks and deny rules are runtime-enforced at steps 2–3: the harness blocks the call before the model knows it tried. canUseTool sits at step 6: the host intercepts a specific action such as Bash, Write, or mcp__github__merge_pull_request and asks the operator. AskUserQuestion is different in kind — it is model-initiated, not host-enforced: the agent calls a built-in tool because requirements fork, and the call surfaces a structured question [23]. Plan Mode is session-level: the agent can reason and decompose but cannot mutate until released, then set_permission_mode('default') resumes execution after the plan is approved. These are not interchangeable. A safe headless run cannot rely on a human callback that does not exist; it should fail closed with a structured “needs-human-input” result, leave the session in plan mode, or rely on hooks and deny rules that do not require an operator at all.

2.7 Persist: Session State Is Not Filesystem State

Stateless, in-memory, disk-persisted, resumed, and forked sessions solve different problems, and the most common operator surprise is conflating two independent concerns. A one-shot query() is stateless from the conversation perspective but can still write files during the call. Session state (conversation history, prior tool outputs, accumulated reasoning) and filesystem state (files mutated on disk) are independent — stateless does not mean side-effect free, and “I started a fresh session” does not undo what the previous one wrote [24]. A stateful client (ClaudeSDKClient in Python, or query() with continue: true in TypeScript) keeps conversation history across turns; one-shot query() does not, which is why query() does not support interrupts or continue chat while the stateful client does [23].

The Claude Code session docs make the disk model explicit and warn that session files are machine-local and cannot reliably be moved cross-host unless the working directory path is identical [24]. The practical consequence is that resuming from a different cwd does not restore the original history — the harness cannot find what the path-keyed index does not contain. Forking is a different operation: it creates a new session with a copy of the original’s conversation history, but any file modifications made by a forked agent remain visible to other sessions working in the same directory [24]. An in-memory or non-persisted run is the right choice when conversation state must not survive process exit: ephemeral CI workers, multi-tenant servers where session bleed is a confidentiality concern, and one-shot automations that should not leave a session trail. The operator question is not “can this agent edit files?” but “what conversation state will the next turn inherit, and where does it live?”

2.8 Shape Context: Retrieval, Targeting, and Exclusion

Long sessions drift because context fills with stale output and old decisions lose salience [25]. The harness rule worth carrying with you is short: auto-retrieval is a proposal, explicit targeting (@file, @#symbol, @path:1-50) overrides it, and exclusion boundaries — .gitignore, admin context filters, tenant roots, managed context filters — still win over both. Cody Context Filters are the canonical example of the exclusion side: when an admin defines exclude rules at the Sourcegraph instance level, the filter is enforced across all clients and narrows the allowed set even for explicit context references [26]. That asymmetry is the safety property: you cannot accidentally route around an exclusion control by being more specific. The full operator playbook for compaction symptoms, externalized state files, resets, and forks belongs in Chapter 5; treat the four-line rule above as the harness contract.

2.9 Recover: Assume the Loop Will Drift

Failure recovery is harness work, not model work. Session forks, externalized state files, post-compact cleanup, and side quests all exist because a wrong path pollutes future turns if it stays in history. Pi’s side-quest pattern makes the idea concrete: branch from the current session to repair a tool, summarize what changed, then return without dragging the repair context through the main task [27]. Quick Codex shows the same idea from the state-externalization side, layering status files, plans, and verification records on top of Codex CLI so multi-turn work stays resumable when memory drifts [28]. The mental model is simple: design session boundaries so rollback is cheap. The full operating discipline for resets and forks belongs in Chapter 5.

2.10 Public Implementations and Reverse Engineering

The useful lesson from public implementations is not the catalog of projects. It is three recurring harness design choices: keep the core small, externalize state so recovery is inspectable, and treat compaction as a first-class runtime path rather than an afterthought. The examples below matter only insofar as they show those choices in working systems.

The fastest way to develop intuition for harness design is to read one. Two sources are unusually instructive: Pi, the minimal agent inside the OpenClaw codebase, and reverse-engineering analyses of Claude Code internals. Pi is a deliberately minimal core: a short system prompt and four tools — Read, Write, Edit, Bash — paired with an extension system that lets the agent extend itself [27]. The argument is counterintuitive but specific: minimal cores beat feature-rich agents because they avoid the cache invalidation and context bloat of static tool loading. Loading tools into the system context at session start (as MCP servers typically do) makes it very hard to reload or modify tools without trashing the prompt cache. Persisting extension state to disk alongside session messages, as custom message types, lets extensions maintain state across reloads without confusing the model about prior tool invocations. Session trees enable side-quests: branch off to fix a broken agent tool, do the fix, summarize what happened, and rewind without polluting the main session’s context [27]. The takeaway is not “use Pi” — it is to recognize that an agent is most productive when treated as a malleable tool that builds its own functionality, not a passive consumer of an external extension marketplace.

The Claude Code internals deep-dive is a reverse-engineering writeup, not vendor documentation, and the prose below reflects that — the patterns it describes are inferred from observed behavior and source analysis rather than promised contract. With that caveat, the analysis reports a layered compaction pipeline (tool-result truncation, mid-session summarization, autocompact at threshold), an agentic search tool that returns line numbers and file offsets so the model can navigate large codebases without re-reading files, and a permission layer that gates every tool call before execution [20]. The operator takeaway matters more than the mechanism list. When the agent suddenly “forgets” an earlier decision, repeats work it already finished, or its replies become noticeably more abstract and less file-specific, compaction has likely fired. Treat compaction as a runtime path you operate around, not a feature you tune; the response patterns belong in Chapter 5.

The lesson from both is that the harness does not need to be elaborate to be effective — it needs to be the right shape for the workload. Most successful agent implementations across dozens of teams use simple, composable patterns rather than complex frameworks [29]. Workflows (predefined paths) are architecturally distinct from agents (dynamic LLM-directed processes), and agents trade latency and cost for better task performance — a trade you should only make when simpler workflows demonstrably fail. Start with direct LLM API calls; frameworks obscure underlying prompts and responses, making debugging harder [29]. The same advice runs through Pi’s minimalism, Anthropic’s pattern catalog, and the gut-and-rebuild discipline of teams that ship at frontier-model pace [16]: keep the harness small enough that you can throw it away when the next model lands. Even at production scale, codebases that lean into AI-assisted development — Puzzmo’s roughly 70% Claude-Code-authored codebase being a representative example — succeed because they refactor toward explicit boundaries and scoped contexts the harness can actually load, not because the harness is elaborate [30].

Extensibility is the harness dimension practitioners reach for last and benefit from most, but the decision rule for evaluating it is simple: inspect hooks, permissions, and session persistence first — those determine whether the harness is safely operable at all — and only then weigh plugins, skills, reusable agents, and SDK paths, which determine how much leverage you compound on top. Plugin systems make the leverage payoff concrete: package commands, skills, agents, and hooks once, then reuse them across projects instead of re-scaffolding .claude/ directories per repo [31]. Cline’s Plan/Act split, MCP integration, and @-mention surfaces show the same pattern in a different harness shape — extensibility expressed through control surfaces rather than a marketplace [9]. The full design playbooks for skills, subagent composition, and reusable workflow overlays belong in Chapter 4 and the process-frameworks chapter; the codebase-side moves that make subagents pay off belong in Chapter 5; and the operator workflow patterns that exercise these surfaces belong in the workflow chapters [32].

The convergence point worth naming explicitly: harness engineering — not prompt engineering — is becoming the primary discipline for managing AI systems [15]. Prompt engineering told you what to type. Context engineering told you what to feed the model. Harness engineering tells you what to build and configure around the model so prompts and context survive 50-step workflows. If you have been treating harness questions as configuration trivia — which permission mode, which tools, which settings layer, when to fork a session — you are leaving most of the available leverage on the table. The chapters that follow operationalize this: how to manage context deliberately, how to choose autonomy levels safely, how to compose subagents and workflows, and how to recover when the harness inevitably surprises you.

When evaluating a harness, inspect five things first: the semantic tool vocabulary, the permission evaluation order, the session persistence contract (including where state lives and how cwd affects resume), the recovery primitives, and the extension surfaces. Model quality matters, but those five surfaces determine whether the model can be safely applied to real software work.

2.11 Takeaways

Evaluate coding agents on a real long-running workflow — 50+ tool calls, hours of execution — rather than on static benchmark scores, because reliability over those runs is a harness property the benchmark cannot measure.
Before choosing or configuring any coding agent, inspect five harness surfaces in order: semantic tool vocabulary, permission evaluation order, session persistence contract (including how cwd affects resume), recovery primitives, and extension surfaces.
Use named tool primitives as control surfaces: give reviewer agents only Read, Glob, and Grep, scope Bash by command prefix such as Bash(npm test:*), and attach Write|Edit hooks when you need path-specific enforcement.
Put non-negotiable blocks in executable policy, not prompt prose: use deny rules for anything that must survive bypassPermissions, and use PreToolUse hooks when you need an earlier programmatic intercept.
If you host agents for multiple users, clear every isolation layer together: start from a clean config slate, disable auto-loaded memory, give each tenant a separate working directory or filesystem sandbox, and assume system policy still applies unless you explicitly control it.
Treat conversation history and disk state as separate boundaries: a fresh or stateless session can still leave file mutations behind, so pair session resets with explicit rollback or workspace checks when you need a clean restart.
For headless runs with no human callback, do not depend on an approval step that is not wired: either keep the session in plan mode until a human releases it, or enforce gates with PreToolUse hooks and deny rules, and fail closed with a structured needs-human-input result when neither path is available.