18  Models, Caching, and Cost Control

The cheapest workflow that delivers a verified result is the right one — model tier, cache structure, and context budget are the three knobs you actually turn, and cost attribution is the lens that lets you see which knob is jammed.

18.1 Model Choice and Economic Tradeoffs

Not all tasks deserve the same model. Running a frontier reasoning model for every lint fix burns money; sending architectural trade-offs to the cheapest model burns reviewer time. The durable principle is to route by task class, failure cost, and verification burden — and to make that routing a configuration concern rather than a habit hidden in your fingers.

Tier prices spread by an order of magnitude or more. Providers often quote model prices per million tokens, abbreviated MTok; Anthropic’s published pricing puts Opus input at $5/MTok against Haiku at $1/MTok, a 5x gap for what is roughly a 3-5x capability difference depending on task [315]. Haiku 4.5 reaches about 90% of Sonnet 4.5’s score on agentic coding evaluations at one-third the cost and roughly 4-5x the speed [316], while Sonnet 4.5 is the model you reach for when you need 30+ hours of coherent autonomous work [317]. Cross-vendor, the spread is larger: R1 paired with Sonnet as editor reached 64.0% on aider’s polyglot benchmark for $13.29 against o1 solo’s 61.7% for $186.50 — a 14x cost gap on a near-tied result [318]. Over a session of hundreds of tool calls, tier choice compounds.

Modern coding agents expose tier routing as a first-class surface, and the concrete pattern is stronger model for plan-mode reasoning, cheaper model for mechanical tool calls. In Aider, --model claude-haiku-4.5 flips the active model per session, and a project-level .aider.model.settings.yml pins defaults so a team gets reproducible cost behavior [319], [13]. OpenCode lets you set provider and model keys per workflow — useful in opencode.yml for a CI run that should never touch a frontier model [320], [321]. Cursor Max Mode opens 1M-token context on Claude 4.6 Sonnet, Claude 4.7 Opus, GPT-5.5, and Grok 4.20 while keeping 200k as the default — Max Mode is a deliberate spend, not a free upgrade [322]. Cursor’s Plan Mode and Debug Mode are themselves natural model-swap checkpoints: scope the plan with a strong model, hand the mechanical edit pass to a cheaper one [322]. Copilot CLI deliberately routes different request types to different model families so you can ask for a “second opinion” and get genuinely different code rather than minor variations [323]. The pattern is the same across tools: model selection belongs in config, not in your head.

The most useful decomposition is the architect/editor split. A stronger reasoning model proposes; a cheaper instruction-following model produces the diff. R1+Sonnet is the canonical example, but it is a general pattern — QwQ as architect with Qwen 2.5 Coder 32B as editor jumps from 42.1% solo to 73.6% paired, even though the pairing still trails Sonnet alone [324]. Reasoning models often plan well but fail edit-format compliance — o1-preview hits 79.7% with whole edits but only 75.2% with diff format, and OpenCode’s own model docs note that very few models are good at both code generation and tool calling [325], [320]. The durable lesson is not which vendor won this quarter’s leaderboard. It is that any single-model policy leaves money and reliability on the table when the workflow itself can be decomposed.

Reasoning depth is a separate dial, and it is billed as output tokens. Modern providers expose model variants by reasoning budget — low/medium/high/max in OpenCode, the budget_tokens parameter in Aider’s .aider.model.settings.yml, the think/think hard/ultrathink trigger phrases in Claude Code [320], [319]. These are real cost levers. GPT-5 Nano with maximum reasoning can beat full GPT-5 with minimal reasoning on specific tasks despite higher token consumption; Grok 4 with thinking enabled effectively costs 3x Sonnet, and 15x without [326]. Crank the dial up when the problem has more than one plausible path or a previous attempt failed for non-obvious reasons. Leave it at default for routine edits. Do not run ultrathink reflexively — it burns tokens and masks spec ambiguity. Reasoning will not save a vague prompt.

The paradox that makes naive token tracking misleading is cheaper tokens, pricier workflows. Per-token prices have fallen since late 2023, but per-task token consumption has multiplied 10x to 100x as agent chains add planning, tool use, retrieval, memory, and revision [326]. A single OpenCode or Cline request can fan out into dozens of model calls as the agent scans the repo, proposes edits, runs commands, and revises [321]. And caching distorts what you see in two directions at once: cached sessions look deceptively cheap because read tokens are billed at a fraction of write tokens, while uncached agent chains look mysteriously expensive because nobody is counting tool-call round trips. If you compare raw token counts across sessions, an expensive uncached session and a cheap cached session can swap positions in your dashboard for reasons that have nothing to do with the work that got done.

This is why cost attribution is the observability layer underneath every other lever in this chapter, and three concrete attribution surfaces are worth knowing by name. Cline’s per-task cost display shows running spend as the task executes, with “When You Pay” documentation that ties each tool call and file read to a billing event so the cost model becomes legible during the session rather than at the end of the month — and the same docs frame the “New Task vs. Continue” decision as a cost call, not just a tidiness call. Cloudflare AI Gateway sits between the agent and the provider, recording every request with cost metadata and exposing custom reporting that groups by user, tag, model, provider, and time, with cached input tokens and cache-creation tokens tracked separately so the caching dividend is visible instead of hidden [327]. BYOK provider dashboards are the structural attribution path for OpenCode, Aider, and Cline — when you bring your own API key the spend is yours to slice on whatever dimensions you tag with [321]. Pick one layer per environment so you don’t double-count, and make sure that layer counts cached and write tokens separately. Without that, a “cheap” workflow may just be a workflow that hit a warm cache yesterday.

The unit that matters is cost per verified task. Raw token cost is the numerator; the denominator is whether the output ships without extensive rework. Imagine the same migration two ways. Path A uses a cheap model repeatedly: $6 in tokens across many retries, plus 45 minutes of careful repair because the diff keeps missing edge cases. Path B uses a stronger model once: $18 in tokens, 10 minutes of review, design and tests aligned with the spec. If reviewer time is the real bottleneck, Path B wins even though the model bill is higher. Estimate verification cost up front rather than discovering it after the agent has produced a large diff [317], [328]. And before you escalate tiers, check whether the harness — tool surface, compaction policy, subagent layout — is the actual bottleneck. LangChain pushed Terminal Bench 2.0 from 52.8% to 66.5% by changing only the harness; “the model” is almost never the right level of abstraction to debug at [329], [330].

Local and self-hosted models sit on this same axis. Air-gap topology, data-handling policy, and compliance scope belong to Chapter 16 and the security material. The cost decision is narrower: smaller open-source models are economical on high-volume, low-complexity work — code completion, test scaffolding, commit messages — but inference infrastructure matters as much as the model. The same Qwen3-235B reaches 65.3% on aider’s polyglot benchmark when self-hosted with vLLM at bfloat16, and 61.8% via the official API — a 3.5-point gap from serving config alone, with whole edit format producing 100% well-formed responses against 94.7% for diff format on the same model [331]. Local Ollama silently truncates context past the default 2k window, which is far too small for any serious agent workflow — Aider works around this by overriding num_ctx automatically, but the underlying hazard is real: silent context drops look like model failures [13], [319]. Verify performance against your own tasks before standardizing on a self-hosted tier. The most durable investment is a model-agnostic harness — when the model is a pluggable dependency, switching tiers becomes a config change rather than a workflow rebuild [332].

18.2 Prompt Caching and Token Reuse

Prompt caching is the highest-leverage cost lever in any sustained agent loop. The mechanism is simple: stable prefixes — system prompts, rules files, tool definitions, large reference docs — get hashed and reused across calls instead of being re-prefilled every turn. The economics are dramatic. Anthropic charges 10% of base input price for cache reads and a 25% premium for cache writes, with claimed cost reductions up to 90% and latency reductions up to 85% on long prompts [333]. OpenAI prices cached input at exactly 50% of uncached and reports up to 80% latency reduction; Redis benchmarks show up to 79% time-to-first-token improvement on 100k-token cached prefixes [334], [335]. For a coding agent that loads the same CLAUDE.md, the same MCP tool inventory, and the same project context on every tool-call round trip, this is the difference between routine multi-turn work and a runaway bill.

The two providers expose caching differently and the difference matters for how you structure prompts. OpenAI’s caching is automatic for any prompt at or above 1024 tokens, the cache stores the longest prefix and grows in 128-token increments on reuse, and you confirm hits via the cached_tokens field in the usage response; caches expire after 5-10 minutes of inactivity with a one-hour hard cap [334]. Anthropic’s caching is explicit — you annotate message blocks with cache_control: {type: ephemeral} to mark cache breakpoints, with a 5-minute default TTL and a longer persistent option [333]. Explicit cache management is the harder model to wield but the more controllable one — you decide where the boundary sits, you can see the cost of every change, and you can keep multi-direction conversations predictable in a way implicit caching cannot [330]. The breakeven math is friendly either way: with Anthropic’s 1.25x write and 0.1x read, a single 5-minute cache hit pays back the write premium [315].

The single structural rule is head-stable, tail-volatile. Cache hits depend on exact prefix matches — a single token change anywhere in the prefix breaks the match from that point forward [335]. The order of operations is therefore non-negotiable: stable system prompt first, then the rules file (your CLAUDE.md or AGENTS.md), then tool definitions, then any pinned reference docs, then the per-turn user query and tool outputs at the tail. A bloated CLAUDE.md that mixes timestamps, session IDs, or rotating examples into the prefix systematically misses the cache. A lean CLAUDE.md placed at the head of the prompt and kept stable across turns hits the cache on every subsequent request. This is why disabling unused MCP servers in Claude Code matters cost-wise as well as quality-wise: every active MCP injects tool definitions into the system prompt, expands the cached prefix, and consumes context budget you cannot reclaim until you trim the manifest [336]. Use the /context command to see what’s actually in the prefix before you trust your mental model of it.

Cache-hit visibility is your feedback loop. The cached_tokens field in OpenAI usage responses and Anthropic’s separate cache-creation versus cache-read counters tell you whether your prefix is actually being reused [334], [333]. Cloudflare AI Gateway tracks both numbers separately in its custom reporting so you can chart hit rate over a week and tie it to cost [327]. If your hit rate is low, check three things in order. First, prefix mutation: a tool definition that was reordered, a rule file with a dynamic timestamp, a system prompt template that interpolates the current date. Second, TTL expiry: sessions that pause for more than five minutes against the default Anthropic ephemeral cache miss completely on resume. Third, request volume against the same prefix-cache_key combination — OpenAI’s documented threshold of roughly 15 requests per minute causes overflow that cuts effective hit rate when concurrency runs hot [337].

Mid-session model switches are the silent cache killer. Every prompt cache is keyed to a specific model; flipping from Sonnet to Haiku mid-task forces a full prefix rebuild and re-bills the entire context at write rates. Tier-routing scripts that swap models within a session deserve an explicit pause-point that resets context cleanly, otherwise the savings from going to a cheaper tier are wiped out by the cache rebuild on the way down. The same hazard hits autocompaction loops — Claude Code burned roughly 250,000 API calls per day globally on consecutive autocompaction failures before the team capped failures at three per session, a real-world reminder that any feature that rewrites the prefix mid-loop is a cost incident waiting to happen [338]. When you must switch models, switch at task boundaries, not in the middle of a tool-call chain.

Caching also interacts with reasoning budget in ways that surprise practitioners. A large cached prefix is cheap to read but it still occupies the context window — and it still narrows the budget left for dynamic content, including the model’s own reasoning tokens. If you cache a 50k-token system prompt and then enable a high reasoning budget via budget_tokens in .aider.model.settings.yml, you can hit the context ceiling on output before the agent finishes its plan. Claude Code’s leaked source shows the production response: an explicit SYSTEM_PROMPT_DYNAMIC_BOUNDARY that splits the prompt into a cached, organization-wide stable section and a session-specific dynamic tail, with DANGEROUS_uncachedSystemPromptSection markers that warn engineers when a change will move content out of the cached region [339], [340]. The naming is intentional: cache boundary placement is a load-bearing decision that benefits from being annotated where it lives.

Subagents are the place where caching pays off most. In Claude Code’s architecture, spawned subagents inherit the parent context as byte-identical copies via prompt caching, which makes parallel subagent execution barely more expensive than running a single agent — five subagents can run simultaneously at roughly the cost of one because the shared prefix is read-priced rather than write-priced [339]. This inverts a common cost intuition. If you suspect parallel agents are expensive, the question is not whether to fan out but whether the harness shares cached context across the fan-out. Without that sharing, a multi-agent system is the worst case: every subagent re-bills the full prefix, every coordination round-trip re-bills the conversation history, and the bill grows quadratically. The 10x token cost reported for parallel agent systems is a real pattern when the harness isn’t sharing context — but it is the harness’s failure to cache, not parallelism itself, that drives the multiplier [341].

Tool definitions are the other quiet expense. Anthropic’s pricing notes that bash tool use alone adds 245 fixed input tokens per request, with system prompt overhead of 313-346 tokens for Opus and Sonnet just from enabling tool use, plus the tokens for each tool definition and every tool_use/tool_result block [315]. A coding agent with twelve MCP tools loaded but only two in active use is paying for ten tool definitions on every turn. Vercel’s d0 team rebuilt their agent from a 15-tool architecture down to a single bash-execution tool and improved success rate from 80% to 100% while reducing token count and latency, because modern models handle raw context better than they handle pre-curated retrieval layers [342]. Two lessons land together: trim what’s in the cached prefix, and prefer fewer, more powerful tools over many narrow ones.

18.3 KV Cache and Context Economics

Prompt caching is what providers expose. The KV cache is what makes it work. Under the hood, every transformer inference computes key and value matrices for each token in the prompt; reusing those matrices across requests is what skips the prefill cost. OpenAI’s documented cache routing uses a hash of approximately the first 256 tokens to assign requests to machines with matching prefixes, which is why prefix stability matters at the byte level and why the order of your CLAUDE.md headers can change your bill [337]. KV cache reuse is also why output tokens stay full-priced even on a cache hit — you are skipping prefill computation, not generation [335]. The transferable lesson for practitioners is that long-context economics are dominated by prefill, and prefill is dominated by what’s in the prefix.

This is why context window discipline is a cost lever, not just a quality lever. Every token sent to the model competes for the model’s attention and gets billed as input on every subsequent turn. Each turn of an agent loop carries the full conversation history forward — a 200k-token context filled with stale tool outputs, abandoned exploration paths, and irrelevant files costs the same to send whether or not those bytes contribute to the answer [343]. Active pruning routinely cuts 30-70% off per-turn spend on large codebases, and it carries a caching dividend: a stable, predictable prefix enables higher cache hit rates, so context discipline pays twice. Production agents implement compaction as a first-class internal mechanism — Claude Code’s three-layer memory system keeps an Index always loaded, Topic Files loaded on demand, and Transcripts only grepped, never injected directly — precisely because unmanaged growth degrades both cost and output quality [339], [344].

The practitioner-facing controls for context budget are concrete. Aider exposes /add and /drop commands that force you to reason about what’s in context every session — the explicit add/drop model is the most disciplined surface in the open-source ecosystem and it pairs with /read-only to prevent unintended refactoring of files that should only be referenced [319]. Cline ships a context-window indicator and documents /newtask as the explicit reset point, framing the decision as “New Task vs. Continue” with cost-aware guidance. OpenCode pairs --fork to branch a session at a checkpoint with --continue to resume an existing session without restart, giving you a complete session lifecycle model [321], [345]. Server-side, agent-friendly content negotiation — having servers return clean markdown when the agent requests it via Accept headers instead of full HTML with navigation, scripts, and tracking — cuts the token cost of a single page fetch dramatically [346]. All these surfaces share the same idea: treat context as a finite, actively managed budget rather than a passive accumulation of everything the agent has touched.

Session lifecycle is the other side of the same coin. Long sessions accumulate context that is billed on every subsequent turn, so per-turn cost climbs even when the agent is doing the same kind of work. The escalation curve is steep enough that /newtask-style resets are themselves a cost lever, not a tidiness preference — Cline’s documentation makes this explicit with “When You Pay” sections that connect each tool call and file read to a billing event, so the decision to reset becomes a decision about future spend. Fork before risky exploration so you have a checkpoint to return to without replaying. Continue when you’ve built up precise understanding the next step needs. Reset when context noise is steering the agent toward repeated mistakes or when the next subtask shares little with what’s accumulated. The trigger signals are observable: per-turn token count climbing across successive turns, the agent re-reading files it already processed, hallucinated details that weren’t in the original spec [344], [347]. Without persistent task management between sessions, parallel agents lose context on restart and rebuild state from scratch, generating roughly 10x token cost for the same unit of work [341].

Multi-agent burn rate is the scaling trap that lives at this junction. Five agents in parallel can multiply per-developer model spend by an order of magnitude, but the more dangerous failure is outrunning your team’s ability to verify what was produced [341]. Without shared cached context across subagents, each agent re-bills the full prefix on every turn. Without strict guardrails, autonomous agents will merge PRs without human review — that is a cost incident as well as a safety one. Cap parallelism by review capacity and blast radius, not by how many agent windows your tooling can open. Long-horizon autonomous models like GLM-5.1 are designed for sustained closed-loop iteration — measure, identify bottleneck, optimize, repeat — but the operator’s job is to bound the loop, not to admire its endurance [348].

Enterprise attribution closes the loop on team-scale usage. The detailed governance framing — access control, MCP tool scoping, privilege containment, audit trails — belongs to Chapter 16 and the security chapters, but the cost-attribution slice lives here, and the mechanics are practical enough to write down. Per-team API keys keep spend separable at the provider so finance can see “team X spent $Y this month” without any in-app instrumentation. Proxy-injected attribution headers tag requests with team, project, or feature dimensions before they reach the model, letting a single shared key still produce per-project breakdowns. Per-session token tracking in OpenCode and per-task cost displays in Cline give individual practitioners the same feedback at the leaf level so they can self-correct mid-task. Per-team budget caps with hard-stop or soft-warn behavior turn attribution into enforcement — a soft warn at 75% of monthly budget pairs well with a hard stop at 100% for non-critical projects. Exporting token usage to the org’s cost-allocation system is the last mile: gateways like Cloudflare AI Gateway record costs per request and group them on any tagged dimension without modifying agent code, and the same data feeds the chargeback ledger [327]. Make spend visible at the level of a real decision unit — a task, a feature, a sprint, a team — and AI engineering stops being a black-box line item.

A simple weekly spend table is enough to make the data useful. One row per workflow, six columns: request tag, model or routing policy, total cost, average tokens per task, review minutes, and shipped/not-shipped. If feature-implementation is cheap in tokens but expensive in review minutes, the fix is usually better specs or a stronger model. If architecture-debugging is expensive in both tokens and rework, narrow the task or keep more of it human-led. If mechanical-fixes shows mid-tier model usage, route those to Haiku and recover the spend immediately. The decision rule is not accounting trivia; it is having enough visibility to connect spend to verified outcomes and to choose the cheapest workflow that minimizes token cost + review time + retry churn for a verified result. Pricing models are also shifting under you — pay-per-token discourages iteration because users ration queries to avoid surprise bills, while outcome-based pricing aligns vendor incentives with shipped work [349]. Whichever pricing model your tools use, attribution is the only honest way to know whether the spend is buying you leverage or just buying you tokens.

Cost is only one reason to leave vendor defaults. When the attribution table says cloud tokens are the problem, optimize routing and caching first. When privacy, compliance, air-gap topology, data residency, or provider lock-in is the problem, the answer moves into BYOK, local, or self-hosted deployment. Chapter 19 picks up at that boundary.

18.4 Takeaways

  • Use the architect/editor split: route the planning or scoping step to a stronger reasoning model and the diff-production step to a cheaper instruction-following model, because no single-model policy optimizes both plan quality and cost simultaneously.
  • Raise the reasoning budget (think hard, ultrathink, or budget_tokens) only when a problem has more than one plausible path or a previous attempt failed for non-obvious reasons; leave it at default for routine edits and never run it to compensate for a vague prompt.
  • Before escalating to a more expensive model tier, audit the harness — tool surface, compaction policy, subagent layout — because harness changes routinely outperform model upgrades; only after ruling out harness bottlenecks should you treat the model as the variable.
  • Choose the workflow that minimizes token cost plus review time plus retry churn for a verified result; raw token spend is only useful when tied to shipped, reviewed outcomes.
  • Structure prompts with the stable system prompt, rules file, tool definitions, and pinned reference docs at the head and per-turn queries at the tail — and keep the head byte-identical across turns — because a single token change anywhere in the prefix breaks the cache match from that point forward.
  • Switch models only at task boundaries, because prompt caches are model-specific and a mid-task model swap rebuilds the full prefix at write rates.
  • Before scaling to multiple parallel agents, verify that the harness shares cached context across subagents; otherwise each agent re-bills the full prefix and coordination history, turning parallelism into a quadratic cost spike.
  • Treat session resets (/newtask), forks, and active context pruning (/drop, /add) as explicit cost levers rather than housekeeping: reset when per-turn token count climbs across successive turns, when the agent re-reads files it already processed, or when hallucinated details appear that weren’t in the original spec.