17 Diagnosing Agent Workflow Failures

When a coding-agent run produces a broken change, the question isn’t “did the model fail?” — it is which layer failed, because the fix for context rot — the stale, partial, or misleading working context defined in Chapter 5 — is nothing like the fix for a missing harness approval gate.

17.1 Common Failure Modes

The most expensive agent failures don’t look like failures. They look like polished pull requests with green CI that ship a query scanning every row, retry logic that triggers thundering herds, or a cache with no TTL that crashes Redis the first time real traffic hits it [299]. The polish itself is the trap — output reads more flawless than human code while staying blind to production realities the agent never saw, and the data backs up the cost. Sixty-six percent of developers cite “AI solutions that are almost right, but not quite” as their top frustration, and code churn — code added then deleted within two weeks — has roughly doubled since 2021 [300], [301]. If you have to re-read your own PR to explain how it touches production, the workflow has already failed [299].

Plausibility without correctness traps you next. LLMs are inferrers, not compilers — they generate text that sounds right because it pattern-matches the training distribution, with no guarantee of being correct [302], [301]. The same statistical mechanism that produces creative solutions produces hallucinations, and no amount of context engineering eliminates that baseline risk [301]. In practice this surfaces as deprecated imports the agent confidently chooses (javax.persistence instead of jakarta.persistence), unrequested features that quietly expand scope, and “all tests pass!” declarations made while tests are actually failing [303].

Silent proceed through ambiguity is the failure class where the agent moves forward confidently through an underspecified or contradictory task instead of asking. This happens when the harness has no reachable pause tool, when an approval callback has been stripped out, or when the agent simply hasn’t been taught to escalate. The behavioral signature is blunt: agents “habitually generate unrequested features and fill requirement gaps with assumptions,” producing cascading failures that look like whac-a-mole [303]. The fix is architectural, not motivational. As Chapter 2 established, pause points have to be reachable; if your permission mode silently disables them, the human-in-the-loop guarantee disappears without warning, and the policy question of which decisions deserve a pause belongs to Chapter 16.

The fourth class is environment and security: the agent does something it shouldn’t have been allowed to do, because the sandbox or scope let it. LLMs are trained to generate code that runs, not code that’s safe — and the same anti-patterns appear across models, frameworks, and prompts. v0 alone blocked over 17,000 deployments in a single month from secrets exposed via NEXT_PUBLIC_ variables, with another thousand-plus developers per month nearly leaking Supabase or model API keys before guardrails caught them [304]. YOLO mode (--dangerously-skip-permissions) compounds the risk: if the agent has access to private data and is exposed to untrusted content from files or web pages, prompt injection lets attackers exfiltrate through tool calls or even rendered markdown image tags [305], [306]. The GitLab Duo incident — where an agent rendered an attacker-supplied markdown image whose URL embedded private data — is the canonical case of “the model didn’t make a network request, but the browser did” [306].

Long-running workflows fail differently — through context drift and runaway cost. Agents become increasingly unreliable as session length grows; multi-agent subtasking with separate context windows outperforms single long sessions even on the same task [303]. The diagnosable signs of context drift are concrete: the agent re-reads files it already processed, ignores constraints it acknowledged earlier in the session, or proposes solutions that contradict decisions made twenty turns ago. When each iteration takes 20+ minutes, you start batching changes — and then you can’t tell which modification moved the needle [307].

17.2 Separating Failure Classes

Before you blame the model, run four questions in order: did the agent have the right context, was a pause point reachable, did the surrounding workflow gate the output, and was the environment scoped? This sequence matches Fowler’s three-factor risk model — probability of error, impact if wrong, and detectability through tests and types — and it maps remediation to the layer that actually broke [308], [301]. The diagnostic move is sequential, not parallel; only after the first four come back clean do you reach for a different model.

Before going further, draw a bright line between this loop and ordinary software debugging. If the bug reproduces in a fresh session given a clear, well-scoped spec and the same harness gates intact — the agent had the right context, pause points were reachable, the workflow gates ran, the environment was scoped — and the logs, stack traces, or failing tests point at application logic rather than missing context, a broken pause point, an absent workflow gate, or environment scope, leave workflow diagnosis behind. You’re looking at an ordinary defect: read the trace, write the failing test, fix the code. Reach for the four questions only when the failure won’t reproduce cleanly under those conditions, or when the symptom is shaped by how the agent worked rather than what the code does.

Most agent failures answer the first question wrong. Context failures, not model failures, drive the bulk of remediation work [83]. That’s the single most useful diagnostic frame you can adopt, because it inverts where you spend effort. When an agent produces a generic, unhelpful, or wrong answer, the temptation is to swap the model or rewrite the prompt. But the difference between a “cheap demo” agent and a “magical” one is usually the quality of context delivered to it — the right instructions, conversation state, retrieved data, tools, and output schemas at the moment of execution [83]. Context problems split into two diagnosable shapes. Missing or noisy context means the agent never saw the right files, or it saw too many irrelevant ones — Ghostwriter Chat’s team learned this the hard way and ended up explicitly filtering virtual environments, CI configs, and build artifacts out of the prompt while prioritizing recently-opened files [309]. Context drift means the right information was there once, but accumulated turns have buried or contradicted it; the signature is behavioral — re-reading already-processed files, ignoring acknowledged constraints, contradictory solutions across turns. Drift calls for a session reset; missing context calls for a context refresh, not a restart.

When the second question — was a pause point reachable — comes back wrong, that’s the silent-proceed pattern from the previous section, and it almost always traces to the harness rather than the model. The third question reveals workflow failures: the run declared success while tests failed because nothing in the loop verified test results [303]; the PR shipped a thundering-herd retry because review was rubber-stamped [299]; confirmation bias took over because the developer only re-tested the failing scenario locally without a baseline, so they couldn’t tell whether the fix moved the needle or hid the problem [307]. Workflow failures are diagnosed by tracing the gates the change passed through: was there an eval, a trajectory review, a pre-established success criterion, a second agent or human reviewing the diff? A useful diagnostic review move is to ask the reviewer (human or agent) to restate the change’s production impact in their own words before approving — if they can’t, the review gate didn’t run, it just signed off. If the answer to any gate is no, the fix isn’t more prompting — it’s adding the missing gate.

The fourth question catches the easiest class to misdiagnose. The run “broke production” because its harness allowed a destructive migration against a database it should never have been able to reach. Replit’s response was platform-level: dev/prod database separation is now enforced automatically, because agents cannot be trusted to recommend their own safety features unprompted [310]. Tools must also be scoped to the authority of the caller — if the model can set a tenantId parameter, it can read across tenants; closure-bind that parameter at tool creation instead [306]. When you see a run doing something it should not have been allowed to do, the failure is in the sandbox, the permission scope, or the tool surface — not in the model’s judgment.

When environment or security is the suspected layer, run a triage checklist before you re-prompt or re-run anything:

Target environment. Was the agent pointed at dev, staging, or production? Confirm from configuration, not from what the agent said it was doing.
Network and file scope. Which hosts, ports, and filesystem paths did the sandbox actually allow this run? Compare against what the change required.
Secret exposure surfaces. Were API keys, tokens, or NEXT_PUBLIC_*-style env vars reachable from the prompt, the repo, or the rendered output? Check both inputs and artifacts.
Tenant binding. Are tenant, user, or org IDs closure-bound at tool creation, or are they parameters the model can set? Model-settable identifiers are the cross-tenant failure shape [306].
Sandbox and approval mode. Is bypass mode (or the equivalent YOLO flag) on for this session, and did it propagate to any spawned subagents?
Recent tool-call history. Which destructive, network-egress, or cross-boundary calls preceded the incident, and did any of them touch attacker-influenced content (issues, web pages, untrusted files)?

The operator moves follow directly from whichever line the checklist lights up: revoke or rotate the exposed secret, narrow the sandbox or filesystem scope, re-bind the tool to a server-controlled identifier, switch the session out of bypass mode, or roll back the environment to the last good checkpoint before re-engaging the agent.

Model problems are the residual category, and they’re rarer than they feel. A real model problem looks like: you’ve supplied concrete code examples (because LLMs select deprecated libraries without them [303]), you’ve isolated the task in a fresh context window, you’ve validated the spec is unambiguous, and the output still fails in the same shape across multiple runs. At that point you’re hitting an inferrer’s irreducible non-determinism, and the answer is either to constrain the task further, decompose it into smaller subagent calls with pristine context, or accept that this category of work needs more human authoring. Subagent decomposition is a recovery tactic here; the broader question of how to structure multi-agent work belongs to Chapter 15.

17.3 Observability and Recovery Signals

Before you can diagnose, you have to see. The single most underused observability surface is the agent’s trajectory — what it actually ran, in order, with output. Devin’s shell command history gives you every executed command with its output preview in one place, so you can jump to where the agent’s reasoning diverged from yours [311]. Most harnesses ship some equivalent today: Claude Code persists session transcripts on disk that you can scrub through, IDE-integrated agents log task history per workspace, and CLI tools usually emit a session ID you can replay. What to look for in any of them is the same — the first turn where the agent stopped quoting your spec verbatim and started paraphrasing it, the first tool call that returned an error the agent then ignored, and the moment the file list it was working from drifted from the file list you cared about. Replit’s live-build sessions show the discipline at a smaller scale: when something breaks, examine console logs, query the database directly, and write a clear problem description before re-prompting — not a vague “this isn’t working” [312]. If your tooling doesn’t expose a step-by-step trajectory you can scrub through, you’re flying blind, and adding that surface is a higher-leverage fix than tuning prompts.

For production agents, trajectory observability has to graduate into structured evals. AWS DevOps Agent’s team treats every fix as a test-driven change: Given-When-Then test cases with LLM-Judge rubrics that require both correct output and correct reasoning, baseline metrics established before any change, and pre-selected scenarios so the team can tell whether a tweak helped overall or just papered over the loud failure [307]. Husain’s evals-skills toolkit pushes the same idea down to individual practitioners: an eval-audit skill inspects your current eval setup, runs diagnostic checks across six areas, and produces a prioritized list of problems — and major eval vendors now expose MCP servers so the agent itself can query traces and run experiments [313]. Generic “hallucination scores” hide the distinction between a factual error and a fabricated action; categorize failure types or you’ll fix the wrong one [313].

The recovery signal that matters most for long sessions is the session boundary itself. Treat sessions as bounded units with explicit lifecycle transitions: reset to bound cost and clear noise, fork to preserve a checkpoint before risky exploration, resume to re-attach to an established context. Captured session IDs and forked checkpoints are also what make diagnosis measurable: if you can re-load the same failing trajectory from a saved session and run a reset variant against a forked baseline, you can tell whether the change actually moved the needle or just shuffled the noise. Without those handles, every intervention is a confounded experiment. Session-control surfaces vary by tool — there is no portable CLI for this — so the snippet below is pseudocode meant to show the shape of the moves, not a command you can paste:

# Pseudocode — session-control surfaces vary by tool.
# Documented exemplars include OpenCode's --fork and --continue flags
# and Cline's /newtask command for a fresh-context restart inside an
# existing project. Claude Code persists session transcripts on disk
# that you can replay by ID. Check your harness's docs for the real
# invocations.

# 1. Capture the session identifier when you start, however your tool exposes it.
start "begin migration plan"            # note the session id from the output

# 2. Fork before a speculative experiment you might want to throw away.
fork <session-id> "try the streaming variant"

# 3. Continue / resume the parent branch if the experiment turns ugly.
continue <session-id>

Treat a fresh-context restart command (Cline’s /newtask, or whatever your CLI calls the equivalent) as a cost-management decision, not just a UX gesture: triggering it deliberately is how you cap accumulated tokens before they degrade output. Match the action to the diagnosis: context drift means reset (the accumulated context is causing bad behavior), while a genuine progress interruption means resume (the session has valuable context worth restoring). The prerequisite for the resume path is operational — capture the session ID from your agent’s process output the moment it starts, or the option doesn’t exist when you need it.

Trigger signals for these transitions are concrete enough to put in a checklist. Reset when per-turn token count is visibly growing across successive turns, when the cost display crosses a threshold and the primary task isn’t yet complete, when the agent starts re-reading files it already processed, or when you’re pivoting to a subtask that shares little context with what’s accumulated. Fork when you’re about to attempt a speculative change you might want to throw away — but externalize key findings into files first, because what’s only in the parent session’s implicit context won’t make it across the branch. Don’t reset reflexively: tearing down a session that has built up precise understanding of a complex system, only to have the agent re-explore the same territory, is its own waste.

Pause points are the recovery surface for the next failure, not the current one — and they only protect you if they’re actually reachable. The diagnostic check for an incident in hand is minimal: confirm the agent has a pause/ask tool available in its tool list, confirm an approval callback or equivalent runtime gate is registered, and confirm the session is not running in a bypass mode that propagates silently to subagents. If any of those three is missing, the human-in-the-loop guarantee was gone before the run started, regardless of what your prompt said. The architecture of approval modes, escalation rules, and autonomy budgets — which decision classes deserve a pause point in the first place, and how to wire the SDK-level callbacks — is policy design, not failure diagnosis, and lives in Chapter 16.

For autonomous and semi-autonomous workflows, observability has to extend into the runtime environment. Replit’s checkpoint-and-rollback system captures complete environment state — code, workspace, conversation context, and database — so recovery from a destructive agent action is a single operation rather than a forensic exercise [310]. At scale, the only viable defense is infrastructure-level guardrails such as canary deployments with automatic rollback and continuous validation, because human review can’t catch every thundering-herd retry or untimely cache [299]. Production samples remain irreplaceable — they reveal failure modes evals haven’t yet covered, and early-adopter partnerships compress the loop between “the agent shipped this” and “we know whether it’s actually working.”

The escalation path is the last piece. Before restarting, escalating, or changing strategy, run the diagnostic in order: inspect the trajectory (what did the agent actually do?), classify the failure (context, model, workflow, environment?), check for context drift signals (reset) versus progress interruptions (resume), and verify pause points are reachable. The “10x employee” framing — one operator orchestrating agents to deliver what a five-person team would [314] — only works if that operator can diagnose at this resolution. Without it, you get the hangover instead: doubled code churn, declining stability, and a 7.2% drop in delivery reliability across organizations that mistook generation speed for engineering throughput [300]. The litmus test is the on-call one — would you be comfortable owning the production incident tied to this change? If the answer requires re-reading your own PR to figure out what it does, the diagnostic loop hasn’t run yet [299], [301].

17.4 Takeaways

Before blaming the model, run four diagnostic questions in sequence — did the agent have the right context, was a pause point reachable, did a workflow gate the output, and was the environment scoped — and reach for a different model only after all four come back clean.
When context is the diagnosed failure layer, distinguish drift from missing context before acting: context drift — re-reading already-processed files, ignoring acknowledged constraints — calls for a session reset; missing or noisy context calls for a context refresh, not a restart.
Capture the session ID the moment an agent run starts so that forking for speculative experiments and reloading a failing trajectory for reproducible diagnosis remain available options — without it, every intervention is a confounded experiment.
When an agent run produces unexpected output, inspect the trajectory in order: find the first turn where the agent stopped quoting your spec verbatim, the first tool call whose error it ignored, and the moment its working file list drifted from yours — before re-prompting.
After any environment or security incident, run a triage checklist before re-prompting or re-running: confirm the target environment from configuration (not from what the agent claimed), audit actual sandbox network and filesystem scope, check for exposed secrets or NEXT_PUBLIC_*-style env vars, verify tenant IDs are closure-bound at tool creation rather than model-settable, and confirm bypass mode was off — including for any spawned subagents.
Add a workflow gate check that asks the reviewer — human or agent — to restate the change’s production impact in their own words before approving; if they cannot, treat the review gate as not having run.
Before trusting an autonomous run, verify three things: the agent’s tool list includes a reachable pause/ask tool, an approval callback or equivalent runtime gate is registered, and no bypass mode is active that could propagate silently to spawned subagents.
For any production agent, establish baseline metrics and a pre-selected scenario set before making changes, and categorize failure types rather than tracking generic hallucination scores — so you can tell whether a fix helped overall or just papered over the loudest reported failure.