11 Reviewing AI-Generated Changes
When writing code costs nothing, reading it becomes everything.
11.1 Review as the New Bottleneck
Writing code used to be slower than reviewing it. That relationship has inverted. Teams with high AI adoption merge 98% more pull requests, but review time increases 91%, bugs increase 9%, and DORA metrics — deployment frequency, lead time for changes, mean time to recovery, and change failure rate — remain flat at the organizational level [195]. Those four measures matter here because they capture delivery-system health rather than local activity, and AI-generated volume distorts them when review capacity does not scale. More code, bigger PRs, slower reviews, more bugs, no throughput improvement.
This is a queuing theory problem, not a tooling problem. When input grows faster than throughput, you get accumulating failure. OpenClaw — the same AI-first project whose Pi harness appears in Chapter 2 — had over 2,500 open pull requests at one point [196]. That’s not a big backlog; it’s a system that has lost the ability to process its own output. Without active throttling, teams experience what Ronacher calls total loss of codebase coherence [196] — engineers stop knowing what code exists in their own project. This is review theater: the process still has approval rituals, but the approvals no longer prove comprehension. Chapter 12 names the testing analogue test theater; both are the same failure shape, a reassuring gate that verifies too little. For every 25 percentage point increase in AI adoption, delivery throughput dropped 1.5% and delivery stability dropped 7.2% [195]. The bottleneck was never typing speed — writing and testing code accounts for only 25–35% of total SDLC time [195]. AI optimized the part that wasn’t the constraint, then made the actual constraint worse.
Before we get to what to do about it, notice that review is not one event. It is a stack of surfaces the code crosses on its way to production: the IDE diff gate at the moment the agent proposes edits; the deterministic CI gates that run on every push; the automated PR review bot that reads the diff before a human does; the human reviewer on the PR; and, once autonomy rises, the session transcript that makes an agent run reconstructable after the fact. Each layer has a different cost, catches a different class of mistake, and fails differently when you ignore it. The rest of this chapter focuses on the review surfaces that matter most in practice and the policy choices that keep them from collapsing under AI-assisted throughput.
This book recommends one local operating metric for that failure mode: restatement-failure rate, the share of medium- or high-risk AI-assisted reviews where the reviewer cannot accurately restate the changed invariant, failure mode, or blast radius from the diff and must ask for clarification before approving. Treat it as a house-policy signal derived from the review-failure patterns in this chapter, not as an externally validated industry benchmark.
11.2 What to Look For in AI-Generated Code
AI-generated code fails differently from human-written code, and the old heuristics — scanning for obvious bugs, checking naming conventions, verifying coverage numbers — miss exactly the failures that matter. Group the recurring failure modes into three reviewer heuristics, each best caught at a specific review surface:
- Plausible-but-wrong reasoning — the code looks right but is logically wrong. Caught by humans, sometimes by LLM bots; rarely by deterministic checks.
- Mechanical fabrication — APIs, helpers, and dependencies that don’t exist or don’t mean what the agent thinks. Caught at the IDE diff gate and by deterministic CI.
- Silent scope and boundary drift — edits and additions outside what was asked for. Caught at the IDE diff gate first, on the PR second.
Plausible-but-wrong logic. The code compiles, the happy path works, the variable names are reasonable. But the logic is subtly wrong in ways that only surface under edge cases or production load. AI-generated code contains 75% more logic errors than human-written code and XSS vulnerabilities at 2.74× higher frequency [197], [198]. A flawed agent-generated PR can ship inefficient queries, retry logic that causes cascading failures, or caching without expiration — all while looking correct on code review. Vibe-coded changes are especially prone to happy-path success and unexpected-input failure [199].
Suspicious tests belong in the same heuristic. When reviewing AI-generated tests alongside code, watch for three warning signs: tests that mirror implementation logic rather than validating requirements, tests that cover only the happy path while ignoring error handling and edge cases, and tests that assert internal state rather than observable behavior. Any of these suggests the agent wrote tests to confirm what its code already does, not to verify what the code should do. Specification-first testing strategies are covered in the testing and verification chapter.
Dependency hallucination and API drift. Agents hallucinate deprecated API signatures, model names, and authentication patterns. When API information is distributed across documentation, agents confidently assert structure that isn’t actually in the response [200]. You cannot trust import statements or dependency versions in AI-generated code without verification. This is mechanical fabrication: type checkers, linters, and a clean local run usually surface it; the IDE diff gate is the cheapest place to catch it before CI.
Unrequested features and assumption-filling. Agents generate features you never asked for, creating technical debt through unrequested review burden, testing obligations, and maintenance surface area. They fail to reuse existing helper functions, instead replicating logic in multiple places and often incompletely [200]. When agents encounter gaps in requirements, they fill them with assumptions rather than asking — and those cascading assumptions are why a flawed spec produces a flawed implementation that looks fine on the surface [199].
Silent architectural violations and code-base coupling drift. AI assistants produce more lines of code faster, but this correlates with a 48% rise in code cloning (8.3% to 12.3% of changed lines from 2021 to 2024) and a steep decline in refactoring-classified changes [201]. The agent’s path of least resistance is to copy, not to consolidate, and the diff makes that look like new work rather than duplication. On top of that, agents will exploit any tool available to make forward progress, including credentials and side-channels meant to be isolated [202] — so unexpected reads or writes outside the stated scope are a review signal, not a curiosity. Watch the diff for new helpers that duplicate existing ones, paths the change shouldn’t have touched, and “convenience” code that quietly widens the trust boundary.
The practical consequence: assume the deterministic checks catch fabrication, push scope drift detection forward to the IDE diff gate, and reserve human attention for plausible-but-wrong reasoning, where no automated layer reliably wins. Your review checklist follows from there: Does the logic handle edge cases and error paths, not just the happy path? Are the tests validating requirements or mirroring implementation? Are dependencies real and current? Did the agent add anything you didn’t ask for? Does the code follow existing patterns, or did it invent new ones?
11.3 Human Oversight as Non-Negotiable
LLMs are inferrers, not compilers — they produce non-deterministic outputs even with identical inputs, and no amount of context or tooling eliminates the non-negligible probability of failure. Hallucinations are not a bug to be fixed; they are a core property of statistical text generation. The baseline risk is irreducible.
The practical implication: trust but verify is not a slogan but a minimum operating standard [203]. Blind trust in AI-generated code is a vulnerability equivalent to zero-trust network security — nothing should be implicitly trusted without continuous verification. Review overhead can theoretically negate productivity gains if verification effort equals or exceeds generation savings [203]. That’s the trade-off you’re managing.
Use the IDE diff gate before anything else. The first surface where verification happens isn’t the pull request — it’s the diff panel the agent drops in front of you when it proposes edits. Cursor’s Composer, Copilot’s Edits / agent mode in VS Code, and Cline’s diff UI are all instances of the same pattern: the agent stages every file it wants to touch as a reviewable diff, and you keep, reject, or tweak each hunk before anything hits disk. Persistent inline diff widgets — a UX pattern Replit’s Ghostwriter team explicitly redesigned for — exist because popovers disrupt flow and lose context, while diff views let you make informed accept/reject decisions on each suggestion [204]. Treat that panel as non-optional. If the agent proposed edits across more than two files, scroll. If the diff touches a file you didn’t expect, reject that hunk and ask why. If the change summary doesn’t match the hunks you see, the summary is lying and the hunks are the truth. Never click “accept all” on a long diff without reading it. The same logic extends to agentic database operations: an agent making schema or data changes without a preview-before-commit surface is fundamentally unreliable, which is why version-controlled databases like Dolt expose diffs and explicit commits as the trust boundary [205].
The same trap operates after the diff gate. AI-generated code that looks syntactically polished actually lowers reviewer skepticism, increasing the likelihood that subtle bugs slip through. Polished outputs coincide with lower critical evaluation — users invest effort upfront to direct the AI’s work but then dramatically reduce scrutiny when outputs look clean. Fewer than half of developers consistently review AI-assisted code before committing, and a significant minority of those who do report that reviewing AI-generated logic requires more effort than reviewing human-written code [197]. That gap — most people skip review, those who do it find it harder — is where bugs enter production. Heavy reliance on AI without understanding produces a vicious cycle: poor code from inexperience, no experience gained from continued reliance [206].
So how do you maintain oversight without becoming the bottleneck? Assess every change across three dimensions: probability of error (problem complexity and context quality), impact if wrong (blast radius), and detectability (test coverage, type safety, observability). Low-probability, low-impact, high-detectability changes — like updating a string constant in a well-tested module — can be reviewed lightly. High-impact changes to authentication, data models, or API contracts demand line-by-line scrutiny regardless of who (or what) wrote them. The goal is not to review everything equally but to direct your limited review capacity where it has the highest return. Upfront specs reduce review cost because they let reviewers validate code against an explicit design instead of reverse-engineering intent from the implementation [207].
11.4 Automated Review Tools: A Decision Model
Automated review tools — CodeRabbit, Ellipsis, GitHub Copilot Code Review, Vercel Agent, Sourcegraph’s Sherlock — are not replacements for human review. They are force multipliers that handle specific categories of problems, freeing your attention for the problems only humans can spot. The operational question is: which checks run where, and which ones are still reserved for human judgment? The answer is a risk-tiered gate structure.
Start with the tiering rules. Risk tier is determined by three variables: blast radius if the change is wrong, reversibility if you need to roll it back, and detectability if your existing tests and telemetry will catch the failure quickly. Execution mode (manual edit, plan-then-act, autonomous loop, background queue) is one input that pushes a change up or down — an autonomous-loop run with a weak verifier crosses into a higher tier than the same diff produced by an attended edit — but it does not replace the radius/reversibility/detectability lens. See Chapter 6 for the full mode taxonomy. Low-risk work is easy to reverse and easy to detect. Medium-risk work is bounded but not trivial. High-risk work can corrupt data, break trust boundaries, or create failures that your automation will not reliably surface.
Automated signals must be green before human review. Compilation, type checks, linting, the agreed test suites, and static analysis run on every push. The review-policy rule is binary: if any of these fail, the PR is not eligible for human review, full stop. Detailed test-gate design — what to run, where, with what coverage targets — belongs to the testing and verification chapter; the only thing that matters at this layer is that the signals exist and that no human is asked to read a diff while they’re red.
Before human review: author attestations. A human, not the agent, confirms the PR description, the intended scope, and whether any file changes fall outside that stated scope. These belong in the author handoff rather than the machine-check bucket.
Medium-risk PRs: add LLM-based review. LLM review tools earn their cost on bounded changes where the blast radius is real but reversible: new features in existing modules, backward-compatible API changes, or migrations with a clear rollback path. Sourcegraph’s Sherlock, scanning over 400 PRs, surfaced 3 high-severity and 4 medium-severity issues plus 12 notable edge cases that traditional SAST missed, cutting manual triage by 30 minutes per security engineer per day [208]. Practitioner comparisons of GitHub Copilot, CodeRabbit, and Macroscope find each catches subtle bugs like mismatched IDs and null-check failures that the developer missed in their own review, and that explicit corrections (this codebase uses V4 of the API, not V3) stop being repeated in subsequent reviews [209]. On-demand triggers like Vercel Agent’s PR-comment review flow let you pull the bot in when you want a second pass without bot-spamming every trivial PR [210]. If the LLM flags issues, the author addresses them before requesting human review, reducing round-trips. Skip the bot for tiny one-line PRs, generated-code directories, and repos where deterministic linters already cover its nit surface — an un-tuned bot on every PR trains reviewers to dismiss all of its comments, including the real findings.
Treat bot review as a scoped review surface, not a background opinion stream. Trigger it when one of three conditions is true: the PR crosses a risk-tier boundary, the reviewer lacks fresh subsystem context, or the author wants a second pass before requesting human review. Scope it to the changed files plus the spec, ADR, or API contract it should grade against; do not ask it to “review the whole repo” unless the change is architectural by design. Require the bot output to land where the human reviewer will see it — a PR comment, check summary, or attached review artifact — and make the author either fix, reject with reason, or escalate each high-signal finding. That is the durable practice across tools: explicit trigger, bounded context, visible output, and accountable disposition.
Low-risk changes: deterministic checks suffice. String updates, dependency bumps with passing CI, well-typed refactors in well-tested modules — deterministic gates alone are sufficient. Adding LLM review here wastes budget without improving outcomes.
High-risk changes: human judgment required. Authentication, payment flows, irreversible data migrations, cross-service contract changes, and concurrency-heavy code all belong here. LLM tools cannot replace human judgment because they struggle with codebase-wide context, business history, and architectural intent [208]. Iterative refinement loops, where each AI pass surfaces new issues and sometimes contradicts its previous suggestions, can chase their own tail without human anchoring [209].
What the human reviewer owns. Even after automated and LLM gates pass, the human reviewer answers four questions: (1) Does this change align with the intended design, not just the specification? (2) Does it respect existing architectural boundaries and patterns? (3) Did the agent add anything that wasn’t requested? (4) Would I be comfortable owning a production incident caused by this code? These are judgment calls that require understanding business context, system history, and architectural intent [211] — exactly the areas where automated tools fall short.
A medium-risk change through every layer. Take a concrete example: an agent proposes adding retry logic to a payment-status polling endpoint. At the IDE diff gate, the developer notices the agent has also touched a rate-limiter config in an unrelated file and rejects that hunk before accepting the rest — a stray edit that would have shown up later as a scope violation if it had been waved through. CI fails on the first push because the agent imported a deprecated retry helper; the deterministic gates catch the mechanical defect and the developer redirects the agent to the current one. On the resubmit, the LLM review bot flags an unbounded exponential backoff that would amplify load on the downstream provider during an outage — the plausible-but-wrong reasoning bug the deterministic gates have no way to see. The human reviewer, looking at the corrected diff, asks the question no automated layer can: should this endpoint retry at all, given that the upstream caller already has its own retry policy and a fixed user-visible deadline? That last question — does the change fit the system, not just the spec — is what the layered model exists to surface.
A default review bot is commodity; the leverage is in path-scoped rules. Once you’ve lived with a bot for a sprint and can tell its high-signal comments from its noise, commit a config — .coderabbit.yaml, ellipsis.yaml, or the .github/copilot-instructions.md pattern — that encodes real house standards. ADRs referenced from such a config transform architectural decisions into executable constraints [209]:
# .coderabbit.yaml (illustrative)
path_instructions:
- path: "api/**"
instructions: "Require a linked ADR for any new endpoint or contract change."
- path: "db/migrations/**"
instructions: "Verify forward + rollback migration and test fixtures before approval."Every fixed standard becomes an executable gate that catches the next violation, so institutional knowledge stops evaporating when reviewers change teams. Don’t copy-paste someone else’s ruleset — you’ll inherit their noise without their signal — and don’t write rules for standards the team doesn’t actually apply; the bot will drift from reality and train reviewers to dismiss it. Effective checks focus on one concern per check, specify exactly what to look for and how to fix it, and leave diff-scoping to the system rather than the prompt. Combine feedforward controls (rules files, architectural constraints, ADRs that prevent bad output) with feedback controls (sensors that catch errors and enable self-correction) [212]. Neither alone is sufficient: feedback-only harnesses repeat mistakes; feedforward-only harnesses encode rules without verifying they worked.
The following checklist distills this decision model into a reviewable artifact you can embed in your PR template:
AI-Code Review Checklist (risk-tiered)
Attach to every PR. Check applicable items before requesting review.
All PRs: - [ ] IDE diff gate cleared: every hunk reviewed, no stray file changes accepted - [ ] Automated signals green (compilation, type checks, linters, agreed test suites) - [ ] No file changes outside the stated PR scope - [ ] PR description written by the author, not copied from agent output
Medium-risk PRs (add): - [ ] LLM review tool run; flagged issues addressed - [ ] Edge cases and error paths covered by tests - [ ] Backward compatibility verified for API or schema changes
High-risk PRs (human reviewer verifies): - [ ] Change aligns with intended design, not just the written spec - [ ] Existing architectural boundaries and patterns respected - [ ] No unrequested features, dependencies, or abstractions added - [ ] Reviewer comfortable owning a production incident from this code - [ ] Concurrency, security, and data-integrity paths manually traced
11.5 A Practical Review Playbook
The previous section’s decision model tells you what to check. This section addresses how to keep review operationally sustainable when agents generate PRs faster than humans can read them. The mechanics of how diffs are shaped — branch layout, stacked branches, merge queues, rebase conventions, release cadence — belong to the chapter on source control and release discipline. The rules below are the review-side rules only.
The reviewer bounces diffs that are too large, unstable, or out of scope. This is the core review-policy rule that everything else in this section serves. A pull request should contain one logical change and be reviewable in a single short sitting; sprawling diffs train reviewers to skim, and skimmed reviews are the ones that miss the silent architectural violations from the previous section. Likewise, if the base branch is still moving, if automated signals are red, or if the diff has crept outside the author’s stated scope, the right reviewer behaviour is to send it back, not to absorb the cost. A useful upstream tell is the agent’s IDE diff panel: if the proposed edit doesn’t fit on one screen, the resulting PR won’t either, and that’s the moment to interrupt the session rather than push through.
Manage the review queue, not just individual PRs. When a single developer with a coding agent can produce more code in a day than their team can review in a week [213], review backlogs become the primary throughput constraint. Treat your review queue like an ops queue: set work-in-progress limits, make queue depth visible, and stop generating new PRs when the backlog exceeds your team’s daily review capacity. PRs are 18% larger with AI adoption and take 26% longer to review [197]. If your queue is growing, the correct response is to slow generation — not to review faster [214]. Code review is the longest phase of cycle time in most teams, which makes it the single highest-leverage place to invest [215]; aggregate cycle time will hide where work actually stalls unless you break it out by phase.
Set review SLAs by risk tier. Use risk-tier SLAs as operating heuristics tied to queue depth, team size, and downstream pressure. Many teams aim to clear low-risk reviews the same working day, medium-risk work within a day or two while author context is still fresh, and high-risk blocked work ahead of the ordinary queue. The exact target is less important than making it explicit, visible, and adjustable as your queue data changes.
Route reviews to the right person. Generalist reviewers handle low- and medium-risk PRs efficiently. High-risk changes — and any PR touching unfamiliar subsystems — should route to a domain specialist who holds the architectural context. Reviewers without sufficient codebase context slow down approval cycles and miss issues; cross-repository visibility directly improves both review speed and catch rate [215]. When in doubt, route up.
Timebox and rotate. Set dedicated review blocks — 30 to 60 minutes — rather than reviewing ad hoc throughout the day. Context-switching between writing and reviewing degrades both. Rotate review duty across the team to prevent bottlenecking on one or two senior engineers and to spread the architectural awareness that comes from reading code you didn’t write.
Watch for fake throughput. A queue that empties quickly is not the same as a queue that produces understood changes, and an AI-assisted pipeline will produce the first while pretending to be the second. Aggregate volume metrics — lines of code, PR count, tokens spent — are particularly misleading here: high-volume AI output is a process-failure signal, not a success metric, and qualitative results (shipped, tested, secure software) are far better proxies than LoC [216]. Track four signals instead, broken out by risk tier:
- Review latency by risk tier. Time from “ready for review” to merge for low-, medium-, and high-risk PRs. A healthy median in aggregate routinely hides high-risk work waiting days behind a wall of low-risk noise.
- Reopen, hotfix, and revert rate. Fraction of merged PRs that come back within 14 days for the same defect class. A rising rate against falling latency is the canonical “fast queue, no comprehension” signature.
- Scope-violation returns. PRs sent back to the author because the diff exceeded its stated scope or touched files outside the author’s intent summary. A drift up here means the IDE diff gate is being skipped and authors are accepting agent edits they didn’t ask for.
- Approval-without-substantive-review rate, paired with restatement-failure rate. Medium- or high-risk PRs approved without any substantive comment or restatement on the thread, and the matching clarification events produced by the house protocol in the next section. Together they are an early local warning that review may be turning into rubber-stamping.
The diagnostic pattern: latency falls while reopen rate, scope-violation returns, or approval-without-substantive-review climbs. That combination means the queue is clearing because reviewers are skimming, not because the work is well understood, and it is exactly the failure mode this chapter exists to prevent.
11.6 Reviewer Ownership and the Comprehension Protocol
If the agent wrote it and you approved it, you own it. Ownership of AI-generated code is non-delegable: as long as engineers are accountable for what ships, they are the bottleneck, regardless of code generation speed [196]. The role has shifted from creation to verification — from “I wrote this” to “I understand this, I tested this, and I’ll answer the pager when it breaks.” Without that discipline, the asymmetry — minutes to generate, hours to review — quietly breaks maintainer bandwidth and normalizes uncritical acceptance [217].
Recommended comprehension protocol. Comprehension has to be made visible, not assumed. The following is a house policy this book recommends for teams adopting agents; it is synthesized from the review-failure patterns above, not presented as a settled external standard. Medium- and high-risk AI-assisted PRs carry two short artifacts that make approval auditable.
- Author-written intent summary. The author — not the agent — writes a short prose summary stating four things: the user-visible change, the invariant or contract that is now different (or newly enforced), the failure mode the change prevents or introduces, and the file or module boundaries the change is allowed to cross. Pasting the agent’s commit message does not satisfy this; if the author cannot produce the summary without re-reading the diff, the PR is not ready for review. Any file in the diff outside the stated boundaries is a scope-violation return, not a discussion.
- Reviewer restatement. Before approving a medium- or high-risk PR, the reviewer restates, in their own words on the PR thread, the changed invariant or failure mode and the blast radius — re-derived from the diff, not quoted from the author’s summary. A reviewer who cannot restate it has not understood it, and an approve click in that state is rubber-stamping. For low-risk PRs, restatement can be skipped, but the reviewer is still on the hook for the four ownership questions in the decision model.
A worked example makes the protocol concrete. A medium-risk PR adds idempotency keys to the payment refund endpoint:
Author intent summary: User-visible change: clients may now retry
POST /refundssafely; duplicate calls within 24 hours with the sameIdempotency-Keyheader return the original response instead of issuing a second refund. New invariant: at most one refund per (merchant_id,idempotency_key) within a 24-hour window. Failure mode prevented: double refunds from client retries on 5xx responses. Failure mode introduced: clients reusing keys across distinct refund intents will silently get the first response — documented in the API reference. Allowed scope:api/refunds/,db/migrations/2026_03_idempotency_keys.sql, and the refund handler tests.Reviewer restatement: Reading the diff, the new invariant is one refund per merchant+key within 24 hours, enforced by a unique index on the migration plus a lookup before the provider call. The blast radius is the refund path only; an unrelated mutation in the same handler would inherit this dedup. I want to confirm the 24-hour window is wall-clock — what happens around DST transitions?
Restatement failure (alternative path): I can see a new
idempotency_keystable and a uniqueness check, but the diff also rewrites the refund response shape and I can’t tell from the code alone whether existing clients break. Can you point me to the contract test? — at which point the reviewer pauses, the author addresses the gap, and the event is logged as a restatement failure feeding the metrics from the previous section.
That last exchange is the one the policy is designed to surface. The reviewer did not approve; they asked. A spike in those events is an early local signal that scope or specs may be drifting, well before reopens and reverts confirm the damage. The metric is useful because it forces a team to count moments of non-comprehension instead of pretending the approve button already proves understanding.
Signal AI involvement on the commit. Annotate AI involvement in commit messages so reviewers know what they’re reviewing [203]. The annotation is not a disclaimer — it is a signal that tells the reviewer to apply the suspicious-pattern checks from earlier in the chapter. Treat AI output like junior-developer output: requiring extra scrutiny, not less [203], [199]. The exact format — a [ai-assisted] prefix, a AI-Tool: trailer — matters less than picking one and using it consistently. Conventions for how this fits with branches, squashed commits, and release notes belong to the chapter on source control and release discipline.
Keep the session, not just the diff, when the risk tier requires it. Suggest-mode edits and low-risk attended changes can usually be reviewed from the diff, author intent summary, and green checks. Once the agent has act-mode write access, runs unattended, touches high-risk code, or produces a diff whose intent cannot be reconstructed from the PR, the reviewable artifact becomes the PR plus the session transcript. Current pull request models on platforms like GitHub do not carry enough information to review AI-generated code effectively when reviewers cannot see the prompts, tool calls, failed attempts, and scope boundaries that shaped the diff [218]. Platforms that capture every agent turn — the Copilot coding agent’s control plane and Devin’s per-task session logs are two examples — give reviewers something to reconstruct from when the diff alone does not explain the agent’s choice.
When you do open a transcript, keep the review question narrow. Look for three failure shapes: scope drift, where the agent reads or edits files outside the stated task boundary; a repeated tool loop, where the same tool call cycles without converging; and late permission escalation, where the agent asks for or is granted broader permission near the end of a run and uses it to make changes the original constraints would have blocked. The reviewer does not need a general observability program here. They need enough session evidence to decide whether this PR stayed inside scope, whether the final diff emerged from a failing loop, and whether permission changes altered the review burden. Broader transcript retention, access control, audit, and fleet-level observability belong to the chapter on bounded autonomy and permission design.
Close the loop on patterns that recur. When a reviewer catches a pattern the agent keeps getting wrong, the response is not limited to fixing this PR. Update the rules file, the ADRs, or the specification templates so the pattern does not recur [212]. Every review finding should feed back into your configuration, creating the feedback flywheel described in the chapter on process frameworks for coding agents. Without this loop, you are paying the review cost for the same class of mistake indefinitely.
11.7 The Operating Rule
The chapter reduces to one rule: only increase agent output when reviewer understanding and review surfaces scale with it. Throughput without comprehension is the failure mode every section here is designed to prevent — review theater on the PR, mechanical fabrication slipping past green CI, scope drift hidden in a long diff, and unresolved comprehension gaps rationalized into approvals. The IDE diff gate, risk-tiered automation, the queue and metrics playbook, and the comprehension protocol all serve that single rule. When the queue is clearing faster than reviewers can restate what they approved, slow generation. When the bot keeps catching the same class of mistake, encode it. When the diff alone can’t explain the agent’s choice, keep the transcript. The point is not to review more code; it is to ship code you actually understand.
11.8 Takeaways
- Use the IDE diff gate as a mandatory first review surface: scroll every hunk, reject stray file edits, and never click ‘accept all’ on a long diff without reading it.
- Route reviewer attention by failure type: let deterministic CI catch mechanical fabrication such as hallucinated APIs, use the IDE diff gate to catch scope drift, and reserve human scrutiny for plausible-but-wrong reasoning and tests that mirror implementation rather than validate requirements — the failure classes no automated layer reliably catches.
- Tier AI-assisted PRs by blast radius, reversibility, and detectability: deterministic gates suffice for low-risk work, add LLM review for medium-risk PRs, and reserve human line-by-line scrutiny for high-risk changes that can corrupt data, break trust boundaries, or evade your automation.
- For medium- and high-risk AI-assisted PRs, require the author to write the intent summary themselves: user-visible change, changed invariant or contract, failure mode, and allowed file or module boundaries.
- Before approving a medium- or high-risk AI-assisted PR, restate the changed invariant or failure mode and the blast radius in your own words on the PR thread; if you cannot do that from the diff, the PR is not ready to approve.
- Stop generating new PRs when review queue depth exceeds your team’s daily review capacity — treat the queue like an ops queue with explicit WIP limits and visible backlog depth.
- When an agent runs in act mode, operates unattended, or touches high-risk code, keep the session transcript and review it for three failure shapes: scope drift (edits outside task boundaries), repeated tool loops without convergence, and late permission escalation.