12 Testing and Verification
Agents are prolific test writers — and most of those tests are worthless.
12.1 The Test Theater Problem
In the book’s core loop, review and test are adjacent but not interchangeable. Review is the human comprehension and approval gate covered in Chapter 11; testing is systematic verification against the spec — does the artifact behave correctly under checks the agent cannot bluff past? The spec-driven approach from Chapter 7 is the foundation. Without a written spec, tests have no independent oracle and collapse into assertions about whatever the agent happened to build.
The most dangerous test is one that passes and verifies nothing. Agents write these constantly. When you ask an agent to “add tests for this function,” it generates code, then writes tests that assert the code does exactly what it already does — not what it should do. The test becomes a mirror of the implementation, not an independent check against a specification [219], [220]. Call this test theater — a gate that looks reassuring while proving too little.
This happens because of a structural limitation in how single-context agents work. The agent plans the implementation and the tests in the same context window, so the test-writing phase is contaminated by implementation knowledge. It writes tests around the code it already generated — or is about to generate — rather than tests that capture intended behavior. The result is high code coverage numbers with near-zero specification coverage [220]. Your CI pipeline glows green while corner cases, error handling, and security boundaries go completely untested.
The pattern is insidious because it looks right. Agent-generated tests are well-structured, properly named, and use appropriate testing frameworks. They import the right modules, call the right functions, and assert return values. A quick scan during code review won’t catch the problem. You need to ask a different question: “Does this test fail if the implementation is wrong in a way that matters?” For agent-generated tests, the answer is usually no [219], [221].
Ask an agent to implement a payment function and test it, and the failure mode becomes obvious. The agent generates a function that calculates totals, applies discounts, and returns a result. Then it writes tests that call the function with sample inputs and assert the exact outputs the function produces. If the discount logic is wrong — say it applies discounts before tax instead of after — the test still passes, because the test asserts the actual behavior, not the correct behavior. A performance of quality that protects nothing.
The problem compounds in autonomous loops. When an agent runs in a self-correcting cycle — generate code, run tests, fix failures, repeat — it optimizes for making tests pass rather than making code correct. If a test fails, the agent is equally likely to fix the test as fix the code [221]. Without an external oracle that defines correctness independently, the loop converges on internally consistent but potentially wrong code. As Chapter 16 covers in depth, self-correcting loops only work reliably when the verification signal comes from outside the agent’s own output.
12.2 Spec Coverage Over Code Coverage
The antidote to test theater is writing specifications before the agent writes code — and before it writes tests. When you define what “correct” means independently of the implementation, you create a verification signal the agent can’t game [220], [222].
As Chapter 7 established, specification quality matters more than prompt cleverness. This applies doubly to testing. A spec that says “the discount function applies percentage discounts after tax calculation, rounds to two decimal places, and rejects negative amounts” gives you three testable properties the agent must satisfy. Without this spec, you get whatever the agent’s training data suggests a discount function should do — which may or may not match your business rules. Conversational requirements work too — Zinus replaced formal specs with plain-English descriptions of data structure and behavior fed iteratively to the agent, and still cut a contracted three-month build to six weeks [223] — but only because someone owned the resulting acceptance criteria as a testable contract.
TDD is the strongest form of prompt engineering. When you write the test first — or at minimum define the acceptance criteria first — you anchor the agent to externally defined correctness. The test becomes a constraint on generation, not a rubber stamp on output [220], [222].
The challenge is that single-context TDD doesn’t work well with agents. When the same agent writes both tests and implementation in one session, context pollution defeats the purpose — the test-writer “knows” the implementation it’s about to write. One practitioner solved this with subagent isolation: separate agents for the red phase (test writing), green phase (implementation), and refactor phase, each seeing only the context it needs [220]. The test-writing agent sees only the spec and the existing code. The implementation agent sees the failing tests and nothing else. Phase gates prevent skipping steps. This is more ceremony than most tasks need, but it demonstrates the principle: isolating test creation from implementation creation is what makes agent-assisted TDD genuine rather than theatrical.
For everyday work, a lighter-weight approach works: write acceptance criteria in your spec or issue description, have the agent generate tests from those criteria first, review the tests yourself to confirm they capture the right behavior, and only then ask the agent to make them pass. This simple sequence — specify, test, implement — prevents the most common failure mode. If you skip the specification step, you’re back to test theater [221].
Make the handoff explicit enough that the agent cannot quietly collapse the phases. A minimal pattern is: start a separate test-writer session with only the spec, public interface, and relevant existing tests; ask it to produce the failing tests plus a short “oracle note” explaining what behavior each test proves; review that artifact yourself; then start the implementation session with the failing tests and a rule that test files are read-only unless you approve a test change. The artifact that crosses the boundary is not a chat transcript. It is a small handoff file or issue comment: accepted criteria, tests added, expected failures, and files the implementation agent may touch.
Snapshot tests from reference implementations offer another powerful pattern. Wherever you have a reference implementation or a well-defined specification, use it as your test oracle rather than letting the agent define its own success criteria. When the agent has an external target — a previous suite’s snapshot output, a sister implementation’s behavior, a contract document — every behavior gets graded against something the agent did not write [221].
AI-generated tests are coverage scaffolding, not oracles. A newer class of vendor and research tools will read a function and generate unit tests for it; the same instinct shows up inside coding agents that emit a default test alongside every diff [219]. The durable filter is to keep only generated tests that build, run, pass, and measurably raise coverage. That filter makes the output safe as coverage scaffolding. What it does not do is validate the assertion itself. A generated test that locks in a function’s current buggy output is worse than no test, because it turns the bug into the specification and any future fix looks like a regression. Reach for these tools on stable, well-specified code — pure functions with a docstring the generator can read as oracle, or a narrow regression net around an agent-generated diff. Do not point them at untested legacy behavior you haven’t read yourself, and audit every assertion with the same question this chapter poses of hand-written tests: is the asserted value the intended behavior or the observed behavior?
12.3 What Naive Tests Miss
Agent-generated tests have systematic blind spots. Group them by what drifted, and the matching oracle becomes obvious: logic drift needs a differential check against a canonical implementation, boundary drift needs property-based or fault-injection coverage, runtime drift needs concurrency or crash-recovery stress, structural drift needs static analysis and architectural rules, requirement drift needs spec tracing, and cross-component drift needs integration and contract tests. Each class requires a verification strategy the agent’s default behavior doesn’t employ.
Logic duplication and divergent copies. When an agent replicates logic instead of reusing existing helper functions, each copy passes its own isolated assertions [221]. The naive test validates that copy A returns the right output, but never checks whether copy A and copy B agree — or whether either matches the canonical implementation. The first remedy is a differential or contract test against the canonical helper: write one test table of inputs and expected outputs, then assert that every call site — the canonical helper plus each suspected duplicate — produces identical results for every row. That single test forces divergence to surface as a failure rather than a silent drift. Mutation testing is an amplifier on top: once the differential test exists, mutating the canonical helper should break every consumer; if it does not, you have found another rogue copy.
Edge cases in error paths. Agents optimize for the happy path because that is what most training examples demonstrate. A payment function may handle valid inputs correctly but silently truncate negative amounts, return nulls on overflow, or swallow exceptions in retry logic. The naive test calls the function with representative inputs and asserts the expected output — it never exercises the boundary. Property-based testing (generating random inputs within and outside valid ranges) and fault injection (simulating network failures, disk-full conditions, timeout races) catch what example-based tests miss [219].
Concurrency and ordering bugs. Agents can pass all tests while containing critical durability bugs — overwriting data at offset 0 before fsyncing dependent data, creating crash-corruption windows that no sequential test catches. Naive tests run single-threaded, single-connection, in-order. The practical harness is not a longer unit test; it is a stress loop that creates real interleavings. Dolt’s crash-recovery tests run SQL servers inside VMs, drive ordinary SQL writes in one parallel component, hard-reboot the VM from another, and then assert after restart that every acknowledged write is durable and the server is healthy [224]. A passing run still does not prove absence of durability bugs, which is why the harness has configurable runtime and jitter: repeated runs turn ordering bugs into reproducible traces. If your system handles concurrent writes, crash recovery, or timing-sensitive cache behavior, your test suite must exercise those interleavings rather than only the happy sequence.
Architectural drift and idiom violations. When the agent violates language idioms or introduces unnecessary complexity, the test suite stays green because it validates outputs, not design. Static analysis and architectural fitness functions — linter rules that enforce import boundaries, complexity limits, and pattern conformance — are the verification layer for structural quality. Functional tests cannot substitute for structural checks [219].
Requirements misassumptions. AI implements exactly what you ask, not what you need [219]. When the specification is ambiguous, the agent fills gaps with plausible defaults that may contradict business intent. No amount of testing catches a correctly-implemented-wrong-requirement — only specification review does. The verification method is tracing each test back to an explicit acceptance criterion. If a test cannot be linked to a requirement, it is either testing an assumption or testing nothing.
Cross-boundary behavior between components. Unit tests don’t catch the largest class of agent mistake on real systems: changes that are individually correct but break a contract between an HTTP handler and its consumer, a producer and its queue, a service and the database schema it assumes, or a frontend and the API it calls. Agents are happy to refactor a serializer, rename a JSON field, or change an error code without touching the call sites that depend on it, because nothing in the unit suite forces those sides to agree. Integration tests answer “do these two real components still talk?” — exercising a real database, a real HTTP server, a real queue, with the agent’s diff in place. Contract tests answer the stronger question of “do they agree on the shape of the conversation?” — a single authoritative description of an API or message (an OpenAPI doc, a JSON Schema, a Pact file, a generated TypeScript client) that both sides are checked against, so a producer renaming a field fails a test before any consumer ever sees the change. The decision rule is simple. If a change crosses a process, network, schema, or service boundary, a unit test is not enough; add an integration test that drives the boundary and, if more than one side or more than one team owns the contract, a contract test that the boundary description is the source of truth. A practical entry point on the API side is a request-level harness such as Playwright’s APIRequestContext, which drives real HTTP calls with shared base URLs, headers, and beforeAll/afterAll setup of server-side state — enough to assert that an agent’s diff still honors the response shape, status codes, and side effects the rest of the system relies on [225]. Boundary tests are also where the “external oracle” argument bites hardest: the contract is the spec, and an agent that broke the contract did not “almost get it right,” it changed the spec without permission.
The common thread: each failure class requires a different verification method. Implementation-mirroring tests miss all of them equally. A verification strategy that combines differential tests for duplicated logic, property-based testing for logic boundaries, fault injection for error paths, stress testing for concurrency, integration and contract tests for cross-component agreement, static analysis for structure, and specification tracing for intent covers the gaps that naive agent-generated tests leave open. In practice that means reaching for concrete tools, not just categories: Hypothesis or fast-check when the input surface is large, a deliberate timeout or network-failure harness when retries and fallbacks matter, concurrency or crash-recovery stress scripts when ordering matters, real-database integration suites and OpenAPI/Pact contracts when components have to agree, and import-boundary or complexity rules when the agent starts cloning patterns instead of reusing them.
12.4 Evaluating Agent Output
Per-change tests verify a specific diff. Evals verify the agent workflow that produces those diffs. Use them when an agentic workflow will be reused: the same prompt, tool scaffold, model family, or background loop will be trusted on multiple future changes, and a regression in that workflow would create repeated bad patches. Skip them for one-off manual assistance where ordinary spec tests, integration checks, and review give faster signal. Evals are the regression layer above per-change testing, not a substitute for the tests attached to a PR [226], [227], [228].
A practical repository-level harness is small and concrete. Keep a directory of representative tasks: a failing issue reproduction, the acceptance checks that must pass, and a short policy file listing constraints such as “do not change public API signatures” or “touch only files under payments/”. Run the same generation prompt several times against that harness, then mark the workflow as usable only if the patch fixes the reproduction, keeps existing tests green, and respects the declared constraints [227]. That harness is worth maintaining when the same class of task recurs; if the work is unique, add ordinary tests to the change instead of building an eval suite around it.
Keep the oracle binary. In a coding workflow, the eval question should look like “did the patch fix the reproduction without breaking declared constraints?” not “how good was this answer?” Binary oracles make it possible to block a workflow change in CI. Rolled-up subjective scores can mislead, and aggregate metrics like ROC-AUC can hide confidence distributions that are useless in production [229]. For verification, a small set of unambiguous pass/fail checks is more useful than a broad score.
Run repeated trials only where stochastic variation affects shipping risk. One successful generation does not tell you much about a reusable agent workflow; run the same prompt several times and look for whether the workflow usually succeeds, usually fails, or only works when you get lucky [227]. For unattended loops, the tail matters more than the headline: a 70% per-attempt pass rate sounds useful until three consecutive successful attempts happen only 34% of the time — Pass^k, not Pass@k, predicts what downstream automation sees [230]. If humans are supervising every run closely, this can stay advisory. If the workflow is allowed to proceed unattended, repeated-run evidence becomes part of the verification gate.
Build evals from real failures, not synthetic cleverness. Bug reports, flaky patches, architectural drift, and acceptance cases the agent misunderstood produce the reusable regression corpus that tells you whether the workflow improved [231]. Mock tools and scaffolding details still matter, but they matter only because they can make the verification signal lie [227], [232]. The decision rule is simple: add evals when you are standardizing an agent workflow; rely on per-change tests, artifact checks, and review when you are verifying a single diff.
12.5 Verifying at the Artifact, Not the Code
Evals catch logic regressions. They cannot catch a compiling-but-broken artifact: an onClick handler wired to the wrong prop, an API endpoint returning the wrong shape, a CLI that exits zero after doing nothing, a migration that applies but corrupts a seed invariant, or a generated file that is nondeterministic. For agent-authored work, the verification signal has to reach the artifact the user, caller, or downstream system will actually touch. Browser UIs are the most visible case, but the same principle applies to HTTP responses, command output, database state, and generated assets.
First, separate three things that all wear the word “verification” but answer different questions. An artifact smoke check asks: does the thing render, wire up, and run without obvious failure? Deterministic replay asks: did this build behave the same way on a known input as a previous build did? Independent acceptance oracles ask: is the observed behavior the intended behavior — the signal that comes from spec-based assertions, reference-suite oracles, or human sign-off. A passing screenshot tells you the artifact loaded, not that it is correct.
Supervised in-IDE browser verification. Use this when you finish a localized component or layout change in an agentic IDE and “it looks wrong” is the failure mode you can’t catch with a unit test. Skip it when you need deterministic regression coverage, when the bug is backend correctness, or when a flaky dev server will train the agent that “retry” is a fix. The pattern: a browser surface treated as a first-class agent surface alongside the editor and terminal, so the agent can open the running app, capture screenshots, and feed visual and console state back as context for the next turn. Google Antigravity makes this explicit by giving the agent dedicated platform space and treating screenshots, recordings, and task lists as artifacts the human reviews instead of scrolling through raw tool calls [10], and lighter-weight skills built around a context-frugal headless browser deliver the same loop inside Claude Code, Codex, Cursor, Windsurf, and Gemini CLI without a dedicated platform [63]. Either way it is a smoke check, not an oracle: Willison’s SwiftUI vibe-coding sessions show how persuasive this loop feels — the agent renders, you eyeball it, you ship — and how easily it papers over correctness bugs in domains the human doesn’t actually know [233].
Autonomous browser self-test. Use this when the artifact is browser-rendered and no human is in the loop on a per-step basis. Skip it when selectors are too unstable for deterministic verification, or when the change is purely backend. In autonomous and background agents — the territory of Chapter 16 — a browser is what stops the loop shipping “Potemkin” features: code that compiles and renders but whose handlers don’t fire. Replit’s Agent 3 pushes this furthest publicly, embedding browser-based testing as a core loop the agent self-decides when to invoke and reporting it as 3x faster and 10x more cost-effective than driving a general computer-use model, with unattended runs of 200+ minutes [80]. Treat those numbers as a vendor-reported direction rather than independently replicated evidence, and read the win carefully: the gain is in catching gross artifact-level breakage, not in confirming intent. An agent that built and tested the same artifact in the same context is still grading its own work, so pair self-test with spec-based assertions or human sign-off before treating an autonomous run as merge-ready. A browser inside an autonomous sandbox is also an attack surface; permissions belong to Chapter 16.
Record-and-replay regression capture. Use this when you have a living product with real user traffic and hand-maintained Playwright or Cypress suites are bit-rotting. Skip it for greenfield projects with no traffic, for backend services where the oracle is not visual, and for UIs under heavy intentional redesign — every deliberate change trips the replay and drowns signal in noise. The pattern lives in CI, not in the agent’s loop, and its evidence model is different: real user sessions captured in staging, replayed against each PR build, with visual or behavioral divergence flagged as a deterministic regression gate. The configuration cost is real — workflow runs on both main and PRs, baseline establishment, deployment-pattern decisions, and explicit approval of intentional visual differences before the check turns green [234]. The same maintenance cost shows up in lower-level visual regression practice: a Vue/Vitest browser workflow renders component stories, captures screenshot baselines, compares snapshots, then requires humans to accept, reject, fix, or update the baseline; the author explicitly treats manual review as part of the current workflow and points teams with advanced needs toward more mature visual-testing tools [235]. That still does not give independent ROI or false-positive rates for product-scale replay. Treat this as pilot-stage guidance: pick one stable flow, record baseline noise, count intentional-difference approvals versus real regressions, define a PII policy before capturing sessions, and standardize only if the signal-to-noise stays acceptable.
API and CLI artifact verification. Use this when the artifact is a protocol response, command-line behavior, or generated output that downstream callers consume. Skip it when the change is purely internal and already covered by narrower tests. The pattern is to run the built artifact and assert the contract at the boundary: HTTP status, response schema, headers, idempotency behavior, side effects, stdout/stderr, exit code, and generated file bytes. A unit test can prove that a handler function returned the right object; an artifact test proves that the running service, router, middleware, serializer, auth layer, and client-visible response still agree. For CLIs, keep golden invocations with fixed inputs and compare both exit status and output shape. For APIs, drive real requests against a local or staging instance and validate against the OpenAPI or schema contract rather than the implementation object the agent just edited.
Database, migration, and generated-file verification. Use this when an agent changes schema, data migration logic, code generation, lockfiles, reports, or any artifact whose correctness is visible only after a build or apply step. Skip it for throwaway scaffolding where no downstream system consumes the output. The pattern is to apply the migration or generation step in a disposable environment, then assert invariants: schema diff is expected, row counts and foreign-key relationships survive, generated files are deterministic across two runs, and rollback or re-run behavior is defined. This is the artifact-level cousin of contract testing: the oracle is not “the script ran”; it is “the persisted artifact still satisfies the contract downstream systems rely on.”
The verification loop becomes concrete when you run one change through it end to end. Suppose an agent adds idempotency-key support to a refund API. The spec names the externally visible contract: duplicate requests with the same key must return the original refund result, not issue a second refund. A test-writer session sees only that spec and the public API shape, then writes failing request-level tests plus boundary cases for missing, reused, and mismatched keys. The implementation session is given those tests and a read-only rule for test files. Structural checks catch forbidden imports or unsafe transaction boundaries. Artifact verification runs the service and drives real HTTP calls against the local endpoint, validating status codes, response schema, database side effects, and idempotent replay behavior. If this failure class has appeared before, an eval task captures the original bug report and checks that the agent workflow still solves it across repeated runs. The merge policy then requires the spec-linked tests, integration contract, artifact check, and human review evidence before release. That is the difference between “the tests pass” and “the change survived independent verification.”
12.6 Feedback Loops That Strengthen Verification
From a verification standpoint, what matters most is how quickly and clearly tooling tells the agent — and you — that something is wrong. The single highest-leverage move is giving the agent verification criteria it can check independently — tests, type errors, lint output, screenshots, expected outputs — so it can self-correct against an external signal instead of guessing at intent [236].
Type systems as fast failure signals. Strong typing constrains what the agent can generate. When the compiler rejects invalid code immediately, the agent gets fast, unambiguous feedback and self-corrects. Languages with explicit type systems — Go, TypeScript with strict mode, Rust — produce better agent results than languages with implicit magic, and the same effect shows up at the tool boundary: when Anthropic’s SWE-bench harness validates that a string-replace edit matches exactly one location before applying it, the agent sees a precise error message and converges, where a silent partial edit would have sent the loop further off the rails [228]. Apply the same instinct to your code: ban loose type assertions like as any, turn on noUncheckedIndexedAccess, enforce complexity limits, and prefer schemas at the IO boundary so a misread shape fails on the first call instead of three layers deep.
Linters as verification layers. Linters catch categories of error no test suite covers. Feature-boundary enforcement through import rules prevents agents from violating architectural boundaries — turn on path-boundary rules in ESLint, dependency-cruiser, or the equivalent import-restriction tooling for your stack so a bad cross-layer import fails immediately instead of leaking into review. A concrete example: when an agent “fixes” a circular dependency by importing a domain helper directly into a UI component, a no-restricted-imports rule keyed to your folder boundaries fails the lint step in seconds, and the agent’s next turn corrects the layering on its own. Without that rule, the same drift slips through type checking and the test suite, and you discover it weeks later as a tangled module graph.
Test suites as external oracles. Comprehensive test suites remain the single strongest enabling factor for agent-assisted work. They provide the external oracle that prevents the agent from optimizing for internal consistency rather than correctness. Test suites built for human developers serve agents even better: humans can reason about untested edge cases, but agents cannot. If your codebase lacks good tests, invest in them before investing in agent workflows [221].
Put the pieces together as an operating sequence. Start with explicit acceptance criteria or a failing reproduction. Generate or review tests from the spec before accepting the implementation. Run structural checks such as types, linters, and import-boundary rules. For changes that cross a process, network, or schema boundary, require an integration or contract test that exercises the boundary. For UI changes, verify at the rendered artifact — supervised browser verification in the loop, autonomous self-test before merge, or record-and-replay regression once the feature is live. Add property, fault, or stress tests when the risk profile warrants them. Re-run small eval suites on known failures. Then treat green CI as evidence, not proof. The point is not to create one giant gate — it is to manage the tradeoff between speed and real confidence by stacking independent signals until the remaining risk is small enough to ship.
Which signals should block a merge, and which should stay advisory, depends on risk and autonomy. The following matrix is a default; tighten it for higher-blast-radius systems, loosen it for throwaway work.
| Context | Advisory (warn, don’t block) | Blocking (must pass to merge) |
|---|---|---|
| Exploratory or prototype work | Types, lint, formatter, generated-test coverage | Build succeeds; nothing else hard-gates |
| Scoped change in well-tested code | Coverage deltas, complexity warnings | Types, lint, existing test suite |
| New behavior or new module | Coverage, mutation score | Spec-first tests tied to acceptance criteria, plus the scoped-change gates |
| Boundary-crossing change (API, schema, queue, cross-service) | Differential / mutation score | Integration test at the real boundary; contract test against the authoritative API or message schema if more than one side or team consumes it |
| UI change | Visual diff thumbnails, console-warning counts | Artifact smoke check (renders, wires up, no console errors) and, where piloted successfully, record-and-replay regression; human sign-off on intent |
| Autonomous or background-agent run | Agent self-report, transcript summary | All of the above for the change class, plus independent verification evidence such as eval-suite pass, replay pass, or artifact-level contract check. |
Two rules tie the matrix together. First, the blocking set scales with autonomy: the less a human watched the change being made, the more the verification gate has to stand in for that attention, which is why autonomous runs require independent-oracle evidence rather than the agent’s own self-test. Second, advisory checks are not decorative — they are how you decide when to promote a check to blocking. If a lint rule, a contract test, or a coverage threshold is catching real defects in review week after week, move it left into the blocking column for that change class. If a blocking check is producing mostly noise, demote it before the team learns to override it. The goal is a verification gate that earns trust, not a wall that gets routed around. Review-comprehension gates belong in Chapter 11; rollback and release safeguards belong in Chapter 13.
12.7 Takeaways
- Audit every agent-generated test by asking one question: does this test fail if the implementation is wrong in a way that matters? If the answer is no, the test is mirroring the implementation, not verifying the spec — discard or rewrite it.
- To preserve an independent oracle, use a separate test-writer session — with only the spec, public interface, and relevant existing tests — to produce failing tests and a short oracle note before implementation starts; review those tests yourself, then start the implementation session with test files treated as read-only unless you explicitly approve a change.
- Match the verification method to the failure class: differential tests for duplicated logic, property-based testing for logic boundaries, fault injection for error paths, stress loops for concurrency and crash recovery, integration and contract tests for cross-component agreement, static analysis for structural drift, and specification tracing for requirement misalignment — each class requires a different oracle that naive agent-generated tests won’t supply.
- Build evals only when standardizing a reusable agent workflow — a harness of representative tasks, failing reproductions, and binary pass/fail acceptance checks run across repeated trials. For one-off changes, rely on spec-first tests, integration checks, and review instead.
- For UI changes, verify at the rendered artifact with an artifact smoke check (renders, wires up, no console errors) rather than trusting that passing unit tests mean the component works; for API and CLI changes, run real requests against the local or staging endpoint and validate status, response schema, and side effects against the contract.
- Use strict type and schema checks as a fast verification layer: enable options like TypeScript’s
noUncheckedIndexedAccess, ban loose assertions likeas any, and validate I/O with schemas so shape mistakes fail before review. - Scale merge-blocking checks with the autonomy level of the change: for an autonomous or background-agent run, require independent verification evidence (eval-suite pass, replay pass, or artifact-level contract check) in addition to all per-change gates — the less a human watched the change being made, the more the gate must substitute for that attention.