22 Measuring Productivity and Impact
The numbers always go up — adoption, lines generated, PRs merged — but the outcomes stay flat, and the developers are getting slower.
22.1 The Productivity Paradox: Universal Adoption, Modest Measurable Gains
Ninety-three percent of developers use AI coding tools at least monthly. Seventy-five percent use them weekly. Adoption is essentially universal [400], [401]. And yet, when you look at what that adoption actually produces, the numbers tell a different story: industry-wide productivity gains have plateaued at roughly 10%, and they’ve been flat for quarters [402]. Developers report saving about four hours per week — the same number they reported two quarters ago. The time-saving boost leveled off and stayed there. Read this through the METR interpretation established in Chapter 7: different regimes expose different workflow bottlenecks. The 19% slowdown in experienced familiar-repo work, the 100K+ LOC decomposition wins in Chapter 8, and the aggregate plateau here are not contradictions; they are measurements of different parts of the same system. Local generation speed becomes system throughput only when specification, decomposition, verification, and review capacity all move with it. The sustainability warning from Chapter 20 belongs in the same measurement frame: a workflow that raises output while increasing compulsion, fatigue, or review avoidance is borrowing from the operator, not improving productivity [403].
Distribution matters more than the mean. The skilled-decomposition gains in Chapter 8 and the spec discipline from Chapter 7 are exactly what the aggregate plateau hides: strong operators can move one part of the distribution while broad adoption leaves the median workflow bottleneck unchanged.
This is the measurement-side version of the review bottleneck from Chapter 11. If teams merge far more AI-assisted pull requests while review time, bug rate, and architectural scrutiny move in the wrong direction, a plateau is exactly what you should expect: generation improves locally while the system bottleneck shifts to verification, integration, and comprehension [400].
The most uncomfortable data point comes from METR’s randomized controlled trial: sixteen experienced open-source developers, 246 real tasks on their own repositories — codebases they’d contributed to for years, averaging 22,000+ stars and over a million lines of code. When allowed to use AI tools (primarily Cursor Pro with Claude 3.5/3.7 Sonnet), these developers took 19% longer to complete their tasks [404], [405]. Not faster. Slower. The result held across different outcome measures, estimator methodologies, and data subsets.
The perception gap is the real headline. Before starting, developers predicted a 24% speedup. After finishing — having been objectively measured as 19% slower — they still believed AI had made them 20% faster [406], [407]. That’s a 39-percentage-point disconnect between perception and reality. And 69% of participants continued using AI after the study, despite the measured slowdown. AI makes work feel easier even when it takes longer [401].
Why does a tool that generates code instantly produce a net loss of time? Five factors converge: unclear requirements in real code that agents can’t infer, context window limitations that cause agents to forget task details mid-execution, agent brittleness on unfamiliar patterns, developer context-switching overhead while waiting for and reviewing AI output, and implicit quality requirements not captured by the initial task description [404], [405]. The METR finding is specific to experienced developers on familiar codebases — it doesn’t generalize to beginners or unfamiliar code. But that’s precisely the scenario most working engineers face daily. One developer with 50+ hours of Cursor experience did see a 38% speedup, suggesting a high experience threshold before the tools pay off [404]. The investment is real, the learning curve is steep, and the payoff is not guaranteed.
Controlled trials from other organizations show more favorable numbers — Google’s RCT found a 21% gain, a multi-company trial showed 26% average improvement — but these gains consistently favor junior developers (35–39% boost) over seniors (8–16%) [406]. The uncomfortable implication, at least in these studies, is that AI tools tend to help most on tasks and codebases the developer does not yet know well, and help least on the kinds of decisions experienced engineers spend most of their time on. That pattern is not universal — individual results vary widely with task type, tooling, and tenure — but for the experienced engineers making the hardest decisions on the most complex codebases, the productivity narrative rings hollow.
22.2 Lines of Code Are Back (and It’s Worse)
For forty years, the software industry agreed that lines of code is a counterproductive metric — from Dijkstra’s dismissal to DeMarco’s later retraction of his own measurement dogma [408]. The consensus was settled. Faros’s own analysis frames the problem the same way: percentage-of-AI-generated-code was already a discredited measure before AI arrived, and tool fragmentation plus the gap between accepted and shipped code made it less reliable, not more [409].
Then AI showed up, and executives resurrected it. Sundar Pichai told investors that 25% of Google’s new code was AI-generated; by mid-2025, past 30%. Satya Nadella claimed 20–30% of Microsoft’s code was machine-written. Dario Amodei predicted 90% of code would be AI-written within six months [408]. The numbers only go up, and they’re presented as achievements — on earnings calls, in press releases, at conferences. Nobody reports “percentage of bugs introduced by AI-generated code” or “percentage of AI code that survived review unchanged.”
The tooling reinforces the problem. GitHub Copilot’s dashboard shows “Total Lines Suggested” and “Total Lines Accepted” as primary metrics. Cursor tracks lines added per user. Vendor measurement frameworks lean toward PRs merged per engineer per day and percentage of code “written with assistance” — Anthropic’s published numbers (67% PR velocity increase internally, 70–90% code written with Claude Code) are calibrating data, but they are also volume-leaning by construction, and the framework explicitly notes attribution only counts when there is high confidence in agent involvement [410]. Social media amplifies the volume framing further — viral posts celebrate “100k lines in two weeks” without asking whether those lines were maintainable, reviewed, or even deployed [411]. Nobody reasonably reads, understands, and vets 10,000 lines of code per day. When someone brags about that volume, they’re advertising a process failure, not a success.
You will have a telemetry dashboard like this whether you want one or not. If your organization runs Copilot Enterprise, Tabnine, Amazon Q Developer, or any similar org-scope deployment, an admin surface already reports acceptance rate, active users, suggestion volume, and chat usage — usually with an export path into whatever BI stack finance reads [412]. Use that surface for what it is good for: sizing the next seat purchase, spotting teams that claim “we all use it” but can’t point to a number, and answering finance when the renewal conversation arrives with no usage baseline. That is legitimate work and the dashboard is built for it.
Do not use per-developer acceptance rate as a performance metric. The moment that number feeds into reviews or ratings, gaming starts — developers learn to accept suggestions they would otherwise reject, managers learn to chase the leaderboard, and the signal dies in the first quarter [411]. Activity telemetry shows adoption, not value; it is necessary but not sufficient. The failure modes stack: leadership optimizes for acceptance rate and teams game it; the usage chart becomes a productivity proxy despite every METR-style finding saying it shouldn’t; and surfacing per-developer telemetry too aggressively chills the honest adoption you needed the dashboard to measure in the first place. Pair telemetry with the outcome metrics in Section 22.4, where the signal actually lives.
The data on what happens when you measure lines directly confirms the damage. GitClear analyzed 211 million changed lines across private and public repositories from 2020 to 2024: AI-generated code showed 41% higher churn (revised within two weeks), an eightfold increase in duplicated code blocks, and 2024 was the first year copy-pasted lines exceeded moved lines — a historic reversal in how code gets written [406], [409]. Refactoring collapsed 60%, from 24.1% of lines moved or restructured down to 9.5%. Copy-pasted code in production repos rose from 8.3% to 12.3%. The code isn’t just proliferating — it’s getting worse.
Meta now connects performance ratings to lines of code generated by AI, which is “totally the wrong metric” by any established software engineering standard [413]. When you measure tokens spent or lines generated, Goodhart’s Law takes over: the metric becomes the target, and what you actually wanted — working, maintainable software — disappears from view [408]. As Chapter 16 discussed, agents will happily generate overcomplicated solutions if unchecked. Measuring their output by volume actively rewards that tendency.
22.3 DORA Metrics in the Agent Era
Chapter 11 is the canonical introduction for DORA in this book: deployment frequency, lead time for changes, change failure rate, and mean time to recovery. The narrower question here is whether those delivery-health measures still tell the truth when AI changes code volume faster than review capacity changes [414].
The honest answer is: only if you decompose them. Aggregate DORA metrics behave well when humans wrote and reviewed at roughly the same rate; they behave badly when generation outpaces review. Deployment frequency starts measuring how quickly you’re pushing unreviewed code; lead time shrinks because a machine typed faster, not because the process improved [415]. The Faros AI study across 10,000+ developers showed exactly that pattern in operating numbers: 21% more tasks and 98% more PRs merged, with review time up 91%, bugs up 9%, and organizational DORA metrics flat [400], [409].
The DORA 2024 report quantified the damage: for every 25 percentage point increase in AI adoption, delivery throughput dropped 1.5% and delivery stability dropped 7.2%. The 2025 report, at 90% adoption, was blunt: AI doesn’t fix a broken pipeline, and it doesn’t replace organizational discipline [416]. AI is an amplifier — organizations with strong deployment practices, good testing discipline, and fast code review cycles see returns. Organizations without those foundations see existing weaknesses magnified.
The practical move is to stop treating DORA as a scorecard and start using it as a diagnostic instrument. Three operating habits make that real. First, decompose lead time into phases — coding, review, deployment — so you can see where the time actually lives; as Chapter 11 detailed, code review is now the longest phase in most teams’ cycle times, and aggregate metrics mask exactly where the bottleneck sits. Second, watch ratios, not absolutes: a team whose deployment frequency doubles while change failure rate also doubles has not improved; it has accelerated into a wall. Third, supplement DORA with metrics that capture the new bottleneck directly — review time per PR, review thoroughness (comments per PR, requested changes rate), and defect escape rate (bugs found after merge versus before) [411]. DORA still works; it just stops working alone.
22.4 What to Measure Instead
If lines of code and aggregate deployment frequency do not capture what matters, what does? Use a four-axis scorecard, not a single replacement metric.
| Axis | Question | Practical signal |
|---|---|---|
| Throughput | Are valuable changes reaching users faster end to end? | Time-to-value, decomposed into intake, design, implementation, review, deploy, and post-deploy verification. |
| Quality | Are AI-assisted changes surviving contact with production? | Defect origin rate, escaped defects, rollback/reopen rate, and rework after substantive review. |
| Cost | Is spend buying verified delivery rather than activity? | Cost per verified unit of delivery, joined to review-cleared PRs or features that reached users. |
| Comprehension | Do humans still understand what they accepted? | Reviewer explanation evidence, comprehension-debt probes, onboarding-to-first-fix time, and post-incident ownership clarity. |
That scorecard is the constructive answer to bad productivity narratives. DORA remains a delivery-health diagnostic. Vendor telemetry remains an adoption and spend surface. The book’s claim is narrower and stricter: AI improves engineering productivity only when throughput, quality, cost, and comprehension move together. The playbook below gives one lightweight way to compute those signals from artifacts most teams already have.
22.5 Making Cost a First-Class Outcome Metric
Cost belongs in the same conversation as throughput, quality, and comprehension — and the operational detail matters, because the easy framings (“token spend per dev per day,” “seats deployed”) repeat the same volume-as-value mistake the previous sections just dismantled.
Start with the right unit. Token prices have fallen, but per-task cost has risen, because agentic workflows consume 10x–100x more tokens per task than simple completions, and reasoning tokens are billed as output tokens — so dialing thinking budgets up directly inflates spend [326]. The cost surface you actually operate on is cost per verified unit of delivery: tokens or seat dollars in the numerator, PRs that cleared substantive review or features that hit time-to-value in the denominator. Watch the ratio over time; the absolute spend by itself answers no useful question.
The data already exists, but it lives in three different places, and you have to join them carefully:
- SDK or agent telemetry gives you per-call token counts and a
total_cost_usdestimate. Treat that estimate as exactly that — a client-side estimate computed from a bundled price table that drifts when pricing changes or when the SDK doesn’t recognize a model. Use it for fast feedback and per-feature attribution; do not present it to finance as authoritative [417]. - Vendor billing exports (Anthropic console, OpenAI usage, Amazon Q Pro line items, Copilot enterprise billing) are the authoritative numbers for finance and renewal conversations [412].
- Operational benchmarks let you sanity-check your own numbers. Anthropic publishes Claude Code reference points: roughly $13 per developer per active day on average, $150–250 per developer per month, with 90% of users under $30 per active day, and agent-team patterns (parallel teammates each holding their own context) consuming around 7x the tokens of a standard session [418]. If your spend is a multiple of those numbers with no corresponding lift in verified throughput, that is the signal.
Two parallel-tool gotchas distort cost data before you even start interpreting it. First, when an agent uses multiple tools in a single turn, those messages share an ID with identical usage data, and naive aggregation double-counts every parallel call; deduplicate by message ID before summing [417]. Second, the Agent SDK does not provide a session-level cost total — each query() returns only its own cost — so session-level numbers must be accumulated explicitly [417]. Skip either step and your “rising spend” signal is a measurement bug.
A short worked example. Picture a team that watches its monthly Claude Code spend rise from $180 to $260 per developer — well past the $150–250 band [418]. The naive read is “agent costs are out of control.” The right read is to join that number to verified outcomes from the metrics above: Did rework after substantive review fall? Did time-to-value shorten on the features that consumed most of the spend? Did defect origin rates on AI-authored changes drop? If the answer is yes — say, time-to-value down 30% on three high-value features that absorbed most of the increase — the spend is buying delivery. If the answer is no — spend up, rework flat, time-to-value flat — then the cost increase is most likely poor agent loop design (verbose tool manifests, repeated retrieval payloads, parallel teammates duplicating context) rather than model pricing [326]. The measurement rule is the important part here: rising cost is acceptable only when verified delivery rises with it. The concrete cost levers belong in Chapter 18; this chapter only insists that their effect be measured against outcomes, not activity.
22.6 The Lightweight Measurement Playbook
The point of these metrics is not to produce universal thresholds. It is to give teams a starter set of heuristics they can compute from artifacts they already have and then tune against their own baseline.
Time-to-value. Measure the elapsed time from accepted intent — ticket creation, approved spec, or the equivalent durable artifact in your tracker — to production evidence that users are exercising the change (deploy plus the first telemetry, usage, or completed user action that confirms value, not just a successful release). If the median remains stubbornly long even for small features, decompose the span into phases — intake, design, implementation, review, deploy, post-deploy verification — to see where the time actually lives. As a diagnostic subspan, also track the code-to-deploy interval (first commit to first production deploy) on the same features; when end-to-end time-to-value is long but code-to-deploy is short, the constraint is upstream of code generation, which is exactly where AI is least likely to help.
Code half-life. Measure how long newly introduced lines survive before major modification or deletion. If agent-authored code is rewritten materially faster than human-authored code, treat that as a warning that your upstream specs or context are too weak. Do not treat any fixed number of days as a law; use the comparison against your own baseline as the signal.
Defect origin rate. During incident review, tag the root-cause change by origin and compare defect rates across AI-origin and human-origin work. If AI-authored changes are producing defects at a meaningfully higher rate, tighten review requirements or invest in stronger upstream specifications.
Rework after substantive review. Count the share of AI-authored PRs that reach merge without further change after a review that cleared a depth bar. Define that bar before you read the numbers: minimum review time relative to diff size, required comment or question density on non-trivial changes, risk-tiered review, and deterministic gates — tests, type checks, lint, security and policy scans — that must pass independently of any reviewer’s approval. Pair the metric with defect escape rate, post-merge reopens, and scope violations found later; a rate that climbs while escapes also climb is evidence of shallow review, not improving quality. Require reviewer evidence (comments, questions, requested changes, or an explicit attestation that the diff was inspected) before a PR counts toward the numerator. Watch the trend over time as rules files, spec templates, and review guidance improve — and investigate any sudden jump as a possible gaming signal rather than a win.
Cost per verified unit of delivery. Pair token spend or seat spend with a verified outcome: PRs that cleared substantive review, deployed features that hit time-to-value, or incidents closed without regression. The useful question is not whether token spend is high in the abstract. It is whether spend is rising faster than verified throughput. If it is, the cost problem is usually not model pricing alone but weak review conversion or poor upstream specs [326].
A worked instrumentation vignette. None of the metrics above require a bespoke analytics platform; they require joining surfaces a typical engineering org already has. Picture a team with a code host, a CI system, a deploy pipeline behind feature flags, an incident tracker, and the standard finance export from their AI vendors. They wire one thin measurement flow on top:
- AI attribution comes from a PR label (
ai-assisted,ai-authored,human-only) or a commit trailer (Assisted-by: <agent>); a CI check rejects PRs that carry neither. This becomes the join key for everything downstream [410]. - Review evidence is read from the code host’s API: review states (approved, changes-requested), comment counts, requested-changes events, and the timestamps of any commits pushed after the first substantive review.
- CI gate results — tests, type checks, lint, security and policy scans — are emitted as structured events keyed to the PR, so “passed deterministic gates” is a queryable fact rather than a screenshot.
- Deploy and usage telemetry links merged PRs to release IDs, feature-flag exposures, and product events (completed user actions, API calls) in the analytics warehouse.
- Incident and bug labels in the incident tracker carry a root-cause PR link and an origin tag, populated during post-incident review.
- Billing, seat, and token exports from Copilot, Cursor, Tabnine, Amazon Q, and the underlying model vendors land in the same warehouse on a daily cadence [412].
That gives a direct principle-to-surface mapping. Time-to-value is the span from ticket or spec creation to the first deploy event and then the first usage event for the flagged change. Defect origin is the join from incident root-cause PR to the AI-attribution label on that PR. Rework after substantive review is the count of post-review commits before merge on PRs that also cleared the deterministic CI gates and carry reviewer evidence. Cost per verified delivery is the billing or token export joined to the count of PRs that cleared substantive review and produced a usage event in the agreed window.
Then segment every dashboard. Aggregate numbers hide the one finding that has held up across studies: AI’s impact varies sharply by task type (greenfield feature vs. legacy refactor vs. infra change), by risk tier (data, security, or money-handling code vs. internal tooling), by developer experience and tenure, and by codebase familiarity. A team-wide average that mixes a senior engineer in a decade-old monorepo with a new hire on a fresh service produces a number that is true in aggregate and false for every concrete decision. Slice by those four dimensions before drawing any conclusion, and resist the urge to reinflate this stack into an analytics program of its own — the value is in the join, not in the dashboards.
The through-line is simple: treat these numbers as diagnostic starter heuristics, not industry-certified benchmarks. The right target is the one that helps your team calibrate decisions against shipped outcomes, not the one that sounds neat on a dashboard.
22.7 Comprehension Debt: The Metric Nobody Tracks
Comprehension debt is the growing gap between how much code exists in your system and how much of it any human being genuinely understands [402]. Unlike technical debt, comprehension debt breeds false confidence. The codebase looks clean. The tests are green. The reckoning arrives when something breaks and nobody can explain the system well enough to change it safely.
The empirical evidence is stark. In a randomized controlled trial with 52 engineers learning a new library, AI-assisted participants completed the task in roughly the same time as the control group but scored 17% lower on a follow-up comprehension quiz — 50% versus 67% [402]. Developers who used AI through passive delegation performed worst; developers who used it for conceptual inquiry did better [419]. The mode of interaction determines whether you are building understanding or eroding it.
Comprehension debt compounds through a speed asymmetry: AI generates code far faster than humans can evaluate it. As Chapter 11 explored, that converts quality gates into throughput problems. When an agent changes implementation behavior and updates hundreds of tests to match, reviewing whether those changes were necessary requires someone who understands the system’s intent.
That same speed asymmetry is also a cognitive-load problem. Developers bounce between composing prompts, waiting for output, reviewing unfamiliar diffs, and resuming the original task thread [403], [420]. Teams rarely measure that attention fragmentation directly, but they can still surface it: ask in sprint retrospectives how often developers felt they were reviewing code they could not explain, or whether agent-heavy days produced more context switching than feature progress. If the answer keeps trending up, the organization is accumulating comprehension debt even when throughput dashboards look fine.
The practical response is to track comprehension explicitly, even if the proxies are imperfect. Define a tight measurement contract: for any AI-heavy diff above a size threshold, a reviewer who is not the author must, before approval, write a one- or two-sentence summary of the change’s intent and at least one plausible failure mode it introduces or guards against. Track the share of AI-authored PRs where that contract was satisfied, and treat unexplained approvals as a defect class of their own. Pair it with two slower proxies: onboarding-to-first-fix time for a subsystem, and a question asked in every post-incident review — who on the team could have explained the failing module before production did it for them? None of these proxies is perfect, but together they expose whether your organization is still building understanding or merely accumulating generated code.
The uncomfortable truth is that the industry has not standardized a direct metric for comprehension debt. Until it does, teams have to be deliberate: choose interaction modes that build understanding, review AI output with the intent to learn rather than only approve, and treat every accepted diff you cannot explain as debt that will eventually come due.
22.8 The Executive Expectation Gap
Every metrics conversation in an AI-heavy organization eventually hits the same wall: leadership heard that AI makes developers 10x more productive and wants the dashboard to prove it [406]. The gap between executive expectation and practitioner reality is therefore a measurement problem, not merely a communication problem.
The expectation is predictable: more code, fewer engineers, instant ROI [413]. Those narratives come from vendor benchmarks on isolated tasks, from case studies that measure output volume instead of verified outcomes, and from a weak definition of productivity itself [414]. When a company reports that a large share of code is now AI-generated, the tempting interpretation is headcount substitution. What it often really means is that more generated code now flows into the same review and verification bottlenecks [400].
That is why the right response is to redirect the conversation toward verified output. The scorecard in Section 22.4 exists precisely for this purpose. When leadership asks whether AI is delivering 10x, the useful follow-up is: are throughput, quality, cost, and comprehension all improving, or did only the activity graph move?
The concrete move is simple. In the next leadership review, when someone asks for AI ROI, put one activity metric and one outcome metric side by side: accepted lines or token spend on the left, end-to-end time-to-value or defect origin rate on the right. Then ask whether the second moved enough to justify the first. Keep the dashboard tied to shipped outcomes. Separate activity measures from value measures. Treat volume increases without quality improvement as a warning, not a win [416]. If the organization does that consistently, the expectation gap narrows because the conversation moves from hype to operating truth.
22.9 Takeaways
- Measure AI impact with a four-axis scorecard — throughput, quality, cost, and comprehension — and treat gains on only one axis as incomplete until the others move with it.
- Do not use per-developer acceptance rate as a performance metric; use vendor telemetry dashboards for adoption and budget sizing, not as proof of individual productivity.
- Decompose lead time into coding, review, and deployment phases so DORA shows where AI is actually helping and where review has become the bottleneck.
- Track cost per verified unit of delivery — token or seat spend over review-cleared PRs or features that reached users — and judge spend by whether verified throughput rises with it, not by absolute token or seat cost.
- Define a comprehension contract for AI-heavy diffs: require any reviewer who is not the author to write a one- or two-sentence summary of the change’s intent and at least one plausible failure mode before approval, and track the share of AI-authored PRs where that contract was satisfied.
- Segment every productivity metric by task type, risk tier, developer experience, and codebase familiarity before drawing conclusions — a team-wide average that mixes a senior in a decade-old monorepo with a new hire on a fresh service is true in aggregate and false for every concrete decision.
- In the next leadership review, place one activity metric and one outcome metric side by side — accepted lines or token spend on the left, end-to-end time-to-value or defect origin rate on the right — and ask whether the second moved enough to justify the first.