14 Agents Across the SDLC

The biggest productivity gains from coding agents aren’t in code generation — they’re in everything else.

You can use the same agentic development loop outside implementation. Requirements, infrastructure, maintenance, security, and documentation still need a spec, a plan, generated work, review, verification, refinement, and an accountable handoff. The mistake is treating “SDLC work” as looser just because it is not a code diff. It needs the same gates, often stricter ones, because the output shapes the code that follows.

The argument of this chapter is not that every lifecycle phase is another place to sprinkle agents. It is that non-code artifacts become upstream inputs for other agents. When a requirements agent produces the spec an implementation agent will execute, or an infrastructure agent changes the environment a test agent depends on, the SDLC boundary has become a coordination boundary. That is why ownership lines must become team norms before Chapter 15 narrows them to concurrent file/path ownership.

The pattern is the same whether the artifact is a requirement, an infrastructure change, or a maintenance sweep: use the agent where the work can be structured, checked, and bounded. A meeting transcript can become user stories, acceptance criteria, open questions, and a first implementation plan. An existing service can be inspected for an infrastructure-as-code change; in Terraform workflows, terraform plan previews the infrastructure diff without applying it, giving you a mechanical check before anything reaches production. Dependency updates can be swept across dozens of files because tests and review gates bound the blast radius. The orchestrator’s job — as Chapter 1 established — is recognizing which parts of each lifecycle phase meet those criteria and keeping human judgment where they fail.

14.1 Requirements and Specification

The gap between a meeting and a specification is where projects lose weeks. Someone takes notes, someone else interprets them, a third person writes user stories, and by the time acceptance criteria exist, half the context has evaporated. Agents collapse this pipeline — but which pattern you choose depends on your project’s complexity and stakeholder dynamics.

Transcript-to-spec is the lightweight starting point: feed a meeting transcript directly to the agent along with codebase context, and it produces structured output immediately. The workflow is sequential and precise: meeting → AI transcript → PRD generation → ticket creation → plan.md → agent execution, with each step feeding the next [253]. ArgonDigital reports concrete time savings: AI identifies patterns in meeting transcripts to highlight actionable requirements, decisions, and stakeholder expectations, then organizes them into standard formats like user stories, use cases, and acceptance criteria [254]. You can also reverse-engineer requirements from existing solutions, having the agent analyze a competitor or legacy system and flag functional gaps [254]. This pattern is best when requirements are captured in meetings and need structured output — low risk, high mechanical leverage. The output still requires human validation and stakeholder sign-off, but the formatting drudgework disappears.

Prototype-as-discovery replaces early requirements discussion with a working artifact, not the execution contract. When stakeholders can click an interface instead of imagining one, feedback loops shorten dramatically [255]. Code and Theory reduced web app development timelines by 50-75% with this pattern, using the same prompting interface across cross-functional teams [255]. Use it when stakeholder alignment is the bottleneck. Before planning or implementation starts, convert the prototype into a human-approved execution packet: user-visible behavior, acceptance criteria, constraints, open questions, non-goals, and sign-off. The prototype can discover intent; it should not silently become the spec an implementation agent executes.

Spec-driven planning is the heavyweight approach with the highest downstream reuse. AI-native engineers spend less time writing code and more time writing specifications and implementation plans, and those outputs have follow-on value for PRDs, documentation, deployment manifests, and sales materials [256]. A living prototype can feed downstream artifacts such as acceptance criteria, leadership decks, and marketing briefs as the product changes [257], but execution still needs the approved packet above. Choose this when downstream reuse justifies the extra discipline.

The progression runs lightweight to heavyweight: transcript-to-spec for fast structured capture, prototype-as-discovery when alignment matters most, spec-driven planning when execution and reuse need a durable packet. In all three, the agent eliminates mechanical translation; your job is ensuring the intent is right before another agent treats it as input. The same rule applies when the starting point is a tracker ticket: reproduction steps, acceptance criteria, and pointers to relevant files turn an issue into usable input rather than vague aspiration.

14.2 Architecture and Design Exploration

Architecture exploration is where the unifying model shows its limits most clearly. The work is neither structurally repetitive nor mechanically verifiable — it’s creative, exploratory, and the blast radius of a bad decision is the entire system. Agents are excellent exploration partners and terrible decision-makers. The distinction matters.

For codebase exploration — the repetitive, bounded part — agents are transformative. Ask them how your signed cookies are set and read, or how your application uses subprocesses and threads, or which aspects of your JSON API aren’t yet covered by your documentation — and modern reasoning models will follow codepaths through dozens of files to deliver a detailed, actionable answer in minutes [114]. These explorations are excellent candidates for parallel execution: fire off multiple agents to investigate different subsystems simultaneously, since the results don’t need integration [114]. Cross-repository exploration requires additional tooling — workspace-focused agents cannot automatically discover or search across enterprise repositories, making deterministic code search essential for questions like “where is this API called across all our services?” [258].

Agents also serve well as architectural rubber ducks. During foundation building — system design, schema definition, architecture — you don’t want the agent writing code, but looping it in as a sounding board helps you see mistakes, even if you don’t need or trust the answers [259]. Agent-generated plans are useful starting points but should not be taken as truth — ordering and time estimates are often wrong [1].

14.3 Repository archaeology: separating current behavior from historical intent

A sharper variant of codebase exploration is repository archaeology: reconstructing why the code looks the way it does, not just what it currently does. Consider a team picking up an unfamiliar billing service whose apply_discount function has a strange early-return for negative balances. The static behavior is easy to read. The intent is buried.

A useful operator loop is to scope the agent at the file or function, then point it at the historical record explicitly: walk the git history for that file, read git blame for the suspicious lines, follow the merge commits back to their PR threads, pull any ADRs the PR descriptions reference, and surface incident notes that mention the same code path. The agent is good at this because it is pattern-matching across heterogeneous text — commit messages, review comments, postmortems — and stitching a timeline.

What you should ask for is a structured separation: a “current behavior” section grounded in the code as it stands today, and a “historical intent” section grounded in the artifacts. Keeping these distinct prevents the agent’s most common archaeology failure, which is letting an old comment or a stale ADR override what the code actually does now.

The human’s job in this loop is to correct misleading historical signals. The agent might confidently report that the negative-balance branch exists “to support refund reversals,” because that’s what the original PR thread claimed and what the ADR codified. The maintainer who lived through the on-call rotation knows that the refund-reversal feature was rolled back six months later, the branch survived only because a downstream incident note attached new behavior to it, and the ADR was never updated. None of that contradiction is visible to the agent unless a human points at it. Treat the agent’s archaeology as a draft timeline; treat the experienced human as the editor who marks which entries are still true.

Where agents fail is abstraction discovery. Programming involves two deeply interwoven activities: discovering and stabilizing abstractions (creative, exploratory) and applying stable abstractions (mechanical, repeatable) [222]. During discovery, using agents in chat/interactive mode to generate alternatives for testing and comparison is more valuable than agent mode aiming for correct first-time generation [222]. The cautionary tale is concrete: the first AI-assisted prototype of syntaqlite was proof-of-concept-viable but architecturally incoherent, because cheap implementation tempted the developer to defer design decisions repeatedly. The second iteration, built with explicit upfront design and more human decision-making, produced a robust, maintainable library [260]. Agents make tactical implementation cheap, which tempts you to defer strategic decisions. Resist the deferral.

For preserving architectural decisions across sessions, lightweight feature documents outperform both formal ADRs and relying on session memory while the work is still active. A 50-line feature document carries the same decision context that thousands of lines of implementation code cannot express, while consuming a fraction of the token budget [100]. ADRs solve a different problem: they are the durable archival record once a decision has stabilized. Feature documents serve the active session; once the decision hardens, the agent can turn that feature document into a durable ADR by scanning the final implementation against a template [58].

14.4 Infrastructure and Deployment

Infrastructure-as-code is where agents deliver some of their most dramatic productivity gains — and the unifying model explains why. IaC is structurally repetitive (resource blocks follow patterns), mechanically verifiable (plans succeed or fail), and bounded in blast radius (when you review before applying). This combination makes it a near-ideal domain for agent-assisted work.

The effective workflow for Terraform follows a generate -> validate -> deploy pattern with explicit gates between each phase [248]. Provide your Terraform modules, state structure documentation, and provider configuration upfront — this context reduces agent generation errors and speeds up IaC scaffolding significantly [261]. Multi-stage validation is critical: combining terraform fmt, terraform validate, tflint, and duplicate resource detection catches both syntax and architectural errors before deployment [248]. This validation cascade matters more for AI-generated IaC than for human-written code because agents tend to generate plausible-but-subtly-wrong configurations — the same “almost right” problem discussed in the review chapter, but with production infrastructure at stake.

For large-scale infrastructure maintenance, agents can run autonomously with impressive results. Automated Terraform module discovery — where an agent analyzes modules against AWS provider updates, identifies gaps, and generates PRs with fixes — keeps infrastructure-as-code current without maintainer intervention [262]. One such system discovered a validation bug, preventing user-facing issues before they occurred [262]. Runtime upgrade automation similarly transforms multi-hour manual work into agent-driven operations, supporting batch processing across hundreds of functions with human review before execution [263].

The infrastructure tooling ecosystem is converging on the same durable rule: prefer tools that expose a plan, diff, or validation surface before execution. MCP servers for documentation search and local template validation [264], drift-aware change sets that prevent unintended overwrites [265], and deployment platforms that let agents inherit user permissions automatically [266] are useful not because they are “AI-native,” but because they keep the agent inside a reviewable surface. That is also where issue-tracker-native agents fit: the issue becomes the planning packet, the diff or deployment plan becomes the verification packet, and the human approval stays at the boundary that matters. For how these tools connect to broader retrieval and context strategies, see the connectors and automation chapter; for the security implications of agent permissions in infrastructure workflows, see the bounded autonomy chapter. Without domain-specific skills, agents produce plausible code based on statistical patterns rather than engineering judgment — skills encode expert knowledge once and load on-demand, shifting the baseline from plausible-but-wrong to expert practices applied automatically [267].

14.5 Debugging and Root Cause Analysis

Debugging meets the unifying criteria unevenly. Stack trace analysis and log correlation are structurally repetitive and verifiable — agents compress these dramatically. Novel system-level issues involving subtle interactions between components are neither repetitive nor bounded, making agents useful assistants but not autonomous problem-solvers.

The concrete numbers tell the story. Anthropic’s Security Engineering team reduced production incident diagnosis time by 3x: from 10–15 minutes of manual stack trace scanning to approximately 3–5 minutes using Claude Code [268]. Kubernetes cluster troubleshooting via agents saved 20 minutes during a pod scheduling outage by reading dashboard screenshots and navigating the UI programmatically [268]. These aren’t revolutionary time savings per incident, but they compound across dozens of incidents per week.

Agents with access to platform infrastructure go further. Vercel Agent performs correlation analysis across multiple metric streams — deployment timing, traffic patterns, error rates — to identify root causes in seconds rather than hours of manual investigation [269]. Zero-configuration anomaly detection automatically monitors function duration, data transfer, and 5xx errors, and the agent can assess whether an issue has already self-resolved, allowing teams to deprioritize false alarms [269]. The agent-generated remediation recommendations are application-specific, tied to detected root cause and architecture — not generic playbook suggestions [269].

For code-level debugging, agents excel at generating fix hypotheses. Give them a failing test, a stack trace, and access to the codebase, and they can trace the execution path, identify likely failure points, and propose targeted fixes. The most reliable operator loop is short and explicit: provide the failing test or log slice, state the expected behavior, ask for a ranked hypothesis tree, have the agent instrument or isolate one path at a time, then verify each hypothesis independently before accepting a patch. The agent-as-debugger pattern works best when you can describe the expected behavior clearly and point the agent at the relevant code. It works worst when the bug is conceptual — a flawed assumption embedded in the architecture that no single code path reveals. The critical limitation: agents debugging their own output face inherent context constraints, bounded by the same assumptions that produced the bug — which is why independent verification through tests and human review remains essential (see the review chapter).

14.6 Security Triage and Remediation

Security work fits the lifecycle model best when it obeys the same governing rule as the rest of the chapter: repetitive detection, mechanically checkable validation, bounded rollout. The repeatable loop is: detect a suspicious pattern, prioritize it, draft the remediation plan, patch, and verify. Agents help at the repetitive steps. Humans still decide what risk is acceptable and whether the proposed fix actually matches the system’s constraints.

At the detection stage, agents can reason across many files and traces at once. They can follow data flow from input to output, cluster similar findings across repositories, and separate obvious false alarms from issues that deserve immediate attention [270]. That is where they outperform rule-only scanners: a pattern matcher can tell you that a sink exists, but an agent can trace whether tainted data actually reaches it through the code paths your system uses. The gain is not that the model “understands security.” It is that semantic tracing produces a better queue for human judgment than raw scanner volume.

Triage is where observability matters. Deployment timing, error-rate spikes, and log anomalies tell you which vulnerability or misconfiguration is actually active, which environments are affected, and what should be investigated first. In this chapter, observability serves only that narrowing function. The agent correlates the signals, summarizes likely impact, and hands a sharper incident packet to the human responder. The deeper root-cause investigation still belongs to the debugging workflow described earlier in this chapter.

14.7 A worked example: from SSRF finding to verified patch

Consider an SSRF advisory landing against a fetch_remote_avatar helper in a user-profile service. The detect-to-verify loop runs as follows.

What the agent inspects. The agent pulls the scanner finding, reads fetch_remote_avatar and every caller, and traces the data flow from the inbound avatar_url request parameter through the URL parser into the outbound HTTP client. It cross-references the deployment timeline and the error logs to confirm the path is reachable in production and not behind a feature flag, and it checks adjacent helpers for the same shape of bug.

What the agent is allowed to draft. A ranked finding packet — exploitability, blast radius, affected environments — and a proposed remediation order: gate the outbound client behind an allowlist resolver, reject link-local and metadata-service ranges at the parser, and add regression tests that exercise the rejected ranges. The agent drafts the patch and the test cases against an existing security-helper module. It does not silently change network egress policy, rotate credentials, or land the change.

What the human decides. The on-call security reviewer decides whether allowlisting at the resolver is the right boundary for this codebase, whether the rollout tolerates a synchronous DNS check on the hot path, and whether other callers with the same shape should be batched into the same fix or staged separately. The reviewer also confirms that the agent’s “historical intent” reading — that the helper was originally designed for an internal-only use case — does not justify weakening the new check.

How scanners, tests, and environment checks close the loop. Once the patch lands behind a flag, the SAST and dependency scanners are rerun against the diff, the new regression tests are added to the security suite, and an environment check confirms that the staging egress policy and the application-level allowlist agree. The fix is only declared complete when the scanner reports clean, the new tests pass on every supported runtime, and the staging environment shows no residual outbound calls to the previously vulnerable ranges.

From there the operator pattern generalizes: ask the agent to trace the vulnerable data flow, rank the findings by exploitability and blast radius, then draft the remediation order rather than the final patch in isolation. The human decides patch order, rollback tolerance, and whether the proposed fix respects the codebase’s security boundaries. Verification closes the loop: rerun scanners, tests, and environment checks before declaring the response complete.

The defensive framework for permissions, autonomy, and production access belongs in the bounded autonomy and permissions chapter. This section is narrower: security as an SDLC activity with a concrete detect -> prioritize -> patch -> verify loop.

14.8 Batch Maintenance and Safe Sweep Changes

Large-scale maintenance work becomes high-leverage agent work when the target transformation is precisely stated and independently verifiable. The repeatable pattern is not just “let the agent edit many files.” It is stage the maintenance change, verify each stage, and keep the review surface small enough that humans can still judge the result.

The best-evidenced version of this pattern is batch, verifier-backed maintenance. When the same remediation must land in many places, agents can handle the repetitive sweep while humans own the gate. Sourcegraph’s cross-repository React2Shell remediation is one example: identify every vulnerable pattern, generate reviewable fixes, and validate the result before merge [251]. AWS Lambda runtime upgrade automation shows the same shape on infrastructure-adjacent code: detect the upgrade target, generate the repetitive changes, and keep human review at the execution boundary [263].

That gives a narrower but better-supported rule than “agents are good at refactoring in general.” Agents are strongest at maintenance operations where the transformation class is known ahead of time, the verifier is external, and the rollout can happen in bounded waves. Define the change precisely, run the agent on one slice, verify the slice, then expand. The human reviewer owns two questions at every wave: did the transformation preserve behavior, and did the agent invent anything beyond the requested maintenance operation?

Without tests or other mechanical verification (see Chapter 12), even well-bounded maintenance becomes a fast way to spread uncertainty. Invest in the verification harness first, then let the agent handle the repetition.

14.9 Documentation and Execution Byproducts

The cleanest final insight in this chapter is that some lifecycle work should be generated as a byproduct of execution rather than recreated afterward. Documentation is the clearest example. Release notes, changelog entries, API references, ADR drafts, and annotated screenshots are all mechanical translations from work the system already knows about [63], [58]. The agent drafts them while context is still fresh; a human owner verifies whether the artifact reflects the actual decision, behavior, or public promise.

The same logic applies to execution-facing artifacts that teams usually assemble after the fact. A release packet, a review bundle that gathers the relevant screenshots and decisions, a draft risk summary, or an ADR shell that already includes the changed surface area all fit this pattern because they summarize facts the system already has. A weekly status packet built from merged PRs, incident summaries, and issue state is another good example: the agent can collect and summarize facts quickly, while a human manager still owns commitments, trade-offs, and tone. The leverage comes from timing: generate the artifact while the code, discussion, and screenshots are still live in context, then let a human owner correct the final framing.

Project-management automation follows the same rule. One team runs a weekly ClickUp-style sprint-planning agent that scores open tasks, proposes the next sprint slice, and forces every item to carry documented success criteria and expected impact before it can be pulled forward [271]. Another durable pattern is to let the prototype or implementation thread act as the single source of truth, then derive stakeholder updates, release packets, or sales-facing briefs from that living artifact instead of maintaining parallel prose by hand [257].

That is the governing boundary. Let agents draft artifacts that reflect work already done or constraints already known. Keep humans responsible for final framing, public promises, and any commitment that extends beyond the evidence already in hand. When teams follow that line, documentation stops being neglected cleanup work and becomes part of the normal execution flow.

14.10 Conclusion

The lifecycle-wide pattern is consistent: agents create leverage where work is structurally repetitive, mechanically verifiable, and bounded in blast radius. Transcript-to-spec conversion, Terraform generation, security scanning, batch dependency upgrades, and release note generation all meet these criteria cleanly. Architecture decisions, novel debugging, and abstraction discovery do not — and treating them as if they do produces the kind of deferred-design debt and silent correctness bugs that practitioners warn about. The orchestrator’s skill is not knowing how to use agents in each phase, but knowing where the boundary falls in each phase — and adjusting the workflow accordingly. Teams have to institutionalize those boundaries, not leave them to individual instinct: who approves infrastructure changes, who owns the remediation queue, who decides whether a requirements artifact is stable enough to execute. The SDLC patterns only scale when those ownership lines become team norms rather than personal habits.

14.11 Takeaways

Let agents drive work only when it is repetitive, mechanically verifiable, and bounded in blast radius; for architecture decisions, abstraction discovery, and novel debugging, keep agents in an exploration role rather than an autonomous execution role.
When a prototype is driving requirements, convert it into a human-approved execution packet — user-visible behavior, acceptance criteria, constraints, open questions, non-goals, and sign-off — before any implementation agent executes from it; the prototype discovers intent but must not silently become the spec.
When using an agent for repository archaeology, explicitly ask for a structured separation: a ‘current behavior’ section grounded in the code as it stands today and a ‘historical intent’ section grounded in commit messages, PR threads, and ADRs — and then correct any historical signals that no longer match the code.
For agent-generated Terraform, follow a generate → validate → deploy pattern with a multi-stage validation cascade — terraform fmt, terraform validate, tflint, and duplicate resource detection — before any deployment, and provide Terraform modules, state structure documentation, and provider configuration upfront to reduce generation errors.
When using an agent to debug, keep the operator loop short and explicit: provide the failing test or log slice, state the expected behavior, ask for a ranked hypothesis tree, have the agent instrument or isolate one path at a time, then verify each hypothesis independently before accepting a patch.
For security triage, have the agent trace the vulnerable data flow, rank exploitability and blast radius, and propose remediation order; it may draft a provisional patch and tests, but humans own sequencing, rollout tolerance, boundary decisions, and final verification.
For batch maintenance sweeps, run the agent on one slice first, verify that slice, then expand — and at each wave ask two explicit questions: did the transformation preserve behavior, and did the agent invent anything beyond the requested maintenance operation?
Let agents draft release notes, ADR shells, and similar artifacts from work already done or constraints already known, then have a human owner verify the final framing, public promises, and any commitment that goes beyond the evidence in hand.