13 Source Control and Release Discipline

Agents win in domains where diffs already exist, and they lose where they don’t — every other release-discipline rule in this chapter follows from that one fact.

13.1 Reviewable Diffs and Commit Hygiene

The testing gates from Chapter 12 become release controls here. A test result is only useful to the release process when it is attached to a diff, branch, PR, or deployment decision that someone can accept or reject. This chapter starts where verification leaves off: once the checks say what is true about the change, source control decides what can move forward, what must stay isolated, and what can be rolled back.

Coding agents work better than agents in any other domain not because models are smarter at code, but because code lives inside Git, where every mistake is reversible by git reset and every change is visible as a diff [237], [238]. Watch agent-assisted slide editors, spreadsheet copilots, or CRM “AI assistants” stumble where Cursor and Claude Code thrive, and the difference isn’t model quality — it’s the diff infrastructure underneath. Karpathy’s framing is the cleanest: agents need diffs [238]. The practical consequence for you is that source control is not paperwork around agent work; it is the substrate that makes agent work safe. If you find yourself fighting your agent’s commit habits, fix them at the root — don’t paper over them in review.

This is why the “chat on the side” pattern keeps re-emerging — Cursor, v0, Replit, Lovable — and why the ones that touch persistent state need a versioned backend underneath [239]. The cautionary case is Gemini deleting an entire Google Sheet with no undo path; the working case is Cursor letting you reject a hunk because the agent’s edit is staged in your working tree, not committed to main [239]. The same principle extends past code: DoltCash’s accounting agent gets to touch the books only because every write lands on a Dolt branch with a run ID, and only after a human reviews the diff does it merge into the system of record [240]. Treat your agent’s commit as a proposal, not a fact, and treat the merge as the moment a human takes accountability for it [237].

The hygiene problem starts with how the agent commits. Aider’s defaults are the right reference: it commits each change with a descriptive message, commits any pre-existing dirty files first to keep your work separate from its work, marks both the author and committer fields with (aider), and can prefix messages with aider: so git log shows exactly which lines came from the model [241]. The pattern to copy, regardless of which agent you actually use:

auto-commits          → on (one logical change per commit)
dirty-commits         → on (commit human edits separately, first)
attribute-author      → on
attribute-committer   → on
commit-message-prefix → "agent: " (or equivalent)
commit-prompt         → Conventional Commits style

The point isn’t aider specifically. The point is that an agent that lumps human edits and AI edits into a single commit destroys your ability to ever audit “what did the model actually change here.” If your agent doesn’t separate authorship by default, configure it to. When something breaks three weeks from now, that attribution is the difference between a five-minute git log --author and an afternoon of forensic guesswork.

Commit size is the other half. The failure mode you actually see in practice: you tell an agent “fix the auth bug,” it touches eleven files across three subsystems, and you stare at a 600-line diff that no human can review honestly. The fix is to keep agent work batched into small reviewable units. Ben Houston’s “GitHub Mode” workflow runs agents in roughly 30-minute task chunks tied to GitHub issues, each producing a focused PR that you review asynchronously instead of supervising live; that constraint is the unlock for his measured 30 → 160 commits/day jump, because every commit stays small enough to actually merge with confidence [242]. Multi-agent orchestrators apply the same constraint structurally — Claude Code’s Agent Teams give each teammate its own context window, and tools like Gas Town and Multiclaude give each agent its own worktree and branch, so their diffs don’t intermix [243]. One issue, one branch, one PR, one logical change. When the agent strays into a multi-purpose blob, scope it back; don’t let “while I was in there” creep into a release.

A concrete contrast helps. The wrong shape:

* abc123  fix: auth bug + refactor user model + update deps + reformat templates
  37 files changed, 612 insertions(+), 489 deletions(-)

The right shape, same scope of work:

* feat: extract User.from_session helper (agent: claude)
* fix: reject expired session tokens at middleware (agent: claude)
* test: add coverage for expired-token path (agent: claude)
* chore: bump pyjwt to 2.10.1 (agent: claude)

Every commit in the second shape is independently reviewable, independently revertable, and tells a story when read in order. The first shape is unreviewable, no matter how good the agent’s diff is.

Use the agent for the implementation, but write the PR description yourself. Simon Willison’s CSRF refactor on Datasette ran across ten agent-driven commits, with Claude Code doing the work and GPT-5 cross-reviewing — and Willison hand-wrote the PR description anyway, both to keep it concise and to keep himself honest about what actually changed [244]. The PR description is the moment you reconstruct the change in your own head. If you outsource that, you are merging code you don’t understand. Let the agent draft the commits and a first-pass description; rewrite the description before you click “Ready for review.” A good agent-era PR description names the discarded approaches as well as the chosen one — partly so the next reviewer doesn’t ask “did you consider X,” partly so the next agent session doesn’t retry the same dead ends, in the same spirit as Anthropic’s CHANGELOG.md progress files for long-running work [245].

Two operator habits make all of this stick. First, checkpoint aggressively. Replit’s snapshot engine builds copy-on-write filesystem forks so an agent can run, fail, and roll back in constant time [246]; when you don’t have that, your local equivalent is a feature branch plus disciplined git stash, git worktree, and frequent commits. Second, treat the /undo command as load-bearing. Aider exposes it directly because the assumption is that you will need it [241]. Build the same reflex even with agents that don’t ship it: when the diff feels off, revert immediately and re-prompt. Trying to “fix forward” from a bad agent commit is how multi-purpose blobs get born.

13.2 CI Interaction and Release Gates

Once an agent’s work leaves your laptop, your CI pipeline is the only thing standing between speculative output and production. The default failure mode is the agent treating CI as an inconvenience to route around — disabling tests, marking flaky cases as skipped, or pushing a “fix CI” commit that changes the assertion instead of the code. The discipline is to wire agents into pipelines as participants, not as merge-bots that bypass them. As Chapter 11 established for human review, the CI-side equivalent is that no autonomous loop closes on the agent’s self-assessment that “the build passed.”

The high-leverage pattern is comment-triggered GitHub Actions. Houston’s MyCoder demonstrates the canonical shape: a /agent comment on an issue spins up an Action that runs the agent against the repo, opens a PR, and waits — and the same comment thread is where you refine the implementation plan mid-flight, without re-triggering the workflow [247], [242]. What matters is what the workflow does not do: it doesn’t merge, it doesn’t bypass branch protection, it doesn’t auto-approve. The agent opens the PR, the existing CI checks run against it, and a human still merges. That is how agents iterate to a green build without you watching, while the review gate stays where it belongs.

Validation gates inside CI matter more than they did before, because agents will happily produce code that compiles and lints clean while being architecturally wrong. The durable pattern is Generate, then Validate, then only let the release path continue: formatter, type checker, linter, and one domain-specific check (schema validation, dependency audit, security scan, infrastructure validation) should each fail independently before any agent-authored change is treated as merge-ready [248]. Auto-approval after passing all gates is fine; auto-approval that bypasses any of them is how generated code reaches production [248]. For deeper verifier and oracle design — what makes a check externally trustworthy rather than just present — see Chapter 12; here the rule is only that release promotion never closes on the agent’s own grade.

For unattended runs, your config file is the safety contract. Claude Code’s --dangerously-skip-permissions (“YOLO” mode) is what lets agents close the loop in CI, but it is only safe inside a sandbox plus an allowedTools whitelist that restricts what the agent can do even when all permissions are bypassed [249]. The whitelist is the actual permission boundary; “YOLO” only refers to skipping interactive prompts. If allowedTools lets the agent run arbitrary Bash, you have no boundary at all — you have an unattended shell that happens to also write code.

Autonomous remediation is the most dangerous CI pattern, and the one most likely to be sold to you in 2026 marketing decks. Agents that detect a misconfigured S3 bucket and “fix it in real time” will eventually misread context and disable a control that was intentional [250]. The hard rule: autonomous agents can propose via pull request; they should not bypass branch protection to push to main. The MTTR you save by skipping review is the same MTTR you’ll spend cleaning up the day the agent silently rotates the wrong key.

13.3 Release Mechanics: From Approved PR to Tagged Artifact

A merged PR is not a release. The mechanics between “approved on main” and “running in production” are where agent-driven pipelines either earn their speed or quietly accumulate the failure modes that make rollback expensive. Treat each step below as a first-class operator move with a named owner, an audit trail, and a documented reverse.

Release branch and tag. Cut a release branch (release/2026.04) from the approved main commit; agents may open backport PRs against it but never push directly. The tag (v2026.04.0) is applied by a human or by a workflow keyed to a human-issued comment trigger, never by the implementation agent itself. Tags are immutable; if a release candidate fails, you cut a new patch tag, you do not retag. Aider’s (aider) author tag and DoltCash’s run IDs make it easy to query which commits in the release window were agent-authored, under which prompt, against which plan [241], [240] — surface that count in the release notes so reviewers know what they are blessing.

Version bump. The version bump is a separate commit on the release branch, not folded into a feature PR. Agents are good at proposing the bump (semver from changelog labels, lockfile regeneration, embedded version constants); they are bad at deciding whether a change is breaking. Require the bump PR to list every public-API delta the release contains and to cite the commits that introduced each — the same Generate/Validate split that governs feature work [248]. A failing API-diff check blocks the bump.

Changelog and release-note review. The changelog draft can be agent-generated from commit messages and PR bodies, but it must be human-edited before the tag. Agents reliably under-report user-visible behavior changes hidden inside refactors and over-report churn that doesn’t matter to operators. The reviewable artifact is a diff of CHANGELOG.md against the previous release, plus a “operator-impact” section the human writes by hand. No tag without a merged changelog PR.

Promotion from approved PR to release candidate. This is the gate that matters most. The build that ran on the PR is not the build you ship; you rebuild from the release branch tip with release flags, produce an -rc.N artifact, and run it through a release-candidate suite (smoke tests, canary metrics, integration tests against a staging dependency graph). Promotion is a deliberate move — a comment trigger like /promote rc1 on the release-tracking issue, scoped to a small group — not an automatic consequence of green CI. Sourcegraph’s Batch Changes pattern is the right shape for cross-repo promotion: a single spec describes the promotion, and a deterministic engine applies it across every service, while the LLM only generates the spec [251], [252].

Artifact promotion. Artifacts move through environments by re-tagging in the registry, not by rebuilding. Build once on the release branch, sign the artifact, push to a staging repository, and promote the same digest to production after the RC bakes. Re-building between environments lets non-deterministic inputs (dependency resolution, codegen seeds, agent-influenced toolchains) drift between what you tested and what you shipped. The signature on the artifact is also the audit anchor: the run ID that produced it, the commits it contains, and the human who promoted it should all be recoverable from the registry alone.

Revert and rollback packaging. Every release ships with its rollback rehearsed and packaged, not improvised at 2 a.m. That means three artifacts, not one: the new release, a revert PR pre-prepared against main that backs out the user-visible changes, and a rollback runbook (which artifact digest to redeploy, which migrations to reverse, which feature flags to flip). Schema changes get the heaviest treatment — forward and backward migrations both run in CI on every release branch, and a release that can’t roll its schema back is flagged as one-way before the tag is cut. DoltCash’s invariant enforcement is the model: if the rollback would violate a domain invariant (a half-applied ledger migration, an orphaned foreign key), the release system rejects the release plan, not the rollback [240].

13.4 Vignette: One Promote/Reject Cycle End to End

A Friday afternoon. An agent run keyed to issue #4812 (“upgrade node-tar past CVE-2026-1190”) opens PR #4901 against main: lockfile bump, two call-site adjustments where the API tightened, regenerated snapshots. CI is green; a reviewer reads the diff, notes the agent flagged one snapshot change as “behavior-preserving” and confirms it manually, and merges. So far this is ordinary feature work.

The release manager comments /cut release/2026.04.3 on the release-tracking issue. A workflow branches from the merge commit, runs the version-bump agent (patch bump, since the changelog labels are all fix: or chore:), and opens PR #4903 against release/2026.04.3 with the bumped version, a regenerated CHANGELOG.md draft, and an API-diff report showing zero public-surface deltas. The release manager edits the operator-impact section by hand — “no operator action required; node-tar CVE remediation only” — and merges.

/promote rc1 triggers the release-candidate build. The artifact is built once, signed, pushed to the staging registry as app:2026.04.3-rc.1@sha256:…, and deployed to staging. The RC suite runs: smoke tests pass, but the canary metric for tarball-extraction p99 latency regresses 35% against the previous release’s baseline. The promotion workflow halts, posts the regression to the release-tracking issue, and tags the on-call.

The on-call has a choice: fix forward or reject. The agent is asked to investigate, edits a progress file with what it has tried, and reports the regression traces to a buffering change in the upgraded library [245]. Rather than ship a hotfix on Friday, the on-call comments /reject rc1. The workflow drops the staging tag, opens a pre-prepared revert PR (#4905) that backs PR #4901 out of the release branch only — main is untouched — and re-cuts release/2026.04.3 from the prior tag. No artifact is promoted to production; the rejected rc.1 digest stays in the staging registry, frozen, for forensic comparison.

Monday, the agent’s progress file is the starting context for a follow-up PR that pins the upgrade to a different patch version, a fresh rc.2 is cut, the canary clears, /promote rc2 re-tags the same digest to the production registry, and the release manager runs the rollback rehearsal one more time before the deploy. The release ships. The rollback never has to run — but the artifact that would have run it sat ready, signed, and tested the entire window. That readiness is what made /reject rc1 cheap enough to choose on a Friday.

The throughline across every step above is that agents propose and humans promote, that the artifact you ship is the artifact you tested, and that the reverse of every release is packaged before the release goes out. Source control gives you the history; release mechanics give you the discipline that turns history into something you can actually steer. This chapter builds on the verifier and oracle design from Chapter 12; here the job is done when every release has a named owner, a signed artifact, and a rollback you have already rehearsed.

13.5 Takeaways

Configure your agent to commit each logical change separately, attribute author and committer fields to the model, and prefix commit messages (e.g., ‘agent:’) so git log shows exactly which lines came from the model.
Scope each agent task to one issue, one branch, and one PR — when the agent strays into a multi-purpose blob, interrupt and re-scope it rather than letting ‘while I was in there’ changes accumulate.
Write the PR description yourself after every agent-driven implementation, naming both the chosen approach and the discarded alternatives, so you can confirm you understand what actually changed before clicking ‘Ready for review.’
Have unattended agent workflows open PRs into your normal branch-protection path rather than merge directly; let the standard checks run, and keep final merge authority with a human.
Add independent validation gates in CI — formatter, type checker, linter, and at least one domain-specific check — each failing independently before any agent-authored change is treated as merge-ready.
If you skip interactive permissions in CI, run the agent only inside a sandbox with an explicit tool whitelist; skipping prompts removes friction, not the need for a real boundary.
Build artifacts once from the release branch, sign them, and promote the same digest across environments by re-tagging in the registry rather than rebuilding — rebuilding between environments lets non-deterministic inputs drift between what you tested and what you shipped.
Before cutting a release tag, package rollback with the release: prepare a revert PR, write a rollback runbook that names the artifact digest, migrations, and feature flags to reverse, and run forward and backward schema migrations in CI before promotion.