6 Execution Modes and Switching Signals

The agent’s autonomy mode is set once and then forgotten — until it deletes the wrong file in the integration branch or interrupts you seventeen times during a twenty-line refactor. Mode mismatch is the most common workflow failure, and it always feels like a model problem when it is actually a configuration problem.

6.1 Mode Selection: When to Stay Manual, When to Assist, When to Delegate

The agent’s involvement is a knob with six stops, not a binary toggle: Manual, Assisted editing, Plan mode, Act mode, Autonomous loop, and Background queued. Each stop trades supervision for speed, and the mode-selection question is not “how much do I trust this agent?” — it is “what kind of feedback loop does this task need, and what is the cost of misalignment in that loop?”

The first decision is whether to involve the agent at all. The dominant productivity unit is shifting from the file-edit loop to the specify-delegate-review loop [102], but the IDE survives as a subordinate instrument for deep inspection — particularly for multi-file refactorings in large repositories and for situations where a 90%-correct-but-broken diff is more expensive to fix than to write by hand [102]. Stay manual when the task is small enough that explaining it to an agent costs more than doing it, when the codebase uses idiosyncratic tooling that AI-trained models handle badly [103], or when verification cost dwarfs generation cost.

Assisted editing — autocomplete, multi-line completions, auto-edit — is the right default for in-flow code maintenance. You spend more time editing existing code than writing new code, and pure autocomplete breaks flow precisely because it only proposes changes after the cursor [104]. Auto-edit’s contribution is that it watches recent edits and suggests related changes elsewhere — the parameter rename in another file, the deletion you forgot to propagate [104]. Use this mode when the next change is small, local, and already shaped in your head; reach for plan or act mode when the next change is large enough that you cannot predict its boundaries.

Plan mode is a distinct execution mode, not a polite habit. The operator move is to lock the agent into a read-only state, get a reviewable artifact, then release the gate. Replit’s Plan Mode and Claude Code’s --plan flag are the canonical implementation: a non-executing pass with an explicit “Start Building” release [105]; Aider’s /ask is the lighter-weight version, a chat phase before “go ahead” [90]. Tools without a named planning mode can enforce the same gate by prompt discipline — ask for a numbered step list and file list before any implementation instruction. The artifact form matters: a numbered plan tied to concrete file paths is reviewable; a vague summary is not. Entering plan mode means you are still accountable for direction before the agent becomes accountable for execution. That is why the gate matters: it is the last cheap point to correct scope before edits begin. Enter plan mode when the task touches more than two files, crosses a module boundary, involves an irreversible operation (migration, deletion, published API change, infrastructure config), or contains scope you cannot summarize in one sentence. Skip it for a single-file edit you could finish by hand in under a minute, or for a mechanical refactor the type checker is already driving.

The gate should also be granular: file writes, shell commands, and external resource access are separate trust decisions. The same agent action that is fine in one category — reading files, for instance — may be unacceptable in another, like running shell commands that mutate the filesystem or calling external services. Make those categories visible in the control surface: Cline-style auto-approve panels split file edits, command execution, browser use, and MCP actions into separate approvals, so you can allow one category without silently authorizing the rest [106]. The practical move is to scope tool permissions per category rather than wrapping every action in a single global mode toggle. A mode that grants the agent broader command surfaces or external resources than the task needs is already the wrong mode — and the corrective is usually to narrow categories, not to drop into full plan mode.

Act mode is where most agent-assisted work actually lives: the agent executes, you watch the diff, you accept or revert each step. The concrete trigger for stepping up from assisted editing into act mode is when the next change is bigger than your head — multiple files, a small refactor, a new function with its tests — but small enough that you can still hold the diff. Worked contrast: assisted editing is “rename this variable and let auto-edit propagate it”; act mode is “add a retries parameter to the HTTP client, thread it through the three call sites, and add a test for the retry path.” That second task is too wide to drive from the cursor and too narrow to need a written plan. The move is: pin the goal in one sentence, name the files in scope, let the agent edit, read each diff before accepting, and revert immediately if the change touches anything outside the named scope. The checkpoints that remain in act mode are the diff itself and the test run after each cluster of edits; if either becomes too noisy to read carefully, you have over-scoped the act-mode session and should drop back to plan mode. Your role at this level is not inspecting code line by line but designing and maintaining the harness that controls agent behavior across nested loops [107]. Tight act-mode supervision pays off when the change is novel or the codebase is unfamiliar; it becomes a tax once you trust the agent’s pattern on similar work.

Autonomous loops raise autonomy further: the agent runs edit-test-fix cycles for minutes or hours, with humans reviewing at checkpoints rather than per action. This works when verification is cheap and comprehensive — DoltLite reached correctness parity with SQLite by passing 5.7M sqllogictest cases, which let the practitioner hand the loop to the agent for sustained “hand-to-hand combat” without inspecting every diff [108]. The pattern transfers wherever the test suite is strong enough to act as the reviewer: the agent iterates against the test signal, you review the test signal. Where the verification surface is weak, autonomous mode is a confidence trap — the agent will confidently report green when work is broken [109].

Background and queued execution is the sixth stop: dispatch a task to run while you do something else, instead of blocking on every agent turn. Replit’s Queue is the explicit version — you enqueue prompts, reorder them, attach per-task model and extended-thinking settings, and the agent works through them while you keep thinking [110]. The same pattern shows up in IDE workflows: running up to two non-competing agent workflows in parallel — one synchronous focused task, one asynchronous background task on bug fixes or tech debt — is sustainable in practice [111]. Use background mode for work that is well-specified, low-blast-radius, and where you would otherwise context-switch and lose ideas waiting. Do not use it for work whose plan you have not already pinned down — async runs amplify the cost of bad direction because nobody is watching.

6.2 Transition Signals: What Tells You the Mode Is No Longer Right

The most consequential transition signal is the inner-to-outer loop boundary. The inner loop is fast: file-level edits, a unit test or compile cycle, immediate feedback under tight human supervision. The outer loop is slow: CI, integration tests, deployment, cross-system verification — a feedback cycle that spans minutes to hours and may affect shared state. Keep two different switches separate. The plan-versus-act decision is an inner-loop switch: it changes how much freedom the agent has while you are still inside one local session. The local-versus-CI or local-versus-release decision is an inner-to-outer loop switch: it changes which feedback loop owns the work at all. Applying the same autonomy settings to both loops fails in opposite directions: outer-loop ceremony in the inner loop interrupts you constantly during local iteration; inner-loop autonomy in the outer loop lets the agent push to a shared branch or trigger a deployment without the checkpoint that should have caught it. When the task crosses from local edit-test work into CI, shared-branch, or deployment territory, drop to a higher-ceremony mode: plan mode for any new outer-loop action, explicit human approval before merge, narrower tool scopes than you allowed locally. The boundary is a design decision, not an implicit assumption — see Chapter 8 for how to assign tasks to inner versus outer loops upfront.

The second signal is risk threshold crossing within a single task. Plan-then-act exists precisely because most costly agent mistakes happen when execution starts before requirements are fully understood [105]. The trigger signals for entering plan mode are concrete: the agent is about to touch more than two or three files, the request contains words like “refactor”, “migrate”, “rename across”, or “change all”, or the downstream impact involves production data, a public API, or shared infrastructure. The reverse transition — releasing the gate to act mode — is also a signal moment: read the plan, edit it where wrong, then explicitly authorize. Plans approved without reading become rubber stamps, and rubber stamps are worse than no gate because they create false confidence.

The third signal is review surface mismatch. Review the artifact on the surface where it is cheapest to evaluate. Visual artifacts — UI prototypes, dashboards — should be iterated in the preview first; code artifacts — logic, state, data fetching — on the diff. The mismatch shows up most often when a task changes artifact shape mid-flight. A typical case: what starts as “tweak the dashboard layout” is reviewed in preview and looks fine, but the agent’s fix routes through a new data-fetching hook and a small state-machine change. The artifact has shifted from pixels to control flow; preview can no longer see what matters, and continuing to review there will green-light a structurally wrong diff. The corrective is to switch surfaces the moment the unit of work changes shape — drop to diff review, and often back into plan mode to re-scope the now-larger task. Hallucination is the defining feature of LLM output, not a bug [109] — the surface you review on is what determines whether you catch the wrong hallucinations.

The fourth signal is within-session degradation — the agent’s behavior is telling you it is time to back off autonomy. Distinguish two threshold levels. Early signals call for within-session correction without escalation: the agent repeats a solution you already rejected, or it ignores a constraint stated earlier in the conversation. Tighten the prompt, restate the constraint with explicit failure criteria, or switch models — Aider’s working practice of swapping one model for another when one gets stuck is the cheapest reset that does not lose context [90]. Late signals call for escalation: the agent’s responses start summarizing the conversation instead of answering it (compaction has begun replacing reasoning), tests keep failing in the same way despite nominal “fixes”, or you are correcting the same mistake on every turn without convergence. Another late signal is comprehension collapse on the human side: if you can no longer explain what the agent changed or why the current branch state is trustworthy, the mode is already wrong even if the agent sounds confident. At the late-signal threshold, drop to plan mode or reset — the conversation itself has become the problem, not the model. The full mechanics of compaction and recovery belong to Chapter 5; the point here is that compaction symptoms are a mode-switch signal.

The fifth signal is flow disruption asymmetry. If the agent is interrupting you constantly during local edits — confirmation dialogs, every-file approvals, repeated permission prompts — your inner loop is being supervised at outer-loop intensity. Loosen autonomy or drop to assisted-editing mode. Conversely, if the agent has just made a cross-system change without surfacing it for review, your outer loop is running at inner-loop autonomy. The corrective is the inverse: tighten scope, reduce auto-approve categories, require explicit confirmation for shared-state actions. Symmetric autonomy across asymmetric loops is the misconfiguration that produces both burnout and risk.

The sixth signal is agent capability ceiling. Some mode failures are not about session quality at all; they are about the toolchain falling outside the agent’s learned comfort zone. Non-standard build systems, domain-specific DSLs, notebook-derived code workflows such as nbdev, or bespoke deployment wrappers are all signs that the agent may keep producing plausible but structurally wrong edits [103]. When that happens, do not just switch models and hope. Drop the mode back toward manual or plan-only, add tighter examples, or keep the agent on read-heavy explanation work while you execute the tricky change yourself.

The seventh signal is review fatigue. Review fatigue is a real and underestimated cost of multi-agent workflows; staring at twelve diffs from twelve agents at day’s end is unsustainable [102]. When you notice you are accepting changes you do not fully understand because you are tired, that is the signal to reduce parallel agents, raise the planning bar so each agent produces fewer reviewable outputs per unit of intent, or queue work for tomorrow rather than absorb it now [110]. The test for whether you should keep going: can you explain, for the diff you are about to accept, what the agent changed, what evidence says it is correct, and what the next step would be if it is not? If not, that is the signal to drop to a more supervised mode — or stop for the day.

Together these signals answer the chapter’s central question: when is the current mode no longer right? When the loop boundary changed under you, when the risk surface widened, when the artifact shape shifted out from under your review surface, when the session is degrading, when interruption frequency does not match the work, when the toolchain is past the agent’s ceiling, or when your own attention is no longer reliable enough to be the reviewer.

6.3 Handoffs and Checkpoints: Keeping Control While Changing Mode

Mode changes happen mid-task. The handoff is what determines whether they preserve or destroy the work in flight.

The plan-to-act handoff. Plan mode produces an artifact: a numbered list of steps, the files involved, the open questions. The handoff is reading that artifact, editing it where wrong, and then releasing the gate — saying “go ahead” in Aider, hitting “Start Building” in Replit, or whatever the tool’s specific release looks like. The discipline is that the artifact should be specific enough to verify: concrete file paths, function names, the assumption being made about an API contract, the one failure mode the agent is most worried about. Vague plans defeat the gate. The interview-driven version — a structured workflow where the agent interviews you to produce a plan, you review it, then implementation follows — collapses days of guesswork into a single day of focused execution [111]. The principle transfers: plan, review, authorize, execute. Each step is an explicit handoff, and each handoff is reviewable.

The act-to-autonomous handoff. Before you let the agent run a multi-step loop without per-action approval, the verification surface has to be ready. DoltLite’s autonomous-loop success was bounded by its 5.7M-test sqllogictest harness — the agent could iterate freely because the test signal could detect regressions reliably [108]. Without that signal, autonomous mode produces detritus and abandoned attempts that are hard to spot in review [108]. The handoff condition is therefore: do I have a verifier the agent cannot bluff past? If not, stay in act mode. The agent will report all-tests-green when tests are actually failing [109], so “the agent says it works” is not a verifier. The full treatment of test-suite design and verifier strength belongs to Chapter 12; here the rule is simply that autonomous mode requires a verifier the agent did not write.

The inner-to-outer loop handoff. When work crosses from local iteration into CI, integration, or deployment territory, three things must transition: the working state has to be committable and reversible; agent tool scopes should narrow so categories that were auto-approved during local editing now require explicit confirmation; and the human review boundary should move from per-diff to per-merge. AI-only review at this boundary is nondeterministic and misses dependency-level vulnerabilities, which is why deterministic static analysis and dependency scanning have to backstop the agent at the outer-loop transition [112]. The mechanics of commit hygiene and PR structure belong to Chapter 13, and the layered review infrastructure belongs to Chapter 11. The mode-switching point here is just the trigger: crossing the loop boundary is a handoff, and you cannot enter higher-autonomy outer-loop modes responsibly without those checkpoints already in place.

The handoff between humans and the harness. Humans should operate at the boundary between intent and execution — designing the harness, writing the specifications, maintaining the quality checks — rather than approving each line of code [107]. The “why loop” (what we are building and for whom) stays human-driven; the “how loop” (intermediate artifacts, file edits, test runs) can shift increasingly to agents as harness maturity grows [107]. The practical implication for mode selection is that the human checkpoint should sit at the highest layer that still meaningfully gates the outcome. If you can put the gate at “approve the spec”, do that; if the spec is settled, gate at “approve the plan”; if the plan is settled, gate at “approve the diff”; if the diff is settled and the test suite is strong, gate at “approve the merge”. Each upward shift trades inspection for leverage.

The async dispatch checkpoint. When you queue work for background execution, the checkpoint is upfront: the prompt, the model choice, the tool scope, the success criteria. Once dispatched, you are no longer watching, so the prompt has to encode the verification you would otherwise do live. Replit’s Queue exposes per-task model selection and extended-thinking toggles for exactly this reason — the configuration done at enqueue time is the only configuration that runs [110]. Treat enqueue as the same kind of handoff as plan-to-act: specific, scoped, with a verifier the returned work has to pass. Background mode also has a soft cap. In practice, one synchronous focused task plus one async background task is the easiest supervised pattern to sustain; going beyond that is possible, but only if the review surface and ownership boundaries are already strong enough that extra parallelism does not simply pile up unread diffs [111], [102].

The mid-session reset and rewind. Sometimes the right handoff is no handoff: stop, throw away the conversation, start over. Reset is structural — fresh session, fresh priming, anchored to a progress note and the git state. Rewind is surgical — undo the last turn or two when those turns made the repo worse and the state before was fine. Treat them as different tools. Reset when the conversation itself has stopped helping; rewind when only the last move was wrong. The mechanics of session resume, rewind, and compaction recovery belong to Chapter 5; what matters here is that “throw away the session” is a legitimate mode-switching move, not a failure. Agents bloat code and chase abandoned approaches by default [111], [108] — restarting with a sharpened prompt is often faster than nursing a drifting session through five more turns.

The post-session handoff to the next session. Each session should leave behind one reusable artifact: a session note, a sharper rule, a new test, a workflow command. Anything you have pasted into chat twice should become something the next session can load without folklore. Procedures belong in the repo as versioned commands, scripts, or templates, where they survive teammate turnover and evolve under review. The full treatment of how to design those reusable artifacts — when a procedure becomes a skill, when a skill becomes a subagent — belongs to Chapter 4; the mode-switching relevance here is that the cheapest way to reduce future mode mismatches is to leave the next session with the right defaults already loaded.

The thread through all of this: control during a mode change is not about resisting the change — it is about making the handoff explicit. Plan-to-act is a handoff. Inner-to-outer is a handoff. Reset is a handoff. Each one has a specific artifact (a plan, a commit, a progress note), a specific verifier (review, CI, the spec), and a specific authorization (your “go ahead”). When all three are present, mode changes accelerate the work. When any one is missing, the change is the failure mode the chapter started with — the wrong file deleted, the seventeen interruptions, the agent confidently reporting green on broken work.

6.4 Takeaways

Choose execution mode based on the feedback loop the task needs and the cost of misalignment in that loop, not on a blanket judgment about how much you trust the agent.
Default to plan mode and require a numbered plan tied to concrete file paths before execution when a task touches more than two files, crosses a module boundary, involves an irreversible operation, or exceeds a one-sentence summary; skip that ceremony for trivial single-file edits and type-checker-driven mechanical refactors.
Scope agent tool permissions per category — file edits, shell commands, browser use, MCP actions — rather than wrapping all agent actions in a single global mode toggle.
When work crosses from local edit-test iteration into CI, shared-branch, or deployment territory, raise ceremony: require plan mode for any new outer-loop action, explicit human approval before merge, and narrower tool scopes than you allowed locally.
Switch from preview to diff review the moment a task shifts from visual output to logic or control-flow changes, and often return to plan mode to re-scope the now-larger task.
Treat late within-session degradation signals — responses that summarize instead of answer, tests failing the same way despite repeated fixes, or your own inability to explain what the agent changed — as a hard trigger to drop to plan mode or reset the session entirely.
Before releasing an agent into an autonomous loop, confirm you have a verifier the agent did not write and cannot bluff past; without that verifier, stay in act mode.
Every mode change requires three things to be explicit: a concrete artifact (plan, commit, or progress note), a verifier the agent cannot self-certify (review, CI, or the spec), and an authorization point; when any one is missing, the mode change itself is the failure mode.