1 The Agent Landscape and Tool Choice

The question is never “which agent is best?” — it is which operating surface and capability set fit the task, the codebase, and the verification you can actually do.

The coding-agent ecosystem has fragmented into distinct surfaces, and most engineers end up juggling several. A CLI for repository-wide refactors, an IDE assistant for local edits and visual context, a cloud agent for a bounded backlog issue, an in-browser builder for a quick prototype — that is sequential tool switching across moments in the workflow, not parallel agents racing on one branch. Treat brand competition as a distraction. The choice that actually matters is the capability shape of the tool you reach for at each step: what it can see, what it can run, what it can verify, and how much of its own state it shows you when the loop drifts.

Three terms run through the rest of this book. A model is the underlying language model. A harness is the infrastructure around the model — context assembly, tool exposure, permissions, session state, recovery. A coding agent is the working system you actually use: model plus harness plus surface plus workflow. Every coding agent runs the same observe -> plan -> execute -> observe loop on the inside, and the outer engineering loop you live in is Spec -> Generate -> Review -> Test -> Refine -> Commit. What changes across products is how that loop is exposed to you, how much it can do without asking, and how cleanly its output enters review.

Across every surface, the leverage zone is the same: bounded inputs, output you can verify by tests or inspection, semantics that are legible in text rather than pixels. Bulk file moves, path rewrites, scaffolding, search index regeneration, and other repetitive transforms are where agents convincingly outpace hand work [1]. The weak zone is also stable: tasks that need tacit architectural judgment supplied mid-flight, work without a verification loop, and review queues that cannot absorb the throughput. A blunt counterweight worth keeping visible: an experienced practitioner who tracked their own usage carefully concluded that inline code suggestions create a net-negative productivity hit because of the constant context-switching tax, even when the suggestions are accurate [2]. Agent leverage is conditional on task shape and verification, not automatic.

1.1 CLI, IDE, and Cloud Tradeoffs

There are three execution surfaces that change the feedback loop in fundamentally different ways: CLI, IDE, and cloud or in-platform. Open-source and self-hosted is a fourth, cross-cutting axis — a deployment and governance choice that can apply to any of the three surfaces — and it is treated separately later in this section.

Reach for a CLI surface when the task needs tight, text-first iteration across many files. Terminal-driven tools keep the loop explicit: read files, run commands, execute tests, repeat. They are the natural home for coordinated refactors, codebase exploration, and any work where the agent needs to invoke real CLI tools in your repo. The reason a CLI is often the right surface is mechanical: agents understand standard CLI tools and their flags far better than they understand project-specific wrappers, so the more transparent your toolchain is, the better the agent reasons about it [3]. The same logic explains why a CLI-first agent doubles as a general computer-automation agent — anything achievable from the command line is reachable [4]. The trade-off: setup effort, less product polish, and no built-in visual feedback.

Reach for an IDE surface when visual context and local interaction speed matter most. Editor-integrated agents earn their place when you want to watch the code while the agent edits it, compare inline diffs, and exploit local indexing. Newer agentic IDEs push this further: agents reason about full project structure and cross-framework dependencies before writing code, capture and react to visual previews in real time, and execute multi-file tasks autonomously while you stay in the editor [5]. The IDE wins when seeing the code and its rendered output evolve is part of how you supervise. IDE agents also give you the cleanest place to standardize team behavior without overriding personal preferences — Continue, for example, supports a three-tier configuration where workspace-level settings can merge with each developer’s local config rather than overwrite it [6].

Reach for a cloud or in-platform surface when the task is bounded enough to run asynchronously, or when integrated infrastructure shortens the path to a running artifact. Remote agents fit backlog issues, batch fixes, and work you can hand off and review later in one pass; modern platforms now run autonomous sessions of up to about 200 minutes with built-in self-testing and configurable autonomy levels [7]. Integrated platforms also compress the prototype path dramatically: a designer with no CS background shipped a production AR game in a week using an in-browser agent backed by built-in auth, database, deployment, and storage — work that would have taken weeks of manual wiring against third-party services [8]. The trade-off is exactly what the same story exposed: the last 20% of polish — dev/production parity, performance tuning, real-device testing — still takes at least 50% of total time, and agent-generated UIs need a real verification loop before they can be trusted.

The cross-cutting choice: open-source or self-hosted versus managed. Independent of which surface above you pick, you also pick how the harness is owned and run. Managed products optimize for onboarding, integrated cloud features, and platform telemetry. Open-source and self-hosted agents optimize for inspectability, model choice, and environmental control. Cline is a useful concrete reference here: it is open-source with a client-side, zero-trust execution model so user code never touches a vendor server, and it is model-agnostic by design, letting you switch between Claude, GPT-4, Gemini, DeepSeek, and local models without re-platforming [9]. The cost is real — you give up the smoothest onboarding and most cloud-side polish, and you take on harness maintenance — but in exchange nothing about the system is opaque. If your blocker is “we cannot send this code to a vendor,” or “we need to know exactly what the agent does,” start from a self-hosted or open-source option and treat managed agents as the alternative you have to justify.

Most strong practitioners combine surfaces: CLI for tight control, IDE for visual context, cloud for bounded asynchronous work, with the open-source-versus-managed decision made once per environment and then carried through. Match the surface to the feedback loop the task actually needs.

1.2 Capability Assessment

Stop comparing agents by brand and start comparing them by capability surface. Six capabilities determine almost everything about whether a tool will fit a given task. Use this list as a checklist when you evaluate a new entrant — keep each question evaluative; the deeper mechanics belong to the chapters on harnesses and on skills, prompts, and specialization.

Repository and terminal execution. Can the agent read your files, run real commands, and execute your tests in a tight loop? This is table stakes for any work that has to interact with your toolchain rather than only describe code. It is also the capability that rewards a transparent project layout — direct CLI invocations, standard tool flags, minimal wrapper scripts — because the agent reasons better about the tools it already knows than about your bespoke shorthand [3]. When this capability is missing or shallow, you end up pasting code in and out of a chat window, which is where the productivity tax shows up [2].

Browser feedback and self-testing. Can the agent see the rendered output of its own work? CSS, layout, and any UI behavior is unreliable without it; an agent without a browser loop cannot tell that a change broke a flow it just shipped [1]. Tools that self-test in a real browser between turns close that loop without you in the middle [7], and IDE agents that capture visual previews give you the same property inside the editor [5]. When visual correctness matters, this capability is non-negotiable.

Async or autonomous execution with reviewable handoff. Can the agent run for a long time without you, and does it return something you can actually review — a branch, a PR, a set of artifacts — rather than a wall of tool calls? Long-running sessions of tens of minutes to a few hours are now viable, but only if the platform produces task lists, plans, screenshots, or recordings as first-class artifacts you can scan instead of replaying every step [10], [7]. Without an artifact-shaped handoff, autonomy just shifts cognitive load from execution to forensic review.

Tool extension and custom integration. Can you teach the agent about your stack? The evaluative question is simple: when your codebase falls outside the public happy path — unusual frameworks, custom internal APIs, domain types the model has never seen — does the tool give you a clean way to inject that knowledge? Two common surfaces are protocol-based integrations like MCP, and on-demand recipe files like skills [4], [5]. The mechanics, token economics, and trade-offs between them are the subject of a later chapter on skills, prompts, and agent specialization; here, only ask whether the surface exists and whether your team can actually use it.

Observability and command discoverability. Can you see what the agent can do, what it just did, and what state it is carrying? Observability also has a governance dimension that many teams discover too late. GitHub Copilot’s enterprise audit log, for example, captures plan changes and platform-side agent activity but explicitly excludes local client session data — local prompts your developers send are not in the audit trail unless you instrument it yourself, and events are retained for only 180 days unless you stream to a SIEM [11]. At selection time, the question is binary in two directions: when something goes wrong in a session, can I see what happened? And when audit asks what the team did last quarter, can I answer? The deeper mechanics of harness state, compaction, and rewind belong to the harness chapter.

Local execution and data residency control. Can the work be done where your code is allowed to live, and can you choose the model? Three concrete operator surfaces make this real. First, model choice and BYOK: Cline ships with a model-agnostic architecture that lets you switch providers and route to local models without leaving the agent [9], and Continue’s three-tier configuration lets workspace-level config pin specific models and rules across a team without overriding individual setups [6]. Second, true local runtime: Ollama runs locally and does not send prompts or responses to its cloud by default, binds to localhost on port 11434 unless you explicitly set OLLAMA_HOST to expose it, and supports an air-gapped mode by setting disable_ollama_cloud in ~/.ollama/server.json [12]. Third, the gotchas you need to know before you trust the setup: Ollama defaults to a 4096-token context window and silently truncates anything larger without warning, so coding agents like aider explicitly resize the window per request and recommend the ollama_chat/ model prefix for compatibility [13]. The trade-off is straightforward: stronger residency and control guarantees usually mean weaker raw coding performance, less cloud-side polish, and operational hazards like silent context truncation that you have to manage yourself. A useful auxiliary control when running closer to local models is system-prompt and temperature tuning — system prompts enforce hard constraints far more reliably than user prompts, and dropping temperature toward 0.0–0.3 reduces hallucination in tasks where you cannot afford it [2].

When you assess a new tool, walk this list before reading any feature page. Most tool comparisons collapse into one of these six axes, and the differences between products are usually capability gaps wearing different marketing.

1.3 Choosing Under Constraints

Capability shopping is constrained by three forces: your codebase, your environment, and your review capacity. Each one narrows the practical choice set.

Codebase fit determines how much capability you can actually use. Three codebase signals matter most: enough test coverage that generated changes can be verified quickly, modular boundaries that keep edits coherent, and mainstream frameworks or well-documented internal conventions that give the model a strong prior. Even capable agents fail at the seam where defaults disagree with reality — initial monorepo and workspace setup is a notorious example, because agents drift toward simpler patterns and miss workspace-specific details even when explicitly told otherwise [1]. When your stack diverges from training-data norms, raw model power matters less than whether the tool’s extension surface can carry your domain semantics [4]. The cleanest version of this rule is operational: keep your toolchain transparent, keep your conventions discoverable, and prefer agents that can see both [3].

Environment narrows the menu before capability does. If the code cannot leave your network, hosted frontier-model agents come off the table regardless of how good they look in a benchmark, and your real choice collapses to local-model setups, self-hosted managed platforms, or open-source agents you run yourself — Cline-style client-side harnesses with model-agnostic routing and Ollama-backed local models are the concrete shape this usually takes [9], [12]. If a regulator needs an audit trail, you need a harness whose actions are inspectable end-to-end, and you need to know up front what your platform’s audit log does and does not capture [11]. If your team works from spotty connectivity, a near-network or local setup beats a cloud agent that stalls every turn. When the constraint is environmental, the cross-cutting choice between managed and self-hosted is usually settled before model quality enters the conversation.

Team workflow shapes the rest. Combination workflows are the norm, not the exception. Two recur often enough to plan around. The first is prototype-to-production: sketch fast in an in-browser builder or design-mode tool, then move into a CLI or IDE workflow once tests, architecture, and maintainability matter — and budget for the fact that the last 20% of polish will take at least half the total time, regardless of how fast the first cut arrived [8], [7]. The second is async-to-local: hand a bounded backlog issue to a long-running cloud agent, then review and integrate locally where you can run tests and inspect the diff. The handoff is the moment to verify, not the moment to trust.

Review capacity is the hard ceiling on parallelism. More agents and more autonomy are only useful when the output can still be verified. If your reviewers cannot absorb the volume, the extra throughput is fake, and the cost shows up later as cognitive debt — teams shipping code they no longer fully understand because generation outran comprehension [14]. The disciplined countermove is small and concrete: ask the agent to explain what it produced, treat that explanation as part of the review, and refuse to ship work that no human on the team can defend in plain language.

1.4 A Selection Filter

When you sit down to start a piece of work, run through this filter in order. The questions are ranked: the first three should dominate most choices, and the rest only matter if the first three are satisfied.

Is there a verification loop available? Tests, browser feedback, inspectable output. If not, step back. The right tool is not an agent yet.
What execution surface fits the feedback loop the task needs? CLI for coordinated edits in a known toolchain. IDE for visual or UI work where you supervise inline. Cloud or in-platform for bounded async work that returns reviewable artifacts (branch, PR, plan, screenshots), not raw transcripts. Throwaway prototypes belong in an in-browser builder, with budget for the last-mile polish tax when they graduate.
Does environment force a deployment choice? If code cannot leave your network, or you need to audit the harness, you are picking from open-source and self-hosted options — Cline-style client-side harnesses, local model runtimes like Ollama, or self-hosted managed platforms. Managed products are the alternative you have to justify here.

Only after those three are settled do the secondary questions matter:

Unusual stack or internal APIs the model has not seen? The chosen surface needs a real extension surface (MCP, skills, or equivalent) you can teach.
Will autonomy be cranked up? The chosen surface needs observability deep enough to recover when the loop drifts, and an audit trail that matches your governance needs.
Reviewer capacity? Hard ceiling on how much parallelism you can actually absorb. Throttle agents to match it.

The rule the rest of this book builds on is the one that sits behind every row of that filter: choose the tool landscape that preserves a verifiable workflow. The right stack is not the one with the strongest model or the slickest demo. It is the one whose surface, capability set, and handoff points leave you able to understand and trust what ships.

1.5 Takeaways

Before reaching for an agent, confirm a verification loop exists — tests, browser feedback, or inspectable output. If there is no verification loop, the right answer is not yet to pick an agent; step back and build one first.
Match the execution surface to the feedback loop the task actually needs: CLI for coordinated multi-file edits in a known toolchain, IDE for visual or UI work where you supervise inline, cloud or in-platform for bounded async work that returns a branch, PR, plan, or screenshots — not a raw transcript.
When code cannot leave your network or you need full harness auditability, start from open-source or self-hosted options and treat managed cloud agents as the alternative you must justify — not the default.
Treat reviewer capacity as the hard ceiling on agent parallelism: throttle the number of concurrent agents and autonomous sessions to what your reviewers can actually absorb, and refuse to ship work that no human on the team can defend in plain language.
Evaluate new coding tools against the six capability axes — repository and terminal execution, browser feedback and self-testing, async execution with reviewable handoff, tool extension, observability, and local execution and data residency control — rather than brand reputation or benchmark scores.
When evaluating or configuring a long-running async agent, confirm the platform produces task lists, plans, screenshots, or recordings as first-class reviewable artifacts — if it only returns raw tool-call transcripts, reject it for autonomous work because you have simply shifted cognitive load from execution to forensic reconstruction.
Keep your project’s toolchain transparent — use direct CLI invocations with standard flags and minimal bespoke wrappers — so that the agent can reason about tools it already knows rather than trying to infer your custom shorthand.
Before deploying an agent in an enterprise environment, check exactly what the platform’s audit log captures and what it excludes — verify whether local client prompts are recorded, how long events are retained, and whether you need to stream to a SIEM to satisfy your governance requirements.