19 BYOK, Local Models, and Self-Hosted Agents

The vendor’s default model and the vendor’s billing relationship are conveniences, not requirements — and the practitioners getting the most out of coding agents know exactly when to break both.

Chapter 18 ended at the boundary where routing and caching are no longer enough. Use that chapter’s attribution analysis first: if the cost table shows waste inside the current vendor path, fix model routing, cache hits, and context budget. Move to BYOK, local models, or self-hosting only when the decision criteria point outside the vendor default: your model bill is still dominated by high-volume repeat work after optimization, source code cannot transit the vendor proxy, finance needs spend in your cloud account, a committed provider contract must be used, data residency is mandatory, or the network cannot make outbound calls to api.anthropic.com. Each of these has the same answer: take ownership of the credential, the endpoint, or the runtime, in that order. This chapter walks the portability ladder — your own key, your own provider, your own model — and then closes with the discipline that separates “I configured a local model” from “I can prove no code left this machine.”

19.1 Bring Your Own Key: the credential layer

The first rung of portability is injecting your own provider key in place of whatever credentials the tool ships with. The mechanism is almost always one of three surfaces: an environment variable (ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY/GOOGLE_API_KEY), a CLI flag (aider --anthropic-api-key ...), or a file the tool writes for you (OpenCode’s /connect command stores credentials in ~/.local/share/opencode/auth.json) [294], [350]. The effect is identical regardless of surface: agent calls now hit your account, model-tier choice is yours, spend appears on your invoice, and code stops transiting the vendor’s infrastructure. GitHub Copilot CLI’s BYOK support — letting practitioners point the CLI at any OpenAI-compatible endpoint, Azure OpenAI, Anthropic direct, or a local model server — confirms that even tools that had previously abstracted credentials entirely now treat key injection as table stakes [351], [352].

Reach for BYOK when policy review flags vendor-proxied traffic, when finance wants inference spend in your own cloud account rather than a vendor-billed line item, or when a model variant or region is only available via direct API [353], [354]. Skip it when bundled credentials already meet the compliance and cost-attribution bar; the rotation overhead is real. And know what BYOK does not buy you: per-key rate limits replace pooled vendor capacity, so a team migrating from a Pro/Max plan to BYOK can hit throttling that never appeared before [355]. Tool-side features that depend on vendor-bundled credentials — collaborative sessions, centralized dashboards, certain audit hooks — also tend to silently degrade.

Two failure modes recur. The first is hardcoded keys in config files: a rotation breaks every checked-out copy at once, and a stray commit leaks a credential that grants unlimited model access. Inject keys at runtime through a secret manager or environment, never in opencode.json, ~/.config/goose/config.yaml, or .aider.conf.yml. The second is the consumer-account training default: Pro/Max Claude Code accounts default to opt-in training, meaning proprietary code routed through a personal account may end up in training data unless you explicitly disable it [353]. BYOK is the cleanest way out — your own organizational key gets your organizational data terms.

19.2 Alternative providers and provider routing

The next rung up is not “which key for this backend” but “which backend.” Most modern harnesses expose provider name and model as first-class config, so the same agent workflow can hit Anthropic on Monday, Bedrock on Tuesday, and a local Ollama instance on Wednesday with no code changes. Aider’s --model flag accepts any OpenAI-compatible endpoint, and .aider.model.settings.yml lets you keep per-model profiles — including budget_tokens, custom token-limit metadata, and the aider/extra_params wildcard that applies settings to every model unless overridden [350]. Goose configuration in ~/.config/goose/config.yaml and the GOOSE_PROVIDER/GOOSE_MODEL environment variables route the entire session — Bedrock, SageMaker, Azure, Anthropic, OpenAI, or Ollama — through one switch [355]. OpenCode’s six-layer config precedence (remote organizational → global user → custom env var → project → inline env var → managed admin) means a team can ship a sensible default provider in opencode.json while letting a project pin a different one for sensitive code, all without users editing anything [294], [97].

Copilot CLI exposes the same surface. Setting COPILOT_PROVIDER plus COPILOT_PROVIDER_BASE_URL and a credential redirects all traffic to Azure OpenAI, Anthropic direct, an OpenAI-compatible gateway, or a local model server [351]. The hard requirements practitioners hit immediately: the model must support tool calling and streaming, and a 128k+ token context window is recommended for anything beyond simple completion [352]. Built-in sub-agents (explore, task, code-review) inherit the configured provider automatically — a useful demonstration that provider routing is meant to be set once at the harness level, not threaded through each sub-task.

Reach for alternative providers when high-volume agentic workloads make per-token markup material, when an existing enterprise contract for Bedrock or Azure OpenAI is sitting unused, or when an entitlement governance requirement says “use this account, not that one” [354]. The entitlement-reuse case is worth highlighting separately: Aider’s GitHub Copilot integration extracts the oauth_token from hosts.json so an existing corporate Copilot subscription becomes the credential for an entirely different harness — BYOK as governance reuse, not as cost optimization [350]. Skip alternative providers when the default model is genuinely the most capable option for the work and quality is non-negotiable.

The dominant failure mode here is silent. The alternative provider returns valid responses, the harness keeps running, and outputs look fine — but instruction-following fidelity has dropped just enough that multi-step tool-use loops degrade, the agent re-plans more, and tasks that used to land in three turns now take seven. There is no error, just a quietly worse workflow. The 2026 harness comparison treats provider flexibility and model resilience as a first-order ranking dimension precisely because of this: a harness that handles graceful degradation, retries, and provider switching is materially more robust than one that just exposes a config field [355]. Before committing a high-stakes workflow to an alternative provider, run it through a known-good task you have ground truth for. If turn count, tool-call accuracy, or final code quality drift, you have your answer.

19.3 Cloud-IAM model access

For organizations on AWS or GCP with active IAM governance — and for Azure shops using Azure OpenAI under Entra ID — BYOK should usually skip the API key entirely and bind model invocations to cloud identities. The operator question shifts from “which key is invoking this model” to “which IAM role is invoking this model.” The credential chain — instance profile, workload identity, service account, named profile, environment variable — resolves the same way it does for any other cloud resource, and model calls show up in CloudTrail or Cloud Audit Logs alongside S3 reads and Lambda invocations [353], [354].

The concrete pattern: an AWS Bedrock-backed agent needs an IAM policy granting bedrock:InvokeModel and, critically, bedrock:InvokeModelWithResponseStream for streaming responses. Goose accepts AWS_PROFILE, AWS_ACCESS_KEY_ID, AWS_BEARER_TOKEN_BEDROCK, and SAGEMAKER_ENDPOINT_NAME for routing through the standard credential chain or to a custom fine-tuned SageMaker deployment [355]. On GCP, Vertex AI authentication uses GOOGLE_APPLICATION_CREDENTIALS plus GOOGLE_CLOUD_PROJECT, with gcloud auth application-default login for developer machines. On Azure, the same idea lands as Azure OpenAI behind Entra ID: the Copilot CLI azure provider type takes a deployment-scoped endpoint and accepts a Microsoft Entra token through the standard Azure credential chain, so model calls inherit the role assignments already used for Key Vault and Storage [352]. The benefit is structural across all three: rotating, granting, and revoking model access reduces to standard IAM operations, audit logs come for free, and least-privilege scoping down to specific model ARNs, deployments, or regions is enforceable through policy rather than wishful thinking [353].

Three failure modes are worth memorizing. First, the credential chain order: AWS_ACCESS_KEY_ID overrides AWS_PROFILE, so a stale environment variable silently downgrades you to a less-privileged identity, and the permission denial only surfaces at model invocation, not at startup. Second, the streaming gap: a policy that grants InvokeModel but omits InvokeModelWithResponseStream looks fine for batch tests and breaks every interactive agent session. Third, the region or deployment mismatch: a SageMaker endpoint in us-west-2 accessed by an agent configured for us-east-1, or an Azure OpenAI deployment name that does not match the configured base URL, produces generic connection errors that look like network problems. None of these require rocket science to fix — they require knowing they exist before the security review.

Skip cloud-IAM model access for personal projects where the IAM setup overhead exceeds the audit benefit, and for purely local-runtime workflows where there is no cloud identity to bind to. Reach for it the moment a security team asks for an audit trail of model API calls, the moment offboarding includes “rotate AI keys” as a manual step, or the moment you realize your existing Bedrock, Vertex AI, or Azure OpenAI footprint already covers what you need [354].

19.4 Target the protocol, not the provider

The OpenAI REST API has become the de facto inference protocol — local runtimes, cloud proxies, and alternative providers all implement it — which means a single baseURL override is the universal portability lever. Continue.dev configured against an Ollama localhost endpoint, OpenCode with a provider.baseURL field in opencode.json, Aider pointed at any compatible server with --model, Goose with GOOSE_PROVIDER and a custom endpoint — the operational pattern is the same in every case [356], [294], [350]. The 2026 harness comparison treats this protocol-portability as a ranking dimension on par with reasoning depth and memory, on the argument that harness-plus-provider design — retry policy, tool routing, prompt scaffolding, and which model you can route to — materially affects coding outcomes in production, not just headline benchmark numbers [355].

Targeting the protocol decouples your tool upgrade cycle from your provider upgrade cycle. A new Ollama model release does not require a harness update; a new provider entering the market does not require waiting for first-party integration. The same opencode.json block routes to a public API today and a self-hosted vLLM cluster tomorrow:

{
  "provider": {
    "name": "openai-compatible",
    "baseURL": "http://localhost:11434/v1",
    "model": "qwen2.5-coder:14b"
  }
}

The protocol-mismatch failures are concrete and worth pre-empting. Some compatible servers do not implement streaming deltas correctly, and the harness sits frozen waiting for chunks that never differentiate. Some do not support parallel tool calls, breaking agentic loops that issue concurrent reads. Some return responses without the structured-output fields the harness expects, surfacing as malformed-response errors instead of clean failures. And quantized local models often expose smaller context windows than the cloud equivalents the prompt was designed for, causing truncation that is harder to diagnose than a clean cloud-side 400. When you add a new compatible endpoint, run a known agentic task end-to-end before trusting it for real work; verify streaming, verify tool calling, verify a long-context turn.

19.5 Local model runtime integration

Pointing a harness at a local runtime is mechanically the same move as alternative-provider routing — GOOSE_PROVIDER=ollama, OpenCode /connect with a localhost provider, Continue.dev’s local provider block, LM Studio’s OpenAI-compatible HTTP server on localhost:1234. What changes is the operational profile underneath. Inference now runs on your hardware, costs collapse from per-token to per-watt, data stays on the machine, and the network is no longer in the loop [356], [357], [358]. The tradeoffs land in three places: hardware floor, model capability, and tool-calling reliability.

The hardware floor is non-negotiable. 7B parameter models need roughly 10 GB of disk and 16 GB of RAM to run cleanly; 13–16B models work on 16 GB but tighten quickly under concurrent agent workloads; 40B+ models need serious local GPUs or optimized cloud hosts [358], [357]. 7B models respond in under 500 ms on modern hardware, which makes them suitable for tab-completion latency, but the smaller capability envelope shows up immediately in multi-step reasoning [357]. A practical pattern that survives daily use is splitting models by interaction shape: a fast 7B model for completion, a heavier 13–16B (or larger if hardware allows) for chat and agentic work, both behind the same Continue.dev or OpenCode configuration [357].

Model capability differences are real and not all 7B models are interchangeable. A direct comparison on a single coding prompt produced clean separation: codeqwen:7B correctly inferred the requests library and used raise_for_status() for error handling, meeting the baseline; deepseek-coder:6.7B reached for click.getchar() instead of reading from stdin, missing the requirement; codellama:7B added type annotations and rich.console but skipped the __name__ == '__main__' guard [358]. The lesson is not “pick codeqwen” — it is that model selection has to be tested against your actual prompts. Ollama’s abstraction layer makes this cheap: the editor configuration does not change, only the model name does [358].

Tool-calling reliability is the precondition that distinguishes “can autocomplete” from “can run an agent loop.” Many small open models handle completion competently and fall apart on structured tool use, breaking agentic workflows that work fine with frontier APIs. Before committing a local model to agent-style tasks, test it on a multi-step tool-use sequence with the harness you actually use; if the model hallucinates tool names, returns malformed JSON, or stops calling tools partway through, it is a chat model in this configuration, not an agent model. The other recurring failures are predictable: the runtime process is not running when the harness starts, surfacing as a cryptic connection-refused error; the model does not fit in RAM and the runtime swaps until inference is unusably slow; the quantized context window is much smaller than the cloud equivalent and long sessions silently truncate.

19.6 Self-hosted agent gateways: when to pay the operations bill

A private model endpoint is not a self-hosted agent. Pointing Claude Code or Cursor at Bedrock with BYOK gives you private inference; it does not give you a place to enforce policy across users, persist sessions across machines, or answer “what did the agent do for tenant X last Tuesday?” If you only need provider isolation and key custody, stop at BYOK — that is the right resting point for most teams. If you need a single chokepoint for auth, audit, and tool scope across many developers and CI runners, the next step is an agent gateway sitting between developer tooling and the model. The full governance contract — managed-settings policy, MCP allow-listing, audit retention, bypass-mode prevention, guardrail design — is owned by Chapter 21; this section covers only the deployment-shape decision and the operational bill that comes with it.

The deployment shape is small enough to describe in one paragraph. Developer CLIs, IDE extensions, and CI runners all speak to an internal HTTPS endpoint instead of directly to the model vendor. That endpoint is a thin service — usually a wrapper around an OSS agent runtime such as a forked OpenCode-style provider router, sometimes a commercial proxy like the TrueFoundry MCP/AI Gateway pattern that documents the centralization story explicitly [359], [360]. It owns three things developers used to own individually: provider selection (the gateway resolves a logical name like code-default to Bedrock, Azure, or a self-hosted vLLM cluster), the downstream credential (Bedrock/Azure keys live on the gateway’s IRSA or workload identity, never on laptops), and session/transcript persistence so a developer can resume on a different machine and reviewers can replay trajectories. Client config files like opencode.json collapse from full credentials and tool definitions into thin pointers at the gateway [294].

What a gateway buys you over plain BYOK, in one sentence each: a single place to enforce tool-use policy that laptops cannot opt out of, centralized credential custody that survives offboarding without hunting through machines, and a unified audit pipeline where every prompt and tool call lands in one queryable stream [360], [353]. The mechanics of how those controls are designed, scoped, reviewed, and proved to auditors belongs to the governance chapter; the deployment-layer point is only that none of them are reachable with plain BYOK no matter how carefully you scope the API key.

The operational burden, stated honestly: you are now running a tier-1 developer service. Patch the gateway and client SDKs on the upstream runtime’s cadence, on-call for queue depth and per-team cost overruns, retain session and audit data, and document a “use direct Bedrock with break-glass keys” fallback so engineering does not stall when the gateway goes down. Build the gateway when the triggers are observable rather than numeric: a security or compliance review demands an auditable trail of tool calls and prompts, you are routing more than one model provider in production, regulated data passes through agent sessions, MCP server proliferation is creating per-laptop trust decisions you cannot review centrally, or offboarding can no longer be done by revoking a single shared key. Stay on plain BYOK with a shared MCP config in the repo when none of those triggers are present — most small teams on a single provider with no compliance pressure get the credential and routing benefits without the operational tax.

19.7 Air-gaps, zero-egress, and the validation discipline

The leap from “I configured a local model” to “I can prove no code left this machine” is the entire point of air-gapped operation, and it is where most setups quietly fail. Configuration intent is insufficient. The discipline is network-layer verification, every time [356], [361].

Pre-download is step one and easy to get wrong. Every model weight, every harness asset, every plugin you intend to use has to land on the machine before it crosses into the isolated network — once you are inside, there is no fetching anything. Build a checklist that includes Ollama models, Continue.dev or OpenCode binaries, any MCP servers, and language-server assets. Step two is endpoint pinning with no fallback: configure the harness so its only configured provider is the localhost endpoint, and verify there is no implicit cloud default. Continue.dev’s local provider config, OpenCode’s opencode.json baseURL override, and GOOSE_PROVIDER=ollama all pin a single endpoint, but some IDE extensions and harnesses fall back to their default cloud API if the configured local endpoint is unreachable. That fallback is the silent leak — requests go out, and the user sees no error [356].

Step three is enforcement at the network layer rather than the configuration layer. An egress firewall rule that blocks AI API hostnames and outbound traffic from the developer machine catches what configuration alone cannot. Step four is verification — running tcpdump or an equivalent network capture during a real agent session and confirming zero packets to external destinations. The 45-minute air-gapped setup pattern with Continue.dev plus Ollama gets this right: hardware floor (16 GB+ RAM), model pre-download, IDE configuration pinning the local endpoint, performance tuning, and the explicit network-traffic validation step that confirms zero egress before declaring the setup done [356].

Copilot CLI’s COPILOT_OFFLINE=true flag illustrates an important nuance. Setting it disables telemetry and GitHub server contact, which is exactly what an offline workflow needs — but the documentation is explicit that offline mode only provides true network isolation if COPILOT_PROVIDER_BASE_URL points to a local or same-network provider; a remote OpenAI-compatible endpoint still transmits prompts and code over the network [351], [352]. “Offline” as a flag is not the same as “offline” as a verified network state. Air-gap as a property holds only when the egress validation confirms it.

Tabnine positioning itself as the only production-grade air-gapped AI development platform — and Windsurf moving its on-premises offering to maintenance mode while Cursor and Copilot remain SaaS-bound — is the market signal that air-gap is a differentiated procurement requirement, not a niche [361]. The driving constraint is sovereign control: defense, aerospace, healthcare under HIPAA, financial services under data-residency rules, and public sector environments cannot send code, architecture, or system context to external servers, and cloud-dependent platforms operate as black boxes with no visibility into training-data handling or auditability of suggestions. For practitioners in those environments, the choice is not air-gapped versus SaaS — it is air-gapped or nothing.

19.8 Closing the ladder: rung, trigger, proof

The portability ladder ends where your constraints end, and each rung has the same shape: a trigger that justifies the move, a credential or endpoint surface that implements it, and a proof artifact that holds up under review. Use this as the chapter’s decision matrix.

Rung	Trigger	Credential / endpoint surface	Proof
Bundled credentials	None of the others apply	Vendor default	Vendor invoice and data terms
BYOK env var	Vendor proxy or training defaults flagged	`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`	Provider invoice in your account
BYOK in secret manager	Multi-developer or CI deployment	Runtime injection, no file storage	No keys in repo, rotation runbook
Cloud IAM	Audit trail or least-privilege required	Bedrock/Vertex/Azure OpenAI via credential chain	CloudTrail/Audit Logs entries
Alternative provider	Cost markup or entitlement reuse	`GOOSE_PROVIDER`, `--model`, `baseURL`	Baselined task on known-good ground truth
Local runtime	Zero per-token cost or no-egress	`GOOSE_PROVIDER=ollama`, localhost `baseURL`	Tool-calling test plus latency check
Self-hosted gateway	Org-wide policy, audit, or multi-provider routing	Internal HTTPS endpoint, IRSA-backed	Replayable transcript pipeline
Verified air-gap	Sovereignty: ITAR, FedRAMP High, HIPAA, classified	Pinned local endpoint, no fallback	`tcpdump` capture showing zero egress

Two adoption signals shape how this lands in practice. Tool selection is increasingly driven by company size and procurement constraints rather than individual preference, with developers switching readily to whichever tool clears the legal and compliance bar [362], [363]. And with weekly AI usage now near-universal among engineers [362], being locked out of AI assistance because of a sovereignty constraint is no longer marginal — it is a structural disadvantage that justifies the operational cost of the higher rungs.

Two rules for staying honest about which rung you are actually on. First, do not confuse intent with proof: a config file that says baseURL=localhost is configuration, not isolation, until a packet capture confirms it [356]. Second, baseline before you commit: every alternative provider, every local model, every gateway hop should run a known-good agentic task end-to-end first, with attention to tool-calling fidelity and turn count, not just final output [355]. For most teams, BYOK with cloud-IAM access on a frontier provider is the right resting point. For some, alternative-provider routing through a private gateway is essential. For a smaller but real cohort, full air-gap with verified zero egress is the only path to AI-assisted development at all. Knowing which rung you are on, and proving it at the network layer rather than the configuration layer, is the difference between a controlled posture and a hopeful one.

19.9 Takeaways

Inject API keys at runtime through a secret manager or environment variable; never store them in agent config files like opencode.json, ~/.config/goose/config.yaml, or .aider.conf.yml.
When proprietary code would otherwise flow through a personal or vendor-managed account with unacceptable data terms, switch the agent to an organizational BYOK credential so inference runs under your organization’s contract.
When your organization already governs model access through AWS, GCP, or Azure identity systems, bind agent calls to cloud identities instead of API keys, and grant the provider’s streaming permission before testing interactive sessions so batch-safe policies do not fail at runtime.
Stay on plain BYOK until you have observable triggers requiring a gateway — centrally enforced tool policy, a unified audit pipeline across teams, or multi-provider routing in production — because a gateway creates a tier-1 service obligation with on-call, patching, and break-glass fallback requirements.
Before using a local model in an agentic workflow, run a multi-step tool-use sequence on your actual harness; if the model hallucinates tool names, returns malformed JSON, or stops calling tools mid-loop, treat it as a chat model only — not an agent model — in that configuration.
Before trusting a new provider or OpenAI-compatible endpoint, run a known agentic task end-to-end and verify streaming, tool calling, and one long-context turn against a known-good baseline.
For air-gapped setups, configure the harness so its only provider is the local endpoint and verify there is no cloud fallback, because some harnesses silently send requests to their default cloud API when the local endpoint is unreachable.
After completing an air-gapped setup, run tcpdump or an equivalent network capture during a real agent session and confirm zero packets reach external destinations — configuration intent is not network isolation until a packet capture proves it.