21  Teams, Governance, and Enterprise Constraints

Some teams will legitimately ship with fewer humans and more agents — but only when ownership boundaries, compliance controls, and rollout discipline are already strong enough to support that compression.

21.1 Team Adoption and Code Ownership

Implementation capacity is rising faster than review and coordination capacity, and that asymmetry is reshaping team structure. Three- and four-person teams running on agent leverage now ship work that previously required ten to twenty engineers, but only when the bottleneck had been implementation labor rather than architectural judgment or review throughput [383]. The structural reason smaller teams work is that agents reduce the coordination tax that made large teams necessary. When implementation is automated, two or three people making high-bandwidth architectural decisions move faster than fifteen synchronizing through standups and sprint ceremonies — provided review load stays within their capacity [383]. AI coding tools have reached mainstream adoption across Big Tech and startups, but the actual workflow impact is still being calibrated, which means rollout decisions should favor experimentation grounded in measurement over either uncritical adoption or dismissal [384].

The emerging staffing model is the centaur pod [385]. This is the team-level version of the conductor-to-orchestrator shift introduced in Chapter 1 and operationalized in Chapter 15: humans stop being the only implementers and become the owners of task boundaries, review load, and integration judgment. One concrete pattern that works for teams building features on codebases with strong test coverage:

  • Senior Orchestrator (1 person): Owns architecture decisions, task decomposition, and specification quality. Reviews agent output for architectural coherence — does this change fit the system’s direction, or quietly introduce a second pattern for the same problem?
  • Verification Specialist (1 person): Owns defect capture, acceptance criteria, and test coverage adequacy. Reviews agent output for correctness and catches plausible-but-wrong logic. AI-generated code carries roughly 1.7x more critical issues than human-written code on average, which makes this role load-bearing rather than ceremonial [385].
  • Domain Engineer (1 person): Owns subsystem context, codebase maintenance, and the institutional knowledge agents lack — business rules, edge cases, regulatory constraints.
  • Agents: Own implementation — code generation, refactoring, test writing, documentation. Multiple agents run in parallel, but always within the review capacity of the three humans above.

Puzzmo demonstrated the upper bound of this pattern: a single engineer cleared years of accumulated tech debt — converting hundreds of React Native components, replacing three framework systems, and building complex REPLs — enabled by a well-structured monorepo with co-located schema and API definitions, explicit and well-trained-on tech stacks, and well-framed GitHub issues as task input [386]. The model breaks down in specific conditions. Greenfield architecture work demands more than one person making structural decisions. Compliance-heavy domains need deeper human review on every change. Legacy codebases without test coverage make verification unscalable because every agent change requires manual validation. In those cases, start with a larger team and compress as infrastructure improves. The practical test is simple: do you already have strong tests, clear module boundaries, enough reviewer bandwidth, and at least one human with deep domain context for every critical subsystem? If not, do not compress yet.

Code ownership has to follow the new staffing pattern, not the old one. Agents produce code that belongs to no one — nobody wrestled through the design decisions, so nobody feels responsible for maintaining the result. Three failure modes follow: cognitive debt (shared understanding erodes), intent debt (nobody remembers why code was designed a certain way), and cognitive surrender (uncritical acceptance of agent output) [387]. The organizational antidote is named module ownership: this person reviews and maintains this subsystem regardless of who or what wrote the code. Treat AI-generated code as a draft requiring human review and explicit authorship, not as finished product that arrived from somewhere [388]. Pull requests heavy with AI-generated code take roughly 26% longer to review, and syntactically polished output lowers reviewer skepticism, increasing the chance subtle bugs slip through [388]. Owners need that review time, and they need permission to push back when output looks plausible but is wrong.

Make ownership enforceable rather than aspirational. Three concrete surfaces turn the principle into mechanism: a CODEOWNERS file that routes every PR touching a subsystem to its named maintainer for required review; a PR template field that names the human author of record — the person accountable for the change after merge regardless of how much of the diff an agent produced; and a branch-protection rule that blocks merge on AI-heavy PRs without an owner approval, not just a thumbs-up from another contributor. The pattern survives turnover because the routing lives in the repo and is reviewed in code review like anything else.

The deepest organizational risk in an agent-heavy team is the talent hollow: routine implementation work disappears before organizations replace it with a better path to judgment. Between 2018 and 2024, entry-level software job postings dropped from 43% to 28% of total postings while overall demand stayed comparatively steady, and 55% of employers now report regret over AI-driven layoffs [389]. Manual QA postings fell 43% since 2023, even as QA headcount grew in AI-intensive environments — the role is concentrating on strategic oversight rather than vanishing [385]. The practical response is to redesign junior and mid-level roles around guided verification and subsystem ownership: rotate early-career engineers through verifier and domain-owner responsibilities, require them to explain agent-generated changes and trace blast radius, and run hiring loops that ask candidates to detect, explain, and repair defects in a working but flawed codebase rather than write everything from scratch. The developer role is shifting from creator to curator, and the differentiator is no longer raw generation speed but quality velocity — the ability to evaluate, orchestrate, and refine agent output reliably [390].

Convention files are the lowest-overhead governance artifact a team has. Externalize style rules, library preferences, type conventions, and prohibited patterns into a committed file the agent loads on every session — CLAUDE.md, AGENTS.md, GEMINI.md, .cursorrules, or whatever the tool reads. The point is not personal prompt craft. The file lives in the repo, every PR touching it is reviewable, and violations surface through normal code review, which makes ambient agent behavior auditable. When teams extract tacit review knowledge from senior engineers, they often discover that different leads have been enforcing different unwritten standards all along; convention files force those standards into the open. Tabnine’s enterprise pattern of ingesting internal repositories and design docs to align generated code with organizational standards makes the same point at the platform tier — standardized, governance-enforced output reduces review bottlenecks because consistency itself is a quality control [391]. Treat the convention file like a test suite: owned, maintained, and updated when failures expose a gap.

21.2 Governance, Compliance, and Auditability

When more than one or two developers run agents with shared or org-provisioned credentials, governance stops being optional and becomes prerequisite infrastructure. Coding agents run with broad system permissions and generate unbounded API costs that conventional budget controls cannot see [359]. The proactive posture — designed before incidents occur — covers four surfaces: identity and access, tool scope, data boundary, and audit trail [359]. The gap between vendor-tier controls (admin console seat caps, SSO/SAML wiring) and runtime-tier controls (network egress allow-lists, container scopes, MCP registry) is where most real failures happen, so plan both layers explicitly rather than assuming one covers the other. Each control surface that follows answers exactly one operator question — who is accountable, what can connect, what can egress, what is the blast radius, what can the session actually touch, what evidence survives audit? — and none of them substitutes for another.

Identity and access — who is accountable. Wire agent access to the organization’s identity provider so accountability binds to a person, not a shared key. GitHub Copilot Enterprise and the Anthropic Console support SSO/SAML and per-seat license-key gating [359]; Tabnine layers organizational policy over its enterprise distribution to enforce consistent code quality and compliance [391]. The decision behind those controls matters more than the product list. Bind model invocations to cloud IAM identities — AWS IAM roles for Bedrock, Google service accounts for Vertex AI, per-team or per-project roles for cost attribution — so that offboarding reduces to standard IAM revocation rather than a manual hunt for shared API keys, and so that finance can attribute spend without a separate billing system. Bedrock logs InvokeModel, Converse, and InvokeModelWithResponseStream to CloudTrail as management events by default, giving you a model-invocation audit trail with no additional tooling [392]. Note the asymmetric coverage: GitHub’s Copilot audit log captures plan changes, seat assignments, and agent activity but explicitly excludes local client session data, retains events for 180 days only, and pushes anything longer-lived to SIEM streaming [11]. Enterprise tiers of vendor tools also come with materially different data-handling terms — a personal Gemini Code Assist account, a Workspace Standard account, and a Gemini Code Assist Enterprise account carry contractually different commitments, and only the highest tier belongs in regulated work. Chapter 19 covers the BYOK and IAM mechanics; the policy this chapter insists on is using IAM for attribution, audit, and revocation, not for credential convenience.

Tool scope — what can connect. MCP servers execute arbitrary code with the agent’s credentials, can read environment variables, and can make outbound connections without additional prompts after initial setup [393]. Per-user opt-in is structurally insufficient in regulated environments. The current gold-standard pattern is Amazon Q Developer’s registry-over-HTTPS model: an admin publishes a JSON registry that lists approved servers (with name regex, transport types — stdio, streamable-http, sse — and package runners for npm, pypi, or oci); every Q Developer client fetches that registry on startup and refreshes on a 24-hour cycle; servers absent from the registry are terminated automatically; an org-wide disable toggle kills MCP entirely; and account-level profiles supersede organization-level profiles, so team-by-team tightening is possible without loosening the org floor [92]. The negative exemplar is Devin’s per-session, per-user MCP configuration with three transport methods and a marketplace, but no org-level allow-list governing which servers a session may connect to [393] — the developer-freedom posture works for a small co-located team and fails the moment a compliance auditor asks “what external services can your agent reach?” The TrueFoundry MCP Gateway pattern generalizes the lesson at the platform tier: centralized credential management, RBAC, runtime guardrails on tool invocation, and audit logging are first-class concerns once MCP adoption scales beyond a single team [360]. Two honest caveats: enforcement is client-side, so the registry is a policy floor rather than a network control, and the periodic sync cadence creates a revocation window in which a server pulled from the registry can keep running for up to 24 hours after a security incident; user-customizable env vars or HTTP headers on registry-defined servers can also pass credentials in ways the registry author did not anticipate [92].

What an approval decision looks like in practice: a team rolling out a database integration approves mcp__postgres__query in the registry with the connection string scoped to a read-only Postgres role and the credentials stored in the gateway, not in developer environments. Write-capable tools like mcp__postgres__execute are deliberately omitted; if a workflow needs writes, the team files a separate approval that names the specific write tool, the bounded environment (staging only), and a hook-enforced confirmation step before each invocation. The agent platform exposes this through a mcpServers configuration block with the namespace convention mcp__<server>__<tool>, and OAuth-based MCPs should use service accounts rather than personal accounts because access is shared across the organization [393]. The connectors and runtime chapters cover the developer-side mechanics; what governance owns is the read-versus-write partition.

Data boundary — what can egress. The other surface attackers and accidents touch is context — what files, repositories, and tools the agent is allowed to see. Default-include is dangerous: any file visible on disk can become model context, flowing to a third-party LLM on every request. Treat boundary control as a compliance mechanism, not a developer preference. Sourcegraph Cody’s cody.contextFilters is the strongest exemplar: site admins author RE2 include and exclude rules in site config; the configuration is gated behind a double-gate of an Enterprise license check (Sourcegraph Enterprise ≥5.4.0) and an explicit cody-context-filters-enabled feature flag; rules are version-controlled and auditable; and the system honors .gitignore, files.exclude, and search.exclude so the boundary policy reuses artifacts the team already maintains [26]. Anticipate the side effect: when filters are active, the Prompts feature is disabled, which creates pressure to loosen rules to recover lost functionality — resist that pressure or reroute the workflow [26]. For regulated regimes that prohibit any external transmission of code — ITAR, EAR, HIPAA, FedRAMP High, classified networks — boundary control alone is not enough; the deployment must be air-gapped with verified zero egress, and Tabnine’s positioning as a dedicated air-gapped platform shows that procurement now treats this as a category, not a configuration. Chapter 19 covers the operational walkthrough; this chapter only requires that you map each compliance regime to a deployment mode and validate at the network layer that configuration intent matches actual egress.

Layered permissions — what blast radius is allowed. Chapter 16 establishes the permission architecture; the governance angle here is that organizations have to manage a multi-scope settings stack with an exact precedence the operator can rely on. For Claude Code, the durable on-disk scopes evaluate in this order, from highest to lowest: Managed > Local > Project > User, with a session-level command-line override sitting between Managed and Local at invocation time but not surviving the session [19]. The non-obvious move there is that Local — a single developer’s per-repo override — outranks shared Project policy. That is intentional for personal experimentation, and it is exactly why Project is not the enforcement layer for organizational policy. Managed is. Managed-scope settings cannot be overridden by user, project, local, or command-line settings, which makes Managed the only durable layer where the security team’s denylist actually holds: no Bash(git push *) in CI containers, no writes to credential files, no destructive shell commands without a hook [19]. Project settings sit one tier below Local and standardize the team baseline; they are checked into git and reviewed in pull requests like code [19]. Gemini CLI’s enterprise model is structurally similar: settings merge from system defaults to user to project to system overrides, with system overrides taking final precedence on single-value settings, and a determined user with local administrative privileges can still circumvent the controls — they prevent accidental misuse and enforce policy in managed environments, not malicious local actors [21]. Allow-listing tools through tools.core is more secure than blocklisting through tools.exclude, but excludeTools deployed at the enterprise layer is exactly the same denylist pattern a solo developer uses as a personal safeguard, lifted into an org-mandated floor [21], [22]. The composition rule that matters most: allowedTools does not constrain bypassPermissions, so use disallowedTools at the Managed layer for hard blocks that must hold even when an operator runs in bypass mode. For multi-tenant setups, set CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 to prevent automatic memory loading from leaking context across tenants. The decision rule: anything you cannot afford for any developer to override goes in Managed; anything that defines the team baseline goes in Project; Local is a personal scratchpad, not a policy layer.

Runtime containment — what the session can actually touch. Settings precedence and registry policy are configuration-time controls; they fail soft if a misconfigured agent or a hostile MCP server tries to read what configuration intended to forbid. Runtime containment closes that gap by reducing the privileges the agent process actually holds when it executes. The pattern for production-adjacent work: run the agent inside an ephemeral container (a fresh dev container, an isolated VM, or a CI runner that exits after the task) rather than on the developer’s primary workstation. Mount only the repository roots the task requires — read-only for source the agent should not modify, read-write only for the working tree it owns. Issue task-scoped cloud credentials at session start instead of mounting the developer’s full credential cache: an STS-derived role good for an hour with a bedrock:InvokeModel policy and nothing else, or a Vertex service account scoped to a single project, fed in as environment variables the container does not persist. Restrict outbound network access to an approved egress list — model API endpoints, the MCP registry host, the package mirrors the build needs — through a sidecar proxy or container network policy, so a compromised MCP server cannot dial an arbitrary domain even if it tries. The composition is the point: managed settings say what the agent is configured to do, and the container, scoped role, and egress allow-list say what it is able to do regardless. Use the lighter sandbox for routine work; reach for the full ephemeral-container model when the task touches production credentials, customer data, or write access to systems whose blast radius you cannot afford to discover empirically.

Audit trail and provenance — what evidence survives. Logs are the proof your governance worked. Ship the agent client’s audit logs into the organization’s SIEM — necessary because vendor-side retention is short and incomplete on its own [11] — route Bedrock CloudTrail InvokeModel events to the same place [392], and treat the resulting model-invocation timeline as a first-class compliance artifact. Bedrock agent invocations (InvokeAgent, InvokeInlineAgent) and knowledge-base operations require advanced event selectors to be logged as data events; budget for that explicitly because data events incur additional CloudTrail charges [392]. Provenance has its own enforcement point: the moment that matters most for license risk is the CI/CD gate, not the suggestion-time review — by the time AI-generated code reaches main, the cost of detecting a GPL or other non-permissive contamination is much higher than the cost of blocking it before merge [394]. Embed provenance and license-attribution checks in the build pipeline so policy enforcement does not depend on developer or reviewer awareness. For regulated buyers, demonstrating provenance and attribution checks in CI/CD has become a procurement differentiator [394] — painful for a startup faced with a license-contamination claim, existential for an enterprise.

A concrete operating example keeps this from becoming doctrine. Suppose an agent introduces a risky schema migration because the shared instructions did not require rollback planning. The platform owner does not rewrite every team’s prompts by hand. The incident owner records the failure mode, the team lead updates the shared migration checklist in the convention file, and the ratifying group approves the new standing rule — and a corresponding disallowedTools entry blocking unscoped destructive migrations — in the Managed settings layer before the next release train. Tool approval is centralized, standards are ratified by the people who enforce them, the registry and Managed deny rules carry the policy, and incident learning becomes a durable artifact rather than a Slack memory.

21.3 Rollout Discipline and Operating Models

When agents handle code generation, verification becomes the expensive constraint. The organizational unit of work shifts from “what did we ship” to “what did we validate” [387]. That shift directly changes how managers must staff, plan, and set limits — and it makes the difference between a controlled rollout and an uncontrolled one a sequence of explicit gates rather than a single go/no-go.

The ladder below operationalizes the chapter’s three governing constraints in sequence: each stage opens only when review capacity is protected, ownership is preserved, and concrete governance — registry, audit, managed deny floor — is in place rather than promised. A stage that produces an unreviewed merge, a registry bypass, or an audit gap drops back one rung until the gap closes.

  • Stage 1 — Constrained pilot. Pick one or two teams with strong test coverage, clear module boundaries, and an existing CI/CD pipeline. Lock tool scope to read-only MCP servers and a narrow allowedTools list. Deploy Managed settings with a deny floor before any developer logs in. Measure baseline review throughput, defect-origin rate, and time-to-merge.
  • Stage 2 — Audit-readiness gate. Before widening, confirm three things: model-invocation audit trails actually reach the SIEM, context-exclusion rules are enforced at the site-admin tier, and the MCP registry is published over HTTPS with a documented refresh cadence [92], [392]. If any of those are stubs, do not widen.
  • Stage 3 — Bounded expansion. Add teams whose codebases meet the pilot’s structural prerequisites. Allow write-capable MCP servers only behind named approval paths. Introduce CI/CD provenance and license-attribution checks before any AI-heavy team merges to main [394]. Watch review-queue depth as a leading indicator: when it grows faster than completed reviews, freeze headcount expansion.
  • Stage 4 — Steady-state autonomy. Loosen permission modes only where review capacity, hook coverage, and incident-response loops have proven they can absorb the additional blast radius. Most teams should never reach blanket bypass; they should reach narrowly scoped autonomy on specific task classes (test generation, doc updates, small refactors) while keeping production-adjacent work gated.

Staffing ratios anchor on review load, not generation capacity. When review is the bottleneck, adding another developer with agent access does not increase throughput — it increases the review queue. Sprint planning should be driven by how many agent-generated PRs the team can meaningfully review per cycle, not by how many an agent can produce. The Tech Lead role grows rather than shrinks because validation of AI output and architectural coherence become more valuable when implementation is automated; the Engineering Manager role concentrates on irreducibly human work — psychological safety, conflict resolution, noticing struggle — because the process administration that AI now automates was never the hard part [385]. The sentiment gap between executives and practitioners is the warning signal worth tracking: roughly 55% of executives report feeling prepared for AI adoption versus 50% of developers and testers, and the qualitative read on that gap is more important than the point estimate — practitioners see implementation risks that leaders gloss over [390], and unbounded parallelism is one of those risks.

WIP limits need enforcement. Limit concurrent agent tasks per engineer to what they can review within the same day. AI amplifies existing organizational practices for good and ill — teams with strong CI/CD and review culture see agents as a force multiplier; teams with broken pipelines see agents accelerate technical debt non-linearly. Stably’s experience is the positive case: testing was the bottleneck for autonomous coding agents, and removing infrastructure decision-making from the critical path let a six-person team run revenue-generating autonomous testing agents and convert enterprise pilots to contracts on a single platform [395]. The lesson generalizes: when validation is integrated and infrastructure is abstracted, the constraint shifts from execution to planning, and team culture shifts from asking permission to shipping same-day [395]. Without that integration, raising the WIP ceiling just builds an unreviewable queue.

Decision rights and the feedback flywheel. The differentiator between teams that plateau with AI tools and teams that compound is whether learnings from individual sessions get routed into shared artifacts [396]. Each session generates four types of signal: context gaps that point to priming documents, instruction gaps that point to commands, workflow patterns that point to playbooks, and failures with root causes that point to guardrails [396]. Without an active practice of feeding those signals back, static infrastructure goes passive and stops evolving. Make the routing explicit: a named platform owner approves tools centrally, team leads ratify the standards that shape day-to-day behavior (rules files, PR templates, review checklists, escalation paths), and after any meaningful agent-caused incident a designated owner decides whether the fix belongs in a convention file, a specification template, a test harness, a Managed-settings deny rule, an MCP registry change, a context-filter update, or a hook. Plugin systems make this concrete: Claude Code plugins bundle slash commands, subagents, MCP servers, and hooks together into a toggleable unit, which lets engineering leaders standardize code review and testing workflows across teams without re-deploying the underlying tool, and toggling unused plugins off keeps the system prompt context lean for tasks that do not need them [397].

Build agent integrations on the verification-first pattern. When a team adds an MCP-backed integration — database access, ticket-system queries, observability platforms — design it so that human reviewers can audit in seconds what previously took minutes. Sourcegraph’s DataBot illustrates the move: tool composition that mirrors human workflow (discovery → schema lookup → execution), thread context retention across multi-turn conversations, and verification-first design that exposes the underlying SQL with every answer so reviewers can spot a bad join, tweak the prompt or context, and watch the next answer improve [398]. The governance point is the same as in the previous section: which servers a team is allowed to add routes back to the registry, the Managed settings deny list, and the context filter, not to per-developer choice. The shipping point is that an integration that does not expose its underlying actions to review will become opaque under load.

Onboarding and ramp-up are the unambiguous wins. Time-to-productivity on unfamiliar codebases drops sharply when new engineers can feed an agent the codebase with a convention file for context — a six-week experience report from a team that ran Claude Code intensively shows years of accumulated tech debt cleared without increasing total working hours, with the gains concentrated where the monorepo is well-structured and the stack is one the model has been trained on [386]. The corollary is sobering: agents are most valuable not for replacing team members but for reducing the ramp-up cost of adding them, especially when engineers switch projects or domains. The budget math behind sustained agent use is also real and should be planned as a per-developer line item, not as ambient cost — coding agents run materially more productively than chat-based coding for engineers who structure tasks well, and the daily LLM spend that unlocks that productivity is a real number that finance teams will eventually ask about [399].

The governing model of this chapter is simple: smaller, agent-leveraged teams work only when three conditions hold simultaneously. First, review capacity is protected — staffing ratios, WIP limits, and sprint planning all anchor on what can be verified, not what can be generated. Second, the talent pipeline is preserved — junior roles are redefined rather than eliminated, and the apprenticeship ladder is rebuilt around guided verification, debugging, and subsystem ownership. Third, governance is concrete — every subsystem has a named human owner, every standard is encoded in a committed convention file, every external tool is in the registry, every context boundary is admin-managed, every model invocation produces an IAM-anchored audit trail, every CI/CD pipeline checks provenance, every prod-adjacent session runs in a contained runtime with scoped credentials, and every agent failure feeds a closed loop that improves the next cycle. Get all three right and a small centaur pod can outperform a much larger implementation-heavy team. Miss any one and the speed of agent-generated code becomes the speed at which you accumulate debt you cannot repay. The decision that matters is not a single yes-or-no on adoption: it is which workflows to bring under agent leverage, under what controls, and at what pace — with some domains and task classes appropriately deferred until the human and infrastructure structures that make agent-generated output trustworthy at scale are actually in place.

21.4 Takeaways

  • Enforce code ownership by wiring three repo artifacts together: a CODEOWNERS file that routes every PR touching a subsystem to its named maintainer for required review, a PR template field that names a human author of record accountable for the change after merge, and a branch-protection rule that blocks merge on AI-heavy PRs without that owner’s approval — not just any contributor thumbs-up.
  • Redesign junior and mid-level roles around guided verification and subsystem ownership rather than eliminating them: rotate early-career engineers through verifier and domain-owner responsibilities, require them to explain agent-generated changes and trace blast radius, and reshape hiring to ask candidates to detect, explain, and repair defects in a working codebase rather than write from scratch.
  • Bind model invocations to cloud IAM identities — AWS IAM roles for Bedrock, Google service accounts for Vertex AI — so that enterprise governance runs on controls you already operate: offboarding becomes standard IAM revocation rather than a manual hunt for shared API keys, finance attributes spend without a parallel billing system, and Bedrock’s default CloudTrail logging of InvokeModel events gives your compliance team a model-invocation audit trail with no additional tooling.
  • Govern MCP server access through an admin-published registry rather than per-user opt-in: publish a JSON registry over HTTPS listing only approved servers with named transport types and package runners, configure clients to terminate servers absent from the registry, and add an org-wide disable toggle — treating the registry as the policy floor, not developer preference.
  • Put non-negotiable tool and command denylists in the Managed settings scope, not Project or Local: Managed cannot be overridden, so it is the only layer where a hard block actually holds.
  • Treat data-boundary controls as admin-managed compliance policy: use site-level include and exclude filters, keep the rules versioned and auditable, and validate zero-egress requirements at the network layer when external code transmission is prohibited.
  • For production-adjacent agent work, run the session in an ephemeral container mounted read-write only to the working tree the task owns, issue task-scoped cloud credentials at session start (an STS-derived role limited to the required policy for an hour, not the developer’s full credential cache), and restrict outbound network to an approved egress list enforced by a sidecar proxy or container network policy — so that a misconfigured or compromised MCP server cannot reach an arbitrary external domain even if it tries.
  • Gate rollout through explicit stages: start with constrained pilots, require audit-readiness before widening, watch review-queue depth during bounded expansion, and freeze expansion when completed reviews stop keeping pace.