Making Agents Reliable

The previous chapter ended with a single agent wrapped in a disciplined harness — the canonical pipeline, three gates, an incident memory, and a five-stage maturity ladder that ranks any harness from ad-hoc scripts through self-healing. That discipline is necessary but not sufficient once agents compose with each other, run in fleets, and process enough volume that aggregate failure modes surface. A ninety-percent-reliable single agent still produces nineteen-in-twenty-plausible output on a solo task; five of those agents composed sequentially produce a roughly-sixty-percent-reliable chain, and real production stacks chain more than five steps once tool calls, handoffs, and intermediate parses get counted.

This chapter covers the patterns the harness needs once multiple agents compose at production scale: topology choices that decide cost and reliability, composition rules beyond single-agent principles, fleet-level reliability patterns for exploration and supervision, hard caps and constrained rollback to bound blast radius, and the durability stack that carries long-horizon agent work across sessions. The honest framing on models: better models unambiguously improve reliability, and every frontier release has moved problems from multi-agent coordination in one generation to single-agent capability in the next. The patterns here matter because the firm ships today against today's frontier, which still benefits from disciplined composition. The reliability work is written with one eye on today's constraint and one eye on what is coming.

Four topologies govern multi-agent cost and reliability

Four building blocks sit at the core of any multi-agent system. Each one carries a distinct cost signature and a distinct failure surface.

Sequential (pipeline). One agent after another. Predictable cost, clear debugging, clean failure localization at a step boundary. Runs at roughly one-to-two-times baseline cost versus a single-agent call. The right default when the task decomposes into ordered steps and each step has a known output shape.
Concurrent (fan-out/fan-in). Parallel sub-agents working on pieces of the task, results aggregated by an orchestrator. Cost scales linearly with agent count. Risk: contradictions between parallel findings that the orchestrator has to resolve before the final output lands.
Handoff (role-based). Specialist agents pass work across stages — research → draft → edit → verify, each a different agent with a different skill file. Enables specialization at the cost of bottleneck risk: any slow role holds up the chain.
Judge Loop (generator-discriminator). One agent produces, another evaluates, the loop continues until the output passes a quality threshold. Roughly three-to-six-times baseline cost for non-trivial tasks. The right choice when the quality bar is higher than any single generator call can reliably clear.

Adaptive routing — having an LLM select the topology per request — is the highest-variance, highest-risk pattern of the four. Defer until the deterministic versions are stable; adaptive routing can be the right call eventually, but only once the firm has evals and traces for each deterministic variant and knows which shape works for which class of task.

The pattern that survives model upgrades is a deterministic outer loop with agentic inner nodes. The outer loop guarantees sequencing and constraints in code; each inner node is allowed to be stochastic. A CyberFund research pipeline illustrates the shape concretely: a research agent produces a draft, five parallel judges evaluate on practicality, data-freshness, language, topicality, and completeness, the researcher revises, the loop runs three times, and the session produces in roughly forty minutes what would take a human analyst a full workday. The deterministic outer loop is about control structure, not model capability; when the next frontier model lands, the outer loop stays the same and the inner nodes get cheaper or smarter without the surrounding code changing.

Topology is today's primary cost lever, and the next model may collapse the chain

A planner-judge-consensus loop runs roughly an order of magnitude more expensive than a sequential pipeline on the same task against today's frontier models. Even the largest model-selection decision — Haiku-class to Opus-class — moves cost within a single order of magnitude, while topology moves the same task across orders. Before optimizing model choice at today's capability frontier, audit topology.

The canonical public example is Klarna's plan-and-execute pattern: a capable frontier model creates the complete plan for a customer-service task, and cheaper, faster execution models handle each step — pulling account data, processing refunds, generating responses. The planning model touches the task once; the execution models handle the volume. Routing planning to one capable model and execution to cheaper models cuts costs by up to ninety percent compared to running frontier models for every step. Cost-aware routing as a component, not a prompt trick, is where this pays off. ClawRouter — the open-source cost-aware routing engine mentioned in the previous chapter — picks the cheapest model capable of the current task from a graded pool and reports between seventy-eight and ninety-two percent cost reduction versus a uniform high-cost-model baseline. The discipline is picking the router layer before picking the models and routing by skill-level requirements (tool-use, reasoning depth, output schema) into a graded pool, not per-request on a hunch.

The forecast caveat matters. The next frontier model may make a single Opus-equivalent call succeed where today's chain of five cheaper calls plus a judge is needed. When that happens, topology moves from "primary cost lever" to "premature optimization" overnight. Build the simplest topology that works today. Plan to collapse topologies when the model catches up. A Judge Loop that runs at six-times baseline cost today becomes a wasted engineering investment the week a single frontier call produces the same quality at one-sixth the token spend.

Anthropic's own engineering team published the canonical topology experiment in February 2026. Nicholas Carlini set sixteen instances of Claude Opus loose on a shared codebase with minimal supervision, tasking them with building a C compiler from scratch. The agents built a working compiler. They also spent a large share of their tokens re-deriving context they should have shared, producing redundant findings, and arguing over handoff format. Public measurements from practitioners running similar multi-agent pipelines put the coordination overhead well above what the actual task content requires — context serialized into JSON, re-transmitted at every hop, re-parsed at each end, fed back into an LLM with the accumulated history of the entire conversation. The token cost of coordination frequently exceeds the token cost of the work itself.

The practical claim the pattern supports is that many enterprise workflows are monotonic — ordered, predictable, no adaptive reasoning required — and do not need multi-agent architecture at all. A pipeline with a single capable agent and a schema-validated output gate frequently beats a chain of five specialized agents on both cost and quality. Multi-agent architecture earns its cost when the task genuinely decomposes into parallel or role-specialized work, not when it can be rewritten as a single agent with a clearer eval.

Composition rules beyond the four framing principles already named

The harness-engineering discipline in the single-agent case already named four framing principles — the dumbest variant that works is the best variant, the checking model must be smarter than the executor, two retries then human, and eval-first before anything else gets tuned. Those principles still apply at multi-agent scale. Four additional composition rules extend the set.

A tool can call a human or another agent. A tool does not have to be a deterministic script. The main agent's "call specialist" tool can trigger a research sub-agent, which does its work and returns structured findings. The office agent's reset_printer tool can page the sysadmin in a Slack DM — Vasya restarts the printer, the tool returns "printer operational," and the agent continues. Human-in-the-loop becomes a first-class tool call rather than an escalation exception, and the firm's existing human processes extend naturally into the agent's action surface.

Embed a deterministic script inside the agent loop where the step is deterministic. LLMs forget roughly every fifth time; deterministic scripts never forget. For link validation, file-dependency tracking, update propagation, SQL queries, data-format transforms, and any other step where the right answer is unambiguous, the LLM plans and the script executes. A private legal-services pipeline at Skala illustrates the discipline in a regulated domain: a deterministic questionnaire routes to one of several canonical contract templates; the script fills placeholders deterministically; the LLM handles only minor edits and the "special conditions" override clause on page one that legally supersedes contradicting terms in the body. The firm reports roughly 99.9 percent success on standardized legal documents under this pattern, and the LLM explicitly does not generate contracts from scratch.

Explicit handoff schemas between agents. An agent handing work to another agent must pass a structured object — task ID, status, context summary under five hundred tokens, attempts to date, constraints, and the artifacts produced so far. Without the schema, receiving agents re-derive context by reading conversation history, which is where context drift compounds into hallucination. A private seven-phase sales-agent pipeline illustrates the discipline: hypothesis → MVP validation → spec → dev → paper-trading simulation → live → continuous improvement, with explicit gates at each phase transition. The gates are the schema; the schema is what keeps handoffs from drifting across the seven agents. Four or five strategies survive per hundred tested; the failed ones become training data for the hypothesis agent on the next run.

Surrender to complexity when a single agent with a strong eval will do. The counter-pattern to multi-agent hype: if the firm can write an end-to-end eval for the task, collapse the five-agent or twelve-agent chain into a single agent with an evolved prompt. The resulting prompt will be unreadable to humans, and that is fine. Teams split into multi-agent chains not because the problem requires it but because they cannot hold single-prompt complexity in their heads. The eval does the holding. The agent executes against it. The multi-agent chain is a human coping mechanism for complexity, not an answer to it.

Exploration budgets escape consensus mediocrity

Bee colonies have eight to ten percent of bees genetically incapable of reading the waggle dance. They fly randomly. Ninety percent of the time they are inefficient foragers. Ten percent of the time they discover new fields that save the entire colony from local maxima when the old field dries up. The critical design error in 2026 multi-agent systems is over-indexing on exploitation and under-indexing on exploration.

All agents producing the same thing converges on mediocrity, because consensus among stochastic systems always selects the median. A twenty-agent fan-out that votes or averages its findings produces the median finding; the valuable tails — the insight one agent caught that the others missed, the unusual angle that becomes the actual breakthrough — get smoothed out by aggregation. Design an explicit exploration budget. A fraction of the fleet gets instructed to try random directions. Outputs merge via ranking, not averaging, so the tail is preserved rather than averaged away. The exploration budget is a harness-layer decision; no amount of prompt cleverness inside individual agents produces it.

Shared rails carry fleet coherence; supervision ratios set the human cap

Agent fleets need common rails the way any large system does. Shared schemas. Protocol contracts. A single culture.md or constraints.md that every agent reads at session start. Local autonomy inside those rails — per-agent instructions, skill files, memory — is where specialization happens. Encoding the shared piece as a set of files every agent pulls at boot, and the local piece as per-agent configuration, is the pattern that survives scale.

Entropy management is the operational challenge at AI-scale output. Ramp's Glass team put the framing precisely: the teams that figure out the meta-game — not how to get the model to write good code, but how to keep the codebase healthy as the model writes a lot of code very fast — are the teams that build things that were not possible before. At multi-agent scale the discipline is automated defragmentation, shared design systems the agents pull from, documentation validation at PR time, pre-commit gates that the agents cannot bypass. None of this is fancy. It is normal software engineering, pointed at a vibe-coded codebase.

Plan the supervision ratio explicitly. Low-stakes agent fleets operate at roughly ten to twenty agents per human supervisor. Medium-stakes fleets — internal tooling, ops automation, mid-value workflows — run at four to eight agents per supervisor. High-stakes fleets in financial, medical, or legal contexts run at one to three agents per supervisor. The "Agent Manager" role is formalizing in 2026 org charts with its own career ladder. The failure mode the ratio prevents is defaulting to "as many agents as the harness can run in parallel" — which produces fleets the human layer cannot actually supervise at the stakes level the work requires.

The failure taxonomy determines the mitigation

Multi-agent incidents cluster into four categories, and each category responds to a different mitigation.

Protocol failures — bad handoff formats, missing fields in the schema, structural drift between agents. Mitigation: explicit handoff schemas enforced by the outer-loop code, not by agent prompts.
Semantic failures — confident-but-wrong outputs that pass HTTP 200 checks. Mitigation: output validation on schema conformance and semantic sanity even when the agent's response format parses.
Behavioral failures — infinite loops, recursive spawning, an agent that decides to rerun the same tool call fifty times. Mitigation: hard caps the harness enforces regardless of what the agent asks.
Supply-chain failures — malicious skills, compromised MCP servers, injected instructions in untrusted content the agent processes. Mitigation: sandboxed execution for agent-generated code, signed-skill provenance, network-egress controls on what the agent can reach.

Five hard caps the harness must enforce

Five caps are not optional at production scale, and all five live in the harness code, not in the prompt. The agent cannot override them with natural-language cleverness.

Max tool calls per invocation. Loop detection triggers at roughly fifteen to twenty tool calls on a single task. Above that threshold the harness halts the agent and escalates.
Max recursive spawning depth. One level by default — Claude Code's subagent tooling enforces this by design, so teammates cannot spawn teammates. Depth-two spawning is permitted explicitly per use case; depth-three essentially never is. The recursive-spawning failure mode is documented in the oh-my-openagent tracker as a real production pattern where a root agent delegates research, the research sub-agent is still allowed to delegate, similar prompts get re-issued, and the session becomes unstable.
Per-invocation cost ceiling. Alert at three times the agent's rolling-average cost; halt at a hard cap the agent cannot exceed. A documented 2025 incident provides the motivating case: a four-agent LangChain research system ran for eleven days with two agents stuck in an infinite conversation loop before the team noticed; cost escalated from $127 in week one to $891, then $6,240, then $18,400, landing at $47,000 before the team pulled the plug. Dashboards showed healthy activity and normal latency the entire time.
Fleet-wide kill switch. Halt every agent in the firm within thirty seconds. The time between a runaway incident starting and the finance team noticing the invoice is typically measured in hours; the halt has to execute faster than the reporting delay.
Output validation on HTTP 200. The most dangerous agent failure is the confident-but-wrong one, and traditional monitoring never fires on it. Anthropic's own Project Vend experiment is the public reference: Claude Opus (named Claudius) running a vending kiosk coordinated with a partner agent on pricing in ways that produced a de facto cartel pattern, and failed to recognize when it needed human help; the response shapes were valid while the substance was wrong. Schema validation on HTTP 200 responses, plus semantic sanity checks where the task admits them, is the failure layer that catches this.

A private trading pipeline at a crypto-focused firm illustrates the caps in production. The Polymarket pipeline runs hypothesis → validator (Docker-sandboxed, internet-access only through a sanitization proxy that strips injection vectors from web content) → tech-spec agent → developer agent → paper trading → live agent → analyst monitoring. Over four to five months the pipeline tested roughly 100 strategies; two survived to profitable live trading. The pipeline's caps are not decorative — every stochastic step runs in an isolated sandbox, and profitable strategies get manually transferred to a hardened server with zero AI agents on it before they touch production capital. The production surface holds nothing stochastic.

Constrained automated rollback repairs the harness inside a narrow action space

Hard caps prevent blast-radius expansion when things go wrong. Constrained rollback is what happens after: automated repair that cannot invent new fixes, only choose from a pre-approved set.

Microsoft's Azure SRE Agent is the canonical 2026 public implementation. The agent combines anomaly detection from the observability layer with a remediation action manifest that is narrow by design: rollback to the last green commit, disable a failing tool, or throttle ingress. It cannot write new code to work around the failure, cannot modify the production deployment outside that action set, and cannot override the human operator's stop signal. Microsoft's own production incident-management work (the Triangle system) documents roughly forty percent MTTR reduction and ninety-one percent Time-to-Engage reduction from automated triage on real Azure incidents. The pattern is the Stage-4 harness rung from the maturity ladder made concrete — the harness repairs itself, but only inside a deliberately small action space.

The key principle is that the caps and the rollback live in the harness, not the prompt. The agent cannot argue its way around them. A remediation agent with three verbs in its manifest cannot cause the class of incident it was deployed to contain, which is the entire design intent — the narrow action space is the safety property, not a limitation to engineer around.

The hard caps bound the damage; knowing why the damage happens helps the policy author size the caps correctly. Production reports through 2025-2026 converge on roughly 35 to 45 percent of tokens consumed by failure modes that produce no usable output. Four mechanisms account for most of the leak. Re-summarization loops — the agent re-summarizes its own conversation history each turn rather than working from a structured state passed forward, so context grows quadratically while the useful information stays constant. Tool-call amnesia — the agent re-invokes the same tool because it does not recall the prior response from context, burning tokens on redundant calls whose results have already been computed. Retry spirals — a missing two-retry cap lets the agent loop through dozens of failed attempts on what is actually a context problem rather than an execution problem. Hallucinated tool calls — the agent invokes tools that do not exist or with malformed arguments; error-handling then burns more tokens without producing diagnostic gain. Public anchor incidents: NotebookCheck documented an OpenClaw agent burning roughly 120K tokens every 30 minutes of idle time, accumulating roughly $250 per week in noise spend; Galileo's 2026 retry-cascade pattern shows how a single cost-observability gap silently triples hourly spend; the overnight-bill story from a marketer who left an agent running unsupervised is the canonical practitioner warning for teams standing up their first hard cap.

Agent-specific observability catches confident-but-wrong outputs

Traditional application monitoring fails on agents because agents fail quietly. A confident response, well-formatted output, HTTP 200 — but the content is wrong. The Datadog or New Relic dashboard shows healthy green; the underlying work is broken. Agent-specific alerts cover four signals the traditional stack does not.

Loop detection — the agent's tool-call sequence shows repetition past a threshold. Alerts fire before the hard cap halts the session, so the human can inspect before the run dies.

Cost anomaly detection — the agent's spend in a window exceeds three times its rolling average. The $47K incident is the motivating case: week-by-week cost escalation invisible to the team because no individual day's cost crossed the traditional threshold.

Confidence degradation — the slow drift from correct outputs to plausibly-wrong ones without a sudden failure. Catching drift requires trending the agent's eval-pass rate over time, not just alerting on single failures.

Output validation failures — schema non-conformance, semantic-sanity-check failures. Even on HTTP 200. Especially on HTTP 200.

Agent traces become queryable analytical data rather than debug output. Google Cloud's BigQuery Agent Analytics and open-source OpenObserve let teams run analytics across agent runs the same way they run analytics across user events — correlate agent decisions with downstream business outcomes, compute cost per task class, detect drift over time, identify skill versions where tail performance degrades. This is the Stage-3-to-Stage-4 transition from the maturity model: telemetry stops being a debugging afterthought and becomes a queryable surface that drives cost routing, self-healing, and skill retirement decisions.

The durability stack carries long-horizon agent work

Production agent workflows that touch real systems need durable execution. Motif, Inngest, and Temporal are the standard durable-execution orchestrators — checkpointing, retries, deduplication, multi-service coordination, graceful recovery from partial failures. For any agent workflow that runs past a few seconds and touches external systems, the orchestrator is not optional.

Hibernate-and-wake checkpointing extends the pattern to long-horizon agent work. Meta's engineering team operates a production Ranking Engineer Agent that resumes interrupted multi-hour ML-pipeline tasks without losing context or progress — the durable-execution pattern applied at the agent layer with one added requirement: the agent's full working-memory state has to serialize and deserialize cleanly. The practitioner lesson is specific: long-horizon tasks benefit more from checkpointable state than from larger context windows. A one-million-token context window solves a different problem than the one a long-running ML-pipeline agent has; hibernate-and-wake targets the actual problem, which is resume-from-crash rather than hold-everything-in-memory.

The architectural shift is harness separated from execution compute. Anthropic's Managed Agents and OpenAI's Agent SDK both separate the harness (orchestration, tool registry, verification, memory) from the execution compute (filesystem, sandbox, tools). The 2024-2025 pattern of thick custom harnesses bundled with compute is giving way to modular designs in 2026 — the harness runs on one piece of infrastructure, the compute runs on another, connected by a protocol. Teams migrating in this direction report that modular designs let them swap to state-of-the-art harnesses as the frontier moves, rather than rewriting the substrate every three to six months. The durable asset in the stack is the execution interface, not any particular harness running above it.

The framework landscape ages fast. Treating any specific 2026 choice as durable is a mistake. LangChain as a harness is already morally outdated, though LangFuse (tracing, analytics) and LangGraph (stateful orchestration) remain useful. AutoGen fits R&D and research-task agents; CrewAI fits fast MVPs with role-based teams, with scale caveats. For new builds in 2026 the realistic choices are Claude Agent SDK, Google's Agent Development Kit, AWS Bedrock AgentCore, and OpenAI's Agents SDK. Standards converge on MCP (agent-to-tool) and A2A (agent-to-agent), and the direction of travel across analyst and practitioner coverage is that enterprise agents increasingly lean on those two protocols through 2026-2027.

Horizon. The framework landscape in this section is a 2026 snapshot, not a recommendation. Specific product choices made in mid-2026 will look stale by mid-2027. The category shape — durable execution, harness-separated-from-compute, open standards for agent-to-tool and agent-to-agent — is stable; the vendor names inside each slot continue to churn on a quarterly cadence. The safe posture is to architect around the category and treat specific vendor selection as a 12-18 month refresh rather than a multi-year lock-in.

Fountain and three private pipelines show the full shape

One public case carries the full treatment; three private pipelines add shape to specific reliability patterns.

Fountain (frontline workforce management). What Fountain runs: hierarchical multi-agent orchestration for high-volume frontline hiring — candidate screening, automated document generation, sentiment analysis, and candidate conversations across channels. What the architecture uniquely carries: Fountain Copilot as the central orchestrator coordinating specialized sub-agents, each with dedicated context, whose results synthesize back into integrated output. What the architecture makes possible: fifty percent faster screening, forty percent quicker onboarding, two-times candidate conversions, and one logistics customer cutting fulfillment-center staffing from more than a week to under seventy-two hours. The canonical 2026 public example of a working supervisor-plus-specialists topology at enterprise scale.

Skala (legal). A private legal-services platform running a deterministic-questionnaire-plus-template pipeline for standardized legal documents. Routing logic picks the right template based on questionnaire answers; a script fills placeholders deterministically; the LLM handles only minor edits and the special-conditions override clause. Reported success rate on standardized legal documents sits near 99.9 percent — the discipline is that the LLM explicitly does not generate contracts from scratch, and the deterministic scripts carry the load-bearing steps. The pipeline branch applied to a regulated domain where precision is legally mandatory.

Seven-phase sales agent pipeline (private). Hypothesis → MVP validation → spec → dev → paper-trading simulation → live deployment → continuous improvement. Staged multi-agent deployment with explicit gates at each phase transition. Four or five strategies survive per hundred tested; the failed ones become training data for the hypothesis agent on the next run. The structural feature to notice is that the gates are the schema — each handoff between phases is a typed object with status, context summary, attempts, constraints, and artifacts — and the schema is what keeps handoffs from drifting across seven agents.

Polymarket trading pipeline (private). Hypothesis agent → validator (Docker-sandboxed with internet access through a sanitization proxy that strips injection vectors) → tech-spec agent → developer agent → paper trading → live agent → analyst monitoring each cycle. Over four to five months the pipeline tested roughly 100 strategies; 2 survived to profitable live trading. Profitable strategies get manually transferred to a hardened server with zero AI agents on it — the production surface holds nothing stochastic. The purest expression of "bound the blast radius when agents fail" — every stochastic step runs in isolation, and the path from stochastic discovery to deterministic production is explicit and human-mediated.

Seven failure modes map to composition gaps

Each named failure ties back to one of the composition, reliability, or blast-radius patterns above.

Skipped harness. Agent deployed without Preflight / Plan / Approve / Verify gates. Looks cheap to build; becomes unmanageable by the tenth production task, and every subsequent task compounds the technical debt.
Naive multi-agent. Chains of five-to-twelve agents where the problem could be one agent with an end-to-end eval. Multi-agent as a coping mechanism for complexity rather than an answer to it. Fix is the Surrender-to-Complexity rule: if a binary eval exists end-to-end, collapse the chain.
Recursive spawning. Agent spawns sub-agents that spawn sub-agents; the state tree explodes; cost spikes invisibly. Hard caps on delegation depth, duplicate-task suppression, and backpressure on spawn velocity are not optional — the $47K eleven-day incident is the motivating case.
Silent hallucination. HTTP 200, valid JSON, confident content, completely wrong answer. Traditional monitoring cannot detect this; agent-specific output validation at the Verify gate is required. Project Vend's cartel-pattern behavior is the public reference.
Coordination without shared state. Agents re-derive context from conversation history; context drifts; outputs contradict. Fix is explicit handoff schemas enforced by outer-loop code, not by agent prompts.
Single-model stack. Everything running on Opus "because quality matters." Topology is the cost lever, not model choice; a cheap-base-plus-expensive-QA split usually matches or beats a uniform-Opus stack at a fraction of the cost — Klarna's ninety-percent reduction via plan-and-execute is the canonical measurement.
Over-centralized consensus. All agents converge on the median; the valuable tails disappear. Fix is an explicit exploration budget — eight-to-ten percent of the fleet producing random-direction outputs, merged via ranking rather than averaging.

Run this week

Six concrete tasks, each with a time box and deliverable.

Topology audit for the firm's highest-volume agent system (2 hours). Pick the production multi-agent system that handles the most daily traffic. Classify its topology — sequential, concurrent, handoff, or judge-loop. Measure its per-task cost and compare against a single-agent baseline for the same task on today's best cheap model. Deliverable — a two-row comparison table with decision: keep the current topology, or collapse.
Coordination-overhead measurement on one pipeline (2-3 hours). Pick one multi-agent pipeline. Count the tokens spent on actual task content versus tokens spent on handoff and context re-transmission across the chain. If coordination overhead exceeds thirty percent, redesign with explicit handoff schemas (typed object with task ID, status, context summary, attempts, constraints, artifacts). Deliverable — a before table of token distribution and the schema spec for the after version.
Asymmetric model routing on one Verify gate (2 hours). Identify the one pipeline where executor and Verify run the same model. Swap in a smarter model at the Verify gate only, keep executor on the cheaper model. Measure ten runs; record cost delta and quality delta. The Klarna plan-and-execute pattern is the reference — planning runs on the capable model, execution volume runs on cheaper models. Deliverable — a ten-row before/after cost/quality table.
Five hard caps inventory (1-2 hours). For the firm's production agent fleet, verify every cap is in place: max tool calls per invocation, max recursive spawning depth, per-invocation cost ceiling, fleet-wide kill switch, output validation on HTTP 200. Any missing cap is a $47K-incident waiting to happen. Deliverable — a five-row table marking each cap present or absent, with the code commit that implements each.
Exploration-budget pilot on one fleet (4-8 hours). Pick one multi-agent fan-out fleet. Instruct eight to ten percent of agents to try random directions instead of the main approach. Merge outputs via ranking (top-three by eval score) rather than averaging. Compare output diversity and best-case quality against the consensus baseline. Deliverable — a short report naming the tail findings the consensus version would have smoothed away.
Durability + observability gaps on one long-horizon agent (1 day). Pick a long-running agent workflow — one that goes past a thirty-minute session. Audit four things: is there a durable-execution orchestrator (Motif, Inngest, Temporal); does the agent's working memory serialize and deserialize cleanly across crashes; are loop-detection and cost-anomaly alerts wired into the observability layer; are agent traces queryable as analytical data rather than just debug output. Each missing item is a future incident the firm has not yet budgeted for. Deliverable — a four-row audit with the next engineering ticket for each gap.

The next chapter picks up skills — how the firm codifies procedural knowledge into markdown files the agents execute against, turning the reliability substrate this chapter built into repeatable organizational capability that compounds rather than decays.