The Machine
Harness Engineering
Multi-agent systems fail through compounding errors. Even ninety-percent-reliable per-step chains drop to roughly sixty percent across five steps, and real production stacks stack more than five steps with tool calls, intermediate parses, and handoffs. The fix is not a better model — it is a disciplined harness: tool registries and governance, verification and evaluation pipelines, persistent memory, sandboxed runtimes, agent-specific tracing, middleware hooks, and a canonical six-stage pipeline wrapped in three gates and an incident memory. By early 2026 the discipline has stabilized enough to be named, and harness-only changes produce measurable benchmark gains while model upgrades inside a weak harness usually do not.
Every agent step has a small chance of going wrong, and errors compound down the chain. A ninety-percent-reliable per-step chain drops to roughly sixty percent reliable across five steps, and production systems routinely run more than five steps once tool calls, intermediate parses, and handoffs get counted. The load-bearing problem is the loop around the model, not the model itself. By 2026 the practitioner community has stabilized that loop into a named discipline.
That discipline is harness engineering. It is the practice that surrounds the model: tool registries and governance, verification and evaluation pipelines, persistent memory, sandboxed runtimes, agent-specific tracing, middleware hooks, CI-integrated evals, self-healing remediation, and cost-aware orchestration. In 2024 the surrounding practice was still called prompt engineering — a single-turn framing that implicitly treated the model as the system. By mid-2026 the term harness engineering recurs across LangChain's engineering blog, the community-curated awesome-harness-engineering repository, Microsoft Foundry's release notes, and OpenAI's own harness-engineering post2. The claim that keeps surfacing in public practitioner reports is that harness quality often determines production performance more than model choice does; harness-only changes produce measurable benchmark gains on fixed models, while same-size model upgrades inside a weak harness tend to show up in cost per task rather than in output quality.
The five-stage maturity model ranks any harness on a single ladder
A team that claims "we have a harness" without being able to locate itself on a maturity ladder is usually at Stage 0 or Stage 1. The ladder has five rungs, and each rung has specific indicators.
- Stage 0 — Ad-hoc scripts. No formal harness. Manual tool invocation. No registry, no sandboxing, no structured logging. Brittle, not production-viable. Indicator: the fastest way to answer "what tools does this agent have access to?" is to grep the source tree.
- Stage 1 — Basic harness. Schema-first specs for tools. Simple tool registry as a manifest. Basic memory store. Minimal verification through unit tests. Indicators: tool manifest exists as a first-class artifact, unit evals run in CI, tracing is limited to structured logs.
- Stage 2 — Verified harness. Static verification (lint, type checks) in CI. Sandboxed execution for agent-generated code. Structured tracing. Behavioral evals in CI. Branch-per-agent pattern — each experimental change runs in its own Git branch, rollback is a single
git revert, A/B comparison runs the same eval suite on both branches. LangChain DeepAgents, Stripe's Minions content harness, and most mature internal agent stacks operate at this rung. Indicators: automated static checks per commit, per-agent test suites, rollback capability. - Stage 3 — Observability-first harness. End-to-end agent tracing correlating steps, tool calls, context, and LLM input/output. LLM-as-judge for behavioral evals. Composable middleware. Versioned memory. Indicators: production-grade traces, automated LLM scoring, scenario replay for debugging.
- Stage 4 — Self-healing and cost-optimized. Automated remediation on anomaly detection. Cost-aware orchestration against budget SLAs. Policy-as-code tool governance. Indicators: remediation incident logs, cost-SLA adherence, policy audit trails, per-agent MTTD and MTTR.
Most production teams sit at Stage 2 in 2026 and choose between investing in Stage 3 observability or Stage 4 optimization based on risk and volume. The Stage-4 rung has concrete public examples. ClawRouter is an open-source cost-aware routing engine that analyzes each request locally in under a millisecond and routes it to the cheapest model capable of the task; reported inference-cost reductions run between 78 and 92 percent versus a uniform-high-cost-model baseline, bringing blended average cost to roughly $2-$3 per million tokens. Azure's site-reliability-engineering work combines anomaly detection with constrained remediation agents whose action manifest is narrow by design — rollback to last green commit, disable a failing tool, throttle ingress — producing the Stage-4 pattern of a harness that repairs itself inside a deliberately small action space. Meta's engineering team operates a production Ranking Engineer Agent that uses hibernate-and-wake checkpointing to resume multi-hour ML pipeline tasks without losing state. The ladder is not software-only. A mid-market insurer running a claims-triage agent, a logistics firm running route-optimization agents, a law firm running first-pass redline agents each face the same rung progression — the tool registry, the sandboxed runtime, the eval gate, the self-healing response to anomalies do not change when the agent is reading insurance contracts rather than parsing code.
Horizon. The Stage-4 rung is emerging through 2026 rather than standard practice. ClawRouter, Azure SRE remediation patterns, and Meta's durable checkpointing are each less than a year old in production. The ladder shape is stable — most teams have converged on roughly these five rungs — while the specific Stage-4 tools that occupy the top rung will rotate on a quarterly cadence as the discipline matures.
Harness-only changes moved a fixed model from rank 30 to top-five on Terminal Bench
LangChain's engineering team published the cleanest public measurement in February 2026: deepagents-cli moved from 52.8 to 66.5 on Terminal Bench 2.0 — a 13.7-point jump that took the agent from rank 30 to top-five on the public leaderboard — with the model held fixed at gpt-5.2-codex. The changes were harness-only: self-verification loops, enhanced tools and context injection, loop-detection middleware that catches agents stuck in repeated tool calls, and tracing at scale to identify failure modes and iterate against them. Anyone arguing "we need a better model" on a production agent that has not yet been pushed up the maturity ladder is choosing the more expensive of two moves that both improve quality, and the cheaper move usually produces the bigger gain.
The problem is almost never the model
Every agent system runs three layers: the model itself, the orchestration and tooling around it, and the context the system feeds in. When something fails, the problem is almost never the model — modern frontier models are all capable enough for most business workflows. It is almost always either poor orchestration (missing error handling, bad tool descriptions, unstable state passing between steps) or poorly assembled context (wrong data, thin coverage, no recency signal).
The diagnostic runs from least to most expensive to fix. Audit the orchestration layer first — tool manifests, retry budgets, handoff schemas, failure logging. If orchestration looks clean, move to the context layer: the Context Bundle assembled per task, the skill file loaded for the current request, the memory the agent reads at session start. Most production incidents trace to one of these two layers, not to a model deficiency, and the fix is usually cheaper than the budget a model swap would consume.
Every agent runs the same loop with the same four components
Every agent system — ChatGPT, Claude, Gemini, internal custom harnesses — runs the same cycle. Receive task, choose a step, execute that step (call a tool or think), receive feedback, evaluate whether the task is complete. If not, loop. Every agent consists of four components: memory, context, an LLM, and tools. Everything beyond this is nomenclature.
Three parameters determine how well that universal loop runs. The quality of context the agent has on hand. The quality of the LLM under the hood — fast versus slow, smart versus cheap. The quality of the harness — how well tools are described to the agent and how well tools actually function when called. A tool fails the agent in two distinct ways. First, well-coded but poorly described — the agent never selects it when it's the right tool. Second, well-described but buggy — the agent selects it and gets nothing useful back. Both failure modes belong to the harness layer, not to the model.
The harness is the regulator on the token pipeline that 1.1 named. The pipeline's flow rate, loss rate, and cost per token at the firm level all resolve to harness-layer parameters at the agent level: how fast the agent loops, how much context survives each step, and which steps are worth the LLM call versus the cheaper code path. A weak harness produces a leaky pipeline. The reliability work in this chapter is the engineering interface to that economics.
Pipelines, agents, and self-improvers are parallel branches, not a progression
Three patterns sit side by side in 2026 production stacks. A classic pipeline has code calling the LLM step by step: code is the orchestrator, and the LLM is a transformation step in the middle. An agentic system has the LLM calling code and tools: the LLM is the orchestrator, and code is the substrate. A self-improving agent modifies its own prompt or tool set based on eval feedback: the agent adapts its own definition of how to do the work.
These are parallel evolutionary branches, not three steps on a ladder. Pipelines offer better observability (every step is inspectable), better cost control (each call's token count is known), and simpler debugging (failures localize cleanly at a step boundary). An agent costs more and is harder to cost-control — one run calls three tools, the next run calls fifteen, the third spends an unpredictable share of its budget thinking — but it handles task shapes a pipeline cannot pre-specify. A self-improver earns its cost only when the task has a binary eval and the team is willing to let the prompt grow beyond human readability. Choose the branch that fits the problem. The common failure mode is assuming "more agentic is more evolved" and deploying an agent where a pipeline would have worked at a tenth of the cost.
The harness goal is to make the agent boring
An agent given freedom behaves like an unboarded employee: smart but unpredictable, prone to over-engineering, invention of unnecessary work, and side effects. The solution is not a cleverer prompt. It is a formal process layer that constrains agent behavior. A boring agent is a good agent — its steps are traceable in the incident log, its outputs verifiable against a schema, and its failures inspectable without paging the team. The model is the creative layer; the harness is the boring layer that makes the model's creativity safe to ship.
The discipline runs against a real temptation. Elaborate prompts and sprawling multi-agent chains produce roughly five percent improvement in output quality for roughly ninety-five percent more engineering work. In production, with unstable infrastructure and stochastic models, the simplest variant that works is usually the correct answer. Karpathy's framing captures it: the dumbest variant that works is the best variant. The harness earns its complexity only where a simpler version fails.
The canonical pipeline runs six stages, three gates, and an incident memory
Every agent should run the same pipeline shape. The six stages:
- Preflight. The agent gathers context and constraints, determines what it can do.
- Plan. The agent decomposes the high-level task into atomic tasks with expected artifacts.
- Approve. Mandatory human confirmation of the plan before execution. This is the gate, not a rubber stamp.
- Tasks. Each task produces code plus an artifact — the artifact is documentation written during execution that makes the work traceable after the fact.
- Verify. A quality gate with pre-defined criteria, hash checksums on produced artifacts, and formal checking of output structure.
- Finish. Export additional artifacts if needed, close the session, emit the incident log entry.
Three gates narrow the agent's degrees of freedom across the pipeline. The scope boundary defines what the agent can touch — files in its folder only, this repository only, no writes outside the session directory. Permissions are the network access, write access, tool access defaults the agent carries, along with the explicit grants required to exceed them. The responsibility boundary names who plans, who executes, and who verifies — which roles hold which authorities when the pipeline runs with more than one agent.
The pipeline also maintains an incident memory. When an agent encounters a problem, it automatically saves the incident to the task folder. On the next run against a similar task, that incident gets added to the agent's context. If the same problem appears again, the agent already knows the workaround. Over time, the incident log becomes fuel for distilling new skills — compound learning that accrues at the harness layer rather than at the model layer.
KISS, asymmetric QA, two retries, and an eval-first gate carry the reliability weight
Four principles carry most of the reliability weight across the canonical pipeline. The first is the KISS discipline already named: the dumbest variant that works is the best variant. Elaborate prompts rarely beat simple prompts under production noise, and the maintenance burden of an elaborate prompt compounds every time the schema around it changes.
The second is the asymmetric quality-assurance rule. The model doing the checking must be smarter than the model doing the execution. The base agent runs on a cheap, fast model (Haiku-class, Gemini Flash, GPT-class-mini). The Verify gate, the tone-of-voice check, the factual-grounding check, and the license-compliance check each invoke a more powerful model for that single call, then drop back to the cheap model for the next task. A symmetric setup — same model handling execution and verification — tends to produce confident approval of wrong work. Asymmetry is what catches the errors worth catching, at the cost of a single expensive call per Verify gate rather than across every step.
The third is the two-retries-then-human rule. Cap agent retries at two on any single gate failure. If two retries do not resolve the failure, the problem is in the context or the specification, and five retries will not help either. The variant pattern is two retries on the base model, then fall back to a smarter model for two more retries, then escalate to a human. Beyond that budget, the agent burns compounding spend without compounding results.
The fourth is the eval-first rule. The Verify gate needs a binary criterion for "did this work" before anything else in the pipeline gets tuned. A structured-output check against a schema, a reference comparison against a known-good output within a defined similarity threshold, an LLM-as-judge rubric that has been calibrated against human scoring on a sample — any of these counts. What does not count is the absence of an eval; a pipeline with no Verify criterion is a pipeline that cannot be improved, only hoped over.
LangChain DeepAgents, Claude Code, and Goose are the three reference harnesses to study
Three public implementations sit at different rungs of the maturity ladder and give the reader three concrete reference harnesses to study.
LangChain DeepAgents is the public benchmark anchor. Terminal Bench 2.0 score moved from 52.8 to 66.5 on a fixed model through harness-only changes — structured verification loops, enhanced tools with context injection, loop-detection middleware, and tracing at scale. Branch-per-agent experimentation lets the team A/B-test changes against the same eval suite on both branches. The case sits cleanly at the Stage-2-to-Stage-3 transition: verified harness with branching, moving into observability-first territory through LangSmith tracing.
Claude Code is the commercial reference harness. A team that does not want to assemble components from parts can fork an opinionated starter. Claude Code carries substantial orchestration around the model — tool registry, sandbox runtime, skill-routing, session persistence, middleware hooks — while keeping the underlying capability visible enough that a team can read the harness as a template rather than a black box.
Goose is the open-source counterpart. Block's coding harness launched January 28 2025, model-agnostic and MCP-native, and by 2026 sits at the top of the open-source harness list. Goose demonstrates that a team can run production agent workloads without committing to a specific model vendor, which matters when the model frontier moves on a quarterly cadence and the team does not want to rewrite its orchestration layer on every generation.
Five failure modes map to specific harness gaps
Each named failure maps to a specific pipeline stage or discipline.
- Skipped harness — Stage 0 pretending to be Stage 2. The team says "we have a harness" but cannot locate itself on the maturity ladder. No Preflight, no Approve gate, no Verify criterion. Looks cheap to build; becomes unmanageable by the tenth production task, and every subsequent task compounds the technical debt.
- No Approve gate — The Plan stage runs but the human confirmation step gets skipped because "the plan was obvious." Small plan errors propagate into six hours of wrong work. The Approve gate is the single cheapest place to catch a misframed task; removing it is the most common harness-debt mistake.
- No incident memory — The same failure repeats, and the harness cannot tell because it never saved the first occurrence. Compound learning at the harness layer requires that the harness actually remembers.
- Symmetric verification — Same model running execution and the Verify gate. Confident approval of wrong work. The fix is structural: route Verify to a smarter or differently-trained model than the executor.
- No binary eval — The pipeline claims to be verified but the Verify criterion is "does this look good." Without a binary, repeatable criterion, the pipeline cannot be tuned and the rest of the harness discipline collapses. The operating principle holds across the discipline: learn to build evals first, and everything else follows.
Run this week
Six concrete tasks, each with a time box and deliverable.
- Locate the firm's harness on the five-stage ladder (1 hour). Pick the single production agent that handles the most daily traffic. Walk through the indicators at each rung — tool manifest as first-class artifact, unit evals in CI, branch-per-agent, end-to-end tracing, self-healing remediation. Stop at the first rung where an indicator is missing. Deliverable — a one-sentence verdict ("we are at Stage 1 because branch-per-agent is absent") with the specific missing indicator named.
- Three-layer diagnostic on one recent failure (1 hour). Pick one agent failure from the last week. Ask three questions in order: Is the model capable on similar tasks? Is the orchestration layer passing state cleanly, with handoff schemas and retry budgets? Is the context the agent saw complete and current? The first "no" is where the fix goes. Deliverable — a one-page failure report naming the layer.
- Audit the canonical pipeline for one agent (2 hours). Check that the agent runs Preflight, Plan, Approve, Tasks, Verify, Finish. Missing stages get flagged; the Approve gate specifically is the highest-leverage one to add if absent. Deliverable — a six-row checklist with the agent's current state per stage and the specific code change to fix each gap.
- Install branch-per-agent for one experimental change (2-4 hours). Pick one harness change the team wants to make. Create a Git branch for the change. Run the same eval suite on both branches. Roll back is a single
git revertif the change regresses; ship the change if the eval improves. Deliverable — a before/after eval report plus the PR orgit revertartifact. - Write or strengthen one eval (1 day). Pick the one task where the Verify gate currently runs on "does this look good." Write a binary criterion — schema check, reference comparison, calibrated LLM-as-judge rubric. Run it against the last twenty agent outputs and record the pass/fail ratio. Deliverable — the eval function plus the twenty-row result table.
- Add asymmetric QA to one pipeline (2 hours). Identify the single pipeline where the executor and Verify model are the same. Swap in a smarter model at the Verify gate only, keeping the executor on the cheaper model. Measure before/after cost and quality on ten runs. Deliverable — a before/after cost/quality table.
The next chapter picks up agent reliability — topology choices, composition rules, exploration budgets for escaping consensus mediocrity, circuit breakers for failure containment, and observability patterns — on top of the harness discipline this chapter established.