Context Engineering

Agent deployments that disappoint are frequently diagnosed as a model problem and are almost never a model problem. A frontier model given no loaded context produces a confident-looking strategy that hallucinates the firm into the ground. Move the same task onto a mid-tier model running against the firm's full operating context and the output becomes verifiable on first read, often at less than a tenth of the cost. The binding constraint on quality in 2026 is what the agent perceives, remembers, and can trace back to ground truth — not model capability, which already exceeds what most business workflows ask of it.

The useful name for the architecture is Token Metabolism. Every Slack message, call recording, commit, Gong recording, Google Doc, Jira ticket, email, marketing campaign, and deal-pipeline event is a token. The firm ingests those tokens, runs them through a four-stage pipeline — ingest, data model, governance, ontologize — and consolidates them into a living knowledge graph that every agent queries at task time. The earlier Token Pipeline analogy described the flow of tokens between nodes; Token Metabolism describes the stack that processes them.

Context engineering is the system, not the prompt

The single most useful diagnostic when an agent disappoints is the ratio of rewrite time to review time. If the human spends more time rewriting the agent's output than reviewing it, the problem is almost never the prompt. It is that the agent does not know enough about the firm to produce the output the human would have produced directly. The fix is always upstream — richer context, better-structured memory, a more specific skill file — and can be diagnosed by watching where the human's time goes.

The prompt is one component. Context is the entire system: folder structure, CLAUDE.md files, skills, MCP servers, retrieved data, working memory, session state, and agent identity. A weak model in rich context outperforms a strong model in thin context on almost every business task that is not at the frontier of model capability.

The worked public example is Browserbase's internal Slack-native agent bb. The production agent routes most traffic to whatever model is cheapest-capable because a skill-routing table in the agent's system prompt maps request patterns to domain-specific skill files. Each skill is a markdown file that carries the schema, the decision tree, and the exact multi-source debugging playbook a senior engineer would follow. A fragment of bb's routing rubric reads like this:

Skill routing:
  Session investigation, debugging, error analysis → investigate-session
  Pull request, code change, Linear ticket → create-pr
  Customer data, usage trends, account questions → customer-intelligence
  Feature request logging or triage → log-feature-request
  Warehouse queries or SQL analysis → data-warehouse
  Browser automation, web data retrieval → write-browserbase-web-automations
  Notion pages or databases → notion
  Load only what you need. Do not load all skills for every request.

The agent loads only the skill it needs and executes against it. The same class of task routed to a frontier model without the skill file produces worse and less auditable output. The skill file carries the reasoning path; the model executes against it. For most firms in 2026 the default shape is cheap-tier-plus-rich-context, not frontier-model-plus-naive-prompt.

The substrate target is every firm token reachable to agents under identity policy

The destination of the work in this chapter is precise. The substrate is in place when, modulo the per-identity permissions developed in 2.2, an agent can answer any question about firm state without a human first compiling the data. Every artifact the firm produces — calls, meetings, Slack threads, email, calendars, code, CRM records, support tickets, internal databases, finance ledgers — is reachable through a single retrieval surface, and the agent's read on any artifact respects the same access policy a human would. That is the operational definition of what 2.2's identity layer enforces against and what every architectural decision in this chapter contributes to.

Current frontier context windows cannot hold a firm's full token volume in a single call, which is why the context graph is the architecture rather than long-context cramming. A mid-sized firm produces tokens at a rate that exceeds even the largest published context window by orders of magnitude across any meaningful operating window, and even where the volume technically fits, the long-context-degradation effect tracked in HELM benchmarks compresses recall toward the middle of the input. The graph is what makes the substrate target reachable: ingest, normalize, govern, ontologize, and the agent retrieves the relevant slice at task time.

The diagnostic that a firm has reached the substrate target is operational rather than architectural. Pick five questions across five functions — what was decided about pricing in last quarter's exec session, which customer accounts have raised the same compliance concern in the last 90 days, what are the last three commits to the highest-value repository and what tests are gating their merge, which decisions in this month's hiring funnel reference an AI-fluency criterion, what is the current churn risk on the firm's top ten accounts. An agent should answer each question against the substrate without a human pre-compiling any input, with provenance back to the artifacts the answer references. Five out of five within seconds is the substrate target met. Anything less is the work that remains.

Token Metabolism ingests, digests, and ontologizes every operational token

The metabolism is a four-stage processing pipeline with two streams feeding it. A fully-remote firm tokenizes close to 100 percent of its information by necessity — there are no hallway conversations to lose. A co-located firm has to explicitly capture calls, meetings, and hand-written artifacts to match. The target is the same on both sides: every operational artifact reaches the pipeline.

Ingest. Every token-producing source connects through a managed connector layer with policy enforcement in the middle. Batch sources (CRMs, ERPs, git repos, email archives) sit alongside streaming sources (live calls, product analytics, support chats). Failure mode at this stage: an uncaptured channel produces a blind spot the agent fills with hallucinated context.
Data model. A schema on top of the raw data. Normalization, entity extraction, timestamp alignment, and deduplication turn raw tokens into typed objects with references to source. Failure mode: a data model built by the agent rather than by humans produces a probabilistic reconstruction of the firm's operational model, not a map of it.
Governance. Correctness, entity resolution across channels, deduplication of near-duplicates, and access control. This is the layer where the same person appearing as "Nazar P" in Telegram and "Nazar Petrov" in Gmail becomes one canonical entity. Failure mode: an ontology with no named owner compounds staleness — within months the graph cannot be trusted, and the agent's output degrades in lockstep.
Ontologize. The agent reverse-engineers processes and available actions from the governed data. Cross-functional agents can produce a diagram of how orders flow from pipeline to invoice without anyone having documented that flow by hand. Failure mode: ontology rebuilt without provenance chains loses its ability to ground any individual claim against raw source.

A second stream runs alongside the batch pipeline. Ad-hoc tokens feed in real-time — a live call streams into context as it happens, so an agent joining the same call five minutes later can reason over what has already been said.

The pattern is not only for software firms. Walmart's frontline-AI rollout inside the associate app works the same way at retail scale: the associate agent's context bundle spans store-floor layout, inventory state, associate schedules, and customer-pattern data, and the agent reasons across all of them when it compresses shift planning from roughly 90 minutes to 30. Different substrate, same metabolism.

Agents assemble a Context Bundle from three ontology layers plus a live stream

An agent working on any task queries three ontology layers in parallel — personal (the individual's Slack threads, email, calendar), company (the firm's operational artifacts), and world (research, external events, general knowledge) — and pulls from the ad-hoc stream of tokens arriving right now.

The assembled bundle carries five components: the company knowledge graph; the detailed product and domain taxonomy (every feature, every use case, with links to code and example SQL); the per-client or per-project sub-graph; the ad-hoc stream; and the agent's own past and current sessions for continuity. The architectural claim is that there is no operational token outside these five components; anything the agent needs to reason well is somewhere in the bundle. The agent retrieves the relevant slice at task time rather than carrying the full library in every context. The Context Bundle is the working unit; the metabolism is the pipeline that produces the material the bundle pulls from.

Memory is text files scored by recency-weighted confidence at retrieval

Three memory types apply to agents, borrowed from neurophysiology. Semantic memory is facts and knowledge — what pre-training gave the model. Episodic memory is cause-and-effect tied to sequences, where chronology matters: a customer refused, was offered a discount, then bought again. Procedural memory is how-to — the step-by-step instructions that, in an AI-native firm, are skill files. Models today arrive with only semantic memory. Episodic and procedural have to be built on top, and the storage layer that survived 2025-2026 is almost always the same shape: markdown files.

The public corroboration is Anthropic Cowork. Felix Rieseberg's framing: "Memory is just text files. It's really the model being instructed: hey, if anything is pertinent that you might want to remember in the future, just write it down"The public corroboration is Anthropic Cowork. Felix Rieseberg's framing: "Memory is just text files. It's really the model being instructed: hey, if anything is pertinent that you might want to remember in the future, just write it down"↗. The most capable agent product of 2026 ships with the simplest possible memory model — markdown, per-project isolation, no database. The sophistication is elsewhere: in the sandbox, the consent model, the per-project boundary. Memory storage is not where the problem lives.

Relevance decays with age. Every extracted fact carries a timestamp, and retrieval applies a recency-weighted confidence score: the more time elapsed since a fact was recorded, the lower the weight a retrieving agent assigns to it. A linear decay across the last two years with a hard cutoff beyond that is enough for most firms. Without the decay function the knowledge base degrades into a museum — the agent cites the pre-Q4 pricing tier as current, quotes a policy that was revised last quarter, references a Product Director who left the firm in October. The cost of the missing decay policy lands as a plausible summary that is quietly out-of-date, and by the time the cost is visible, the agent has already produced work from the stale base.

Ground truth chains preserve provenance; continuous graph updates preserve freshness

Every synthesized document links back to its source tokens. When an agent analyzes calls, it inserts direct quotes with timestamps so a human can listen to the exact moment. Every node in the knowledge graph traces back to raw source data. The chain reads: insight → reasoning → quote → source file → raw artifact. Implementation is lightweight — markdown links with byte-offset anchors, Obsidian wiki-style [[source#line-42]] backlinks, or a citation-registry sidecar file the agent is required to populate before emitting output. Without the chain, agents produce plausible summaries nobody can verify, and those summaries accumulate as the firm's operational record. With the chain intact, a VP reading the agent's quarterly review can click any claim and hear the call that produced it.

The graph is not rebuilt nightly. New tokens are compared against the existing graph as they arrive. Novel information writes directly to the appropriate node. Information that contradicts an existing node is flagged for an agent, which resolves the contradiction if the data is clear and escalates to a human when ambiguous. Ground-truth references are always preserved across the update. The pattern eliminates stale cache and batch-reindex downtime at once, and makes it possible for a query issued at 2 pm to reflect a call that ended at 1:55 pm.

Horizon. Continuous live-graph updates with real-time contradiction resolution are an emerging pattern with partial production deployments in 2026 rather than a universal default. Most firms still run overnight or every-few-hours batch reindex. The operating claim in this chapter is that continuous updates are what the maturing substrate is converging on; the practical threshold where a firm switches from batch to continuous is when a stale half-day of data starts producing agent-output errors the firm can measure.

Raw stays preserved, processed feeds the agent, and the agent reads only consumable surfaces

The pipeline runs three stages in order. Raw source data (transcripts, PDFs, spreadsheets, Notion pages, Slack exports, email archives, call recordings) moves through a processed intermediate where normalization, deduplication, entity resolution, and timestamp alignment happen. This is where LLMs build relationships and the context graph. The output is a clean agent-readable surface — the only thing agents query.

Two practical rules carry most of the weight. Normalization is roughly eighty percent of the work — punctuation cleanup, entity extraction, deduplication, tagging, and schema validation together form a sub-pipeline with several small models and deterministic scripts. A firm that under-invests in normalization discovers it at the moment an agent produces a confidently wrong answer that traces to a bad row in the processed layer. The second rule is to never skip stages. Agents see only the clean surface; the raw stays in a separate store (Git, object storage, or equivalent) as insurance so the firm can reprocess when schemas change or a cleaning pass was wrong. API access can be revoked at any time — account bans, service shutdowns, geopolitical restrictions, vendor churn. A local copy of the raw data is the only guarantee. The pipeline steps — Source, Extraction, Normalization, Storage, Indexing, Consumer — get documented explicitly for every data source the firm ingests.

Identify the atomic unit of meaning for each data stream before running it through the pipeline. A Telegram message is too granular to be the atomic unit; a dialogue segment with enough context to understand intent is. A single email is too granular; a thread of replies is. A financial transaction is too granular on its own; a transaction plus its metadata — participants, associated contracts, legal entities, dates — is the unit that stays self-contained when the agent reads it later. A raw call transcription without speaker labels, timestamps, or topic markers is an unstructured dump; a formatted transcript with those annotations is the unit that carries meaning. Getting the granule wrong is the most common reason agents produce shallow results from deep data.

Pick the substrate — files or graph — by workload, not by engineering preference

Files-as-substrate is a production-proven pattern, not a law of physics. Markdown plus YAML frontmatter is human-legible, agent-legible, Git-diffable, and grep-able. Hierarchical CLAUDE.md files acting as tables of contents, with cross-references and regex-validated links, scale to thousands of files without structural redesign. Claude Code and similar harnesses walk the markdown tree the way a careful human would.

Graph-first is the counter-pattern, and it wins under specific workloads. Multi-hop relationship traversal — legal entities with ownership chains, fraud-network analysis, supply-chain routing, compliance investigation across related parties — benefits from graph databases (Neo4j, Neptune, TigerGraph) where the agent issues Cypher or SPARQL queries and reasons over structured results. Markdown-first assumes the agent loads text into context and reasons over it; graph-first assumes the agent issues queries against a schema. The decision follows the workload:

Primary workload	Substrate	Why
Load human-readable context, reason fluidly	Markdown + YAML + Git (Claude Code, Obsidian, filesystem-native harness)	Grep-able, diff-able, readable by both human and agent
Multi-hop relationship traversal (ownership chains, fraud networks, supply chains, compliance)	Graph DB (Neo4j, Neptune, TigerGraph)	Cypher/SPARQL over structured results scales past what a markdown tree can answer
Mixed: under 50,000 entities with occasional traversal	Markdown with JSON entity index	Filesystem + index hybrid; both queries work
Unknown — still discovering the workload	Start with markdown; measure query patterns for a quarter	Substrate-change cost grows with data volume

Picking the substrate by engineering preference rather than by workload produces a stack that looks coherent on a whiteboard and breaks under the first real load. The recurring failure mode is the same from both directions: the wrong substrate amplifies the cost of every query the agent is actually asked to run.

Skip RAG when the corpus fits in context, use hybrid-with-citations when it does not

For business tasks where the useful corpus fits inside a frontier context window — roughly several hundred thousand tokens today and rising — a retrieval pipeline is often net-negative overhead. Build cost, maintenance cost, and failure surface exceed the savings. A firm whose operational corpus fits in a few hundred thousand tokens can hand the whole thing to a cheap-tier frontier model in one call; inference pricing has been falling by large multiples year-over-year, and the economics that forced retrieval pipelines in 2023-2024 have shifted.

Long-context degradation is a real effect at the far end of a large context window — language models systematically lose information positioned in the middle of long inputs, a finding first established in 2024 research and tracked in continuing benchmarks 2. The practical threshold at which a firm moves from in-context to retrieval is not a single number; it is the size at which the firm's own evals start showing quality drop.

When the corpus is too large, hybrid retrieval with citations is what survives production. The 2026 reference implementation is NetApp's Hybrid RAG — BM25 first for deterministic traceability on exact identifiers, vector re-ranking for semantic relevance, every chunk tagged with source document, version, ingestion timestamp, and byte offsets. Outputs lacking a valid citation are rejected before they reach the user. The decision follows corpus size and governance requirements:

Corpus size + constraints	Approach	Stack
Fits in context, no multi-tenant isolation required	Full corpus in-call with schema-guided JSON output	Cheap-tier frontier model + strict output schema
Too large for context, or requires regulated provenance	Hybrid retrieval with enforced citation	BM25 + vector re-ranker + OpenSearch or equivalent + provenance tags (NetApp pattern)
Unknown, still discovering corpus shape	Eval-driven threshold test at 10K / 50K / 200K / 500K token context sizes	Pick the knee where quality drops

What fails in both scopes is the 2024 naive pattern: embed everything into a vector database without provenance or structure. Pure vector search destroys entity relationships, temporal context, and cross-domain connections. It produces a probabilistic reconstruction rather than knowledge — a confident summary that cannot be verified against any specific source. The industry's move to hybrid-and-schema-aware retrieval happened because the pure-vector pattern failed often enough in production that the cost became visible.

The corollary that holds across every subsection of this chapter: never dump raw data into the agent's memory. The agent sees only processed, normalized, schema-validated data. The never-skip-stages rule is what keeps the boundary intact.

Entity resolution, documentation, and schema remain human responsibilities

Three responsibilities inside the pipeline stay structurally human in 2026, and the firm that tries to hand them off to the agent produces a context layer that decays faster than it accumulates.

Entity resolution is the first. Language models cannot reliably determine that "User1567" in Telegram is the same person as "Nazar Petrov" in email. Master entity profiles — mapping all aliases, contact methods, and associated records for a person, a company, a deal, or a product — need to be created and curated by humans. Once the master profiles exist, even small open-source models can do the linking at the pipeline layer. Most reported "agent hallucinations" on business data are this data-layer failure in disguise: the agent faithfully reproduced bad input, and the input was bad because the entity graph was incomplete.

Documentation runs on two separate disciplines that both matter. The first is mutual exclusion with collective exhaustion — no overlap between sub-points while the set covers the domain. The second is link-don't-duplicate — when the same claim appears in two places, the second is a reference to the first, not a copy. In an AI-native firm, agents execute on the documentation. A CLAUDE.md that is out of date produces wrong agent behavior the same way stale code does, and the two disciplines together are what keep the documentation maintainable as the firm's skill library grows into the hundreds.

Schema is the third. Forcing JSON output validated against a strict schema — and describing reasoning steps as JSON as well — improves quality and consistency more than prompt tuning does. Small open-source models produce results close to frontier quality under this discipline. The pattern generalizes: the harder the schema the agent has to respect, the less the agent's variance shows up in the output. Validation has to happen at the data layer with deterministic scripts, not at the model layer through prompt instructions, for the rail to actually hold.

Four anchor cases carry the context pattern across four substrates

Four public implementations cover different slices of Token Metabolism. Reading all four shows the category converging rather than any single vendor defining it. Each case runs through the same three questions: what does the firm ingest, what does its context substrate uniquely carry, and what does the context architecture make possible that a thinner stack would not?

Shopify. What it ingests: every role request, every hiring justification, every internal argument about whether a workflow can be automated. What the substrate uniquely carries: the phase metadata — which workflows are in prototype mode, which are in build, which are in operate — plus the accountability assignments that travel with each phase transition. What the substrate makes possible: Tobi Lütke's "prove AI can't do it" hiring default. A firm that cannot compute whether AI can do the work for a proposed role has no way to enforce the rule. The context substrate is what turns the memo from a public statement into an operational gate.

Block. What it ingests: Slack messages, emails, pull requests, code, Google documents, recorded meetings — the full artifact stream of the post-restructure firm. What the substrate uniquely carries: the intelligence-layer abstraction on top of the artifact stream, which Dorsey and Botha describe as the move from hierarchy to intelligence — the firm becomes queryable about its own state rather than legible only through layered reporting. What the substrate makes possible: board members can ask the company a question and the company answers, and the February 2026 workforce restructure (developed in 1.3) landed cleanly because the coordination substrate was already operating.

Anthropic Cowork. What it ingests: the user's prompts, per-folder scoped file access, per-domain network access, and the model's own decisions about what to remember. What the substrate uniquely carries: per-project isolated markdown memory, with the model itself deciding what to write down. What the substrate makes possible: the most capable agent product of 2026 with the simplest possible memory model. Context engineering is not a database problem. Firms that spend engineering quarters on an elaborate memory store and produce little user-visible benefit have optimized the wrong layer.

Improvado. What it ingests: roughly a thousand SaaS-vendor data integrations plus internal operational artifacts — calls, commits, docs, tickets. What the substrate uniquely carries: the parallel query across the personal, company, and world ontologies, assembled at task time from a knowledge substrate built on markdown with YAML frontmatter, with session as the first-class unit of work. What the substrate makes possible: the workspace the firm built around this — Miras — is agent-first from the inside out, with sessions as first-class entities and one-click switching between UI view and terminal. The deepest available evidence that the architectural claims in this chapter work at the scale of a firm running its business on them.

Six failure modes map to specific pipeline stages

Each named failure ties to a specific Token Metabolism stage. The pattern is the failure one encounters when that stage is skipped or misbuilt, not a generic cautionary tag.

Raw dump — Ingest stage failure. Ten thousand unstructured files in an Obsidian vault is not context. The data is visible but not retrievable; the agent runs blind and hallucinates. The fix is the three-stage pipeline — raw stays separate, processed is what the agent queries, and the index is what the agent searches.
Atomic-unit error — Data-model stage failure. Treating a single message as the atomic unit when the dialogue segment is the correct granule. A single transaction when the thread is. Produces plausible summaries that miss the point. The fix is per-stream granule design before any data reaches the pipeline.
Confidently wrong — Cross-stage failure that surfaces at the agent. Better models produce more convincing wrong answers on bad data. The most common version at business scale: the agent cites a customer-support policy that was revised two quarters ago, or quotes a price tier that changed last month, because the stale entries in the processed layer never got a recency penalty at retrieval. The fix is upstream: the schema and provenance chain that guarantee the agent is reading what the firm meant to store.
Stale memory — Governance stage failure. No recency scoring, no decay policy. Two-year-old assumptions treated as current. A product-marketing agent confidently references a SKU the firm sunset in Q3 of the prior year; a customer-success agent cites an SLA commitment that was renegotiated six months ago. The fix is the recency-weighted confidence score at retrieval plus a named ontology owner who signs off on staleness sweeps on a scheduled cadence.
Retrieval without structure — Ontology stage failure. The 2024 pattern — embed everything, retrieve by similarity, hope for the best — became the 2026 anti-pattern. Destroys entity relationships, temporal context, and cross-domain connections. The fix is hybrid with citations when the corpus is too large to fit in context, and no retrieval pipeline at all when it fits.
Ontology without governance — Governance stage failure at the concept level. A knowledge graph is only as good as its oldest unreviewed entry. Without explicit ownership, staleness compounds across entity categories; within months the graph cannot be trusted, and the agent's output degrades in lockstep. The fix is a named owner per ontology layer plus scheduled reviews that resolve contradictions flagged at ingest.

Run this week

Six concrete tasks, each with a deliverable and a time box.

Context-sufficiency diagnostic (1 hour). Pick the firm's highest-volume production agent. Pick a representative task it ran in the last 24 hours. Assemble the Context Bundle it had access to — the five components (company knowledge graph, product taxonomy, per-client sub-graph, ad-hoc stream, prior sessions). Ask a fresh instance of the same agent for three next actions on the same task. Two or more strong suggestions means the context layer is carrying its weight. Three generic or hedged answers means the data, skills, or memory layer is thin. Deliverable — a one-page note marking which of the five Bundle components were empty or thin.
Atomic-unit audit (2-3 hours). For each of the firm's top five ingested data streams, build a four-column table: stream name, current atomic unit, correct atomic unit, pipeline change required. Most firms discover they are treating messages as the unit when dialogue segments are correct, or transactions when threads are. Deliverable — the filled table plus an owner per row.
Pipeline-stage inventory (4-8 hours). Document the Source → Extraction → Normalization → Storage → Indexing → Consumer chain for the top five data sources the firm ingests. Any source where raw is not separately preserved is a reprocessing risk. Any source where the agent queries raw instead of processed is a confidently-wrong risk. Deliverable — six-column matrix, five rows, with risks flagged per cell.
Ground-truth-chain test on one agent output (1 hour). Pick the most recent synthesized document the agent produced. Trace any single claim back through the chain — insight, reasoning, quote, source file, raw artifact. If no synthesized documents exist yet because the firm is pre-implementation, pick one recent agent output and enumerate which source tokens should have been cited; the gap is the provenance work to add at ingest. Deliverable — a single-claim trace, pass or fail.
Entity-resolution pass on the master profile (1-2 days). Pull the firm's top fifty entities — people, companies, deals, products. For each, enumerate every alias the firm's data uses. Build the canonical profile by hand. Substrate options: a Google Sheet for the fifty-row table, markdown files per-entity in an entities/ directory, or directly into the firm's existing CRM if it supports aliases as a first-class concept. Once the profile exists, even a small model can do the ongoing linking. Deliverable — a living fifty-row master file with named owner.
Long-context threshold test (2-3 hours). Take one representative task. Run it at 10,000, 50,000, 200,000, and 500,000 token context sizes using a fixed input set. Score output quality on a binary rubric — acceptable or not. Record the knee — the size at which quality drops from stable to degraded. That number is the firm's in-context-vs-retrieval boundary and the threshold at which a retrieval pipeline earns its keep. Deliverable — a four-row table with quality-score per context size.

The next chapter picks up harness engineering — the outer loop that wraps a model with verification, tool-registry, memory-writeback, and loop-detection so the Context Bundle assembled here compounds into production-grade agent output rather than a single-shot answer.