The Playbook

The Compounding Firm

A Compounding Firm. Revenue per employee, cycle time, and idea-to-prototype latency improve every quarter without a corresponding headcount increase. The mechanism is recursion through time, run by two feedback loops the founder operates simultaneously.

A Compounding Firm runs three numbers up every quarter without growing headcount: revenue per employee, cycle time, and idea-to-prototype latency. The mechanism is recursion through time. The same agents, skills, and processes that ran the firm yesterday rewrite themselves to run it better tomorrow, and the deltas accumulate. Two feedback loops do the rewriting. An inner loop makes existing work cheaper, faster, and more reliable. An outer loop searches for what the firm should become next. The previous chapter developed the engineering shape of those loops; this chapter develops what compounds when the founder runs both loops together for long enough that the curve goes vertical.

The shape pairs with the recursion across scale already named by Stafford Beer's Viable System Model in the previous chapter. Across-scale recursion says the same five-function pattern operates at every level of the firm. Through-time recursion says the same five-function pattern operates at every horizon — yesterday's loop is the substrate today's loop runs on, and tomorrow's loop will be the substrate next quarter's outer-loop search builds against. The compounding curve emerges only when both recursions hold. Without the time axis, a firm runs efficiently against today's plan and gets caught by competitors who searched out the next form of the business. Without the scale axis, the same firm produces interesting one-off products and never compounds across functions.

A compounding firm runs an inner exploitation loop and an outer exploration loop on the same scoreboard

The firm has two metabolisms. The inner loop runs exploitation. The outer loop runs exploration. James March's 1991 paper Exploration and Exploitation in Organizational Learning named the dynamic: the two loops compete for the same scarce resources, and a firm that over-invests in either fails on long horizons. The cybernetic framing sharpens the consequence. The inner loop is the firm's S3, the operations layer that executes against today's plan. The outer loop is the firm's S5, the identity layer that decides what the firm should be next. When S5 collapses into S3, when the only loop running is the inner one, the firm hits Ashby's variety wall on the time axis rather than the scale axis: the regulator stops keeping up with the regulated system's evolution.

The inner loop tracks cost per successful task, cycle time, and incident rate. The outer loop is less familiar but just as concrete in its instrumentation: experiments per quarter, novelty rate of the wins, kill rate. The kill rate matters most because it distinguishes a real outer loop from a marketing slide that says the team runs experiments. A firm that does not kill products on a fixed cadence is running gradient ascent on the existing portfolio, which is what Stanley and Lehman's Picbreeder result predicts will fail. Their distillation: to achieve your highest goals, you must be willing to abandon them. Operationally, the team writes down the kill threshold before the experiment runs and honors the threshold when the experiment misses, even when the team that built the experiment wants to keep iterating.

Two 2026 firms run the loops publicly enough to anchor the pattern. Stripe's published engineering discipline runs the inner loop with the Minions Blueprints state machine. Agent nodes for code production are wrapped inside deterministic nodes for linting, testing, and pushing. The blueprint runs a subset of linters as a deterministic node within the agent devloop and loops on that lint node locally before pushing, then runs one iteration against the full CI suite as part of the standard blueprint. After the second push and CI run, the branch goes back to a human operator for manual scrutiny — the published rule that caps each task at two CI rounds before escalation. Stripe's disclosed numbers run as inner-loop metrics: more than 1,300 minion-produced PRs merged each week, agent-merged PR rate, CI pass rate, cycle time. The discipline is to make existing work cheaper and faster, week over week, with measurable deltas tracked against the prior quarter's baseline.

Anthropic's product organization runs the outer loop publicly enough to read. Boris Cherny, Head of Claude Code, told Lenny's Podcast in February 2026 that he ships 20-30 pull requests per day running multiple Claude instances in parallel, and that he has not written a single line of code by hand since November 2025. He documented a thirty-day stretch of 259 PRs, 497 commits, and 40,000 lines added on his own X account in March 2026, with every line written by Claude Code plus Opus 4.5. In the same Lenny's episode, Cherny describes Cowork — a Claude Code surface for non-engineers — being built in under two weeks before its January 12 2026 launch. The pattern works because the experiment cost has collapsed: building a full product surface in under two weeks rather than a quarter changes what experiments-per-quarter means in operational terms. Running ten experiments and killing seven costs less than running one experiment and shipping it whether or not it landed.

The pattern across both firms holds whether the work is exploitation or exploration: the inner loop has a measurable scoreboard the team checks weekly, the outer loop has a measurable scoreboard the team checks at the same cadence, and neither loop is a side project. The compounding signal is whether both loops are improving at the same time — outer-loop experiments-per-quarter rising while inner-loop cycle-time falls. The anti-signal is the most common failure mode at single-loop firms: inner-loop wins are easier to celebrate, the outer loop is allowed to atrophy, and growth flattens about three quarters later when the existing-product cycle of refinement runs out of room. The firm-level diagnostic is whether the team can name a specific product killed last quarter against a written-down threshold. Without that artifact, the language of compounding is being applied to optimization, and optimization has a ceiling the team will hit during a quarter when the existing-product growth rate flattens for the first time without an obvious cause.

Every operator gets a mirror agent that other agents can query

Every operator in a compounding firm has a mirror agent. The operator can be an employee, a contractor, or an external coordinating agent. The mirror holds the operator's full context: calendar, recent decisions, current ownership, open threads, the running record of what the operator is working on this week. Other agents can query the mirror on the operator's behalf, scoped to what the requesting party is authorized to see. The mirror functions as an authoritative interface to the operator's state, exposed to the firm's coordination fabric, rather than as a chatbot or a productivity tool aimed at the operator.

Why this compounds is a marginal-cost argument. In a 100-person firm without the layer, getting a single answer that depends on inputs from a dozen people requires scheduling time on a dozen calendars, holding the meeting, and writing up the result. With the mirror layer, the requesting agent reads each operator's mirror and synthesizes the answer without scheduling anything human. The synchronous meeting becomes the exception rather than the default. The first-order benefit is meeting time saved. The compounding effect is the second-order one: hours that used to go into alignment flow into the outer loop instead — into experiments, kills, and the search for what to ship next.

Glean's Enterprise Context platform is the cleanest public anchor for the architecture at scale. Glean's January 2026 engineering post describes the layered substrate: connectors, indexes, an organization-wide knowledge graph, and a per-employee personal graph that reflects each operator's projects, collaborators, and working style. The personal graph is the mirror's authoritative state. The connectors and indexes keep it current. The knowledge graph is what the requesting agent traverses to find the right operator to query. Glean's February 2026 third-generation release added two features that operationalize the mirror further: personal-graph insights the user can see, edit, and delete; and agent sandboxes that allow the assistant to run long-running tasks without LLM context-window limits while still respecting enterprise permissions. The user-controllable layer matters because it is what keeps the mirror from becoming surveillance theater. When the operator cannot see what their mirror exposes to whom, trust collapses and adoption stops.

Three architectural options surface in 2026 for where the mirror's authoritative state actually lives. The first is read-only over the operator's own workspace — calendar plus mail plus files plus tasks, exposed to the firm's agents through MCP connectors over Google Workspace, Microsoft 365, or equivalent. This is what most firms ship first because the technical investment is low and the privacy story is clean. The second is a centralized people-graph maintained by a vendor or a central platform team — the Glean shape — which is the right answer once the firm passes a few hundred operators. The third is a federated architecture where each operator's own agent is the source of truth and the firm's coordination layer queries those agents directly. The federated design scales further than the centralized one because the bottleneck moves from a single platform team to the operators themselves, and because the operator owns the policy decisions about what to expose. Most firms ship the first option as a starting point, run the second once the platform team can support it, and migrate toward the third when the platform matures. Botha and Dorsey named the same shape from a different angle: capabilities at the bottom, interfaces in the middle, mirrors-and-coordination at the top.

The failure mode is mirror agents that turn into surveillance theater. The discipline that holds is twofold. Operators see and edit what their mirror exposes to whom, and the policy is enforced at retrieval time rather than at ingestion time. The agent never returns what its caller is not authorized to see. Without that retrieval-time enforcement, the architecture's privacy story collapses on the first internal incident, and adoption goes with it.

SOPs that live as versioned code in a repository compound; SOPs in Confluence decay

A standard operating procedure that lives as a Markdown file in a repository, with version history, a review process, and machine-executable steps, behaves qualitatively differently from one that lives in Confluence. The first compounds because every revision is reviewable, reversible, and tied to the incident or feedback or regulatory change that motivated it. The second decays because none of those properties hold.

The ladder runs in three rungs. SOP-as-document is the Confluence baseline: a written description of how the work should be done, accessible to humans who go looking, invisible to agents that need to execute. SOP-as-skill is the next rung — a callable skill file with eval criteria, executable by an agent at runtime, versioned in the team's skill repository, with the merge process governed the way code is governed. The Anthropic Skills format (a SKILL.md plus an evaluation harness in a .claude/skills/<name>/ directory) is the publicly documented form factor at this rung; Stripe's blueprint definitions and Block Goose's recipe files are equivalent shapes elsewhere. SOP-as-policy-bundle is the third rung — a skill plus the permissions plus the metrics plus the kill switch, deployed and observed as a coherent unit. A firm at the third rung answers what is our current refund policy with a git log against a single file, audits when it last changed and who reviewed it, and runs a regression suite against the policy to verify that the agents executing it produce the expected outcome on a panel of historical cases.

The compounding mechanism is in the diff log. Institutional memory that used to live in a senior employee's head and walked out the door when they left now accumulates in the policy repository. The firm's policy state at any moment is git checkout away. The new hire who wants to know why the refund policy is what it is reads the merge request that introduced the current version, sees the incident that motivated it, and understands the reasoning without having to find the senior employee who happens to remember.

GitLab's public handbook is the reference implementation, predating the AI-native era by years. Every GitLab policy lives as Markdown in a public repository with a clear merge workflow. The handbook is queryable, diffable, and forkable. The AI-native contribution is the executability layer above it — the SOP file becomes a callable artifact agents can run, not just human-readable Markdown. Microsoft's Agent Governance Toolkit, released as open-source on April 2 2026, is the citable infrastructure for the policy-execution layer. The toolkit's Agent OS package is a stateless policy engine that intercepts every agent's action before execution at sub-millisecond latency, supporting YAML rules, OPA Rego, and Cedar policy languages. The Agent Mesh package adds cryptographic identity using decentralized identifiers with Ed25519 and the Inter-Agent Trust Protocol for secure agent-to-agent communication. The Agent Compliance package maps controls to the EU AI Act, HIPAA, SOC2, and all ten OWASP agentic AI risk categories. The package set is what production-grade SOP-as-code infrastructure looks like in 2026.

The killer metric for this layer is policy propagation latency: the time from "we changed the policy" to "every agent in the firm runs the new policy."

  • Sub-day: what compounding firms hit in practice.
  • Sub-week: the firm is doing well, not yet compounding.
  • Sub-month: the policy still lives as document rather than as code.
  • Over a quarter: the firm has a Confluence problem dressed up as a governance problem.

The metric matters because it is the leading indicator of the firm's ability to learn. A firm whose policies take a quarter to propagate is operating against last quarter's environment, while a firm that ships the same change in a day is running an outer loop on its own procedures and accumulating the deltas against competitors that do not.

The failure mode is SOPs-as-code that are not actually machine-readable: PDFs in a repository, screenshots inside Markdown, prose paragraphs that an agent cannot parse into executable steps. The repository becomes version-control theater. The test for whether an SOP is actually code is whether an agent can retrieve, parse, and execute the SOP at runtime against a test case and produce the expected output. If the answer requires a human to translate the SOP into agent instructions every time it runs, the SOP is still document and the firm has not climbed the ladder.

The enterprise operations graph is the firm's working memory across teams and across time

The substrate beneath the loops, the mirrors, and the SOPs is the operations graph. Without an operations graph the firm runs on a goldfish architecture: every cross-team query starts from zero. With one, the same query lands on a state that is consistent across teams and over time.

Three layers compose the graph. At the bottom, the data layer holds records, events, and embeddings — the firm's empirical state of the world. The ontology layer above it carries entities, relationships, and semantic links that turn those records into a model of what they mean. Access rules live in a third, policy layer that decides who can see what under which conditions and with what justification. Every agent in the firm reads from this graph; every action the firm takes writes back to it.

The compounding mechanism is cross-team and cross-time accumulation of context. Context that lives in the graph compounds across teams because a salesperson's question about a customer reads from the same authoritative state engineering used to ship the last release and finance used to recognize the revenue. Context that lives in spreadsheets and chat threads decays in place because the spreadsheet's author retires, the chat thread scrolls off the screen, and the next person who needs the same answer rebuilds it from primary sources.

Four productized patterns dominate the 2026 vendor landscape. Palantir's Foundry Ontology is the longest-running public exemplar: object types define entities, properties define their characteristics, link types define relationships, and action types define how the firm writes back to the graph. The result is what Palantir calls a digital twin of the organization, with governance enforced at retrieval time. The agent never sees what its caller is not authorized to see. Databricks Unity Catalog plus Genie is the second: Genie is the agent layer that queries the catalog, with governance inherited from the platform and lineage tracked at table and column level. The third is Glean Enterprise Graph (cited above for the personal-graph layer); the same graph that powers the mirror agents serves as the operations graph the rest of the firm queries. Salesforce Agentforce on Data Cloud is the fourth, with unified customer profiles serving as the grounding substrate for agent actions on customer-facing surfaces. Each is missing pieces that the firm has to fill in: deeper ontology coverage in adjacent domains, finer-grained policy enforcement on long-running tasks, better lineage on agent-written records, integration with the firm's specific tooling.

A buy heuristic per pattern: Foundry for regulated or multi-system enterprise ontology, Unity Catalog when the firm is already on Databricks, Glean for knowledge-work coordination, Salesforce Agentforce + Data Cloud when the firm's center of gravity is the customer-facing surface. The build-versus-buy decision splits at scale. Founder-led firms under 100 people should buy; the integration tax of building exceeds the strategic value of owning the substrate at that scale, and time-to-first-useful-graph is measured in weeks rather than quarters. Above 500 people the calculus inverts; the firms named above are all building, because the integration surface is too specific and the ontology too domain-specific for an off-the-shelf vendor to capture. The middle is where most of 2026's mistakes live — half-built graphs that never get a permission model, half-bought graphs that never get the firm's actual ontology, and the worst case where both sides are partially complete and the firm runs two operations graphs simultaneously without either being authoritative.

The failure mode is operations graphs that are not query-time-permissioned. If the agent can see records its caller cannot, the graph is a data lake with marketing copy on top, and the next regulatory incident is already scheduled. The discipline that distinguishes a real operations graph from a re-labeled data lake is the same one Palantir and Databricks productized: governance enforced at retrieval, on every query, with an audit trail that says which agent saw which record on whose behalf and why.

A small founding team can ship many products from one operating substrate

The architectural shifts above produce a new economic shape: a small founding team that ships many products from one operating substrate. Teams are often three or fewer; the cadence runs to two-or-three new products a week up to first paid traffic; the operating discipline is to kill losers fast and scale winners. Software has become nearly free in marginal cost, so the cost has migrated to taste, distribution, and the judgment about what to keep. The Business Factory is what a compounding firm looks like at the consumer-software margin structure.

The cleanest public anchor for the pattern is Wix's June 2025 acquisition of Base44. Solo founder Maor Shlomo, six months from founding to acquisition, $80 million cash plus earn-outs through 2029, 250,000 users at acquisition, profitable. Wix bought the operating discipline that produced the product — the discipline of killing losers fast and scaling winners fast, paired with the new-app-template tempo — rather than the codebase. The codebase was a six-month-old artifact that any competent team could have rebuilt in a quarter. The premium was for the discipline that compounds.

The pattern works in well-defined domains and breaks in others. It works in consumer software, SMB tools, content products, vertical AI applications, and most internal-tools surface area inside larger firms. Three classes of business reject it. Regulated industries (healthcare, finance, legal) hit it on the cost of being wrong: a single bad launch can produce a liability that compounds faster than the speed of iteration recovers. Physical-operations companies (logistics, manufacturing, energy) hit a different ceiling, because the marginal cost of the next product is dominated by physical infrastructure rather than software. Deep-tech with multi-year R&D cycles is the third exclusion; the experiment-per-week tempo cannot be retrofitted onto a multi-year synthesis loop. The Business Factory is a software-margin pattern, and retrofitting it onto businesses with different margin structures is the most common founder mistake when the pattern goes viral on social media.

The composite skill the founder is exercising in a Business Factory is running an outer loop on the product portfolio at much higher tempo than a traditional firm can sustain, with the inner loop shared across products via the same operating substrate. Each product launch is a single experiment in a portfolio the firm runs continuously, so the operating discipline is closer to a venture portfolio than to a product team — most launches die on schedule, and the deaths are the system working rather than the system failing. The founders who succeed at the pattern have explicit kill thresholds, honor them when the team that built the killed product wants to keep iterating, and reinvest the freed capacity into the next experiment rather than sitting on the savings. Most operators run the kill review on a weekly or biweekly cadence with thresholds written down before each experiment ships; quarterly is the slowest cadence that produces a kill rate the diagnostic triad can pick up.

The failure mode is Business Factory rhetoric applied to single-product firms. The deck says the team kills products fast. The execution shows one product that has been shipping for two years and three abandoned experiments that nobody tracks anymore. On its own the language is harmless; the failure starts when the firm believes the deck and skips the discipline the deck was meant to describe.

Four diagnostics distinguish a compounding firm from one that has only built marketing copy of compounding

The Diagnostic Triad. Revenue per employee, cycle time, and idea-to-prototype latency improve quarter over quarter for at least four consecutive quarters. Improvement here means non-trivial slope, not measurement noise. Sub-3-percent quarter-over-quarter movement on any of the three is within noise and does not constitute compounding; firms running the discipline well typically post high-single-digit to low-double-digit percent improvement on at least one of the three each quarter, with the other two not regressing. Failure to improve any one of the three for two consecutive quarters is the canary that something is off — either the inner loop has saturated, or the outer loop has stopped killing experiments, or the operations graph has accumulated enough drift that cross-team queries no longer return clean answers.

Working / Theater pairs. The diagnostic that distinguishes the actual pattern from the rhetoric of the pattern. Each pair contrasts a working version with the theater version that looks superficially similar.

  • A skill written by one team runs across every agent in the firm in the same release. The theater version is a chat channel where people post skills and nobody else reuses them.
  • An SOP change merges and propagates to every agent within twenty-four hours. The theater version is an SOP change posted in a Confluence page nobody reads.
  • The operations graph answers a cross-team question in seconds. The theater version routes the same question to four different people and takes a day, ending with a hand-curated spreadsheet emailed to the requester.
  • Outer-loop experiments-per-quarter rises while inner-loop cycle-time falls. The theater version is both metrics flat while the firm reports "AI productivity gains" in earnings.
  • The firm has a non-zero kill rate on its own products in the most recent quarter. The theater version is a backlog of products the firm has not formally killed but does not actively maintain either.

Non-Engineer Contribution Rate as the single best leading indicator. The percentage of skills shipped by people whose job title is not "engineer" reveals whether the substrate is actually substrate or whether it is a central-team gatekeeping function. Measure it as the share of skills called in production at least once in the trailing thirty days whose authoring commit was made by a non-engineer (skills in the registry that are never invoked do not count). Practitioner reports through 2026 cluster the working range in the middle to upper single digits; firms with effectively zero non-engineer contribution to production skills are running a central-AI-team architecture and have already started to bottleneck, whether or not the bottleneck is yet visible in team metrics. The leading indicator captures three things at once: the substrate is usable by non-experts, the firm's culture rewards contribution from outside engineering, and the operations graph is permissioned well enough that non-engineers can write to it without breaking things.

The central-AI-team anti-pattern. When a single central team owns every skill, functional teams wait on central, central burns out under load, and the transformation stalls early. The structural fix is the inverse of the instinct that produced the failure: central owns primitives (the operations graph, the skill registry, the LLM gateway, the FinOps controls); functional teams own their own skills, with central reviewing for security and compliance but not for content. Most firms get this wrong on first attempt because the central team's first instinct when adoption goes well is to control the next wave from the same center. The fix is to push ownership outward as soon as the primitives are stable, even when the central team feels they could do it better themselves. They probably could; the cost of the bottleneck exceeds the cost of the quality variance.

The 1% reframe. Almost no work in a compounding firm runs at one hundred percent automation. The last one percent is irrational to automate because the blast radius of a wrong answer does not justify the savings. The compounding firm stops comparing AI versus human cost on the same task and asks three different questions: where does the cost of being wrong exceed any savings; which decisions carry blast radius a human must retain; what hurts most if it fails silently. The set left after that filter — production-database migrations, customer-refund decisions over a dollar threshold, public-statement sign-off, hiring-manager final calls, regulatory filings — is what the remaining humans hold, and the chapter on what remains human develops what those humans actually do.

No firm has fully reached this state in 2026. The compounding curve compounds quietly at first; the firm one or two years ahead does not look two-or-more times more productive on any single quarter's metrics, but the gap on the four-quarter trajectory is visible to a careful reader. By the time the gap shows up in market-cap-per-employee comparisons, it has been forming for six to eight quarters in the operating metrics that compound underneath. The next part of the playbook takes up the market consequences — what survives when building is cheap, where the Intent Economy reshapes distribution, how the cyber-economy fork plays out — for the firms that have crossed this line and the firms that have not.