The Shift
Map every recurring process to one of four autonomy levels
The transition exercise from Part 1 to Part 2. Every recurring process in the team or organization gets assigned to one of four autonomy levels — Always Human, AI Prepares Human Finalizes, AI Executes Human Supervises, or Fully Autonomous — then ranked by the gap between its current level and the level it could structurally reach. The map is the input Part 2 assumes the reader has produced.
The founder's four-market filter and the executive's two-axis test from 1.4 collapse into one concrete next action. Before reading Part 2, list every recurring process your team or organization runs and assign each one to a level between where the work sits today and where it could structurally sit if the tool constraint were removed. Part 2 is the architecture of moving a process up the autonomy ladder, and this exercise is what produces the map.
Four autonomy levels cover every recurring process
The levels sort processes by how much human involvement each one requires at execution time.
- L1 — Always Human. The task requires physical presence, legal signature, or judgment that cannot be specified in advance. Signing a $2M wire, representing the firm in court, presenting to the board, making a strategic acquisition call.
- L2 — AI prepares, human finalizes. The agent drafts; the human reviews, adjusts, approves. The defining property of L2 is that the human is in the loop per-artifact — every output passes through a gate before it leaves the firm. Contract redlines, earnings-call scripts, monthly financial commentary, architectural design documents.
- L3 — AI executes, human supervises. The agent runs the workflow; the human monitors for drift and handles exceptions. The defining property of L3 is that the human is in the loop per-exception — supervisors watch aggregate metrics and intervene when the distribution shifts, not when any single output looks wrong. Lead qualification, inbound-ticket routing, market scanning, first-pass resume screening.
- L4 — Fully autonomous. The agent operates within a budget, a set of circuit breakers, and defined escalation rules, with no per-run human in the loop. Competitor monitoring, code generation inside a tightly scoped domain, invoice data extraction, anomaly detection.
A single firm runs all four levels at once. Board governance sits at L1 by definition. External communication has typically been an L2 task, with agents drafting and humans approving. Inside-sales operations — lead qualification, outbound sequencing, CRM hygiene — can run at L3 in a firm that has built the supervision layer. Competitor intelligence is one of the earliest processes firms routinely run at L4: a nightly crawl with a scored summary, with human attention reserved for the weekly trend review rather than for any single run. Choosing the right level per process is design work.
The four wrong-level failures each have a specific cost:
- L4 overreach. An agent placed where judgment was actually required produces silent errors at machine speed; the cost lags and compounds because nobody reviews individual outputs. The invoice-extraction agent that was circuit-broken on dollar thresholds but not on vendor-name drift is the canonical symptom.
- L3 supervision gap. A process nominally at L3 without the actual supervision mechanism — aggregated metrics, exception queues, drift detection — behaves as de facto L4 without the L4 guardrails. The common symptom is an exception queue that grows faster than the supervisor can drain it, at which point the system silently reverts to unsupervised execution.
- L2 under-delivery. Keeping an end-to-end automatable flow at L2 burns the margin recovery the firm could have captured; the cost surfaces as unit-economic drift rather than a visible failure, and the gate itself becomes a rubber-stamp theater that pays the supervisor-time cost without the oversight benefit.
- Defensive L1. A process held at L1 on autopilot when the tool layer has moved past it hides a gap competitors have already crossed. Vendor-agreement reviews under a $50K threshold and policy-compliance checks against a codified rubric are the categories most often stuck here.
Each autonomy level carries its own governance envelope. The L-level determines not only how much the agent does on its own but also what governance the firm applies around it. The shape changes across four axes — scope of access, approval gate, budget primitive, audit mode:
- L1 — Always Human. Read-only scope; full human review per output; no write budget needed because the agent does not act; full audit.
- L2 — AI Prepares. Draft-scope writes; human approval per output (the per-artifact gate is L2's defining property); capped per-output token budget; full audit.
- L3 — AI Executes, Human Supervises. Pre-approved scope; exception-only escalation (per-exception, not per-artifact — this is L3's defining property); skill-level token budget; sampled audit with full audit on exceptions.
- L4 — Fully Autonomous. Self-directed within budget; circuit breakers active; per-identity token budget with alerts; structured audit on every invocation.
The consequence for practitioners: upgrading an agent from L2 to L3 is not a volume knob. It is a policy redesign. The scope expands, the approval gate moves from per-artifact to per-exception, the budget primitive changes shape, and the audit mode shifts. Teams that treat an L2-to-L3 upgrade as a toggle discover the transition produces the compliance and cost incidents the rest of the playbook catalogs.
Six maturity levels describe how far the firm has climbed as a whole
Per-process autonomy is one axis. Firm-level maturity is the orthogonal axis the firm walks across all of its processes at once. A firm can run a single process at L4 (a competitor-monitoring crawler operating fully autonomously) while the firm itself sits at maturity Level 1, where individuals use AI tools but nothing about the firm's substrate has changed. The two axes answer different questions, and the firm's transition plan needs to address both.
Six maturity levels cover the path from no real adoption to a hypothetical fully autonomous firm. The diagnostic test for each level is what is true at the firm rather than what is true at the most advanced process inside the firm.
- L0 — Theater. Vendors run demos, the firm runs pilots, and nothing the firm produces externally has changed. The diagnostic test: when an outside observer compares this firm's output to a peer that has not adopted AI at all, the outputs are indistinguishable. The McKinsey-versus-BCG measurement gap from 1.2 (88 percent of organizations report AI use, 5 percent achieve scale) is the pool of L0 firms reporting adoption while the substrate stays empty. L0 ends when at least one process has shipped a measurable change in unit economics on a cited business metric, not when the first license is purchased.
- L1 — Personal productivity. Individuals use AI tools (Cursor, Claude, ChatGPT, Copilot) on their own work, with no firm-wide substrate beneath the use. The diagnostic test: each individual's productivity gains are real but disappear when the individual leaves, because nothing has been encoded into the firm's substrate that another individual could reuse. Most enterprises in mid-2026 sit at L1 across most functions, even when a small number of processes inside the firm have moved further. L1 ends when the firm starts encoding the high-leverage individual workflows into shared, agent-readable artifacts.
- L2 — Team-level skills. Teams share skill files, prompts, and agent harnesses across the team, with at least one master skill or harness library that grows as the team's working knowledge does. Ramp's Dojo marketplace — a Git-backed library of 350-plus shared skills serving 700 daily active users — is the public anchor for the shape. The diagnostic test: a new team member productively uses an agent on the team's work within a day because the team's tacit knowledge has become legible to the agent. L2 ends when the team's skills compose with the firm's other functions through a shared substrate rather than living as parallel team-level libraries.
- L3 — Org infrastructure. The load-bearing maturity milestone. Every artifact produced inside the firm is reachable to agents subject to per-identity policy: calls, meetings, messages, mail, calendars, code, CRM records, support tickets, internal databases, finance ledgers. The 2.3 chapter on context engineering develops the substrate work; this is the firm-level state when the substrate is in place. Block's restructure (1.3) and the company-as-intelligence framing in Dorsey and Botha's Sequoia essay describe the operating-model consequence at one public anchor. The diagnostic test: any question about firm state that an exec used to ask through a status-meeting chain returns from a single agent query against the substrate, with provenance, in seconds. Most firms that look advanced at the function level are still pre-L3 because at least one of the artifact streams (call recordings, finance ledgers, customer support archives) is not yet reachable.
- L4 — Initiative-driven self-improving. The firm runs at L3 and the substrate now generates initiatives the firm acts on. Agents propose process redesigns, surface accounts the sales team has not yet prioritized, identify exception patterns the operations team has not yet labeled, and flag decisions where the firm's prior reasoning would have produced a different outcome than the one shipping today. The 2.7 chapter on self-improving skills describes the loop at the skill level; L4 is the firm-level expression. The diagnostic test: at least one initiative the firm acted on in the last quarter originated with an agent's proposal rather than with a human operator. No published 2026 firm has reached this level firm-wide; the closest public examples are individual loops at firms otherwise operating at L3. The Shopify "prove AI can't do it" memo is the leading-indicator signal for the cultural posture L4 requires.
- L5 — Autonomous org. The firm itself initiates and executes projects with humans setting the policy envelope and monitoring on exception. Aspirational. The closest existing analog is finance trading desks, where decision algorithms have operated autonomously inside policy envelopes for over a decade and the human role is risk policy and monitoring rather than per-trade decision-making. No firm outside that narrow domain operates at L5 in early 2026, and the playbook treats it as a horizon target rather than a current-state plan.
The maturity axis is firm-level; the autonomy axis named earlier in this chapter is per-process. Most firms will run their first L4-autonomy process before reaching firm-level maturity L3, because the first-process work is concentrated and the firm-level substrate is broad. The two move on different clocks, and the autonomy-map exercise that follows produces the per-process plan; the maturity-ladder produces the firm-level plan. 2.1's operating-model duality, 2.3's context-engineering substrate, 4.5's cybernetic-firm framing, and 4.7's compounding-firm thesis each carry forward against the maturity-ladder destination, with L3 as the load-bearing milestone the rest of Part 2 develops.
Run the exercise in four steps
Work through the full recurring-process list once, top to bottom. Budget roughly three hours of focused work for a 50-person organization and a full day for a 500-person organization.
- List every recurring process. Pull from three sources: every recurring meeting on your team's calendars over the past four weeks, every role description on the HR wiki or HRIS with its work verbs extracted (
review,approve,draft,reconcile,qualify,triage), and the top five Slack channels by volume scanned for verbs that repeat weekly. Aim for 20-40 items. Do not filter yet. - Classify each process against today's level. Apply a three-question rubric per process. Does a human produce the first draft today? Yes puts it at L1 or L2. Does a human review every output before it reaches its audience? Yes is L2; no opens L3. Does the system run without human presence per run, bounded only by budget and alerts? Yes is L4; no is L3. First-pass classifications typically overstate current state — a PM copy-pasting a ChatGPT draft into a deck twice a week is L2 today, not L3, no matter how much the PM feels AI is doing the work.
- Mark the gap between today and the structurally achievable level. Apply three screens. Reversibility: Is the action reversible within a working day at acceptable cost? If no (wire transfer, published statement, production deploy without rollback), cap at L3. Regulatory: Does a regulator, contract, or board policy require a named human signatory? If yes, hard cap at L2 by contract. Input shape: Are the inputs bounded and the decision rule encodable? If yes, L3 is at minimum on the table; if inputs are open-ended, cap at L2 until the supervision layer is built. A process that could be L3 but runs at L1 because no agent has been deployed gets the notation L1→L3. One that could sit at L4 but operates at L2 because nobody has built the circuit breakers is L2→L4. A process correctly at L1 stays L1=L1.
- Rank by gap-volume and record. For each row:
process | owner | current_level | structural_level | gap (integer) | monthly_runs | minutes_per_run | monthly_minutes_at_stake (gap × runs × minutes). Rank bymonthly_minutes_at_stake. The top five are the targets. A quarterly board review with a one-level gap loses to a daily inbound-triage workflow with a two-level gap precisely because the volume multiplies the structural leverage.
The deliverable is a single spreadsheet with the columns above, plus a one-paragraph summary at the top: "X percent of recurring minutes sit at L1-L2 today; structural ceiling is Y percent; top five moves recover Z hours per month."
Warning. First-pass autonomy maps overestimate how much work sits at L3 or L4 today and underestimate how much could sit there structurally. Overclaiming the current state hides the restructure work that still has to be done. The opposite error — holding a process at its structural today-state when the tool layer has already moved — pays out as margin competitors are capturing six months before the firm notices. Validate the map with two readers: one function-peer in another company who can challenge the
structural_levelcolumn without political cost, and one cross-function peer inside the firm who can flag processes the owner is quietly protecting. If either reader classifies more than three rows differently, redo the rubric before ranking.
The exercise output is the firm's current autonomy distribution and a sequenced list of the moves that recover the most margin. Part 2 walks through the operating-model work (2.1), identity and policy (2.2), context engineering (2.3), harness engineering and agent reliability (2.4 and 2.5), skills (2.6), and the self-improving loop (2.7) that each upward move requires. The measurement-as-adoption trap from 1.2 recurs at autonomy-map level — a team reports an L3 deployment while the process actually runs at L2 behind the scenes, and the adoption metric hides the restructure work still required.
The reference box that follows names the implementation vocabulary Part 2 uses — traditional ML, LLM chat, workflow, agent, skill. Those five are independent from the four autonomy levels here. Autonomy describes how much human involvement a process requires at execution time. Automation describes which implementation technique fits a given process. A single process at L3 autonomy might run as a workflow, an agent, or a skill-loaded agent depending on problem shape.