AI-native Operating Model

Procter & Gamble ran a 776-person field experiment on its product-innovation teams, partnering with Harvard Business School researchers, and published the findings in March 2025. Professionals worked on real product-innovation challenges in a 2×2 design — individual versus paired teams, with AI versus without — and the headline finding collapsed two separate design questions into one. Individuals with AI matched the performance of teams without AI. The same dataset also broke functional silos: R&D professionals working without AI leaned toward technical solutions while their Commercial counterparts leaned toward commercial proposals, but both groups produced balanced cross-functional solutions when AI joined the work. Professionals using AI reported more positive emotional responses than professionals working alone. The authors close the abstract with a direct editorial claim: "AI adoption at scale in knowledge work reshapes not only performance but also how expertise and social connectivity manifest within teams, compelling organizations to rethink the very structure of collaborative work"Procter & Gamble ran a 776-person field experiment on its product-innovation teams, partnering with Harvard Business School researchers, and published the findings in March 2025. Professionals worked on real product-innovation challenges in a 2×2 design — individual versus paired teams, with AI versus without — and the headline finding collapsed two separate design questions into one. Individuals with AI matched the performance of teams without AI. The same dataset also broke functional silos: R&D professionals working without AI leaned toward technical solutions while their Commercial counterparts leaned toward commercial proposals, but both groups produced balanced cross-functional solutions when AI joined the work. Professionals using AI reported more positive emotional responses than professionals working alone. The authors close the abstract with a direct editorial claim: "AI adoption at scale in knowledge work reshapes not only performance but also how expertise and social connectivity manifest within teams, compelling organizations to rethink the very structure of collaborative work"↗.

Six months after the P&G paper was written, Morgan Stanley publicly described DevGen.AI, an in-house tool built on OpenAI GPT models and launched in January 2025. By June 2025 it had reviewed nine million lines of legacy code and saved its roughly 15,000 developers approximately 280,000 hours — the equivalent of 140 developer-years. The capability that made DevGen.AI work was not obvious to the labs building the underlying models. Commercial products did not handle Morgan Stanley's Perl codebases and internal dialects well enough to be useful; the firm's applied-AI team had to probe the models against the actual codebase, find that LLMs could read these languages well enough to translate them into plain-English specs, and only then design the AI-to-human translation workflow that turned the capability into a product engineers could use at scale.

A reader arriving with a completed autonomy map of the firm's recurring processes and the five-level automation vocabulary in hand has done the exercise that sets up this chapter. The architecture is a co-design — every discovered capability forces a software-side question (how does the system route tokens through it) and an organizational-side question (which roles, incentives, and accountabilities change around it). The operating-model duality named here runs across every architectural layer that follows.

AI capabilities arrive like physics and leave like engineering

Model capabilities in a given firm's context are under-determined by the model card. Pretraining distribution, post-training alignment, context assembly, harness amplification, and task specification all interact, and only the firm can supply the last three. The behavior of a mid-tier model on a specific pricing-segmentation task, or on a specific customer-support intent, cannot be predicted from the lab's announcement notes; it has to be pushed against actual production inputs. A pricing team probing segmentation finds that the model does 80 percent of what they had been outsourcing to a consultancy. A customer-support team pushing the same model against its ticket queue watches it collapse on domain-specific intents the firm never documented. Both findings come from contact with real inputs rather than from reading the model card.

Once a capability is visible and repeatable, compounding requires engineering it into the operating substrate. The engineering move consists of four concrete artifacts: a skill file that captures the working prompt and its invariants, a harness that enforces quality gates and logging, a workflow specification that names the inputs and outputs on either side, and a human owner whose name attaches to the accountability when the workflow misbehaves. A capability that stays in discovery generates demos and does not compound at unit-economics level. Engineering too early locks the firm into an architecture shaped by the first few prompts it ever saw, and the next model upgrade breaks it.

Two modes run permanently in any healthy operating model. Perpetual discovery without consolidation accumulates pilots that never reach production. The reverse failure — engineering before the capability has been properly mapped — takes longer to surface but produces systems that encode wrong assumptions about the model's ceiling. The routing machinery between the two modes is itself a load-bearing component. Ramp's public Glass story is the clearest recent instance of the routing machinery failing quietly: internal AI adoption hit 99 percent, the capability was demonstrably discovered, but nothing moved from discovery to engineering until a product manager built a harness as a weekend project. No one owned the transition.

The two modes also run across two domains. Software design and organizational design have to be redrawn together. Re-architecting the harness without re-architecting the organization produces AI theater — technology that works in isolation while the firm's outputs stay flat because the surrounding roles never changed. Re-titling roles on top of unchanged tooling produces a PowerPoint transformation — new charts above hands doing the same work with the same tools. The lead case below is the deepest available evidence of both sides of the redesign happening in lockstep.

Three tests separate discovery from engineering

Three working tests carry most of the weight.

The fifty-iteration test is the crudest. If the skill has not been run roughly fifty times against real production inputs rather than demos or curated happy paths, assume the capability is still in discovery. Engineering before then locks in artifacts of the first few prompts — the initial framing, the sample inputs that happened to be on hand, the one behavior the model produced on the third try that looked like cleverness. The test comes from a simple observation: the tail of failure modes in a new capability usually surfaces between iteration twenty and iteration fifty, and a team that consolidates at ten ships a skill that generalizes poorly and gets rewritten inside a quarter.

The 2-of-3 test operationalizes context sufficiency. Give the agent the full working context it would have in production — relevant files, relevant history, relevant skills and tools — and ask for three next actions on a representative task. Two or more suggestions that are obviously correct or genuinely novel indicate the context layer is stable enough to engineer. Three generic or hedged suggestions indicate the data, skills, or memory layer is still thin. The test misfires when the agent is pattern-matching plausibly without access to domain reality; cross-check by reading the agent's reasoning trace on the suggestion it chose not to pick.

The eval-before-harness test is the hardest to pass. Until the team can write a binary criterion for "did this work" on the task, engineering around the capability is provisional. The binary criterion can be a structured-output check (schema validation on produced JSON), a reference comparison (the output matches a known-good example within a defined similarity threshold), or an LLM-as-judge rubric that has itself been calibrated against human scoring on a sample. What counts as "binary" softens for tasks that resist it — writing quality, strategic judgment, design decisions — but the softening is the place where the eval discipline is easy to abandon. Ramp's Glass team puts the rule in the voice a skeptical engineer would use: "the engineering discipline doesn't go away just because the AI is writing the code. If anything, you need more of it" is the hardest to pass. Until the team can write a binary criterion for "did this work" on the task, engineering around the capability is provisional. The binary criterion can be a structured-output check (schema validation on produced JSON), a reference comparison (the output matches a known-good example within a defined similarity threshold), or an LLM-as-judge rubric that has itself been calibrated against human scoring on a sample. What counts as "binary" softens for tasks that resist it — writing quality, strategic judgment, design decisions — but the softening is the place where the eval discipline is easy to abandon. Ramp's Glass team puts the rule in the voice a skeptical engineer would use: "the engineering discipline doesn't go away just because the AI is writing the code. If anything, you need more of it"↗. What transfers from pre-AI engineering is the evaluation discipline. A capability without an eval cannot be engineered; it can only be hoped.

Improvado redesigns software and organization together

The firm is a data-pipeline SaaS with over a thousand integrations to SaaS vendors. Over the 2025-to-early-2026 window it rebuilt the software stack and the organization in parallel, and the pairing is the clearest available evidence of what the co-redesign looks like when neither side is free to lag.

On the software side: the firm abandoned its graph database after concluding the UI and maintenance overhead exceeded the analytical value at its scale. It moved to a markdown-plus-YAML file system that is human-legible, agent-legible, and Git-diffable. It built a token-metabolism pipeline — operational artifacts ingest through a data model, pass through a governance layer that handles correctness and entity resolution, and consolidate into a knowledge graph that every agent queries at task time. The skills library the firm accumulated over 2025 eventually consolidated into a single living ontology rather than a growing catalog of one-offs, and the agent retrieves the relevant slice at the moment it needs it rather than carrying the full library in every context.

On the organizational side, in lockstep: four role categories were eliminated wholesale. The shape common to all four is what makes the elimination defensible and applicable beyond this single case:

Bounded demand for the role's output. A fixed number of outbound lead touches, a fixed number of content pieces, a fixed set of product-management deliverables, a fixed surface of frontend components to maintain.
End-to-end task work an agent can now perform. Not partial assist; the task completes inside the harness without handoff back to a human mid-flow.
A workflow in which the human role is doing the part the agent is now better at. Routing information, formatting outputs, translating specs into artifacts — the coordination and production layer, not the judgment layer.

The four role categories Improvado eliminated (specified in 1.4) each score yes on all three questions. The role disappeared not because a single team decided a function was optional but because the agent-plus-substrate loop was engineered, which became possible only after iteration discovered that agents could produce acceptable output given the right harness. Neither the software choice nor the organizational choice is coherent alone — a clean knowledge graph on top of an org chart still routing work through the old coordination layer produces a worse outcome than either change in isolation, and cutting the roles before the substrate is in place produces a quality collapse the remaining team absorbs for two to three quarters.

Practitioner audit. Run the three-question test on the team's top eight functions: is demand for the role's output bounded? Can an agent now do the task end-to-end? In the current workflow, is the human doing the part the agent does better? Functions that answer yes to all three are the elimination shortlist. Functions that answer no to any of the three are not yet in engineering range, and the audit tells the firm what has to change — grow demand, improve the agent, or shift the human role up the judgment ladder — before that function is ready to restructure.

Five public cases trace the pattern across substrates

The same co-design move shows up in software-native firms, in frontline retail, in regulated finance, and in the public-company CEO essays that describe the shift. Five 2025-2026 cases are legible from public sources.

Block. The February 2026 restructure (developed at full length in 1.3) is the public benchmark; the chapter-specific angle here is the sequence. Goose, Block's open-source coding harness, launched January 28 2025. The intelligence layer that Dorsey describes as the new coordination substrate was operating a full year before the workforce cut, and the post-cut org shape from the March 2026 Sequoia essay — compressed hierarchy, three roles, every artifact piped into a queryable intelligence layer — became feasible only because the substrate was already there. The software redesign opened the possibility of the org redesign rather than following from it.

Shopify. Tobi Lütke's April 2025 "prove AI can't do it" memo set the hiring rule: every role request must first justify why AI cannot do the work. The chapter-specific angle from the co-design frame: Shopify treats phase transitions as explicit modes — prototype, build, operate — with a defined handoff where accountability transfers between phases. Both the software and the organizational redesigns run as one engineering problem, and the company itself becomes the engineered object alongside the product it ships.

Walmart. In June 2025, Walmart rolled out an AI-powered suite inside the associate app: task management that compressed shift planning from roughly 90 minutes to 30, real-time translation across 44 languages, and a conversational GenAI that turns process guides into step-by-step instructions the associate executes on the floor. Three weeks later Walmart eliminated the "market coordinator" role — a corporate position whose job was coordinating data compilation and analysis for store managers. The software did the coordination the role had existed to perform. The Walmart case is the clearest frontline, non-software-industry evidence in the public record that corporate coordination roles are the canaries in co-design: positions whose work is to shape and transmit data between other humans are the first to retire when the data layer becomes machine-readable.

NatWest. In February 2026 NatWest reported creating a Chief AI Research Office, hiring almost 1,000 software engineers in 2025 alone, and deploying AI tools to approximately 60,000 colleagues. AI now writes roughly 35 percent of the bank's code; Cora, the GenAI digital assistant, scaled from four customer journeys to twenty-one; AI tools saved more than 70,000 hours in Retail through automated call summaries. The Chief AI Research Office is the organizational counterpart to the 35-percent software position — neither figure reads coherently without the other. NatWest is the regulated-finance version of the co-design: a named executive office whose accountability is the research-to-production pipeline for AI, created at the same moment the firm's code is being materially written by AI.

Ramp Glass. Ramp's internal AI-tool adoption hit 99 percent in early 2026 and stalled. The discovery side was fully validated; the engineering side was missing, so the productivity did not compound. A product manager with engineering background built the harness as a weekend project; twenty daily users were asking for more features before a team formally existed; engineering then layered in a Defrag skill, a shared design system, and the Dojo skills marketplace backed by Git, eventually reaching 350-plus skills and 700 daily active users within three months. The case is the control group for the other four: discovery without engineering produced a plateau, and the plateau only moved once a single engineer owned the routing from discovery to engineering.

The five cases differ in substrate — software-native, retail, regulated finance — but share the move. A software capability that worked turned a role into redundant coordination, and the firm restructured once the substrate was stable enough to carry the load. The firm that skipped the substrate step and still tried to restructure is not in this list because it did not work.

Harness engineering turns discovery into production

Harness engineering stabilized into a named discipline by early 2026. It covers the surrounding software that orchestrates the model: tool registries and governance, verification and evaluation pipelines, persistent memory and context assembly, sandboxed runtimes, agent-specific tracing, middleware hooks that detect loop conditions, CI-integrated evals, self-healing remediation, and cost-aware orchestration. The practical claim running across public reports is that harness quality often determines production performance more than the choice of model. LangChain published the cleanest public measurement in February 2026: deepagents-cli moved from 52.8 to 66.5 on Terminal Bench 2.0 — a 13.7-point jump that took the agent from rank 30 to top-five on the public leaderboard — with the model held fixed at gpt-5.2-codex. The changes were harness-only: self-verification loops, enhanced tools and context injection, middleware hooks that detect agents getting stuck in repeated tool calls, and LangSmith tracing at scale to identify failure modes and iterate against them.

Two observations about sequencing a harness build. First, teams that build harnesses in 2026 typically start with evals — without a binary criterion, no other component can be tuned — and add tracing, tool registry, middleware hooks, self-verification loops, and a cost governor in roughly that order. Second, a team that does not want to assemble the components from parts forks an opinionated starter harness. Claude Code is the commercial reference; Goose is the open-source reference. Both run substantial orchestration around the model; both keep the underlying capability visible enough that a team can read the harness as a template rather than a black box.

Harness is temporary; engineered knowledge is permanent

A clever harness pushes a model's effective capability beyond its baseline. Frontier labs observe what harnesses do, absorb those capabilities into the next model generation, and release a model that does natively what the harness was adding. Specific public instances of the absorption: native tool use appeared in Claude 3 after early harnesses had been routing tool calls manually; extended thinking in recent Claude and GPT versions absorbed the scratchpad patterns teams had been appending to prompts; inline agentic browsing absorbed what Cursor and early coding agents had been gluing together with tool chains. The cycle has a rough cadence but the specific months vary by capability.

Horizon. The prediction that harness capabilities will continue to be absorbed by frontier models on a roughly quarterly-to-semiannual cadence is a bet on pattern continuation from the last two years. The cadence could slow if model-lab incentives shift, or accelerate if open-source harnesses start setting the frontier faster than labs can consolidate. The operating-model advice below assumes the pattern holds; the firm should monitor it.

The durable advantage is what persists across harness generations. Four artifacts carry over independently of which harness the firm is running at a given moment:

the evaluation suite that lets the team diagnose what worked
the dataset of tagged inputs and expected outputs the evals run against
the feedback loops that route human corrections back into the skill files
the skill library that codifies operational knowledge

A team that rebuilds the harness every six months while keeping the same evals, the same tagged data, and the same skill library compounds faster than the alternative posture of protecting static harness code and rewriting skills on every model release. Rebuilding the harness is the work of maintaining the investment; the evals, data, and skills are the investment itself.

Four failure modes map to the co-design question

Each of the four named failures traces back to the co-design question — software redesigned without organization, organization redesigned without software, or either side redesigned with the wrong invariants locked in.

Premature consolidation — engineering the capability before roughly fifty production iterations. The skill locks in artifacts of the first few prompts and generalizes poorly. Self-diagnostic: has the team rewritten its main skill file in the last three months because the first version missed edge cases?
Workshop theater — perpetual discovery with no consolidation move. A hundred pilots accumulate; none reach production; the firm celebrates the demo shelf while unit economics stay flat. Self-diagnostic: can the team name a pilot from Q2 2025 that is in production today?
AI theater — the software redesign happens, the organization does not. Tooling works in isolation; the firm's customer-visible outputs do not change because the information-routing middle layer persists. Self-diagnostic: has the firm's external output actually changed, or only its internal tooling?
PowerPoint transformation — the org chart changes, the tooling does not. New titles sit above the same work done with the same tools. Self-diagnostic: are the people in the newly-named roles doing measurably different work from the people in the old roles, or just reporting differently?

Mollick names the mechanism underneath the four modes: "all of these things broke because they all depended on there being only one form of intelligence available. Now we're in a world where that isn't the case. So things have to be rebuilt from the ground up"Mollick names the mechanism underneath the four modes: "all of these things broke because they all depended on there being only one form of intelligence available. Now we're in a world where that isn't the case. So things have to be rebuilt from the ground up"↗. The operating model that survives the rebuild co-designs software and organization as one problem. Everything else lands as one of the four failures on a predictable schedule.

The duality runs on a slower clock in contexts where either side of the redesign is partially constrained externally — heavily unionized workforces where role changes require negotiation, regulated sectors with mandated reporting structures, public-sector functions where roles are defined by law. The co-design framework still applies; the velocity is capped by the slower of the two sides.

Software 1/2/3 pairs with Org 1/2/3 because the substrates evolve together

Andrej Karpathy's June 2025 YC keynote named three software paradigms: Software 1.0 is handwritten code; Software 2.0 is trained neural-network weights that replace handwritten code in narrow domains; Software 3.0 is natural-language prompts directing LLMs to do the work. The paradigms describe how the firm's code gets produced. Each paradigm pairs with an organizational counterpart that describes how the firm's work gets produced, because the same substrate that writes the code also routes the labor.

Software 1.0 and Org 1.0. Handwritten code paired with handwritten procedures. The firm runs on documents, role descriptions, and hierarchical management approving each non-trivial decision. Both substrates are produced by humans, scale by adding more humans, and degrade as both compress under volume.
Software 2.0 and Org 2.0. Trained ML embedded in narrow workflows paired with CRM/ERP-augmented hierarchies. The firm offloads pattern-recognition tasks to learned models (recommendation, fraud detection, lead scoring) while humans still hold the routing and approval substrate. Salesforce, NetSuite, and Workday are the operational shape of Org 2.0 — the system tracks the work but humans still route it.
Software 3.0 and Org 3.0. Declarative goals directing LLMs to pick the path paired with the cybernetic firm where agents hold the routing substrate. The firm's controller reasons across artifacts, agents act on the reasoning, and humans hold the four responsibilities (Architect, Relationships, Validation, Accountability) developed in 3.1. The Block and NatWest restructures named earlier in the chapter are the pairing in production at firm scale. 4.5 develops the cybernetic firm in full.

Co-design fails when the firm advances on one substrate without the other. Software 3.0 grafted onto Org 1.0 produces AI theater — the agents work but the firm still routes their output through hierarchical approval that the agents could have skipped. Org 3.0 attempted on Software 1.0 produces PowerPoint transformation — new role names above the same hand-built tooling, with the agent layer absent from the substrate the new roles were supposed to operate. The pairing rule is that the firm advances one paradigm step at a time on both axes; skipping a side is the load-bearing source of the four failure modes named above.

Run this week

A reader who has a completed autonomy map can convert it into an operating-model audit inside a week:

Pick three initiatives the firm has already started. Prefer visible activity: a coding-assistant rollout, a customer-support pilot, an internal-knowledge-base project.
Apply the fifty-iteration test to each. Count real production runs, not demos. Below fifty: discovery. Above fifty: engineering is justified if the eval exists.
Apply the 2-of-3 test to initiatives past fifty iterations. If two of three suggested next actions are strong, the context layer is sufficient. If not, add to the backlog "strengthen context before engineering."
Check for a binary eval on every initiative claimed to be in engineering. Initiatives without an eval are running on vibes; move them back to discovery with an eval-writing task attached.
Map each initiative to a failure-mode risk using the four self-diagnostics above. Most firms find that one failure mode dominates across their initiatives.
Name the missing co-design side for each. An initiative in AI theater needs a named role change paired with the tooling; one in PowerPoint transformation needs real tooling beneath the new title. The output is a list of paired software-and-org moves, sequenced by how much margin each recovers.

The next chapter develops the substrate that every subsequent architectural layer runs on — identity, access policy, and audit — because the operating-model duality named here is expensive to run when the firm cannot answer which agent acted, on whose behalf, under what policy.