The Shift

Production systems combine all five automation levels

Shared vocabulary for Part 2. Five automation levels — Traditional ML / RPA, LLM Chat, Workflow, Agent, Skill — are co-deployed parallel branches rather than a maturity ladder. A firm running at 2026 standards combines all five deliberately across its system portfolio, each chosen per task. Most organizational failures at the architecture layer come from choosing wrong, typically deploying an agent where a workflow would do.

The autonomy map the reader just produced classified each process by the human-involvement level it requires. This reference box names the implementation vocabulary Part 2 uses — the five automation levels Part 2 chapters invoke when describing how to build a process at a given autonomy level. The two axes are orthogonal. Autonomy describes how much human involvement a process needs at execution. Automation describes which implementation technique fits that process. A single L3 process might run as a workflow, as an agent, or as a skill-loaded agent, depending on problem shape.

The five levels below are co-deployed in production rather than stacked into a maturity ladder. A firm running at 2026 standards combines all five deliberately across its system portfolio, each chosen per task. The main architectural decision at this layer is choosing the right level per task, and the most common failure is deploying an agent where a workflow would do.

Traditional ML and RPA remain correct for bounded, high-reliability problems

Deterministic models and rule-based automation solve problems with stable inputs and strict reliability targets. Stripe Radar is the reference public-company example: a machine-learning fraud-prevention system trained on over a trillion dollars of annual payment volume, scanning hundreds of signals per transaction, reducing fraud by 38 percent on average. Radar does not get replaced by an LLM because the failure-cost asymmetry demands deterministic behavior — a single missed-fraud event costs more than the savings from thousands of correctly-labeled benign transactions. Logistic regression, OCR, rule-engine classification, and traditional robotic process automation all keep their place in the stack whenever the input distribution is stable and the per-error cost is asymmetric. The failure mode here is substitution anxiety — a team replaces a working deterministic model with an LLM because LLMs dominate the current conversation, then watches the new system's error bar widen without any offsetting gain.

LLM chat is the single-turn workhorse

Single-turn requests to a language model are a common automation unit and the one most often applied to the wrong task shape. Typical uses — summarizing a contract, drafting an email, translating a passage — are one-request-one-response transactions at low unit cost, useful immediately when the request actually needs a single step. At its pure form, the mode holds no state beyond the current conversation and runs no tool calls. Consumer-facing examples at that end include Gmail Smart Compose, GitHub Copilot's inline suggestions, and a raw ChatGPT chat box used for single-paragraph text transformations. Once a chat UI layers in tool calls or persistent memory it has crossed into agent or skill-loaded-agent territory under a chat-shaped interface. The mode breaks down when the task actually requires multiple dependent steps, external tool calls, or a decomposition the caller has not thought through. The typical error is treating the chat box as a universal interface — pasting a 40-row CSV into ChatGPT and asking for personalized outreach for each collapses a four-step workflow (CRM lookup, segment, template, send) into a single hallucinated paragraph per row. LLM chat fits single-step work; the signal that a task has outgrown the chat box is that the caller can already name the dependent steps they were hoping to compress into one prompt.

Workflows make multi-step LLM execution observable and cheap

A workflow is a deterministic chain of LLM calls wired together by code. Code is the orchestrator; the LLM is a component at each step. Workflows trade flexibility for measurable gains in observability, cost predictability, and debuggability. Steps are isolated from each other, token spend is auditable per call, and failures localize at the call boundary where debugging is actually tractable. A press-release pipeline that extracts the angle from a brief, drafts copy, translates to multiple languages, runs an LLM-as-judge quality check, and loops back to revision (capped at two retries before a human is called) is a workflow — not because the team was too conservative to build an agent, but because the steps were known in advance and the system needed predictable cost per run. Orchestration substrates that fit this pattern include LangGraph, Temporal, Inngest, and Prefect, each of which exposes the individual step as a first-class unit for retries, timeouts, and observability. The error at this level is premature promotion — teams treat a workflow as a scaffold on the way to an agent when the fixed-step workflow was already the correct final form.

Agents pick their own tool calls when the path cannot be pre-specified

An agent is a language model that chooses its own tool calls and their order. Agents handle ambiguity where workflows cannot, at the cost of higher token spend, less predictable behavior, and harder debugging. The trade is worth it only when the step sequence cannot be pre-specified. Customer support with unbounded ticket shapes is one case. Production-log investigation across multiple systems where the query path is unknown at invocation time is another. Research tasks that require running code mid-session to decide the next step round out the common set. Browserbase's internal agent bb is the reference 2026 implementation: one generalized agent lives in Slack and serves engineering, ops, sales, support, and exec by lazy-loading the right skill and the right scoped permissions per task rather than being built as a bot per task. The reference failure mode: a team spins up an internal Slack agent without scoped permissions or a skill library, and on day fourteen the agent either hits a 20x cost spike because it retries tool calls indefinitely, or executes a destructive action an engineer had assumed was read-only. The diagnostic: if the agent does not require a permission scope per invocation and does not load a skill file per task, it is not the bb pattern — it is a bot wrapper that will fail on the first edge case.

Skills are codified operational knowledge the agent loads on demand

A skill is a markdown file codifying operational knowledge for a specific job. The file defines who the team is, what it does, how it handles exceptions, and when it escalates. A skill declares scope, output schema, and failure mode up front, which is the property a prompt does not have. The test: open the markdown file. If it has no declared scope (what jobs it accepts), no declared output shape (what the result returns), and no named escalation rule (when to surface to a human), it is a prompt, not a skill. Ramp's Dojo marketplace is the reference public-company example: 350+ skills contributed across the company via a Git-backed repository, each versioned and reviewed like code, loaded by the agent at the moment it is needed rather than shipped as part of the agent itself. At execution time the agent retrieves a matching skill from the library based on the task — via embeddings search, tool-call routing, or explicit skill selection by the orchestrating model — and the skill's contents enter the agent's context just for that task. Agents run on top of skills, and an agent without skills is an un-onboarded intern with root access. The main pathology is skill hoarding — the library polluted with one-off skills that the agent's discovery layer cannot effectively route. 2.6 develops the library discipline that prevents it.

Accounts payable in production spans four of the five levels

Ramp Bill Pay is the reference production system at this layer. It processes invoices with 99-percent-accurate OCR, runs three named AI agents (auto-coding, fraud prevention, approval), and reports near-100 percent invoice-processing automation at 2.4x the speed of legacy software. Ramp markets the system as "agents," but the underlying architecture splits across four of the five automation levels:

  • GL coding of routine invoices runs as a workflow because the steps (classify vendor, look up account, apply historical pattern match) are pre-specified and need to run at predictable cost per invoice.
  • Fraud detection runs as traditional ML because the reliability target is strict and the input distribution is stable — suspicious banking-detail changes, unexpected vendor email domains, unverified accounts, all observable on fixed features.
  • Exception resolution (unrecognized vendor, duplicate-invoice flag, multi-invoice reconciliation against a contract) runs as an agent because the investigation path is not known in advance.
  • Escalation rules (when to route to a controller, when to pause payment, when to trigger a compliance review) live as a skill loaded by the agent at the moment an exception surfaces.

Four levels cohabit in one accounts-payable system, each placed because the level matched the problem structure. LLM chat does not appear in the core pipeline — no step in the production flow is a single-turn one-response transaction — but the same firm's finance team still uses LLM chat for ad-hoc questions against the AP database ("what's our outstanding AR from Acme?"). The failure pattern most teams produce looks different. They build everything as "the AP agent," then discover in sequence: the fraud classifier beats the agent on fraud, the pre-specified workflow beats the agent on routine coding, and the agent itself fails on exceptions because the escalation skill was never codified.

A one-page decision table for the five levels

LevelPick whenAvoid whenReference implementationFirst failure mode
Traditional ML / RPAInputs are stable; per-error cost is asymmetric; reliability target ≥ 99.9%.Inputs drift frequently; rules change faster than retraining cycles.Stripe Radar [stripe-radar-engineering]Substitution anxiety — replacing a working model with an LLM for no gain.
LLM chatThe task is a single-turn transaction and the caller can name the output shape.The task actually needs dependent steps, tool calls, or retrieval beyond the prompt.Gmail Smart Compose; GitHub Copilot inline suggestions.Pasting a workflow's worth of input into a chat box and treating hallucinated output as a valid answer.
WorkflowThe step sequence is known in advance; cost per run must be predictable; each call needs to be debuggable in isolation.The path changes per input and cannot be specified up front.LangGraph / Temporal / Inngest / Prefect orchestrations; the GL-coding leg of Ramp Bill Pay [ramp-bill-pay].Premature promotion — treating the workflow as a scaffold on the way to an agent.
AgentThe step sequence cannot be pre-specified; tool-use reasoning is required; ambiguity is in the problem rather than in the spec.The problem has a known deterministic decomposition.Browserbase bb [browserbase-bb-blog]20x cost spike or unintended destructive action, both from missing permission scopes and absent skill libraries.
SkillThe agent needs codified operational knowledge (scope, output schema, escalation rules) loaded per task rather than baked into system prompts.The content would only run once, or the scope is too broad to declare.Ramp Dojo marketplace [ramp-glass-linkedin]Skill hoarding — polluting the library with one-off skills the discovery layer cannot route.

Part 2's chapters use these five labels throughout:

  • 2.1 treats the consequence of co-deploying all five.
  • 2.2 develops the scoped-permissions layer that keeps skills and agents safe.
  • 2.4 develops harness engineering; 2.5 covers agent reliability specifically.
  • 2.6 covers skills as a library discipline, with the skill-hoarding failure mode developed in full.
  • 2.7 closes Part 2 with the self-improving loop and the four controls self-modifying agents need.

Traditional ML and LLM chat do not have their own chapters — they remain as legacy substrate on which the other architectural decisions sit.