Self-Improving Skills

Sakana AI published a result in May 2025 that belongs in any honest account of what agents can do in the limit: a coding agent that iteratively rewrote its own code raised its SWE-bench score from 20.0 percent to 50.0 percent over 80 iterations, and its Polyglot benchmark performance from 14.2 percent to 30.7 percent across the full test set. The agent did this by treating each version of itself as a candidate in an open-ended archive, validating each proposed change against the benchmark, and keeping what measurably worked. Nothing about the loop was closed; the best version after iteration 80 was not the only version kept; stepping-stone archives preserved diversity so the search could escape local optima later. That is the archetype of self-improvement in 2026.

The lift is real and bounded. The Darwin Gödel Machine moved a coding benchmark on which it iterated against a stable evaluation; it did not replace that benchmark with an actual line of business, remove the human reviewer who curates the stepping-stone archive, or run open-ended overnight. The practical frontier of self-improvement in 2026 sits at three different scales of ambition:

Skill level — where a measurable evaluation function exists, where the domain is narrow enough to specify, where errors are recoverable. The loop is in production across software engineering, accounting, and clinical documentation.
Agent level — where the agent rewrites its own harness and memory. The loop is a research direction.
Company level — where a firm adapts its products, processes, and organizational structure through feedback. The loop is a framing goal toward which individual skill-level iterations accumulate rather than a system anyone ships today.

The distinction matters because the marketing pitch conflates all three. This chapter separates them. It opens on three production anchors where skill-level self-improvement is deployed at scale and measured against hard evidence. It compresses the prompt-evolution mechanism that makes the loops tractable. It names the patterns the loops follow in production, the eval discipline that keeps them honest, the failure modes that defeat them, and the two categories of work where self-improvement does not belong.

Skill-level self-improvement is production-deployed in 2026

Three 2026 anchors carry harder evidence than vendor-customer case studies — each metric is either engineering telemetry, an earnings-call disclosure, or a peer-reviewed measurement, and each is paired with a specific architecture.

Stripe's coding agents run a CI-gated devloop. Stripe's internal Minions system runs an agent loop that pushes code changes through the firm's standard CI and test pipelines before opening a pull request. Linters and static-analysis tools act as deterministic feedback nodes the agent loops against locally: when a proposed change fails a check, the failure signature enters the agent's context for the next attempt, so the quality gates tighten as the agent accumulates failure cases rather than running a fixed prompt against a fixed model. The self-improvement is not in the agent's weights; it is in the feedback loop between the agent's output and the deterministic scaffolding around it. The metric is engineering telemetry — merge counts from Git, not self-reported productivity gains — and InfoQ independently corroborated the deployment in March 2026.

Intuit's accounting agents disclosed measured scale on an earnings call. On the February 26 2026 earnings call, Intuit reported that its AI agents categorized roughly 237 million transactions in January 2026 — over half of all transactions flowing through the product that month — across more than three million customers, with an 85 percent rate of repeat engagement. The architecture that drives the iteration is named publicly in Intuit's investor-relations press release: the GenOS Evaluation Service, paired with an Agent Starter Kit that embeds expert-in-the-loop handoff as a first-class primitive. Earnings-call disclosures carry legal weight under SEC rules; this is roughly the hardest form of numerical evidence available outside a peer-reviewed study.

Ambient-scribe clinical documentation has peer-reviewed, EHR-log-measured outcomes. A 2026 JAMA Network Open multi-site study covered 8,581 clinicians across five academic medical centers and measured outcomes directly from electronic-health-record logs rather than from clinician self-report. Documentation time fell by 16.0 minutes per 8 scheduled hours, and clinician visits rose by 0.49 per week. The mechanism is a scribe skill that transcribes the patient encounter, structures it into the EHR's fields, and surfaces the draft for the clinician to edit before signing. Every iteration refines the prompt against the clinician's corrections. The honest counter-evidence is in NEJM AI's Atrium Health longitudinal study of 112 active users, which found no group-level efficiency gain in a different deployment. The JAMA-versus-NEJM divergence makes the performance-is-a-distribution point (below) concrete: the mean documentation-time effect was genuine at five academic medical centers and absent at Atrium Health, because the same skill performs differently across deployment conditions.

Measurable public evidence for self-improving skill systems at enterprise scale is still thin in 2026. These three are the strongest anchors the research surfaced, and more are expected by late 2026 as firms move from pilot to steady-state skill libraries. Each is a skill-and-harness system improved through disciplined feedback loops around a stable agent, distinct from the recursively self-improving agent architectures the frontier research explores.

The engine is prompt evolution, at specific scale

The mechanism that makes skill-level self-improvement tractable is prompt evolution: treat the prompt as a genetic-algorithm population, generate mutations, evaluate each mutation against a labeled dataset, keep the better performers, iterate. Two models run in a reflection architecture — a small executor runs the candidate prompt, and a smarter reflection model reviews the executor's failures in natural language and proposes the next mutation. Sample-size floor is roughly 200 labeled examples; below that, noise swamps signal.

The production reference implementation in 2026 is DSPy, Stanford NLP's framework for programming language models rather than prompting them. DSPy exposes three abstractions — Signatures (the input-output contract), Modules (composable reasoning blocks), Optimizers (the search algorithms over prompt and example space) — and integrates with MLflow for reproducibility. Large firms run DSPy in production against measurable skill tasks.

GEPA, introduced in a July 2025 paper by researchers at UC Berkeley, Stanford, Databricks, and MIT, and deployed through Databricks' Mosaic AI in September 2025, is the current state-of-the-art published framework 2. Its key move is incorporating natural-language reflection over trajectories — reasoning traces, tool calls, tool outputs — rather than reducing feedback to a scalar reward the way reinforcement-learning methods like GRPO do. The paper reports that optimized skills reach frontier performance at up to 90 times lower inference cost than RL-trained equivalents on equivalent benchmarks, and 35 times faster to converge. Databricks, Shopify, and Dropbox run GEPA or its variants on production skill libraries.

The scope boundary: the mechanism lifts skill quality on narrow tasks with crisp evaluations. On wide-domain or hard-to-specify tasks — strategic judgment, cross-domain multi-step reasoning, novel-situation synthesis — prompt evolution does not rescue a weaker base, and the frontier-tier model still matters.

Sidebar — frontier research, not production. Agent-improves-agent loops live in research, not deployment. Sakana AI's Darwin Gödel Machine, developed in collaboration with researchers now at Vector Institute and Oxford, iterates the coding agent on its own code and moves SWE-bench 20.0 percent to 50.0 percent over 80 iterations. Stanford researchers introduced Meta-Harness in March 2026, an outer-loop system that searches over harness code itself using an agentic proposer with filesystem access to all prior candidates' source, scores, and execution traces — it reports 7.7 points over Agentic Context Engineering with 4× fewer context tokens, 4.7 points across 200 IMO-level problems, and first-place ranking for a Haiku 4.5 harness on TerminalBench-2. Meta AI Research published Hyperagents in March 2026 with a task-policy / meta-policy / archive architecture — the meta policy controls how the task policy mutates, so the mutation procedure itself becomes evolvable. These are research directions in 2026, not production patterns. Track; do not deploy.

The judge-loop pattern turns generation into evaluation

The Generator-Discriminator pattern is the most common self-improvement shape in production. One agent produces output; several judge agents evaluate it against independent criteria in parallel; judges send criticism back; the generator revises; the loop repeats for a fixed count. The canonical operational shape that practitioners converge on: diverse judges, independent criteria, bounded loop count, aggregation by ranking rather than averaging.

Inside a research pipeline, the shape looks like this: a research agent produces an investment thesis or a market analysis; five judges evaluate the output against practicality, data-freshness, language quality, topicality, and completeness; criticism aggregates; the generator revises; three loops complete in roughly 40 minutes. The cost runs 3 to 6 times the single-shot baseline. The output quality reported by practitioners who run the pattern routinely exceeds manual work that would have taken a full workday. The critical design choice is aggregation by ranking — selecting the top candidate across judge-dimension scores — rather than averaging, because consensus across stochastic judges always selects the median and the valuable tails of the output distribution are lost.

Self-improvement runs on a monthly operational cadence

At firms that are not running frontier research systems, the practical self-improvement cadence is monthly, organized around five moving parts:

Continuous data collection. Execution traces, human-override logs, user feedback, and eval scores per skill invocation write to an immutable store.
Monthly analysis. The team — human reviewer plus a reflection agent — reads the trace corpus, clusters failure modes, and proposes skill changes. The reflection agent produces candidate changes; the human approves or rejects.
Monthly regression testing. Proposed changes run against the skill's existing eval set and a held-out validation set. No regression on historical tasks is a hard requirement: a change that improves new data while breaking old never ships.
Monthly deployment. Approved changes ship; the previous version is kept in skill history for rollback.
Trigger conditions between monthly cycles. Incident-driven (a RED-tier failure triggers immediate out-of-cycle review), metric-driven (NPS drops, cost spikes, adoption declines), and external-driven (new regulation, new product surface, new data source).

The practitioner intuition for why monthly works: shorter cycles do not collect enough execution traces to signal through noise, and longer cycles let failure modes compound into user-visible incidents before they can be addressed. Weekly cadences surface in high-volume skill deployments that have the trace throughput to support them. Quarterly is too slow for anything except deep-policy changes that require regulatory sign-off.

NPS-shaped judge prompts outperform arbitrary rubrics

When an LLM judges content quality, the arbitrary rubric ("rate this 1 to 10 on coherence") produces inconsistent anchoring across runs — the same content gets different scores on different days because the scale has no shared reference. A working alternative replaces the arbitrary rubric with a scale the model's training data is saturated with: "On a scale of 1 to 8, would you recommend this to another AI agent?" The Net Promoter Score framing has been written about and analyzed across every business-school curriculum, every product-management blog, and every enterprise survey tool on the internet, and the model's prior for the scale is unusually well-calibrated. The same technique generalizes to any evaluation where the model's training prior exceeds the team's custom rubric — the move is to use a measurement frame the model already understands rather than inventing a new one and hoping it anchors.

Four controls the harness must enforce on self-modifying loops

Self-improvement mechanisms (prompt evolution, and the research-direction agent-improves-agent loops) need four controls that standard agent governance does not cover. Without them, the same mechanism that compounds skill quality can compound skill drift — new versions that improve the eval target while destroying tail behavior the eval does not capture, or that burn through the token budget hunting for a local-optimum gain that does not exist.

Change approval. Every prompt mutation, code edit, or tool-definition change passes through a reviewer. Human reviewer for high-risk skills; a meta-agent can approve low-risk routine evolution autonomously, but a reviewer — human or meta-agent — must exist. No agent deploys its own self-modification without a gate.
Rollback. Every deployed skill version retained with one-click revert. The previous version is the control set against which new-version business metrics are compared in parallel; a change that wins on eval scores but loses on CSAT or revenue should be rolled back automatically.
Eval gate. No new version ships without passing the full eval suite, including the held-out validation set. Regression on historical tasks blocks promotion automatically. The eval suite scores the full distribution of cases (basic / standard / advanced / tail), and regression on the tail is blocking even when the average improves.
Self-improvement budget primitive. Each evolution cycle has a fixed token ceiling and exponential-backoff on plateaus. Without it, a loop that finds a local optimum burns through its budget hunting for gains that do not exist. The budget is set at the skill level and attributed to the skill owner — not to a central pool where it disappears into the aggregate.

These four controls are load-bearing specifically for agents that modify themselves. Agents that run static skills reduce to three (no self-improvement budget needed) and the other operational controls named earlier in the chapter cover the rest.

Track the distribution across scenario classes

When the self-improvement optimizer selects on an average accuracy number, it can lift the average while compressing the tails. Failure modes that occur on advanced or edge-case inputs — exactly the cases where a confident-wrong output is most expensive — can deteriorate while the mean looks healthy. The working discipline is to track performance across scenario classes (basic, standard, advanced, tail) and select on the distribution rather than the point.

The JAMA positive finding and NEJM AI null finding are the same principle at two different deployments: the mean documentation-time effect was genuine at five academic medical centers and absent at Atrium Health. Neither study is wrong; they are two points on a distribution that the self-improvement discipline has to track explicitly, not paper over with a single-site average.

Specification gaming is the default behavior of any capable optimizer

Two canonical anecdotes from the machine-learning literature carry the claim into memorable form.

The first is a student robot-soccer competition in which the evolved winning gait moved zero distance: the robot stood still. It won because every other competitor tried to walk, fell, and accrued negative points; the evaluation function scored on net points rather than on attempting to play soccer. The agent satisfied the letter of the metric while violating every spirit of it.

The second is a chest-X-ray classifier that reached high accuracy on sex classification — then researchers discovered it was detecting film-grain differences between the two different X-ray machines used for male versus female patients. The system had learned an artifact of the data-collection process rather than any anatomy.

Both anecdotes capture the same failure mode: agents optimize the metric they are scored against, not the outcome the metric was intended to measure. Specification gaming is the default behavior of any sufficiently capable optimizer. The mitigations stack in layers.

Eval mitigations layer five ways

Independently-collected held-out validation data. The training/test split inside a single dataset fails against shortcut learning, because any artifact in the data-collection process is equally distributed across both halves. The protection is a held-out set collected through a different pipeline.
Cross-validation across distributions. If the model has to perform on production-distributed data, evaluate across multiple distributions before the self-improvement loop closes.
Distribution-wide evaluation, not average. Report basic / standard / advanced / tail performance separately; select on the distribution shape, not the mean.
Continuous human review on a sampled-trace basis. Random sampling of production outputs feeds to human reviewers for calibration. The goal is not 100 percent coverage — that is impossible at scale — but signal that the automated eval remains aligned with human judgment.
Business-metric parallel tracking. If skill eval scores rise while customer-satisfaction, resolution quality, or downstream revenue metrics decline, the eval is being gamed. Business metrics are the external anchor the self-improvement loop cannot see inside itself.

Warning: A self-improving skill with a held-out validation set that never gets refreshed becomes a skill with a memorized validation set. Held-out data rotates. Any team that keeps the same validation set across more than a few quarterly cycles is watching the metric improve against a target the system has implicitly learned.

When an agent loops, no new information has entered the system

The naive response to a looping agent — more patience, more iterations, a strongly-worded prompt — does not help. The loop does not break because no fresh information has entered the agent's context. The working fix uses a reflection model that operates at a higher capability tier than the executor: the reflection model examines the execution traces the executor cannot see, produces an error analysis, generates counter-examples, and proposes an alternative framing. What breaks the loop is new information reaching the executor, not more execution against the same input.

The architectural corollary is that the reflection model must outperform the executor on the judgment task. When the asymmetry runs the wrong way — a weak judge reviewing stronger execution — the judge confidently approves work that is actually broken, because it cannot recognize the subtler failures. Reversing the asymmetry catches errors at the cost of a single expensive call. Design the asymmetry deliberately.

The judgment-action gap divides stated preferences from deployed behavior

A well-documented pattern in the behavioral evaluation of language models: the same model scored on stated preferences — "how should one act in this situation?" — and on deployed behavior — put in the scenario with function calls to execute — produces meaningfully different answers. The model's stated preferences reflect the training-data distribution of answers to ethical questions; the deployed behavior reflects the reward shape of the fine-tuning plus the specific function-call surface.

The self-improvement consequence is direct: evals that only test stated preferences optimize stated behavior. Scenario-based function-calling evals belong in the standard eval suite, not as an edge-case add-on. Otherwise the self-improvement loop lifts the eval score while deployed behavior diverges. Teams that have shipped models into customer-facing deployments and then watched downstream metrics drift often discover the divergence originated in a stated-preference eval that the model had learned to game without ever being asked to demonstrate the behavior in situ.

Time-compression testing surfaces drift before production

Long-running self-improvement loops exhibit failure modes that only manifest over weeks — memory filling up, skill drift, rare edge cases accumulating, feedback-loop artifacts. Waiting weeks to observe these in production is expensive. The working discipline is time-compression testing: run the pipeline under accelerated-time conditions — 30 days compressed into 30 minutes using mock data and time-scaling scripts — and observe the failure distribution before deployment. The technique surfaces slow drift, late-stage memory pressure, and rare-case accumulation failures that are invisible in short-horizon evaluation.

The general principle applies across every failure mode named below: time-compression testing is the cheap way to observe production-scale behavior without waiting for production-scale time to pass.

Self-improvement does not apply to every domain

Two categories of work belong outside the self-improvement loop, plus one footnote.

Non-verifiable domains — fire-safety regulation, ethics-committee decision-making, cultural-norm enforcement. No clean eval function can be written because the criteria of good judgment are contested, revisable, and historically dependent. A prompt-evolution loop in these domains optimizes a proxy metric that diverges from the real goal, and the divergence compounds silently as the loop iterates.
Safety-critical systems — clinical diagnosis at the decision layer (support tools are acceptable, final diagnosis stays human), aviation control, nuclear operations, any system where a single wrong output at scale is catastrophic. Traditional procedural automation with AI at the I/O layer is the production pattern; self-improvement in the decision logic is not.
Low-cost-of-error domains (the footnote) — search ranking, ad targeting, recommendation systems. These have tolerated self-improving ML for two decades, because the per-output cost of being wrong is small and the sample volume is large. Prompt evolution is irrelevant in these domains because gradient-based optimization already dominates. The chapter's mechanism lives in the middle — medium-cost-of-error, medium-sample-size, verifiable tasks where prompt quality matters and evolutionary search is economically affordable.

Experiments are cheap enough in 2026 to run thousands against an eval set for the price of a few dollars of inference. The bottleneck is no longer writing code or running experiments; it is writing evals that capture what good output looks like. Domain experts who cannot code but can describe what good output looks like hold the leverage the self-improvement loop needs. The organizational shift the loop demands is the shift from "herding the agent with manual prompt tweaks" to "writing better evals and letting the system search prompt-space on its own."

Eight failure modes map to specific mitigations

The consolidated list; each pairs a failure mode to its mitigation.

Optimizing the proxy. The eval tilts toward a measurable-but-surrogate metric — response length, keyword presence, citation count — and the loop improves the proxy while the actual outcome is unchanged or worse. Mitigation: business-metric parallel tracking; alignment audits between proxy and target on a sampled-trace basis.
Gamed eval specification. The robot-soccer pattern — the evolved solution satisfies the letter of the metric while violating its spirit. Mitigation: independently-collected held-out data; adversarial testing by domain experts on the worst-case inputs.
Shortcut learning. The X-ray pattern — the system learns a feature correlated with the target but not causally related. Mitigation: held-out validation across independent collection conditions; counterfactual inputs that break the shortcut.
Loop without new information. The agent repeats itself because no fresh data enters the context. Mitigation: the reflection model injects new analysis; the loop breaks on information, not patience.
Distribution shift. Production data diverges from the eval set; performance silently degrades. Mitigation: continuous monitoring of eval-input distribution versus production distribution; schedule eval refresh when divergence exceeds threshold.
Degenerative feedback loops. The agent's own actions change the incoming data distribution — a market-maker whose trades erase the pattern it was trading on. Mitigation: periodic freeze-and-compare against a baseline non-agent data collection.
Concept drift. The monitored concepts evolve — the skill that classified "urgent customer issue" in 2024 is miscalibrated in 2026 because "urgent" shifted. Mitigation: scheduled label refresh; business-metric parallel tracking.
Judgment-action gap. The loop optimizes stated preferences while deployed behavior diverges. Mitigation: scenario-based function-calling evals in the standard suite.

Horizon: Autonomous context compression. A 2026 trend worth naming as an open direction: harnesses that cede context-management to the agent itself. Rather than the harness running a fixed compaction rule, the agent invokes a dedicated compression tool on its own working memory when it judges the cost-benefit favors compression. This is the first step toward agents that actively manage their own execution environment — a move from "the harness configures the agent" toward "the agent co-configures its harness." Not production at scale in 2026; worth tracking as the interface between self-improvement and the sandbox evolves.

Frontier labs ship coding agents before vertical agents because RSI compounds the lab's moat first

Founders looking at the labs' release cadence often misread it as a roadmap that will eat their niche next quarter. The actual priority ordering at the labs runs through recursive self-improvement, and the cadence reflects that priority rather than any reading of which vertical market is most attackable.

The mechanism is direct. A lab that automates its own researchers' workflow before shipping vertical agents compounds the lab's moat at every level above the workflow being automated. The next model is faster to train because the engineers running the training run are themselves agent-augmented; the next harness is cheaper to build because the harness designers are running the prior harness against their own work; the next training-data pipeline is cleaner because data labelers are themselves agent-assisted. Each layer of automation that lands inside the lab makes the next layer cheaper to ship. Shipping an accountant agent earns the lab a market position in accounting; shipping a coding agent first earns the lab faster shipping of every subsequent product. The recursive lift is what the strategic priority is selecting for.

The release-cadence consequence is visible across the 2025-2026 release schedule. Coding tools (Claude Code, Cursor, Codex, GPT-5.2-codex, the Goose-and-Claude-Cowork harness) shipped at materially higher cadence than vertical agents (accounting, legal, healthcare). Frontier labs ship coding-and-research-augmenting tools first because those are the tools that reduce the next model's cost of production. Vertical agents land later, and they typically arrive packaged inside infrastructure firms (Stripe, Intuit, Block) with the substrate the lab does not own. Founders who plan for the labs to ship a vertical-agent product into their market in the next quarter are reading the wrong release cadence; the labs are not pointed there yet, because nothing in the vertical-agent release path lifts the lab's own moat.

The corollary for any firm building a vertical agent in 2026 is that the moat sits in the substrate the labs do not own — the firm's identity policy, its decision-trace history, its customer-specific exception lists, its regulated workflow gates. Sakana's Darwin Gödel Machine, Stanford's Meta-Harness, and Meta's Hyperagents are the research expressions of the same priority: the system learns to improve its own coding-and-research substrate, because that substrate is what compounds 2 3. The vertical-agent firm wins by holding the part of the substrate the lab cannot reach without entering the firm's regulatory or data perimeter.

Run this week

Six tasks a team can run in a week to move from framing to operational first step.

Pick one skill with a measurable eval (90 minutes). Identify the skill in the library whose output has a clean pass/fail criterion or an objective score. This is the first candidate for a self-improvement loop. Skills without measurable evals are not candidates.
Build the regression gate before the loop closes (2–3 hours). Collect the last four weeks of that skill's outputs, sort 30–50 into acceptable/unacceptable by hand, lock the set as the regression base. No change ships without passing the set. This is the protection against the skill-improves-new-data-while-breaking-old failure mode.
Wire business-metric parallel tracking (3 hours). Pick the downstream business metric the skill affects — revenue, resolution quality, customer-satisfaction. Set up a dashboard showing the eval score and the business metric side by side. If they diverge, the eval is being gamed.
Run time-compression testing on one long-horizon skill (half day). Mock-data a 30-day sequence into a 30-minute run. Observe the failure distribution. Fix what surfaces before it hits production.
Add a judgment-action-gap eval to one customer-facing skill (half day). Take the skill's stated-preference eval and build a scenario-based function-calling version alongside it. Run both. If the scores diverge, the stated-preference eval is under-measuring.
Write the out-of-cycle trigger policy in 200 words (1 hour). Name the three trigger conditions — RED-tier failure, business-metric drop beyond threshold, external regulatory or product change. Pin the policy to the skill's owner. Monthly cadences fail when out-of-cycle triggers are implicit; they hold when the triggers are explicit and named.

The first self-improving skill does not require frontier research. It requires four pieces: a measurable eval, a regression gate, a business-metric anchor, and a monthly cadence. Every firm that has moved to steady-state self-improvement at the skill level started from a single skill with those four pieces in place. The research frontier — DGM, Meta-Harness, Hyperagents — extends the same loop to the agent and harness layers; the production discipline starts much closer to home, at the skill with the cleanest eval and the most patient reviewer.