The Machine

Skills

Skills are procedural memory for the company: encoded workers whose knowledge, workflows, constraints, and escalation rules turn a general model into a specific, repeatable organizational capability that compounds across model generations.

Skills

Stanford's 1976 MYCIN system diagnosed bacterial blood infections at the level of a senior physician by separating two things that had been tangled together in every prior attempt: the medical knowledge (hundreds of rules about organism identification, antibiotic sensitivity, dosing) and the reasoning engine that searched through them. The architecture split the what from the how. The knowledge could be audited, extended, and debugged as a standalone artifact. The inference engine became a substrate: Stanford researchers stripped MYCIN's rules, kept the engine, and produced EMYCIN, which let domain experts write new rules for new domains. Knowledge engineering as a discipline was born from that split. Fifty years later, with LLMs serving as the new reasoning engine, the same architectural move re-emerges under a new name.

A skill, in the 2026 practitioner sense, is the encoded worker. The LLM is the reasoning engine. The skill file — markdown with YAML frontmatter, living in a versioned repository — specifies the role definition, the entity types the worker handles, the workflows it executes, the hard constraints it refuses to cross, the escalation rules when it cannot proceed, and the worked examples that train its pattern recognition. Strip the skill file and the model on its own produces generally capable output; the file is what turns that generally capable output into a specific worker executing a specific job under specific guardrails. The skill is a specification of a competent worker: what they know, what they do, what they refuse, and where they escalate.

The skill file is the encoded worker

The model alone will produce plausible output for any task you name. Plausible is the failure mode. A variance-analysis run that reconciles the GL to the wrong precision, a contract-review pass that flags the wrong clauses as risky, an outreach email that pattern-matches to a persona but misses the buying signal — each of these is plausible and each of these is wrong. What separates plausible from correct is the encoded knowledge the worker applies: the specific chart-of-accounts taxonomy, the organization's negotiation playbook, the ICP definition that draws the line between a qualified lead and a noisy one. That knowledge does not live in the model; it lives in the skill file.

The operational consequence is that every role which holds procedural knowledge in someone's head becomes, in part, a spec-writing role. A logistics coordinator encodes how the agent handles exception escalation. A financial analyst captures the firm's variance-attribution playbook; a support lead, the tier-one resolution tree. Each specification captures what used to live in an expert's head. Ramp's 2026 public signal is the clearest public number on this shift: non-engineers account for 12 percent of all human-initiated PRs on the production codebase, thousands per month, using the firm's in-house coding agent. The PRs are written by operators, not by engineers. The skill files encode what the operators know.

Warning: A skill file that reads as "a longer prompt" — paragraphs of instructions without role definition, without entity types, without constraints, without examples — produces expensive unreliability. The model has to re-derive structure on every invocation. Skill files are specifications, not monologues. If the file does not name the worker's role, enumerate the entities, list the hard constraints, and show worked examples, it is a draft, not a skill.

Do not write a skill until the task has been done more than twice

The first time you execute a task, you are discovering what it requires. The second time, you are learning what varied and what was repetitive. Starting the skill after the second execution is what lets you distinguish the parts that generalize from the parts that were incidental to a single case. Writing a skill after the first execution packages a single instance as if it were a pattern; writing one after the tenth execution means nine cycles of procedural work the organization could have captured earlier. Marketplace evidence points the same direction: Ramp's Dojo has more than 350 shared skills, each one backed by an operator who executed the underlying process manually enough times to know what generalized.

Most skill-file bloat in 2026 comes from violating this rule in the other direction — packaging one-time tasks as if they were processes. A one-off market analysis does not need a skill; it needs a two-step manual pipeline and an archive for the output. A one-off M&A diligence on a specific target does not need a skill; the next target will be different enough that the encoded workflow will misfit. The skill library that works is composed of processes that repeat at least weekly, with enough stability in the inputs that the encoded worker's assumptions hold across runs.

Eval-first makes skills tractable

Once a task recurs often enough to deserve a skill, the first artifact to produce is not the skill itself — it is the evaluation function. The analogy is concrete: solving a Sudoku is hard, but checking a solved Sudoku is fast and cheap. The evaluator's job is checking, not solving. Once you can write an evaluation function that takes the agent's output and returns a quality score against ground truth, everything else becomes tractable: the agent can self-improve, failure modes become detectable, monitoring becomes possible, and the procedure can generate itself through iteration against the eval.

The practical consequence is that domain experts should write evals, not procedures. The expert knows what good output looks like: they can flip through 100 variance reports and sort them into acceptable and unacceptable within minutes, even when they cannot articulate the rules they applied. They cannot easily write the procedure the agent should follow, because they have not consciously abstracted it. They can write the eval, because the eval only needs to capture the judgment. Browserbase's internal agent bb makes this concrete at the skill-routing level: the system prompt carries a routing table that maps request patterns to skills (session investigation loads the investigate-session skill; PR creation loads create-pr; customer-data questions load snowflake plus customer-intelligence) and each skill's body encodes the exact debugging playbook a senior engineer would follow. The skill did not have to be invented from scratch; the evaluator (does this output match what a senior engineer would have produced?) was the starting point.

Track performance as a distribution across scenario classes

A skill that returns 85 percent accuracy on average can fail 40 percent of the time on its hardest 10 percent of inputs. In autoregressive systems, the probability of error grows with sequence length: the agent is more likely to be wrong on a long-tail contract clause than on a boilerplate one, more likely to be wrong on a novel financial transaction than on a standard one. A single accuracy number hides where the expensive failures live.

Designing skills for the distribution means tracking performance across scenario classes — basic, standard, advanced, tail — and setting separate thresholds for each. A failure in the basic case is a skill defect that should be fixed before ship. When the tail case fails, the skill is surfacing a scenario it was not asked to handle: that is a known-unknown that belongs with human review, not a defect. The confusion between these two categories is the most common source of skill incidents. A skill evaluated only on standard cases and deployed against tail inputs will produce confident-wrong output in exactly the cases where confident-wrong output is most expensive to cleanup. Every non-trivial skill needs its evaluation set stratified across the distribution before it ships.

Prompt evolution compounds across model generations

Once the eval is written, the skill's prompt becomes evolvable. The pattern in the 2026 practitioner literature — sometimes referred to as reflective prompt evolution or by related names — treats the prompt as a genetic-algorithm population: the system generates variant prompts as mutations, evaluates them against the labeled data, and keeps the better-performing versions. The architecture is two-model: a base model executes the prompt; a smarter reflection model reviews errors and proposes edits. Documented production cases move skill accuracy meaningfully through this iteration loop — an under-appreciated lever once the eval exists — with the specific lift varying by task and starting point. The evolved prompt is typically unreadable by humans, long, full of conditions the author could not have articulated, specific to nuances the author could not have described. That is the mechanism working, not failing.

The durability claim for skills rests on this evolution. A well-designed skill file keeps producing better output as the underlying model improves, because the eval survives model upgrades and the evolution loop can rerun against the new capability frontier. Model-specific tuning does not survive. The skill compounds across generations; the model choice sits on top as a per-task lever. Both the model and the skill library matter — the frontier tier is meaningfully better than mid-tier on hard tasks, and the next release will improve agent output quality on tasks where it was previously thin — but what skills add is durability across model generations that model-specific tuning cannot.

Intelligence can live in the prompt for narrow task shapes

Under prompt evolution, a small executor (1.5B parameters up through mid-tier) can be pushed to useful performance on tasks where the domain is narrow and the eval is crisp. Variance analysis against a fixed chart of accounts, clause extraction against a fixed negotiation playbook, persona-matched outreach against a fixed ICP — these are task shapes where evolved prompts let a team downgrade the executor as quality compounds. The corresponding cost reduction is what makes high-volume skill deployment economically tractable at firm scale.

The limit is the task shape. On tasks where the domain is wide or the evaluation criterion is hard to specify — strategic judgment, multi-step reasoning across domains, novel-situation synthesis — the frontier-tier model still matters and prompt evolution does not rescue a weaker base. The skill compounds across model upgrades; model selection is the per-task lever on top of that asset.

Horizon: Whether prompt evolution generalizes from narrow-eval tasks to wide-domain or hard-to-specify tasks is the open question for 2026 through 2027. The documented production gains sit in narrow domains. Extrapolating to strategic-judgment tasks is speculative and should be treated as research direction, not current practice.

A non-trivial skill decomposes into named sub-agents

A skill that takes meaningfully more than a single model call to execute is, internally, a composition. The 2026 canonical set of roles practitioners converge on:

  • Orchestrator — plans the workflow and dispatches tasks.
  • Researcher — retrieves information (RAG against internal docs, web search, structured databases).
  • Data-Prep — cleans, normalizes, and aligns schemas across heterogeneous sources.
  • Analyst — computes metrics, runs comparisons, produces the numerical substrate.
  • Writer — drafts narrative content from the analyst's output.
  • Formatter — generates the final artifact (an Excel workbook, a slide deck, a PDF, a CSV).
  • Verifier — runs checks against the eval criteria before returning.

The workflow is typically fan-out / fan-in: the Orchestrator dispatches to sub-agents in parallel, then merges deterministic outputs. Per-task timeouts, exponential-backoff retries, and human-in-the-loop escalation after N failures are standard. The composition shape at the skill level — fan out to specialists, merge at an orchestrator, verify before return — is the same topology that recurs at the multi-agent fleet level, scaled up. Inside a single skill invocation, the pattern is compact. Across a fleet of skills running in parallel, it is the same pattern at larger scale.

Hooks turn approval from a prompt instruction into a harness contract

A prompt instruction that says "always ask for approval before posting a journal entry" is a suggestion the model can forget on the hundredth run, the tenth edge case, or the next model upgrade. A hook is a harness-level contract the model cannot bypass. The skill file specifies which hooks fire and when; the harness enforces them.

The three hook types that recur across 2026 production skills:

  • Pre-write hooks fire before the skill modifies a system of record. The variance-analysis skill pauses before posting a journal entry; the contract-review skill pauses before redlining a template; the outreach skill pauses before sending the first email in a sequence. A named human or role receives a notification and must return an approval token before execution resumes.
  • Post-draft hooks fire before the skill finalizes an artifact visible to a stakeholder. The CFO-memo skill pauses after generating the narrative summary but before exporting the PDF; an analyst reviews the draft against the source data; the hook waits for the approval before the artifact ships.
  • High-risk-flag hooks fire when the skill's own confidence score falls below threshold or when the input matches a risk-tier pattern (a contract flagged RED by the playbook, a variance above the materiality threshold, an outreach target in a regulated category). The hook routes to human review with the confidence score and the triggering evidence attached.

Hooks are the bridge between the skill file and the harness. The file is declarative: it names the step and the condition under which a given hook fires. The harness handles the imperative work — pausing execution, sending the notification, waiting on the approval token, logging the decision, resuming the workflow. Neither component alone carries the contract, and without the enforcement layer the declaration would be just another suggestion the model can drift past on a long run.

Tag every claim with its source and a confidence score

For any skill whose output a human will later cite — research synthesis, contract review, legal memo, financial analysis, regulatory filing — three elements belong in the skill design from the first draft.

  • Verbatim quotes with source identifiers. The skill outputs its claims as claim + source-file + page or line reference + direct quote. No synthesis without a traceable anchor. This is the operational answer to "how do we stop the agent from making things up at scale": the agent is not asked to suppress fabrication, it is required to surface evidence.
  • Confidence scoring. Each extracted claim carries a model-generated confidence score and a citation-alignment score (the embedding similarity between the claim text and the cited quote). Novel claims, abstract statements, and claims whose citation does not align above threshold are flagged automatically.
  • Approval thresholds keyed to the score. Automated rules on the confidence score route the output: auto-approve above 0.9, flag for human review in the 0.7-to-0.9 band, require re-extraction or cut below 0.7. Novel claims and any claim with an unverified DOI or URL always go to human review regardless of the score.

When every output must carry a verbatim quote plus a source identifier, the dominant failure mode shifts: confident fabrication becomes harder to produce than verifiable synthesis. The numbers across shops vary with how unsupported is defined, and 2026 knowledge-work deployments show meaningful reductions in unsupported-claim rates against prompt-and-hope baselines. The mechanism is the load-bearing claim here, not any specific rate — provenance enforcement changes what the agent can produce, not just what it is asked to produce.

Skills compound into ontology, and the Master Skill is the operational form

A single skill serves a single process. A skill library grows past several dozen files and starts to show DRY and MECE violations: the same logic for resolving an entity appears in five places, the same chart-of-accounts taxonomy in three places, two skills contradict each other on how to handle the same exception. The endpoint of mature skill accumulation is consolidation into a unified knowledge graph — the single source of truth that replaces hundreds of individual skill duplications.

The operational form of skill accumulation at team scale is the Master Skill: one file that links to all the sub-skills for a process. Ramp's 2026 Glass architecture shows the public version of this pattern at codebase-maintenance scale. The maintenance Master Skill composes three primitives: a Defrag skill that scans the codebase for fragmentation (duplicated components, inconsistent patterns, files that should be consolidated) and fixes what it finds; a shared design-system library so every new UI element inherits the same visual language instead of generating CSS from scratch; and documentation-validation at PR time, where adding a feature that is not documented fails the pipeline gate. Each primitive is a skill. Together, they form a Master Skill — a self-maintenance process the agent executes against. The cultural artifact that makes it stick is the rule: work on the codebase flows through the Master Skill pipeline, not around it. Stripe's engineering organization uses the same shape in its Blueprint system — workflows defined in code that combine deterministic steps with agent-driven steps, gated by plan approval and human PR review — and ships over 1,000 fully-agent-produced pull requests per week, each reviewed before merge. Both firms pair deterministic scaffolding with evolved skill content. The same blueprint primitives — branch-per-agent, plan-approval, policy linters — that Stripe applies to coding generalize to non-engineering skill pipelines at other firms.

The vocabulary for reading another organization's stack is compact and stable: a skill is an auto-loaded markdown file that activates based on context; a command is an explicit-invoke instruction set called via /name; a subagent is a fresh-context process spawned by a skill or command to run a specific task and return results; a script is plain code with no AI involvement, used for scheduling, file operations, and headless invocation. Most production stacks use all four. When in doubt about which primitive a step requires, practitioners default to the script.

Skills have taken root in five non-engineering functions

The 2026 center of gravity for skill deployment is non-engineering knowledge work. The common pattern across functions: a skill that encodes a domain process the expert previously executed manually, sub-agent decomposition inside the skill (Orchestrator plus Researcher plus Formatter at minimum), hooks at irreversible or high-risk points, provenance and confidence tagging for outputs humans will cite, and immutable audit logging.

Finance — monthly close and variance analysis. General Ledger detail files land in a shared folder; the finance team triggers /variance-analysis. Sub-agents read the GL, compute variances against budget and prior period, generate a formatted Excel workbook with pivot tables and conditional-formatted variance columns, and produce an auto-reconciliation tab flagging mismatches with candidate root causes. A companion skill (/cfo-memo) synthesizes the numerical findings into a narrative executive summary. A pre-post hook requires a manager's approval token before any journal entry writes. Every invocation logs to an immutable store with skill version, user, and source files, archived for seven years against SOX compliance. Reported 2026 cycle-time reductions are roughly an order of magnitude: a workbook that took thirty to forty-five minutes of analyst time completes in a few minutes, freeing the analyst to review and investigate rather than produce.

Marketing operations — multi-platform ad performance reporting. A scheduled skill ingests data from Meta, Google Ads, and LinkedIn through CSV exports or API connectors. It normalizes platform-specific metric names and dimensions into a canonical schema (mappings stored as version-controlled JSON files), writes to a central store (BigQuery, Parquet on S3, or Notion tables), then runs analysis — week-over-week changes, anomaly detection, cohort analysis — and generates branded slide decks populated from master templates. Quality gates include data-freshness alerts on source-timestamp versus ingest-timestamp latency, schema-drift detection via header checksums, and deterministic-grounding prompts that force the agent to cite the underlying metric rows in its narrative. Reported 2026 impact: a 20-to-60 percent reduction in reporting time; reporting latency from 24-to-72 hours down to under an hour; reconciliation gaps held below 5 percent.

Knowledge work — automated literature review and synthesis. The pipeline sits in a structured file system (raw_pdfs/, canonical_papers/, metadata/, notes/, outputs/). Two-step PDF ingestion (OCR and document parsing, then CrossRef or Unpaywall for DOI and metadata). Two-pass deduplication: exact DOI match, then embedding similarity for semantic clustering. Per-paper extraction skill pulls research question, methods, results, and limitations as verbatim quotes with page numbers — structured JSON output. Two-stage synthesis: extractive summarization per section, then abstractive synthesis across papers with provenance tags linking every claim back to paper, quote, and page. Hybrid retrieval combines BM25 for keywords with vector similarity for semantics. Human approval gates fire on novel or abstract claims. Reported 2026 throughput: six to twelve papers per hour for full extraction; thirty to a hundred for metadata-only passes; citation parsing accuracy around 95 percent post-verification; editor review time for a 20-paper annotated bibliography drops from three to five hours down to thirty to sixty minutes.

Sales and customer success — automated outreach plus account-brief generation. The workflow is event-driven: a new lead signal in the CRM or a significant account event triggers the skill. Enrichment calls CRM connectors (Salesforce, HubSpot) and external data sources through MCP connectors with field-level allow and deny lists that redact PII while fetching firmographics and tech-stack data. Core generation produces a one-page account brief (executive summary, top buying signals, buying-committee map, three-question call plan) plus two or three personalized email variants per persona (executive, technical, end-user). Mandatory human-in-the-loop review: drafts present in the agent UI or a dedicated Slack channel for the account executive to edit and approve; only after approval does the skill invoke a scoped send API and log the activity into the CRM. Safeguards layer tightly: OAuth with narrow scopes (read-only for research, write-only for logging), MCP manifest allowlists at field level, middleware tokenizing PII before the model sees it, and an immutable audit log. Impact is measured through A/B testing against human-crafted sequences over four-to-eight-week periods, tracking response rate, meetings set per 100 touches, qualified-meeting rate, and the rep-time reduction on call preparation.

Legal and compliance — first-pass contract review and redlining. Contract ingest through a sanctioned connector or local file access. The skill performs clause extraction and normalization (indemnity, liability caps, confidentiality, term, IP). Each clause compares against the organization's negotiation playbook; risk is tagged GREEN, YELLOW, or RED with rationale. The skill generates suggested redlines and rationale comments for flagged clauses. Output: an issues summary plus a change-tracked redline document, routed to attorney review by automated triage (RED flags go to senior counsel; more than three YELLOW flags trigger mandatory senior review). Every action logs with user, timestamp, model version, and playbook version. Privacy posture is either local-only processing or a sanctioned MCP connector into the enterprise VPC; never public cloud. Reported 2026 impact: 30-to-70 percent reduction in time-to-first-review; 70-to-90 percent precision on standard risky-clause detection; measurable share of standard NDAs cleared without attorney escalation; the team tracks the disagreement rate between model flags and attorney final decisions to tune the playbook.

Morgan Stanley's DevGen.AI sits outside this functional grouping but inside the same pattern. The in-house tool, built on OpenAI models and introduced in January 2025, reverse-engineers legacy COBOL and Perl into plain-English specifications — "extracting the business logic from the reams of patchy COBOL and custom libraries and turning them into a clear and comprehensive set of specifications," in the words of the firm's global head of technology and operations. The encoded worker is a spec-writer; the knowledge is the firm's entire code base; the inference engine is the LLM; the skill file enforces the output format (specification, not new code). Nine million lines of legacy code reviewed; roughly 280,000 developer hours freed for the work of writing the new systems. The shape is MYCIN inverted: instead of encoding medical expertise so a machine can reason about bacterial infections, the skill extracts business logic from legacy code so humans can rewrite it in a modern language.

Onboarding follows a week-by-week sequence

A team introducing skills for the first time works through a sequence that is stable across 2026 practitioner reports. The target is not maximum velocity in week one; it is a durable discipline by month three.

  • Day 1-2. Identify target use cases (highest volume, most repetitive, lowest regulatory risk first). Browse the skill marketplace — internal if one exists, public repositories otherwise — for pre-built skills in the target domain. Pick the one or two that match most closely. The goal is to run a skill against real data, not to build one.
  • Day 3-5. Install, configure with the team's actual data sources, and run against real historical data. Compare output against the manual baseline. Where the skill matches, note it. Where it misses, note the specific failure mode.
  • Week 2. Customize (team-specific hooks, output-format adjustments, operational-metric tuning). Write the first team-specific skill, ideally by adapting the pre-built one rather than starting blank.
  • Week 3-4. Run in parallel with the manual process. Every discrepancy is a skill improvement. Start logging every invocation; the logs are the training data for the next iteration.
  • Month 2. Retire the manual process for this use case. The skill is now the primary path; manual is the fallback.
  • Month 3. Iterate on operational metrics (adoption, NPS, cost per execution). Retire or rebuild any skill failing the thresholds — adoption below 40 percent after thirty days, NPS below 5 of 9, or rising incident trend.

The discipline that carries through the sequence is always log execution. Every skill's final step is to record timestamp, input, output, tools called, success probability. Without execution logs, skill improvement is guesswork; with them, the next iteration has training data. Execution logs are also what let a linter like agnix catch skill configurations that have drifted into anti-patterns before they ship — agnix validates skill manifests, prompt files, and hook definitions against several hundred rules and catches in pre-commit what would otherwise surface as silent production failures.

The proficiency levels map cleanly to hiring signal

The 2026 practitioner-defined rubric for a knowledge worker using skills is three-level. Junior: runs pre-built skills against standard inputs; reviews the generated artifacts before submission; comfortable with the interface and the basic safety checks. Intermediate: customizes skill templates for team-specific taxonomies; writes team-level hooks (approval gates, pre-post checks); validates outputs using basic QA methods. Expert: composes multi-skill pipelines spanning multiple stages (for a finance analyst, GL ingestion through reconciliation through memo generation); owns audit posture and compliance-appropriate versioning; mentors colleagues on governance practice. The same junior-runs, intermediate-composes, expert-builds pattern applies to adjacent roles (Finance Analyst, Marketing Ops lead, Legal Counsel, Sales Rep, Support Lead) with domain-specific verbs swapped in.

The hiring signal that has emerged in 2026 is direct: "junior" means the candidate can run a skill; "intermediate" means they can compose skills; "expert" means they can build skills. Performance review moves in the same direction. Time spent using skills is table stakes; time spent encoding skills is the leverage metric.

The organizational implications compound faster than most firms expect

Encoded skills persist when the employee leaves. The interview pattern — a skill that conducts a deep interview with an employee about how they answer emails, what they do when a client says X, what happens when something breaks, then generates a first-draft skill encoding the answers — captures what would otherwise walk out the door on the last day. Over twelve months, a company that systematically packages tacit knowledge into skills ends up with a library that a new hire can use on day one and a departing expert cannot take with them. The knowledge-worker employment model inverts: the company owns the encoded procedures, and the human owns the taste and judgment that decides which procedure to deploy and when to override.

Skills compound where humans cannot. AI starts much worse than any employee in any domain. But every skill iteration — every fix, every new edge case handled, every eval pass — accumulates into the skill file permanently. A human's investment lifecycle is short and fragile, capped by the employment relationship. A skill's is indefinitely long. The incremental cost of another skill iteration is near zero by month twenty-four; the value of the accumulated library grows with every use.

The backlog eventually becomes obsolete. When an agent with full company context can execute most ideas from a single prompt, queueing the idea costs more than executing it. Not every company reaches this state, and the ones that do change how they plan — less roadmap, more execution velocity, fewer quarterly planning cycles, more weekly deliverables. The firms that arrive here first discover that the organizational practice of maintaining a backlog was a workaround for a constraint (human execution speed) that has softened enough to make the workaround obsolete.

Skills produce analysis, not judgment. Models produce the arithmetic average of their training data; the tone of the question tilts the output bullish or bearish. A skill can package research, comparison, measurement, and counter-argument generation. It should not package "what should we do?" The human holds the judgment call. The principle applies to every high-stakes domain: investment, hiring, strategy, product decisions, legal posture. The skill brings the evidence to the decision-maker; the decision-maker decides. Skills that output judgments — recommend this candidate, approve this loan, pick this strategy — produce confident mediocrity at scale, because the arithmetic-average failure mode compounds with every invocation.

The anti-patterns cluster into a small named set

The anti-patterns that recur across 2026 skill-library audits cluster into a small set.

  • Skill hoarding. Packaging one-time tasks as skills. Violates the two-execution rule. Produces library bloat that increases search cost without increasing value.
  • Auto-generated skill files. Research on large auto-generated Agent MD files (400-plus lines) shows they harm performance. Skill files must be hand-curated. They are code, not output.
  • Orphan skills. No owner, no retirement policy, no eval. Within months, the library contains skills that no longer match the business and that nobody can safely delete. The skill that nobody owns is the skill that nobody can fix.
  • Skill sprawl without ontology. Hundreds of overlapping skills with DRY violations multiplying. The same domain logic re-encoded in five places. The fix is consolidation into ontology, not more skills.
  • Gamed evals. Agents will optimize for any metric you give them. An algorithm asked to maximize a walking-gait score for a soccer robot once evolved toward standing perfectly still, because that scored better than falling. Held-out validation, distribution-wide evaluation (not just the average), and continuous human review are non-negotiable.
  • Opinion skills. Skills that output judgments rather than analysis. Every instance at scale produces confident mediocrity.
  • Skill-as-prompt. Treating a skill as just a longer prompt, without role definition, constraints, entity types, examples, or escalation rules. The MYCIN architectural point inverted: all knowledge, no structure; the model has to re-derive structure on every invocation. Expensive and unreliable.
  • Wrong-tier deployment. Running a regulated workflow with a permissive posture, or gating a fast-iteration skill behind heavyweight approval flows. The 2026 practitioner tier framework — Lockdown for regulated workloads (full HITL on every irreversible action, local-only or VPC processing, 7-year retention); Controlled for internal sensitive data (OAuth-scoped connectors, HITL on writes only, 90-day audit); Open for exploratory read-only skills (default connectors, lightweight logging) — fails when the tier does not match the workload. Over-tiering kills iteration speed; under-tiering is where compliance incidents come from.

Run this week

Six tasks a team can run in a week to move from the chapter's framing to an operational first step.

  1. Identify three tasks each run more than twice this month (90 minutes). Cross-check against the two-execution rule. Pick the one with highest volume and lowest regulatory risk. This becomes the first-skill target.
  2. Write the eval before the skill (2 hours). Sort 20 historical outputs of the chosen task into acceptable and unacceptable. Articulate what separates the two classes, even if the articulation is incomplete. The eval set becomes the training signal for everything downstream.
  3. Install a pre-built skill against real data (2-3 hours). Browse the skill marketplace for anything in the target domain. Run it against one week's worth of historical inputs. Compare against the manual baseline. Note where it matches and where it misses.
  4. Add the three hook types to one existing skill (2 hours). Pre-write before any system-of-record modification; post-draft before any stakeholder-visible artifact ships; high-risk-flag when the confidence score falls below threshold. Ship the hook enforcement, not the prompt instruction.
  5. Run agnix or an equivalent linter over the skill library (90 minutes). Catch the skill configurations that drifted into anti-patterns before they surface as production failures.
  6. Interview one domain expert with the People Compiler pattern (half day). Let the skill conduct a deep interview on how the expert executes a specific repetitive task. Generate a first-draft skill file. Review it with the expert. The next iteration is their edits.

The first skill does not need to be perfect; it needs to exist and to be evolving against an eval. Everything that makes a skill library work at scale — provenance, hooks, sub-agent composition, the Master Skill, the proficiency rubric — starts from a single skill with a real eval and a real execution log. Where this discipline iterates into agent-improves-agent research is the subject of the next chapter, which picks up how the evaluation loop itself becomes automated and how firms build toward the self-improving mode.