2026-06-02
Stop Correcting Your AI. Grade It, Tune It, and Compound the ROI.

I arrived here by giving the AI more autonomy.
The more it runs on its own, the faster a slightly-off instruction turns into wasted work. Your code-review-improve loops run longer than they should. You burn time curating skills, subagent prompts, rules files, and CLAUDE.md content, tweaking by gut feel and hoping the next run goes smoother.
I went looking for a way to make prompt quality less of a guess. By “prompt” I mean any instruction you hand the agent: skills, CLAUDE.md, rule files, subagent system prompts, all of it.
None of this is new. The people shipping LLM products solved the measurement problem a while ago, and the advice is boring and consistent: don’t trust vibes, build evals, read what the model actually did. Anthropic’s guidance on agent evals leans hard on reading transcripts, and OpenAI ships a whole trace-grading guide for the same reason.
When someone’s results are lackluster, the reflex answer is “use TDD” or “try Spec Driven Development.” Both are good practices, but they structure the work without touching the instructions you feed the agent: your skills, rules files, and subagent prompts. I rarely see anyone share how to improve those.
The Drift
Things work most of the time. Then a skill fires on something it wasn’t meant for, or stays quiet on the exact task you built it for. You tweak the wording, run it again, and it’s better on one case and worse on another. Without a way to see what’s actually happening, you’re tuning by feel, and feel doesn’t compound.
Intent, Instructions, Behavior
There are three layers, and keeping them straight is the whole game:
- Intent is what you want (e.g., trigger on accessibility, stay quiet on brand styling).
- Instructions are your attempt to put that intent into words (the description or skill body).
- Behavior is what the agent actually did.
Evaluating is just measuring the gap between behavior and intent. To close that gap, you tune the instructions, which is the only layer you control directly. You only tune your intent when you realize your intentions were misguided.
To measure behavior, read the transcript instead of asking the model. Asking “did you use the skill?” gets you a self-report that can hallucinate. The transcript is ground truth: either the tool call occurred or it didn’t. Then there is the actual work it did, and that can be observed for accuracy.
To stop a lopsided skill from hiding behind a good average, you track two numbers:
- Sensitivity: Did it fire when it should have?
- Specificity: Did it stay quiet when it should have?
Just averaging them would let “fires on everything” tie with “fires on nothing.” I did some research, then had the AI generate a gate using min(sensitivity, specificity). It’s a harsh bar, but it works.
The Setup
I’m working in Claude Code, latest version as of this writing. To run this eval, I use Claude Code’s Dynamic Workflows feature. That’s where the eval lives.
I pasted two prompts into Claude Code to generate the eval and the improve loop. Treat them as a starting point: do some research, run them, and iterate to dial them in for your own setup. The point is to show how the AI can improve its own instructions.
Prompt A: The Behavioral Probe + Scorer
I have “skills”, instruction files my agent auto-loads when a task matches their description. I want to measure whether a skill actually fires, behaviorally, not by prediction. Build:
A probe runner. Given a skill name and trigger cases (
{ id, prompt, should_trigger, split }, split = tune|holdout), spawn one fresh sub-agent per case with ONLY the bare task, no hint about which skill to use (I want natural routing). Tell it to take just the first step and not write files. Tag each probe with a unique per-run marker in the task text so I can find its transcript.A scorer. Read the sub-agent transcripts (the JSON-lines session files). A “fire” = a transcript line that is a tool-use of the skill-invocation tool naming THIS skill (JSON.parse and match the structure, no brittle substring matching). Score each run from ITS OWN transcript folder; skip the top-level session log and any transcript with more than one distinct marker (those are orchestrators and cause false positives). Per case: fired/samples → firerate; score = firerate if shouldtrigger else 1-firerate. Aggregate sensitivity, specificity, and a balanced = min(sensitivity, specificity) gate, reported for tune / holdout / overall. A case with zero transcripts is a failure, not a silent drop.
Prompt B: The Improve Loop
Using that probe + scorer, build a bounded improvement loop for a skill’s
description. Each round: probe every case (per-round marker) → score → gate on the HELD-OUT balanced rate. If it passes, stop. Otherwise a tuner edits ONLY thedescription:field, learning from the TUNE-split misses (broaden in-scope language for false-negatives; tighten the out-of-scope boundary for false-positives), respecting the skill spec’s limits. Re-probe; repeat, bounded by maxRounds. Hard rules: gate on held-out / tune on tune-split (never the same cases), never commit (write a report; I review the diff), validate the skill after each edit, and be honest that skill discovery may be cached within a run so a clean confirming pass can need a fresh run.
What It Generated
A note on the code below: the examples are redacted and abbreviated, the load-bearing parts of the generated workflows, not the full files. Your AI will produce something similar when you run the prompts above.
The workflows have a lot of plumbing, but the load-bearing parts are small.
The Probe Runner
// skill-trigger-eval.js: a Claude Code Dynamic Workflow
// Spawns one fresh sub-agent per case, NO skill hints, tagged per run.
// (Redacted/abbreviated: see the section note above for context.)
const runId = generateNonce() // isolates this run's transcripts
for (const testCase of cases) {
for (let sample = 0; sample < SAMPLES; sample++) {
// Bare task + unique marker. No mention of which skill to use.
// The agent sees a normal task and routes naturally.
const prompt = `
You are an agent working in this project. Begin the following task
the way you normally would. Take only the FIRST concrete step,
then stop. Do NOT create or edit files.
Task [evaltag:${testCase.id}#${runId}#${sample}]: ${testCase.prompt}`
// Spawn a fresh sub-agent via the workflow API.
// Real skill routing happens here and gets recorded in its transcript.
await agent(prompt)
}
}
// Then point the scorer at THIS run's transcript folder.
The probe withholds every hint on purpose. You’re measuring whether the skill’s description earns the trigger on its own, which is the only thing the agent has to go on at routing time.
The Scorer
// scorer reads sub-agent transcripts and decides per case: did the skill fire?
// The fire detector. ~10 lines that do the real work.
// A "fire" = a tool_use node invoking the Skill tool for THIS skill.
// Parse the JSON, walk the tree, match the structure. No substring matching.
function skillUsed(node: unknown, name: string): boolean {
if (!node || typeof node !== "object") return false;
if (Array.isArray(node)) return node.some((n) => skillUsed(n, name));
const o = node as Record<string, unknown>;
if (o.type === "tool_use" && o.name === "Skill") {
const input = o.input as { skill?: string } | undefined;
if (input?.skill === name) return true;
}
return Object.values(o).some((v) => typeof v === "object" && skillUsed(v, name));
}
// Per case: find this case's transcripts by its evaltag marker,
// check each one for the skill tool_use, compute fire_rate.
// ISOLATION matters: only score transcripts from THIS run,
// skip orchestrator logs (they contain multiple evaltags).
// Aggregate: sensitivity, specificity, and the gate
const sens = mean(positives.map(r => r.fire_rate)); // fires when it should
const spec = mean(negatives.map(r => 1 - r.fire_rate)); // quiet when it should be
const balanced = Math.min(sens, spec); // the gate
// Report per split (tune / holdout / overall).
The whole idea is in that one function. The grader is deterministic and about ten lines, and no model gets a vote on whether the skill fired.
The Improve Loop
// improve-skill-trigger.js: a Claude Code Dynamic Workflow
// Measure → tune description → re-measure. Bounded, never commits.
const GATE = 0.9;
let round = 0, prev = -1;
while (true) {
// PROBE: run the eval workflow for all cases, tagged per round
await probeAllCases(round);
// SCORE: a sub-agent runs the scorer against this round's transcripts
const ev = await score(round);
const balanced = ev.holdout.balanced;
// GATE on held-out, not tune
if (balanced >= GATE) { log("passing"); break; }
if (round > 0 && balanced <= prev) { log("stalled"); break; }
if (round >= MAX_ROUNDS) { log("maxed out"); break; }
// TUNE: learn from TUNE-split misses only
// false negatives (should fire, didn't) → broaden in-scope language
// false positives (fired, shouldn't) → tighten out-of-scope boundary
// A tuner agent edits ONLY the description field in SKILL.md.
const misses = ev.perCase.filter(c => c.split === "tune" && c.score < 1);
await agent(tunePrompt(misses));
// Validate the skill still parses, then re-probe next round
await agent(validatePrompt());
prev = balanced; round++;
}
// Write a report with per-round scores and the before/after description.
// NEVER commits. You review the diff and decide.
await agent(reportPrompt());
The tune/holdout split is me borrowing train/test discipline from machine learning.
The loop auto-edits descriptions based on the metric. The guardrails are deliberate: it validates the skill after every edit, and it never commits. You review the diff and decide.
The Example Skill
I picked a skill trigger because it’s the cleanest case to teach. The skill fires for a good reason, or it doesn’t fire for a bad reason. The same shape works on output grading, but it’s busier, so I’ll cover that in another post.
To test this I built a deliberately vague accessibility skill:
---
name: a11y-review
description: Helps make the app nicer for the people who use it.
---
This description is vague on purpose. To test it, I had Claude generate 18 test cases, balanced between should-fire and should-not-fire, split into tune and holdout:
{ "id": "p-aria-form", "should_trigger": true, "split": "tune", "prompt": "Add ARIA labels to the household search form." },
{ "id": "p-contrast", "should_trigger": true, "split": "tune", "prompt": "Our dashboard text fails WCAG AA color contrast, fix it." },
{ "id": "n-redesign", "should_trigger": false, "split": "tune", "prompt": "Redesign the landing page to look more modern and on-brand." },
{ "id": "n-brand-color", "should_trigger": false, "split": "tune", "prompt": "Change the primary button color to match our new brand palette." },
{ "id": "p-alt-text", "should_trigger": true, "split": "holdout", "prompt": "Add alt text to the uploaded document thumbnails." },
{ "id": "n-ci", "should_trigger": false, "split": "holdout", "prompt": "Set up CI to run the test suite on pull requests." }
The near-miss pairs are where this earns its keep. “Fix the contrast for WCAG compliance” should fire; “change the button color to match the new brand” should not. They sit a few words apart, and a vague description can’t tell them apart.
How the Description Improved
Round 0 fails. The skill misses real accessibility tasks and fires on generic “improve the app” ones, which is low sensitivity and low specificity at the same time.
The loop rewrites the description from the tune-split misses. After a few rounds it goes from this:
Helps make the app nicer for the people who use it.
to something like this:
Audits UI code for accessibility issues and suggests fixes. Covers ARIA
attributes, semantic HTML, keyboard navigation, screen-reader support, color
contrast, focus management, and WCAG compliance. Use whenever the task involves
accessibility, a11y, assistive technology, or inclusive design. NOT for visual
redesign, performance, copywriting, or brand styling.
The held-out balanced rate climbs as that happens. The description generalized. It routes correctly on cases it was never tuned against, which is the whole reason for splitting them.
Where This Goes Next
The same shape works on the instructions too. Once routing is solid, the next question is whether the skill body, the subagent prompt, or the rule file actually produces good work.
That eval is more demanding on two fronts:
- You have to define the desired outcome (no triggering heuristic to fall back on).
- The agent has to do the work, so it writes files. Once the eval produces artifacts, isolation stops being optional.
I run them on throwaway git worktrees so scenarios can’t collide and I can discard everything after scoring. Worktrees are my own adaptation, not a standard. The UK AI Safety Institute’s Inspect framework sandboxes eval runs in containers. Either way: don’t let an artifact-producing eval contaminate the thing you’re measuring.
For user-facing agents in production, the same playbook holds with higher stakes:
- Archive prompt iterations with their scores so you can see whether a change helped or just moved the failure around.
- Run the eval as a dress rehearsal before you ship a change.
- Grade against real inputs you gathered from production, not scenarios you made up.
Hamel Husain’s evals writing puts it well: when monitoring surfaces a new failure, that failure becomes a permanent test case. Fabricated cases are fine for a cold start, but they drift, so periodically add real cases over time.
Caveats
It measures proxies, not outcomes. “The skill fired” doesn’t mean “the agent produced correct accessible code.” Firing is necessary, not sufficient.
LLM judges can be confidently wrong. When you extend this to grade output quality, the judge is a second opinion, not ground truth. Test the judge on cases you know the answer to. Until it agrees, treat it as advisory.
Transcript isolation matters more than the scoring. If the scorer picks up transcripts from the orchestrator or a previous run, the numbers look great and mean nothing. I hit exactly that lie firsthand before I locked down the per-run isolation. Score from this run’s own folder, skip multi-tag files, match on exact markers.
Skill discovery may be cached. A description edit mid-session might not take effect until a fresh run, which can make a confirming pass look like a stall.
There’s no settled convention for any of this. The measurement discipline is borrowed and solid: read the transcript, split your cases, only trust scores on cases you didn’t tune on.
Make It Yours
The example skill and all 18 test cases are below. The skill is a casual demo I sketched; the eval, the improve loop, and the test cases were all generated by Claude from the prompts above. Run the same prompts against your own skill to generate an eval and cases for it, then tweak. Copy the skill into .claude/skills/a11y-review/SKILL.md as a starting point if you want, but experiment with your own skills and other prompts as needed.
Example skill: SKILL.md
---
name: a11y-review
description: Helps make the app nicer for the people who use it.
---
# a11y-review (deliberately weak example)
Ground truth (what this skill is really for): web accessibility review of the
UI, covering ARIA attributes, semantic HTML, keyboard navigation, screen-reader support,
color contrast, focus management, WCAG. NOT general UX/visual design, performance,
copywriting, data/model work, or ops.
Example test cases: triggering.json
{
"skill_name": "a11y-review",
"cases": [
{ "id": "p-aria-form", "should_trigger": true, "split": "tune", "prompt": "Add ARIA labels to the household search form." },
{ "id": "p-keyboard-table", "should_trigger": true, "split": "tune", "prompt": "Make the distributions table fully navigable by keyboard." },
{ "id": "p-contrast", "should_trigger": true, "split": "tune", "prompt": "Our dashboard text fails WCAG AA color contrast, fix it." },
{ "id": "p-screenreader", "should_trigger": true, "split": "tune", "prompt": "Announce to screen readers when a record saves via Turbo." },
{ "id": "p-focus-dialog", "should_trigger": true, "split": "tune", "prompt": "Move focus into the dialog when it opens and trap it there." },
{ "id": "n-perf", "should_trigger": false, "split": "tune", "prompt": "Make the dashboard load faster." },
{ "id": "n-redesign", "should_trigger": false, "split": "tune", "prompt": "Redesign the landing page to look more modern and on-brand." },
{ "id": "n-model", "should_trigger": false, "split": "tune", "prompt": "Add a new column to the households table." },
{ "id": "n-release-notes", "should_trigger": false, "split": "tune", "prompt": "Write the release notes for version 2.3." },
{ "id": "n-brand-color", "should_trigger": false, "split": "tune", "prompt": "Change the primary button color to match our new brand palette." },
{ "id": "p-alt-text", "should_trigger": true, "split": "holdout", "prompt": "Add alt text to the uploaded document thumbnails." },
{ "id": "p-audit-form", "should_trigger": true, "split": "holdout", "prompt": "Audit the new intake form for accessibility issues." },
{ "id": "p-skip-link", "should_trigger": true, "split": "holdout", "prompt": "Add a skip-to-content link for keyboard users." },
{ "id": "p-contrast-button", "should_trigger": true, "split": "holdout", "prompt": "Improve the contrast of the button text so low-vision users can read it." },
{ "id": "n-copy", "should_trigger": false, "split": "holdout", "prompt": "Reword the error messages to sound friendlier." },
{ "id": "n-ci", "should_trigger": false, "split": "holdout", "prompt": "Set up CI to run the test suite on pull requests." },
{ "id": "n-query", "should_trigger": false, "split": "holdout", "prompt": "Optimize the slow household export query." },
{ "id": "n-deploy", "should_trigger": false, "split": "holdout", "prompt": "Configure the production deploy pipeline." }
]
}
The Signal
When a skill keeps misfiring and you’ve tweaked the wording three times, build the loop. Try it on the next thing that misfires.