Trace to Skill

Every skill you download was written by an AI that never ran your task. It guessed. Sometimes that's fine. When your task has specific failure modes, guessing doesn't cut it.

The better approach: run your agent 20 times, tell it what was good and what wasn't, and let it extract the rules itself.

That's it. That's the whole idea.

This is based on Trace2Skill, a research paper by Alibaba's Qwen team. They showed that skills built from real execution traces consistently beat human-written ones on hard benchmarks, and transfer across model sizes.

The 4 steps

run 20 times  →  write your feedback  →  4 analysts read everything together  →  merge into SKILL.md

Step 1: run your agent 20 times

First, generate 20 variations of your task — easy ones, hard ones, edge cases. Use Claude to write them:

Generate 20 variations of this task for a Claude Code agent:

Task: [your task]

- 5 easy, straightforward versions
- 8 normal versions
- 4 hard versions with tricky edge cases
- 3 adversarial versions designed to break the agent

Output: a numbered list. Each item is a complete, self-contained task prompt.

Then run them:

claude -p "[variation 1]"
claude -p "[variation 2]"
# repeat for all 20

Claude Code saves every session automatically to ~/.claude/projects/[your-project]/. You don't need to do anything else.

Step 2: write your feedback

Look at what each run produced — the website it built, the post it wrote, the code it generated. You don't need to read the internal logs. Just look at the output.

Write one sentence per run. That's your entire job here. It takes 10-15 minutes.

Run 1: good
Run 2: bad — too many cards, looks cluttered
Run 3: good
Run 4: bad — icons wrapped in colored divs, looks cheap
Run 5: good
Run 6: bad — two CTAs above the fold, confusing
...

The agents will handle the rest. They know how to read the internal traces. You know if the output was actually good. That's the split.

Step 3: spawn 4 analysts in parallel

Create these four files in .claude/agents/. Each one focuses on a different angle, but all four read the same 20 sessions together. Running them in parallel removes the bias you'd get from a single reviewer.

`.claude/agents/error-analyst.md`

---
name: error-analyst
description: Reads all 20 sessions and my feedback. Finds the root cause behind every bad run. Proposes rules that would have prevented each failure. Run in parallel with the other analysts.
---

You analyze why runs went wrong.

You receive:
- A task description
- My feedback on each run (one sentence per run, labeled good/bad)
- Access to the last 20 session files in ~/.claude/projects/[project]/

Process:
1. Read all 20 sessions
2. For every run I marked bad: find the root cause in the actual trace (not just the error message)
3. Check if the same problem appears in multiple bad runs
4. Propose a rule that would have prevented it

Rule format:
- RULE: [what to do, imperative]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Only propose rules that show up in 2+ runs.

`.claude/agents/success-analyst.md`

---
name: success-analyst
description: Reads all 20 sessions and my feedback. Finds what the agent did right in good runs that it didn't do in bad ones. Run in parallel with the other analysts.
---

You find what made the good runs good.

You receive:
- A task description
- My feedback on each run
- Access to the last 20 session files in ~/.claude/projects/[project]/

Process:
1. Read all 20 sessions
2. For every run I marked good: find the behaviors that made it work
3. Find behaviors present in good runs that are absent in bad runs
4. Propose rules that encode those behaviors

Rule format:
- RULE: [what to do, imperative]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Skip obvious rules. Look for the non-obvious things that actually made the difference.

`.claude/agents/structure-analyst.md`

---
name: structure-analyst
description: Reads all 20 sessions and my feedback. Looks at the sequence of steps taken, not the content. Finds ordering patterns that correlate with good or bad outcomes. Run in parallel with the other analysts.
---

You look at the shape of runs, not what was produced.

You receive:
- A task description
- My feedback on each run
- Access to the last 20 session files in ~/.claude/projects/[project]/

Look at tool call sequences, step ordering, verification steps, unnecessary detours.

Questions:
- Which sequences appear in good runs but not bad ones?
- Are verification steps missing from bad runs?
- Are there steps that add noise without improving the output?

Rule format:
- RULE: [ordering or sequencing rule, imperative]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Only rules with 2+ run support.

`.claude/agents/edge-analyst.md`

---
name: edge-analyst
description: Reads all 20 sessions and my feedback. Focuses on the hard and adversarial runs. Finds assumptions the agent makes that break under pressure. Run in parallel with the other analysts.
---

You focus on the runs I marked bad, especially the tricky ones.

You receive:
- A task description
- My feedback on each run
- Access to the last 20 session files in ~/.claude/projects/[project]/

Find:
- What inputs broke the agent that shouldn't have?
- What assumptions does the agent make that fail at the edges?
- What checks are missing?

Write rules as guards: "Before doing X, verify Y."

Rule format:
- RULE: [defensive check or guard]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Every rule must link to a specific run.

Now run all four at once, passing your feedback and pointing them at your sessions:

Run these 4 agents in parallel. Give each the same context.

Task: [your task description]
Project slug: [your-project] (sessions are in ~/.claude/projects/[your-project]/)

My feedback:
Run 1: good
Run 2: bad — too many cards, looks cluttered
Run 3: good
[... all 20]

Agents to run:
- error-analyst
- success-analyst
- structure-analyst
- edge-analyst

Each agent should read the actual session files to understand what happened in each run.

Step 4: merge into one SKILL.md

You have four sets of proposed rules. Most overlap. Run this to consolidate:

Merge these 4 analyst outputs into a single SKILL.md.

Task: [your task]
Existing SKILL.md: [paste or write "none"]

[paste all 4 analyst outputs]

Rules for merging:
- Merge rules that say the same thing
- When two rules conflict, keep the one with more run evidence
- 8+ runs: core rule (goes in main SKILL.md)
- 4-7 runs: guidance (main SKILL.md, secondary section)
- 2-3 runs: edge case (goes in references/ subfolder)
- 1 run: discard

Output as a SKILL.md:

# [Task name]

## When to use this skill
[one short paragraph]

## Core rules
[numbered list]

## Patterns
[bullet points]

## Failure modes
["If X, do Y" format]

Max 30 rules in the main file.

What you end up with

.claude/
  agents/
    error-analyst.md
    success-analyst.md
    structure-analyst.md
    edge-analyst.md
  skills/
    [your-task]/
      SKILL.md
      references/
        edge-cases.md

A real example from running this on a landing page builder:

# Landing Page Builder

## Core rules
1. Never wrap icons in a div with a background. Use the SVG path directly. (runs 3, 6, 11, 15)
2. One hero section, one CTA. Pages with two CTAs above the fold had lower click-through. (runs 2, 7, 9, 14, 18)
3. Limit feature sections to 3 items. Grids of 6+ cards look like AI slop and nobody reads them. (runs 4, 8, 12, 17)

## Failure modes
- If the output has a "Features" section with more than 4 cards: cut to the 3 strongest
- If there are Lucide icons inside colored background divs: replace with inline SVG paths
- If the hero has more than 2 buttons: remove the secondary one

Every rule links to the exact runs that produced it.

Why this works

The reason downloaded skills underperform is simple: the AI that wrote them never ran your task. It invented rules based on what it thought might matter.

This workflow inverts that. You run first, judge the outputs (which you can actually see), and let four independent agents dig through the traces to figure out why each run went the way it did. Running them in parallel means no single perspective dominates.

The paper behind this found that parallel analysis consistently outperforms both human-written skills and sequential AI review. A 35B model evolving its own skills this way outperformed a 122B model using a hand-written skill on some benchmarks.

The shortcut

Build This Now is a production framework for shipping SaaS, internal tools, and client projects. Payments, auth, email, frontend, backend, all wired together and ready to ship.

It also ships with a full AI harness: agent orchestration, parallel dispatch, skill evolution, trace collection, and the patterns top Claude Code setups use in production. The skill workflow above is one of them. You get all of it without building the plumbing yourself.

Trace to Skill

On this page