Trace to Skill

Jede Skill, die du herunterlädst, wurde von einer KI geschrieben, die deine Aufgabe nie ausgeführt hat. Sie hat geraten. Manchmal reicht das. Wenn deine Aufgabe spezifische Fehlermuster hat, reicht raten nicht aus.

Der bessere Ansatz: Führe deinen Agenten 20 Mal aus, sag ihm, was gut und was schlecht war, und lass ihn die Regeln selbst ableiten.

Das war's. Das ist die ganze Idee.

Das basiert auf Trace2Skill, einem Forschungspapier vom Qwen-Team von Alibaba. Sie zeigten, dass Skills aus echten Ausführungstraces handgeschriebene auf schwierigen Benchmarks konsistent schlagen und sich über Modellgrößen übertragen.

Die 4 Schritte

20 Mal ausführen  →  Feedback schreiben  →  4 Analysten lesen alles zusammen  →  zu SKILL.md zusammenfassen

Schritt 1: Führe deinen Agenten 20 Mal aus

Generiere zuerst 20 Variationen deiner Aufgabe, einfache, schwierige und Grenzfälle. Nutze Claude, um sie zu schreiben:

Generate 20 variations of this task for a Claude Code agent:

Task: [your task]

- 5 easy, straightforward versions
- 8 normal versions
- 4 hard versions with tricky edge cases
- 3 adversarial versions designed to break the agent

Output: a numbered list. Each item is a complete, self-contained task prompt.

Dann führe sie aus:

claude -p "[variation 1]"
claude -p "[variation 2]"
# repeat for all 20

Claude Code speichert jede Sitzung automatisch in ~/.claude/projects/[your-project]/. Du brauchst nichts weiter zu tun.

Schritt 2: Schreib dein Feedback

Schau dir an, was jeder Lauf produziert hat, die Website, die er gebaut hat, den Beitrag, den er geschrieben hat, den Code, den er generiert hat. Du musst die internen Logs nicht lesen. Schau dir einfach die Ausgabe an.

Schreib einen Satz pro Lauf. Das ist deine ganze Aufgabe hier. Es dauert 10-15 Minuten.

Run 1: good
Run 2: bad — too many cards, looks cluttered
Run 3: good
Run 4: bad — icons wrapped in colored divs, looks cheap
Run 5: good
Run 6: bad — two CTAs above the fold, confusing
...

Die Agenten erledigen den Rest. Sie wissen, wie man interne Traces liest. Du weißt, ob die Ausgabe wirklich gut war. Das ist die Aufteilung.

Schritt 3: Starte 4 Analysten parallel

Erstelle diese vier Dateien in .claude/agents/. Jede fokussiert sich auf einen anderen Winkel, aber alle vier lesen dieselben 20 Sitzungen zusammen. Sie parallel auszuführen beseitigt die Verzerrung, die du von einem einzelnen Reviewer bekämst.

`.claude/agents/error-analyst.md`

---
name: error-analyst
description: Reads all 20 sessions and my feedback. Finds the root cause behind every bad run. Proposes rules that would have prevented each failure. Run in parallel with the other analysts.
---

You analyze why runs went wrong.

You receive:
- A task description
- My feedback on each run (one sentence per run, labeled good/bad)
- Access to the last 20 session files in ~/.claude/projects/[project]/

Process:
1. Read all 20 sessions
2. For every run I marked bad: find the root cause in the actual trace (not just the error message)
3. Check if the same problem appears in multiple bad runs
4. Propose a rule that would have prevented it

Rule format:
- RULE: [what to do, imperative]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Only propose rules that show up in 2+ runs.

`.claude/agents/success-analyst.md`

---
name: success-analyst
description: Reads all 20 sessions and my feedback. Finds what the agent did right in good runs that it didn't do in bad ones. Run in parallel with the other analysts.
---

You find what made the good runs good.

You receive:
- A task description
- My feedback on each run
- Access to the last 20 session files in ~/.claude/projects/[project]/

Process:
1. Read all 20 sessions
2. For every run I marked good: find the behaviors that made it work
3. Find behaviors present in good runs that are absent in bad runs
4. Propose rules that encode those behaviors

Rule format:
- RULE: [what to do, imperative]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Skip obvious rules. Look for the non-obvious things that actually made the difference.

`.claude/agents/structure-analyst.md`

---
name: structure-analyst
description: Reads all 20 sessions and my feedback. Looks at the sequence of steps taken, not the content. Finds ordering patterns that correlate with good or bad outcomes. Run in parallel with the other analysts.
---

You look at the shape of runs, not what was produced.

You receive:
- A task description
- My feedback on each run
- Access to the last 20 session files in ~/.claude/projects/[project]/

Look at tool call sequences, step ordering, verification steps, unnecessary detours.

Questions:
- Which sequences appear in good runs but not bad ones?
- Are verification steps missing from bad runs?
- Are there steps that add noise without improving the output?

Rule format:
- RULE: [ordering or sequencing rule, imperative]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Only rules with 2+ run support.

`.claude/agents/edge-analyst.md`

---
name: edge-analyst
description: Reads all 20 sessions and my feedback. Focuses on the hard and adversarial runs. Finds assumptions the agent makes that break under pressure. Run in parallel with the other analysts.
---

You focus on the runs I marked bad, especially the tricky ones.

You receive:
- A task description
- My feedback on each run
- Access to the last 20 session files in ~/.claude/projects/[project]/

Find:
- What inputs broke the agent that shouldn't have?
- What assumptions does the agent make that fail at the edges?
- What checks are missing?

Write rules as guards: "Before doing X, verify Y."

Rule format:
- RULE: [defensive check or guard]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Every rule must link to a specific run.

Jetzt starte alle vier gleichzeitig und gib ihnen dein Feedback sowie den Verweis auf deine Sitzungen:

Run these 4 agents in parallel. Give each the same context.

Task: [your task description]
Project slug: [your-project] (sessions are in ~/.claude/projects/[your-project]/)

My feedback:
Run 1: good
Run 2: bad — too many cards, looks cluttered
Run 3: good
[... all 20]

Agents to run:
- error-analyst
- success-analyst
- structure-analyst
- edge-analyst

Each agent should read the actual session files to understand what happened in each run.

Schritt 4: Zusammenfassen zu einer SKILL.md

Du hast vier Sätze vorgeschlagener Regeln. Die meisten überlappen sich. Führe das aus, um sie zu konsolidieren:

Merge these 4 analyst outputs into a single SKILL.md.

Task: [your task]
Existing SKILL.md: [paste or write "none"]

[paste all 4 analyst outputs]

Rules for merging:
- Merge rules that say the same thing
- When two rules conflict, keep the one with more run evidence
- 8+ runs: core rule (goes in main SKILL.md)
- 4-7 runs: guidance (main SKILL.md, secondary section)
- 2-3 runs: edge case (goes in references/ subfolder)
- 1 run: discard

Output as a SKILL.md:

# [Task name]

## When to use this skill
[one short paragraph]

## Core rules
[numbered list]

## Patterns
[bullet points]

## Failure modes
["If X, do Y" format]

Max 30 rules in the main file.

Was du am Ende hast

.claude/
  agents/
    error-analyst.md
    success-analyst.md
    structure-analyst.md
    edge-analyst.md
  skills/
    [your-task]/
      SKILL.md
      references/
        edge-cases.md

Ein echtes Beispiel aus diesem Prozess für einen Landing-Page-Builder:

# Landing Page Builder

## Core rules
1. Never wrap icons in a div with a background. Use the SVG path directly. (runs 3, 6, 11, 15)
2. One hero section, one CTA. Pages with two CTAs above the fold had lower click-through. (runs 2, 7, 9, 14, 18)
3. Limit feature sections to 3 items. Grids of 6+ cards look like AI slop and nobody reads them. (runs 4, 8, 12, 17)

## Failure modes
- If the output has a "Features" section with more than 4 cards: cut to the 3 strongest
- If there are Lucide icons inside colored background divs: replace with inline SVG paths
- If the hero has more than 2 buttons: remove the secondary one

Jede Regel verweist auf die genauen Läufe, aus denen sie entstanden ist.

Warum das funktioniert

Der Grund, warum heruntergeladene Skills schlechter abschneiden, ist einfach: Die KI, die sie geschrieben hat, hat deine Aufgabe nie ausgeführt. Sie hat Regeln auf Basis dessen erfunden, was sie für relevant hielt.

Dieser Workflow kehrt das um. Du führst zuerst aus, bewertest die Ausgaben (die du tatsächlich sehen kannst) und lässt vier unabhängige Agenten die Traces durcharbeiten, um herauszufinden, warum jeder Lauf so verlaufen ist, wie er verlaufen ist. Sie parallel auszuführen bedeutet, dass keine einzelne Perspektive dominiert.

Das Papier dahinter stellte fest, dass parallele Analyse sowohl handgeschriebene Skills als auch sequentielle KI-Überprüfung konsistent übertrifft. Ein 35B-Modell, das seine eigenen Skills auf diese Weise weiterentwickelt, übertraf ein 122B-Modell mit einer handgeschriebenen Skill auf einigen Benchmarks.

Die Abkürzung

Build This Now ist ein Production-Framework für den Versand von SaaS, internen Tools und Kundenprojekten. Zahlungen, Auth, E-Mail, Frontend, Backend, alles verdrahtet und versandbereit.

Es wird auch mit einem vollständigen KI-Harness geliefert: Agent-Orchestrierung, paralleler Dispatch, Skill-Evolution, Trace-Sammlung und die Muster, die Top Claude Code-Setups in der Produktion verwenden. Der obige Skill-Workflow ist einer davon. Du bekommst alles, ohne die Infrastruktur selbst zu bauen.

Trace to Skill

On this page