Do Trace à Skill

Cada skill que você descarrega foi escrita por uma IA que nunca executou a sua tarefa. Ela adivinhou. Às vezes está tudo bem. Quando a sua tarefa tem modos de falha específicos, adivinhar não chega.

A abordagem melhor: execute o seu agente 20 vezes, diga-lhe o que foi bom e o que não foi, e deixe-o extrair as regras por si mesmo.

É isso. É a ideia toda.

Isto é baseado no Trace2Skill, um artigo de investigação da equipa Qwen da Alibaba. Eles mostraram que skills construídas a partir de traces de execução reais consistentemente superam as escritas por humanos em benchmarks difíceis, e transferem entre tamanhos de modelo.

Os 4 passos

executar 20 vezes  →  escrever o seu feedback  →  4 analistas leem tudo em conjunto  →  combinar em SKILL.md

Passo 1: execute o seu agente 20 vezes

Primeiro, gere 20 variações da sua tarefa: fáceis, difíceis, casos extremos. Use o Claude para as escrever:

Generate 20 variations of this task for a Claude Code agent:

Task: [your task]

- 5 easy, straightforward versions
- 8 normal versions
- 4 hard versions with tricky edge cases
- 3 adversarial versions designed to break the agent

Output: a numbered list. Each item is a complete, self-contained task prompt.

Depois execute-as:

claude -p "[variation 1]"
claude -p "[variation 2]"
# repeat for all 20

O Claude Code guarda cada sessão automaticamente em ~/.claude/projects/[your-project]/. Não precisa de fazer mais nada.

Passo 2: escreva o seu feedback

Veja o que cada execução produziu: o site que construiu, o post que escreveu, o código que gerou. Não precisa de ler os logs internos. Apenas veja o resultado.

Escreva uma frase por execução. É o seu trabalho aqui. Demora 10 a 15 minutos.

Run 1: good
Run 2: bad — too many cards, looks cluttered
Run 3: good
Run 4: bad — icons wrapped in colored divs, looks cheap
Run 5: good
Run 6: bad — two CTAs above the fold, confusing
...

Os agentes tratam do resto. Eles sabem como ler os traces internos. Você sabe se o resultado foi realmente bom. Essa é a divisão.

Passo 3: crie 4 analistas em paralelo

Crie estes quatro ficheiros em .claude/agents/. Cada um foca-se num ângulo diferente, mas os quatro leem as mesmas 20 sessões em conjunto. Executá-los em paralelo remove o viés que obteria de um único revisor.

`.claude/agents/error-analyst.md`

---
name: error-analyst
description: Reads all 20 sessions and my feedback. Finds the root cause behind every bad run. Proposes rules that would have prevented each failure. Run in parallel with the other analysts.
---

You analyze why runs went wrong.

You receive:
- A task description
- My feedback on each run (one sentence per run, labeled good/bad)
- Access to the last 20 session files in ~/.claude/projects/[project]/

Process:
1. Read all 20 sessions
2. For every run I marked bad: find the root cause in the actual trace (not just the error message)
3. Check if the same problem appears in multiple bad runs
4. Propose a rule that would have prevented it

Rule format:
- RULE: [what to do, imperative]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Only propose rules that show up in 2+ runs.

`.claude/agents/success-analyst.md`

---
name: success-analyst
description: Reads all 20 sessions and my feedback. Finds what the agent did right in good runs that it didn't do in bad ones. Run in parallel with the other analysts.
---

You find what made the good runs good.

You receive:
- A task description
- My feedback on each run
- Access to the last 20 session files in ~/.claude/projects/[project]/

Process:
1. Read all 20 sessions
2. For every run I marked good: find the behaviors that made it work
3. Find behaviors present in good runs that are absent in bad runs
4. Propose rules that encode those behaviors

Rule format:
- RULE: [what to do, imperative]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Skip obvious rules. Look for the non-obvious things that actually made the difference.

`.claude/agents/structure-analyst.md`

---
name: structure-analyst
description: Reads all 20 sessions and my feedback. Looks at the sequence of steps taken, not the content. Finds ordering patterns that correlate with good or bad outcomes. Run in parallel with the other analysts.
---

You look at the shape of runs, not what was produced.

You receive:
- A task description
- My feedback on each run
- Access to the last 20 session files in ~/.claude/projects/[project]/

Look at tool call sequences, step ordering, verification steps, unnecessary detours.

Questions:
- Which sequences appear in good runs but not bad ones?
- Are verification steps missing from bad runs?
- Are there steps that add noise without improving the output?

Rule format:
- RULE: [ordering or sequencing rule, imperative]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Only rules with 2+ run support.

`.claude/agents/edge-analyst.md`

---
name: edge-analyst
description: Reads all 20 sessions and my feedback. Focuses on the hard and adversarial runs. Finds assumptions the agent makes that break under pressure. Run in parallel with the other analysts.
---

You focus on the runs I marked bad, especially the tricky ones.

You receive:
- A task description
- My feedback on each run
- Access to the last 20 session files in ~/.claude/projects/[project]/

Find:
- What inputs broke the agent that shouldn't have?
- What assumptions does the agent make that fail at the edges?
- What checks are missing?

Write rules as guards: "Before doing X, verify Y."

Rule format:
- RULE: [defensive check or guard]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Every rule must link to a specific run.

Agora execute os quatro ao mesmo tempo, passando o seu feedback e apontando-os para as suas sessões:

Run these 4 agents in parallel. Give each the same context.

Task: [your task description]
Project slug: [your-project] (sessions are in ~/.claude/projects/[your-project]/)

My feedback:
Run 1: good
Run 2: bad — too many cards, looks cluttered
Run 3: good
[... all 20]

Agents to run:
- error-analyst
- success-analyst
- structure-analyst
- edge-analyst

Each agent should read the actual session files to understand what happened in each run.

Passo 4: combine numa única SKILL.md

Você tem quatro conjuntos de regras propostas. A maioria sobrepõe-se. Execute isto para consolidar:

Merge these 4 analyst outputs into a single SKILL.md.

Task: [your task]
Existing SKILL.md: [paste or write "none"]

[paste all 4 analyst outputs]

Rules for merging:
- Merge rules that say the same thing
- When two rules conflict, keep the one with more run evidence
- 8+ runs: core rule (goes in main SKILL.md)
- 4-7 runs: guidance (main SKILL.md, secondary section)
- 2-3 runs: edge case (goes in references/ subfolder)
- 1 run: discard

Output as a SKILL.md:

# [Task name]

## When to use this skill
[one short paragraph]

## Core rules
[numbered list]

## Patterns
[bullet points]

## Failure modes
["If X, do Y" format]

Max 30 rules in the main file.

O que você fica com

.claude/
  agents/
    error-analyst.md
    success-analyst.md
    structure-analyst.md
    edge-analyst.md
  skills/
    [your-task]/
      SKILL.md
      references/
        edge-cases.md

Um exemplo real de executar isto num construtor de landing pages:

# Landing Page Builder

## Core rules
1. Never wrap icons in a div with a background. Use the SVG path directly. (runs 3, 6, 11, 15)
2. One hero section, one CTA. Pages with two CTAs above the fold had lower click-through. (runs 2, 7, 9, 14, 18)
3. Limit feature sections to 3 items. Grids of 6+ cards look like AI slop and nobody reads them. (runs 4, 8, 12, 17)

## Failure modes
- If the output has a "Features" section with more than 4 cards: cut to the 3 strongest
- If there are Lucide icons inside colored background divs: replace with inline SVG paths
- If the hero has more than 2 buttons: remove the secondary one

Cada regra está ligada às execuções exatas que a produziram.

Por que isto funciona

A razão pela qual as skills descarregadas têm desempenho inferior é simples: a IA que as escreveu nunca executou a sua tarefa. Inventou regras com base no que achava que poderia importar.

Este workflow inverte isso. Você executa primeiro, avalia os resultados (que você consegue ver), e deixa quatro agentes independentes mergulhar nos traces para descobrir por que cada execução correu da forma que correu. Executá-los em paralelo significa que nenhuma perspetiva domina.

O artigo por trás disto descobriu que a análise paralela supera consistentemente tanto as skills escritas por humanos como a revisão sequencial de IA. Um modelo de 35B a evoluir as suas próprias skills desta forma superou um modelo de 122B a usar uma skill escrita à mão em alguns benchmarks.

O atalho

O Build This Now é um framework de produção para lançar SaaS, ferramentas internas e projetos de clientes. Pagamentos, autenticação, email, frontend, backend, tudo ligado e pronto para lançar.

Também é lançado com um harness de IA completo: orquestração de agentes, despacho paralelo, evolução de skills, recolha de traces e os padrões que as melhores configurações do Claude Code usam em produção. O workflow de skills acima é um deles. Você recebe tudo sem ter de construir a canalização por si mesmo.

Do Trace à Skill

On this page