トレースからスキルへ

ダウンロードしたスキルはすべて、あなたのタスクを一度も実行したことのないAIが書いたものです。推測でした。それで十分な場合もあります。タスクに特定の失敗モードがある場合、推測では足りません。

より良いアプローチ: エージェントを20回実行し、何が良くて何が悪かったかを伝え、ルールを自分で抽出させます。

それだけです。それがアイデア全体です。

これはAlibaba QwenチームによるリサーチペーパーのTrTrace2Skillに基づいています。彼らは実際の実行トレースから構築されたスキルが、ハードなベンチマークで人間が書いたものを一貫して上回り、モデルサイズをまたいで転移することを示しました。

4つのステップ

20回実行  →  フィードバックを書く  →  4人のアナリストがすべてを一緒に読む  →  SKILL.mdにマージ

ステップ1: エージェントを20回実行する

まず、タスクの20のバリエーションを生成します。簡単なもの、難しいもの、エッジケース。Claude を使って書かせます。

Generate 20 variations of this task for a Claude Code agent:

Task: [your task]

- 5 easy, straightforward versions
- 8 normal versions
- 4 hard versions with tricky edge cases
- 3 adversarial versions designed to break the agent

Output: a numbered list. Each item is a complete, self-contained task prompt.

次に実行します。

claude -p "[variation 1]"
claude -p "[variation 2]"
# repeat for all 20

Claude Codeはすべてのセッションを自動的に~/.claude/projects/[your-project]/に保存します。他に何もする必要はありません。

ステップ2: フィードバックを書く

各実行が生成したもの（構築されたウェブサイト、書かれた投稿、生成されたコード）を確認します。内部ログを読む必要はありません。出力を見るだけです。

実行ごとに1文書きます。ここでのあなたの仕事全体がそれです。10〜15分かかります。

Run 1: good
Run 2: bad — too many cards, looks cluttered
Run 3: good
Run 4: bad — icons wrapped in colored divs, looks cheap
Run 5: good
Run 6: bad — two CTAs above the fold, confusing
...

エージェントが残りを処理します。内部トレースの読み方を知っています。出力が実際に良かったかどうかはあなたが知っています。それが分担です。

ステップ3: 4人のアナリストを並列で起動する

.claude/agents/にこれら4つのファイルを作成します。それぞれが異なる角度に集中しますが、4つとも同じ20のセッションをまとめて読みます。並列実行で単一レビュアーから生じるバイアスを排除します。

`.claude/agents/error-analyst.md`

---
name: error-analyst
description: Reads all 20 sessions and my feedback. Finds the root cause behind every bad run. Proposes rules that would have prevented each failure. Run in parallel with the other analysts.
---

You analyze why runs went wrong.

You receive:
- A task description
- My feedback on each run (one sentence per run, labeled good/bad)
- Access to the last 20 session files in ~/.claude/projects/[project]/

Process:
1. Read all 20 sessions
2. For every run I marked bad: find the root cause in the actual trace (not just the error message)
3. Check if the same problem appears in multiple bad runs
4. Propose a rule that would have prevented it

Rule format:
- RULE: [what to do, imperative]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Only propose rules that show up in 2+ runs.

`.claude/agents/success-analyst.md`

---
name: success-analyst
description: Reads all 20 sessions and my feedback. Finds what the agent did right in good runs that it didn't do in bad ones. Run in parallel with the other analysts.
---

You find what made the good runs good.

You receive:
- A task description
- My feedback on each run
- Access to the last 20 session files in ~/.claude/projects/[project]/

Process:
1. Read all 20 sessions
2. For every run I marked good: find the behaviors that made it work
3. Find behaviors present in good runs that are absent in bad runs
4. Propose rules that encode those behaviors

Rule format:
- RULE: [what to do, imperative]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Skip obvious rules. Look for the non-obvious things that actually made the difference.

`.claude/agents/structure-analyst.md`

---
name: structure-analyst
description: Reads all 20 sessions and my feedback. Looks at the sequence of steps taken, not the content. Finds ordering patterns that correlate with good or bad outcomes. Run in parallel with the other analysts.
---

You look at the shape of runs, not what was produced.

You receive:
- A task description
- My feedback on each run
- Access to the last 20 session files in ~/.claude/projects/[project]/

Look at tool call sequences, step ordering, verification steps, unnecessary detours.

Questions:
- Which sequences appear in good runs but not bad ones?
- Are verification steps missing from bad runs?
- Are there steps that add noise without improving the output?

Rule format:
- RULE: [ordering or sequencing rule, imperative]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Only rules with 2+ run support.

`.claude/agents/edge-analyst.md`

---
name: edge-analyst
description: Reads all 20 sessions and my feedback. Focuses on the hard and adversarial runs. Finds assumptions the agent makes that break under pressure. Run in parallel with the other analysts.
---

You focus on the runs I marked bad, especially the tricky ones.

You receive:
- A task description
- My feedback on each run
- Access to the last 20 session files in ~/.claude/projects/[project]/

Find:
- What inputs broke the agent that shouldn't have?
- What assumptions does the agent make that fail at the edges?
- What checks are missing?

Write rules as guards: "Before doing X, verify Y."

Rule format:
- RULE: [defensive check or guard]
- EVIDENCE: [run numbers]
- PRIORITY: high / medium / low

Every rule must link to a specific run.

次に、フィードバックを渡してセッションを指定しながら4つを同時に実行します。

Run these 4 agents in parallel. Give each the same context.

Task: [your task description]
Project slug: [your-project] (sessions are in ~/.claude/projects/[your-project]/)

My feedback:
Run 1: good
Run 2: bad — too many cards, looks cluttered
Run 3: good
[... all 20]

Agents to run:
- error-analyst
- success-analyst
- structure-analyst
- edge-analyst

Each agent should read the actual session files to understand what happened in each run.

ステップ4: 1つのSKILL.mdにマージする

4セットの提案されたルールがあります。ほとんどは重複しています。これを実行して統合します。

Merge these 4 analyst outputs into a single SKILL.md.

Task: [your task]
Existing SKILL.md: [paste or write "none"]

[paste all 4 analyst outputs]

Rules for merging:
- Merge rules that say the same thing
- When two rules conflict, keep the one with more run evidence
- 8+ runs: core rule (goes in main SKILL.md)
- 4-7 runs: guidance (main SKILL.md, secondary section)
- 2-3 runs: edge case (goes in references/ subfolder)
- 1 run: discard

Output as a SKILL.md:

# [Task name]

## When to use this skill
[one short paragraph]

## Core rules
[numbered list]

## Patterns
[bullet points]

## Failure modes
["If X, do Y" format]

Max 30 rules in the main file.

最終的に得られるもの

.claude/
  agents/
    error-analyst.md
    success-analyst.md
    structure-analyst.md
    edge-analyst.md
  skills/
    [your-task]/
      SKILL.md
      references/
        edge-cases.md

ランディングページビルダーでこれを実行した実際の例:

# Landing Page Builder

## Core rules
1. Never wrap icons in a div with a background. Use the SVG path directly. (runs 3, 6, 11, 15)
2. One hero section, one CTA. Pages with two CTAs above the fold had lower click-through. (runs 2, 7, 9, 14, 18)
3. Limit feature sections to 3 items. Grids of 6+ cards look like AI slop and nobody reads them. (runs 4, 8, 12, 17)

## Failure modes
- If the output has a "Features" section with more than 4 cards: cut to the 3 strongest
- If there are Lucide icons inside colored background divs: replace with inline SVG paths
- If the hero has more than 2 buttons: remove the secondary one

すべてのルールが生成した正確な実行にリンクされています。

なぜこれが機能するか

ダウンロードしたスキルのパフォーマンスが低い理由はシンプルです。それを書いたAIはあなたのタスクを一度も実行していません。重要かもしれないと思ったことに基づいてルールを発明しました。

このワークフローはそれを逆にします。まず実行し、出力を判断し（実際に見られるもの）、4人の独立したエージェントにトレースを掘り下げさせて各実行がどのようになったかを把握させます。並列実行で単一の視点が支配しないようにします。

この背後にあるペーパーは、並列分析が人間が書いたスキルと逐次AIレビューの両方を一貫して上回ることを発見しました。このように独自のスキルを進化させる35Bモデルは、いくつかのベンチマークで手書きスキルを使用する122Bモデルを上回りました。

ショートカット

Build This NowはSaaS、社内ツール、クライアントプロジェクトをリリースするための本番フレームワークです。決済、認証、メール、フロントエンド、バックエンド、すべてが配線されてリリース準備ができています。

また、完全なAIハーネスも含まれています。エージェントオーケストレーション、並列ディスパッチ、スキル進化、トレース収集、そして本番のトップClaude Codeセットアップが使用するパターン。上記のスキルワークフローはその1つです。自分でプラムビングを構築せずにすべてが得られます。

トレースからスキルへ

On this page