GAN Loop

Your AI's first draft is confident slop. Every time.

Not because the model is bad. Because it has no one to fight back.

Add a second agent whose only job is to tear the first one apart, and suddenly quality converges in 2 passes. This pattern is called the GAN Loop. Here is the full implementation guide.

What is a GAN Loop?

GAN stands for Generative Adversarial Network. In traditional machine learning, two neural networks compete: a generator creates fake images, a discriminator tries to spot the fakes. They train each other until the generator gets good enough to fool the discriminator.

We apply the same idea to AI agents. No neural networks required. Just two Claude instances with different jobs and a loop that runs until the output hits a quality threshold.

Generator creates. Evaluator judges. Generator improves. Repeat until score plateaus.

The Problem It Solves

Ask a single AI to write something. It will produce output that looks reasonable. Structured. Confident.

Then look closer. The hook is generic. The numbers are vague. The CTA could apply to anything. The AI had no external pressure, so it optimized for "looks good on first read" instead of "actually performs."

The AI does not know what it does not know. It cannot evaluate its own blind spots.

A second agent, reading the output cold with a rubric in hand, will catch everything the generator missed. And it will give specific, actionable feedback because that is its only job.

The Two Agents

Every GAN Loop has exactly two roles. They must never be the same agent.

The Generator has full context, tools, and creative skills. It produces the best output it can given the brief. It does NOT see the evaluation rubric ahead of time. Giving it the rubric causes it to optimize for the rubric rather than genuine quality, which defeats the whole point.

The Evaluator has the rubric and the output. That is it. No memory of the generation process. No emotional investment in the draft. It reads cold, scores every dimension with a rationale, and returns a structured verdict: accept, refine, or reject. When it rejects, it provides exact quotes from the output, explains the problem, and gives a concrete rewrite.

This separation is everything. The generator is optimistic by nature. The evaluator is adversarial by design. Neither role works well if the same agent plays both.

The Agent Definitions

Here are the two agent files you need. Save them to .claude/agents/ in your project.

The Generator Agent

---
name: gan-generator
description: "GAN Harness — Generator agent. Creates output according to the brief, reads evaluator feedback, and iterates until quality threshold is met."
tools: ["Read", "Write", "Edit", "Bash", "Grep", "Glob"]
model: claude-sonnet-4-6
---

You are the Generator in a GAN-style multi-agent harness.

## Your Role

You are the Creator. You build the output according to the spec.
After each iteration, the Evaluator will score your work.
You then read the feedback and improve.

## Key Rules

1. Read the spec first — always start by reading the brief or spec file
2. Read feedback — before each iteration (except the first), read the latest feedback file
3. Address every issue — feedback items are not suggestions, fix them all
4. Do not self-evaluate — your job is to create, not to judge
5. Commit between iterations — so the Evaluator sees clean diffs

## Workflow

### First Iteration
1. Read the brief / spec
2. Produce the output (post, code, design, document — whatever is specified)
3. Write generator-state.md: what you built, known issues, open questions

### Subsequent Iterations
1. Read feedback/feedback-NNN.md (latest)
2. List every issue the Evaluator raised
3. Fix by priority: critical issues first, then major, then minor
4. Update generator-state.md

## Generator State File

Write to generator-state.md after each iteration:

# Generator State — Iteration NNN

## What Was Built
- [output 1]
- [output 2]

## What Changed This Iteration
- Fixed: [issue from feedback]
- Improved: [aspect that scored low]

## Known Issues
- [anything you could not fix]

The Evaluator Agent

---
name: gan-evaluator
description: "GAN Harness — Evaluator agent. Scores output against rubric, provides actionable feedback to the Generator. Be ruthlessly strict."
tools: ["Read", "Write", "Grep", "Glob"]
model: claude-sonnet-4-6
---

You are the Evaluator in a GAN-style multi-agent harness.

## Your Role

You are the Critic. You score the Generator's output against a strict rubric
and provide detailed, actionable feedback.

## Core Principle: Be Ruthlessly Strict

> You are NOT here to be encouraging. You are here to find every flaw.
> A passing score must mean the output is genuinely good — not "good for an AI."

Your natural tendency is to be generous. Fight it:
- Do NOT say "overall good effort" — this is cope
- Do NOT talk yourself out of issues you found ("it's minor, probably fine")
- Do NOT give points for effort or potential
- DO penalize heavily for vague claims, AI slop patterns, and missing specifics
- DO compare against what a professional human would ship

## Evaluation Workflow

### Step 1: Read the Rubric
Read the criteria file for this task type.
Read the spec / brief for what was asked.
Read generator-state.md for what was built.

### Step 2: Score

Score each criterion on a 1-10 scale using the rubric file.

Calibration:
- 1-3: Broken or embarrassing
- 4-5: Functional but clearly AI-generated
- 6: Decent but unremarkable
- 7: Good — solid work
- 8: Very good — professional quality
- 9: Excellent — polished, senior quality
- 10: Exceptional — ships as-is

### Step 3: Write Feedback

Write to feedback/feedback-NNN.md:

# Evaluation — Iteration NNN

## Scores

| Criterion | Score | Weight | Weighted |
|-----------|-------|--------|----------|
| [criterion] | X/10 | 0.X | X.X |
| TOTAL | | | X.X/10 |

## Verdict: PASS / FAIL (threshold: 7.0)

## Critical Issues (must fix)
1. [Issue]: [exact quote] → [how to fix]

## Major Issues (should fix)
1. [Issue]: [exact quote] → [how to fix]

## Minor Issues (nice to fix)
1. [Issue]: [exact quote] → [how to fix]

## What Improved Since Last Iteration
- [improvement]

## Feedback Quality Rules

1. Every issue must have a concrete "how to fix" — not just "this is bad"
2. Reference specific elements — not "the hook needs work" but quote the exact text
3. Quantify when possible — "3 out of 5 items have no concrete numbers"
4. Acknowledge genuine improvements — calibrates the loop

The Loop Configuration

Save this as gan.json in your project (or inside your .claude/ folder):

{
  "default_threshold": 7.0,
  "max_iterations": 3,
  "escalation": "accept-with-notes",

  "profiles": {
    "my-task": {
      "generator": {
        "agent": "gan-generator",
        "skills": ["relevant-skill-here"]
      },
      "evaluator": {
        "agent": "gan-evaluator",
        "criteria_file": "evaluator/my-task-criteria.md"
      },
      "scoring": {
        "dimensions": [
          {"name": "hook_power",     "weight": 0.25},
          {"name": "value_density",  "weight": 0.25},
          {"name": "brand_voice",    "weight": 0.20},
          {"name": "clarity",        "weight": 0.20},
          {"name": "cta",            "weight": 0.10}
        ],
        "threshold": 7.0,
        "max_iterations": 3
      },
      "sprint_contract": [
        "Each item here is a binary gate check that must pass before shipping",
        "Example: hook lands before 210-char cutoff",
        "Example: no forbidden words",
        "Example: all claims verified against source of truth"
      ]
    }
  }
}

The sprint_contract is a list of binary pass/fail rules. The Evaluator checks these first, before even computing the weighted score. A single failed gate means reject, regardless of how good the rest is.

What a Real Rubric Looks Like

The Evaluator is only as good as its rubric. Vague criteria produce vague scores.

A real rubric defines each dimension with three anchor examples: exceptional (9-10), acceptable (6-7), and reject (1-4). Each anchor includes an example AND a reason, so the Evaluator can calibrate consistently across iterations.

Here is the Hook Power dimension from the LinkedIn post rubric:

### Hook Power (weight: 0.20)

What you're measuring: Would this stop a busy professional from scrolling?
Must land before the 210-character cutoff.

Score 9-10 (Exceptional):
  "I was quoted $15,000 for an MVP. I built it for $29 in a weekend."
  Why: Specific dollar amounts, unexpected contrast, lands in 67 characters.
  Personal stake. Reader wants to know how.

Score 6-7 (Acceptable):
  "Building a SaaS doesn't have to take months. Here's a faster way."
  Why: Clear promise but no specificity, no numbers, no personal stake.

Score 1-4 (Reject):
  "In today's rapidly evolving tech landscape, AI is changing everything."
  Why: Generic, forbidden phrase, says nothing unique. Reader scrolls past.

The anchor examples are the most important part. Without them, two evaluations of the same output will score it differently every run. With them, scores stay consistent across iterations, which is what makes the loop converge.

A Real Example: Loop by Loop

Here is a LinkedIn post going through 3 rounds.

Brief: Write a LinkedIn post about how Build This Now uses 18 AI agents to build SaaS features.

Loop 1 — Generator runs cold:

In today's fast-paced world of software development, AI is completely
changing how we build products. Our innovative platform leverages
cutting-edge AI agents to help you build your SaaS faster and
more efficiently than ever before.

#AI #SaaS #startup #tech #innovation

Evaluator scores Loop 1:

DIMENSION SCORES:
- hook_power:           2/10 — forbidden phrase opener, zero specificity
- value_density:        2/10 — no concrete information in any sentence
- brand_alignment:      2/10 — "innovative", "cutting-edge" are banned words
- platform_optimization: 3/10 — 5 hashtags OK but no character discipline
- cta_effectiveness:    1/10 — no CTA present

OVERALL: 2.1/10

GATE CHECKS:
- No forbidden words: FAIL ("cutting-edge", "innovative", "fast-paced world")
- Single clear CTA: FAIL

VERDICT: reject

CRITICAL ISSUES:
1. Hook: "In today's fast-paced world" → forbidden phrase, replace with
   a specific personal story or data point. Example:
   "I rebuilt auth and payments 3 times in 2 years. Same 3-week slog. Every time."

Loop 2 — Generator incorporates feedback:

I spent 3 weekends rebuilding the same auth and payments system.

With Build This Now, I describe the feature. 9 AI agents take over:
database architect designs the tables, backend developer writes the
API, designer builds the UI, tester catches the bugs.

One feature, 10-60 minutes. No more rebuilt boilerplate.

What are you still building by hand that you shouldn't be?

#BuildThisNow #indiehacker #buildinpublic #solofounder #SaaS

Evaluator scores Loop 2:

DIMENSION SCORES:
- hook_power:           7/10 — personal, relatable, but "I spent 3 weekends"
                               is passive. "3 weekends" needs more punch.
- value_density:        8/10 — concrete agent names, real time range
- brand_alignment:      8/10 — on-brand, hashtag count correct
- platform_optimization: 8/10 — good length, hashtags at end
- cta_effectiveness:    8/10 — question CTA invites comment

OVERALL: 7.7/10

GATE CHECKS: all pass

VERDICT: refine

MAJOR ISSUES:
1. Hook: "I spent 3 weekends" is weak. The data point "3 weekends" is good
   but "spent" is passive. Rewrite opener to front-load the pain with
   a harder verb. Example:
   "I rebuilt auth and payments 3 times in 2 years. Same 3-week slog. Every time."

Loop 3 — One targeted fix:

I rebuilt auth and payments 3 times in 2 years. Same 3-week slog. Every time.

Then I described the feature in plain English and watched 9 AI agents
build it in 45 minutes: database, API, UI, tests. All wired. All clean.

Build This Now ships what used to take a week, in an afternoon.

What are you still rebuilding by hand?

#BuildThisNow #indiehacker #buildinpublic #solofounder #SaaS

Evaluator scores Loop 3:

DIMENSION SCORES:
- hook_power:           9/10 — "3 times in 2 years" is specific and damning
- value_density:        9/10 — every sentence adds new information
- brand_alignment:      9/10 — BTN voice, correct claims
- platform_optimization: 9/10 — clean format, right length
- cta_effectiveness:    8/10 — question CTA invites comment

OVERALL: 8.9/10

GATE CHECKS: all pass

VERDICT: accept

Three loops. 2.1 to 8.9.

How to Run It

The simplest way to run the loop manually with claude -p:

# Step 1: Generator runs first
claude -p --agent gan-generator \
  "Brief: write a LinkedIn post about [topic]. Save output to output/draft.md.
   Write generator-state.md with what you produced."

# Step 2: Evaluator scores it
claude -p --agent gan-evaluator \
  "Read output/draft.md and evaluator/linkedin-criteria.md.
   Score against the rubric. Write feedback to feedback/feedback-001.md.
   Be ruthlessly strict."

# Step 3: Generator iterates with feedback
claude -p --agent gan-generator \
  "Iteration 2. Read feedback/feedback-001.md FIRST.
   Address every issue. Update output/draft.md.
   Update generator-state.md."

# Repeat until VERDICT: accept

You can also run it fully automated with the loop-operator agent from everything-claude-code:

# Set env vars to configure the loop
GAN_MAX_ITERATIONS=5 GAN_PASS_THRESHOLD=7.5 \
  claude -p --agent loop-operator \
  "Run a GAN loop using gan-generator and gan-evaluator.
   Brief: [your task]. Criteria file: evaluator/my-criteria.md.
   Stop when score >= 7.5 or after 5 iterations."

The Critical Rule: No Memory Between Passes

The Evaluator reads the rubric fresh on every pass. This is not optional.

An evaluator with memory of previous scores will inflate them over time. It scored something 6 last round, so it scores it 7 this round to show "progress," even if the actual improvement was minimal. Fresh rubric, fresh eyes, every time.

The loop configuration enforces this:

{
  "max_iterations": 3,
  "escalation": "accept-with-notes"
}

When max_iterations is hit without passing threshold, the system ships the best version with notes rather than blocking forever. The output may not be perfect, but it is the best the generator could produce in 3 rounds, and you have a record of what the evaluator flagged.

Bonus: Use Codex as Your Evaluator

The most powerful version of this loop uses two different AI systems. Claude as generator. OpenAI Codex as evaluator.

Why this matters: the evaluator is looking for flaws in Claude's output. But Claude and Codex were trained on different data, with different objectives, different architectures, and different sets of known weaknesses. Claude evaluating Claude misses the blind spots they share. Codex evaluating Claude catches a different class of issues entirely.

Two AI labs, fighting over your output.

If you have the codex-plugin-cc installed in Claude Code:

# Install the plugin
/plugin marketplace add openai/codex-plugin-cc

# Use it in your loop
/codex:adversarial-review output/draft.md

Or invoke Codex as the evaluator agent directly in your loop config:

{
  "profiles": {
    "adversarial": {
      "generator": {
        "agent": "gan-generator",
        "skills": ["linkedin-post"]
      },
      "evaluator": {
        "agent": "codex",
        "command": "/codex:adversarial-review",
        "criteria_file": "evaluator/linkedin-criteria.md"
      }
    }
  }
}

The Codex evaluator runs /codex:adversarial-review on the generator output and passes the criteria file as context. It will challenge design decisions, flag assumptions Claude does not question, and score from a completely different perspective.

Cross-model evaluation is not a gimmick. When the generator and evaluator share the same training distribution, they share the same blind spots. A cross-model loop closes that gap.

What You Get After 3 Rounds

Output that converges. Not perfect, but reliably above threshold.

The system ships content that would have taken an hour of manual review in 3 automated loops. You stop reading every line. You stop second-guessing hooks. You read the evaluator summary and spot-check the final output.

The rubric does the work. You define what "good" looks like once. Every piece of output gets held to that same standard, every time, with no drift and no ego.

The full file set for a working GAN Loop:

.claude/
  agents/
    gan-generator.md      ← generator agent definition
    gan-evaluator.md      ← evaluator agent definition
  subsystems/
    content/
      gan.json            ← loop config with profiles and thresholds
      evaluator/
        linkedin-criteria.md    ← rubric for LinkedIn posts
        carousel-text-criteria.md
        x-thread-criteria.md
        reddit-criteria.md
output/
  draft.md                ← generator output
  generator-state.md      ← what was built each iteration
  feedback/
    feedback-001.md       ← evaluator feedback per round
    feedback-002.md
    feedback-003.md

Copy the agent definitions above, write one rubric file for your task type, set a threshold, and run it.

Posted by @speedy_devv

GAN Loop

On this page