Test-Driven Development with Claude Code

The highest-leverage way to get reliable output from Claude Code is to make it write the test before it writes the code. A failing test is a spec the agent cannot fake. Code can look correct and be wrong; a test that goes from red to green is an objective definition of done that the agent can check itself, on every iteration, without you in the loop.

This guide shows the exact loop: have Claude write failing tests from your spec, confirm they fail, implement until green, and refuse to let it edit the tests to cheat. Then it shows how to enforce that loop with hooks and a test-runner subagent so the gate runs whether you remember it or not.

Arrête de tout configurer. Place à la construction.

Des templates SaaS avec orchestration IA.

Why TDD fits agentic coding

The core problem with AI-generated code is that "plausible" and "correct" look identical until something breaks. Claude is very good at producing code that reads like it works. Without a check, you are the check, and you are slower and less consistent than a test suite.

A test changes what the agent is optimizing for. Instead of "produce code that looks like a solution," the target becomes "produce code that makes this specific assertion pass." That is a closed loop the agent can run on its own. It writes, it runs the suite, it sees a failure, it fixes, it runs again. You only step in when the loop converges.

This is why TDD beats prompting harder. A longer prompt still leaves the definition of done fuzzy. A test makes it concrete. The agent stops when the suite is green, not when it feels finished.

There is a catch. Claude writes implementation first by default, then writes tests that pass against that implementation. That is the inverse of TDD and it proves nothing, because the test is shaped to the code rather than to the requirement. You have to flip the order on purpose.

The core loop

Four steps, in order, every time.

Write a failing test from the spec. Describe the behavior you want and tell Claude to write a test for it with no implementation yet. The test encodes the requirement, not the code.
Confirm it fails for the right reason. Run the test. It should fail because the behavior does not exist, not because of a typo or a bad import. A test that passes on the first run is testing nothing.
Implement until green. Now let Claude write the minimum code to pass the test. It can run the suite as many times as it needs and self-correct.
Refuse to edit the test to cheat. The test is frozen once you approve it. If the suite is red, the fix goes in the implementation, never in the assertions. Changing the spec to match broken code is the one move that breaks the whole approach.

The discipline lives in steps 2 and 4. Confirming the red state stops the agent from quietly writing a test that already passes. Freezing the test stops it from "passing" by deleting the part that was failing.

A concrete example

Say you need a pure function that splits a payment into equal cents across N recipients and handles the remainder so the totals always reconcile. This is exactly the kind of logic that looks trivial and hides off-by-one bugs in the rounding.

Start by giving Claude the behavior, not the implementation, and forcing the test-first order.

Write a Vitest test file for a function `splitAmount(totalCents, recipients)`
that splits an integer number of cents into `recipients` equal parts.
Rules: every part is a whole number of cents, the parts sum exactly to
totalCents, and any remainder is distributed one cent at a time to the
earliest recipients. Throw if recipients < 1.

Write the FAILING test first. Do NOT write the implementation yet.
Run it and show me that it fails.

What you want back is a test that pins down the tricky cases (exact division, a remainder, a single recipient, the error path) and nothing else. Here is the kind of file that spec produces.

// src/split-amount.test.ts
import { describe, it, expect } from 'vitest'
import { splitAmount } from './split-amount'

describe('splitAmount', () => {
  it('splits evenly when it divides exactly', () => {
    expect(splitAmount(900, 3)).toEqual([300, 300, 300])
  })

  it('gives the remainder to the earliest recipients', () => {
    expect(splitAmount(1000, 3)).toEqual([334, 333, 333])
  })

  it('always sums back to the total', () => {
    const parts = splitAmount(10001, 7)
    expect(parts.reduce((a, b) => a + b, 0)).toBe(10001)
  })

  it('handles a single recipient', () => {
    expect(splitAmount(500, 1)).toEqual([500])
  })

  it('throws when there are no recipients', () => {
    expect(() => splitAmount(500, 0)).toThrow()
  })
})

Run it and it fails, because splitAmount does not exist yet. That failure is the signal you want before any implementation gets written. Confirm it, then tell Claude to make it pass with the minimum code and no changes to the test.

// src/split-amount.ts
export function splitAmount(totalCents: number, recipients: number): number[] {
  if (recipients < 1) {
    throw new Error('recipients must be at least 1')
  }
  const base = Math.floor(totalCents / recipients)
  const remainder = totalCents - base * recipients
  return Array.from({ length: recipients }, (_, i) =>
    i < remainder ? base + 1 : base,
  )
}

The suite goes green. The "sums back to the total" assertion is the one that matters here, because it catches the rounding bug class directly. If Claude had written the implementation first, that assertion might never have existed, and the bug would ship.

Enforce it with hooks and subagents

Asking for TDD works until you forget to ask. Hooks make the gate automatic. Only 13% of public Claude Code repositories use hooks, the mechanism that can enforce a test gate without you remembering, per our analysis of 2,500 repos. It is the most underused lever in the whole tool.

Two events do the job. A PostToolUse hook matched to Edit and Write runs the suite right after Claude touches a file, giving fast feedback. A Stop hook runs when the turn ends and can block: exit code 2 prevents Claude from stopping and pushes it to keep working. The Stop hook is the real gate, because it can refuse to let the agent finish on a red suite.

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          { "type": "command", "command": "npm test --silent" }
        ]
      }
    ],
    "Stop": [
      {
        "hooks": [
          { "type": "command", "command": "npm test --silent" }
        ]
      }
    ]
  }
}

Put that in .claude/settings.json. The PostToolUse run surfaces failures the moment a file changes; the Stop run is the one that holds the line, because a failing suite there blocks the turn from ending. For more on the event model and exit codes, see the hooks guide.

For heavier suites, push the running into a dedicated test-runner subagent so the output and the noise stay out of your main context. The subagent runs the suite, reports pass or fail and the failing cases, and your main session decides what to do. See how to create your first Claude Code subagent for the setup, or the best subagents in 2026 for ready-made ones.

Pitfalls

The dangerous failure mode is reward hacking: the agent makes the suite green by weakening the test instead of fixing the code. It deletes a failing assertion, adds .skip to a case, loosens an exact match to a toBeDefined, or comments the hard part out. The suite passes and the bug ships.

Three guards stop it. Tell Claude in plain words that the test file is frozen once you approve it and that fixes go in the implementation only. Review the diff, because a deleted assertion or a .skip is obvious the moment you look. And run the full suite on a Stop hook, so a skipped test still shows up as a smaller suite rather than a green checkmark.

The other trap is confirming red too lazily. A test can fail for a boring reason (a bad import, a typo in the function name) and you read it as "the behavior is missing." Glance at the failure message. It should fail because the thing under test does not work yet, not because the test file is broken.

What you get

You get output you can trust without reading every line. The test is the contract, the suite is the proof, and the agent does the iterating against both. Bugs that would have surfaced in production surface as a red test in the loop instead.

You also get a feedback loop that compounds. Every test you keep is a regression guard for every future change, including changes Claude makes months from now. The same hook that gates today's feature gates tomorrow's refactor. Pair this with a code review pass and the broader agentic workflow habits, and quality stops being something you hope for and becomes something the system enforces.

If you would rather not wire all of this from scratch, the Build This Now Code Kit is a Claude Code harness for Next.js and Supabase that ships adversarial evaluators and quality gates so the test-and-verify loop runs for you, alongside planning agents, a build pipeline, and auth, payments, and database already wired. It is $29 one-time, no subscription.

Frequently Asked Questions

Why does TDD work so well with Claude Code?

A failing test is an executable spec the agent cannot fake. With normal prompting, Claude writes code that looks right and you find out it is wrong later. With TDD, you write the test first so "done" has an objective definition: the suite goes green. The agent gets a tight feedback loop it can run on its own, so it self-corrects until the behavior actually matches the requirement instead of stopping at plausible.

How do I stop Claude from writing the implementation before the test?

Be explicit. Claude defaults to implementation-first, so tell it: "Write a failing test for this behavior. Do not write any implementation yet. Run the test and show me it fails." Confirm the red state yourself before letting it implement. If you skip that confirmation, the test can quietly be shaped to fit code that does not exist yet, which defeats the point.

How do I keep the agent from weakening a test just to pass it?

This is reward hacking and it is the main failure mode. Three guards: tell Claude explicitly that the test file is frozen once you approve it, review the diff so a deleted assertion is visible, and run a hook on Stop that runs the full suite so a skipped or commented-out test cannot hide. If a test needs to change, you change it deliberately, not the agent mid-task.

Which hook should run my tests, PostToolUse or Stop?

Use both for different jobs. A PostToolUse hook matching Edit and Write gives fast feedback right after a file changes, but it cannot block. A Stop hook runs when the turn ends and can block with exit code 2, which pushes Claude to keep working until the suite is green. The Stop hook is the real gate because it can refuse to let the agent finish on a red suite.

Do I still need to write tests myself if Claude writes them?

You write the spec and review the tests; Claude writes the test code. The leverage is not in typing the assertions, it is in deciding what "correct" means and confirming the test actually checks it. Read every generated test once. A test you never looked at is not a safety net, it is a second thing that can be wrong.

Test-Driven Development with Claude Code

On this page