Why QA Is the Real Bottleneck in AI Development

The hardest unsolved problem in AI software development today is quality assurance, not generation. Building features with AI is cheap and nearly solved. Verifying that those features actually work, at the speed and scale you can now generate them, is the real frontier. QA is the ceiling on agent autonomy right now, because verification does not parallelize the way generation does.

We have shipped AI-built SaaS for about 18 months, from Claude Opus 4.1 through Claude Fable 5. Over that span, the build side got radically faster while the verification side barely moved. That gap is the whole story.

Building Stopped Being the Hard Part

Build This Now is an AI-powered SaaS build system: a production codebase plus 18 specialist AI agents that plan, build, test, and ship features in plain English. With a good agent harness and a real production codebase underneath, generation is no longer the constraint.

An MVP that used to take us a month now takes a day. Auth, payments, database security, email, background jobs: most of that is already wired up, and the AI builds the custom features on top. The cost of producing a working feature dropped by an order of magnitude.

So we did the obvious thing. We tried to run more of them in parallel.

That is where we hit the wall.

The Wall: You Cannot Verify Everything You Can Build

In our experience, we could not reliably run more than about four features in parallel before it turned into a mess. The build agents kept up fine. The verification did not.

Past roughly four concurrent features, three things went wrong:

Test runs started colliding. Shared state, shared database rows, shared ports. One feature's test setup stepped on another's teardown.
State drifted. The environment the tests assumed no longer matched the environment that existed, because a sibling agent had changed it underneath them.
The testing agents spun into loops. They re-ran and re-checked the same flows without converging, burning time and tokens, unsure whether a failure was real or a collision.

None of these are generation problems. They are verification problems. And they get worse, not better, as you add more parallel work.

Generation Scales. Verification Does Not.

This is the core asymmetry. Generating a feature is mostly an independent task: give an agent a spec and a codebase, and it can work in isolation. Verifying a feature is a shared-world task: the test has to observe real behavior in a real environment, and that environment is the thing every other agent is also touching.

Property	Generation	Verification
Unit of work	One spec, mostly isolated	One behavior, observed in a shared environment
Parallelism	Scales near-linearly	Collides as concurrency rises
State dependence	Low (writes its own code)	High (depends on environment others mutate)
Failure mode at scale	Slower, but correct	Loops, false failures, non-convergence
Cost trend as you add agents	Roughly flat per feature	Rises sharply (coordination overhead)

You can throw ten agents at ten features and get ten features. You cannot throw ten agents at ten test suites in one environment and get ten clean verdicts. The verifiers contend for the same world.

That is why QA is the bottleneck. It is not that testing is hard in some abstract sense. It is that testing resists the exact thing that made generation cheap: running many copies at once.

Reliability Drops as Parallelism Rises

Here is the shape of what we saw, framed as features-in-parallel against how reliable the verification stayed. These are our observed bands, not benchmarks.

Features in parallel	What verification looked like
1 to 2	Clean. Tests ran, failures were real, results were trustworthy.
3 to 4	Mostly fine. Occasional collisions, manageable with isolation.
5 to 6	Drift and false failures. Testing agents started re-running without converging.
7 or more	Unreliable. Loops, contention, results you could not trust without re-checking by hand.

The build line stayed flat across all of these. We could generate seven features as easily as two. We just could not believe the test results for seven.

Why Better Models Help, but Do Not Solve It

Stronger models genuinely move the needle. More context means an agent can hold more of the system in its head and make fewer mistakes, which means less to catch downstream. Fewer bugs generated is fewer bugs to verify.

Claude Fable 5, Anthropic's newest model for complex long-running work (priced at $10 per million input tokens and $50 per million output tokens), absorbs part of the problem a different way. It runs longer chains without drifting, so a verification agent can stay coherent across a long test-and-fix loop instead of losing the thread halfway through and starting to spin. That directly attacks the loop failure mode we kept hitting.

But this raises the ceiling. It does not remove it. The asymmetry is structural. As long as verification has to observe behavior in a shared environment, adding parallel verifiers adds contention. A better model converges faster and drifts less, which buys you more concurrent features before the wall. It does not move the wall to infinity.

QA at scale is still the frontier, and it is the next real unlock. The team that figures out how to parallelize verification the way we already parallelize generation gets the next order-of-magnitude jump in agent autonomy. We have written about why building is not the bottleneck from first principles, and you can see the same limit show up in practice in how we run an autonomous AI swarm and why we lean on adversarial evaluators to keep verifiers honest.

FAQ

Why can't AI agents test code at scale?

AI agents cannot test code at scale because verification depends on a shared environment, while generation does not. Each test has to observe real behavior in a real system, and when many agents test in parallel they contend for the same database, state, and resources. Generation is isolated and parallelizes well. Verification collides. In our experience, test runs start interfering past about four concurrent features.

How many AI agents can run in parallel before quality breaks down?

In our experience building AI SaaS for roughly 18 months, we could not reliably run more than about four features in parallel before verification became unreliable. The build agents scaled fine past that, but the testing agents began colliding, drifting, and looping without converging. The exact number depends on environment isolation, but the asymmetry between generation and verification is consistent: building scales, verifying does not.

What is the real bottleneck in AI software development?

The real bottleneck in AI software development is quality assurance, not generation. Producing a working feature with a good agent harness now takes a day instead of a month. Verifying those features at the speed you can generate them is unsolved, because verification does not parallelize the way generation does. QA is the actual ceiling on agent autonomy today.

Do better models fix the QA bottleneck?

Better models help but do not fix it. More context and fewer mistakes mean fewer bugs to catch, and models like Claude Fable 5 run longer verification chains without drifting, which reduces the loop failure mode. But the bottleneck is structural: verification observes a shared environment, so adding parallel verifiers adds contention. Stronger models raise the ceiling; they do not remove it.

Where This Goes Next

Generation is solved enough that it is no longer interesting. The open problem is verification that scales like generation does. Whoever cracks parallel QA unlocks the next level of autonomy, and that is the work we are watching, and building toward, with Claude Fable 5 and the harness around it.

Why QA Is the Real Bottleneck in AI Development

On this page