Build This Now
Build This Now
Real BuildsBuilding Isn't the Bottleneck AnymoreDistribution Is the New MoatWhy QA Is the Real Bottleneck in AI DevelopmentFirst Principles in the Age of 24-Hour MVPsThe Autonomy Curve: How Much Freedom Can You Give an AI Agent?Idea to SaaSGAN LoopSelf-Evolving HooksTrace to SkillDistribution AgentsAI Security AgentsAutonomous AI SwarmAI Email SequencesAI Cleans ItselfAgent Swarm OrchestrationBuild a Full App with Claude Code: Real ExamplesClaude Code for Non-Developers: Real ExamplesClaude Code for Freelancers: Ship 3x FasterA Security Update from Build This Now
speedy_devvkoen_salo
Blog/Real Builds/Why QA Is the Real Bottleneck in AI Development

Why QA Is the Real Bottleneck in AI Development

The hardest unsolved problem in AI software development is not generating features. It is verifying them at scale. QA does not parallelize like generation does.

Stop configuring. Start building.

SaaS builder templates with AI orchestration.

Published Jun 11, 20267 min readReal Builds hub

The hardest unsolved problem in AI software development today is quality assurance, not generation. Building features with AI is cheap and nearly solved. Verifying that those features actually work, at the speed and scale you can now generate them, is the real frontier. QA is the ceiling on agent autonomy right now, because verification does not parallelize the way generation does.

We have shipped AI-built SaaS for about 18 months, from Claude Opus 4.1 through Claude Fable 5. Over that span, the build side got radically faster while the verification side barely moved. That gap is the whole story.


Stop configuring. Start building.

SaaS builder templates with AI orchestration.


Building Stopped Being the Hard Part

Build This Now is an AI-powered SaaS build system: a production codebase plus 18 specialist AI agents that plan, build, test, and ship features in plain English. With a good agent harness and a real production codebase underneath, generation is no longer the constraint.

An MVP that used to take us a month now takes a day. Auth, payments, database security, email, background jobs: most of that is already wired up, and the AI builds the custom features on top. The cost of producing a working feature dropped by an order of magnitude.

So we did the obvious thing. We tried to run more of them in parallel.

That is where we hit the wall.

The Wall: You Cannot Verify Everything You Can Build

In our experience, we could not reliably run more than about four features in parallel before it turned into a mess. The build agents kept up fine. The verification did not.

Past roughly four concurrent features, three things went wrong:

  1. Test runs started colliding. Shared state, shared database rows, shared ports. One feature's test setup stepped on another's teardown.
  2. State drifted. The environment the tests assumed no longer matched the environment that existed, because a sibling agent had changed it underneath them.
  3. The testing agents spun into loops. They re-ran and re-checked the same flows without converging, burning time and tokens, unsure whether a failure was real or a collision.

None of these are generation problems. They are verification problems. And they get worse, not better, as you add more parallel work.

Generation Scales. Verification Does Not.

This is the core asymmetry. Generating a feature is mostly an independent task: give an agent a spec and a codebase, and it can work in isolation. Verifying a feature is a shared-world task: the test has to observe real behavior in a real environment, and that environment is the thing every other agent is also touching.

PropertyGenerationVerification
Unit of workOne spec, mostly isolatedOne behavior, observed in a shared environment
ParallelismScales near-linearlyCollides as concurrency rises
State dependenceLow (writes its own code)High (depends on environment others mutate)
Failure mode at scaleSlower, but correctLoops, false failures, non-convergence
Cost trend as you add agentsRoughly flat per featureRises sharply (coordination overhead)

You can throw ten agents at ten features and get ten features. You cannot throw ten agents at ten test suites in one environment and get ten clean verdicts. The verifiers contend for the same world.

That is why QA is the bottleneck. It is not that testing is hard in some abstract sense. It is that testing resists the exact thing that made generation cheap: running many copies at once.

Reliability Drops as Parallelism Rises

Here is the shape of what we saw, framed as features-in-parallel against how reliable the verification stayed. These are our observed bands, not benchmarks.

Features in parallelWhat verification looked like
1 to 2Clean. Tests ran, failures were real, results were trustworthy.
3 to 4Mostly fine. Occasional collisions, manageable with isolation.
5 to 6Drift and false failures. Testing agents started re-running without converging.
7 or moreUnreliable. Loops, contention, results you could not trust without re-checking by hand.

The build line stayed flat across all of these. We could generate seven features as easily as two. We just could not believe the test results for seven.

Why Better Models Help, but Do Not Solve It

Stronger models genuinely move the needle. More context means an agent can hold more of the system in its head and make fewer mistakes, which means less to catch downstream. Fewer bugs generated is fewer bugs to verify.

Claude Fable 5, Anthropic's newest model for complex long-running work (priced at $10 per million input tokens and $50 per million output tokens), absorbs part of the problem a different way. It runs longer chains without drifting, so a verification agent can stay coherent across a long test-and-fix loop instead of losing the thread halfway through and starting to spin. That directly attacks the loop failure mode we kept hitting.

But this raises the ceiling. It does not remove it. The asymmetry is structural. As long as verification has to observe behavior in a shared environment, adding parallel verifiers adds contention. A better model converges faster and drifts less, which buys you more concurrent features before the wall. It does not move the wall to infinity.

QA at scale is still the frontier, and it is the next real unlock. The team that figures out how to parallelize verification the way we already parallelize generation gets the next order-of-magnitude jump in agent autonomy. We have written about why building is not the bottleneck from first principles, and you can see the same limit show up in practice in how we run an autonomous AI swarm and why we lean on adversarial evaluators to keep verifiers honest.

FAQ

Why can't AI agents test code at scale?

AI agents cannot test code at scale because verification depends on a shared environment, while generation does not. Each test has to observe real behavior in a real system, and when many agents test in parallel they contend for the same database, state, and resources. Generation is isolated and parallelizes well. Verification collides. In our experience, test runs start interfering past about four concurrent features.

How many AI agents can run in parallel before quality breaks down?

In our experience building AI SaaS for roughly 18 months, we could not reliably run more than about four features in parallel before verification became unreliable. The build agents scaled fine past that, but the testing agents began colliding, drifting, and looping without converging. The exact number depends on environment isolation, but the asymmetry between generation and verification is consistent: building scales, verifying does not.

What is the real bottleneck in AI software development?

The real bottleneck in AI software development is quality assurance, not generation. Producing a working feature with a good agent harness now takes a day instead of a month. Verifying those features at the speed you can generate them is unsolved, because verification does not parallelize the way generation does. QA is the actual ceiling on agent autonomy today.

Do better models fix the QA bottleneck?

Better models help but do not fix it. More context and fewer mistakes mean fewer bugs to catch, and models like Claude Fable 5 run longer verification chains without drifting, which reduces the loop failure mode. But the bottleneck is structural: verification observes a shared environment, so adding parallel verifiers adds contention. Stronger models raise the ceiling; they do not remove it.

Where This Goes Next

Generation is solved enough that it is no longer interesting. The open problem is verification that scales like generation does. Whoever cracks parallel QA unlocks the next level of autonomy, and that is the work we are watching, and building toward, with Claude Fable 5 and the harness around it.

More in Real Builds

  • AI Cleans Itself
    Three overnight Claude Code workflows that clean AI's own mess: slop-cleaner removes dead code, /heal repairs broken branches, /drift catches pattern drift.
  • Agent Swarm Orchestration
    Four infrastructure layers that stop agent swarms from double-claiming tasks, drifting on field names, and collapsing under merge chaos.
  • GAN Loop
    One agent generates, one tears it apart, they loop until the score stops improving. GAN Loop implementation with agent definitions and rubric templates.
  • The Autonomy Curve: How Much Freedom Can You Give an AI Agent?
    How much autonomy you can give an AI agent is decided by one thing: how long a model holds a task without drifting. A good harness plus a reliable model is what unlocks real agent work.
  • AI Email Sequences
    One Claude Code command builds 17 lifecycle emails across 6 sequences, wires Inngest behavioral triggers, and ships a branching email funnel ready to deploy.
  • AI Security Agents
    Two Claude Code commands spin up eight security sub-agents: phase 1 scans SaaS logic for RLS gaps and auth bugs, phase 2 penetrates to confirm real exploits.

Stop configuring. Start building.

SaaS builder templates with AI orchestration.

On this page

Building Stopped Being the Hard Part
The Wall: You Cannot Verify Everything You Can Build
Generation Scales. Verification Does Not.
Reliability Drops as Parallelism Rises
Why Better Models Help, but Do Not Solve It
FAQ
Why can't AI agents test code at scale?
How many AI agents can run in parallel before quality breaks down?
What is the real bottleneck in AI software development?
Do better models fix the QA bottleneck?
Where This Goes Next

Stop configuring. Start building.

SaaS builder templates with AI orchestration.