Agent Swarm Orchestration
Four infrastructure layers that stop agent swarms from double-claiming tasks, drifting on field names, and collapsing under merge chaos.
Pare de configurar. Comece a construir.
Templates SaaS com orquestração de IA.
Running multiple AI agents in parallel without falling apart is harder than it looks.
Most swarms fail the same way. Agents double-claim the same task. They invent different field names for the same data. Merges turn into chaos. One agent loops forever and nobody notices. The failures are consistent because the causes are consistent: missing infrastructure.
Four layers fix all of that. This post walks through each one.
Why most swarms break
The single-agent model is easy to reason about. One agent reads a task, builds it, and either finishes or gets stuck. You see what it did. You fix what it missed.
Add a second agent and the problems multiply. Both agents can see the same task queue. Both can grab the same task at the same time. One of them does work the other already started. You now have two half-finished versions of the same feature and no way to know which one to keep.
Add a third agent and field names start diverging. Agent A calls it userId. Agent B calls it user_id. Agent C writes uid. Three agents, three conventions, three branches that won't merge cleanly.
This is not a model quality problem. Claude Code agents are good at writing code. The problem is coordination infrastructure. Without it, even well-prompted agents produce broken swarms.
The four failure modes appear in order:
| Failure | What happens | When it appears |
|---|---|---|
| Double-claiming | Two agents grab the same task | As soon as 2+ agents run |
| Field name drift | Agents invent different names for shared data | First cross-agent feature |
| Merge chaos | Branches conflict because agents wrote to the same files | At merge time |
| Silent looping | One agent repeats the same failed step indefinitely | Long runs |
Fix these four and a swarm becomes reliable. Skip any one and it breaks.
The four layers
Every working swarm shares the same architecture. The names vary. The shape does not.
| Layer | Name | Job |
|---|---|---|
| 01 | Task Graph | Atomic task claims via database |
| 02 | Process Shell | Each agent in its own worktree |
| 03 | Contracts First | Shared interfaces injected before coding starts |
| 04 | Merge Queue | Serialized merges with tiered conflict resolution |
These are not optional. Remove any one and a specific failure mode returns. The task graph stops double-claiming. The process shell stops file conflicts. Contracts stop field drift. The merge queue stops bad code landing on main.
Layer 1: the task graph
The most common fix people try is a markdown plan file. Agents read it, pick a task, update it. In practice this breaks immediately. Two agents read the file at the same time. Both see the same unclaimed task. Both write status: in-progress in parallel. The file has a race condition baked in.
The fix is a database with atomic transactions.
A task graph is a table with one row per task. Each row has a status column: pending, claimed, done, or failed. Agents claim tasks with a SQL transaction that checks and updates in one atomic step:
UPDATE tasks
SET status = 'claimed', agent_id = $1, claimed_at = NOW()
WHERE id = $2 AND status = 'pending'
RETURNING id;If two agents run this query at the same time with the same task ID, the database serializes them. One gets the row back. One gets nothing. The agent that gets nothing moves on to the next unclaimed task. No race condition. No duplicate work.
The task graph also tracks dependencies. Task B can only be claimed after Task A reaches done. This keeps agents from trying to build a payment form before the payment table exists.
Three columns do most of the work:
| Column | Type | Purpose |
|---|---|---|
status | enum | pending / claimed / done / failed |
agent_id | text | Which agent holds the claim |
depends_on | int[] | Task IDs that must complete first |
A SQLite file on disk is enough for single-machine swarms. Supabase or Postgres works for anything distributed. The database is not the complex part. The transaction pattern is.
Layer 2: process isolation
Agents sharing a working directory fight over the same files. Two agents editing the same file at the same time produce conflicts at best and corrupted output at worst. Git's index can only track one active operation at a time. When two agents both run git add in the same repo simultaneously, one of them fails with index.lock.
Git worktrees solve this completely.
A worktree is a separate checkout of the same repository at a different path on disk. Each checkout has its own working directory, its own index, and its own HEAD. The agents share the underlying object store but nothing else.
You create one worktree per agent at the start of a swarm run:
git worktree add ../agent-a-worktree feature/auth
git worktree add ../agent-b-worktree feature/payments
git worktree add ../agent-c-worktree feature/emailAgent A works in agent-a-worktree/. Agent B works in agent-b-worktree/. They never touch each other's directories. No index locks. No file conflicts during the build phase.
The worktree for each agent is pointed at its own branch. When Agent A is done with feature/auth, that branch merges back through the merge queue (Layer 4). The worktree is then cleaned up or reused for the next task.
What each agent gets:
| Resource | Shared | Per-agent |
|---|---|---|
| Git object store | Yes | No |
| Working directory | No | Yes |
| Index | No | Yes |
HEAD pointer | No | Yes |
| Branch | No | Yes |
This is the layer that makes true parallelism possible. Agents cannot accidentally overwrite each other's work because they are never writing to the same location.
Layer 3: contracts first
Field name drift is invisible until merge time. Agent A builds an API that returns { userId: "abc" }. Agent B builds a frontend that reads data.user_id. Both work in isolation. At merge time, the frontend reads undefined and the team spends two hours tracing why.
The fix is shared type contracts injected into every agent prompt before coding starts.
A contract is a TypeScript interface (or JSON schema, or plain type definition) that all agents agree to use. You write the contracts before any agent starts:
// contracts/user.ts
export interface User {
userId: string;
email: string;
createdAt: string;
}
export interface ApiResponse<T> {
data: T;
error: string | null;
}Every agent gets these contracts in its system prompt. The orchestrator injects the full contracts file at the top of each agent's context. Agents are instructed to use the defined types and not invent new field names.
The result is measurable. Without contracts, a six-agent swarm building a SaaS backend produced three variants of the user ID field across six branches. Three of the six branches failed to merge cleanly. The integration quality score (measured by counting type errors across the merged codebase) was 28.
With contracts injected at the start, a four-agent swarm building the same feature used userId everywhere. Zero branches failed at merge. Quality score reached 68.
What changes with contracts:
| Check | Without | With contracts |
|---|---|---|
| User field name | userId / user_id / uid | userId everywhere |
| Branch merge failures | 3 of 6 fail | 0 of 4 fail |
| Quality score | 28 | 68 |
| Merge time | Unpredictable | FIFO, tiered |
The contracts file does not have to be large. Five to ten type definitions covering the shared data models are enough for most features. Add to it as the codebase grows.
Layer 4: the merge queue
Parallel branches are useful until they need to land. Without a queue, the team hits git merge on two branches at the same time, gets conflicts on both, and loses track of which resolution to keep.
A FIFO merge queue serializes landings and handles conflicts in tiers.
Agents push their completed branches to the queue. The queue processes one branch at a time, oldest first. For each branch, it tries four resolution steps:
Tier 1: git merge --no-ff (clean merge, no conflicts)
↓ fails
Tier 2: deterministic auto-resolve (whitespace, import order, lock files)
↓ fails
Tier 3: LLM resolver per conflicted file (Claude reads both versions, picks one)
↓ fails
Tier 4: human review (branch parked, notification sent)Most merges land at Tier 1 or Tier 2. Tier 3 handles the cases where two agents both modified the same function with different changes. Tier 4 is rare and reserved for conflicts where neither automatic approach is safe.
The key constraint: the LLM resolver in Tier 3 is bounded. It resolves one file at a time. It must produce valid code or it rejects the merge entirely. Prose output is not accepted. A merge that cannot be resolved automatically reaches Tier 4 and parks there until a human reviews it.
This design keeps the queue predictable. Branches land in order. Every landing is logged with the tier it required. Over time, a pattern of Tier 3 conflicts in the same files tells you where the contracts are incomplete.
The cost math
Swarms cost more than sequential runs. That is true and worth knowing before you run one.
A single Claude Code agent completing a task uses a baseline token count. Add a second agent and you roughly double the tokens (two agents, two context windows). Add parallel coordination overhead and the multiplier rises further.
For complex multi-module tasks:
| Scenario | Tokens | Cost | Quality gain |
|---|---|---|---|
| Sequential (1 agent) | 1x | $9 | Baseline |
| Swarm (20 agents, 6 hrs) | 6.7x | $60 | +28% quality |
| Time saved | +$51 | 2 hours faster |
The 3.4x token multiplier for complex tasks produces 28% better output quality, measured by type error count and test pass rate on the merged codebase. For simple tasks the multiplier is higher (3.9x) but the quality gain is larger too (+32%).
The rule is straightforward:
Use a swarm when you have three or more independent modules that can be built in parallel. Auth, payments, and email are a good example. They share types but do not share implementation files. Three agents building in parallel with proper contracts and worktrees finish faster and produce cleaner code than one agent doing all three sequentially.
Do not use a swarm when the work fits in one context window. A single agent with full context is cheaper, simpler to debug, and produces equivalent output for tasks that are inherently sequential.
How to build your own version
You do not need a complex stack to run this. The minimum shape works on one machine.
What you need:
- A SQLite file (or Postgres if you want multiple machines)
git worktree(built into Git, no install needed)- A contracts file with your shared types
- A merge script that implements the four tiers
Start with the task graph. Create a SQLite table with the columns above. Write a small script that lets agents claim tasks atomically. Test it with two agents racing to claim the same task. Only one should succeed.
CREATE TABLE tasks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending' CHECK (status IN ('pending','claimed','done','failed')),
agent_id TEXT,
claimed_at DATETIME,
depends_on TEXT -- JSON array of task IDs
);Add worktrees next. Write a setup script that creates one worktree per agent before the swarm starts. The script should also clean up worktrees after agents finish. Stale worktrees accumulate fast in long swarm runs.
Write your contracts file before creating any agent prompt. Put it in a shared location that every agent can access. Make it a non-negotiable part of the agent's system prompt.
Build the merge queue last. Start with Tier 1 and Tier 4 only. A clean merge lands immediately. A conflict parks for human review. Add Tier 2 and Tier 3 once you have a sense of what kinds of conflicts come up most in your codebase.
One rule per layer:
- Task graph: always use transactions. Never a file.
- Process shell: one worktree per agent. Never a shared working directory.
- Contracts: inject at the top of every agent prompt. Non-negotiable.
- Merge queue: never merge two branches simultaneously. Always serialize.
Where else this pattern applies
The four-layer architecture is not specific to feature builds.
Security audits benefit from the same shape. Multiple agents scan different parts of the codebase in parallel, each in its own worktree, each writing findings to a shared task graph. The merge queue combines their reports without duplication.
Content pipelines use it too. Multiple agents draft different sections of a document in parallel. Contracts define the shared outline structure. The merge queue combines sections in the right order.
Performance profiling runs several agents in parallel across different subsystems. Contracts define the shared benchmark format so all reports are comparable. The queue serializes which recommendations land.
The specific tools change. SQLite becomes Postgres. Worktrees become Docker containers. TypeScript contracts become JSON schemas. The four layers stay the same.
Task graph stops double-claiming. Process shell stops file conflicts. Contracts stop drift. Merge queue stops bad code reaching main. That is the whole model.
Pare de configurar. Comece a construir.
Templates SaaS com orquestração de IA.