Agent Harness Engineering

Problem: You upgraded the model. Performance barely moved. You keep seeing the same failure modes, session after session, regardless of which model version you run.

Quick Win: Stop tuning the model. Tune the harness. One team cut from 15 tools down to a single bash call and watched accuracy jump from 80% to 100% while token use dropped 37%.

Understanding: Agent = Model + Harness. The model handles reasoning. The harness handles everything else: what tools the agent can reach, what context it receives, how errors surface, how results get verified. In practice, the harness is the binding constraint on real-world performance, not the model.

What the Harness Actually Is

The harness is the software layer that surrounds an AI model and manages everything except the model's reasoning. It calls tools. It manages memory. It routes tasks. It runs verification checks. It decides what context arrives in the prompt window and in what form.

Martin Fowler's definition is precise: the harness is everything in an agent except the model itself. Two directions of control run through it:

Feedforward (guides): steer the agent before it acts. System prompts, context injection, tool restrictions. These raise the probability that the first output is correct.
Feedback (sensors): observe after the agent acts and let it self-correct. Linters, type checkers, test runners, build hooks. These catch errors before they reach human eyes.

Neither direction alone is enough. Feedback without feedforward produces an agent that corrects the same mistakes on repeat. Feedforward without feedback produces an agent that encodes rules but never learns when they failed.

Why Harness Design Outperforms Model Upgrades

The SWE-bench leaderboard makes the argument plainly. On coding benchmarks, the same model can score 42% with one scaffold and 78% with a better one. The model did not change. The harness did.

The Vercel case is the clearest example in circulation. Their team stripped an agent from 15+ tools down to a single bash tool. On their benchmark: accuracy went from 80% to 100%, tokens dropped 37%, and speed improved 3.5x. They did not touch the model. They removed harness complexity.

The pattern holds across teams. Harvey's legal AI team improved task completion from 40.8% to 87.7% by improving the system around the model. OpenAI's internal tooling team concluded: "Our most difficult challenges now center on designing environments, feedback loops, and control systems."

The model is the engine. The harness is the car. Upgrading the engine helps. Building a better car is usually the higher-leverage move.

The Five Levers

Every harness is built from the same set of control points. Pulling the right ones in the right order is where the craft lives.

1. Tool Design

The tools available to an agent define its capability surface. Too many tools and the agent wastes context on disambiguation. Too few and it resorts to workarounds. The Vercel result is the canonical reference point: fewer, sharper tools outperform a broad general toolkit.

Design tools around outcomes, not capabilities. A tool called run_tests outperforms one called execute_command because the agent does not have to decide what to execute.

2. Context Injection

What the agent knows before it acts determines what it produces. Context injection means putting the right information in the prompt window at the right time: architecture docs, coding standards, recent errors, the current file tree.

The failure mode is over-injection: flooding the window with everything available. Relevant context at the right moment outperforms a large, undifferentiated dump.

3. Memory Architecture

A single-session agent forgets everything when the window closes. A properly harnessed agent carries forward: decisions made, patterns observed, errors seen before. Memory can be short-term (conversation state), medium-term (session logs), or long-term (persistent files the agent reads on start).

Claude Code's CLAUDE.md file is a long-term memory lever. The agent reads it at session start. What you put there shapes every session that follows.

4. Verification Loops

The harness runs checks the agent cannot skip. Type checkers, linters, build commands, test suites. When these run as automated sensors after every agent action, errors surface in the same session that created them. The cost of correction drops.

The principle: keep quality checks as far left as possible. A type error caught before commit costs seconds. The same error caught in review costs an hour.

5. Constraint Design

What the agent cannot do matters as much as what it can. Restricting write access to production configs, blocking certain tool calls, limiting which files an agent can touch are all constraint levers. The goal is not to hobble the agent but to eliminate the class of mistakes that constraints make impossible.

The Constraint Paradox

Adding more rules to an agent does not reliably produce better behavior. This is the constraint paradox: a harness overloaded with instructions can degrade performance below what the agent achieves with minimal guidance.

The mechanism is straightforward. Every constraint takes up context. Contradictory rules create noise the agent has to navigate. A long instruction set that covers every edge case often performs worse than a short, clear one that covers the most important five.

The practical resolution is prioritization. Identify the failure modes that actually repeat. Write constraints for those. Leave everything else to the model's judgment. The harness should eliminate the failures that happen often, not attempt to enumerate every possible bad outcome.

Computational vs. Inferential Controls

Harness controls split into two types.

Computational controls are deterministic and fast. Type checkers, linters, test runners, build systems. They run in milliseconds to seconds. Results are reliable and cheap enough to run on every agent action.

Inferential controls use a model as the judge. Code review agents, semantic quality checks, "LLM-as-evaluator" patterns. They are slower, more expensive, and non-deterministic. They catch what computational controls miss: semantic duplication, misapplied patterns, instructions misunderstood.

The practical split: run computational controls on every change, automatically. Run inferential controls selectively, on higher-stakes changes or post-integration. The two layers are complementary, not competitive.

Three Harness Categories

A harness regulates different dimensions of agent output. Distinguishing between them helps because the right controls differ by category.

Maintainability harness: guides and sensors around code quality, style, and structure. This is the easiest category to build. Existing tooling (linters, type checkers, coverage tools) plugs in directly. Most teams start here.

Architecture fitness harness: guides and sensors that enforce structural constraints. Module boundaries, dependency rules, performance budgets. Fitness functions run as automated checks. Architectural drift gets caught rather than discovered months later in a review.

Behavior harness: guides and sensors around functional correctness. This is the hardest category. AI-generated test suites are not yet reliable enough to fully substitute for specified behavior. Most teams currently combine specification documents (feedforward) with AI-generated tests plus selective manual verification (feedback). The open problem is building enough confidence in agent-generated tests to reduce manual oversight.

Starting Points for Claude Code

Claude Code exposes several harness levers directly.

CLAUDE.md is your primary feedforward control. Architecture decisions, naming conventions, patterns to follow or avoid, explicit rules about what the agent should not do. This file loads on every session start. What you put here shapes the agent's default behavior without any prompting.

.claude/agents/ specialist definitions let you constrain scope per agent. A database agent that can only touch migration files cannot accidentally modify frontend components. Scope restriction is one of the cheapest constraints to add.

Hooks let you wire computational sensors into the agent loop. A post-edit hook that runs the type checker means every file change gets validated before the agent moves to the next task.

Permission rules in settings.json define which tools the agent can access. Starting with fewer tools and expanding is better than starting with everything and trying to restrict later.

The good harness does not aim to eliminate human input entirely. It directs human attention to where it matters most: the decisions that sensors cannot catch and that judgment, not rules, must resolve.

Common Questions

What is agent harness engineering?

Agent harness engineering is the practice of designing everything around an AI model except the model itself. This includes tool selection, context injection, memory architecture, verification loops, and constraint design. The harness determines what information the agent receives, what actions it can take, and how errors are detected and corrected before they reach human review.

Why does the harness matter more than the model?

On SWE-bench coding benchmarks, the same model scores 42% with one scaffold and 78% with a better one. Vercel removed 80% of their agent's tools and saw accuracy jump from 80% to 100% with 37% fewer tokens used. The model's reasoning is fixed. The harness determines how much of that reasoning capability actually reaches the task.

What is an AI agent scaffold?

A scaffold is the structural layer built around an AI model to enable complex, multi-step tasks. It provides the agent with tools to call, memory to read and write, a loop to re-run after errors, and feedback signals to self-correct. Scaffold and harness are used interchangeably in most contexts. The distinction, when made, is that a harness emphasizes control and governance while a scaffold emphasizes structure and enablement.

How do I build a good agent harness?

Start with the five levers in order: tool design first, then context injection, then memory, then verification loops, then constraints. For each lever, identify the failure modes you actually observe, not theoretical ones. Build the simplest control that prevents each failure. Run computational checks (type checker, linter, tests) automatically after every agent action. Add inferential controls only where computational ones cannot reach.

What is the constraint paradox in AI agents?

Adding more constraints does not reliably improve agent behavior. A long, exhaustive rule set often performs worse than a short, clear one because every constraint consumes context and contradictory rules create noise. The resolution is prioritization: identify the failure modes that repeat most often and write constraints only for those. Leave everything else to the model's judgment.

Why did Vercel cut from 15+ tools to one?

Vercel's team found that a broad tool set forced their agent to spend context on disambiguation, deciding which tool to call rather than how to solve the problem. Cutting to a single bash tool removed that overhead. Accuracy went from 80% to 100%, tokens dropped 37%, and speed improved 3.5x on their benchmark. The principle: fewer, sharper tools outperform a large general toolkit.

Explore More Agent Concepts:

Agent Patterns - Orchestration shapes for multi-agent work
Agent Fundamentals - Sub-agents, slash commands, and CLAUDE.md personas
Sub-Agent Design - Architecture patterns for coordinating multiple agents
Team Orchestration - Builder and validator loops in practice
Custom Agents - Writing specialist agent definitions

Agent Harness Engineering

On this page