Headroom: Cut AI Agent Token Costs by Compressing Context

Problem: Most of what your agent reads is junk. Tool outputs, logs, RAG chunks, and file dumps fill the context window with redundant tokens, and you pay for every one. That bill is about to sting more after the June 15 Agent SDK billing change.

Quick Win: pip install "headroom-ai[all]" then headroom wrap claude. Headroom compresses everything your agent reads before it reaches the model, and the originals stay on your machine.

What is Headroom?

Headroom is an open-source context compression layer for AI agents. It sits between your agent and the LLM, compresses the bulky inputs (tool outputs, logs, files, RAG chunks, conversation history) before they hit the model, and keeps the originals on your machine so the model can pull full data back when it actually needs it.

The repo lives at github.com/chopratejas/headroom and is licensed Apache 2.0. It was built by Tejas Chopra, a senior engineer at Netflix, and open-sourced in January 2026 (Open Source For You, June 2, 2026). As of this writing the repo shows roughly 18,000 stars and over 1,100 forks, at version 0.22. The version number tells you the honest truth: this is young software moving fast, not a frozen 1.0.

The README's own one-line pitch is "Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers." Those numbers are the repo's claim, drawn from its own benchmarks, not an independent measurement. Treat them as a ceiling for ideal cases, not a guarantee for your workload.

It ships in several shapes so you can pick the least invasive one: a Python or TypeScript library, a drop-in local proxy, an agent wrapper for tools like Claude Code, and an MCP server. The repo describes its internals as a content router that detects the type of each input and routes it to a specialized compressor (JSON, code with AST awareness, prose via a trained model), plus a cache-aligner that tries to keep provider KV cache prefixes stable.

How does context compression save money?

Every token you send to the model is metered. Chopra estimates that "up to 90% of tokens sent to large language models can be redundant" (Open Source For You, June 2, 2026). A 100-line stack trace, a giant JSON tool response, a directory listing, a RAG chunk that repeats the same boilerplate ten times: the model rarely needs all of it verbatim, but you pay full price to ship it.

Headroom's idea is to shrink that payload before it crosses the wire, while leaving the full version retrievable. The README reports these benchmark results (its own numbers, not independently verified):

Workload	Before	After	Savings
Code search (100 results)	17,765	1,408	92%
SRE incident debugging	65,694	5,118	92%
GitHub issue triage	54,174	14,761	73%
Codebase exploration	78,502	41,254	47%

The spread matters. Highly redundant inputs (search results, structured logs) compress hard. Dense, unique content like a sprawling codebase compresses less. The 60-95% range is real but workload-dependent, so expect a number, not the number.

The piece that makes this safe rather than lossy is reversibility. The README puts it plainly: "Originals never deleted; LLM retrieves on demand." Headroom inserts markers where it compressed something, keeps the original on your local machine, and exposes a retrieval tool so the model can fetch the full version the second it needs it. Coverage of the project notes the tool runs as a local proxy that the LLM can call back through to retrieve the original context (The Register, May 31, 2026).

Why does this matter more after June 15?

Starting June 15, 2026, Anthropic stops letting Agent SDK and claude -p workloads draw from your interactive subscription pool. Programmatic usage now consumes a separate monthly credit (reported at roughly $20 for Pro, $100 for Max 5x, $200 for Max 20x) metered at standard API list rates, after which automated requests stop unless you enable overflow billing (Tech Times, June 2, 2026). Interactive chat and Claude Code in the terminal are unaffected, but anything you script, automate, or run in CI now bills per token at API rates.

Translation: token efficiency used to be a nice-to-have for scripted agents on a flat subscription. After June 15 it is a direct line item. Cutting input tokens by half on automated workloads roughly doubles how far that credit stretches.

How to install and wire it up

Install the Python package with all extras, or the TypeScript package if you live in Node. Headroom needs Python 3.10 or newer for the Python path.

# Python (all features)
pip install "headroom-ai[all]"

# TypeScript / Node
npm install headroom-ai

If you do not want every extra, the README exposes granular installs like [proxy], [mcp], [ml], [code], and [memory], so you can pull only what you use:

pip install "headroom-ai[proxy,mcp,code]"

The fastest path for an agent like Claude Code is the wrapper. This launches Claude Code with Headroom intercepting and compressing context, no code changes on your side:

headroom wrap claude

The README also lists wrappers for other coding agents, so the same pattern works beyond Claude Code:

headroom wrap cursor
headroom wrap aider

If you would rather not wrap the binary, run Headroom as a local proxy and point your tooling at it. This is the zero-code-change route for anything that talks to an LLM over HTTP:

headroom proxy --port 8787

You can confirm it is actually compressing before you trust it with real spend. The README ships a performance command for exactly this:

headroom perf

For agents that speak MCP, install Headroom as an MCP server. This is the cleanest way to give the model the retrieval tool so it can pull originals back on demand:

headroom mcp install

The README says this exposes three tools to the model: headroom_compress, headroom_retrieve, and headroom_stats. The headroom_retrieve tool is the reversibility lever, the model calls it when a compressed marker is not enough and it needs the full original.

If you embed it in code rather than wrapping a CLI, the library exposes a direct compression call. Pass your message array and get back a compressed prompt:

from headroom import compress

compressed = compress(messages)  # returns the compressed prompt

The README also shows SDK and framework integrations, including wrapping an Anthropic client and adapters for LangChain and Agno. One caveat: integration surfaces on a v0.x project change fast, so check the README for the exact current signature before you commit it to production code.

How much does it actually save?

Honest answer: somewhere in the 47-92% range on the README's own benchmarks, and your real number depends entirely on how redundant your inputs are. The repo's headline is "60-95% fewer tokens," but that is the repo's claim from its own evals, not an independent result, so do not budget against the top of the range.

A widely cited figure says Headroom has saved its users an estimated $700,000 and recovered around 200 billion tokens since January (Open Source For You, June 2, 2026). That estimate is attributed to the project and its coverage, not verified independently, so read it as a directional signal rather than a number you can audit.

Where it helps most:

Agents that fire lots of tools and ingest big structured outputs (search results, API responses, log dumps). These compress hard.
RAG pipelines stuffing repetitive chunks into context.
Long-running, automated sessions where the same files and history get re-sent across turns.

Where to temper expectations:

Dense, unique prose or novel code compresses far less (the codebase-exploration benchmark only hit 47%).
The compress/retrieve round-trip adds a moving part. If the model over-calls headroom_retrieve, you can claw back some of the savings, so measure with headroom perf on your own traffic.
It is v0.22. Expect rough edges, changing APIs, and the occasional surprise. Pin a version and test before you point production agents at it.

A predictable bill comes from controlling what you send, not just how you compress it. That is the same instinct behind a build system like Build This Now: an opinionated, type-safe stack where the agent workflow is scoped and the context it touches is bounded, so your token spend stays legible after the June 15 change instead of ballooning with every ad-hoc run. Headroom trims the payload; a tight build process trims the number of payloads.

Frequently asked questions

Is Headroom free?

Yes. It is open source under the Apache 2.0 license at github.com/chopratejas/headroom. You install it locally and run it on your own machine. You still pay your LLM provider for whatever tokens reach the model after compression.

Does compression hurt answer quality?

The README claims "same answers" because compression is reversible, the originals stay local and the model can call headroom_retrieve to pull full context when it needs it. That is the design intent, and the repo's benchmarks report no accuracy drop. It is still the project's own claim, so validate it on your workload with headroom perf before trusting it on anything that matters.

How does it work with Claude Code specifically?

The simplest path is headroom wrap claude, which launches Claude Code with Headroom intercepting and compressing context in front of the model. Alternatively, run headroom proxy --port 8787 or install the MCP server with headroom mcp install to expose the compress and retrieve tools. Check the current README for the exact wrapper flags, since a v0.x project moves quickly.

Will this help after the June 15 Agent SDK billing change?

It should help anywhere you run programmatic or scripted Claude usage, which is exactly what gets metered at API rates after June 15 (Tech Times, June 2, 2026). Fewer input tokens per call means your separate Agent SDK credit stretches further. Interactive terminal use is unaffected by both the billing change and the need for compression.

How mature is the project?

It is young. The repo is at version 0.22, open-sourced in January 2026, with roughly 18,000 stars at the time of writing. The fundamentals (local proxy, MCP retrieval, multiple compressors) are in place, but treat it as fast-moving early software: pin a version, read the changelog, and test before production.

Posted by @speedy_devv

Headroom: Cut AI Agent Token Costs by Compressing Context

On this page