How to Cut Your Claude Code Token Bill in Half

To reduce Claude Code token cost, stack five changes: cap extended thinking at 8,000 to 10,000 tokens, switch your default model from Opus to Sonnet 4.6, turn on MCP deferred loading, move long workflow guides out of CLAUDE.md into on-demand skills, and add a hook that filters noisy tool output before it reaches Claude. Most teams cut their Claude Code bill 40 to 85 percent this way without touching a single line of product code.

Hören Sie auf zu konfigurieren. Fangen Sie an zu bauen.

SaaS-Builder-Vorlagen mit KI-Orchestrierung.

Why this matters to you

Claude Code bills you per token, both for what you send (input) and what it writes back (output). A token is roughly three-quarters of an English word. Anthropic's own docs put average Claude Code spend at about $13 per developer per active day, with most teams landing at $150 to $250 per developer per month, and 90 percent of users staying under $30 per active day (Anthropic Claude Code cost docs). When people say their bill "jumped," it is almost never because the work got harder. It is because tokens are leaking somewhere they cannot see. Below are the leaks and the exact fixes.

The stealth spike: the Opus 4.7+ tokenizer change

Here is the part most cost guides miss. Starting with Opus 4.7, Anthropic changed how the model counts tokens. The same codebase, the same prompt, the same task can now register up to 35 percent more tokens than it did on an older model version (reported, based on community token-count comparisons). Nothing about your work changed. You upgraded the model, and the meter started running faster.

If your bill spiked right after a model update and you could not explain why, this is the likely reason. The fix is not to avoid new models. It is to control the other drains below so the higher per-token count lands on far fewer tokens.

The three silent drains

Uncapped extended thinking. Extended thinking is Claude's private scratch work before it answers. It is billed as output tokens, and by default it can run to tens of thousands of tokens per request. You pay for every one.
Full MCP tool schemas on every request. An MCP server (Model Context Protocol, a way to plug external tools into Claude) injects the full description of every tool into context before you type anything. Undeferred, that is 7,000 to 55,000 tokens of overhead per request.
A bloated CLAUDE.md. CLAUDE.md is the project memory file Claude reads on every message. The bigger it is, the more context every single turn carries.

The five fixes, ranked by impact

1. Cap MAX_THINKING_TOKENS

This is the single highest-impact setting. Add this to your settings.json:

{
  "env": {
    "MAX_THINKING_TOKENS": "8000"
  }
}

This caps the scratch work at 8,000 tokens instead of letting it run wild. On routine development (reading files, writing functions, fixing a bug) deep multi-step reasoning is not needed, and this is estimated to cut spend 30 to 40 percent (reported). Quality on everyday tasks does not meaningfully drop. Raise the cap only for the rare session where the model genuinely needs to reason hard.

2. Default to Sonnet 4.6, opt in to Opus

Sonnet 4.6 costs $3 per million input tokens and $15 per million output tokens. Opus 4.8 costs $5 and $25. That makes Sonnet about 40 percent cheaper on both sides. Sonnet handles roughly 80 percent of coding work at comparable quality. Make Sonnet your default and treat Opus as a deliberate choice for hard architecture or gnarly debugging, not the everyday driver. You can switch models with the /model command mid-session.

3. Enable MCP deferred loading

Deferred loading means only the tool names enter context up front. The full schema for a tool loads only when Claude actually calls that tool. Anthropic now ships this on by default in recent versions, and teams with large MCP catalogs can cut tool-schema overhead by 58 to 92 percent (reported). If you run several MCP servers, confirm deferred loading is active and prune servers you never use.

4. Trim CLAUDE.md and move guides into skills

Keep CLAUDE.md under about 200 lines. Prompt caching means the first turn is the only full-rate hit, but the file still occupies context on every message after that. Long workflow guides ("how we run migrations," "our PR checklist") do not belong there. Move them into on-demand skills that load only when invoked. Your project memory stays lean and every message gets cheaper.

5. Add a PreToolUse hook to filter noisy output

When a test runner or build command dumps 2,000 lines of output, Claude reads all of it and you pay for all of it. A PreToolUse hook is a small script that runs before a tool's result reaches Claude. Use it to strip passing-test noise and keep only failures and summaries. This trims input tokens on every verbose command.

Two more levers for heavy users

Batch API for non-real-time work. Jobs that do not need an instant answer (bulk refactors, doc generation, overnight test sweeps) can route through the Batch API for a flat 50 percent discount on all tokens.
Cost hygiene for agent teams. Running parallel Claude Code teammates uses about 7 times more tokens than a single session, because each teammate carries its own full context window. Use Sonnet for teammates, keep teams small, and shut each teammate down the moment its sub-task is done.

Claude Code cost levers at a glance

Technique	Estimated savings	Effort	Config location	Confirmed or reported
Cap MAX_THINKING_TOKENS	30 to 40%	Low	settings.json	Reported
Use Sonnet instead of Opus	~40% per token	Low	/model or settings.json	Confirmed (pricing)
Enable MCP deferred loading	58 to 92% of tool overhead	Low	Default in recent versions	Reported
Trim CLAUDE.md + use skills	Varies	Medium	CLAUDE.md + skills	Reported
Add PreToolUse hook	Varies	Medium	settings.json hooks	Confirmed (mechanism)
Batch API for non-real-time	50% flat	Medium	API request type	Confirmed
Shut down agent teammates early	Avoids ~7x multiplier	Low	Workflow habit	Reported

A note on setup

If wiring hooks, skills, and a lean CLAUDE.md from scratch sounds like a project of its own, that is exactly what the $29 Code Kit packages: a build system for Claude Code with the hooks, skills, and workflows already configured, plus a production SaaS skeleton (auth, Stripe payments, PostgreSQL with row-level security on every table). It runs on your own Claude subscription and deploys anywhere.

FAQ

How much does Claude Code cost per month?

Anthropic's docs report an average of $150 to $250 per developer per month across enterprise deployments, with an average of $13 per active day and 90 percent of users under $30 per active day. Your actual spend scales with how much context you use and which model you pick.

How do I reduce Claude Code token usage?

The four highest-impact changes are: cap MAX_THINKING_TOKENS at 8,000 to 10,000 in settings.json, switch your default model from Opus to Sonnet 4.6, enable MCP deferred tool loading so tool schemas only enter context when a tool is actually called, and add a PreToolUse hook to filter verbose test-runner output before it reaches Claude.

Is Claude Sonnet good enough for Claude Code instead of Opus?

Yes for most tasks. Sonnet 4.6 handles the roughly 80 percent of coding work that does not need deep multi-step reasoning (reading files, writing functions, running tests, making edits) at about 40 percent lower cost than Opus 4.8. Reserve Opus for complex architecture decisions or hard debugging where extended thinking genuinely improves the result.

What is MAX_THINKING_TOKENS in Claude Code?

MAX_THINKING_TOKENS is a settings.json parameter that caps how many output tokens Claude can spend on extended thinking (its internal scratch work) before it answers. The default is effectively uncapped, running to tens of thousands per request. Setting it to 8,000 to 10,000 is estimated to cut spend 30 to 40 percent on routine work with no meaningful drop in code quality.

How to Cut Your Claude Code Token Bill in Half

On this page