Build This Now
Build This Now
Real BuildsState of Claude Code 2026: What 2,500 Public Repos RevealBuilding Isn't the Bottleneck AnymoreDistribution Is the New MoatWhy QA Is the Real Bottleneck in AI DevelopmentFirst Principles in the Age of 24-Hour MVPsThe Autonomy Curve: How Much Freedom Can You Give an AI Agent?Idea to SaaSGAN LoopSelf-Evolving HooksTrace to SkillDistribution AgentsAI Security AgentsAutonomous AI SwarmAI Email SequencesAI Cleans ItselfAgent Swarm OrchestrationBuild a Full App with Claude Code: Real ExamplesClaude Code for Non-Developers: Real ExamplesClaude Code for Freelancers: Ship 3x FasterA Security Update from Build This NowThe AI Agent That Deleted a Production Database in 9 SecondsHow to Build Your Own Claude Code Harness (or Buy One)Run Claude Code on a Cheaper Model: DeepSeek and GLM Cost ArbitrageIs Claude Code Just a Thin Wrapper? Inside the Harness DebateHow Much Does It Really Cost to Build a SaaS with Claude Code?How to Cut Your Claude Code Token Bill in HalfDo I Still Need a Boilerplate If I Use Claude Code?Harness vs Boilerplate vs Framework: The Build-System Stack ExplainedHow Long Does Idea to Production Actually Take with Claude Code?Is Vibe Coding Safe? What the Lovable and Moltbook Breaches TeachOwn Your Vercel Analytics: I Built a Drain-to-Postgres PipelineSpec-Driven Development Explained: Why Pros Stopped Vibe CodingState of Vibe-Coded SaaS Security (2026 Data)From Vibe Coding to Production: The Checklist That Stops Data LeaksVibe Coding vs Vibe Engineering vs Agentic Engineering: The 2026 GlossaryWhat Is an Agent Harness? Why the Harness, Not the Model, Is the 2026 Moat
speedy_devvkoen_salo
Blog/Real Builds/How to Cut Your Claude Code Token Bill in Half

How to Cut Your Claude Code Token Bill in Half

Reduce Claude Code token cost with five fixes: cap thinking tokens, default to Sonnet, defer MCP loading, trim CLAUDE.md, and filter tool output.

Stop configuring. Start building.

SaaS builder templates with AI orchestration.

Published Jun 22, 20267 min readReal Builds hub

To reduce Claude Code token cost, stack five changes: cap extended thinking at 8,000 to 10,000 tokens, switch your default model from Opus to Sonnet 4.6, turn on MCP deferred loading, move long workflow guides out of CLAUDE.md into on-demand skills, and add a hook that filters noisy tool output before it reaches Claude. Most teams cut their Claude Code bill 40 to 85 percent this way without touching a single line of product code.


Stop configuring. Start building.

SaaS builder templates with AI orchestration.


Why this matters to you

Claude Code bills you per token, both for what you send (input) and what it writes back (output). A token is roughly three-quarters of an English word. Anthropic's own docs put average Claude Code spend at about $13 per developer per active day, with most teams landing at $150 to $250 per developer per month, and 90 percent of users staying under $30 per active day (Anthropic Claude Code cost docs). When people say their bill "jumped," it is almost never because the work got harder. It is because tokens are leaking somewhere they cannot see. Below are the leaks and the exact fixes.

The stealth spike: the Opus 4.7+ tokenizer change

Here is the part most cost guides miss. Starting with Opus 4.7, Anthropic changed how the model counts tokens. The same codebase, the same prompt, the same task can now register up to 35 percent more tokens than it did on an older model version (reported, based on community token-count comparisons). Nothing about your work changed. You upgraded the model, and the meter started running faster.

If your bill spiked right after a model update and you could not explain why, this is the likely reason. The fix is not to avoid new models. It is to control the other drains below so the higher per-token count lands on far fewer tokens.

The three silent drains

  1. Uncapped extended thinking. Extended thinking is Claude's private scratch work before it answers. It is billed as output tokens, and by default it can run to tens of thousands of tokens per request. You pay for every one.
  2. Full MCP tool schemas on every request. An MCP server (Model Context Protocol, a way to plug external tools into Claude) injects the full description of every tool into context before you type anything. Undeferred, that is 7,000 to 55,000 tokens of overhead per request.
  3. A bloated CLAUDE.md. CLAUDE.md is the project memory file Claude reads on every message. The bigger it is, the more context every single turn carries.

The five fixes, ranked by impact

1. Cap MAX_THINKING_TOKENS

This is the single highest-impact setting. Add this to your settings.json:

{
  "env": {
    "MAX_THINKING_TOKENS": "8000"
  }
}

This caps the scratch work at 8,000 tokens instead of letting it run wild. On routine development (reading files, writing functions, fixing a bug) deep multi-step reasoning is not needed, and this is estimated to cut spend 30 to 40 percent (reported). Quality on everyday tasks does not meaningfully drop. Raise the cap only for the rare session where the model genuinely needs to reason hard.

2. Default to Sonnet 4.6, opt in to Opus

Sonnet 4.6 costs $3 per million input tokens and $15 per million output tokens. Opus 4.8 costs $5 and $25. That makes Sonnet about 40 percent cheaper on both sides. Sonnet handles roughly 80 percent of coding work at comparable quality. Make Sonnet your default and treat Opus as a deliberate choice for hard architecture or gnarly debugging, not the everyday driver. You can switch models with the /model command mid-session.

3. Enable MCP deferred loading

Deferred loading means only the tool names enter context up front. The full schema for a tool loads only when Claude actually calls that tool. Anthropic now ships this on by default in recent versions, and teams with large MCP catalogs can cut tool-schema overhead by 58 to 92 percent (reported). If you run several MCP servers, confirm deferred loading is active and prune servers you never use.

4. Trim CLAUDE.md and move guides into skills

Keep CLAUDE.md under about 200 lines. Prompt caching means the first turn is the only full-rate hit, but the file still occupies context on every message after that. Long workflow guides ("how we run migrations," "our PR checklist") do not belong there. Move them into on-demand skills that load only when invoked. Your project memory stays lean and every message gets cheaper.

5. Add a PreToolUse hook to filter noisy output

When a test runner or build command dumps 2,000 lines of output, Claude reads all of it and you pay for all of it. A PreToolUse hook is a small script that runs before a tool's result reaches Claude. Use it to strip passing-test noise and keep only failures and summaries. This trims input tokens on every verbose command.

Two more levers for heavy users

  • Batch API for non-real-time work. Jobs that do not need an instant answer (bulk refactors, doc generation, overnight test sweeps) can route through the Batch API for a flat 50 percent discount on all tokens.
  • Cost hygiene for agent teams. Running parallel Claude Code teammates uses about 7 times more tokens than a single session, because each teammate carries its own full context window. Use Sonnet for teammates, keep teams small, and shut each teammate down the moment its sub-task is done.

Claude Code cost levers at a glance

TechniqueEstimated savingsEffortConfig locationConfirmed or reported
Cap MAX_THINKING_TOKENS30 to 40%Lowsettings.jsonReported
Use Sonnet instead of Opus~40% per tokenLow/model or settings.jsonConfirmed (pricing)
Enable MCP deferred loading58 to 92% of tool overheadLowDefault in recent versionsReported
Trim CLAUDE.md + use skillsVariesMediumCLAUDE.md + skillsReported
Add PreToolUse hookVariesMediumsettings.json hooksConfirmed (mechanism)
Batch API for non-real-time50% flatMediumAPI request typeConfirmed
Shut down agent teammates earlyAvoids ~7x multiplierLowWorkflow habitReported

A note on setup

If wiring hooks, skills, and a lean CLAUDE.md from scratch sounds like a project of its own, that is exactly what the $29 Code Kit packages: a build system for Claude Code with the hooks, skills, and workflows already configured, plus a production SaaS skeleton (auth, Stripe payments, PostgreSQL with row-level security on every table). It runs on your own Claude subscription and deploys anywhere.

FAQ

How much does Claude Code cost per month?

Anthropic's docs report an average of $150 to $250 per developer per month across enterprise deployments, with an average of $13 per active day and 90 percent of users under $30 per active day. Your actual spend scales with how much context you use and which model you pick.

How do I reduce Claude Code token usage?

The four highest-impact changes are: cap MAX_THINKING_TOKENS at 8,000 to 10,000 in settings.json, switch your default model from Opus to Sonnet 4.6, enable MCP deferred tool loading so tool schemas only enter context when a tool is actually called, and add a PreToolUse hook to filter verbose test-runner output before it reaches Claude.

Is Claude Sonnet good enough for Claude Code instead of Opus?

Yes for most tasks. Sonnet 4.6 handles the roughly 80 percent of coding work that does not need deep multi-step reasoning (reading files, writing functions, running tests, making edits) at about 40 percent lower cost than Opus 4.8. Reserve Opus for complex architecture decisions or hard debugging where extended thinking genuinely improves the result.

What is MAX_THINKING_TOKENS in Claude Code?

MAX_THINKING_TOKENS is a settings.json parameter that caps how many output tokens Claude can spend on extended thinking (its internal scratch work) before it answers. The default is effectively uncapped, running to tens of thousands per request. Setting it to 8,000 to 10,000 is estimated to cut spend 30 to 40 percent on routine work with no meaningful drop in code quality.

More in Real Builds

  • AI Cleans Itself
    Three overnight Claude Code workflows that clean AI's own mess: slop-cleaner removes dead code, /heal repairs broken branches, /drift catches pattern drift.
  • Agent Swarm Orchestration
    Four infrastructure layers that stop agent swarms from double-claiming tasks, drifting on field names, and collapsing under merge chaos.
  • GAN Loop
    One agent generates, one tears it apart, they loop until the score stops improving. GAN Loop implementation with agent definitions and rubric templates.
  • The Autonomy Curve: How Much Freedom Can You Give an AI Agent?
    How much autonomy you can give an AI agent is decided by one thing: how long a model holds a task without drifting. A good harness plus a reliable model is what unlocks real agent work.
  • The AI Agent That Deleted a Production Database in 9 Seconds
    An AI deleted PocketOS's production database and all backups in 9 seconds. Here is why it happened and the guardrails that prevent it.
  • AI Email Sequences
    One Claude Code command builds 17 lifecycle emails across 6 sequences, wires Inngest behavioral triggers, and ships a branching email funnel ready to deploy.

Stop configuring. Start building.

SaaS builder templates with AI orchestration.

On this page

Why this matters to you
The stealth spike: the Opus 4.7+ tokenizer change
The three silent drains
The five fixes, ranked by impact
1. Cap MAX_THINKING_TOKENS
2. Default to Sonnet 4.6, opt in to Opus
3. Enable MCP deferred loading
4. Trim CLAUDE.md and move guides into skills
5. Add a PreToolUse hook to filter noisy output
Two more levers for heavy users
Claude Code cost levers at a glance
A note on setup
FAQ
How much does Claude Code cost per month?
How do I reduce Claude Code token usage?
Is Claude Sonnet good enough for Claude Code instead of Opus?
What is MAX_THINKING_TOKENS in Claude Code?

Stop configuring. Start building.

SaaS builder templates with AI orchestration.