Claude Code Prompt Caching: The Token Discount Most People Never Turn On
Claude Code prompt caching is automatic and bills cached tokens at ~10% of normal input. Here's how to stop leaking the 90% discount, with real cost math.
Arrête de tout configurer. Place à la construction.
Des templates SaaS avec orchestration IA.
Prompt caching in Claude Code is automatic. It caches the system prompt, tool definitions, and CLAUDE.md on every turn, and those cached input tokens bill at roughly 10% of the standard input rate, a discount of about 90% (Claude Code docs). So the title is a small lie, and an honest one: claude code prompt caching is not a setting you switch on. The discount is already running. The "token discount most people never turn on" is really the cache hits you keep accidentally throwing away. The one thing you can literally turn on is the 1-hour cache lifetime. This post covers both.
Arrête de tout configurer. Place à la construction.
Des templates SaaS avec orchestration IA.
The discount you are already getting
Here is the honest version. Caching is on by default in Claude Code. Every turn, it tries to reuse the stable front of your prompt instead of reprocessing it at full price. When that works, you pay about a tenth of the normal rate for those tokens.
So your two levers are not "enable caching." They are: stop breaking the cache, and decide whether to extend its lifetime from 5 minutes to 1 hour.
If you are shaky on what a token even is, start with what a token actually is, then come back. The rest of this assumes you know your bill is measured in tokens, and you want fewer of them billed at full price.
How the math works: read tokens vs write tokens
There are two events in a cache: writing it and reading it. They cost different amounts.
A cache write costs a one-time premium of 1.25x the base input price on the default 5-minute cache, or 2x on the 1-hour cache (Claude API prompt caching docs). A cache read costs about 0.1x the base input price. That read number is the whole point: roughly 90% off.
The write premium is why caching is not free on the first turn. You pay 1.25x to store the prefix. But the first read at 0.1x already pays it back. Two calls of the same prefix cost 1.25x plus 0.1x, which is 1.35x. Without caching, two calls cost 2x. You are ahead after a single reuse (Claude API prompt caching docs). The 1-hour cache breaks even after two reads instead of one, because the write costs 2x.
What cache read and cache write actually mean
Every API response carries three input counters, and reading them is how you know caching is doing its job:
cache_creation_input_tokensare the tokens written into the cache this turn, billed at 1.25x.cache_read_input_tokensare tokens served from an existing cache, billed at about 0.1x.input_tokensis the uncached remainder, your new message and fresh tool results, billed at full price (Claude Code docs).
A healthy session shows a high read number and a low creation number, turn after turn. If creation stays high, something keeps invalidating your prefix.
What Claude Code caches for you
Claude Code caches in three prefix layers, ordered from most stable to most volatile. The order matters, because caching is a prefix match: a single byte change anywhere in the prefix invalidates everything after it.
| Layer | What is in it | Invalidates when |
|---|---|---|
| System prompt | Core instructions, tool definitions, output style | The tool set changes, or Claude Code is upgraded |
| Project context | CLAUDE.md, auto memory, unscoped rules | Session start, /clear, or /compact |
| Conversation | Your messages, Claude's replies, tool results | Every turn (the cheap, expected miss) |
The conversation layer changing every turn is fine. That is the small new chunk you always pay full price for. The expensive mistake is knocking out the two layers above it.
There are also two hidden inputs to the cache key that are not text. The model is one: each model has its own cache. The effort level is the other: each effort level keeps a separate cache for the same model (Claude Code docs). Switch either and you start cold.
Because the request renders in the order tools, then system prompt, then messages, the most stable content sits at the front and the volatile content at the back. That is also why a tidy, stable CLAUDE.md pays off. If you rewrite it constantly, you keep nuking the project layer. More on that in structuring your CLAUDE.md.
The before and after cost
Numbers make this real. The figures below are illustrative, the math is reproducible, and every assumption is stated so you can swap in your own.
Assume Opus 4.8 with an input price around $5 per million tokens. That per-token dollar figure is a moving target, so treat the percentage as the durable takeaway, not the exact dollars. Assume the cacheable prefix (system prompt, tools, CLAUDE.md, and the conversation that does not change) averages 40,000 tokens across the session, a mid-session steady state. Each turn adds about 2,000 new tokens of fresh message and tool results. The session runs 30 turns. You keep working, so the 5-minute cache stays warm and every turn after the first is a read. Output tokens are ignored because they are identical either way; this isolates the input-side discount.
Without caching, every turn reprocesses the whole 42,000-token input at full price. That is about $0.21 per turn, and 30 turns comes to roughly $6.30.
With caching, turn 1 writes the 40,000-token prefix at 1.25x and adds 2,000 new tokens at full price, about $0.26. Turns 2 through 30 read the 40,000-token prefix at 0.1x and add 2,000 new tokens at full price, about $0.03 each. That is $0.26 plus 29 times $0.03, roughly $1.13.
| Without cache | With cache (default) | |
|---|---|---|
| Input cost, 30-turn session | $6.30 | $1.13 |
| Savings | n/a | ~$5.17 (about 82%) |
On a 40K-prefix, 30-turn Opus 4.8 session, caching cuts input cost from about $6.30 to about $1.13, roughly 82%, and you pay nothing extra to get it. Real savings scale with prefix size and session length. This is the caching deep-dive of the broader playbook to cut your Claude Code token costs; the bigger and longer-lived your stable prefix, the more this matters.
The invalidation tax: one careless
/modelswitch at turn 15 forces a full re-read of the 40,000-token prefix at full price, with no hit. That is about $0.20 extra plus a slow turn. Do it ten times a day and the leak adds up fast.
Where people leak the discount
This is the part the title is really about. The default cache is generous; most high bills come from quietly breaking it. The actions below all force a fresh cache, verified against the Claude Code docs.
Switching models mid-task
Each model keeps its own cache. Switch with /model, toggle opusplan between Opus and Sonnet, or fall back to Fable 5, and the entire history is re-read with zero hits even when the content is byte-identical. Pick a model and stay on it for the stretch of work.
Changing effort or turning on fast mode
Effort level is part of the cache key, so /effort starts a cold cache (Claude Code prompts you to confirm). Fast mode adds a header that is also part of the cache key, so the first time you turn it on in a conversation, the prefix re-writes.
MCP servers loading into the prefix
Connecting or disconnecting an MCP server whose tools load into the prefix invalidates the system layer, because the tool-definition set changed. The same goes for enabling or disabling a plugin that provides an MCP server. Deferred and tool-search tools do not load into the prefix, so they are cache-safe, and other plugin component types are safe too.
Compacting at the wrong moment
/compact rebuilds the conversation layer by design, so it throws away the cache. That is sometimes worth it for context room, but do not reach for it reflexively. /recap appends instead of rebuilding, and /rewind truncates back to a prefix that is already cached. Both keep the cache.
Editing CLAUDE.md mid-session does not even apply
This is the counterintuitive one. Editing CLAUDE.md mid-session does not invalidate the cache, which sounds good, except the edit also does not take effect. The version loaded at session start persists until the next /clear, /compact, or restart. So you are not paying a cache penalty, but you are also not getting your change. If you want the new CLAUDE.md live, restart or /clear, and accept the one-time re-write that comes with it.
A few things that feel risky but keep the cache: editing files in your repo, changing output style mid-session, changing permission mode (except the opusplan plan-mode switch, which is really a model swap), invoking skills and commands, and spawning a subagent (it gets a separate cache and leaves the parent prefix untouched).
The dial you actually control: the 1-hour cache
The default cache lives for 5 minutes and resets on every hit, so as long as you keep working it stays warm. Step away for a coffee and it can expire, costing you a re-write when you return. That is the case for the 1-hour option.
On a Claude subscription, Claude Code requests the 1-hour TTL automatically; it is included in the plan, and only drops back to 5 minutes if you spill into usage credits. On an API key, Bedrock, Vertex, Foundry, or AWS, you opt in with an environment variable.
# Claude Code: extend the cache lifetime on an API key.
# On a Claude subscription this is requested automatically.
export ENABLE_PROMPT_CACHING_1H=1 # 1-hour TTL (write premium becomes 2x)
# Force the 5-minute cache back on, e.g. while debugging cache behavior:
# export FORCE_PROMPT_CACHING_5M=1
# Turn caching off entirely (you almost never want this):
# export DISABLE_PROMPT_CACHING=1Is the 1-hour cache worth it? Only when you reread the prefix enough to clear its higher break-even. The 2x write pays back after two reads instead of one, so it wins for long sessions with gaps between turns and loses for short bursts. The split also tracks your auth method, which is one more reason to understand Max plan vs API billing before you tune this.
If you are calling the API yourself
Maybe you are not in the Claude Code terminal at all. You are writing your own agent against the raw API and want the same discount. The mechanics are the same, you just place the cache breakpoint by hand.
Two rules decide whether your prefix caches at all. First, it has to clear the per-model minimum, or it silently will not cache, with cache_creation_input_tokens coming back as 0 and no error.
| Model | Minimum cacheable prefix |
|---|---|
| Opus 4.8, Haiku 4.5 | 4096 tokens |
| Sonnet 4.6, Fable 5 | 2048 tokens |
| Sonnet 4.5 and earlier Sonnets | 1024 tokens |
So a 3,000-token system prompt caches on Sonnet 4.6 and Fable 5 but not on Opus 4.8 (Claude API prompt caching docs). Second, the prefix has to be byte-identical between calls. Anything that varies silently invalidates it: a Date.now() timestamp baked into the system prompt, unsorted JSON whose key order shifts, or a per-user ID where the stable content should be. Keep the volatile stuff after the breakpoint.
You can set up to 4 cache breakpoints per request, and each one walks back at most 20 content blocks to find a matching prior entry. The usual move is one breakpoint on the last stable block, with the varying user message sitting after it.
# Raw Claude API: cache the stable prefix so reads bill at ~0.1x.
# Render order is tools -> system -> messages, so put the breakpoint on
# the LAST stable block. Min cacheable prefix: 4096 tok on Opus 4.8.
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=4096,
system=[
{
"type": "text",
"text": LARGE_STABLE_SYSTEM_PROMPT, # no timestamps, no per-request IDs
"cache_control": {"type": "ephemeral"}, # 5-min TTL (default)
# "cache_control": {"type": "ephemeral", "ttl": "1h"}, # 1-hour, 2x write
}
],
messages=[
{"role": "user", "content": VARYING_QUESTION}, # volatile -> after the breakpoint
],
)
print(resp.usage.cache_creation_input_tokens) # written this turn (1.25x)
print(resp.usage.cache_read_input_tokens) # served from cache (0.1x)
print(resp.usage.input_tokens) # uncached remainder (1x)The first call prints a nonzero cache_creation_input_tokens and a zero read. Every call after it, within the TTL, flips that: a big read number, a creation number near zero. That flip is your proof the cache is live.
How to check it is working
You do not have to guess. In Claude Code, read cache_read_input_tokens against cache_creation_input_tokens, exposed through the usage object (a statusline script can surface it live). A high read-to-creation ratio means the cache is doing its job. If creation stays high turn after turn, walk back through the leak list above and find what you keep invalidating.
For the broader tactics around this, including how caching fits with the rest of your spend, see optimizing your usage. The single best habit is the dullest one: pick a model, keep CLAUDE.md stable, and stop reaching for /compact and /model out of reflex.
Frequently asked questions
Does Claude Code use prompt caching by default?
Yes. Claude Code automatically caches the system prompt, tool definitions, and CLAUDE.md every turn. You do not enable anything. Cached input tokens are billed at roughly 10% of the normal input rate.
How much does Claude Code prompt caching save?
Cache reads cost about 10% of the standard input rate, a discount of roughly 90%. The catch is a one-time write premium of 1.25x the base input price on the 5-minute cache, which the first read pays back.
What is the difference between cache read and cache write tokens?
Three usage fields tell the story: cache_creation_input_tokens are written this turn at 1.25x, cache_read_input_tokens are served from cache at about 0.1x, and input_tokens is the uncached remainder at full price.
What is the Claude prompt cache TTL?
The default lifetime is 5 minutes and it resets on every cache hit. A 1-hour option exists at a higher 2x write premium instead of 1.25x.
How do I enable the 1-hour cache in Claude Code?
On an API key, Bedrock, Vertex, Foundry, or AWS, set the environment variable ENABLE_PROMPT_CACHING_1H=1. On a Claude subscription, Claude Code requests the 1-hour TTL automatically.
Does editing CLAUDE.md break the cache?
No. A mid-session edit does not invalidate the cache, but it also does not apply. The version loaded at session start persists until the next /clear, /compact, or restart.
Why is my Claude Code usage so high even with caching?
Something keeps invalidating the prefix: model switches, effort changes, MCP servers loading into the prefix, /compact, fast mode, or a Claude Code upgrade. Watch for cache_creation_input_tokens staying high turn after turn.
What is the minimum prompt size to be cacheable?
It is model-dependent: 4096 tokens on Opus 4.8 and Haiku 4.5, 2048 on Sonnet 4.6 and Fable 5, and 1024 on Sonnet 4.5 and earlier Sonnets. Below the minimum, the prefix silently will not cache.
Arrête de tout configurer. Place à la construction.
Des templates SaaS avec orchestration IA.
Claude Code Max Plan vs API Cost: Break-Even Guide
Claude Code max plan vs api cost, decided by math. A Max 5x subscription beats an API key once you spend about $3.33/day; Max 20x breaks even at $6.67/day.
Run a Team of AI Agents in Parallel with Git Worktrees
A hands-on workflow for running multiple Claude Code agents at once using git worktrees: real commands, a directory layout, the merge strategy, and the practical 5-7 agent ceiling.