Claude Code 1M Context in Practice: When Bigger Isn't Better

Claude Code's 1M-token context window, available on Claude Opus 4.8 and Sonnet 4.6, is now GA at flat standard pricing with no long-context surcharge, but it is a capacity ceiling, not a free lunch. Research-verified "context rot" means accuracy degrades as input grows, often well before you hit the limit, so the 1M window earns its keep on large unfamiliar codebases and cross-file reasoning while it wastes both tokens and quality on iterative edits and well-scoped tasks that /compact, subagents, and dynamic workflows handle better.

This post covers what GA actually changed, why bigger context degrades quality, a decision framework for when to load it all versus keep it small, the token-cost math, and the gotchas that bite once you stop equating "fits in the window" with "the model uses it well."

What GA Actually Changed
Why Bigger Context Degrades Quality (Context Rot)
The Decision Framework: Load It All vs Keep It Small
The Token-Cost Math
The Three Tools That Keep Context Small
Enabling and Verifying 1M in Claude Code
Gotchas Developers Hit
Frequently Asked Questions

What GA Actually Changed

The 1M context window went GA on March 13, 2026 for Opus 4.6 and Sonnet 4.6, and it removed the long-context premium entirely. A 900,000-token request is now billed at the same per-token rate as a 9,000-token one (awesomeagents.ai). During the beta the 1M window cost 2x on input and 1.5x on output; GA deleted that surcharge (pasqualepillitteri.it).

So the headline number per model is simple:

Claude Opus 4.8: $5 / 1M input tokens, $25 / 1M output tokens, flat across the entire 1M window.
Claude Sonnet 4.6: $3 / 1M input tokens, $15 / 1M output tokens, flat across the entire 1M window.

The 1M window covers Opus 4.8, Opus 4.7, Opus 4.6, and Sonnet 4.6 on the Claude API, Amazon Bedrock, and Vertex AI. Other models, including Sonnet 4.5, stay at a 200K-token window (Anthropic docs). One caveat worth flagging early: Microsoft Foundry caps Opus 4.8 at 200K, so the 1M window is API, Bedrock, and Vertex only. Do not assume 1M everywhere.

Here is the part most "bigger is better" takes miss. Anthropic's own documentation says the quiet part out loud: "As token count grows, accuracy and recall degrade, a phenomenon known as context rot," which makes curation as important as capacity (Anthropic docs). Flat pricing changed the bill. It did not change the physics.

Why Bigger Context Degrades Quality (Context Rot)

Two pieces of research explain why a bigger window does not mean a smarter model.

Lost in the middle. The "Lost in the Middle" paper (Liu, Lin, Liang et al., arXiv:2307.03172, TACL 2023) found that LLM performance is highest when the relevant information sits at the beginning or end of the context, and degrades significantly when it lands in the middle. This was replicated across GPT-3.5, GPT-4, Claude 1.3, and others (arXiv:2307.03172). Practical takeaway: put the load-bearing content near the edges of your prompt, not buried in the center of a 500K-token dump.

Context rot. Chroma's 2025 "Context Rot" study tested 18 frontier models and found every one degrades as input length grows, with the steepest degradation in the 100K–500K token range. A 200K-window model showed significant degradation at just 50K tokens (Chroma research). All 18 models showed monotonically decreasing F1 as input length grew.

The key insight: context rot is not window overflow. It happens well before the token limit, and degradation accelerates when question-answer semantic similarity is low and when distractors are present in the context (Chroma). The one bit of good news for Claude users: Claude models showed the lowest hallucination rates among those tested.

So "it fits in the window" and "the model uses it well" are two different claims. The first is about capacity. The second is about signal-to-noise. The 1M window raises the ceiling; context rot lowers the quality floor as you approach it.

The Decision Framework: Load It All vs Keep It Small

Here is the honest call on when the 1M window wins and when keeping context small wins. Each row is a real scenario with the reasoning.

Scenario	Load it all (1M)	Keep it small (/compact, subagents, fork)	Why
Large unfamiliar codebase, need cross-file reasoning	WINS	—	The model needs to see relationships it cannot predict; pre-curation is impossible
One-shot Q&A over a big doc set	WINS	—	Single pass, no iteration, so pay once with no re-summarizing
Iterative edits on a few known files	LOSES	WINS	Context rot degrades precision on stale tokens; re-reading 500K each turn is pure cost
Well-scoped task with clear inputs/outputs	LOSES	WINS	A scoped subagent or fork avoids dragging irrelevant context
Long agentic loop accumulating tool results	LOSES (rots)	WINS (context editing + compaction)	Old tool results are noise; clearing or summarizing keeps signal high
Sub-task needing a cheaper model	— (model switch kills cache)	WINS (fork to subagent)	Switching models mid-session invalidates the whole prompt cache

Read it as a heuristic, not a law. Load it all when you genuinely cannot pre-curate: you do not know which files matter, the reasoning crosses many of them, and it is a single pass. Keep it small when you can scope the work: you know the files, you are iterating, or the loop is piling up tool output you will never reference again.

The trap is iterative work. A long edit session feels like it benefits from "the model remembering everything," but every turn re-sends a growing context that is mostly stale, paying full token cost on input that context rot is actively degrading. That is the worst of both bills.

The Token-Cost Math

Flat pricing is not free pricing. A 1M-token Opus 4.8 request still costs about $5 of input every time you send it. Some anchors at the standard rates:

Opus 4.8 at $5 / 1M input: 500K input ≈ $2.50, 1M input ≈ $5.00.
Sonnet 4.6 at $3 / 1M input: 500K input ≈ $1.50, 1M input ≈ $3.00.

Now the decision that actually moves your bill: pay to keep a big context loaded, or re-summarize each turn? Here is the math on Opus 4.8 ($5/M input) for a 10-turn session over 500K tokens of context.

Option A, re-send 500K raw context every turn, no cache. 500K × $5/M = $2.50 per turn. Ten turns = $25.00.

Option B, prompt caching. Cache reads cost about 0.1x base input price; cache writes cost about 1.25x (5-min TTL) or 2x (1-hour TTL) ([claude-api skill: prompt-caching]). First turn writes at 1.25x (~~$3.13), each later turn reads at 0.1x (~~$0.25). Ten turns ≈ $5.38 total.

Option C, /compact to a ~50K summary. Then 50K × $5/M = $0.25 per turn. Ten turns = $2.50, but you lose detail.

The break-even rules:

Caching beats re-summarizing when the context is reused unchanged at least twice. Caching break-even is 2 requests at the 5-min TTL and 3+ at the 1-hour TTL (the 1h write costs 2x instead of 1.25x).
Compaction wins when older detail is genuinely disposable AND quality (fighting context rot) matters more than completeness.

For a fuller treatment of trimming the bill, see how to cut Claude Code token costs and the Claude Code pricing breakdown. If you are sorting out the post-cutover billing model specifically, what changed for Claude Code costs after June 15 walks through it.

Strategy	Cost over 10 turns (Opus 4.8, 500K ctx)	Keeps full detail?	Best when
Re-send raw every turn (no cache)	~$25.00	Yes	Never; strictly dominated
Prompt caching	~$5.38	Yes	Context reused unchanged 2+ times
`/compact` to 50K summary	~$2.50	No	Older detail is disposable, quality matters
Subagent fork (scoped 50K)	varies	N/A (isolated)	Sub-task is well-bounded or wants a cheaper model

The Three Tools That Keep Context Small

When the framework says "keep it small," these are the three mechanisms, and they are not interchangeable.

1. Compaction (/compact). Summarizes earlier conversation near the limit. Server-side compaction is Anthropic's recommended context-management strategy: it auto-summarizes earlier conversation when approaching the limit, with a default trigger around 150K tokens, available in beta (header compact-2026-01-12) for Opus 4.8/4.7/4.6 and Sonnet 4.6 ([claude-api skill: compaction]). One implementation gotcha: you must append the full response.content (the compaction block) back on each turn, not just the text, or you silently lose the compaction state.

2. Context editing. Prunes stale tool results and thinking blocks within a session. It removes content rather than summarizing it ([claude-api skill: agent-design]). This is the right tool for a long agentic loop where old tool outputs are pure noise.

3. Subagents. Fork a separate, isolated context (and can use a cheaper model) for a sub-task. The reason this matters for cost: switching models or changing the tool set mid-session invalidates the entire prompt cache, because the cache is a prefix match, so any byte change anywhere in the prefix invalidates everything after it ([claude-api skill: agent-design]). So instead of swapping the main loop's model for a cheap sub-task, you spawn a subagent and keep the main loop on one model and tool set.

These compose. Context editing prunes within a session, compaction summarizes near the limit, and memory persists across sessions, so long-running agents often use all three. When the orchestration itself is the bottleneck, the next step up is moving coordination out of the context window entirely with Claude Code dynamic workflows, which keep intermediate agent results in script variables rather than the main context.

Two more capacity details worth knowing. Sonnet 4.6 and Haiku 4.5 have "context awareness": they receive a token_budget tag (1M or 200K) and post-tool-call system warnings of remaining capacity, so they self-moderate on long tasks. And extended/adaptive thinking blocks from previous turns are automatically stripped from the context-window calculation by the API. They are not carried forward as input tokens, which preserves capacity (Anthropic docs).

Enabling and Verifying 1M in Claude Code

In Claude Code, no beta header is needed for 1M anymore. The access path depends on your plan (Claude Code Camp):

Max, Team, and Enterprise: Opus auto-upgrades to 1M.
Pro: opt in via /extra-usage.
Force it on any plan: use a [1m] model suffix via /model.

To verify it is active, run /context. It shows usage as Xk/1000k when the 1M window is active, or Xk/200k when it is not.

# 1M window active
/context
  Context: 142k/1000k

# 200K window
/context
  Context: 142k/200k

One operational rule for Claude Code: auto-compaction triggers around 95% capacity, and by then precision on older tokens is already lost (wmedia.es). Running /compact proactively beats waiting for the auto-trigger. For the full GA mechanics, see the 1M context GA writeup.

A few hard limits on the 1M-window models: max output is 128K tokens (you need streaming at that size to avoid SDK timeouts), and you can attach up to 600 images or PDF pages per request, versus 100 on the 200K models (Anthropic docs).

Gotchas Developers Hit

Flat pricing is not free. A 1M-token Opus 4.8 request still costs about $5 of input every time you send it. The surcharge is gone. The per-token bill is not.

Context rot hits before the limit. Chroma found a 200K-window model degraded at 50K tokens, with the steepest drop in the 100K–500K range. Do not equate "fits in the window" with "the model uses it well."

Lost in the middle is real and Claude-relevant. Information buried mid-context is recalled worse than information at the start or end (Liu et al. 2023, replicated across model families). Put load-bearing content near the edges.

Auto-compaction fires too late. Claude Code's auto-compaction kicks in around 95% capacity, by which point precision on older tokens has already degraded. Run /compact proactively.

Model and tool-set switches kill the cache. Switching models or changing the tool set mid-session invalidates the entire prompt cache (prefix match). Use a subagent for the cheaper-model sub-task instead of swapping the main loop's model.

Compaction state must be passed back whole. Compaction requires appending the full response.content block back on each turn, not just the text. Extracting only the text silently loses the compaction state.

Caching break-even depends on TTL. The 5-min TTL pays off after 2 requests; the 1-hour TTL needs 3+ because the write costs 2x instead of 1.25x.

1M is not everywhere. Microsoft Foundry caps Opus 4.8 at 200K. The 1M window is API, Bedrock, and Vertex only.

Frequently Asked Questions

Does the 1M context window in Claude cost extra?

No. Since the March 2026 GA, the 1M window on Opus 4.8/4.7/4.6 and Sonnet 4.6 is billed at flat standard rates ($5/$25 per million for Opus, $3/$15 for Sonnet) with no long-context premium. The beta's 2x-input and 1.5x-output surcharge was removed. You still pay per token, so a 1M-token request genuinely costs about $5 of input on Opus 4.8.

If context is flat-priced now, should I just load everything into Claude Code?

No. Pricing is flat but quality is not. Context rot, confirmed across 18 models in Chroma's 2025 study and in the "Lost in the Middle" research, means accuracy degrades as input grows, often well before the limit, and a 200K model degraded at 50K. Loading 1M tokens of mostly-irrelevant context both burns money and lowers answer quality. Curate what is in context.

When does the 1M context window actually help?

On large unfamiliar codebases and cross-file reasoning where the model cannot know in advance which files matter (situations where you genuinely cannot pre-curate), and on one-shot Q&A over a big document set. It loses on iterative edits, well-scoped tasks, and long tool-calling loops, where /compact, subagents, and context editing keep the signal-to-noise ratio high and cost low.

How do I enable the 1M window in Claude Code?

No beta header is needed anymore. On Max, Team, and Enterprise, Opus auto-upgrades to 1M. On Pro you opt in via /extra-usage, or force it with a [1m] model suffix via /model. Run /context to verify; it shows Xk/1000k when 1M is active, Xk/200k when not.

Is it cheaper to keep a big context loaded or to re-summarize each turn?

It depends on reuse and how disposable the detail is. Prompt caching makes reused context about 0.1x to read, beating re-sending after just two turns. Compaction (summarizing to a smaller working set) is cheaper still per turn but loses detail. Pay for full context when you keep needing the detail; compact when older context is genuinely disposable and you want to fight context rot.

What's the difference between /compact, subagents, and context editing?

/compact (compaction) summarizes earlier conversation server-side near the limit. Context editing prunes stale tool results and thinking blocks within a session. Subagents fork a separate, isolated context (and can use a cheaper model) for a sub-task. Long-running agents typically combine all three. Switching models mid-main-loop is avoided because it invalidates the prompt cache.

Wrapping Up

The 1M context window is genuinely useful, and GA made it cheap to reach. But flat pricing changed the bill, not the physics. Context rot is real, measured across 18 models, and it hits well before the limit. The skill now is not "fit more in the window." It is deciding, per task, whether the model needs to see everything at once or whether /compact, context editing, a scoped subagent, or a dynamic workflow keeps the signal high and the cost low.

Start with the decision table. If you cannot pre-curate and it is a single pass, load it all. If you are iterating on known files or piling up tool results, keep it small. And run /context often enough that you notice when you are paying to drag stale tokens through a session that no longer needs them.

Posted by @speedy_devv