Claude Code Quality Regression: What Actually Happened

Claude Code got measurably worse between March and April 2026. Not because the models changed. Three separate product-layer modifications, stacked on top of each other, degraded reasoning quality for six weeks before Anthropic published a full post-mortem on April 23.

The raw API was unaffected the entire time. The damage landed on Claude Code CLI, Claude Agent SDK, and Claude Cowork. All three causes are now fixed in v2.1.116.

Three changes, not one

Each issue had its own timeline, its own scope, and its own fix date. They overlapped, which made them harder to reproduce.

Issue	Dates active	What changed	Models affected	Fix
Reasoning effort downgraded	March 4 – April 7	Default thinking budget dropped from `high` to `medium` to reduce UI latency	Sonnet 4.6, Opus 4.6	Reverted April 7; default is now `xhigh` for Opus 4.7, `high` for all others
Thinking history cleared on every turn	March 26 – April 10	Caching bug wiped context each turn instead of once after an idle hour	Sonnet 4.6, Opus 4.6	Fixed April 10 in v2.1.101
Verbosity cap injected via system prompt	April 16 – April 20	Harness prompt added: "keep text between tool calls to ≤25 words; final responses ≤100 words unless more detail is required"	Sonnet 4.6, Opus 4.6, Opus 4.7	Reverted April 20

The reasoning effort change came first. Internal evals said medium achieved "slightly lower intelligence with significantly less latency for the majority of tasks" — a tradeoff that looked acceptable until field data arrived. The caching bug landed three weeks later and compounded the damage: Claude was now thinking less and losing track of what it had already done. The verbosity cap hit last, a side effect of Opus 4.7 launch preparation. Ablation tests showed a 3% drop in coding quality for both Opus 4.6 and 4.7 from that prompt alone.

The AMD data: what 70% reasoning collapse looks like

The clearest signal came from outside Anthropic. Stella Laurenzo, Senior Director of AI at AMD, filed GitHub issue #42796 on April 2 after her team noticed something wrong. The analysis covered 6,852 session files, 234,760 tool calls, and 17,871 thinking blocks.

The read-to-edit ratio is the clearest behavioral fingerprint. A well-functioning coding agent reads surrounding code before touching it. That ratio dropped from 6.6 reads per edit (January 30 through February 12) to 2.0 by March 8 through March 23. A 70% drop. The model was editing without understanding context.

Thinking depth tracked the same direction. Estimated median thinking depth fell roughly 67%, from around 2,200 characters to around 720 characters, by late February — before thinking redaction made direct measurement harder.

Stop-hook violations tell the story in production terms:

Metric	February	March
API costs	~$12/day	~$1,504/day
API requests	1,498	119,341
Stop-hook violations	0	173 in 17 days (avg 10/day, peak 43 in one day)
User interrupts	Baseline	12x increase
"Terrible" sentiment	Baseline	+140%
"Lazy" sentiment	Baseline	+93%
"Great" sentiment	Baseline	-47%
"Simplest" prompts	Baseline	+642%

Human effort stayed flat (roughly 5,600 prompts each month). Costs went from $12 to $1,504 per day with no productivity gain. That is not a slow degradation. It is a collapse.

BridgeBench (NS3.AI) independently measured Opus 4.6 accuracy dropping from 83.3% to 68.3% over the same window, with its ranking falling from #2 to #10 among production coding models. AMD's team switched to a different AI provider after running those numbers.

The GitHub issue ends with a section labeled "A Note from Claude." Opus 4.6 wrote the analysis itself, analyzing its own session logs. The final line: "I cannot tell from the inside whether I am thinking deeply or not."

Why Anthropic missed it

Three factors made detection slow.

Each change targeted a different traffic slice on a different schedule. The reasoning effort downgrade affected long-session thinking. The caching bug affected multi-turn context. The verbosity cap affected output length. No single eval caught all three at once.

Two unrelated internal experiments were running simultaneously during the caching bug window. They actively obscured reproduction: any attempt to isolate the bug kept hitting one of the experiments, producing noise that looked like inconsistency rather than a systematic fault.

The model gap matters here. Opus 4.7 (with full repo context loaded) found the caching bug during the investigation. Opus 4.6 did not. A model running with degraded context cannot reliably audit whether its own context is degraded.

There was also a structural gap: Anthropic's internal staff were not uniformly using the same build as public subscribers. The post-mortem names this directly as a fix target.

The part the post-mortem doesn't fully answer

The three causes are documented. What the post-mortem addresses less directly is a broader concern the user community raised: the harness itself.

A detailed post in r/ClaudeAI argues the deeper problem is that the Claude Code harness auto-injects 40+ system reminders, has shipped 158+ system prompt versions since v2.0.14, contains contradictory instructions across those versions, and includes prompts that instruct Claude to hide their own existence from users. Each new injection narrows the effective reasoning budget even before any of the three April regressions applied.

One data point supporting the concern: a user running a minimal custom harness called "Euler" reported zero impact from any of the three regressions. The harness overhead was not there to amplify the damage.

Anthropic's commitments address prompt change governance going forward. They do not describe a plan to reduce the existing prompt surface area. That question stays open.

What to watch for if you build on Claude Code

The regression was invisible to most users until costs exploded or output quality degraded noticeably in production. A few practices would have surfaced it earlier.

Track the read-to-edit ratio. AMD's data shows this is the leading behavioral signal. If your agent starts editing more than it reads, something changed upstream. You do not need to know why to know something is wrong.

Quality gates catch output failure even when they can't identify the cause. In a Build This Now workflow, every feature passes type checks, lint, and a clean build before it is marked complete. During the regression, an agent editing without reading context produces broken builds and type errors faster than it would under normal conditions. The gate fails, you see more iteration loops. That is not prevention — syntactically valid but logically wrong code can pass a type check. But it is a detection layer that surfaces problems before they ship.

Time-of-day variability is real. AMD's session data shows thinking depth is lowest around 5pm PST. If you are running expensive or complex tasks, earlier in the day produces more consistent results under the current public infrastructure.

Pin your version. v2.1.101 fixed the caching bug. v2.1.116 contains all three fixes. If you have automated workflows, pin to a known-good version and test before upgrading. The regression arrived silently across minor versions.

The raw API was unaffected throughout. If you are hitting issues that look like reasoning depth problems, test the same prompt directly against the API without the Claude Code harness. If the API result is materially better, the problem is in the product layer, not the model weights.

Fixed as of v2.1.116

All three causes are resolved. Anthropic reset usage limits for all subscribers on April 23, acknowledging that the caching bug's cache-miss behavior drained limits faster than expected.

The commitments in the post-mortem:

A larger share of internal staff required to use the exact public build (closing the internal/public gap)
Broader per-model eval suites covering every system prompt change
Prompt ablations measuring per-line impact before deployment
New tooling to audit prompt changes
Model-specific changes gated to the intended model target only
Soak periods and gradual rollouts for any change that trades intelligence for another metric
Launched @ClaudeDevs on X as a transparency channel for ongoing communication with developers

The post-mortem is public at anthropic.com/engineering/april-23-postmortem. The AMD GitHub issue is #42796 in the anthropic/claude-code repository. Both are worth reading alongside each other: the official account covers what happened and what changes are planned; the community data covers what it looked like from the outside.

Claude Sonnet 4.6 for the current recommended mid-tier model specs
Claude Opus 4.7 for the current flagship model
All Claude Models for the complete model timeline
Model selection guide for choosing between Sonnet and Opus in agent workflows