Build This Now
Build This Now
speedy_devvkoen_salo
Blog/Model Picker/Claude Code Quality Regression: What Actually Happened

Claude Code Quality Regression: What Actually Happened

Three product-layer changes broke Claude Code for six weeks in early 2026. The post-mortem, the AMD data, and what it means if you build on AI coding agents.

Stop configuring. Start building.

SaaS builder templates with AI orchestration.

Published Apr 24, 20268 min readModel Picker hub

Claude Code got measurably worse between March and April 2026. Not because the models changed. Three separate product-layer modifications, stacked on top of each other, degraded reasoning quality for six weeks before Anthropic published a full post-mortem on April 23.

The raw API was unaffected the entire time. The damage landed on Claude Code CLI, Claude Agent SDK, and Claude Cowork. All three causes are now fixed in v2.1.116.

Three changes, not one

Each issue had its own timeline, its own scope, and its own fix date. They overlapped, which made them harder to reproduce.

IssueDates activeWhat changedModels affectedFix
Reasoning effort downgradedMarch 4 – April 7Default thinking budget dropped from high to medium to reduce UI latencySonnet 4.6, Opus 4.6Reverted April 7; default is now xhigh for Opus 4.7, high for all others
Thinking history cleared on every turnMarch 26 – April 10Caching bug wiped context each turn instead of once after an idle hourSonnet 4.6, Opus 4.6Fixed April 10 in v2.1.101
Verbosity cap injected via system promptApril 16 – April 20Harness prompt added: "keep text between tool calls to ≤25 words; final responses ≤100 words unless more detail is required"Sonnet 4.6, Opus 4.6, Opus 4.7Reverted April 20

The reasoning effort change came first. Internal evals said medium achieved "slightly lower intelligence with significantly less latency for the majority of tasks" — a tradeoff that looked acceptable until field data arrived. The caching bug landed three weeks later and compounded the damage: Claude was now thinking less and losing track of what it had already done. The verbosity cap hit last, a side effect of Opus 4.7 launch preparation. Ablation tests showed a 3% drop in coding quality for both Opus 4.6 and 4.7 from that prompt alone.

The AMD data: what 70% reasoning collapse looks like

The clearest signal came from outside Anthropic. Stella Laurenzo, Senior Director of AI at AMD, filed GitHub issue #42796 on April 2 after her team noticed something wrong. The analysis covered 6,852 session files, 234,760 tool calls, and 17,871 thinking blocks.

The read-to-edit ratio is the clearest behavioral fingerprint. A well-functioning coding agent reads surrounding code before touching it. That ratio dropped from 6.6 reads per edit (January 30 through February 12) to 2.0 by March 8 through March 23. A 70% drop. The model was editing without understanding context.

Thinking depth tracked the same direction. Estimated median thinking depth fell roughly 67%, from around 2,200 characters to around 720 characters, by late February — before thinking redaction made direct measurement harder.

Stop-hook violations tell the story in production terms:

MetricFebruaryMarch
API costs~$12/day~$1,504/day
API requests1,498119,341
Stop-hook violations0173 in 17 days (avg 10/day, peak 43 in one day)
User interruptsBaseline12x increase
"Terrible" sentimentBaseline+140%
"Lazy" sentimentBaseline+93%
"Great" sentimentBaseline-47%
"Simplest" promptsBaseline+642%

Human effort stayed flat (roughly 5,600 prompts each month). Costs went from $12 to $1,504 per day with no productivity gain. That is not a slow degradation. It is a collapse.

BridgeBench (NS3.AI) independently measured Opus 4.6 accuracy dropping from 83.3% to 68.3% over the same window, with its ranking falling from #2 to #10 among production coding models. AMD's team switched to a different AI provider after running those numbers.

The GitHub issue ends with a section labeled "A Note from Claude." Opus 4.6 wrote the analysis itself, analyzing its own session logs. The final line: "I cannot tell from the inside whether I am thinking deeply or not."

Why Anthropic missed it

Three factors made detection slow.

Each change targeted a different traffic slice on a different schedule. The reasoning effort downgrade affected long-session thinking. The caching bug affected multi-turn context. The verbosity cap affected output length. No single eval caught all three at once.

Two unrelated internal experiments were running simultaneously during the caching bug window. They actively obscured reproduction: any attempt to isolate the bug kept hitting one of the experiments, producing noise that looked like inconsistency rather than a systematic fault.

The model gap matters here. Opus 4.7 (with full repo context loaded) found the caching bug during the investigation. Opus 4.6 did not. A model running with degraded context cannot reliably audit whether its own context is degraded.

There was also a structural gap: Anthropic's internal staff were not uniformly using the same build as public subscribers. The post-mortem names this directly as a fix target.

The part the post-mortem doesn't fully answer

The three causes are documented. What the post-mortem addresses less directly is a broader concern the user community raised: the harness itself.

A detailed post in r/ClaudeAI argues the deeper problem is that the Claude Code harness auto-injects 40+ system reminders, has shipped 158+ system prompt versions since v2.0.14, contains contradictory instructions across those versions, and includes prompts that instruct Claude to hide their own existence from users. Each new injection narrows the effective reasoning budget even before any of the three April regressions applied.

One data point supporting the concern: a user running a minimal custom harness called "Euler" reported zero impact from any of the three regressions. The harness overhead was not there to amplify the damage.

Anthropic's commitments address prompt change governance going forward. They do not describe a plan to reduce the existing prompt surface area. That question stays open.

What to watch for if you build on Claude Code

The regression was invisible to most users until costs exploded or output quality degraded noticeably in production. A few practices would have surfaced it earlier.

Track the read-to-edit ratio. AMD's data shows this is the leading behavioral signal. If your agent starts editing more than it reads, something changed upstream. You do not need to know why to know something is wrong.

Quality gates catch output failure even when they can't identify the cause. In a Build This Now workflow, every feature passes type checks, lint, and a clean build before it is marked complete. During the regression, an agent editing without reading context produces broken builds and type errors faster than it would under normal conditions. The gate fails, you see more iteration loops. That is not prevention — syntactically valid but logically wrong code can pass a type check. But it is a detection layer that surfaces problems before they ship.

Time-of-day variability is real. AMD's session data shows thinking depth is lowest around 5pm PST. If you are running expensive or complex tasks, earlier in the day produces more consistent results under the current public infrastructure.

Pin your version. v2.1.101 fixed the caching bug. v2.1.116 contains all three fixes. If you have automated workflows, pin to a known-good version and test before upgrading. The regression arrived silently across minor versions.

The raw API was unaffected throughout. If you are hitting issues that look like reasoning depth problems, test the same prompt directly against the API without the Claude Code harness. If the API result is materially better, the problem is in the product layer, not the model weights.

Fixed as of v2.1.116

All three causes are resolved. Anthropic reset usage limits for all subscribers on April 23, acknowledging that the caching bug's cache-miss behavior drained limits faster than expected.

The commitments in the post-mortem:

  • A larger share of internal staff required to use the exact public build (closing the internal/public gap)
  • Broader per-model eval suites covering every system prompt change
  • Prompt ablations measuring per-line impact before deployment
  • New tooling to audit prompt changes
  • Model-specific changes gated to the intended model target only
  • Soak periods and gradual rollouts for any change that trades intelligence for another metric
  • Launched @ClaudeDevs on X as a transparency channel for ongoing communication with developers

The post-mortem is public at anthropic.com/engineering/april-23-postmortem. The AMD GitHub issue is #42796 in the anthropic/claude-code repository. Both are worth reading alongside each other: the official account covers what happened and what changes are planned; the community data covers what it looked like from the outside.

Related Pages

  • Claude Sonnet 4.6 for the current recommended mid-tier model specs
  • Claude Opus 4.7 for the current flagship model
  • All Claude Models for the complete model timeline
  • Model selection guide for choosing between Sonnet and Opus in agent workflows

More in Model Picker

  • Claude Mythos: The Model That Thinks in Loops
    Claude Mythos is suspected to use recurrent-depth architecture: one shared layer looped N times, with ACT halting so hard questions get more passes and easy ones stop early.
  • Claude Opus 4.7 vs Other AI Models
    Claude Opus 4.7, GPT-5.4, Kimi K2.6, Gemini 3.1 Pro, DeepSeek V3.2: benchmarks, context windows, agent reliability, and cost, so you reach for the right one.
  • DeepSeek V4: Pricing, Context, and Migration
    DeepSeek V4 ships two models: V4-Flash at $0.28/M output and V4-Pro at $3.48/M. Both carry a genuine 1M context window and drop into any Anthropic-compatible SDK with one line changed.
  • Every Claude Model
    Every Claude model on one page: Claude 3, 3.5, 3.7, 4, Opus 4.1 to 4.6, Sonnet 4.5 and 4.6, Haiku 4.5. Specs, pricing, benchmarks, and when to use each.
  • Claude 3.5 Sonnet v2 and Claude 3.5 Haiku
    Claude 3.5 Sonnet v2 and 3.5 Haiku launched October 2024 with Computer Use beta, cursor control, upgraded coding and tool use, and cheaper Haiku at $0.80/$4.
  • Claude 3.5 Sonnet
    Claude 3.5 Sonnet launched June 2024 at $3/$15, beating Claude 3 Opus on MMLU, GPQA, HumanEval at a fifth of the cost. Specs, benchmarks, and code gains.

Stop configuring. Start building.

SaaS builder templates with AI orchestration.

On this page

Three changes, not one
The AMD data: what 70% reasoning collapse looks like
Why Anthropic missed it
The part the post-mortem doesn't fully answer
What to watch for if you build on Claude Code
Fixed as of v2.1.116
Related Pages

Stop configuring. Start building.

SaaS builder templates with AI orchestration.