Is Kimi K2.6 still faster than Claude on coding benchmarks?

Not anymore at the top of the lineup. When K2.6 launched in April 2026 it beat the then-current Claude Opus 4.6 on SWE-Bench Pro (58.6% vs 53.4%). Anthropic shipped Opus 4.8 on May 28, 2026, which scores 69.2% on SWE-Bench Pro and 88.6% on SWE-Bench Verified, reclaiming the coding lead. K2.6 stays compelling on price and tool reliability, not raw benchmark wins.

How do you point Claude Code at Kimi K2.6?

Set ANTHROPIC_BASE_URL to an Anthropic-compatible endpoint (OpenRouter works) and map all three Claude Code model tiers (Haiku, Sonnet, Opus) to moonshotai/kimi-k2.6. If you skip a tier you get intermittent 404 errors mid-session.

Kimi K2.6: What Actually Changed

Moonshot AI dropped K2.6 on April 18, 2026. No launch event. The blog post went up quietly. By the time most developers noticed, Claude Code was already the third-largest consumer of K2.6 tokens on OpenRouter.

The model is worth understanding. Not because of the headline benchmark, but because of what changed under the hood.

Updated June 2026: this post compares K2.6 against Claude Opus 4.6, the flagship shipping when K2.6 launched. Anthropic has since released Opus 4.8 (May 28, 2026), which scores 69.2% on SWE-Bench Pro and 88.6% on SWE-Bench Verified, retaking the coding lead K2.6 briefly held. The benchmark numbers below are the April 2026 snapshot. The reason to still care about K2.6 is unchanged: tool reliability and price, not topping the leaderboard.

The Numbers That Actually Matter

Most coverage leads with SWE-Bench Pro: 58.6%, above Claude Opus 4.6's 53.4%. That is real, but it is not the most interesting jump.

The two benchmarks that matter more for anyone running agents:

Toolathlon went from 27.8% to 50.0%. That is near-doubling in a single version. Toolathlon measures whether a model uses tools correctly across complex, multi-step sequences. K2.5 was genuinely unreliable here. K2.6 is not.

MCPMark went from 29.5% to 55.9%. MCPMark is specifically MCP tool-calling accuracy. For Claude Code users routing through OpenRouter, this is the number that predicts whether agent runs complete cleanly or stall mid-task. K2.5's 29.5% was the reason people kept reporting flaky behavior. K2.6's 55.9% is why those reports stopped.

Hallucination rate dropped from 65% to 39%. K2.5 was too unreliable for architecture work. K2.6 is now in Claude Opus territory on factual reliability.

The full coding picture against Opus 4.6:

Benchmark	Kimi K2.6	Claude Opus 4.6
SWE-Bench Pro	58.6%	53.4%
SWE-Bench Verified	80.2%	80.8%
Terminal-Bench 2.0	66.7%	65.4%
LiveCodeBench v6	89.6%	88.8%
HLE-Full w/ tools	54.0%	53.0%
HLE-Full no tools	34.7%	40.0%
Toolathlon	50.0%	47.2%
MCPMark	55.9%	56.7%

K2.6 leads on every coding benchmark. Opus 4.6 leads on pure reasoning (the HLE no-tools row). That gap is real and worth knowing.

What Long-Horizon Actually Means

Benchmarks show averages. The demos show ceilings.

Moonshot ran K2.6 on a real engineering task: deploy Qwen3.5-0.8B locally on a Mac, implement inference in Zig, optimize until fast. Zig is a niche language with thin training data. The model ran for 12 hours, 4,000+ tool calls, 14 iterations. Final result: throughput went from ~15 to ~193 tokens per second, 20% faster than LM Studio on the same hardware.

A second task: overhaul exchange-core, an 8-year-old open-source financial matching engine already running near its performance limits. The model ran for 13 hours, made 1,000+ tool calls, modified 4,000+ lines of code. It analyzed CPU flame graphs, identified hidden bottlenecks, and reconfigured the core thread topology. Medium throughput went from 0.43 to 1.24 MT/s. A 185% improvement on a system that engineers had been squeezing for years.

Moonshot's own RL infrastructure team ran a K2.6 agent autonomously for 5 consecutive days managing monitoring, incident response, and system operations. Not a benchmark. Production infrastructure.

The pattern across all three: K2.6 does not stall when it hits a wall. It pivots, finds another path, and keeps going.

Preserve Thinking Mode

This feature ships without much fanfare and most setup guides skip it entirely.

K2.6 supports preserve_thinking — the model retains its full reasoning content across multi-turn interactions. In standard mode, the thinking from turn 1 is gone by turn 2. With preserve thinking enabled, every subsequent turn can reference what the model reasoned through earlier.

For coding agents running multi-step tasks, this is significant. The model does not re-derive architectural context on every tool call. It carries its reasoning forward.

Enable it by passing {'thinking': {'type': 'enabled', 'keep': 'all'}} in extra_body. On vLLM or SGLang: {'chat_template_kwargs': {"thinking": True, "preserve_thinking": True}}.

The Swarm Architecture

K2.5 introduced Agent Swarm as a research preview: 100 sub-agents, 1,500 coordinated steps. K2.6 scales that to 300 sub-agents executing 4,000 steps simultaneously.

This is not just a bigger number. At 100 agents you can parallelize research tasks. At 300 agents running 4,000 steps you can parallelize software engineering pipelines where subtasks have dependencies. Moonshot used it internally to run their own content production: Demo Makers, Benchmark Makers, Social Media agents, and Video Makers all coordinated by K2.6 in a single run.

Claw Groups extends this further as a research preview. Any agent, on any device, running any model can join the swarm. A laptop running a local Llama model and a cloud instance running K2.6 can operate as genuine collaborators under K2.6's coordination. When an agent stalls, the coordinator detects it, reassigns the task, and manages the full lifecycle through to completion.

Where Opus Still Leads

Honest accounting: K2.6 does not win everywhere, and the gap widened after Opus 4.8 shipped.

Pure reasoning without tools: even the older Opus 4.6 scored 40.0% on HLE-Full to K2.6's 34.7%. Opus 4.8 pushes that further (49.8% no-tools HLE). For open-ended architectural thinking not grounded in code, that gap matters.

Coding leaderboard: the April table above shows K2.6 ahead of Opus 4.6. Opus 4.8 (May 28, 2026) now posts 69.2% on SWE-Bench Pro and 88.6% on SWE-Bench Verified, ahead of K2.6 on both. If you want the current top coding model regardless of price, that is Anthropic's flagship today. See the best AI coding model in 2026 for the full ranking.

AIME 2026 and HMMT math: GPT-5.4 and Gemini 3.1 Pro lead here. K2.6 is competitive but not the best reasoning model available.

Data sovereignty: Moonshot AI is a Chinese lab. Anthropic named Moonshot in a February 2026 legal complaint. Enterprise data policies at many companies prohibit routing code through Chinese-owned infrastructure. Check before using this on client work.

Modified MIT license: the model weights ship under what Moonshot calls "Modified MIT." This is not a standard recognized license. Read the actual terms before commercial deployment of the weights.

Cost

OpenRouter pricing for Kimi K2.6 (Moonshot AI provider): $0.80/M input, $3.50/M output, $0.20/M cache reads.

The Moonshot AI provider achieves a 93.1% cache hit rate in production. Effective input cost with caching: ~$0.215/M tokens.

For a Claude Code-style workload at 20 prompts/day across 22 working days, the monthly cost lands around $12-15. Claude Sonnet 4.6 at the same volume runs ~$44/month. That price gap is the main reason to keep K2.6 in your routing options, and it matters more now that the Claude Code billing change on June 15, 2026 separates Claude Code usage from the chat plans. If you want to trim spend without switching models, see how to cut Claude Code token costs.

Using It in Claude Code

Claude Code reads ANTHROPIC_BASE_URL at startup and routes to any Anthropic-compatible API. The non-obvious part: Claude Code uses three internal model tiers (Haiku for aux tasks, Sonnet for main coding, Opus for complex reasoning), and all three must be mapped or you get intermittent 404 errors mid-session. Add this to ~/.zshrc:

export OPENROUTER_API_KEY="sk-or-..."
export ANTHROPIC_BASE_URL="https://openrouter.ai/api"
export ANTHROPIC_AUTH_TOKEN="$OPENROUTER_API_KEY"
export ANTHROPIC_API_KEY=""
export ANTHROPIC_DEFAULT_HAIKU_MODEL="moonshotai/kimi-k2.6"
export ANTHROPIC_DEFAULT_SONNET_MODEL="moonshotai/kimi-k2.6"
export ANTHROPIC_DEFAULT_OPUS_MODEL="moonshotai/kimi-k2.6"
export CLAUDE_CODE_SUBAGENT_MODEL="moonshotai/kimi-k2.6"

Use the Moonshot AI provider specifically (moonshotai/kimi-k2.6). Tool error rate across providers: Moonshot AI 0.20%, NovitaAI 0.44%, Cloudflare 1.86%.

The Toolathlon and MCPMark jumps from K2.5 to K2.6 are the story. A model that could not reliably use tools is now one of the best at it. At $0.80/M input tokens, that is the combination that makes K2.6 worth switching for personal and cost-sensitive work.