Moonshot AI released K2.6 with near-doubling tool reliability, a 300-agent swarm, and SWE-Bench Pro scores above Claude Opus 4.6 — at $0.80 per million input tokens.
Stop configuring. Start building.
SaaS builder templates with AI orchestration.
Moonshot AI dropped K2.6 on April 18, 2026. No launch event. The blog post went up quietly. By the time most developers noticed, Claude Code was already the third-largest consumer of K2.6 tokens on OpenRouter.
The model is worth understanding. Not because of the headline benchmark, but because of what changed under the hood.
Most coverage leads with SWE-Bench Pro: 58.6%, above Claude Opus 4.6's 53.4%. That is real, but it is not the most interesting jump.
The two benchmarks that matter more for anyone running agents:
Toolathlon went from 27.8% to 50.0%. That is near-doubling in a single version. Toolathlon measures whether a model uses tools correctly across complex, multi-step sequences. K2.5 was genuinely unreliable here. K2.6 is not.
MCPMark went from 29.5% to 55.9%. MCPMark is specifically MCP tool-calling accuracy. For Claude Code users routing through OpenRouter, this is the number that predicts whether agent runs complete cleanly or stall mid-task. K2.5's 29.5% was the reason people kept reporting flaky behavior. K2.6's 55.9% is why those reports stopped.
Hallucination rate dropped from 65% to 39%. K2.5 was too unreliable for architecture work. K2.6 is now in Claude Opus territory on factual reliability.
The full coding picture against Opus 4.6:
| Benchmark | Kimi K2.6 | Claude Opus 4.6 |
|---|---|---|
| SWE-Bench Pro | 58.6% | 53.4% |
| SWE-Bench Verified | 80.2% | 80.8% |
| Terminal-Bench 2.0 | 66.7% | 65.4% |
| LiveCodeBench v6 | 89.6% | 88.8% |
| HLE-Full w/ tools | 54.0% | 53.0% |
| HLE-Full no tools | 34.7% | 40.0% |
| Toolathlon | 50.0% | 47.2% |
| MCPMark | 55.9% | 56.7% |
K2.6 leads on every coding benchmark. Opus 4.6 leads on pure reasoning (the HLE no-tools row). That gap is real and worth knowing.
Stop configuring. Start building.
SaaS builder templates with AI orchestration.
Stop configuring. Start building.
SaaS builder templates with AI orchestration.
Benchmarks show averages. The demos show ceilings.
Moonshot ran K2.6 on a real engineering task: deploy Qwen3.5-0.8B locally on a Mac, implement inference in Zig, optimize until fast. Zig is a niche language with thin training data. The model ran for 12 hours, 4,000+ tool calls, 14 iterations. Final result: throughput went from ~15 to ~193 tokens per second, 20% faster than LM Studio on the same hardware.
A second task: overhaul exchange-core, an 8-year-old open-source financial matching engine already running near its performance limits. The model ran for 13 hours, made 1,000+ tool calls, modified 4,000+ lines of code. It analyzed CPU flame graphs, identified hidden bottlenecks, and reconfigured the core thread topology. Medium throughput went from 0.43 to 1.24 MT/s. A 185% improvement on a system that engineers had been squeezing for years.
Moonshot's own RL infrastructure team ran a K2.6 agent autonomously for 5 consecutive days managing monitoring, incident response, and system operations. Not a benchmark. Production infrastructure.
The pattern across all three: K2.6 does not stall when it hits a wall. It pivots, finds another path, and keeps going.
This feature ships without much fanfare and most setup guides skip it entirely.
K2.6 supports preserve_thinking — the model retains its full reasoning content across multi-turn interactions. In standard mode, the thinking from turn 1 is gone by turn 2. With preserve thinking enabled, every subsequent turn can reference what the model reasoned through earlier.
For coding agents running multi-step tasks, this is significant. The model does not re-derive architectural context on every tool call. It carries its reasoning forward.
Enable it by passing {'thinking': {'type': 'enabled', 'keep': 'all'}} in extra_body. On vLLM or SGLang: {'chat_template_kwargs': {"thinking": True, "preserve_thinking": True}}.
K2.5 introduced Agent Swarm as a research preview: 100 sub-agents, 1,500 coordinated steps. K2.6 scales that to 300 sub-agents executing 4,000 steps simultaneously.
This is not just a bigger number. At 100 agents you can parallelize research tasks. At 300 agents running 4,000 steps you can parallelize software engineering pipelines where subtasks have dependencies. Moonshot used it internally to run their own content production: Demo Makers, Benchmark Makers, Social Media agents, and Video Makers all coordinated by K2.6 in a single run.
Claw Groups extends this further as a research preview. Any agent, on any device, running any model can join the swarm. A laptop running a local Llama model and a cloud instance running K2.6 can operate as genuine collaborators under K2.6's coordination. When an agent stalls, the coordinator detects it, reassigns the task, and manages the full lifecycle through to completion.
Honest accounting: K2.6 does not win everywhere.
Pure reasoning without tools: Opus 4.6 scores 40.0% on HLE-Full, K2.6 scores 34.7%. For open-ended architectural thinking not grounded in code, that gap matters.
AIME 2026 and HMMT math: GPT-5.4 and Gemini 3.1 Pro lead here. K2.6 is competitive but not the best reasoning model available.
Data sovereignty: Moonshot AI is a Chinese lab. Anthropic named Moonshot in a February 2026 legal complaint. Enterprise data policies at many companies prohibit routing code through Chinese-owned infrastructure. Check before using this on client work.
Modified MIT license: the model weights ship under what Moonshot calls "Modified MIT." This is not a standard recognized license. Read the actual terms before commercial deployment of the weights.
OpenRouter pricing for Kimi K2.6 (Moonshot AI provider): $0.80/M input, $3.50/M output, $0.20/M cache reads.
The Moonshot AI provider achieves a 93.1% cache hit rate in production. Effective input cost with caching: ~$0.215/M tokens.
For a Claude Code-style workload at 20 prompts/day across 22 working days, the monthly cost lands around $12-15. Claude Sonnet 4.6 at the same volume runs ~$44/month.
Claude Code reads ANTHROPIC_BASE_URL at startup and routes to any Anthropic-compatible API. The non-obvious part: Claude Code uses three internal model tiers (Haiku for aux tasks, Sonnet for main coding, Opus for complex reasoning), and all three must be mapped or you get intermittent 404 errors mid-session. Add this to ~/.zshrc:
export OPENROUTER_API_KEY="sk-or-..."
export ANTHROPIC_BASE_URL="https://openrouter.ai/api"
export ANTHROPIC_AUTH_TOKEN="$OPENROUTER_API_KEY"
export ANTHROPIC_API_KEY=""
export ANTHROPIC_DEFAULT_HAIKU_MODEL="moonshotai/kimi-k2.6"
export ANTHROPIC_DEFAULT_SONNET_MODEL="moonshotai/kimi-k2.6"
export ANTHROPIC_DEFAULT_OPUS_MODEL="moonshotai/kimi-k2.6"
export CLAUDE_CODE_SUBAGENT_MODEL=Use the Moonshot AI provider specifically (moonshotai/kimi-k2.6). Tool error rate across providers: Moonshot AI 0.20%, NovitaAI 0.44%, Cloudflare 1.86%.
The Toolathlon and MCPMark jumps from K2.5 to K2.6 are the story. A model that could not reliably use tools is now one of the best at it. At $0.80/M input tokens, that is the combination that makes K2.6 worth switching for personal and cost-sensitive work.