Build This Now
Build This Now
Claude Code ModelleDeepSeek V4: Pricing, Context, and MigrationClaude Code Qualitätsregression: Was wirklich passiert istClaude Opus 4.7 vs GPT-5.5Claude Opus 4.7 vs andere KI-ModelleClaude Mythos: Das Modell, das in Schleifen denktClaude Opus 4.5 in Claude CodeClaude Opus 4.7Claude Opus 4.7: AnwendungsfälleClaude Opus 4.6Claude Sonnet 4.6Claude Opus 4.5Claude Sonnet 4.5Claude Haiku 4.5Claude Opus 4.1Claude 4Claude 3.7 SonnetClaude 3.5 Sonnet v2 und Claude 3.5 HaikuClaude 3.5 SonnetClaude 3Alle Claude-Modelle
speedy_devvkoen_salo
Blog/Model Picker/DeepSeek V4: Pricing, Context, and Migration

DeepSeek V4: Pricing, Context, and Migration

DeepSeek V4 ships two models: V4-Flash at $0.28/M output and V4-Pro at $3.48/M. Both carry a genuine 1M context window and drop into any Anthropic-compatible SDK with one line changed.

Hören Sie auf zu konfigurieren. Fangen Sie an zu bauen.

SaaS-Builder-Vorlagen mit KI-Orchestrierung.

Published Apr 24, 202611 min readModel Picker hub

DeepSeek V4 landed with pricing that is hard to ignore. V4-Flash comes in at $0.28 per million output tokens. V4-Pro sits at $3.48. GPT-5.5 is $30. That is a 107x gap at the Flash tier, and it compounds when you factor in cache hits.

Two things make this worth paying attention to beyond the headline number. First, the 1M context window actually works. On the MRCR benchmark at 1M tokens, V4-Pro publishes 83.5% recall. GPT-5.5 drops to 74%. Opus 4.7 falls to 32%. V4-Pro holds up best at long context of any frontier model that currently ships a 1M window. Second, you do not need to rewrite anything. DeepSeek V4 uses an Anthropic-compatible endpoint. Change one URL and one model string. Same SDK, same tools, different price.

The Two Models

V4 ships as two variants. They run on very different hardware profiles.

ModelTotal ParametersActive ParametersOutput Price
V4-Flash284B13B$0.28 / 1M tokens
V4-Pro1.6T49B$3.48 / 1M tokens

Both are Mixture-of-Experts (MoE) models. Here is what that means in plain terms: the model has a large number of parameters, but only a small fraction of them fire on any given token. V4-Flash has 284 billion parameters total, but only 13 billion are active per inference pass. V4-Pro has 1.6 trillion total, but only 49 billion activate.

Why this matters for cost: You pay for active parameters, not total ones. A 1.6T model that runs 49B active parameters is cheaper to serve than a dense 49B model. You get the knowledge of a trillion-parameter model at the compute cost of a much smaller one.

SpecV4-FlashV4-Pro
Context window1M tokens1M tokens
ArchitectureMixture-of-ExpertsMixture-of-Experts
Active params13B49B
Total params284B1.6T
EndpointAnthropic-compatibleAnthropic-compatible

The Cost Breakdown

Put the output pricing side by side with the current frontier models:

ModelOutput Price (per 1M tokens)
GPT-5.5$30.00
Claude Opus 4.7$25.00
DeepSeek V4-Pro$3.48
DeepSeek V4-Flash$0.28

The math for a daily agentic workload: if your pipeline outputs 1 million tokens per day, V4-Flash saves $29.72 every single day compared to GPT-5.5. That is $891 per month on a single pipeline, before cache pricing enters the picture.

When the savings stack: Agentic loops generate output on every iteration. A single long-running agent that processes tool results and emits next-step instructions might push 500K to 2M output tokens in a day. At that volume, the per-token gap between V4-Flash and GPT-5.5 moves from "interesting" to "this decides the project's economics."

The Context Window

A 1M token context window is only useful if the model can actually retrieve information from it. Many models claim long context but show retrieval accuracy that collapses past 128K or 256K tokens. The right way to verify this is the MRCR benchmark (Multi-Record Contextual Retrieval), which all three frontier labs now publish by depth.

Here are the vendor-published MRCR numbers at 1M tokens:

ModelMRCR at 1MSource
DeepSeek V4-Pro83.5%DeepSeek technical report
GPT-5.574%OpenAI MRCR v2 (512K-1M bucket)
Claude Opus 4.732.2%Anthropic system card, pp. 195-196

V4-Pro has the best published 1M recall of any frontier model as of today. The gap versus Opus 4.7 is the story that did not make the launch posts: 83.5% versus 32.2% on the same benchmark, a 51-point spread. Opus 4.7 is also a regression from 4.6, which scored 78.3% at 1M in Anthropic's own system card. Whatever trade-off Anthropic made to move Opus forward did not preserve long-context recall.

What Engram does: Most transformer models use positional embeddings that were trained at shorter lengths. When you extend them to 1M tokens, the model's sense of "where am I in this document" degrades. Engram is DeepSeek's approach to keeping that signal accurate across the full window. The Engram research paper demonstrated the architecture on a 27B test model hitting 97% NIAH at 1M tokens; production V4-Pro uses the same approach and publishes 83.5% on the harder MRCR benchmark at the same depth.

For practical use: if you are building a system that processes entire codebases, long legal documents, or extended conversation histories, V4-Pro has the most reliable long-context recall of any frontier model that currently ships a 1M window.

Cache Pricing: Where the Real Story Is

Cache hits are where the cost math changes most dramatically. When your system prompt or document context is repeated across many requests, DeepSeek charges a fraction of the base rate for the cached portion.

PatternNo CacheCachedSaving
System prompt$18.00/day$0.90/day95%
Long doc context$42.00/run$2.10/run95%
Agent loop tool calls$8.40/hr$0.42/hr95%

The system prompt gets cached after the first call. Every subsequent call in that session treats the system prompt tokens as a cache hit. For a long-running agent with a detailed system prompt, the per-call cost of those prompt tokens drops to near zero after the first inference.

Long doc context follows the same logic. If you load a 500-page document and run multiple queries against it, only the first call pays full price. Every follow-up query pays the cache rate. At 95% off, a workload that would cost $42 per run lands at $2.10.

Agent loops are where this compounds fastest. Every iteration of an agent loop that reads the same tool results or working memory pays cache rates for the repeated tokens. Scale the loop up and the savings grow linearly with iteration count.

The practical implication: the pricing advantage of V4 over frontier models is not just at the base rate. It widens with every cached token. Long-running agentic systems get the biggest discount.

Migrating in One Line

DeepSeek V4 uses an Anthropic-compatible API endpoint. If you are already using the Anthropic SDK or any OpenAI-compatible SDK, nothing else needs to change.

The complete migration from Claude to DeepSeek V4-Pro:

# before
base_url = "https://api.anthropic.com"
model = "claude-opus-4-5"

# after (one line changed)
base_url = "https://api.deepseek.com"
model = "deepseek-v4-pro"

Same SDK. Same tools. Same workflow. The endpoint speaks the same protocol.

SDKCompatible
Claude SDK (Anthropic)Yes
OpenAI SDKYes
LangChainYes

No refactor needed. Tool calls, function definitions, message formats, system prompts — all of it carries over. You are pointing the same code at a different URL with a different model string.

For the Flash variant, swap deepseek-v4-pro for deepseek-v4-flash. That is the entire change to move from Pro pricing ($3.48/M) to Flash pricing ($0.28/M).

Flash vs Pro: Which One to Use

The decision is straightforward once you know what each model is optimized for.

Use caseRecommendedWhy
High-throughput pipelines, classification, extractionV4-Flash13B active params, lowest cost per token
Complex multi-step reasoning, long-horizon agentsV4-Pro49B active params, higher reasoning ceiling
Document Q&A with repeated contextV4-FlashCache hits compound fast at $0.28 base
Code generation on hard problemsV4-ProMore active capacity for harder tasks
Summarization at scaleV4-FlashOutput volume is where Flash wins most
Architecture planning, system designV4-ProFewer errors on ambiguous requirements

A useful heuristic: Flash is the right default. Move to Pro when a task fails or produces unreliable output at Flash. The 12x price difference between them ($0.28 vs $3.48) means you can run a lot of Flash iterations before Pro becomes the cheaper option on a per-correct-output basis.

How to Plug DeepSeek V4 Into Your Own Agent

Getting the model configured is one step. Making it work well in an agentic loop takes a few more decisions.

Step 1: Route by task type. Build a simple router that picks Flash for extraction, summarization, and tool-result parsing. Reserve Pro for planning steps, ambiguous multi-turn reasoning, and anywhere you are already seeing model errors at Flash.

Step 2: Set up caching deliberately. Cache hits require the cached tokens to appear at the start of the context in a consistent order. Put your system prompt first, then any static context (documents, rules, reference data), then the dynamic portion (conversation history, tool results). If the static portion changes between calls, you lose the cache hit.

The setup in Python looks like this:

import anthropic

client = anthropic.Anthropic(
    base_url="https://api.deepseek.com",
    api_key="your-deepseek-api-key",
)

response = client.messages.create(
    model="deepseek-v4-flash",
    max_tokens=4096,
    system="Your system prompt here — this gets cached after the first call.",
    messages=[
        {"role": "user", "content": "Your dynamic message here."}
    ],
)

Step 3: Monitor cache hit rates. DeepSeek returns cache hit information in the usage object. Log it. If your cache hit rate is below 80% on a workload with a static system prompt, the context ordering is probably wrong.

Step 4: Size the context window. V4's 1M token window gives you room to include large reference documents without chunking. For retrieval-heavy agents, loading the full document is often more reliable and cheaper (via cache hits) than running a retrieval step on every query.

Step 5: Pick the model per sub-agent. In a multi-agent system, different agents have different requirements. An orchestrator doing routing decisions might work fine on Flash. A specialist doing code generation might need Pro. Set ANTHROPIC_DEFAULT_HAIKU_MODEL, ANTHROPIC_DEFAULT_SONNET_MODEL, and ANTHROPIC_DEFAULT_OPUS_MODEL to mix models by tier if you are running through a framework that uses those variables.

Quick Specs Reference

V4-FlashV4-Pro
Input price$0.07 / 1M tokens$0.87 / 1M tokens
Output price$0.28 / 1M tokens$3.48 / 1M tokens
Cached input~$0.004 / 1M tokens~$0.044 / 1M tokens
Context window1M tokens1M tokens
MRCR recall at 1M (published)—83.5%
Active parameters13B49B
Total parameters284B1.6T
Endpointapi.deepseek.comapi.deepseek.com
SDK compatibilityAnthropic, OpenAI, LangChainAnthropic, OpenAI, LangChain

FAQ

Is DeepSeek V4 actually 107x cheaper than GPT-5.5?

At the V4-Flash output rate. GPT-5.5 output is $30 per million tokens. V4-Flash output is $0.28 per million tokens. That is a 107x difference. V4-Pro at $3.48 is about 8.6x cheaper than GPT-5.5. The 107x figure applies when you are comparing the cheapest V4 variant to the most expensive widely-used frontier model.

Does the 1M context window actually work, or is it a marketing number?

Yes. V4-Pro publishes 83.5% on the MRCR benchmark (Multi-Record Contextual Retrieval) at 1M tokens. That is the highest MRCR score of any frontier model with a published 1M number. GPT-5.5 comes in at 74% in the same 512K-1M bucket. Claude Opus 4.7 drops to 32.2% at 1M per pages 195-196 of Anthropic's system card. Cite the MRCR number when someone asks if the context window is real. The 97% figure quoted in some coverage is from the Engram research paper on a 27B test model, not the production V4-Pro.

Will my existing Claude or OpenAI code work with DeepSeek V4?

Yes. DeepSeek V4 exposes an Anthropic-compatible endpoint. If you are using the Anthropic SDK, change base_url to https://api.deepseek.com and set the model to deepseek-v4-pro or deepseek-v4-flash. If you are using the OpenAI SDK or LangChain, the same base URL change applies. No other code changes are required.

When should I use Flash versus Pro?

Flash is the right starting point for most workloads. It handles classification, extraction, summarization, tool-result parsing, and high-throughput pipelines well. Move to Pro when a task requires sustained multi-step reasoning, produces unreliable output at Flash, or involves complex code generation on hard problems. The 12x price gap means Flash gives you a lot of retries before Pro becomes the cheaper option per correct output.

How does cache pricing work in practice?

After the first API call, any tokens that appear identically at the start of the context are treated as cache hits on subsequent calls. System prompts, static document context, and fixed reference data all cache well. The effective rate on cached tokens is about 95% off the base input price. For an agent that runs 100 iterations with the same 10K-token system prompt, only the first call pays full price for those 10K tokens. The other 99 calls pay cache rates.

Related Pages

  • Model selection guide for per-task model routing inside Claude Code
  • All Claude Models for the complete Anthropic model timeline
  • Claude Opus 4.7 vs GPT-5.5 for the current frontier comparison
  • Kimi K2.6 for another cost-efficient alternative with MoE architecture

More in Model Picker

  • Claude Mythos: Das Modell, das in Schleifen denkt
    Claude Mythos verwendet vermutlich eine Recurrent-Depth-Architektur: eine gemeinsam genutzte Schicht in einer Schleife, mit ACT-Halting, damit schwere Fragen mehr Durchläufe bekommen und leichte früh stoppen.
  • Claude Opus 4.7 vs andere KI-Modelle
    Claude Opus 4.7, GPT-5.4, Kimi K2.6, Gemini 3.1 Pro, DeepSeek V3.2: Benchmarks, Kontextfenster, Agenten-Zuverlässigkeit und Kosten, damit du beim nächsten Task das richtige Modell greifst.
  • Alle Claude-Modelle
    Alle Claude-Modelle auf einer Seite: Claude 3, 3.5, 3.7, 4, Opus 4.1 bis 4.6, Sonnet 4.5 und 4.6, Haiku 4.5. Specs, Preise, Benchmarks und wann du welches nutzt.
  • Claude 3.5 Sonnet v2 und Claude 3.5 Haiku
    Claude 3.5 Sonnet v2 und 3.5 Haiku erschienen im Oktober 2024 mit Computer Use Beta, Cursor-Steuerung, verbessertem Coding und Tool-Use, und dem günstigeren Haiku für $0.80/$4.
  • Claude 3.5 Sonnet
    Claude 3.5 Sonnet erschien im Juni 2024 für $3/$15 und übertraf Claude 3 Opus bei MMLU, GPQA, HumanEval zu einem Fünftel der Kosten. Specs, Benchmarks und Coding-Fortschritte.
  • Claude 3.7 Sonnet
    Claude 3.7 Sonnet erschien im Februar 2025 mit hybridem Reasoning und erweitertem Denken. 64K Output, Thinking-Budget-Kontrolle, SWE-bench-Coding-Fortschritte bei $3/$15.

Hören Sie auf zu konfigurieren. Fangen Sie an zu bauen.

SaaS-Builder-Vorlagen mit KI-Orchestrierung.

Claude Code Modelle

Das richtige Claude Code Modell wählen: Sonnet, Opus, Haiku, sonnet[1m] oder opusplan. Aufgabenbasiertes Wechseln senkt die Modellkosten um 60-80% ohne Qualitätsverlust.

Claude Code Qualitätsregression: Was wirklich passiert ist

Drei Änderungen auf Produktebene haben Claude Code sechs Wochen lang beschädigt. Das Post-Mortem, die AMD-Daten und was das bedeutet, wenn du auf KI-Coding-Agenten baust.

On this page

The Two Models
The Cost Breakdown
The Context Window
Cache Pricing: Where the Real Story Is
Migrating in One Line
Flash vs Pro: Which One to Use
How to Plug DeepSeek V4 Into Your Own Agent
Quick Specs Reference
FAQ
Related Pages

Hören Sie auf zu konfigurieren. Fangen Sie an zu bauen.

SaaS-Builder-Vorlagen mit KI-Orchestrierung.