DeepSeek V4: Pricing, Context, and Migration

DeepSeek V4 landed with pricing that is hard to ignore. V4-Flash comes in at $0.28 per million output tokens. V4-Pro sits at $3.48. GPT-5.5 is $30. That is a 107x gap at the Flash tier, and it compounds when you factor in cache hits.

Two things make this worth paying attention to beyond the headline number. First, the 1M context window actually works. On the MRCR benchmark at 1M tokens, V4-Pro publishes 83.5% recall. GPT-5.5 drops to 74%. Opus 4.7 falls to 32%. V4-Pro holds up best at long context of any frontier model that currently ships a 1M window. Second, you do not need to rewrite anything. DeepSeek V4 uses an Anthropic-compatible endpoint. Change one URL and one model string. Same SDK, same tools, different price.

The Two Models

V4 ships as two variants. They run on very different hardware profiles.

Model	Total Parameters	Active Parameters	Output Price
V4-Flash	284B	13B	$0.28 / 1M tokens
V4-Pro	1.6T	49B	$3.48 / 1M tokens

Both are Mixture-of-Experts (MoE) models. Here is what that means in plain terms: the model has a large number of parameters, but only a small fraction of them fire on any given token. V4-Flash has 284 billion parameters total, but only 13 billion are active per inference pass. V4-Pro has 1.6 trillion total, but only 49 billion activate.

Why this matters for cost: You pay for active parameters, not total ones. A 1.6T model that runs 49B active parameters is cheaper to serve than a dense 49B model. You get the knowledge of a trillion-parameter model at the compute cost of a much smaller one.

Spec	V4-Flash	V4-Pro
Context window	1M tokens	1M tokens
Architecture	Mixture-of-Experts	Mixture-of-Experts
Active params	13B	49B
Total params	284B	1.6T
Endpoint	Anthropic-compatible	Anthropic-compatible

The Cost Breakdown

Put the output pricing side by side with the current frontier models:

Model	Output Price (per 1M tokens)
GPT-5.5	$30.00
Claude Opus 4.7	$25.00
DeepSeek V4-Pro	$3.48
DeepSeek V4-Flash	$0.28

The math for a daily agentic workload: if your pipeline outputs 1 million tokens per day, V4-Flash saves $29.72 every single day compared to GPT-5.5. That is $891 per month on a single pipeline, before cache pricing enters the picture.

When the savings stack: Agentic loops generate output on every iteration. A single long-running agent that processes tool results and emits next-step instructions might push 500K to 2M output tokens in a day. At that volume, the per-token gap between V4-Flash and GPT-5.5 moves from "interesting" to "this decides the project's economics."

The Context Window

A 1M token context window is only useful if the model can actually retrieve information from it. Many models claim long context but show retrieval accuracy that collapses past 128K or 256K tokens. The right way to verify this is the MRCR benchmark (Multi-Record Contextual Retrieval), which all three frontier labs now publish by depth.

Here are the vendor-published MRCR numbers at 1M tokens:

Model	MRCR at 1M	Source
DeepSeek V4-Pro	83.5%	DeepSeek technical report
GPT-5.5	74%	OpenAI MRCR v2 (512K-1M bucket)
Claude Opus 4.7	32.2%	Anthropic system card, pp. 195-196

V4-Pro has the best published 1M recall of any frontier model as of today. The gap versus Opus 4.7 is the story that did not make the launch posts: 83.5% versus 32.2% on the same benchmark, a 51-point spread. Opus 4.7 is also a regression from 4.6, which scored 78.3% at 1M in Anthropic's own system card. Whatever trade-off Anthropic made to move Opus forward did not preserve long-context recall.

What Engram does: Most transformer models use positional embeddings that were trained at shorter lengths. When you extend them to 1M tokens, the model's sense of "where am I in this document" degrades. Engram is DeepSeek's approach to keeping that signal accurate across the full window. The Engram research paper demonstrated the architecture on a 27B test model hitting 97% NIAH at 1M tokens; production V4-Pro uses the same approach and publishes 83.5% on the harder MRCR benchmark at the same depth.

For practical use: if you are building a system that processes entire codebases, long legal documents, or extended conversation histories, V4-Pro has the most reliable long-context recall of any frontier model that currently ships a 1M window.

Cache Pricing: Where the Real Story Is

Cache hits are where the cost math changes most dramatically. When your system prompt or document context is repeated across many requests, DeepSeek charges a fraction of the base rate for the cached portion.

Pattern	No Cache	Cached	Saving
System prompt	$18.00/day	$0.90/day	95%
Long doc context	$42.00/run	$2.10/run	95%
Agent loop tool calls	$8.40/hr	$0.42/hr	95%

The system prompt gets cached after the first call. Every subsequent call in that session treats the system prompt tokens as a cache hit. For a long-running agent with a detailed system prompt, the per-call cost of those prompt tokens drops to near zero after the first inference.

Long doc context follows the same logic. If you load a 500-page document and run multiple queries against it, only the first call pays full price. Every follow-up query pays the cache rate. At 95% off, a workload that would cost $42 per run lands at $2.10.

Agent loops are where this compounds fastest. Every iteration of an agent loop that reads the same tool results or working memory pays cache rates for the repeated tokens. Scale the loop up and the savings grow linearly with iteration count.

The practical implication: the pricing advantage of V4 over frontier models is not just at the base rate. It widens with every cached token. Long-running agentic systems get the biggest discount.

Migrating in One Line

DeepSeek V4 uses an Anthropic-compatible API endpoint. If you are already using the Anthropic SDK or any OpenAI-compatible SDK, nothing else needs to change.

The complete migration from Claude to DeepSeek V4-Pro:

# before
base_url = "https://api.anthropic.com"
model = "claude-opus-4-5"

# after (one line changed)
base_url = "https://api.deepseek.com"
model = "deepseek-v4-pro"

Same SDK. Same tools. Same workflow. The endpoint speaks the same protocol.

SDK	Compatible
Claude SDK (Anthropic)	Yes
OpenAI SDK	Yes
LangChain	Yes

No refactor needed. Tool calls, function definitions, message formats, system prompts — all of it carries over. You are pointing the same code at a different URL with a different model string.

For the Flash variant, swap deepseek-v4-pro for deepseek-v4-flash. That is the entire change to move from Pro pricing ($3.48/M) to Flash pricing ($0.28/M).

Flash vs Pro: Which One to Use

The decision is straightforward once you know what each model is optimized for.

Use case	Recommended	Why
High-throughput pipelines, classification, extraction	V4-Flash	13B active params, lowest cost per token
Complex multi-step reasoning, long-horizon agents	V4-Pro	49B active params, higher reasoning ceiling
Document Q&A with repeated context	V4-Flash	Cache hits compound fast at $0.28 base
Code generation on hard problems	V4-Pro	More active capacity for harder tasks
Summarization at scale	V4-Flash	Output volume is where Flash wins most
Architecture planning, system design	V4-Pro	Fewer errors on ambiguous requirements

A useful heuristic: Flash is the right default. Move to Pro when a task fails or produces unreliable output at Flash. The 12x price difference between them ($0.28 vs $3.48) means you can run a lot of Flash iterations before Pro becomes the cheaper option on a per-correct-output basis.

How to Plug DeepSeek V4 Into Your Own Agent

Getting the model configured is one step. Making it work well in an agentic loop takes a few more decisions.

Step 1: Route by task type. Build a simple router that picks Flash for extraction, summarization, and tool-result parsing. Reserve Pro for planning steps, ambiguous multi-turn reasoning, and anywhere you are already seeing model errors at Flash.

Step 2: Set up caching deliberately. Cache hits require the cached tokens to appear at the start of the context in a consistent order. Put your system prompt first, then any static context (documents, rules, reference data), then the dynamic portion (conversation history, tool results). If the static portion changes between calls, you lose the cache hit.

The setup in Python looks like this:

import anthropic

client = anthropic.Anthropic(
    base_url="https://api.deepseek.com",
    api_key="your-deepseek-api-key",
)

response = client.messages.create(
    model="deepseek-v4-flash",
    max_tokens=4096,
    system="Your system prompt here — this gets cached after the first call.",
    messages=[
        {"role": "user", "content": "Your dynamic message here."}
    ],
)

Step 3: Monitor cache hit rates. DeepSeek returns cache hit information in the usage object. Log it. If your cache hit rate is below 80% on a workload with a static system prompt, the context ordering is probably wrong.

Step 4: Size the context window. V4's 1M token window gives you room to include large reference documents without chunking. For retrieval-heavy agents, loading the full document is often more reliable and cheaper (via cache hits) than running a retrieval step on every query.

Step 5: Pick the model per sub-agent. In a multi-agent system, different agents have different requirements. An orchestrator doing routing decisions might work fine on Flash. A specialist doing code generation might need Pro. Set ANTHROPIC_DEFAULT_HAIKU_MODEL, ANTHROPIC_DEFAULT_SONNET_MODEL, and ANTHROPIC_DEFAULT_OPUS_MODEL to mix models by tier if you are running through a framework that uses those variables.

Quick Specs Reference

	V4-Flash	V4-Pro
Input price	$0.07 / 1M tokens	$0.87 / 1M tokens
Output price	$0.28 / 1M tokens	$3.48 / 1M tokens
Cached input	~$0.004 / 1M tokens	~$0.044 / 1M tokens
Context window	1M tokens	1M tokens
MRCR recall at 1M (published)	—	83.5%
Active parameters	13B	49B
Total parameters	284B	1.6T
Endpoint	api.deepseek.com	api.deepseek.com
SDK compatibility	Anthropic, OpenAI, LangChain	Anthropic, OpenAI, LangChain

FAQ

Is DeepSeek V4 actually 107x cheaper than GPT-5.5?

At the V4-Flash output rate. GPT-5.5 output is $30 per million tokens. V4-Flash output is $0.28 per million tokens. That is a 107x difference. V4-Pro at $3.48 is about 8.6x cheaper than GPT-5.5. The 107x figure applies when you are comparing the cheapest V4 variant to the most expensive widely-used frontier model.

Does the 1M context window actually work, or is it a marketing number?

Yes. V4-Pro publishes 83.5% on the MRCR benchmark (Multi-Record Contextual Retrieval) at 1M tokens. That is the highest MRCR score of any frontier model with a published 1M number. GPT-5.5 comes in at 74% in the same 512K-1M bucket. Claude Opus 4.7 drops to 32.2% at 1M per pages 195-196 of Anthropic's system card. Cite the MRCR number when someone asks if the context window is real. The 97% figure quoted in some coverage is from the Engram research paper on a 27B test model, not the production V4-Pro.

Will my existing Claude or OpenAI code work with DeepSeek V4?

Yes. DeepSeek V4 exposes an Anthropic-compatible endpoint. If you are using the Anthropic SDK, change base_url to https://api.deepseek.com and set the model to deepseek-v4-pro or deepseek-v4-flash. If you are using the OpenAI SDK or LangChain, the same base URL change applies. No other code changes are required.

When should I use Flash versus Pro?

Flash is the right starting point for most workloads. It handles classification, extraction, summarization, tool-result parsing, and high-throughput pipelines well. Move to Pro when a task requires sustained multi-step reasoning, produces unreliable output at Flash, or involves complex code generation on hard problems. The 12x price gap means Flash gives you a lot of retries before Pro becomes the cheaper option per correct output.

How does cache pricing work in practice?

After the first API call, any tokens that appear identically at the start of the context are treated as cache hits on subsequent calls. System prompts, static document context, and fixed reference data all cache well. The effective rate on cached tokens is about 95% off the base input price. For an agent that runs 100 iterations with the same 10K-token system prompt, only the first call pays full price for those 10K tokens. The other 99 calls pay cache rates.

Model selection guide for per-task model routing inside Claude Code
All Claude Models for the complete Anthropic model timeline
Claude Opus 4.7 vs GPT-5.5 for the current frontier comparison
Kimi K2.6 for another cost-efficient alternative with MoE architecture

DeepSeek V4: Pricing, Context, and Migration

On this page