DeepSeek R2 is a 32-billion-parameter open-weight reasoning model released in April 2026. It scores 92.7% on AIME 2025, runs on a single 24 GB consumer GPU, and undercuts Western frontier reasoning APIs by roughly 70% on token cost — a dramatic pivot from the long-rumored 1.2T MoE design that had been delayed for nearly a year.
What is DeepSeek R2?
DeepSeek R2 is the second generation of DeepSeek’s reasoning-first model line. Where R1 (January 2025) was a 671-billion-parameter Mixture-of-Experts behemoth, R2 ships as a 32B dense transformer released under MIT license — small enough to fit on a single RTX 4090 or A6000, big enough to clear 92% on the hardest publicly graded math benchmark in current use.
If you’ve been following the saga, this is not the release the AI press was forecasting. Throughout 2025, leaks suggested R2 would be a 1.2-trillion-parameter MoE model trained on Huawei Ascend chips, then on Nvidia hardware after Ascend stability problems forced a pivot. The final model that actually shipped is something quite different: a small, dense, locally-runnable model that beats the rumored monster on the only thing most users care about — quality per dollar.
Why does the size drop from 1.2T to 32B matter?
It matters because it inverts the assumption that drove the entire post-GPT-4 era: that frontier reasoning requires hundreds of billions of activated parameters. R2 puts most of its intelligence into post-training — specifically a refined version of the GRPO reinforcement-learning pipeline DeepSeek introduced with R1 — rather than into raw scale.
The practical consequences for developers are immediate:
| Property | DeepSeek R1 (Jan 2025) | DeepSeek R2 (Apr 2026) |
|---|---|---|
| Architecture | 671B MoE (37B active) | 32B dense |
| License | MIT | MIT |
| AIME 2025 | ~74% (independent) | 92.7% (announced) |
| Local hardware floor | 8× H100 cluster | 1× RTX 4090 (24 GB) |
| API price vs. Western frontier | ~25× cheaper | ~70% cheaper than GPT-5 / Claude 4.6 |
| Context window | 128K | 128K |
A 92.7% AIME 2025 score is not a casual benchmark. AIME — the American Invitational Mathematics Examination — is the qualifier round for the USA Math Olympiad. A score of 92.7% means R2 correctly answers roughly 14 out of 15 problems where each problem demands multi-step symbolic reasoning. For comparison, the original R1 hovered around 74% on the same benchmark in independent evaluations, and GPT-5’s reported scores sit in a similar range without tool use.
⚠️ A note on benchmark inflation. Vendor-reported AIME scores have historically run several points higher than independent evaluations (Vals, Artificial Analysis, MathArena). DeepSeek’s own R1 result of ~89% dropped to ~74% under stricter pass/fail conditions. Treat the 92.7% figure as the upper bound until third-party harnesses publish their own runs.
How did DeepSeek get 32B parameters to reason like a 671B model?
The short answer: distillation plus a much longer reinforcement-learning post-training phase. The longer answer involves three techniques DeepSeek has been refining since R1.
1. Reasoning distillation from a larger teacher
DeepSeek used the full R1 (and likely DeepSeek-V3.2-Speciale, the IMO-gold-medal variant from late 2025) as a teacher model. The teacher generates millions of long chain-of-thought traces for math, code, and logic problems; the 32B student is then fine-tuned on those traces. This is the same playbook that produced the original R1-Distill-Qwen-32B back in January 2025 — but with eighteen months of accumulated technique on top.
If you want the full mechanics of this process, our explainer on LoRA fine-tuning covers the parameter-efficient side, and our piece on deep learning fundamentals walks through why distillation transfers reasoning patterns more efficiently than retraining from scratch.
2. GRPO with self-verification
Group Relative Policy Optimization (GRPO) was DeepSeek’s original RL contribution: instead of training a separate value network, you sample a group of responses to the same prompt, score them against a verifier, and update the policy toward the higher-scoring members of the group. R2 layers self-verification on top — the model is trained to check its own intermediate reasoning steps before committing to a final answer, the same trick that pushed DeepSeekMath V2 to gold-medal performance in late 2025.
3. Dense, not sparse
This is the structural choice that makes everything else viable. A dense 32B transformer activates every parameter on every token, which means no expert-routing overhead, no load-balancing tricks, and — critically — no need for the 8-GPU minimum that MoE inference imposes. The trade-off is that you cannot scale dense architectures to trillion-parameter sizes the way MoE can. For a model that’s deliberately small, that ceiling doesn’t matter. Our transformer architecture explainer goes deeper on why dense vs. MoE is the central design fork in modern LLM engineering.
How much does DeepSeek R2 actually cost to run?
Two paths, two cost structures.
Via the official API, R2 sits at roughly 70% below the blended price of GPT-5 and Claude 4.6 for equivalent reasoning workloads. DeepSeek has historically priced its reasoning models at $0.45–$0.55 per million input tokens and $2.00–$2.20 per million output tokens — and the R2 launch keeps that envelope. By contrast, frontier Western reasoning APIs sit closer to $3 / $15 per million tokens for the highest tiers. For an agentic workflow that burns 20 million tokens per day, that’s the difference between a $40/day bill and a $250/day bill.
Self-hosted, the 32B dense architecture changes the economics entirely. You can run R2 at INT4 quantization on a single RTX 4090 (24 GB VRAM, ~$1,800 used) and get 30–40 tokens per second. That’s the price floor below which API economics stop making sense for high-volume, latency-tolerant workloads. Until R2, that floor was around the 14B distilled R1 model — capable but visibly weaker on hard reasoning. Now it’s at the level of a model that scores in the low 90s on AIME.
# Run DeepSeek R2 locally via Ollama (assumes 24 GB+ VRAM)
# ollama pull deepseek-r2:32b-q4
from ollama import chat
response = chat(
model="deepseek-r2:32b-q4",
messages=[{
"role": "user",
"content": (
"A particle moves along the curve y = x^3 - 6x^2 + 9x. "
"Find the values of x where the tangent line is horizontal. "
"Reason step by step, then give the final answer in \\boxed{}."
),
}],
options={"temperature": 0.6, "top_p": 0.95},
)
print(response.message.content)
Note the temperature=0.6 and the explicit “reason step by step” instruction — both inherited from DeepSeek’s recommended R1 settings. R2 still benefits from this scaffolding; without it, the model occasionally truncates its own chain of thought.
How does R2 compare to GPT-5, Claude 4.6, and Gemini 3.1 Pro?
The honest answer is: nobody knows yet, because R2 has been public for less than a week and the standardized benchmark harnesses (Artificial Analysis, MathArena, LMArena) have not finished their independent runs. What we can say with confidence based on the announced numbers and DeepSeek’s track record:
| Capability | R2 position |
|---|---|
| Pure math (AIME, HMMT, MATH-500) | Likely competitive with GPT-5 and Claude 4.6 in non-tool-use mode |
| Competitive coding (LiveCodeBench, Codeforces) | Probably second tier — DeepSeek’s reasoning models have always trailed their own coder line |
| Long-context multi-hop reasoning | Weaker — distilled dense models tend to lose cross-document reasoning during compression |
| Tool use / agent workflows | Solid but not best-in-class; MCP integration works but lags behind purpose-built agent models |
| Multilingual quality (incl. Polish) | Strong in Chinese and English; mid-tier in Slavic languages compared to Gemini and Claude |
| Cost per useful token | Best in class by a wide margin |
The picture that emerges is the same one R1 painted in January 2025: R2 is not the absolute best at anything, but it is the cheapest entry point to “good enough” reasoning by a margin that nothing else in the open ecosystem comes close to matching.
What does R2 mean for the open-source AI ecosystem?
Three concrete shifts are already visible in the days since the release.
First, the economic case for closed reasoning APIs gets harder. If you’re building a product that does math, code review, or structured analysis at scale, R2 lets you move that workload off GPT-5 or Claude 4.6 and either onto DeepSeek’s API at 30% of the cost, or onto your own GPU at zero marginal cost. The quality gap is no longer big enough to justify the premium for most use cases.
Second, the geopolitics tighten. R2 was reportedly trained on Nvidia hardware after the Huawei Ascend pivot failed in 2025 — a quiet acknowledgment that domestic Chinese AI silicon is still 18–24 months behind. U.S. export-control hawks will use this to argue for tighter restrictions; Chinese policymakers will use the 92.7% AIME score to argue that the restrictions don’t work. Both arguments are partially correct.
Third, the distillation playbook becomes the dominant strategy. R2 is, structurally, a vindication of the idea that you train a giant teacher once, then distill it into many small specialized students. Expect every major lab to publish 7B / 14B / 32B distilled versions of their flagships within the next quarter, with reasoning quality that would have been unthinkable a year ago.
💡 Practical takeaway. If you maintain a production AI workload built around GPT-5 or Claude 4.6 reasoning, run a parallel evaluation against DeepSeek R2 this week. Use your real prompts, your real eval set, and your real cost budget. The decision to migrate (or not) will be obvious within a few hundred test cases — and if R2 holds up, the savings compound from day one.
What does this mean in practice for developers in 2026?
The deeper shift R2 reveals is that the dominant constraint on useful AI is no longer raw model capability — it’s the cost of running that capability against your actual data. We’ve crossed into an era where the relevant question is not “which model is smartest” but “which model is smart enough at a price I can sustain”. For most reasoning workloads in mid-2026, that question now has a 32-billion-parameter open-weight answer. The implication for anyone building AI agents or RAG pipelines is that the model layer is rapidly becoming a commodity — and the durable engineering value is moving up the stack into context engineering, retrieval quality, and evaluation infrastructure.
That’s not bad news. It means smaller teams with sharper engineering can compete on equal terms with labs that have ten thousand times the capital. R2 just made that competition meaningfully more affordable.
FAQ
Is DeepSeek R2 really only 32 billion parameters?
Yes — that’s the core surprise of the release. The 1.2T MoE model that leaked throughout 2025 was either scrapped, postponed, or repurposed as the teacher model used for distillation. The version DeepSeek actually shipped is a dense 32B transformer under MIT license.
Can I run DeepSeek R2 on a single GPU?
Yes. At 4-bit quantization (Q4_K_M in GGUF format), R2 uses approximately 20 GB of VRAM, which fits on an RTX 4090, RTX 3090, or A6000. Expect 30–45 tokens per second on consumer hardware, depending on context length.
How much cheaper is DeepSeek R2 than GPT-5 or Claude 4.6?
Roughly 70% cheaper on a blended input-plus-output basis at the API level. For self-hosted deployments, the marginal cost per token after hardware amortization approaches zero, which is where R2 becomes structurally impossible to compete with for high-volume workloads.
Is the 92.7% AIME 2025 score reliable?
It’s the vendor-reported number, which has historically run a few points above independent evaluations. Wait for results from Artificial Analysis, MathArena, and Vals before treating it as definitive. Even at the more conservative ~85% an independent harness would likely produce, R2 would still be competitive with frontier closed models.
What’s the license, and can I use DeepSeek R2 commercially?
R2 ships under the MIT License, identical to R1. You can use it commercially, modify it, redistribute it, distill it into other models, and build products on top of it without paying royalties. The only requirement is preserving the copyright notice.
How does R2 compare to DeepSeek V3.2 and the V3.2-Speciale variant?
V3.2 is a general-purpose 671B MoE model focused on agentic workflows; V3.2-Speciale is its high-compute reasoning variant that achieved gold-medal scores at IMO and IOI 2025 under relaxed token budgets. R2 is the consumer-grade reasoning model of the family — smaller, cheaper, runnable locally, optimized for everyday math and code reasoning rather than competition-grade proof writing.
Will DeepSeek R2 replace R1 in production?
For most reasoning workloads, yes. R2 is smaller, cheaper, faster, and reportedly stronger on math benchmarks. The exceptions are workloads that depend on R1’s specific quirks (long-context multi-hop tasks where the 671B MoE still has an edge) or pipelines locked to R1 by tooling. New deployments should default to R2.
Bibliography & sources
- DeepSeek-AI — DeepSeek-V3.2 technical report (arXiv 2512.02556) · arxiv.org/html/2512.02556v1
- DeepSeek API documentation — V3.2 and Speciale release notes · api-docs.deepseek.com
- DeepSeek-AI — DeepSeek-R1 paper, “Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” (arXiv 2501.12948) · arxiv.org/abs/2501.12948
- Hugging Face — DeepSeek-R1-Distill-Qwen-32B model card · huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
- Artificial Analysis — DeepSeek V3.2 intelligence and pricing benchmarks · artificialanalysis.ai/models/deepseek-v3-2
- Wikipedia — DeepSeek (R2 release timeline and Huawei Ascend pivot) · en.wikipedia.org/wiki/DeepSeek
- EU AI Act, Regulation (EU) 2024/1689 — general-purpose AI model obligations · eur-lex.europa.eu