DeepSeek V4 on NVIDIA: Specs, GPU Needs & Cost

Last updated: June 2026 · By Ignacy Kwiecień, founder & editor-in-chief, DecodeTheFuture.org

DeepSeek V4 ships in two sizes — V4-Pro (~1.6T total / ~49B active MoE) and the smaller V4-Flash (284B / 13B active) — both with a 1M-token context window and an MIT license. On NVIDIA Blackwell you reach it three ways: day-0 as a hosted NIM on build.nvidia.com, through TensorRT-LLM, or via the Dynamo dev build v1.3.0-deepseek-v4-dev.1 (pinned to TensorRT-LLM 1.3.0rc15.post1, example config on 4× GB300 — flagged experimental, not for production). NVIDIA reports over 150 tokens/sec/user on GB200 NVL72, and DeepSeek’s own model card reports a ~73% drop in per-token FLOPs versus V3.2 in the 1M-token context setting; the FP8 weight footprint is large (~500GB, third-party estimate), so most teams start on the hosted endpoint.

V4-Pro 1.6T / 49B active V4-Flash 284B / 13B active 1M context · MIT Dynamo + TensorRT-LLM Day-0 NIM on Blackwell

Table of Contents

What are DeepSeek V4-Pro and V4-Flash specs, side by side?

DeepSeek V4 is a two-model family, not one model, and that distinction is the single thing most spec recaps blur. V4-Pro is the flagship Mixture-of-Experts model at roughly 1.6 trillion total parameters with about 49B activated per token — only the selected experts fire on any given token, so it carries frontier-scale knowledge at a fraction of a dense 1.6T model’s per-step compute. V4-Flash is the lighter sibling at 284B total / 13B activated, built for cheaper, faster serving where you do not need Pro’s full reasoning depth.

Both members share a 1 million-token context window and ship under the permissive MIT License, which is unusually liberal for a model of this size — you can use it commercially without the bespoke community-license clauses some competitors attach. V4-Pro’s headline architectural move is a hybrid attention stack: Compressed Sparse Attention (CSA) plus Heavily Compressed Attention (HCA), the mechanism behind the ~90% cut in KV-cache memory versus V3.2 in the 1M-token context setting that DeepSeek reports on its model card.

Spec	DeepSeek V4-Pro	DeepSeek V4-Flash
Total / active parameters	~1.6T total / ~49B active (MoE)	284B total / 13B active (vendor-listed, treat as directional)
Context window	1,000,000 tokens	1,000,000 tokens
License	MIT	MIT
Attention design	Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA)	Compressed hybrid attention
Reasoning modes	Non-Think, Think High, Think Max	Non-Think, Think High, Think Max
Stored weight precision	FP4 + FP8 mixed (MoE experts FP4, other params FP8)	FP4 + FP8 mixed
NVIDIA serving format	Native MXFP4 (optimized with NVFP4 + CUDA kernels)	MXFP4 / NVFP4

One spec deserves a flag: the V4-Flash 284B / 13B figure is vendor-listed and we could not independently verify it against a stable primary model card at the time of writing, so treat it as directional rather than a confirmed audited number. The V4-Pro 1.6T / 49B figure is confirmed in NVIDIA’s own Dynamo release notes and is the safer one to plan around.

What reasoning modes does DeepSeek V4 have, and why do they matter for cost?

DeepSeek V4 exposes three reasoning modes — Non-Think, Think High and Think Max — and they are the most direct lever you have over your inference bill. Non-Think answers without an extended reasoning trace, which is cheapest and fastest. Think High allocates a moderate reasoning budget. Think Max unlocks the deepest chain and is where V4-Pro posts its strongest benchmark numbers.

Those benchmarks are V4-Pro’s selling point, and they come from the model card in Max reasoning mode: 80.6 Resolved on SWE-bench Verified (a real-world coding-agent benchmark), 93.5 Pass@1 on LiveCodeBench, and 87.5 EM on MMLU-Pro. The practical takeaway for a buyer: you only pay for Think Max’s token cost when you actually need that depth. For high-volume agent loops, defaulting to Non-Think and escalating to Think Max only on hard turns is the same cost discipline you would apply to retries — budget reasoning per call, not globally.

Where the V4 cost story actually lives

DeepSeek’s model card reports V4 uses ~73% fewer per-token inference FLOPs than V3.2 and cuts KV-cache memory burden by ~90%, both measured in the 1M-token context setting (vendor-reported). The saving is largest at long context and should not be read as a flat discount across all context lengths. Combined with the reasoning-mode dial, that is the architecture-level reason V4 can be cheaper to serve than its parameter count suggests — the saving is in fewer FLOPs and less KV cache per token, not a discount you toggle.

What GPUs and VRAM does DeepSeek V4 need on NVIDIA Blackwell?

There is no consumer-GPU or single-card path to V4-Pro — it is a multi-GPU, datacenter-class deployment. The hardest number to pin down across the SERP is VRAM, because sources quote wildly different figures depending on precision. A widely cited third-party estimate puts V4-Pro’s weights near ~500GB at FP8 and ~1TB at BF16 (Spheron’s deployment guide — treat as a directional estimate, not an NVIDIA-audited figure). Either way you are well past a single 80GB card, which is why NVIDIA’s own reference points are whole nodes, not chips.

The serving format matters as much as raw weight size, and it helps to separate two things. DeepSeek’s model card describes the model’s own stored weight precision as FP4 + FP8 mixed (MoE experts in FP4, the remaining parameters in FP8). NVIDIA then serves it in its native MXFP4 format on Blackwell, further tuned with NVFP4 and custom CUDA kernels — the OCP MXFP4 serving format is NVIDIA’s runtime representation, not identical to DeepSeek’s storage precision. Low-precision formats shrink both the weight and activation footprint, which is what lets the model fit fewer GPUs and stretch toward its 1M context. The deeper mechanics of why low-precision KV cache buys you context headroom are their own subject — if you want that, see our explainer on vector quantization and the KV cache rather than re-deriving it here.

Precision	V4-Pro weight footprint	What it means for hardware
BF16	~1TB (third-party estimate)	Max fidelity; needs the largest multi-node configs
FP8	~500GB (third-party estimate)	Roughly halves the BF16 footprint; still multi-GPU
MXFP4 / NVFP4 (NVIDIA serving format)	Smaller still	NVIDIA’s recommended path on Blackwell; best throughput-per-GPU and context headroom

For a concrete NVIDIA-confirmed reference, the Dynamo dev build that serves V4-Pro ships an example configuration on 4× GB300. That is a Blackwell-generation node, and it is the closest thing to an official “this is what it runs on” number currently published. The FP8/BF16 footprint figures above come from third-party guides, so use them to sanity-check your own hardware, not to size a final budget.

How do you run DeepSeek V4 with Dynamo and TensorRT-LLM?

The shipped path to self-serve V4-Pro on NVIDIA is the Dynamo dev build v1.3.0-deepseek-v4-dev.1, released June 6, 2026, with its TensorRT-LLM backend pinned to 1.3.0rc15.post1. That pin is not optional flavour text — the V4-Pro attention kernels (CSA/HCA) need that exact TensorRT-LLM build to serve correctly, which is the gotcha most third-party guides skip. The same June 2026 Dynamo wave shipped day-0 dev builds for MiniMax-M3, Kimi K2.6 and Nemotron-3 Super alongside DeepSeek-V4-Pro.

Bash – serve DeepSeek-V4-Pro via the Dynamo dev build

# Dynamo dev build for DeepSeek-V4-Pro (June 6, 2026)
# Backend is pinned to TensorRT-LLM 1.3.0rc15.post1 - the
# CSA/HCA attention kernels need this exact build to serve.
# NVIDIA's example config targets 4x GB300 (Blackwell).
#
# WARNING: this is an EXPERIMENTAL dev build, NOT for production.

git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
git checkout v1.3.0-deepseek-v4-dev.1

# TensorRT-LLM backend pin (required):
#   tensorrt-llm==1.3.0rc15.post1

# Launch the Dynamo serving graph on a 4x GB300 node
# (see release notes for the deepseek-v4-pro example config)
dynamo serve graphs.deepseek_v4_pro:Frontend \
  --tensor-parallel-size 4 \
  --enable-think-modes  # Non-Think / Think High / Think Max

If you do not need to stand up your own serving graph, TensorRT-LLM itself carries the V4 attention kernels and AutoConfig support in its recent releases, so it is the lower-level building block under both Dynamo and the NIM container. The runtime landscape (vLLM, TensorRT-LLM, SGLang) is a comparison in its own right and we do not benchmark them against each other here — for V4-Pro specifically, the NVIDIA-blessed path is TensorRT-LLM, because that is where the CSA/HCA kernels live.

Production gotcha most guides miss

The Dynamo V4 image is explicitly an experimental dev build, not for production. If you need a stable, supported V4 endpoint today, the day-0 NIM container or a hosted endpoint on build.nvidia.com is the safer path — the dev build is for evaluation and kernel testing, not for putting a customer-facing agent behind it.

How do you access DeepSeek V4 day-0 as an NVIDIA NIM?

DeepSeek V4 is available day-0 as an NVIDIA NIM container and through GPU-accelerated endpoints on build.nvidia.com, which is the fastest way to call the model without owning a Blackwell node. The NIM packages V4 with NVIDIA’s optimized serving stack — the MXFP4/NVFP4 kernels, TensorRT-LLM backend and the reasoning-mode controls — behind an OpenAI-compatible API. If NIM is new to you, our NVIDIA NIM API explained primer covers how the microservice works internally, and you can grab a key via the free NVIDIA API key walkthrough before you call it.

Bash – call DeepSeek V4 via the build.nvidia.com endpoint

# Day-0 hosted endpoint on build.nvidia.com (OpenAI-compatible).
# No GPUs to provision - pay per token, escalate reasoning per call.
curl https://integrate.api.nvidia.com/v1/chat/completions \
  -H "Authorization: Bearer $NVIDIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/deepseek-v4-pro",
    "messages": [{"role":"user","content":"Plan a 3-step refactor."}],
    "reasoning_mode": "think_high"
  }'

The hosted route is where most teams should start. It sidesteps the FP8 ~500GB footprint, the 4× GB300 floor and the experimental-dev-build caveat entirely — you validate V4’s quality on your own prompts first, measure real tokens-per-turn, and only then decide whether a dedicated node is worth it. For a broader view of how NVIDIA’s hosted offering stacks up against managed substitutes, our best inference APIs 2026 roundup and the NVIDIA NIM alternatives guide map the wider market.

When does self-hosting DeepSeek V4 beat the hosted endpoint?

The hosted NIM endpoint wins until your V4 token volume is high and sustained enough to keep a 4-plus-GPU Blackwell node near full utilization 24/7. Below that line you are paying for idle GB300s; above it, a busy node amortizes the hardware and the per-token economics flip in your favour. NVIDIA frames the economics with a per-user throughput figure — over 150 tokens/sec/user on GB200 NVL72 — and a 30× better perf-per-watt than H200 on Blackwell (both vendor-reported). Those numbers describe a busy, well-utilized node, which is exactly the condition under which self-hosting makes sense.

The serving stack underneath this matured fast: NVIDIA Dynamo 1.0 entered production on March 16, 2026 and, per NVIDIA, boosts Blackwell inference up to 7×. So the V4-Pro dev image sits on top of a serving layer that already has a stable 1.0 production lineage — the model integration is experimental, the underlying Dynamo platform is not. The diagram makes the decision boundary concrete.

Concretely: a developer evaluating V4 on a handful of prompts, or a product with bursty traffic, has no business standing up four GB300s and pinning a release-candidate TensorRT-LLM build. The day-0 NIM absorbs the spikes and you pay only for what you use. Self-hosting earns its place when a fleet of agents runs continuously, when data residency forbids an external API, or when sustained latency control justifies dedicated hardware — and even then, V4-Pro’s serving image is still flagged experimental, so a production self-host means accepting that caveat. For the wider hosted-vs-managed cost picture, the inference APIs roundup works through the same break-even maths.

What are the caveats before you commit to DeepSeek V4 in production?

Three things will shape a real deployment more than the spec sheet.

1. The V4-Pro serving image is experimental. The Dynamo v1.3.0-deepseek-v4-dev.1 build is explicitly a dev build, not for production. For anything customer-facing today, the day-0 NIM or hosted endpoint is the supported path; the dev build is for kernel testing and evaluation.

2. The benchmarks and economics are vendor-reported. The 80.6 SWE-bench, 93.5 LiveCodeBench and 87.5 MMLU-Pro scores are from the DeepSeek model card in Max mode; the >150 tok/s/user and 30× perf/watt figures are NVIDIA’s own, while the ~73% fewer FLOPs and ~90% KV-cache figures come from the DeepSeek model card, measured in the 1M-token context setting. Use them as directional benchmarks to justify a pilot, not audited guarantees to size a cluster. Run your own workload first.

3. Some specs are firmer than others. V4-Pro’s ~1.6T/49B and the Dynamo/TensorRT-LLM pins are confirmed in NVIDIA’s release notes. The V4-Flash 284B/13B figure and the FP8 ~500GB footprint are vendor-listed or third-party estimates — verify them against the live model card before you bet hardware on the exact numbers.

FAQ

What is the difference between DeepSeek V4-Pro and V4-Flash?

V4-Pro is the flagship Mixture-of-Experts model at roughly 1.6T total parameters with about 49B activated per token, built for deep reasoning and coding. V4-Flash is the lighter sibling at a vendor-listed 284B total / 13B activated, for cheaper and faster serving. Both share a 1M-token context window, an MIT license and the three reasoning modes. The V4-Flash parameter figure is vendor-listed and should be treated as directional until confirmed on the live model card.

What GPUs and how much VRAM does DeepSeek V4 need?

V4-Pro is a multi-GPU, datacenter-class deployment with no single-card path. DeepSeek stores the weights in FP4 + FP8 mixed precision (MoE experts FP4, other params FP8); a widely cited third-party estimate puts the footprint near ~500GB at FP8 and ~1TB at BF16, and NVIDIA’s native MXFP4 serving format on Blackwell shrinks the runtime footprint further. NVIDIA’s Dynamo dev build ships an example configuration on 4x GB300 (Blackwell), which is the closest official hardware reference. Treat the FP8/BF16 footprint numbers as directional third-party estimates, not NVIDIA-audited figures.

How do I run DeepSeek V4-Pro with Dynamo and TensorRT-LLM?

Use the Dynamo dev build v1.3.0-deepseek-v4-dev.1 (released June 6, 2026), with its TensorRT-LLM backend pinned to 1.3.0rc15.post1 – that exact build carries the CSA/HCA attention kernels V4-Pro needs. NVIDIA’s example configuration targets a 4x GB300 node. The image is flagged as an experimental dev build and is not for production; for stable serving use the day-0 NIM container instead.

Is DeepSeek V4 available as an NVIDIA NIM?

Yes. DeepSeek V4 is available day-0 as an NVIDIA NIM container and through GPU-accelerated endpoints on build.nvidia.com. The NIM wraps the model with NVIDIA’s optimized MXFP4/NVFP4 serving stack and reasoning-mode controls behind an OpenAI-compatible API, so you can call it without owning a Blackwell node. This is the recommended starting path for evaluation and most production traffic.

What context window and reasoning modes does DeepSeek V4 support?

Both V4-Pro and V4-Flash support a 1 million-token context window. The family exposes three reasoning modes: Non-Think (cheapest and fastest, no extended trace), Think High (moderate reasoning budget) and Think Max (deepest chain, where V4-Pro posts its strongest benchmark scores). Defaulting to Non-Think and escalating to Think Max only on hard turns is the main lever for controlling inference cost.

How fast and efficient is DeepSeek V4 on NVIDIA Blackwell?

NVIDIA reports over 150 tokens/sec/user on GB200 NVL72 and a 30x better perf-per-watt than H200 on Blackwell. Separately, DeepSeek’s own model card reports a ~73% reduction in per-token inference FLOPs and a ~90% reduction in KV-cache memory burden versus V3.2, both measured in the 1M-token context setting (so the saving is largest at long context, not a flat figure). All of these are vendor-reported, so use them as directional benchmarks rather than audited guarantees, and validate against your own workload before sizing hardware.

Should I self-host DeepSeek V4 or use the hosted endpoint?

Use the day-0 hosted NIM on build.nvidia.com until your V4 token volume is high and sustained enough to keep a 4-plus-GPU Blackwell node near full utilization 24/7. Below that break-even the hosted endpoint is cheaper and simpler and avoids the experimental dev-build caveat; above it, a busy node amortizes the hardware. Data residency or latency requirements can override the pure cost maths.

Bibliography (11 sources)

Sources prioritise primary material: NVIDIA developer/newsroom publications, the DeepSeek-V4-Pro model card, NVIDIA build.nvidia.com model cards, and the ai-dynamo / TensorRT-LLM release notes. Vendor performance figures (>150 tok/s/user and 30× perf/watt from NVIDIA; ~73% fewer FLOPs and ~90% KV-cache from the DeepSeek model card, in the 1M-token context setting) are treated as vendor-reported unless independently audited. VRAM footprint figures are third-party directional estimates, not NVIDIA list specs. Links accessed June 2026.

DeepSeek-AI — DeepSeek-V4-Pro model card (Hugging Face) (specs, 1M context, MIT license, CSA/HCA attention, Max-mode benchmarks, FP4+FP8 mixed weight precision, ~27% of FLOPs and ~10% of KV cache vs V3.2 in the 1M-token context setting). huggingface.co/deepseek-ai/DeepSeek-V4-Pro
NVIDIA — DeepSeek-V4-Pro model card (build.nvidia.com) (hosted endpoint, day-0 availability). build.nvidia.com/deepseek-ai/deepseek-v4-pro/modelcard
NVIDIA — DeepSeek-V4-Flash model card (build.nvidia.com) (V4-Flash 284B/13B, treated as vendor-listed / directional). build.nvidia.com/deepseek-ai/deepseek-v4-flash/modelcard
NVIDIA Technical Blog — Build with DeepSeek V4 Using NVIDIA Blackwell and GPU-Accelerated Endpoints (Apr 24 2026; >150 tok/s/user, 30× perf/watt, native MXFP4 serving format, day-0 NIM). developer.nvidia.com/blog/build-with-deepseek-v4…
ai-dynamo/dynamo — Release v1.3.0-deepseek-v4-dev.1 (V4-Pro 1.6T/49B, reasoning modes, TensorRT-LLM 1.3.0rc15.post1 pin, 4× GB300, experimental). github.com/ai-dynamo/dynamo/releases/tag/v1.3.0-deepseek-v4-dev.1
ai-dynamo/dynamo — Releases (June 2026 dev builds: MiniMax-M3, Kimi K2.6, Nemotron-3 Super). github.com/ai-dynamo/dynamo/releases
NVIDIA Newsroom — NVIDIA Enters Production With Dynamo 1.0 (Mar 16 2026 production launch; up to 7× Blackwell speedup). nvidianews.nvidia.com/news/dynamo-1-0
NVIDIA Technical Blog — Introducing NVIDIA Dynamo, a Low-Latency Distributed Inference Framework (Dynamo architecture context). developer.nvidia.com/blog/introducing-nvidia-dynamo…
NVIDIA/TensorRT-LLM — Releases (DeepSeek V4 attention kernels, AutoConfig). github.com/NVIDIA/TensorRT-LLM/releases
NVIDIA Blog — Leading Inference Providers Achieve Lowest Token Cost With Open Models on Blackwell (inference economics context; metrics vendor-reported). blogs.nvidia.com/blog/inference-open-source-models-blackwell-reduce-cost-per-token
Spheron — Deploy DeepSeek V4 on GPU Cloud: MoE Inference with vLLM and Expert Parallelism (third-party FP8 ~500GB / BF16 ~1TB VRAM estimate, support only). spheron.network/blog/deploy-deepseek-v4-gpu-cloud

DeepSeek V4 on NVIDIA: Specs, GPU Needs & Cost

What are DeepSeek V4-Pro and V4-Flash specs, side by side?

What reasoning modes does DeepSeek V4 have, and why do they matter for cost?

What GPUs and VRAM does DeepSeek V4 need on NVIDIA Blackwell?

How do you run DeepSeek V4 with Dynamo and TensorRT-LLM?

How do you access DeepSeek V4 day-0 as an NVIDIA NIM?

When does self-hosting DeepSeek V4 beat the hosted endpoint?

What are the caveats before you commit to DeepSeek V4 in production?

FAQ

vLLM vs TensorRT-LLM vs SGLang (2026): Which to Pick

NVIDIA Nemotron 3 Ultra: 550B Open Agent Model

Claude Fable 5 Pricing & Limits: 2026 Cost Breakdown

LEAVE A REPLY Cancel reply

Most Popular

vLLM vs TensorRT-LLM vs SGLang (2026): Which to Pick

NVIDIA Nemotron 3 Ultra: 550B Open Agent Model

How to Build an AI Agent with the Claude Agent SDK

Claude Fable 5 Pricing & Limits: 2026 Cost Breakdown

Recent Comments

Inwestowanie

vLLM vs TensorRT-LLM vs SGLang (2026): Which to Pick

NVIDIA Nemotron 3 Ultra: 550B Open Agent Model

How to Build an AI Agent with the Claude Agent SDK

POPULAR POSTS

vLLM vs TensorRT-LLM vs SGLang (2026): Which to Pick

NVIDIA Nemotron 3 Ultra: 550B Open Agent Model

How to Build an AI Agent with the Claude Agent SDK

POPULAR CATEGORY

ABOUT US

FOLLOW US