NVIDIA Nemotron 3 Ultra: 550B Open Agent Model

Last updated: June 2026 · By Ignacy Kwiecień, founder & editor-in-chief, DecodeTheFuture.org

NVIDIA Nemotron 3 Ultra is an open 550B-parameter Mixture-of-Experts reasoning model (550B-A55B, 55B active per token) built for long-running agents, released June 4, 2026 under the permissive OpenMDW-1.1 license. Open weights do not mean cheap to run: the BF16 floor is 16× H100, 8× H200, or 8× Blackwell. For most teams a hosted API (build.nvidia.com NIM, OpenRouter, or third-party hosts at around $0.40–0.50 per million input tokens, provider-set; OpenRouter lists ~$0.50/M) wins until your agent’s token volume is high and sustained enough to keep a dedicated 8–16 GPU node near full utilization 24/7.

550B-A55B MoE Up to 1M context OpenMDW-1.1 open weights 8x Blackwell / 16x H100 floor Hosted ~$0.40-0.50/M input

Table of Contents

What is Nemotron 3 Ultra and what are its exact specs?

NVIDIA Nemotron 3 Ultra is the largest open model in NVIDIA’s Nemotron 3 family (Nano, Super, Ultra), published June 4, 2026 as part of NVIDIA’s June agentic-AI push around Computex 2026. It is a 550B-total-parameter Mixture-of-Experts model with 55B active parameters per token — the “550B-A55B” naming you see on the model cards. Only the active experts fire on any given token, so it carries frontier-scale knowledge at a fraction of a dense 550B model’s compute per step.

Architecturally it is a Hybrid Mamba-Attention design. NVIDIA combines Mamba state-space layers, which handle long-context sequences efficiently, with Transformer attention layers that preserve recall, then adds LatentMoE routing and Multi-Token Prediction (MTP) layers for speculative decoding. That hybrid is the engineering reason the model can hold very long agent transcripts without the quadratic attention cost a pure Transformer would pay. It was pre-trained on roughly 20T tokens, with a pre-training data cutoff of September 2025 and post-training cutoff of May 2026.

Spec	Nemotron 3 Ultra
Parameters	550B total / 55B active per token (MoE)
Architecture	Hybrid Mamba-Attention + LatentMoE + Multi-Token Prediction
Context window	BF16 default 262,144 tokens (256K); extendable to full 1M (env-var flag)
License	OpenMDW-1.1 (Linux Foundation permissive open weights)
Checkpoints	BF16, NVFP4 (quantized), Base-BF16
Released	June 4, 2026 (Nemotron 3 family)
Pre-training	~20T tokens; data cutoff Sept 2025 / post-train May 2026

The license matters for builders. OpenMDW-1.1 is the Linux Foundation’s Open Model, Data and Weights license — permissive enough for commercial use — and NVIDIA released not just weights but large portions of the pre-training and post-training datasets plus an end-to-end training recipe in its Nemotron Developer Repository. That is meaningfully more open than a weights-only drop.

How much faster and cheaper is Nemotron 3 Ultra for agents, really?

NVIDIA’s headline claim is up to 5× higher inference throughput versus other open models in its class, plus roughly 30% lower cost on agentic coding tasks such as SWE-bench Verified. Both numbers are real, both are vendor-reported, and both need unpacking before you build a budget on them.

On the throughput claim, NVIDIA’s research page cites a specific 8K-input / 64K-output configuration where Ultra reaches 5.9×, 4.8× and 1.6× higher throughput than GLM-5.1-754B-A40B, Kimi-K2.6-1T-A32B and Qwen-3.5-397B-17B respectively, at on-par accuracy. Read that carefully: the multiplier swings from 5.9× to 1.6× depending on the competitor, so “up to 5×” is the ceiling, not the average. Treat it as a directional benchmark, not an audited guarantee.

The “30% cheaper” figure is the one agent builders should care about most — and the one most likely to be misquoted. NVIDIA attributes it to using fewer total tokens and fewer tokens per turn to finish SWE-bench Verified, a coding-agent benchmark. It is task-specific. It is not a universal 30% cut on every workload, and it does not come from a lower per-token price. The saving comes from the model spending fewer reasoning tokens to reach the same answer.

Where the cost lever actually is

For long-running agents, the real cost driver is not the headline per-token price — it is tokens-per-turn. Nemotron 3 Ultra’s reasoning-budget controls (enable_thinking and a medium_effort flag, covered below) are where NVIDIA’s “30% cheaper” genuinely comes from. Tune those before you tune the GPU bill.

What does Nemotron 3 Ultra’s 1M context actually require?

First, clear up a number that trips people up: 256K and 262K are the same window. Ultra’s BF16 checkpoint natively defaults to 262,144 tokens, which is exactly 256 × 1024 — “256K” and “262K” are two ways of writing one figure, and it is the same default NVIDIA’s docs cite for Ultra (Super defaults to the same 262,144). The full 1M-token ceiling is opt-in. The real gate on long context is not a context cap but KV-cache memory: the longer the window, the more GPU memory each request consumes.

To unlock the full window on vLLM you set an explicit environment flag, because the long context dramatically increases KV-cache memory:

Bash – enable the full 1M context on vLLM

# BF16 defaults to 262144 tokens (= 256K). To address the
# full 1M window, tell vLLM to allow a longer max model length:
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

vllm serve nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 \
  --max-model-len 1000000 \
  --tensor-parallel-size 8

The catch is memory. Long context is bounded by KV-cache size, which is exactly where the quantized NVFP4 checkpoint earns its keep: smaller activations mean more room for context and higher per-GPU throughput. The mechanics of why low-precision KV cache lets you stretch the window are their own subject — if you want the deep version, see our explainer on vector quantization and the KV cache rather than re-deriving it here.

What GPUs do you need to run Nemotron 3 Ultra, and which checkpoint?

This is the gating fact the “open weights!” headlines bury: you cannot self-host Nemotron 3 Ultra cheaply. The single-node BF16 hardware floor is one of the following: 8× GB200/B200/GB300/B300, or 16× H100, or 8× H200. There is no consumer-GPU path and no single-card path. “Open” describes the license, not the affordability.

Checkpoint	Footprint	Hardware fit	Pick it when…
BF16 (full)	Largest	8× Blackwell, 8× H200, or 16× H100	Max fidelity, you already own a big node
NVFP4 (quantized)	Much smaller	Runs on Ampere, Hopper and Blackwell; up to 5× throughput/GPU vs BF16 on Blackwell (vendor-reported)	You want longer context, lower GPU count, best throughput-per-dollar
Base-BF16	Largest	Same as BF16	You are doing your own post-training / fine-tuning

For nearly every self-hosting team, NVFP4 is the practical default. NVIDIA reports it runs across Hopper, Blackwell and even Ampere GPUs, and delivers up to 5× higher throughput per GPU than BF16 on Blackwell (again, vendor-reported). It is what makes the full 1M context realistic and what lowers the GPU count needed to serve the model. Pick BF16 only when you need maximum fidelity and already have the cluster; pick Base-BF16 only if you intend to fine-tune.

How do you run Nemotron 3 Ultra: open weights, NIM, or hosted APIs?

There are three ways to reach the model, and they map almost exactly onto the three NIM usage modes. Open weights sit on HuggingFace in BF16, NVFP4 and Base-BF16 variants — pull and serve them with vLLM yourself. The NIM microservice packages the model as a production-grade container with NVIDIA’s optimized serving stack; if you have never used NIM, start with our NVIDIA NIM API explained primer on how the microservice works internally. Hosted endpoints include the model card on build.nvidia.com plus third-party access via Perplexity, OpenRouter, DeepInfra (around $0.40–0.50 per million input tokens, provider-set; OpenRouter lists ~$0.50/M), and day-0 availability on Eigen AI, AWS, Google Cloud and Microsoft Foundry.

The fastest local path is open weights through vLLM:

Bash – pull and serve the open NVFP4 checkpoint

# 1. Pull the quantized open checkpoint from HuggingFace
huggingface-cli download nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4

# 2. Serve it with vLLM (OpenAI-compatible endpoint on :8000)
vllm serve nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 \
  --tensor-parallel-size 8 \
  --max-model-len 262144   # 262144 = 256K, the native default

# 3. Call it like any OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4",
    "messages": [{"role":"user","content":"Plan a 3-step refactor."}],
    "chat_template_kwargs": {"enable_thinking": true, "medium_effort": true}
  }'

If you would rather not stand up GPUs at all, route to a hosted endpoint. For the full hosted-API cost picture — free tier, ~40 RPM baseline, AI Enterprise pricing, the 90-day eval — do not duplicate it here; the canonical reference is our NVIDIA NIM API pricing and limits guide. For managed substitutes (OpenRouter, Together, Fireworks and friends), see NVIDIA NIM alternatives.

Worth noting on the agentic side: NVIDIA shipped Nemotron 3 Ultra alongside the open-source NeMo Agent Toolkit, a library for connecting and optimizing teams of AI agents. (Some early write-ups mention an “OpenShell runtime” — we could not verify that as an NVIDIA product in any primary source, so treat it with caution; the confirmed adjacent tool is the NeMo Agent Toolkit.)

When does self-hosting Nemotron 3 Ultra beat a hosted NIM API?

The decision rule is utilization, not idealism. A hosted API wins until your long-running-agent token volume is high and sustained enough to keep a dedicated 8–16 GPU node near full utilization 24/7. Below that line, you are paying for idle Blackwells; above it, self-hosting amortizes and the per-token economics flip in your favour.

The diagram makes the boundary concrete.

Concretely: a developer experimenting with one agent, or a product with bursty traffic, has no business buying eight Blackwells. The hosted endpoints — including third-party hosts at roughly $0.40–0.50/M input tokens or the build.nvidia.com NIM — absorb the spikes and you pay only for what you use. Self-hosting earns its place when a fleet of agents runs continuously, when data residency forbids an external API, or when sustained latency control justifies dedicated hardware. The break-even maths is the same one we worked through for hosted NIM in the pricing and limits guide: divide your fully-loaded hourly node cost by your steady tokens/hour and compare to the API’s per-token rate.

What are the limits and caveats for production agents?

Three things will shape a production deployment more than the spec sheet.

1. Reasoning-budget controls are the cost dial. The chat template exposes enable_thinking (True/False) and a medium_effort flag to cap reasoning-token consumption. For agents that loop hundreds of times, turning thinking off for cheap turns and capping effort on the rest is where you reclaim the “30% cheaper” NVIDIA advertises. This is a per-call decision, not a global setting — budget reasoning the way you budget retries.

2. The license is genuinely permissive, but verify your use. OpenMDW-1.1 is a Linux Foundation permissive open-model license; NVIDIA shipped weights plus data plus recipe. That is unusually open, but read the terms for your specific commercial scenario rather than assuming MIT-style freedom.

3. The benchmarks are vendor-reported. The 5× throughput and 30% SWE-bench cost figures come from NVIDIA’s own research and blog, measured in specific configurations against specific competitors. Use them as directional benchmarks to justify a pilot, not as audited guarantees to size a budget. Run your own workload before committing a cluster.

Practical warning

“Open weights” is a licensing fact, not a cost promise. With a 16× H100 / 8× Blackwell floor, the wrong move is to self-host a low-traffic agent because the model is free to download. Validate quality on a hosted endpoint first, measure your real tokens/hour, and only then price a dedicated node.

FAQ

What is NVIDIA Nemotron 3 Ultra?

Nemotron 3 Ultra is NVIDIA’s largest open model in the Nemotron 3 family, released June 4, 2026. It is a 550B-total-parameter Mixture-of-Experts model with 55B active parameters per token (550B-A55B), built on a hybrid Mamba-Attention architecture for long-running agents, supporting up to 1M tokens of context and released under the permissive OpenMDW-1.1 license.

What is the context window of Nemotron 3 Ultra?

The model supports up to 1 million tokens, but the BF16 checkpoint defaults to 256K to keep memory manageable. You extend it to the full 1M on vLLM by setting the environment variable VLLM_ALLOW_LONG_MAX_MODEL_LEN=1. The default 256K is exactly 262,144 tokens — the same figure NVIDIA’s docs cite for Ultra. Both Super and Ultra default to 262,144 and support up to 1M.

What GPUs do I need to self-host Nemotron 3 Ultra?

The single-node BF16 floor is 8x GB200/B200/GB300/B300, or 16x H100, or 8x H200. There is no consumer or single-GPU path. The quantized NVFP4 checkpoint runs on Ampere, Hopper and Blackwell and lowers the GPU count needed, making it the practical default for most self-hosting teams.

Is Nemotron 3 Ultra cheaper to run than other open models?

NVIDIA reports up to 5x higher inference throughput and about 30% lower cost to complete the SWE-bench Verified benchmark versus other open frontier models, driven by fewer tokens per turn. These are vendor-reported, task-specific figures, not audited universal cost cuts. The real cost lever is the reasoning-budget controls plus GPU utilization, not the raw per-token price.

Should I self-host Nemotron 3 Ultra or use a hosted API?

Use a hosted API (build.nvidia.com NIM, OpenRouter, or DeepInfra at roughly $0.40-0.50/M input tokens; OpenRouter lists ~$0.50/M) until your agent’s token volume is high and sustained enough to keep a dedicated 8-16 GPU node near full utilization 24/7. Below that break-even, the hosted API is cheaper and simpler; above it, self-hosting amortizes the hardware. Data residency or latency needs can override the pure cost maths.

What license is Nemotron 3 Ultra released under?

OpenMDW-1.1, the Linux Foundation’s permissive Open Model, Data and Weights license. NVIDIA released open weights plus large portions of the pre-training and post-training datasets and an end-to-end training recipe. It is more open than a weights-only release, but verify the terms for your specific commercial use.

How do I control reasoning cost on Nemotron 3 Ultra?

The chat template exposes enable_thinking (True/False) and a medium_effort flag that caps reasoning-token consumption. For long-running agents that loop many times, disabling thinking on cheap turns and capping effort on the rest is where NVIDIA’s advertised ~30% cost saving actually comes from. Treat reasoning budget as a per-call decision.

Bibliography (10 sources)

Sources prioritise NVIDIA primary documentation: official research pages, the developer blog, HuggingFace model cards, and the build.nvidia.com model card. Vendor performance figures (5x throughput, ~30% SWE-bench cost saving) are treated as vendor-reported unless independently audited. Third-party API prices are provider-set, not NVIDIA list prices. Links accessed June 2026.

NVIDIA Newsroom — NVIDIA Debuts Nemotron 3 Family of Open Models (launch announcement, June 2026). nvidianews.nvidia.com/news/nvidia-debuts-nemotron-3-family-of-open-models
NVIDIA Developer Blog — NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents (30% SWE-bench cost, NVFP4 throughput). developer.nvidia.com/blog/nvidia-nemotron-3-ultra…
NVIDIA Research — Nemotron 3 Ultra lab page (architecture, throughput benchmarks vs GLM/Kimi/Qwen). research.nvidia.com/labs/nemotron/Nemotron-3-Ultra
NVIDIA Research — Nemotron 3 Ultra Technical Report (PDF). research.nvidia.com/…/NVIDIA-Nemotron-3-Ultra-Technical-Report.pdf
HuggingFace — NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 model card (specs, 256K default, 1M flag, GPU floor, reasoning controls, license). huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
HuggingFace — NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4 model card (quantized checkpoint, GPU support). huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4
NVIDIA — build.nvidia.com Nemotron 3 Ultra model card (hosted endpoint, third-party availability). build.nvidia.com/nvidia/nemotron-3-ultra-550b-a55b/modelcard
NVIDIA — NeMo Agent Toolkit (open-source library for connecting and optimizing AI agents). github.com/NVIDIA/NeMo-Agent-Toolkit
NVIDIA — Nemotron developer hub (family overview, repository, recipes). developer.nvidia.com/nemotron
OpenRouter — Nemotron 3 Ultra 550B-A55B API pricing (third-party hosted input/output price reference; ~$0.50/M input). openrouter.ai/nvidia/nemotron-3-ultra-550b-a55b

NVIDIA Nemotron 3 Ultra: 550B Open Agent Model

What is Nemotron 3 Ultra and what are its exact specs?

How much faster and cheaper is Nemotron 3 Ultra for agents, really?

What does Nemotron 3 Ultra’s 1M context actually require?

What GPUs do you need to run Nemotron 3 Ultra, and which checkpoint?

How do you run Nemotron 3 Ultra: open weights, NIM, or hosted APIs?

When does self-hosting Nemotron 3 Ultra beat a hosted NIM API?

What are the limits and caveats for production agents?

FAQ

Claude Fable 5 Pricing & Limits: 2026 Cost Breakdown

NVIDIA NIM Alternatives 2026: 7 Best Inference APIs

Claude Models 2026: Opus 4.8 vs Sonnet 4.6 vs Haiku

LEAVE A REPLY Cancel reply

Most Popular

How to Build an AI Agent with the Claude Agent SDK

Claude Fable 5 Pricing & Limits: 2026 Cost Breakdown

NVIDIA NIM Alternatives 2026: 7 Best Inference APIs

LangGraph vs CrewAI vs AutoGen (2026): Which to Pick

Recent Comments

Inwestowanie

How to Build an AI Agent with the Claude Agent SDK

Claude Fable 5 Pricing & Limits: 2026 Cost Breakdown

NVIDIA NIM Alternatives 2026: 7 Best Inference APIs

POPULAR POSTS

How to Build an AI Agent with the Claude Agent SDK

Claude Fable 5 Pricing & Limits: 2026 Cost Breakdown

NVIDIA NIM Alternatives 2026: 7 Best Inference APIs

POPULAR CATEGORY

ABOUT US

FOLLOW US