NVIDIA NIM API pricing has no public per-token price. The hosted catalog at build.nvidia.com is free for prototyping through the NVIDIA Developer Program, capped by a community-acknowledged baseline of around 40 requests per minute (model- and traffic-dependent, not a published SLA). Production needs NVIDIA AI Enterprise from $4,500 per GPU per year or about $1 per GPU hour in the cloud — and there is now a free 90-day evaluation license so you can run production-grade NIM before paying.
What does NVIDIA NIM API pricing actually cost in 2026?
The honest answer to NVIDIA NIM API pricing is counter-intuitive: there is no “$X per 1M tokens” rate at all. NVIDIA splits NIM into three usage modes — hosted API endpoints for prototyping (free, rate-limited), downloadable NIM containers for development and testing (free from NVIDIA, you pay the GPUs), and production deployments under NVIDIA AI Enterprise (from $4,500 per GPU per year). Most confusion comes from mixing these three modes into one phrase: “NVIDIA NIM API.”
If you are using the NVIDIA API catalog from build.nvidia.com, you are using a hosted endpoint accelerated by NVIDIA infrastructure for development, testing, research, or evaluation. If you download a NIM container and run it on your own GPU, NVIDIA is not charging you per request during allowed dev/test use; your real cost is the GPU infrastructure. If you serve real end users or business transactions, NVIDIA classes that as production, and production requires an NVIDIA AI Enterprise license.
This is the pricing-and-limits page. For what NIM is and how it works internally, see the companion NVIDIA NIM API explained; for managed-API substitutes, see NVIDIA NIM alternatives. This page stays on cost, limits, and the buy-vs-self-host decision.
NIM is priced like enterprise AI infrastructure, not like a consumer API. The free hosted endpoint is a developer-acquisition layer; the production product is a licensed, self-hostable inference stack where cost is driven by GPUs, utilization, support, and operational control.
The 3 NIM usage modes: free endpoint, downloadable NIM, production license
The most useful way to read NIM pricing is by deployment mode. The API surface looks similar across modes, but the economics and the legal boundary are different.
| Mode | Who hosts it? | What it costs | Main limit | Best for |
|---|---|---|---|---|
| Hosted API catalog | NVIDIA / partners on DGX Cloud-style infrastructure | Free for prototyping via Developer Program; no public per-token price | Rate limits (~40 RPM baseline, model/traffic dependent) | Testing models, demos, agents, comparing latency and quality |
| Downloadable NIM | You: workstation, data center, or cloud GPU | No per-request NVIDIA charge for allowed dev/test (up to 16 GPUs); you pay infrastructure | GPU memory, model support matrix, license scope | Local prototyping, private-data tests, enterprise integration |
| NVIDIA AI Enterprise | You or your cloud provider, with enterprise license/support | From $4,500/GPU/year or ~$1/GPU/hour in cloud (plus CSP instance cost); free 90-day eval | GPU count, utilization, support contract, architecture | Production apps, real users, business transactions, regulated workflows |
The subtle point: the hosted API and the self-hosted NIM container are not two pricing tiers of one SaaS API. They are two different paths into the same NVIDIA inference ecosystem. The first is convenient and rate-limited. The second gives you control, but pushes cost and operations onto your infrastructure.
How NVIDIA NIM free API access works in 2026
NVIDIA’s official NIM FAQ says NVIDIA Developer Program members get free access to NIM API endpoints for prototyping, plus a license to download and self-host NIM microservices for research, application development, and experimentation on up to 16 GPUs on any infrastructure. That access runs through build.nvidia.com and is tied to development/test use, not production.
The hosted flow is simple: visit build.nvidia.com, choose a model, generate a free NVIDIA API key, and call the endpoint. NVIDIA’s quickstart shows hosted calls going to https://integrate.api.nvidia.com/v1/chat/completions with an OpenAI-compatible request shape. The 2026 catalog has expanded fast: DeepSeek-V4-Pro and DeepSeek-V4-Flash (announced April 24, 2026, both up to a 1M-token context, available day-0 as downloadable NIM containers), the Nemotron 3 family (hybrid Mamba-Transformer MoE, 1M context, multimodal), and GLM-5.1 all sit alongside Llama, Qwen, Kimi, Mistral and Microsoft Phi. Exact catalog availability and model IDs change, so confirm the current slug at build.nvidia.com before you ship.
from openai import OpenAI
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="NVIDIA_API_KEY"
)
response = client.chat.completions.create(
model="deepseek-ai/deepseek-v4-pro", # confirm current slug at build.nvidia.com
messages=[
{"role": "system", "content": "Answer concisely."},
{"role": "user", "content": "Explain NVIDIA NIM pricing in one paragraph."}
],
temperature=0.2,
max_tokens=300
)
print(response.choices[0].message.content)
That OpenAI-compatible interface is why NIM is attractive for quick experiments: point existing tools at the NVIDIA base URL, swap the model name, compare output. But the production question is separate — once you serve real users, you move from trial access to a production license and deployment plan.
The 7 limits that matter before you build on NIM
Search results often frame NIM as “free AI inference.” That is true only in a narrow development sense. The useful engineering question is which limits shape the system before you rely on it.
1. Free hosted access is for prototyping, not production
The official FAQ draws a bright line. In NVIDIA’s words: “Production use involves any use of NIM for purposes other than development, testing, research or evaluation such as conducting business transactions and any non-testing activity including activity serving real end-users.” Production requires NVIDIA AI Enterprise. This decides whether you are allowed to treat a free hosted endpoint as part of your app’s runtime.
2. The free tier is rate-limit based, around 40 RPM
NVIDIA staff confirmed in the Developer Forums that build.nvidia.com trial usage is not credit-based; it is governed by a rate limit “dependent on model, use-case and the amount of current overall traffic using the same access.” Two figures now reconcile rather than contradict: your account dashboard shows your own ceiling, and the practical community baseline that NVIDIA staff have openly referenced is roughly 40 requests per minute (“The team is aware of the implications of the 40 rpm rate limit,” MarkusHoHo, May 13, 2026). Treat ~40 RPM as a community-acknowledged baseline, model- and traffic-dependent, not a guaranteed published SLA.
This is the most important practical limit. If your agent needs long continuous coding sessions, parallel workers, or high-volume batch jobs, a free hosted endpoint can fail even when it is “free.” That bites workflows like AI coding assistants, where token volume and session length spike quickly. Architect for backoff, fallbacks, and a production migration path.
3. There is no single public per-token price for hosted NIM endpoints
Unlike OpenAI, Anthropic, Groq, or many aggregator APIs, NVIDIA does not publish one token-price table for hosted NIM catalog calls. The public pricing anchor is the production NVIDIA AI Enterprise license, not a commodity per-token bill. That makes NIM harder to compare in spreadsheet form, but easier to understand as GPU economics: the paid product is software plus support for running optimized inference on GPUs.
4. Production pricing is per GPU, not per model
NVIDIA’s NIM FAQ says production requires NVIDIA AI Enterprise, with licenses starting at $4,500 per GPU per year or about $1 per GPU hour in the cloud. Pricing is based on the number of GPUs, not the number of NIMs, and is the same regardless of GPU size. Your final cost still depends on cloud GPU rental or owned hardware, utilization, networking, storage, observability, and operations staff time.
5. Downloadable NIMs are free for dev/test, but GPUs are not free
For Developer Program members, downloadable NIMs can be used for research, application development, and experimentation on up to 16 GPUs. That does not mean zero cost. On cloud H100/H200/B200/B300-style machines the GPU bill can dominate; on local workstations the constraint becomes VRAM, throughput, driver compatibility, and ops time.
6. Not every catalog model is equally downloadable or production-ready
NVIDIA’s docs distinguish hosted catalog models from select downloadable container images supported with NVIDIA AI Enterprise entitlement. Some models are only convenient hosted endpoints; others are packaged as NIM containers (DeepSeek-V4-Pro and V4-Flash, for instance, shipped day-0 as containers). The catalog changes frequently, so the production question is not “can I call this today?” but “can I deploy this model family with support, licensing, hardware fit, and acceptable latency?”
7. Self-hosted NIM still has infrastructure limits
Self-hosting gives you control, not magic. NVIDIA’s deployment FAQ recommends one model per pod/container and allocates about 90% of remaining GPU RAM to KV cache. NIM chooses optimized TensorRT-LLM engines for supported GPU/model combinations and vLLM otherwise. Throughput and cost therefore depend heavily on hardware fit, model size, batch behavior, context length, and whether your deployment lands on an optimized path.
If you need predictable SLA, stable model availability, enterprise support, or real-user traffic, do not design around the free hosted endpoint as your only runtime. Use it to validate model quality, then price the production path separately.
NVIDIA AI Enterprise pricing: the full SKU ladder
The production anchor is NVIDIA AI Enterprise, priced per GPU. The licensing guide (last updated June 8, 2026) shows no 2026 price increase versus the long-standing $4,500 base — but it does expose a fuller SKU ladder, multi-year discounts, a perpetual option, and a 75% education/Inception discount. The multi-year and perpetual SKUs are additional detail, not a hike.
| SKU (self-managed, per GPU) | Price | Effective annual | Notes |
|---|---|---|---|
| 1-year subscription | $4,500 | $4,500/yr | The headline anchor |
| 2-year subscription | $9,000 | $4,500/yr | No multi-year discount yet |
| 3-year subscription | $13,500 | $4,500/yr | — |
| 4-year subscription | $18,000 | $4,500/yr | No multi-year discount |
| 5-year subscription | $18,000 | $3,600/yr | Multi-year discount applies |
| Perpetual license | $22,500 | One-time | Includes 5 years of support |
| Education / Inception | ~$1,125 (1-yr) | 75% off | Startups and education programs |
| Cloud production | ~$1/GPU/hour | Usage-based | License only — plus the CSP instance cost |
Read the cloud line carefully: NVIDIA’s page states “$1/hour/GPU + CSP Instance Cost(s).” The $1/hour is the license on top of GPU rental, not an all-in number. Per-GPU, per-year pricing is the same regardless of GPU size, so the lever is utilization, not which card you pick.
Try production NIM free for 90 days
The strongest new 2026 angle for cost-conscious teams: organizations can access a free 90-day NVIDIA AI Enterprise evaluation license with production-grade features, enterprise security, and API stability — before buying. That closes the gap between the rate-limited prototype endpoint and the paid production license. If the ~40 RPM free tier is your wall but you are not ready to commit $4,500/GPU/year, the 90-day eval is the path to test the production NIM stack with real workloads and decide on real evidence.
Prototype free on build.nvidia.com → hit the ~40 RPM wall → run the 90-day AI Enterprise eval at production grade → then choose: buy a per-GPU license, self-host a downloadable NIM, or route production to a managed API.
How to check and raise your NVIDIA NIM rate limits
To check your own ceiling: sign in at build.nvidia.com, open a model, and look at the account/usage panel — NVIDIA staff confirm your maximum rate limit is shown there. The limit is model-specific, so a large reasoning model and a small embedding model will not share the same ceiling, and the practical baseline sits near 40 RPM.
There is no self-service “increase my free limit” button, and NVIDIA’s forums explicitly state that requesting a rate-limit increase in the forum does not grant one: “There is no official way to circumvent this rate limit or to receive a rate limit increase on that same tier.” In practice there are three real ways past the free ceiling:
- Self-host a downloadable NIM — throughput becomes a function of your own GPUs, not NVIDIA’s shared trial limits.
- Move to NVIDIA AI Enterprise (or the free 90-day eval first) — the production path where stable, higher-volume serving is supported.
- Use a managed inference API for the rate-limited portion — many teams prototype on NIM, then route production traffic to a provider with published quotas (comparison below).
If your workload’s success depends on a specific requests-per-minute number, the free hosted endpoint is the wrong layer to depend on. Free NIM proves model fit; guaranteed throughput lives behind self-hosting or a licensed/managed tier.
How to fix NVIDIA NIM 429 rate-limit errors
A 429 Too Many Requests response means you hit the model’s rate limit on the hosted endpoint — the most common error when you push a free NIM endpoint (around that ~40 RPM baseline) past prototype volume. The fix is not “request a higher limit” (NVIDIA does not grant those); it is to make your client well-behaved and add fallbacks.
Three engineering responses, in priority order:
- Exponential backoff with jitter. On a 429, wait and retry with increasing delay. If the response includes a
Retry-Afterheader, honour it instead of guessing. - Concurrency caps. Limit in-flight requests. Most 429 storms come from parallel workers firing at once, not total daily volume.
- Model and provider fallback. If NIM is saturated, fail over to a second model or a managed API so the workflow does not stall mid-run.
import time, random
from openai import OpenAI
from openai import RateLimitError
client = OpenAI(base_url="https://integrate.api.nvidia.com/v1", api_key="NVIDIA_API_KEY")
def call_with_backoff(messages, model, max_retries=5):
delay = 1.0
for attempt in range(max_retries):
try:
return client.chat.completions.create(model=model, messages=messages)
except RateLimitError as e:
# honour Retry-After if the server sent one, else exponential + jitter
retry_after = getattr(e, "response", None) and e.response.headers.get("retry-after")
wait = float(retry_after) if retry_after else delay + random.uniform(0, 0.5)
time.sleep(wait)
delay = min(delay * 2, 30)
raise RuntimeError("NIM still rate-limited after retries - fall back to another model")
This pattern matters most for long-running AI coding assistants and multi-agent runs — the kind built with the tools in Best AI Agent Frameworks 2026 — where a single mid-session 429 breaks an otherwise working workflow. Treat backoff and fallback as mandatory, not optional, on the free tier.
NVIDIA NIM pricing diagram: where the cost actually moves
The diagram below shows the cost boundary. Hosted API catalog usage is the fastest path to a working prototype. The moment you need production reliability, cost moves from “free endpoint with rate limits” to “licensed software plus GPUs plus operations.”
How to estimate NIM production cost
The useful cost model is not “price per token.” It is cost per useful unit of throughput. For production NIM, start with this formula:
hourly_system_cost =
gpu_infrastructure_cost_per_hour # the CSP instance cost
+ nvidia_ai_enterprise_license_per_gpu_hour # ~$1/GPU/hour, license only
+ storage_network_observability_cost_per_hour
+ ops_cost_per_hour
cost_per_1m_output_tokens =
hourly_system_cost / output_tokens_per_hour * 1_000_000
cost_per_successful_task =
hourly_system_cost / successful_tasks_per_hour
The last line is the one that matters. A cheap model that needs three retries can cost more than a stronger model that succeeds once. A self-hosted GPU at 8% utilization is expensive even if its raw hourly price looks good. A rate-limited free endpoint is priceless for experimentation and nearly unusable for a high-volume agent if it stalls mid-workflow.
Worked examples: what a NIM deployment actually costs
Numbers make the model concrete. The GPU rental rates below are illustrative 2026 cloud ranges — verify current prices with your provider — but the structure is what matters. The license figure uses NVIDIA’s published anchor of about $1/GPU/hour in cloud (or $4,500/GPU/year, ≈ $0.51/GPU/hour if amortised over a fully-utilised year). These per-token derivations are illustrative, not NVIDIA-published rates.
| Scenario | Setup | Rough hourly system cost | Best for |
|---|---|---|---|
| A. Prototype | 1 hosted free endpoint, ~40 RPM | $0 (dev/test only) | Model evaluation, demos |
| B. Single-GPU self-host | 1 cloud H100 (~$3/hr illustrative) + ~$1/hr AI Enterprise + ~$0.5/hr storage/ops | ~$4.5/hr (~$3,250/mo at 24/7) | Steady internal workload, one model |
| C. Production cluster | 4 GPUs (~$12/hr) + ~$4/hr license + ~$2/hr ops | ~$18/hr (~$13,000/mo) | Customer-facing, higher throughput |
The decisive variable is utilization. Scenario B at 24/7 full load might serve roughly ~1.5–2M output tokens/hour on a well-fit model — putting effective cost in the low single-digit dollars per 1M output tokens, competitive with managed APIs. The same GPU at 10% utilization costs the same $4.5/hour but does a tenth of the work, so its cost per 1M tokens is 10× worse. That is why a pay-per-token managed API often wins for spiky, low-volume traffic, while self-hosted NIM wins for steady, high-volume, latency- or data-sensitive workloads.
Self-hosted NIM beats a managed per-token API only past a utilization threshold. Before committing, estimate your steady tokens/hour, divide your fully-loaded hourly cost by it, and compare to the managed API’s per-token price. If your traffic can’t keep the GPU busy, the managed API is usually cheaper and simpler.
NIM vs normal inference APIs: the decision table
NIM sits in a different market position from OpenAI, Anthropic, Together, Fireworks, OpenRouter, or other inference providers. Those services are usually easier to price per token. NIM is stronger when you care about self-hosting, NVIDIA-optimized inference, data control, and production deployment on your own GPU estate. For a full provider-by-provider roundup, see NVIDIA NIM alternatives; the table here is the lightweight decision view.
| Question | NIM is strong when… | Use a standard API when… |
|---|---|---|
| Do you need free prototyping? | You want to test many open and NVIDIA models quickly. | You need predictable free-tier quotas or published token pricing. |
| Do you need production SLA? | You can buy NVIDIA AI Enterprise (or run the 90-day eval first) and operate GPUs. | You want the provider to own scaling, billing, and model serving. |
| Do you have private data? | You want data to stay in your own enclave or cloud VPC. | Your data policy permits a managed external model API. |
| Do you need cost predictability? | You can keep GPUs highly utilized and model throughput is stable. | Your traffic is spiky and pay-per-token billing is cleaner. |
| Do you need model portability? | You want OpenAI-compatible local endpoints and self-hosted containers. | You mostly consume one proprietary frontier model. |
When NVIDIA NIM free endpoints are enough
The free hosted endpoint is enough when work is exploratory: evaluating model behavior, testing prompts, checking tool compatibility, building demos, prototyping an internal agent, or comparing model families before you commit to a deployment path. It is especially useful for testing multiple open or NVIDIA-tuned models — including the 2026 DeepSeek-V4 and Nemotron 3 additions — behind an OpenAI-style API without provisioning GPUs.
It is not enough when your application needs predictable throughput, customer-facing availability, batch processing at scale, long multi-agent runs, commercial guarantees, security review, or stable model availability. At that point, move to the 90-day eval and then NVIDIA AI Enterprise / self-hosted NIM, or pick a managed inference API with public token pricing and production SLAs. This is the same throughput-versus-price issue we covered in Anthropic’s Claude limits and SpaceX compute deal: rate limits, not only token prices, decide whether an AI workflow is usable.
Production migration checklist
Before moving from build.nvidia.com experiments to a real NIM deployment, answer these in writing.
- Use rights: Is the workload still development/testing, or production under NVIDIA’s definition?
- Model packaging: Is the model available as a downloadable NIM container with the right entitlement?
- Hardware fit: Does the model fit your GPU memory budget with KV cache and target context length?
- Throughput target: How many successful tasks per hour do you need, not just tokens per second?
- Utilization: Will GPUs stay busy enough to beat pay-per-token APIs?
- Observability: Are readiness, liveness, metrics, traces, and model errors wired into your stack?
- Fallbacks: What happens when a model is slow, unavailable, or fails quality evals?
- Security: How are API keys, model access, data boundaries, and tool permissions controlled?
- Support: Who owns incidents: your platform team, NVIDIA AI Enterprise support, a cloud partner, or all three?
The migration is less about changing code and more about changing accountability. The prototype endpoint proves the model is useful. Production NIM proves the system is operable.
Common mistakes with NVIDIA NIM pricing
- Calling the hosted endpoint “unlimited free inference.” NVIDIA describes free developer access, but rate limits apply — around 40 RPM, model-specific.
- Comparing NIM to OpenAI only by token price. NIM’s production economics are GPU/license/utilization economics, not per-token billing.
- Reading “$1/GPU/hour” as all-in. That is the license only; the CSP GPU instance cost is on top.
- Ignoring the production boundary. A demo is not production. Real users and business transactions change the license requirement.
- Assuming every catalog model is self-hostable. Check whether the specific model is available as a downloadable NIM and under what entitlement.
- Skipping fallbacks. Free endpoints are excellent for tests, but agent workflows need backoff, retry, alternate models, and state checkpoints.
Bottom line
NVIDIA NIM is one of the most useful free developer inference surfaces in 2026, but the word “free” needs precision. Hosted NIM endpoints are free for prototyping, capped near 40 RPM and by model availability. Downloadable NIMs are free from NVIDIA’s side for allowed dev/testing, but you pay the GPU infrastructure. Production is NVIDIA AI Enterprise — from $4,500/GPU/year, now with a free 90-day evaluation so you can try the production stack before paying.
That makes NIM a strong choice when you want to evaluate models quickly and eventually run optimized inference under your own control. It is weaker if you want a simple public token-price table and a managed API that absorbs all operational complexity. Treat NIM as a bridge from prototype to self-hosted production, not as a permanently unlimited free API. For the conceptual picture, start with NVIDIA NIM API explained; for substitutes, see NVIDIA NIM alternatives.
FAQ
How much does the NVIDIA NIM API cost?
There is no per-token price. The hosted catalog at build.nvidia.com is free for prototyping through the NVIDIA Developer Program (rate-limited near 40 RPM). Production requires NVIDIA AI Enterprise, listed in NVIDIA’s NIM FAQ from $4,500 per GPU per year or about $1 per GPU hour in the cloud (plus the CSP instance cost). A free 90-day evaluation license is available before you buy.
Is the NVIDIA NIM API free?
Yes, but only in the development sense. NVIDIA Developer Program members get free access to hosted NIM endpoints for prototyping and to downloadable NIM microservices for research, development, testing and experimentation on up to 16 GPUs. Production use requires an NVIDIA AI Enterprise license — though a free 90-day production-grade evaluation now exists.
How many requests per minute does the NVIDIA NIM free tier allow?
NVIDIA does not publish a guaranteed quota, but NVIDIA staff have openly referenced a baseline of around 40 requests per minute, and your exact account ceiling is shown inside the build.nvidia.com UI. Treat ~40 RPM as a community-acknowledged, model- and traffic-dependent baseline, not a published SLA. If your workflow depends on a specific RPM, self-host a downloadable NIM or move to a licensed/managed tier.
Does NVIDIA NIM have per-token pricing?
No. NVIDIA does not publish a universal per-token table for hosted NIM catalog usage. The public production anchor is NVIDIA AI Enterprise, from $4,500 per GPU per year or about $1 per GPU hour in the cloud (license only; the GPU instance cost is on top). Final cost depends on GPU infrastructure and utilization.
Can I use NVIDIA NIM in production for free?
Not indefinitely. NVIDIA defines production as use beyond development, testing, research or evaluation — including real end users and business transactions — and that requires NVIDIA AI Enterprise. However, you can run a free 90-day AI Enterprise evaluation license with production-grade features before committing to a paid per-GPU license.
How do I fix a 429 error on NVIDIA NIM?
A 429 means you hit the model’s rate limit (around the ~40 RPM free baseline). NVIDIA does not grant increases on request, so the fix is client-side: add exponential backoff with jitter (honour the Retry-After header if present), cap concurrent requests, and fail over to a second model or managed API so the workflow doesn’t stall. For sustained volume, self-host or move to a production tier.
NVIDIA NIM vs a managed API like OpenRouter or Together – which is cheaper?
For occasional or spiky calls to open models, a managed/aggregator API (OpenRouter, Together, Fireworks) is usually cheaper and simpler because you pay per token with no infrastructure. NIM becomes cheaper at steady, high utilization when you self-host on your own GPUs, because cost is then GPU economics rather than per-token billing. The break-even depends on how busy you can keep the hardware — see our NVIDIA NIM alternatives roundup for the provider comparison.
Bibliography (10 sources)
Sources prioritise NVIDIA primary documentation and official developer-forum staff posts. Vendor performance and pricing figures are treated as vendor-reported unless independently audited; the ~40 RPM figure is a community-observed baseline that NVIDIA staff acknowledged, not a formally published SLA. Links accessed June 2026.
- NVIDIA — NIM General FAQ (pricing, free access, 16-GPU dev/test, production definition, 90-day eval). docs.api.nvidia.com/nim/docs/product
- NVIDIA — AI Enterprise Licensing Guide: Pricing (SKU ladder, $4,500 anchor, perpetual, edu discount, “Last updated Jun 08, 2026”). docs.nvidia.com/ai-enterprise/…/pricing.html
- NVIDIA — API Catalog Quickstart Guide (OpenAI-compatible hosted endpoint URL). docs.api.nvidia.com/nim/docs/api-quickstart
- NVIDIA Developer Blog — Build with DeepSeek-V4 using NVIDIA Blackwell and GPU-accelerated endpoints (Apr 24, 2026; V4-Pro/Flash, 1M context, day-0 NIM container). developer.nvidia.com/blog/build-with-deepseek-v4…
- NVIDIA Developer Forums — API Rate Limit Increase is NOT granted by requesting it here (40 RPM acknowledgement, rate-limit-not-credit, no increases on request). forums.developer.nvidia.com/t/…/368420
- NVIDIA — Try NVIDIA NIM APIs (hosted catalog, free API key, model list). build.nvidia.com
- NVIDIA — NIM Deployment FAQ (one model per pod, ~90% KV cache, TensorRT-LLM vs vLLM). docs.api.nvidia.com/nim/docs/deployment
- NVIDIA — LLM APIs reference (catalog model list incl. Nemotron, DeepSeek, GLM). docs.api.nvidia.com/nim/reference/llm-apis
- NVIDIA — NIM for Large Language Models: Get Started (self-hosted OpenAI-compatible endpoints). docs.nvidia.com/nim/large-language-models/latest/get-started
- NVIDIA — NVIDIA AI Enterprise (product page) (production support, security, API stability). nvidia.com/en-us/data-center/products/ai-enterprise

[…] The biggest source of confusion is treating “NIM pricing” as a single number. NVIDIA splits it across three deployment modes with completely different economics and legal boundaries (DecodeTheFuture pricing analysis): […]