vLLM vs TensorRT-LLM vs SGLang (2026): Which to Pick

Q: Is SGLang always faster than vLLM?

No. SGLang's headline up to 5x advantage (lab-reported by LMSYS) was measured on prefix-sharing workloads against an old vLLM v0.2.5 baseline. When requests share long prefixes - RAG with a fixed corpus, multi-turn agents - RadixAttention gives SGLang a real edge. When every request is unique, the gap narrows sharply and vLLM's PagedAttention is competitive. Benchmark on your own traffic shape rather than trusting a single multiplier.

Last updated: June 2026 · By Ignacy Kwiecień, founder & editor-in-chief, DecodeTheFuture.org

Pick vLLM as the default open serving engine for the widest hardware reach (NVIDIA, AMD ROCm, Intel XPU, TPU) and the easiest pip install; pick SGLang when prefix sharing dominates your workload, because its RadixAttention reuses cached prefixes automatically; pick NVIDIA TensorRT-LLM only when you are all-in on NVIDIA Blackwell/GB200 and can absorb the compiled-engine, Docker-driven setup cost. As of June 2026 the engines are vLLM 0.23.0, SGLang 0.5.13 and TensorRT-LLM 1.2.1 stable. Surface features have converged; the choice is architectural. If you do not want to run a GPU at all, do not self-host — use a hosted API or a packaged microservice instead.

vLLM 0.23.0 SGLang 0.5.13 TensorRT-LLM 1.2.1 PagedAttention RadixAttention June 2026

Table of Contents

Which LLM inference engine should you pick in 2026?

Pick by hardware and workload, not by a single throughput number. As of June 2026 the practical default is vLLM because it runs on the broadest hardware and installs with one pip command, exposing an OpenAI-compatible server immediately. Choose SGLang when your traffic re-uses long shared prefixes (RAG with a fixed corpus, multi-turn agents, structured decoding) because its RadixAttention turns that prefix overlap into a cache hit automatically. Choose NVIDIA TensorRT-LLM when you are committed to NVIDIA’s newest silicon (Blackwell, GB200, GB300) and want the deepest kernel-level optimization for it — accepting that you pay for that in build complexity.

This page is strictly about the self-hosted open serving-engine layer — the case where you rent or own the GPU and run the model yourself. It is deliberately not about hosted endpoints (if you just want a URL to call, our best inference APIs comparison for 2026 owns that decision) and not about packaged microservices (NVIDIA NIM bundles an engine, a model and an endpoint into one container — see how NVIDIA NIM works and the NIM alternatives rundown). Read on only if you have already decided to run the engine yourself.

The do-not-self-host escape hatch

If “I do not want to manage GPUs” describes you, none of these three is your answer — a hosted API or a NIM-style packaged microservice is. Self-hosting a serving engine only pays off when you need cost control at volume, data residency, custom models, or hardware you already own. Otherwise the operational overhead outweighs the per-token saving.

How do PagedAttention, RadixAttention and a compiled engine differ?

The three engines differ at the memory and compilation layer, and that difference is now the main reason to choose one over another. vLLM’s PagedAttention manages the KV cache like an operating system manages virtual memory: it splits the cache into fixed-size pages so GPU memory fragments are near-zero and batches pack tightly. SGLang’s RadixAttention goes one step further by storing KV cache in a radix tree keyed on the token prefix, so two requests that share an opening prefix automatically share its cached computation — no manual cache directives. TensorRT-LLM instead compiles the model into an optimized inference engine tuned to a specific NVIDIA GPU, trading flexibility for kernel-level speed on that hardware.

Here is the practical consequence. PagedAttention is a throughput-and-memory win on almost any workload. RadixAttention adds a second, workload-dependent win that is large when prefixes overlap and negligible when every request is unique. The compiled-engine approach is a hardware-dependent win that is large on the exact GPU it was built for and unavailable everywhere else. The surface feature list has largely converged — per the inference-engineering community framing, all three now implement continuous batching, paged KV caching and FP8 quantization, so the distinction is architectural, not feature-checkbox.

Engine	Core memory architecture	Latest version (Jun 2026)	Governance
vLLM	PagedAttention + continuous batching	0.23.0 (Jun 13, 2026)	vLLM project / open source
SGLang	RadixAttention prefix-tree KV cache	0.5.13 (Jun 13, 2026)	LMSYS / open source
TensorRT-LLM	Compiled engine, PyTorch-default backend	1.2.1 stable (Apr 20, 2026)	NVIDIA

What hardware does each engine run on?

Hardware breadth is the single hardest split between the three, and it is where TensorRT-LLM is the odd one out. vLLM is multi-platform: NVIDIA CUDA, AMD ROCm, Intel XPU and Google TPU (under vLLM V1). SGLang reports native support across NVIDIA, AMD, Intel Xeon, Google TPU and Ascend NPU accelerators, per its official documentation. TensorRT-LLM is NVIDIA-only by design — it compiles CUDA engines and is optimized for Blackwell-class hardware (GB200, GB300). If you run, or might run, on AMD MI300/MI355X, Intel, TPU or Ascend, TensorRT-LLM is off the table before any benchmark; vLLM and SGLang are the only candidates.

Most 2026 comparisons skip this entirely and quietly assume NVIDIA. That assumption is wrong for a growing share of deployments: AMD’s MI300/MI355X line and Ascend NPUs are now first-class targets for both vLLM and SGLang. If you are quantizing aggressively, note that vLLM supports AWQ, GPTQ, FP8 and INT8 weight quantization plus FP8 KV cache, with FP4 on NVIDIA Blackwell and AMD MI300 per AMD’s ROCm guidance — so the low-precision story is no longer NVIDIA-exclusive either.

Accelerator	vLLM	SGLang	TensorRT-LLM
NVIDIA CUDA (incl. Blackwell)	Yes	Yes	Yes (primary)
AMD ROCm (MI300 / MI355X)	Yes	Yes	No
Intel (XPU / Xeon)	Yes (XPU)	Yes (Xeon)	No
Google TPU	Yes (V1)	Yes	No
Huawei Ascend NPU	Partial	Yes	No

How should you read the 5x and 25x throughput claims?

Treat the big multipliers as vendor/lab-reported and workload-specific, not as a universal ranking. SGLang’s original “up to 5x higher throughput” figure (from LMSYS) was measured against vLLM v0.2.5 and Guidance v0.1.8 on prefix-sharing workloads (Llama-7B on A10G, Mixtral-8x7B) — exactly the case RadixAttention is built for. It is a real, lab-reported result, but it is not “SGLang is 5x faster than vLLM at everything,” and the vLLM baseline it used is many versions old. On workloads with no shared prefix, the gap shrinks sharply.

The newer “up to 25x” headline is even more conditional. LMSYS reported up to 25x higher performance running DeepSeek R1 on an NVIDIA GB300 NVL72 versus an H200 baseline — but that H200 baseline was latency-constrained at 50 tokens/second per user, and the figure is explicitly an NVIDIA + SGLang collaboration. That makes it an apples-to-oranges hardware-generation comparison (new GB300 silicon vs older H200 under a tight latency cap), not an engine-vs-engine result. Use it to understand that SGLang scales on GB300, not to rank the three engines. The honest takeaway: benchmark on your model, hardware and traffic shape, because every published multiplier was measured on someone else’s.

Why vendor multipliers mislead

A throughput multiplier is only meaningful with its baseline pinned: which engine version, which GPU, which model, and crucially whether latency was held constant. The 5x (old vLLM, prefix-sharing) and 25x (GB300 vs latency-capped H200) numbers fail at least one of those checks for a general pick. They are directional evidence that SGLang is strong on prefix-heavy and GB300 workloads — nothing more.

How hard is each engine to deploy?

vLLM and SGLang install via pip and serve an OpenAI-compatible endpoint in one command; TensorRT-LLM is effectively Docker-image-driven and, per Yotta Labs’ 2026 comparison, costs significantly more developer-hours unless you are already inside the NVIDIA ecosystem. That ergonomics gap is the most under-covered factor in most roundups, and for small teams it often outweighs raw throughput. The reason is the compiled engine: TensorRT-LLM wants you to build an engine artifact tuned to your GPU and model, which is powerful but adds a build-and-version step that pip install vllm simply does not have.

Here is the deploy reality for each. Note that TensorRT-LLM does ship an OpenAI-compatible server (trtllm-serve), so the API surface matches once you are running — the friction is in getting there.

bash — one-command OpenAI-compatible servers

# vLLM: pip install, then serve an OpenAI-compatible endpoint
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

# SGLang: same shape, RadixAttention prefix caching on by default
pip install "sglang[all]"
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct --port 30000

# TensorRT-LLM: Docker-driven; pull image, then trtllm-serve
docker run --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:1.2.1
trtllm-serve meta-llama/Llama-3.1-8B-Instruct --port 8000

# All three then answer the same OpenAI client call:
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"...","messages":[{"role":"user","content":"hi"}]}'

The takeaway: if your team is not already running NVIDIA’s container stack, vLLM or SGLang will be live in minutes while TensorRT-LLM is still building. If you are on NVIDIA Blackwell and chasing the last increment of per-GPU throughput, that build cost buys you something real.

What is the pick-X-when decision matrix?

Use the matrix below as a decision tree, reading top to bottom — the first row that matches your situation is your answer. It encodes the architecture-first logic above: hardware constraints decide first, then workload shape, then ergonomics.

One nuance the matrix compresses: SGLang and vLLM are not mutually exclusive with the rest of your stack. Both speak the OpenAI API, so you can prototype on vLLM and migrate to SGLang if prefix-cache wins materialize, with no client rewrite. And if KV-cache memory is your bottleneck rather than raw compute, the technique layer matters too — see our explainer on vector quantization for the KV cache for a complementary lever that applies across engines.

How do these engines fit an agent or coding workflow?

For agent and coding workloads, prefix reuse is the decisive factor, which tilts the pick toward SGLang — but only if your framework actually re-sends a stable prefix. Agent loops typically replay a long, fixed system prompt and tool-definition block on every turn; that is precisely the shared-prefix pattern RadixAttention rewards. SGLang also ships Spec V2 as its default speculative-decoding path (with tree drafting across multiple backends as of v0.5.13), which helps the short, latency-sensitive completions common in tool-calling. vLLM remains the safer default when the agent runs across heterogeneous hardware or when you value the larger ecosystem.

The engine is only one layer of an agent stack, though. The orchestration framework on top decides how prefixes are constructed and whether they stay stable — if you are still choosing that, our 2026 guide to AI agent frameworks covers the layer above the engine, and for the developer-tool end specifically, the best AI coding assistants for 2026 rundown covers the products that ultimately call these engines.

Source note

Throughput multipliers in this article (SGLang 5x, GB300 25x) are vendor/lab-reported by LMSYS and NVIDIA and are labeled as such; they are workload- and hardware-specific, not audited engine rankings. Version numbers are pinned to PyPI and GitHub release history as of June 2026. Deploy-ergonomics framing draws on trade analysis and is presented as directional, not a guarantee. Always benchmark on your own model, hardware and traffic before committing.

FAQ

Is SGLang always faster than vLLM?

No. SGLang’s headline “up to 5x” advantage (lab-reported by LMSYS) was measured on prefix-sharing workloads against an old vLLM v0.2.5 baseline. When requests share long prefixes — RAG with a fixed corpus, multi-turn agents — RadixAttention gives SGLang a real edge. When every request is unique, the gap narrows sharply and vLLM’s PagedAttention is competitive. Benchmark on your own traffic shape rather than trusting a single multiplier.

Does TensorRT-LLM run on AMD or non-NVIDIA GPUs?

No. TensorRT-LLM is NVIDIA-only by design — it compiles CUDA engines and is optimized for Blackwell-class hardware like GB200 and GB300. If you run or might run on AMD ROCm (MI300/MI355X), Intel XPU, Google TPU or Huawei Ascend, your only candidates are vLLM and SGLang, both of which support those accelerators natively.

What is RadixAttention and how is it different from PagedAttention?

PagedAttention (vLLM) manages the KV cache in fixed-size pages like virtual memory, cutting fragmentation and packing batches tightly. RadixAttention (SGLang) stores the KV cache in a radix tree keyed on the token prefix, so two requests that share an opening prefix automatically reuse its cached computation. PagedAttention is a near-universal memory win; RadixAttention adds a second, workload-dependent win that is large only when prefixes overlap.

Which engine is easiest to deploy?

vLLM and SGLang both install via pip and launch an OpenAI-compatible server in one command. TensorRT-LLM is effectively Docker-image-driven and, per trade analysis, costs significantly more developer-hours unless you are already inside the NVIDIA container ecosystem, because it requires building a compiled engine tuned to your GPU. For small teams without an existing NVIDIA stack, vLLM or SGLang is the faster path to a live endpoint.

What are the current versions in June 2026?

As of June 2026: vLLM 0.23.0 (released June 13, 2026, on a roughly biweekly cadence per PyPI), SGLang v0.5.13 (released June 13, 2026, per GitHub), and TensorRT-LLM 1.2.1 stable (released April 20, 2026, on PyPI, with 1.3.0 release candidates in testing). TensorRT-LLM’s v1.0.0 milestone in September 2025 made the PyTorch-based architecture the stable default backend.

Should I self-host an engine or just use a hosted API?

Self-host only if you need cost control at high volume, data residency, custom models, or hardware you already own. If “I do not want to manage GPUs” describes you, a hosted inference API or a packaged microservice like NVIDIA NIM is the better answer — the operational overhead of running vLLM, SGLang or TensorRT-LLM yourself only pays off at scale or under specific constraints.

Can I move between these engines without rewriting my client?

Yes, largely. All three expose an OpenAI-compatible API — vLLM and SGLang via their launch servers, TensorRT-LLM via trtllm-serve — so the same OpenAI client call works against any of them once running. That lets you prototype on vLLM and migrate to SGLang if prefix-cache wins materialize, or to TensorRT-LLM on NVIDIA Blackwell, without changing application-side request code.

Bibliography (16 sources)

Sources prioritise official project documentation and release notes (vLLM, SGLang, NVIDIA TensorRT-LLM) and vendor primary disclosures, with trade media used only for deploy-ergonomics and feature-parity framing. Vendor and lab throughput multipliers (SGLang 5x, GB300 25x) are treated as vendor/lab-reported, workload- and hardware-specific, not audited engine rankings. Links accessed June 2026.

vLLM project — vLLM on PyPI (latest version 0.23.0, June 13 2026; version history). Primary source for current version and release cadence. pypi.org/project/vllm
vLLM project — Official documentation. PagedAttention, continuous batching and OpenAI-compatible server reference. docs.vllm.ai
vLLM project — GitHub releases. Per-release feature and version history. github.com/vllm-project/vllm
AMD — ROCm docs: vLLM V1 performance optimization and quantization on ROCm. Source for AMD hardware breadth and FP8/FP4/GPTQ quantization on ROCm. rocm.docs.amd.com
SGLang / LMSYS — GitHub repository and releases (v0.5.13, Spec V2 default). Primary source for current version and speculative-decoding default. github.com/sgl-project/sglang
SGLang / LMSYS — Official documentation. Self-description, hardware breadth and OpenAI-compatible API. docs.sglang.io
LMSYS — Fast and Expressive LLM Inference with RadixAttention and SGLang. Original up-to-5x throughput claim and methodology. lmsys.org
LMSYS — Unlocking 25x Inference Performance with SGLang on NVIDIA GB300 NVL72. Source for the GB300 vs latency-capped H200 figure. lmsys.org
PyTorch Blog — Serving DeepSeek-V4 on GB300 with SGLang: 5x higher throughput since Day-0. Vendor-primary context for GB300 day-0 serving. pytorch.org
NVIDIA — tensorrt-llm on PyPI (latest stable 1.2.1, Apr 20 2026; version history). Primary source for current TensorRT-LLM stable version. pypi.org/project/tensorrt-llm
NVIDIA — TensorRT-LLM official release notes. v1.0 GA, PyTorch default backend and deprecation policy. nvidia.github.io/TensorRT-LLM
NVIDIA — TensorRT-LLM GitHub releases. NVFP4, EAGLE3, Blackwell support and trtllm-serve. github.com/NVIDIA/TensorRT-LLM
NVIDIA Developer — TensorRT-LLM product page. Vendor-primary positioning and hardware scope. developer.nvidia.com
NVIDIA Technical Blog — Build with DeepSeek V4 using NVIDIA Blackwell and GPU-accelerated endpoints. Vendor-primary context for Blackwell serving. developer.nvidia.com
Inference Engineering — vLLM vs SGLang vs TensorRT-LLM. Support source for feature-parity and latency framing. inferenceengineering.tech
Yotta Labs — Best LLM Inference Engines 2026 (vLLM, TensorRT-LLM, TGI, SGLang compared). Support source for deploy-ergonomics framing. yottalabs.ai

vLLM vs TensorRT-LLM vs SGLang (2026): Which to Pick

Which LLM inference engine should you pick in 2026?

How do PagedAttention, RadixAttention and a compiled engine differ?

What hardware does each engine run on?

How should you read the 5x and 25x throughput claims?

How hard is each engine to deploy?

What is the pick-X-when decision matrix?

How do these engines fit an agent or coding workflow?

FAQ

DeepSeek V4 on NVIDIA: Specs, GPU Needs & Cost

NVIDIA Nemotron 3 Ultra: 550B Open Agent Model

How to Build an AI Agent with the Claude Agent SDK

LEAVE A REPLY Cancel reply

Most Popular

DeepSeek V4 on NVIDIA: Specs, GPU Needs & Cost

NVIDIA Nemotron 3 Ultra: 550B Open Agent Model

How to Build an AI Agent with the Claude Agent SDK

Claude Fable 5 Pricing & Limits: 2026 Cost Breakdown

Recent Comments

Inwestowanie

DeepSeek V4 on NVIDIA: Specs, GPU Needs & Cost

NVIDIA Nemotron 3 Ultra: 550B Open Agent Model

How to Build an AI Agent with the Claude Agent SDK

POPULAR POSTS

DeepSeek V4 on NVIDIA: Specs, GPU Needs & Cost

NVIDIA Nemotron 3 Ultra: 550B Open Agent Model

How to Build an AI Agent with the Claude Agent SDK

POPULAR CATEGORY

ABOUT US

FOLLOW US