Mixture of Experts (MoE) is a neural network architecture that splits each feed-forward layer into multiple parallel “expert” sub-networks and routes every input token to only 1–2 of them. The result is a sparse model: total parameter count can reach hundreds of billions, but compute per token stays equivalent to a model 5–10× smaller. In 2026, MoE powers nearly every frontier LLM — from DeepSeek-R1 (671 B total, 37 B active) to Meta’s Llama 4 Maverick (128 routed experts) — and is the single biggest reason open-source models now rival proprietary ones at a fraction of the training cost.
What Is a Mixture of Experts and Why Does It Matter?
A dense transformer routes every token through every parameter in the model. A Mixture of Experts model does something fundamentally different: it replaces the monolithic feed-forward network (FFN) inside each transformer block with N smaller FFN copies — the “experts” — and a lightweight gating network (also called a router) that decides which experts process each token.
The gating network produces a probability distribution over all experts. Only the top-k experts (typically k = 1 or 2) actually run. The rest stay idle for that token. Mathematically:
y = Σᵢ G(x)ᵢ · Eᵢ(x) where G(x) = Softmax(TopK(x · Wg, k))
Only k gates are non-zero → only k experts compute Eᵢ(x)
This is called conditional computation: the model has massive capacity (all expert weights combined) but pays the inference cost of only a small slice. Mixtral 8×7B, for example, holds 46.7 billion parameters but activates roughly 12.9 billion per token — running at the speed and memory-bandwidth cost of a 13 B dense model.
Why this matters in 2026: the gap between open-source and proprietary models has narrowed precisely because MoE lets labs train huge-capacity models at a fraction of the FLOPs a dense equivalent would require. DeepSeek-V3 was trained on approximately 2,048 H800 GPUs in roughly two months — an order of magnitude less compute than a hypothetical 671 B dense model would need.
How Does the Router Select Experts?
The diagram above shows the most common pattern: top-2 routing. The gating network computes a softmax over all 8 experts, selects the two with the highest probabilities (Expert 3 at 0.41 and Expert 7 at 0.38), and the final output is their weighted sum. The remaining 6 experts do zero computation for this token.
There are three main routing strategies used in production today:
| Strategy | How it works | Used by |
|---|---|---|
| Top-K routing | Router picks the K experts with highest gating scores per token | Mixtral (K=2), Switch Transformer (K=1) |
| Expert-Choice routing | Each expert picks the top-K tokens it wants to process (inverted selection) | Google research papers |
| Shared + Routed | One expert always runs (shared); router selects 1 additional from a routed pool | Llama 4 Maverick, DeepSeek-V3 |
The shared-plus-routed pattern is a 2025–2026 innovation worth highlighting: Llama 4 Maverick activates 1 shared expert plus 1 routed expert from a pool of 128, which stabilizes generalization (the shared expert provides a consistent baseline) while allowing token-level specialization through routing.
Why Does MoE Beat Dense Models at Scale?
The scaling argument for MoE boils down to one insight: model capacity and compute cost are decoupled. In a dense transformer, doubling the parameter count roughly doubles both the capacity and the FLOPs per forward pass. In an MoE transformer, you can double capacity (by adding more experts) without changing the FLOPs per token — because only k experts fire.
Concrete numbers illustrate the gap:
| Model | Total Params | Active Params / Token | Effective Dense Equivalent |
|---|---|---|---|
| Mixtral 8×7B | 46.7 B | 12.9 B | ~13 B |
| DeepSeek-V3 / R1 | 671 B | 37 B | ~37 B |
| Llama 4 Maverick | 400 B* | ~17 B* | ~17 B |
| Kimi K2 | 1 T | 32 B | ~32 B |
* Llama 4 Maverick figures are approximate based on public architecture details (1 shared + 1 routed from 128).
DeepSeek-V3 matches GPT-4-class performance on most benchmarks while requiring roughly 5.5 million H800 GPU-hours — a fraction of what a dense 671 B model would need. The saving comes from the fact that during training, each token only backpropagates through 37 B parameters, not 671 B.
MoE doesn’t reduce memory — all expert weights must be loaded into VRAM. It reduces compute per token. This is why MoE shines in throughput-bound scenarios (serving thousands of concurrent users) but can be challenging for latency-bound single-user setups on limited hardware.
What Are the Biggest Challenges When Training MoE Models?
Load Balancing: The Expert Collapse Problem
Without intervention, the router tends to converge on sending most tokens to the same 1–2 “popular” experts. This creates a self-reinforcing loop: favored experts get trained more, become better, get selected even more — while the remaining experts stagnate. The result is a model that barely uses its capacity.
The standard fix is an auxiliary loss that penalizes uneven expert utilization. Switch Transformer introduced a simple formulation: multiply the fraction of tokens dispatched to each expert by the fraction of the router’s probability allocated to that expert, then minimize the sum. This gently pushes the router toward spreading tokens more evenly.
In practice, even with auxiliary loss, load imbalance persists. NVIDIA’s empirical analysis of Mixtral 8×7B on the MMLU benchmark showed that the busiest expert receives 40–60% more tokens than the least busy one — and that specific domains (e.g. abstract algebra) heavily favor particular expert pairs. Perfect balance is neither achieved nor necessary; what matters is that no expert starves completely.
Router Z-Loss: Stabilizing Large-Scale Training
Google’s ST-MoE paper introduced router z-loss, which penalizes large logit values in the gating network. Large logits cause the softmax to saturate, making routing decisions nearly deterministic and fragile. Z-loss keeps the logit magnitudes moderate, improving training stability without degrading model quality — and it’s cheap to compute.
Memory vs. Compute Trade-off
MoE models need all expert weights resident in memory even though only a fraction activates per token. A 671 B MoE at FP16 needs ~1.3 TB of VRAM just for the weights. This is why MoE inference almost always happens across multiple GPUs, even when the active compute per token would fit on a single one. Expert parallelism (distributing different experts across GPUs) and tensor parallelism (splitting individual experts across GPUs) are the two strategies for managing this — and we’ll cover them in detail below.
Which Models Use MoE in 2026?
As of April 2026, MoE is no longer an exotic architecture — it’s the default for frontier models. On the Artificial Analysis leaderboard, 9 of the top 10 open-source models use MoE. Here is the current landscape:
| Model | Release | Experts | Routing | Key Innovation |
|---|---|---|---|---|
| DeepSeek-V3.2 Speciale | 2026 | 256 (8 active) | Shared + top-2 routed | Fine-grained experts; IMO gold medal reasoning |
| Llama 4 Maverick | Apr 2025 | 128 (+1 shared) | 1 shared + 1 routed | Shared expert stabilizes generalization |
| Llama 4 Scout | Apr 2025 | 16 (+1 shared) | 1 shared + 1 routed | Smaller MoE for constrained deployment |
| Kimi K2 | 2025 | Unknown | Sparse | 1T params, 32B active; top reasoning scores |
| Mixtral 8×7B | Dec 2023 | 8 | Top-2 | First open MoE LLM at frontier quality |
| DeepSeek-R1 | Jan 2025 | 256 (8 active) | Shared + top-2 routed | RL-trained reasoning on MoE backbone |
A pattern emerges: the number of experts is increasing rapidly (from 8 in Mixtral to 128–256 in latest models) while the number of active experts per token stays small (1–2). This means the ratio of total-to-active parameters is growing, which pushes memory requirements up but keeps per-token compute constant.
How Do You Serve a MoE Model in Production?
Deploying a MoE model is harder than deploying an equivalently-performing dense model because of the memory footprint. A 671 B parameter model at FP8 needs ~670 GB of VRAM — more than a single A100 (80 GB) or H100 (80 GB) can hold. This is where parallelism strategies come in.
Expert Parallelism (EP)
The most MoE-specific strategy: different experts live on different GPUs. When a token needs Expert 3, the hidden state is sent to whichever GPU holds Expert 3, processed, and sent back. The communication pattern is an all-to-all exchange — every GPU may need to send tokens to every other GPU.
Combined Strategies in vLLM
In 2025–2026, the LLM serving ecosystem converged on combining multiple parallelism strategies. vLLM, the most widely used open-source serving framework, supports:
| Strategy | What It Splits | When to Use |
|---|---|---|
| Tensor Parallelism (TP) | Individual weight matrices across GPUs | When you need low latency on a single request |
| Expert Parallelism (EP) | Different experts on different GPUs | When you have many experts and high throughput |
| Data Parallelism (DP) | Duplicate model; split batch across replicas | When throughput > latency matters |
| Pipeline Parallelism (PP) | Different layers on different GPUs | When model doesn’t fit with TP alone |
The current best practice for large MoE models like DeepSeek-R1 is DP attention + EP MoE: use data parallelism for the attention layers (which are shared, not sparse) and expert parallelism for the MoE layers. This achieves up to 1.8× per-GPU throughput compared to pure tensor parallelism.
Here’s how you’d launch DeepSeek-R1 on an 8×H100 node with vLLM:
# Serve DeepSeek-R1 with Expert Parallelism on 8 GPUs
vllm serve deepseek-ai/DeepSeek-R1 \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--enable-expert-parallel \
--max-model-len 32768 \
--dtype float8 \
--trust-remote-code
Even with FP8 quantization, DeepSeek-R1 (671 B) needs ~670 GB VRAM. That’s a minimum of 9× H100-80GB GPUs. In practice, teams use NVIDIA GB200 NVL72 racks or multi-node setups. The llm-d project (Red Hat, Google Cloud, IBM, NVIDIA) provides Kubernetes-native orchestration for exactly this scenario.
FP8 Quantization: The MoE Sweet Spot
FP8 quantization is particularly effective for MoE because each expert is an independent FFN — you can quantize experts individually without cross-expert interference. Benchmarks show FP8 achieving 25–30% higher throughput than FP16 on Mixtral 8×7B with negligible quality loss. For larger models like DeepSeek-V3, FP8 is essentially mandatory for practical deployment.
How Do You Run a MoE Model Locally?
For experimentation and development, you don’t need a GPU cluster. Mixtral 8×7B with 4-bit quantization fits on a single GPU with 24 GB VRAM. Here’s a minimal inference example using Hugging Face Transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto", # auto-shard across available GPUs
load_in_4bit=True, # 4-bit quantization → ~24 GB VRAM
)
messages = [{"role": "user", "content": "Explain MoE routing in one paragraph."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
For production workloads, switch to vLLM or TensorRT-LLM. Both frameworks include fused MoE kernels that merge expert selection, routing, and FFN computation into a single GPU kernel — eliminating intermediate memory transfers and significantly improving throughput.
What Does the EU AI Act Mean for MoE Models?
The EU AI Act’s General-Purpose AI (GPAI) provisions became applicable in August 2025, and full enforcement arrives in August 2026. A key threshold: any model trained with more than 10²³ FLOPs that can generate language is presumed to be a GPAI model with systemic risk — triggering mandatory reporting, red-teaming, and incident monitoring.
MoE introduces an interesting wrinkle. DeepSeek-V3 has 671 B total parameters but its training compute is dominated by the 37 B active parameters per token. The training FLOPs are therefore closer to what a 37 B dense model would require — but the 10²³ threshold is still likely exceeded given the scale of training data (14.8 T tokens).
If you’re fine-tuning or deploying a MoE model, the EU AI Act cares about the original pre-training compute, not just your fine-tuning cost. A LoRA fine-tune of DeepSeek-R1 is cheap, but the base model’s training compute still matters for classification purposes. Document the base model’s training compute in your technical documentation even if you only fine-tuned it.
Do MoE Experts Specialize in Specific Domains?
A common misconception is that Expert 1 handles “math,” Expert 2 handles “language,” and so on. The reality is more nuanced: experts tend to specialize at a syntactic and token-level rather than semantic level.
NVIDIA’s empirical analysis of Mixtral 8×7B across the 57-topic MMLU benchmark revealed that while some domain clustering exists (abstract algebra heavily activates experts 3 and 8, while professional law uses expert 4), the strongest specialization patterns are structural: punctuation tokens like “:” and “.” consistently route to the same expert pairs across layers, regardless of topic.
DeepSeek’s fine-grained expert architecture (256 experts, 8 active) pushes specialization further. With more experts to choose from, the router can create more specific clusters. But even here, the specialization is emergent — not designed — and the experts don’t have interpretable “job descriptions” that a human could label.
This connects directly to how deep learning works in general: the model finds whatever internal representation is most useful for reducing loss, regardless of whether that representation aligns with human-interpretable categories. MoE just makes this process modular.
When Should You Choose MoE Over a Dense Model?
MoE isn’t universally better. The right choice depends on your constraints:
| Factor | Choose MoE | Choose Dense |
|---|---|---|
| Throughput | Serving thousands of concurrent users — MoE gives you more capacity per FLOP | Low-concurrency, latency-sensitive applications |
| Training budget | Want frontier performance at lower training cost | Smaller models where MoE overhead isn’t justified |
| Memory budget | You have a multi-GPU setup with high aggregate VRAM | Single GPU or memory-constrained edge deployment |
| Fine-tuning | Knowledge-heavy tasks (Q&A, translation) where expert capacity helps | Reasoning-heavy tasks; MoE historically overfits more during fine-tuning |
| Inference cost | Token throughput matters more than per-request latency | Per-request latency is the primary metric |
A practical heuristic: if your deployment target is ≥4 GPUs and you’re optimizing for cost-per-token at scale, MoE almost certainly wins. If you’re running on a single GPU or optimizing for minimum latency, a smaller dense model (or a heavily quantized MoE) is likely better.
For a broader context on how MoE fits into the machine learning ecosystem, consider that MoE is one of several strategies for scaling models efficiently — others include neural architecture search, pruning, and distillation. MoE’s advantage is that it’s an architecture-level decision made at training time, not a post-hoc optimization.
FAQ
A Mixture of Experts model is a neural network where each layer contains multiple smaller sub-networks (experts) and a router that sends each input to only 1–2 of them. This makes the model sparse: it has massive total capacity but only uses a small fraction of its parameters for any given input, reducing compute cost while maintaining quality.
Typically 5–20% of the total. Mixtral 8×7B uses 12.9 B of its 46.7 B parameters per token. DeepSeek-R1 uses 37 B of its 671 B. The exact ratio depends on the number of experts and the top-k routing value.
No. An ensemble runs multiple complete models on the same input and averages their outputs. MoE is a single model where different sub-networks within the same architecture process different inputs. MoE is much more efficient because only a fraction of the model runs per input.
Yes, with quantization. Mixtral 8×7B at 4-bit quantization fits in ~24 GB VRAM. Larger models like DeepSeek-R1 require multiple GPUs or cloud inference. The memory requirement is determined by total parameters (not active parameters), since all experts must be loaded.
The router tends to develop preferences, sending most tokens to a few “popular” experts in a self-reinforcing loop. Auxiliary losses penalize this imbalance but don’t eliminate it. Some imbalance is actually beneficial — it reflects genuine specialization patterns in the data.
Top-1 routing (used in Switch Transformer) sends each token to exactly one expert — maximizing sparsity but sacrificing quality. Top-2 routing (used in Mixtral) sends each token to two experts and combines their outputs — slightly more expensive but consistently better in practice. Most production models use top-2.
MoE reduces compute cost per token (fewer FLOPs) but increases memory cost (all experts loaded). For high-throughput serving, MoE is cheaper per token. For single-user, latency-sensitive inference, the memory overhead can make MoE more expensive than a dense model of equivalent active size.
Sources & Further Reading
- Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, 2017 — arxiv.org/abs/1701.06538
- Fedus et al., Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, 2022 — arxiv.org/abs/2101.03961
- Jiang et al. (Mistral AI), Mixtral of Experts, 2024 — arxiv.org/abs/2401.04088
- DeepSeek-AI, DeepSeek-V3 Technical Report, 2024 — arxiv.org/abs/2412.19437
- Zoph et al., ST-MoE: Designing Stable and Transferable Sparse Expert Models, 2022 — arxiv.org/abs/2202.08906
- NVIDIA Developer Blog, Applying Mixture of Experts in LLM Architectures — developer.nvidia.com
- AMD ROCm Blog, The vLLM MoE Playbook: TP, DP, PP and Expert Parallelism, 2025 — rocm.blogs.amd.com
- European Commission, Guidelines on obligations for General-Purpose AI providers (EU AI Act) — digital-strategy.ec.europa.eu