What Is Mixture of Experts? MoE Architecture in 7 Key Facts

Last updated: April 2026

Mixture of Experts (MoE) is a neural network architecture that splits each feed-forward layer into multiple parallel “expert” sub-networks and routes every input token to only 1–2 of them. The result is a sparse model: total parameter count can reach hundreds of billions, but compute per token stays equivalent to a model 5–10× smaller. In 2026, MoE powers nearly every frontier LLM — from DeepSeek-R1 (671 B total, 37 B active) to Meta’s Llama 4 Maverick (128 routed experts) — and is the single biggest reason open-source models now rival proprietary ones at a fraction of the training cost.

Sparse Models Expert Routing LLM Architecture DeepSeek Llama 4

What Is a Mixture of Experts and Why Does It Matter?

A dense transformer routes every token through every parameter in the model. A Mixture of Experts model does something fundamentally different: it replaces the monolithic feed-forward network (FFN) inside each transformer block with N smaller FFN copies — the “experts” — and a lightweight gating network (also called a router) that decides which experts process each token.

The gating network produces a probability distribution over all experts. Only the top-k experts (typically k = 1 or 2) actually run. The rest stay idle for that token. Mathematically:

Math
y = Σᵢ G(x)ᵢ · Eᵢ(x)     where G(x) = Softmax(TopK(x · Wg, k))

Only k gates are non-zero → only k experts compute Eᵢ(x)

This is called conditional computation: the model has massive capacity (all expert weights combined) but pays the inference cost of only a small slice. Mixtral 8×7B, for example, holds 46.7 billion parameters but activates roughly 12.9 billion per token — running at the speed and memory-bandwidth cost of a 13 B dense model.

Why this matters in 2026: the gap between open-source and proprietary models has narrowed precisely because MoE lets labs train huge-capacity models at a fraction of the FLOPs a dense equivalent would require. DeepSeek-V3 was trained on approximately 2,048 H800 GPUs in roughly two months — an order of magnitude less compute than a hypothetical 671 B dense model would need.

How Does the Router Select Experts?

MoE Token Routing — Top-2 Gating Diagram showing how a single input token passes through a gating network that produces softmax probabilities over 8 experts, selects the top-2 with highest scores, routes the token to those two experts in parallel, and combines their outputs with a weighted sum. Inactive experts remain idle, saving compute. MoE Token Routing — Top-2 Gating DecodeTheFuture.org Mixture of Experts, gating network, top-k routing, sparse model Visual explanation of MoE expert selection: gating network → softmax → top-2 selection → weighted sum. Shows 8 experts with 2 active. Diagram image/svg+xml en © DecodeTheFuture.org Input Token Gating Network (Router) Softmax: [0.02, 0.04, 0.41, 0.03, 0.01, 0.05, 0.38, 0.06] Expert 1 Expert 2 Expert 3 ✓ Expert 4 Expert 5 Expert 6 Expert 7 ✓ Expert 8 Weighted Sum 0.52 × E₃ + 0.48 × E₇ Hidden State Out Active expert (top-k selected) Idle expert (zero compute cost) © DecodeTheFuture.org

The diagram above shows the most common pattern: top-2 routing. The gating network computes a softmax over all 8 experts, selects the two with the highest probabilities (Expert 3 at 0.41 and Expert 7 at 0.38), and the final output is their weighted sum. The remaining 6 experts do zero computation for this token.

There are three main routing strategies used in production today:

StrategyHow it worksUsed by
Top-K routingRouter picks the K experts with highest gating scores per tokenMixtral (K=2), Switch Transformer (K=1)
Expert-Choice routingEach expert picks the top-K tokens it wants to process (inverted selection)Google research papers
Shared + RoutedOne expert always runs (shared); router selects 1 additional from a routed poolLlama 4 Maverick, DeepSeek-V3

The shared-plus-routed pattern is a 2025–2026 innovation worth highlighting: Llama 4 Maverick activates 1 shared expert plus 1 routed expert from a pool of 128, which stabilizes generalization (the shared expert provides a consistent baseline) while allowing token-level specialization through routing.

Why Does MoE Beat Dense Models at Scale?

The scaling argument for MoE boils down to one insight: model capacity and compute cost are decoupled. In a dense transformer, doubling the parameter count roughly doubles both the capacity and the FLOPs per forward pass. In an MoE transformer, you can double capacity (by adding more experts) without changing the FLOPs per token — because only k experts fire.

Concrete numbers illustrate the gap:

ModelTotal ParamsActive Params / TokenEffective Dense Equivalent
Mixtral 8×7B46.7 B12.9 B~13 B
DeepSeek-V3 / R1671 B37 B~37 B
Llama 4 Maverick400 B*~17 B*~17 B
Kimi K21 T32 B~32 B

* Llama 4 Maverick figures are approximate based on public architecture details (1 shared + 1 routed from 128).

DeepSeek-V3 matches GPT-4-class performance on most benchmarks while requiring roughly 5.5 million H800 GPU-hours — a fraction of what a dense 671 B model would need. The saving comes from the fact that during training, each token only backpropagates through 37 B parameters, not 671 B.

Key insight for practitioners

MoE doesn’t reduce memory — all expert weights must be loaded into VRAM. It reduces compute per token. This is why MoE shines in throughput-bound scenarios (serving thousands of concurrent users) but can be challenging for latency-bound single-user setups on limited hardware.

What Are the Biggest Challenges When Training MoE Models?

Load Balancing: The Expert Collapse Problem

Without intervention, the router tends to converge on sending most tokens to the same 1–2 “popular” experts. This creates a self-reinforcing loop: favored experts get trained more, become better, get selected even more — while the remaining experts stagnate. The result is a model that barely uses its capacity.

The standard fix is an auxiliary loss that penalizes uneven expert utilization. Switch Transformer introduced a simple formulation: multiply the fraction of tokens dispatched to each expert by the fraction of the router’s probability allocated to that expert, then minimize the sum. This gently pushes the router toward spreading tokens more evenly.

In practice, even with auxiliary loss, load imbalance persists. NVIDIA’s empirical analysis of Mixtral 8×7B on the MMLU benchmark showed that the busiest expert receives 40–60% more tokens than the least busy one — and that specific domains (e.g. abstract algebra) heavily favor particular expert pairs. Perfect balance is neither achieved nor necessary; what matters is that no expert starves completely.

Router Z-Loss: Stabilizing Large-Scale Training

Google’s ST-MoE paper introduced router z-loss, which penalizes large logit values in the gating network. Large logits cause the softmax to saturate, making routing decisions nearly deterministic and fragile. Z-loss keeps the logit magnitudes moderate, improving training stability without degrading model quality — and it’s cheap to compute.

Memory vs. Compute Trade-off

MoE models need all expert weights resident in memory even though only a fraction activates per token. A 671 B MoE at FP16 needs ~1.3 TB of VRAM just for the weights. This is why MoE inference almost always happens across multiple GPUs, even when the active compute per token would fit on a single one. Expert parallelism (distributing different experts across GPUs) and tensor parallelism (splitting individual experts across GPUs) are the two strategies for managing this — and we’ll cover them in detail below.

Which Models Use MoE in 2026?

As of April 2026, MoE is no longer an exotic architecture — it’s the default for frontier models. On the Artificial Analysis leaderboard, 9 of the top 10 open-source models use MoE. Here is the current landscape:

ModelReleaseExpertsRoutingKey Innovation
DeepSeek-V3.2 Speciale2026256 (8 active)Shared + top-2 routedFine-grained experts; IMO gold medal reasoning
Llama 4 MaverickApr 2025128 (+1 shared)1 shared + 1 routedShared expert stabilizes generalization
Llama 4 ScoutApr 202516 (+1 shared)1 shared + 1 routedSmaller MoE for constrained deployment
Kimi K22025UnknownSparse1T params, 32B active; top reasoning scores
Mixtral 8×7BDec 20238Top-2First open MoE LLM at frontier quality
DeepSeek-R1Jan 2025256 (8 active)Shared + top-2 routedRL-trained reasoning on MoE backbone

A pattern emerges: the number of experts is increasing rapidly (from 8 in Mixtral to 128–256 in latest models) while the number of active experts per token stays small (1–2). This means the ratio of total-to-active parameters is growing, which pushes memory requirements up but keeps per-token compute constant.

How Do You Serve a MoE Model in Production?

Deploying a MoE model is harder than deploying an equivalently-performing dense model because of the memory footprint. A 671 B parameter model at FP8 needs ~670 GB of VRAM — more than a single A100 (80 GB) or H100 (80 GB) can hold. This is where parallelism strategies come in.

Expert Parallelism (EP)

The most MoE-specific strategy: different experts live on different GPUs. When a token needs Expert 3, the hidden state is sent to whichever GPU holds Expert 3, processed, and sent back. The communication pattern is an all-to-all exchange — every GPU may need to send tokens to every other GPU.

Combined Strategies in vLLM

In 2025–2026, the LLM serving ecosystem converged on combining multiple parallelism strategies. vLLM, the most widely used open-source serving framework, supports:

StrategyWhat It SplitsWhen to Use
Tensor Parallelism (TP)Individual weight matrices across GPUsWhen you need low latency on a single request
Expert Parallelism (EP)Different experts on different GPUsWhen you have many experts and high throughput
Data Parallelism (DP)Duplicate model; split batch across replicasWhen throughput > latency matters
Pipeline Parallelism (PP)Different layers on different GPUsWhen model doesn’t fit with TP alone

The current best practice for large MoE models like DeepSeek-R1 is DP attention + EP MoE: use data parallelism for the attention layers (which are shared, not sparse) and expert parallelism for the MoE layers. This achieves up to 1.8× per-GPU throughput compared to pure tensor parallelism.

Here’s how you’d launch DeepSeek-R1 on an 8×H100 node with vLLM:

Bash
# Serve DeepSeek-R1 with Expert Parallelism on 8 GPUs
vllm serve deepseek-ai/DeepSeek-R1 \
  --tensor-parallel-size 1 \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --max-model-len 32768 \
  --dtype float8 \
  --trust-remote-code
Memory reality check

Even with FP8 quantization, DeepSeek-R1 (671 B) needs ~670 GB VRAM. That’s a minimum of 9× H100-80GB GPUs. In practice, teams use NVIDIA GB200 NVL72 racks or multi-node setups. The llm-d project (Red Hat, Google Cloud, IBM, NVIDIA) provides Kubernetes-native orchestration for exactly this scenario.

FP8 Quantization: The MoE Sweet Spot

FP8 quantization is particularly effective for MoE because each expert is an independent FFN — you can quantize experts individually without cross-expert interference. Benchmarks show FP8 achieving 25–30% higher throughput than FP16 on Mixtral 8×7B with negligible quality loss. For larger models like DeepSeek-V3, FP8 is essentially mandatory for practical deployment.

How Do You Run a MoE Model Locally?

For experimentation and development, you don’t need a GPU cluster. Mixtral 8×7B with 4-bit quantization fits on a single GPU with 24 GB VRAM. Here’s a minimal inference example using Hugging Face Transformers:

Python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",          # auto-shard across available GPUs
    load_in_4bit=True,          # 4-bit quantization → ~24 GB VRAM
)

messages = [{"role": "user", "content": "Explain MoE routing in one paragraph."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(inputs, max_new_tokens=256, temperature=0.7)

print(tokenizer.decode(output[0], skip_special_tokens=True))

For production workloads, switch to vLLM or TensorRT-LLM. Both frameworks include fused MoE kernels that merge expert selection, routing, and FFN computation into a single GPU kernel — eliminating intermediate memory transfers and significantly improving throughput.

What Does the EU AI Act Mean for MoE Models?

The EU AI Act’s General-Purpose AI (GPAI) provisions became applicable in August 2025, and full enforcement arrives in August 2026. A key threshold: any model trained with more than 10²³ FLOPs that can generate language is presumed to be a GPAI model with systemic risk — triggering mandatory reporting, red-teaming, and incident monitoring.

MoE introduces an interesting wrinkle. DeepSeek-V3 has 671 B total parameters but its training compute is dominated by the 37 B active parameters per token. The training FLOPs are therefore closer to what a 37 B dense model would require — but the 10²³ threshold is still likely exceeded given the scale of training data (14.8 T tokens).

Practical takeaway for EU compliance

If you’re fine-tuning or deploying a MoE model, the EU AI Act cares about the original pre-training compute, not just your fine-tuning cost. A LoRA fine-tune of DeepSeek-R1 is cheap, but the base model’s training compute still matters for classification purposes. Document the base model’s training compute in your technical documentation even if you only fine-tuned it.

Do MoE Experts Specialize in Specific Domains?

A common misconception is that Expert 1 handles “math,” Expert 2 handles “language,” and so on. The reality is more nuanced: experts tend to specialize at a syntactic and token-level rather than semantic level.

NVIDIA’s empirical analysis of Mixtral 8×7B across the 57-topic MMLU benchmark revealed that while some domain clustering exists (abstract algebra heavily activates experts 3 and 8, while professional law uses expert 4), the strongest specialization patterns are structural: punctuation tokens like “:” and “.” consistently route to the same expert pairs across layers, regardless of topic.

DeepSeek’s fine-grained expert architecture (256 experts, 8 active) pushes specialization further. With more experts to choose from, the router can create more specific clusters. But even here, the specialization is emergent — not designed — and the experts don’t have interpretable “job descriptions” that a human could label.

This connects directly to how deep learning works in general: the model finds whatever internal representation is most useful for reducing loss, regardless of whether that representation aligns with human-interpretable categories. MoE just makes this process modular.

When Should You Choose MoE Over a Dense Model?

MoE isn’t universally better. The right choice depends on your constraints:

FactorChoose MoEChoose Dense
ThroughputServing thousands of concurrent users — MoE gives you more capacity per FLOPLow-concurrency, latency-sensitive applications
Training budgetWant frontier performance at lower training costSmaller models where MoE overhead isn’t justified
Memory budgetYou have a multi-GPU setup with high aggregate VRAMSingle GPU or memory-constrained edge deployment
Fine-tuningKnowledge-heavy tasks (Q&A, translation) where expert capacity helpsReasoning-heavy tasks; MoE historically overfits more during fine-tuning
Inference costToken throughput matters more than per-request latencyPer-request latency is the primary metric

A practical heuristic: if your deployment target is ≥4 GPUs and you’re optimizing for cost-per-token at scale, MoE almost certainly wins. If you’re running on a single GPU or optimizing for minimum latency, a smaller dense model (or a heavily quantized MoE) is likely better.

For a broader context on how MoE fits into the machine learning ecosystem, consider that MoE is one of several strategies for scaling models efficiently — others include neural architecture search, pruning, and distillation. MoE’s advantage is that it’s an architecture-level decision made at training time, not a post-hoc optimization.

FAQ

A Mixture of Experts model is a neural network where each layer contains multiple smaller sub-networks (experts) and a router that sends each input to only 1–2 of them. This makes the model sparse: it has massive total capacity but only uses a small fraction of its parameters for any given input, reducing compute cost while maintaining quality.

Typically 5–20% of the total. Mixtral 8×7B uses 12.9 B of its 46.7 B parameters per token. DeepSeek-R1 uses 37 B of its 671 B. The exact ratio depends on the number of experts and the top-k routing value.

No. An ensemble runs multiple complete models on the same input and averages their outputs. MoE is a single model where different sub-networks within the same architecture process different inputs. MoE is much more efficient because only a fraction of the model runs per input.

Yes, with quantization. Mixtral 8×7B at 4-bit quantization fits in ~24 GB VRAM. Larger models like DeepSeek-R1 require multiple GPUs or cloud inference. The memory requirement is determined by total parameters (not active parameters), since all experts must be loaded.

The router tends to develop preferences, sending most tokens to a few “popular” experts in a self-reinforcing loop. Auxiliary losses penalize this imbalance but don’t eliminate it. Some imbalance is actually beneficial — it reflects genuine specialization patterns in the data.

Top-1 routing (used in Switch Transformer) sends each token to exactly one expert — maximizing sparsity but sacrificing quality. Top-2 routing (used in Mixtral) sends each token to two experts and combines their outputs — slightly more expensive but consistently better in practice. Most production models use top-2.

MoE reduces compute cost per token (fewer FLOPs) but increases memory cost (all experts loaded). For high-throughput serving, MoE is cheaper per token. For single-user, latency-sensitive inference, the memory overhead can make MoE more expensive than a dense model of equivalent active size.

Sources & Further Reading

  1. Shazeer et al., Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, 2017 — arxiv.org/abs/1701.06538
  2. Fedus et al., Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, 2022 — arxiv.org/abs/2101.03961
  3. Jiang et al. (Mistral AI), Mixtral of Experts, 2024 — arxiv.org/abs/2401.04088
  4. DeepSeek-AI, DeepSeek-V3 Technical Report, 2024 — arxiv.org/abs/2412.19437
  5. Zoph et al., ST-MoE: Designing Stable and Transferable Sparse Expert Models, 2022 — arxiv.org/abs/2202.08906
  6. NVIDIA Developer Blog, Applying Mixture of Experts in LLM Architecturesdeveloper.nvidia.com
  7. AMD ROCm Blog, The vLLM MoE Playbook: TP, DP, PP and Expert Parallelism, 2025 — rocm.blogs.amd.com
  8. European Commission, Guidelines on obligations for General-Purpose AI providers (EU AI Act) — digital-strategy.ec.europa.eu

Leave a Comment