What Is a Transformer? Architecture, Attention & 7 Facts

Q: Is GPT a transformer?

Yes. GPT stands for 'Generative Pre-trained Transformer.' It uses a decoder-only variant of the transformer architecture with causal masking, meaning each token can only attend to previous tokens. All versions of GPT (1 through 5), as well as Claude, Llama, and Mistral, are decoder-only transformers.

Last updated: March 2026

A transformer is a neural network architecture introduced in the 2017 paper “Attention Is All You Need” that processes entire sequences in parallel using a mechanism called self-attention. Instead of reading tokens one by one like earlier recurrent models, transformers compute relationships between all tokens simultaneously — enabling faster training and stronger long-range context understanding.

Transformers are the foundation behind every major large language model in 2026, including GPT-5, Claude, Gemini, and Llama, as well as vision models (ViT), speech systems, and protein-folding tools like AlphaFold.

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Self-Attention Multi-Head Attention Encoder-Decoder GPT / BERT / T5 Vision Transformer Flash Attention EU AI Act GPAI

If you’ve used ChatGPT, Claude, or Google Gemini, you’ve interacted with a transformer. If you’ve searched Google in the last five years, a transformer helped rank your results. If you’ve used automatic translation, voice-to-text, or AI-generated images — transformers were almost certainly involved. Understanding how transformers work is no longer optional for anyone serious about artificial intelligence or deep learning.

This guide breaks down the transformer architecture from first principles — what problems it solved, how its components work mechanically, why it dominates modern AI, and what comes next as alternatives like Mamba challenge the transformer’s supremacy.

Table of Contents

Why Were Transformers Invented? The Problem with Sequential Processing

Before 2017, the dominant approach for processing language was the recurrent neural network (RNN) and its improved variant, Long Short-Term Memory (LSTM). Both architectures process sequences one token at a time, maintaining an internal “hidden state” that carries context forward. This sequential nature created two fundamental bottlenecks.

First, long-range dependency decay. By the time an RNN reaches the 50th word in a sentence, information about the 1st word has been compressed and distorted through dozens of state updates. LSTMs improved this with gating mechanisms, but the problem never fully disappeared. In a sentence like “The doctor who treated the patients at the rural clinic during the monsoon season was exhausted,” an RNN struggles to connect “doctor” (singular) to “was” across the intervening clause.

Second, no parallelism. Because each step depends on the previous hidden state, RNNs must process tokens strictly in order. This made them painfully slow to train — you couldn’t leverage the thousands of parallel cores available on modern GPUs. Google’s 2016 Neural Machine Translation system, based on 8-layer bidirectional LSTM, took nine months to develop despite achieving only modest accuracy gains.

Transformers solved both problems at once. By replacing recurrence with attention — a mechanism that lets every token “look at” every other token directly — transformers eliminated the information bottleneck. And because attention computations are independent across positions, the entire sequence can be processed in parallel. The 2017 paper by Vaswani et al. at Google didn’t just propose a marginal improvement; it introduced a paradigm shift that would reshape the entire field of machine learning.

How Does Self-Attention Work?

Self-attention is the core innovation that makes transformers possible. The intuition is simple: for every token in a sequence, compute how relevant every other token is, then create a new representation that blends information weighted by that relevance.

Queries, Keys, and Values

Each input token is projected into three vectors: a Query (Q), a Key (K), and a Value (V). Think of it like a search engine: the Query is your search term, the Keys are the index entries, and the Values are the actual content. For each token, the model computes a dot product between its Query and all Keys to get a “relevance score,” normalizes those scores with softmax, and uses them to create a weighted sum of all Values.

Mathematically, the scaled dot-product attention is:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Where:
  Q = input × Wq   (query projection)
  K = input × Wk   (key projection)
  V = input × Wv   (value projection)
  dₖ = dimension of key vectors (scaling factor)

The division by √dₖ is crucial — without it, dot products grow large with vector dimensionality, pushing softmax into saturated regions where gradients vanish. This seemingly small detail was one of the key engineering decisions that made transformer training stable.

Multi-Head Attention: Multiple Perspectives

A single attention computation captures one type of relationship. But language is rich with multiple simultaneous relationships — syntactic dependencies, semantic similarity, coreference, and more. Multi-head attention solves this by running several attention computations in parallel, each with its own learned Q, K, V projections. The original paper used 8 heads; modern models like GPT-4 and Claude use 96–128 heads.

Each head operates on a smaller slice of the embedding dimension (d_model / n_heads), so the total computation cost is roughly the same as a single full-dimensional attention. The outputs of all heads are concatenated and linearly projected to produce the final result.

MultiHead(Q, K, V) = Concat(head₁, head₂, ..., headₕ) · Wₒ

Where each headᵢ = Attention(Q·Wqᵢ, K·Wkᵢ, V·Wvᵢ)

Research has shown that different heads specialize organically during training — some track syntactic structure, others handle positional relationships, and some focus on rare-word semantics. This emergent specialization is one reason transformers generalize so well across tasks.

What Are the Components of the Transformer Architecture?

The original transformer follows an encoder-decoder structure. The encoder reads the full input and builds a contextual representation; the decoder generates output one token at a time, attending both to its own prior outputs and to the encoder’s representation. Here are the key building blocks that appear in both.

Tokenization and Embeddings

Raw text first goes through a tokenizer that splits it into subword units (tokens). Modern tokenizers like Byte-Pair Encoding (BPE) or SentencePiece balance vocabulary size against sequence length — GPT-4 uses roughly 100,000 tokens in its vocabulary. Each token is then mapped to a dense vector through a learned embedding table, typically 768 to 12,288 dimensions depending on model size.

Positional Encoding

Because self-attention treats all positions equally (it has no inherent notion of “first” or “last”), transformers need explicit position information. The original paper used fixed sinusoidal functions at different frequencies. Modern models have moved to Rotary Position Embeddings (RoPE), which encode relative positions into the attention computation itself and scale better to long contexts. RoPE is used in Llama, Mistral, and many other open-source models.

Feed-Forward Networks

After each attention layer, every token independently passes through a position-wise feed-forward network (FFN) — typically two linear layers with a nonlinear activation in between. Modern models often use SwiGLU or GeGLU activations instead of the original ReLU. The FFN is where much of the model’s “knowledge” is stored — factual associations, language patterns, and reasoning templates are encoded in these weights.

Layer Normalization and Residual Connections

Each sub-layer (attention or FFN) is wrapped with a residual connection (the input is added directly to the output) followed by layer normalization. Modern practice has shifted to “pre-norm” (normalize before attention, not after), which stabilizes training at large scales. These residual connections are essential — without them, gradient signals would decay exponentially through dozens of layers, making deep transformers untrainable.

What Are the 3 Transformer Families?

The original encoder-decoder design quickly spawned three architectural variants, each optimized for different tasks. Understanding which variant does what is essential for practitioners choosing the right model for their use case.

Encoder-Only: BERT and Understanding Tasks

BERT (Bidirectional Encoder Representations from Transformers, Google 2018) uses only the encoder stack and processes text bidirectionally — each token attends to all tokens in both directions. This makes encoder-only models excellent at understanding tasks: classification, named entity recognition, sentiment analysis, and semantic search. BERT is still widely used in production search engines and retrieval systems. Notable successors include RoBERTa, DeBERTa, and domain-specific models like BioBERT and FinBERT.

Decoder-Only: GPT and Generation Tasks

GPT (Generative Pre-trained Transformer, OpenAI 2018) uses only the decoder stack with causal masking — each token can only attend to previous tokens, never future ones. This makes decoder-only models natural at generation tasks: text completion, conversation, code writing, and creative content. The decoder-only architecture has proven remarkably scalable. Every major large language model in 2026 — GPT-5, Claude, Gemini, Llama 3, Mistral — uses decoder-only transformers, often with hundreds of billions of parameters.

Encoder-Decoder: T5 and Sequence-to-Sequence Tasks

T5 (Text-to-Text Transfer Transformer, Google 2019) preserves the full encoder-decoder structure and frames every NLP task as a text-to-text problem — give it “translate English to French: The house is big” and it outputs “La maison est grande.” This architecture excels at translation, summarization, and tasks where the input and output are fundamentally different sequences. Google’s newer models and some specialized systems still use encoder-decoder designs, particularly for machine translation.

Architecture	Example Models	Best For	Attention Type
Encoder-only	BERT, RoBERTa, DeBERTa	Classification, search, NER	Bidirectional
Decoder-only	GPT-5, Claude, Llama, Mistral	Text generation, chat, code	Causal (left-to-right)
Encoder-decoder	T5, BART, mBART	Translation, summarization	Bidirectional + Causal

How Do Transformers Work Beyond Text? Vision and Multimodal Models

One of the most powerful aspects of the transformer architecture is its domain-agnosticism. The self-attention mechanism doesn’t inherently “know” about language — it operates on sequences of vectors. This flexibility has allowed transformers to conquer domains far beyond NLP.

The Vision Transformer (ViT), introduced by Google in 2020, splits an image into fixed-size patches (typically 16×16 pixels), treats each patch as a “token,” and feeds the resulting sequence through a standard transformer encoder. ViT and its successors now match or exceed convolutional neural networks (CNNs) on image classification, object detection, and segmentation tasks — particularly when pre-trained on large datasets.

Modern frontier models are multimodal: they process text, images, audio, and video within a single transformer-based architecture. GPT-4 and Gemini can understand images alongside text. Meta’s ImageBind connects six modalities. These multimodal transformers represent the current frontier of AI capability, enabling systems that can reason across different types of input — describing images, answering questions about charts, or generating code from screenshots.

Transformers have also transformed scientific computing. AlphaFold 2 (DeepMind) uses a transformer-based architecture to predict protein structures with near-experimental accuracy. Transformer-based models are used in drug discovery, weather forecasting (Google’s GraphCast), and even music composition.

What Is the Quadratic Complexity Problem, and How Does Flash Attention Solve It?

Self-attention has a fundamental limitation: its computation scales quadratically with sequence length. For a sequence of n tokens, the attention matrix is n × n, requiring O(n²) operations and O(n²) memory. Double the sequence length and you quadruple the compute. This is why early transformers were limited to 512 or 2,048 tokens.

Context windows have exploded in recent years — from 2–4K tokens in GPT-3 to 128K in GPT-4 and up to 1M in Llama 3. This expansion was made possible primarily by Flash Attention, a family of algorithms developed by Tri Dao and collaborators at Stanford, starting in 2022.

How Flash Attention Works

Flash Attention’s core insight is that the bottleneck isn’t raw compute — it’s memory bandwidth. Standard attention writes enormous intermediate matrices to GPU high-bandwidth memory (HBM), which is slow. Flash Attention restructures the computation into small tiles that fit entirely in the GPU’s fast on-chip SRAM (roughly 100× faster than HBM but limited to ~20MB). Each tile computes attention end-to-end using an “online softmax” technique, so the full n × n attention matrix never materializes in memory.

The result is mathematically identical to standard attention — no approximation, no information loss — but 2–4× faster with linear memory scaling. Flash Attention 3 (2024) further exploits NVIDIA Hopper GPU features (H100), achieving up to 75–85% of theoretical peak throughput at 740–840 TFLOPS in FP16 and 1.2+ PFLOPS in FP8 precision — a 1.5–2× speedup over Flash Attention 2.

Practitioner tip: Since PyTorch 2.0, Flash Attention is available via torch.nn.functional.scaled_dot_product_attention — PyTorch automatically selects the optimal backend. In Hugging Face Transformers, enable it with attn_implementation="flash_attention_2" during model initialization.

Hands-On: Self-Attention in PyTorch

Understanding self-attention becomes clearer when you implement it. Here is a minimal, working implementation of scaled dot-product self-attention in PyTorch — the exact computation at the heart of every transformer layer.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class SelfAttention(nn.Module):
    """Single-head scaled dot-product self-attention."""

    def __init__(self, d_model: int = 64):
        super().__init__()
        self.d_model = d_model
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x shape: (batch, seq_len, d_model)
        Q = self.W_q(x)  # Query projection
        K = self.W_k(x)  # Key projection
        V = self.W_v(x)  # Value projection

        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_model)
        attn_weights = F.softmax(scores, dim=-1)
        output = torch.matmul(attn_weights, V)
        return output

# Demo: 2 sequences, each 10 tokens, 64-dim embeddings
x = torch.randn(2, 10, 64)
attn = SelfAttention(d_model=64)
out = attn(x)
print(f"Input shape:  {x.shape}")   # [2, 10, 64]
print(f"Output shape: {out.shape}")  # [2, 10, 64]
print(f"Attention matrix: {10}×{10} = 100 entries per sequence")

This is single-head attention. To make it multi-head, you’d split d_model across h heads, run each independently, concatenate, and project. For production code, use PyTorch’s built-in nn.MultiheadAttention or the scaled_dot_product_attention function, which automatically dispatches to Flash Attention on supported hardware.

What Is the State of Transformer Research in 2026?

The transformer remains the dominant architecture, but the landscape is evolving rapidly. Three major trends define the current research frontier.

Efficiency at Scale: Mixture of Experts

The Mixture of Experts (MoE) approach activates only a subset of the model’s parameters for each input token, dramatically reducing compute while maintaining capacity. Mistral’s Mixtral and Google’s Switch Transformer demonstrate that a 1.8T-parameter MoE model can perform comparably to a dense 540B model at a fraction of the inference cost. DeepSeek-V2 combined MoE with Multi-head Latent Attention (MLA) to compress the KV cache, further reducing memory requirements.

State Space Models and Mamba: The Challenger Architecture

The most significant architectural challenge to transformers comes from State Space Models (SSMs), particularly the Mamba architecture developed by Albert Gu (CMU) and Tri Dao (Princeton). Mamba processes sequences in linear time (O(n)) rather than quadratic, maintains a compact internal state instead of storing a full KV cache, and achieves up to 5× inference throughput improvements over comparably sized transformers.

Mamba-3, released in March 2026, introduced three key innovations — Exponential-Trapezoidal Discretization, Complex-Valued SSMs with the “RoPE Trick,” and Multi-Input Multi-Output (MIMO) for increased arithmetic intensity. It improved language modeling perplexity by nearly 4% over Mamba-2 while using only half the state size and doubling inference throughput.

However, transformers still outperform SSMs on tasks requiring precise information retrieval from context (the “needle in a haystack” problem). The most promising direction may be hybrid architectures like AI21’s Jamba, which interleaves transformer attention layers with Mamba SSM layers — getting the best of both worlds: strong retrieval capabilities from attention and efficient long-sequence processing from SSMs.

Hardware Co-Evolution

The AI hardware ecosystem has been deeply optimized for transformer workloads — specifically, the massive matrix multiplications that dominate attention and FFN computations. NVIDIA’s H100 and upcoming B200 GPUs include Transformer Engines with hardware-level mixed-precision support. As hybrid architectures gain adoption, we may see hardware designers adding dedicated support for recurrent-style operations alongside attention, much as NPU chips have emerged for edge-device deep learning inference.

What Does the EU AI Act Mean for Transformer-Based Models?

The European Union’s AI Act (Regulation (EU) 2024/1689) is the world’s first comprehensive AI regulation, and it directly addresses the transformer-powered foundation models that dominate the industry. The Act categorizes large transformer models as General-Purpose AI (GPAI) models — defined as models capable of performing a wide range of tasks and being integrated into downstream systems.

Key obligations for GPAI providers, in force since August 2, 2025, include: publishing training data summaries, maintaining technical documentation, complying with EU copyright law, and designating an EU representative. Models classified as posing systemic risk — those trained with compute exceeding 10²⁵ FLOPs, or reaching 10,000+ registered EU business users — face additional requirements: mandatory model evaluations, adversarial testing, cybersecurity protections, serious incident reporting, and energy consumption disclosure.

From August 2, 2026, the European Commission gains full enforcement powers, with penalties up to €35 million or 7% of global annual turnover for violations. For practitioners building on transformer-based models, this means understanding your position in the GPAI supply chain — whether you’re a provider, deployer, or integrator — and ensuring documentation and risk assessments are in place. For a broader overview of AI regulation, see our guide on what is artificial intelligence.

Real-World Applications of Transformers in 2026

Transformers power applications across virtually every industry. In healthcare, transformer-based models analyze medical imaging, predict protein structures, and assist clinicians with diagnostic reasoning. In finance, they detect fraud patterns, generate market analysis, and power the conversational interfaces of trading platforms. In software engineering, AI agents built on transformer LLMs write, review, and debug code — tools like GitHub Copilot and Claude Code have fundamentally changed how developers work.

The Retrieval-Augmented Generation (RAG) pattern — combining transformer LLMs with external knowledge retrieval — has become the standard architecture for enterprise AI. LoRA fine-tuning enables organizations to customize transformer models for specific domains at a fraction of full training cost. And the Model Context Protocol (MCP) standardizes how transformer-based agents connect to external tools and data sources.

The key insight for practitioners: transformers are not just a research curiosity — they are the computational engine behind the current wave of AI products and services. Understanding their mechanics, capabilities, and limitations is foundational to working effectively with any modern AI system.

Frequently Asked Questions

What is a transformer in simple terms?

A transformer is a type of neural network that processes all parts of an input (like all the words in a sentence) at the same time, rather than one by one. It uses a mechanism called “attention” to figure out which parts of the input are most relevant to each other. This allows it to understand context much better than older approaches and is why it powers chatbots like ChatGPT, Claude, and Google Gemini.

What is the difference between a transformer and an RNN?

RNNs process sequences token by token, carrying a hidden state forward. This makes them slow (no parallelism) and poor at remembering information from far back in the sequence. Transformers process all tokens in parallel using self-attention, which can directly relate any two positions regardless of distance. This makes transformers faster to train and much better at capturing long-range dependencies.

Why does self-attention scale quadratically?

Self-attention computes a relevance score between every pair of tokens. For a sequence of n tokens, that means n × n comparisons, giving O(n²) time and memory complexity. For 1,000 tokens, that’s 1 million scores; for 100,000 tokens, it’s 10 billion. Flash Attention and related optimizations reduce the memory overhead but don’t change the fundamental O(n²) compute — which is why alternative architectures like Mamba (O(n) linear time) are being actively explored.

Is GPT a transformer?

Yes. GPT stands for “Generative Pre-trained Transformer.” It uses a decoder-only variant of the transformer architecture with causal masking, meaning each token can only attend to previous tokens. This makes GPT-style models excellent at generating text one token at a time. All versions of GPT (1 through 5), as well as Claude, Llama, and Mistral, are decoder-only transformers.

Will transformers be replaced by Mamba or state space models?

Not imminently. Mamba-3 (March 2026) shows nearly 4% better language modeling and doubled inference throughput compared to similarly sized transformers, but transformers still outperform SSMs on tasks requiring precise retrieval from context. The most promising direction is hybrid architectures like Jamba that combine transformer attention layers with Mamba SSM layers. Transformers will likely remain dominant for high-precision reasoning tasks, while SSMs gain ground in efficiency-critical deployments and very long sequences.

How does the EU AI Act affect transformer models?

The EU AI Act classifies large transformer-based models as General-Purpose AI (GPAI) models. Providers must publish training data summaries, maintain technical documentation, and comply with copyright rules. Models exceeding 10²⁵ FLOPs in training compute face additional “systemic risk” obligations: adversarial testing, incident reporting, and cybersecurity requirements. Full enforcement powers for the European Commission begin August 2, 2026, with penalties up to €35 million or 7% of global turnover.

What is Flash Attention and why does it matter?

Flash Attention is an algorithm that computes exact self-attention much faster by restructuring memory access on GPUs. Instead of writing the full n×n attention matrix to slow GPU memory (HBM), it processes attention in small tiles that fit in fast on-chip SRAM. Flash Attention 3 (2024) achieves 75–85% GPU utilization on NVIDIA H100 — up from 35% with Flash Attention 2. It’s a key reason context windows grew from 2K tokens to 128K+ in recent years, and is natively available in PyTorch 2.0+.

Bibliography

Vaswani, A., et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30. arxiv.org/abs/1706.03762
Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv.org/abs/1810.04805
Radford, A., et al. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI. openai.com
Dosovitskiy, A., et al. (2020). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arxiv.org/abs/2010.11929
Dao, T. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022. arxiv.org/abs/2205.14135
Shah, J., Dao, T., et al. (2024). FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. arxiv.org/abs/2407.08608
Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arxiv.org/abs/2312.00752
Lahoti, A., et al. (2026). Mamba-3: Improved Sequence Modeling using State Space Principles. arxiv.org/abs/2603.15569
Turner, R. E. (2023, updated 2026). An Introduction to Transformers. arxiv.org/abs/2304.10557
Regulation (EU) 2024/1689 of the European Parliament and of the Council — Artificial Intelligence Act. Official Journal of the European Union. eur-lex.europa.eu
European Commission. (2025). Guidelines for providers of general-purpose AI models. digital-strategy.ec.europa.eu
PyTorch Foundation. (2024). FlashAttention-3 Integration. pytorch.org/blog/flashattention-3

2 COMMENTS

What Is Claude Mythos? 7 Facts About Anthropic's Leaked AI 2026-03-28 At 19:39

[…] early-access announcements will be important. For everyone else, Claude Opus 4.6 and the Transformer architecture powering it remain the production […]

TurboQuant Explained: 3-Bit KV Cache at 6× Compression 2026-03-28 At 19:45

[…] time a decoder-based transformer generates a new token, it stores a key vector and a value vector for that token in every attention […]