Transformer Architecture: Complete Visual Guide
How transformers work from input to output - the architecture behind GPT, BERT, and modern LLMs
Transformer Architecture
The transformer is the foundation of modern AI. GPT-4, Claude, Gemini—they’re all transformers at their core. Understanding this architecture is essential for understanding how LLMs work.
Introduced in the seminal 2017 paper “Attention is All You Need” by Vaswani et al., the transformer architecture revolutionized natural language processing by replacing recurrent neural networks (RNNs) with a pure attention-based mechanism. The key insight? Instead of processing text sequentially (word by word), transformers can attend to all positions simultaneously through self-attention, enabling massive parallelization during training.
This parallel processing capability is why we can train models on trillions of tokens—something impossible with sequential architectures like LSTMs. But the real magic lies in how attention allows each token to dynamically gather information from every other token in the sequence, creating rich contextual representations.
The computational breakthrough: RNNs require sequential steps for a sequence of length , making parallelization impossible. Transformers reduce this to sequential steps—all tokens are processed simultaneously. This is the difference between training GPT-4 in 3 months versus 30 years.
However, this parallelism comes at a cost: memory and compute for attention. The entire field of efficient transformers (Linformer, Performer, FlashAttention) is dedicated to taming this quadratic beast while preserving the architectural elegance.
High-Level Overview
Input Text: "The cat sat on the mat"
↓
┌───────────────────────────────────┐
│ 1. Tokenization │
│ ["The", "cat", "sat", ...] │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ 2. Embedding Layer │
│ Convert tokens → vectors │
│ [0.2, 0.5, ...], [0.1, ...] │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ 3. Position Encoding │
│ Add position information │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ 4. Transformer Blocks (x N) │
│ ┌─────────────────────┐ │
│ │ Multi-Head Attention│ │
│ └─────────────────────┘ │
│ ↓ │
│ ┌─────────────────────┐ │
│ │ Feed-Forward Net │ │
│ └─────────────────────┘ │
└───────────────────────────────────┘
↓
┌───────────────────────────────────┐
│ 5. Output Layer │
│ Predict next token │
│ "mat" → probabilities │
└───────────────────────────────────┘
Component Breakdown
1. Token Embeddings
The first step transforms discrete tokens (words or subwords) into continuous vector representations. Think of this as mapping each word to a point in a high-dimensional space where semantic similarity corresponds to geometric proximity.
Mathematically, we maintain an embedding matrix where is the vocabulary size and is the embedding dimension. For a token with ID , we simply look up row :
For a sequence of tokens , we get a matrix where each row is an embedding vector.
Why continuous representations? Neural networks need gradients to learn, which require continuous (differentiable) inputs. These embeddings are learned during training—words with similar meanings naturally cluster together through backpropagation.
class TokenEmbedding:
def __init__(self, vocab_size=50000, d_model=512):
# Embedding matrix: (vocab_size, d_model)
# Each token gets a unique vector
self.embeddings = np.random.randn(vocab_size, d_model)
def forward(self, token_ids):
"""
token_ids: [15, 847, 592, ...] (integer IDs)
returns: embedding vectors
"""
return self.embeddings[token_ids]
# Example:
vocab = {"the": 0, "cat": 1, "sat": 2, "on": 3, "mat": 4}
emb_layer = TokenEmbedding(vocab_size=5, d_model=512)
# Token IDs: [0, 1, 2] → "the cat sat"
embeddings = emb_layer.forward([0, 1, 2])
# Shape: (3, 512) - three 512-dimensional vectors Visual:
Token "cat" (ID: 1)
↓
Embedding Matrix row 1:
[0.23, 0.15, -0.45, 0.67, ..., 0.12] (512 numbers)
↓
This vector represents "cat" in a continuous space
where similar words have similar vectors
2. Positional Encoding
Here’s the problem: unlike RNNs that process sequences left-to-right, transformers see all tokens simultaneously. Without position information, “dog bites man” and “man bites dog” would be indistinguishable!
We inject positional information by adding a positional encoding to each embedding. The original transformer uses sinusoidal functions:
where is the position and is the dimension. This creates a unique encoding for each position using different frequencies.
Why sin/cos? Multiple brilliant reasons:
- Continuous: Smooth function allows the model to interpolate for unseen positions
- Relative positioning: Due to trigonometric identities, can be expressed as a linear function of , helping the model learn relative distances
- Extrapolation: Can handle sequences longer than those seen during training
The final input to the transformer is:
def sinusoidal_position_encoding(seq_len, d_model):
"""
Create position encodings using sin/cos functions
Different frequencies for different dimensions
"""
position = np.arange(seq_len)[:, np.newaxis]
div_term = np.exp(np.arange(0, d_model, 2) *
-(np.log(10000.0) / d_model))
pos_encoding = np.zeros((seq_len, d_model))
# Even dimensions: sine
pos_encoding[:, 0::2] = np.sin(position * div_term)
# Odd dimensions: cosine
pos_encoding[:, 1::2] = np.cos(position * div_term)
return pos_encoding
# Add to embeddings:
embeddings_with_pos = token_embeddings + positional_encoding
# Why sin/cos?
# - Continuous function (smooth transitions)
# - Can extrapolate to longer sequences
# - Relative positions: PE(pos+k) can be represented as
# linear function of PE(pos) Visual pattern:
Position encoding heatmap (first 100 positions, 128 dims):
Pos Dimensions →
0 ████████████████░░░░░░░░░░░░░░░░
1 ███████████░░░░░░████████████░░░
2 ██████░░░░░░████████████░░░░░░██
3 ███░░░░░████████████░░░░░░██████
...
50 ░░████████░░░░░░████████░░░░░░██
100 ████░░░░░░████████░░░░░░████████
Low freq (left) → High freq (right)
Encodes position at multiple scales
3. Transformer Block: The Core Engine
The transformer block is where the magic happens—this is the repeating unit stacked times (GPT-3 uses 96!). Each block has two main sub-components:
- Multi-Head Self-Attention: Allows tokens to gather context from other tokens
- Feed-Forward Network: Processes each position independently
Both use residual connections and layer normalization for training stability. The residual connections create “skip paths” that help gradients flow through deep networks without vanishing.
Mathematical formulation of a transformer block:
\text{Step 1: } & Z = \text{LayerNorm}(X + \text{MultiHeadAttention}(X)) \\ \text{Step 2: } & \text{Output} = \text{LayerNorm}(Z + \text{FFN}(Z)) \end{align*}$$ The residual connections ($+$ operations) are critical. Without them, gradients vanish in deep networks: $$\frac{\partial \mathcal{L}}{\partial X} = \frac{\partial \mathcal{L}}{\partial \text{Output}} \cdot \underbrace{\frac{\partial \text{FFN}}{\partial Z}}_{\text{can be small}} \cdot \underbrace{\frac{\partial \text{Attn}}{\partial X}}_{\text{can be small}}$$ With residuals, we get an additional gradient path: $$\frac{\partial \mathcal{L}}{\partial X} = \frac{\partial \mathcal{L}}{\partial \text{Output}} \cdot \underbrace{1}_{\text{from residual}} + \frac{\partial \mathcal{L}}{\partial \text{Output}} \cdot \frac{\partial \text{Functions}}{\partial X}$$ The "1" ensures gradients always flow, even if the function path vanishes. This is why we can train 96-layer models! ``` Input (seq_len, d_model) ↓ ┌─────────────────────────────┐ │ Multi-Head Self-Attention │ ← Tokens look at each other │ + Residual + LayerNorm │ └─────────────────────────────┘ ↓ ┌─────────────────────────────┐ │ Feed-Forward Network │ ← Process each position │ + Residual + LayerNorm │ └─────────────────────────────┘ ↓ Output (seq_len, d_model) ``` <CodeBlock language="python" filename="transformer_block.py" code={`class TransformerBlock: def __init__(self, d_model=512, num_heads=8, d_ff=2048): self.attention = MultiHeadAttention(d_model, num_heads) self.ffn = FeedForward(d_model, d_ff) self.norm1 = LayerNorm(d_model) self.norm2 = LayerNorm(d_model) def forward(self, x): # 1. Multi-head attention with residual attn_out = self.attention(x) x = self.norm1(x + attn_out) # Residual + normalize # 2. Feed-forward with residual ffn_out = self.ffn(x) x = self.norm2(x + ffn_out) # Residual + normalize return x class FeedForward: def __init__(self, d_model=512, d_ff=2048): self.W1 = np.random.randn(d_model, d_ff) self.W2 = np.random.randn(d_ff, d_model) def forward(self, x): # Two-layer network: expand then compress hidden = relu(x @ self.W1) # (seq, 512) → (seq, 2048) output = hidden @ self.W2 # (seq, 2048) → (seq, 512) return output def relu(x): return np.maximum(0, x)`} /> ### Why Residual Connections? ``` Without residual: Input → [Block] → Output (can distort original info) With residual: Input → [Block] → (+) → Output |__________________| (original preserved + learned changes) ``` Residual connections solve a fundamental problem in deep learning: as networks get deeper, gradients either vanish (approach zero) or explode during backpropagation. The residual formulation: $$x_{\text{out}} = x_{\text{in}} + F(x_{\text{in}})$$ where $F$ is the block's transformation, creates a direct gradient pathway. During backpropagation: $$\frac{\partial L}{\partial x_{\text{in}}} = \frac{\partial L}{\partial x_{\text{out}}} \cdot \left(1 + \frac{\partial F}{\partial x_{\text{in}}}\right)$$ The "$+1$" term ensures gradients can flow directly backward even if $\frac{\partial F}{\partial x_{\text{in}}}$ is very small. This enables training of 96+ layer networks! Benefits: - **Prevents vanishing gradients**: The $+1$ term guarantees gradient flow - **Easier optimization**: Model can learn identity (do nothing) by setting $F(x) \approx 0$ - **Allows very deep networks**: GPT-3 has 96 layers, GPT-4 likely has 120+ ### 4. Layer Normalization: Stabilizing Training Layer normalization standardizes the inputs to each layer, dramatically stabilizing training dynamics. Unlike batch normalization (which normalizes across the batch dimension), layer norm normalizes across the feature dimension. For an input $x \in \mathbb{R}^{d_{\text{model}}}$, layer norm computes: $$\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$ where: - $\mu = \frac{1}{d_{\text{model}}} \sum_{i=1}^{d_{\text{model}}} x_i$ is the mean - $\sigma^2 = \frac{1}{d_{\text{model}}} \sum_{i=1}^{d_{\text{model}}} (x_i - \mu)^2$ is the variance - $\gamma, \beta \in \mathbb{R}^{d_{\text{model}}}$ are learned scale and shift parameters - $\epsilon$ is a small constant (e.g., $10^{-6}$) for numerical stability - $\odot$ denotes element-wise multiplication **Why layer norm over batch norm?** 1. **Sequence length independence**: Works with variable-length sequences 2. **Batch size independence**: Each example normalized independently 3. **Better for transformers**: Normalizes the representation of each token independently <CodeBlock language="python" filename="layer_norm.py" code={`class LayerNorm: def __init__(self, d_model, eps=1e-6): self.gamma = np.ones(d_model) # Learned scale self.beta = np.zeros(d_model) # Learned shift self.eps = eps def forward(self, x): # x shape: (batch, seq_len, d_model) # Compute mean and variance across d_model dimension mean = x.mean(axis=-1, keepdims=True) var = x.var(axis=-1, keepdims=True) # Normalize x_norm = (x - mean) / np.sqrt(var + self.eps) # Scale and shift return self.gamma * x_norm + self.beta # Why Layer Norm? # - Stabilizes training # - Each token normalized independently # - Works well with variable sequence lengths`} /> ### 5. Output Layer: From Hidden States to Predictions The final layer projects the model's hidden representations back to vocabulary space, producing logits for each possible next token. Given the final hidden state $h \in \mathbb{R}^{d_{\text{model}}}$, we compute: $$\text{logits} = hW^{\text{out}} \quad \text{where } W^{\text{out}} \in \mathbb{R}^{d_{\text{model}} \times V}$$ Then apply softmax to get probabilities: $$P(\text{token}_i | \text{context}) = \frac{\exp(\text{logit}_i)}{\sum_{j=1}^V \exp(\text{logit}_j)}$$ This gives a probability distribution over the entire vocabulary. During **generation**, we sample from this distribution (or take argmax for greedy decoding): $$\text{next\_token} = \arg\max_i P(\text{token}_i | \text{context})$$ **Temperature sampling**: We can control randomness by dividing logits by temperature $T$ before softmax: - $T > 1$: More random (flatter distribution) - $T < 1$: More confident (peakier distribution) - $T = 0$: Greedy (always pick most likely) <CodeBlock language="python" filename="output_layer.py" code={`class OutputLayer: def __init__(self, d_model=512, vocab_size=50000): # Project from hidden dim to vocabulary self.W = np.random.randn(d_model, vocab_size) def forward(self, x): """ x: (batch, seq_len, d_model) returns: (batch, seq_len, vocab_size) - logits """ logits = x @ self.W # Shape: (batch, seq_len, vocab_size) # Apply softmax to get probabilities probs = softmax(logits, axis=-1) return probs # During generation: # Take probs for last token # Sample or take argmax to get next token next_token_probs = probs[:, -1, :] # (batch, vocab_size) next_token = np.argmax(next_token_probs, axis=-1)`} /> ## Complete Forward Pass <CodeBlock language="python" filename="full_transformer.py" code={`class GPTModel: def __init__( self, vocab_size=50000, d_model=768, num_layers=12, num_heads=12, d_ff=3072, max_seq_len=2048 ): self.embedding = TokenEmbedding(vocab_size, d_model) self.pos_encoding = sinusoidal_position_encoding( max_seq_len, d_model ) self.blocks = [ TransformerBlock(d_model, num_heads, d_ff) for _ in range(num_layers) ] self.output = OutputLayer(d_model, vocab_size) def forward(self, token_ids): """ token_ids: (batch, seq_len) - input token IDs returns: (batch, seq_len, vocab_size) - probabilities """ seq_len = token_ids.shape[1] # 1. Embed tokens x = self.embedding(token_ids) # 2. Add positional encoding x = x + self.pos_encoding[:seq_len] # 3. Pass through transformer blocks for block in self.blocks: x = block(x) # 4. Project to vocabulary probs = self.output(x) return probs # Example usage: model = GPTModel( vocab_size=50000, d_model=768, num_layers=12, num_heads=12 ) # Input: "The cat" input_ids = [15, 847] # Token IDs probs = model.forward(input_ids) # Get next token prediction next_token_id = np.argmax(probs[0, -1, :]) print(f"Predicted next token: {next_token_id}") # e.g., "sat"`} /> ## Architecture Variants ### GPT (Decoder-Only) ``` Input: "The cat" ↓ [Causal Mask] - can't see future tokens ↓ Transformer Blocks (decoder only) ↓ Output: Next token prediction → "sat" ``` **Use case:** Text generation (GPT-4, Claude, LLaMA) ### BERT (Encoder-Only) ``` Input: "The [MASK] sat" ↓ [Bidirectional] - can see all tokens ↓ Transformer Blocks (encoder only) ↓ Output: Fill in [MASK] → "cat" ``` **Use case:** Understanding tasks (classification, Q&A) ### T5/BART (Encoder-Decoder) ``` Encoder Input: "Translate: Hello" ↓ Encoder Blocks (bidirectional) ↓ Decoder Input: "Bonjour" (during training) ↓ Decoder Blocks (causal) + Cross-Attention to encoder ↓ Output: French translation ``` **Use case:** Translation, summarization ## Scaling Laws: The Power of Scale One of the most remarkable discoveries in modern AI is the predictability of scaling. Performance improves as a power law with model size, dataset size, and compute budget. **The Chinchilla Scaling Law** (Hoffmann et al., 2022) found that for compute-optimal training: $$N_{\text{params}} \propto C^{0.5} \quad \text{and} \quad D_{\text{tokens}} \propto C^{0.5}$$ where $C$ is compute budget (in FLOPs), $N_{\text{params}}$ is model size, and $D_{\text{tokens}}$ is dataset size. **Key insight**: Most models are undertrained! To optimally use compute, scale data and parameters equally. For a 70B parameter model, you should train on ~1.4 trillion tokens (not the ~300B GPT-3 used). **Loss scaling**: Test loss decreases predictably: $$L(N) = \left(\frac{N_c}{N}\right)^{\alpha}$$ where $\alpha \approx 0.076$ for language modeling. This means doubling model size gives a ~5% reduction in loss. <TechSpecs specs={[ { label: 'GPT-2', value: '1.5B params, 48 layers', iconName: 'cube' }, { label: 'GPT-3', value: '175B params, 96 layers', iconName: 'database' }, { label: 'GPT-4', value: '~1.7T params (rumored), 120+ layers', iconName: 'trending-up' }, { label: 'LLaMA 2 70B', value: '70B params, 80 layers', iconName: 'activity' }, ]} /> **Scaling trends:** - Loss decreases predictably with size - Compute optimal: train bigger models on more data - Emergence: capabilities appear at certain scales ## Memory & Compute For inference on 70B model: <CodeBlock language="python" filename="memory_calculation.py" code={`def calculate_inference_memory( num_params=70e9, # 70 billion precision="fp16", batch_size=1, seq_len=2048, num_layers=80, d_model=8192 ): bytes_per_param = 2 if precision == "fp16" else 4 # Model weights model_memory = num_params * bytes_per_param / (1024**3) # KV cache: store keys and values for all layers kv_cache = ( 2 * # K and V batch_size * num_layers * seq_len * d_model * bytes_per_param ) / (1024**3) # Activations (estimated) activation_memory = ( batch_size * seq_len * d_model * bytes_per_param ) / (1024**3) * 10 # rough multiplier total = model_memory + kv_cache + activation_memory print(f"Model weights: {model_memory:.1f} GB") print(f"KV cache: {kv_cache:.1f} GB") print(f"Activations: {activation_memory:.1f} GB") print(f"Total: {total:.1f} GB") return total calculate_inference_memory() # Output: # Model weights: 140.0 GB # KV cache: 5.2 GB # Activations: 3.3 GB # Total: 148.5 GB # This is why you need A100 80GB GPUs or multiple GPUs!`} /> ## Training Process ``` 1. Pre-training (unsupervised) ├─ Objective: Predict next token ├─ Data: Massive text corpus (trillions of tokens) └─ Cost: $100M+ for GPT-4 scale 2. Supervised Fine-Tuning (SFT) ├─ Objective: Follow instructions ├─ Data: High-quality examples (10K-100K) └─ Cost: $10K-$100K 3. RLHF (Reinforcement Learning) ├─ Objective: Align with human preferences ├─ Data: Human feedback on outputs └─ Cost: $100K-$1M ``` --- --- ## Resources & Further Reading ### 📄 Foundational Papers <div class="resources-grid"> **The Original Paper** - <a href="https://arxiv.org/abs/1706.03762" target="_blank" rel="noopener noreferrer">Attention is All You Need</a> (Vaswani et al., 2017) - The paper that started it all - introduces the transformer architecture - Must-read for understanding modern LLMs **Scaling Laws** - <a href="https://arxiv.org/abs/2001.08361" target="_blank" rel="noopener noreferrer">Scaling Laws for Neural Language Models</a> (Kaplan et al., 2020) - How performance scales with model size, data, and compute - <a href="https://arxiv.org/abs/2203.15556" target="_blank" rel="noopener noreferrer">Training Compute-Optimal LLMs</a> (Hoffmann et al., 2022) - Chinchilla paper showing most models are undertrained **Architecture Improvements** - <a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf" target="_blank" rel="noopener noreferrer">GPT-2: Language Models are Unsupervised Multitask Learners</a> - <a href="https://arxiv.org/abs/2005.14165" target="_blank" rel="noopener noreferrer">GPT-3: Language Models are Few-Shot Learners</a> - <a href="https://arxiv.org/abs/2302.13971" target="_blank" rel="noopener noreferrer">LLaMA: Open and Efficient Foundation Models</a> </div> ### 🔧 Implementation Resources - **The Annotated Transformer**: <a href="http://nlp.seas.harvard.edu/annotated-transformer/" target="_blank" rel="noopener noreferrer">Harvard NLP's line-by-line guide</a> - **Hugging Face Transformers**: <a href="https://github.com/huggingface/transformers" target="_blank" rel="noopener noreferrer">Production-ready implementations</a> - **nanoGPT**: <a href="https://github.com/karpathy/nanoGPT" target="_blank" rel="noopener noreferrer">Minimal GPT implementation by Karpathy</a> - **PyTorch Transformer Tutorial**: [Official PyTorch guide](https://pytorch.org/tutorials/beginner/transformer_tutorial.html) ### 📚 Related Technical Guides - [Attention Mechanisms Explained →](/technical/attention-mechanisms) - Deep dive into self-attention - [Position Encodings →](/technical/position-encodings) - How transformers understand order - [Long-Context Architecture →](/technical/long-context-architecture) - Extending to longer sequences - [Training Large Language Models →](/technical/llm-training) - How to train transformers at scale - [Inference Optimization →](/technical/inference-optimization) - Making transformers faster - [KV Cache Optimization →](/technical/kv-cache-optimization) - Memory management for generation ### 🎓 Educational Content **Video Lectures** - [Stanford CS324: Large Language Models](https://stanford-cs324.github.io/winter2022/) - [Andrej Karpathy: Let's build GPT](https://www.youtube.com/watch?v=kCc8FmEb1nY) - [3Blue1Brown: Transformers Visualized](https://www.youtube.com/watch?v=wjZofJX0v4M) **Interactive Tools** - [Transformer Explainer](https://poloclub.github.io/transformer-explainer/) - [LLM Visualization](https://bbycroft.net/llm) - [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) ### 🔬 Advanced Topics - [FlashAttention](https://arxiv.org/abs/2205.14135) - IO-aware exact attention - [Mixture of Experts](https://arxiv.org/abs/2101.03961) - Scaling with sparsity - [Grouped Query Attention](https://arxiv.org/abs/2305.13245) - Efficient multi-head attention --- *Last updated: August 2025*