Transformer Architecture

The transformer is the foundation of modern AI. GPT-4, Claude, Gemini—they’re all transformers at their core. Understanding this architecture is essential for understanding how LLMs work.

Introduced in the seminal 2017 paper “Attention is All You Need” by Vaswani et al., the transformer architecture revolutionized natural language processing by replacing recurrent neural networks (RNNs) with a pure attention-based mechanism. The key insight? Instead of processing text sequentially (word by word), transformers can attend to all positions simultaneously through self-attention, enabling massive parallelization during training.

This parallel processing capability is why we can train models on trillions of tokens—something impossible with sequential architectures like LSTMs. But the real magic lies in how attention allows each token to dynamically gather information from every other token in the sequence, creating rich contextual representations.

The computational breakthrough: RNNs require $O(n)$ sequential steps for a sequence of length $n$ , making parallelization impossible. Transformers reduce this to $O(1)$ sequential steps—all tokens are processed simultaneously. This is the difference between training GPT-4 in 3 months versus 30 years.

However, this parallelism comes at a cost: $O(n^2)$ memory and compute for attention. The entire field of efficient transformers (Linformer, Performer, FlashAttention) is dedicated to taming this quadratic beast while preserving the architectural elegance.

High-Level Overview

Input Text: "The cat sat on the mat"
    ↓
┌───────────────────────────────────┐
│  1. Tokenization                  │
│     ["The", "cat", "sat", ...]    │
└───────────────────────────────────┘
    ↓
┌───────────────────────────────────┐
│  2. Embedding Layer                │
│     Convert tokens → vectors      │
│     [0.2, 0.5, ...], [0.1, ...]  │
└───────────────────────────────────┘
    ↓
┌───────────────────────────────────┐
│  3. Position Encoding              │
│     Add position information      │
└───────────────────────────────────┘
    ↓
┌───────────────────────────────────┐
│  4. Transformer Blocks (x N)      │
│     ┌─────────────────────┐       │
│     │ Multi-Head Attention│       │
│     └─────────────────────┘       │
│              ↓                    │
│     ┌─────────────────────┐       │
│     │  Feed-Forward Net   │       │
│     └─────────────────────┘       │
└───────────────────────────────────┘
    ↓
┌───────────────────────────────────┐
│  5. Output Layer                   │
│     Predict next token            │
│     "mat" → probabilities         │
└───────────────────────────────────┘

Component Breakdown

1. Token Embeddings

The first step transforms discrete tokens (words or subwords) into continuous vector representations. Think of this as mapping each word to a point in a high-dimensional space where semantic similarity corresponds to geometric proximity.

Mathematically, we maintain an embedding matrix $E \in \mathbb{R}^{V \times d_{\text{model}}}$ where $V$ is the vocabulary size and $d_{\text{model}}$ is the embedding dimension. For a token with ID $t$ , we simply look up row $t$ :

$\text{embed}(t) = E[t] \in \mathbb{R}^{d_{\text{model}}}$

For a sequence of tokens $[t_1, t_2, ..., t_n]$ , we get a matrix $X \in \mathbb{R}^{n \times d_{\text{model}}}$ where each row is an embedding vector.

Why continuous representations? Neural networks need gradients to learn, which require continuous (differentiable) inputs. These embeddings are learned during training—words with similar meanings naturally cluster together through backpropagation.

embeddings.py

class TokenEmbedding:
  def __init__(self, vocab_size=50000, d_model=512):
      # Embedding matrix: (vocab_size, d_model)
      # Each token gets a unique vector
      self.embeddings = np.random.randn(vocab_size, d_model)
  
  def forward(self, token_ids):
      """
      token_ids: [15, 847, 592, ...]  (integer IDs)
      returns: embedding vectors
      """
      return self.embeddings[token_ids]

# Example:
vocab = {"the": 0, "cat": 1, "sat": 2, "on": 3, "mat": 4}
emb_layer = TokenEmbedding(vocab_size=5, d_model=512)

# Token IDs: [0, 1, 2]  →  "the cat sat"
embeddings = emb_layer.forward([0, 1, 2])
# Shape: (3, 512) - three 512-dimensional vectors

Visual:

Token "cat" (ID: 1)
     ↓
Embedding Matrix row 1:
[0.23, 0.15, -0.45, 0.67, ..., 0.12]  (512 numbers)
     ↓
This vector represents "cat" in a continuous space
where similar words have similar vectors

2. Positional Encoding

Here’s the problem: unlike RNNs that process sequences left-to-right, transformers see all tokens simultaneously. Without position information, “dog bites man” and “man bites dog” would be indistinguishable!

We inject positional information by adding a positional encoding to each embedding. The original transformer uses sinusoidal functions:

$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$

$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$

where $pos$ is the position and $i$ is the dimension. This creates a unique encoding for each position using different frequencies.

Why sin/cos? Multiple brilliant reasons:

Continuous: Smooth function allows the model to interpolate for unseen positions
Relative positioning: Due to trigonometric identities, $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$ , helping the model learn relative distances
Extrapolation: Can handle sequences longer than those seen during training

The final input to the transformer is: $X_{\text{input}} = X_{\text{embed}} + X_{\text{pos}}$

positional_encoding.py

def sinusoidal_position_encoding(seq_len, d_model):
  """
  Create position encodings using sin/cos functions
  Different frequencies for different dimensions
  """
  position = np.arange(seq_len)[:, np.newaxis]
  div_term = np.exp(np.arange(0, d_model, 2) * 
                    -(np.log(10000.0) / d_model))
  
  pos_encoding = np.zeros((seq_len, d_model))
  
  # Even dimensions: sine
  pos_encoding[:, 0::2] = np.sin(position * div_term)
  # Odd dimensions: cosine
  pos_encoding[:, 1::2] = np.cos(position * div_term)
  
  return pos_encoding

# Add to embeddings:
embeddings_with_pos = token_embeddings + positional_encoding

# Why sin/cos?
# - Continuous function (smooth transitions)
# - Can extrapolate to longer sequences
# - Relative positions: PE(pos+k) can be represented as 
#   linear function of PE(pos)

Visual pattern:

Position encoding heatmap (first 100 positions, 128 dims):

Pos  Dimensions →
0    ████████████████░░░░░░░░░░░░░░░░
1    ███████████░░░░░░████████████░░░
2    ██████░░░░░░████████████░░░░░░██
3    ███░░░░░████████████░░░░░░██████
...
50   ░░████████░░░░░░████████░░░░░░██
100  ████░░░░░░████████░░░░░░████████

Low freq (left) → High freq (right)
Encodes position at multiple scales

3. Transformer Block: The Core Engine

The transformer block is where the magic happens—this is the repeating unit stacked $N$ times (GPT-3 uses 96!). Each block has two main sub-components:

Multi-Head Self-Attention: Allows tokens to gather context from other tokens
Feed-Forward Network: Processes each position independently

Both use residual connections and layer normalization for training stability. The residual connections create “skip paths” that help gradients flow through deep networks without vanishing.

Mathematical formulation of a transformer block:

Transformer Architecture: Complete Visual Guide

Transformer Architecture

High-Level Overview

Component Breakdown

1. Token Embeddings

2. Positional Encoding

3. Transformer Block: The Core Engine

Related Articles

Attention Mechanisms

Long-Context Architecture

Transformer Architecture

High-Level Overview

Component Breakdown

1. Token Embeddings

2. Positional Encoding

3. Transformer Block: The Core Engine

Related Articles

Attention Mechanisms

Long-Context Architecture

🚀 Get AI Tool Insights

You're In!