Position Encodings Explained

Transformers have no inherent sense of order. Position encodings teach them “the” at position 1 is different from “the” at position 5.

The attention mechanism is permutation-equivariant: shuffling input tokens produces shuffled outputs, but the model can’t detect the shuffle! This is catastrophic for language, where “dog bites man” ≠ “man bites dog”. Position encodings inject order information, transforming the model into a sequence processor.

The evolution from sinusoidal encodings (2017) to RoPE (2021) to ALiBi (2022) represents a fascinating arc of increasingly elegant solutions to the same problem: how do we tell a model where each token is, without sacrificing the parallelism that makes transformers fast?

The Problem

Without position information:

Input: "The cat ate the mouse"
Input: "The mouse ate the cat"

Without positions → Same representation!
Model can't distinguish word order

Attention is permutation invariant - it treats tokens as an unordered set.

Solution 1: Absolute Sinusoidal Encoding (Original Transformer)

The original “Attention is All You Need” paper proposed adding fixed sinusoidal functions to embeddings:

$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$

$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$

where $pos$ is position and $i$ is dimension index. Each dimension oscillates at a different frequency, creating a unique “fingerprint” for each position.

The beautiful property: Due to trigonometric identities, $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$ :

$PE_{pos+k} = f(PE_{pos}, k)$

This means the model can learn to attend to relative positions (“3 tokens back”) even though we only provide absolute positions.

Proof of the key property: Consider the dot product between position encodings at positions $p$ and $p+k$ :

$PE_p \cdot PE_{p+k} = \sum_{i=0}^{d/2-1} \left[\sin\left(\frac{p}{\omega_i}\right)\sin\left(\frac{p+k}{\omega_i}\right) + \cos\left(\frac{p}{\omega_i}\right)\cos\left(\frac{p+k}{\omega_i}\right)\right]$

Using the trigonometric identity $\cos(a-b) = \cos(a)\cos(b) + \sin(a)\sin(b)$ :

$PE_p \cdot PE_{p+k} = \sum_{i=0}^{d/2-1} \cos\left(\frac{k}{\omega_i}\right)$

Notice this depends only on $k$ (the relative distance), not on the absolute position $p$ ! This is the magic that allows the model to learn relative position patterns. When computing attention scores $Q_p \cdot K_{p+k}$ , the position encoding contribution naturally captures relative distance.

sinusoidal_encoding.py

import numpy as np

def sinusoidal_position_encoding(seq_len, d_model):
  """
  Original position encoding from 'Attention is All You Need'
  
  Uses sin/cos with different frequencies for each dimension
  """
  position = np.arange(seq_len)[:, np.newaxis]
  div_term = np.exp(
      np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model)
  )
  
  pe = np.zeros((seq_len, d_model))
  
  # Even dimensions: sine
  pe[:, 0::2] = np.sin(position * div_term)
  # Odd dimensions: cosine
  pe[:, 1::2] = np.cos(position * div_term)
  
  return pe

# Visualization for position 0-100, first 8 dimensions:
# Pos  Dim0   Dim1   Dim2   Dim3   Dim4   Dim5   Dim6   Dim7
# 0    0.00   1.00   0.00   1.00   0.00   1.00   0.00   1.00
# 1    0.84   0.54   0.10   1.00   0.01   1.00   0.00   1.00
# 5    0.96  -0.28   0.48   0.88   0.05   1.00   0.01   1.00
# 10  -0.54  -0.84   0.84   0.54   0.10   0.99   0.01   1.00

# Pattern: Low dimensions change fast, high dimensions change slow
# This encodes position at multiple scales

Why sin/cos?

Continuous function: Smooth transitions between positions
Bounded: Values stay in [-1, 1]
Deterministic: Same position always gets same encoding
Extrapolation: Can handle longer sequences than training

using_sinusoidal.py

def transformer_with_positional_encoding(tokens):
  # Get token embeddings
  token_embs = embedding_layer(tokens)  # [seq_len, d_model]
  
  # Get position encodings
  seq_len = len(tokens)
  pos_encoding = sinusoidal_position_encoding(seq_len, d_model)
  
  # Add them together!
  input_embs = token_embs + pos_encoding
  
  # Now transformer can distinguish positions
  output = transformer_layers(input_embs)
  
  return output

# Example:
# Token "the" at position 0: [0.2, 0.5, ...] + [0.0, 1.0, ...] = [0.2, 1.5, ...]
# Token "the" at position 3: [0.2, 0.5, ...] + [0.9, 0.4, ...] = [1.1, 0.9, ...]
# Different positions → Different representations!

Solution 2: Learned Position Embeddings (BERT, GPT-2)

Learn position embeddings like token embeddings:

learned_positions.py

class LearnedPositionalEmbedding:
  def __init__(self, max_seq_len=512, d_model=768):
      # Just a lookup table!
      # Position 0 → vector, Position 1 → different vector, etc.
      self.pos_embeddings = nn.Embedding(max_seq_len, d_model)
  
  def forward(self, seq_len):
      positions = torch.arange(seq_len)
      return self.pos_embeddings(positions)

# BERT uses this
# GPT-2 uses this

# Pros:
# - Simple
# - Model learns best encoding for the task

# Cons:
# - Fixed max length (can't extrapolate)
# - Uses more parameters

Solution 3: RoPE (Rotary Position Embedding)

Used in: LLaMA, GPT-Neo, PaLM

RoPE (Su et al., 2021) is elegant: instead of adding position to embeddings, rotate the query and key vectors by an angle proportional to their position.

For a 2D subspace, the rotation matrix for position $m$ is:

$R_m = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix}$

Key insight: The dot product between rotated vectors at positions $m$ and $n$ :

$q_m^T k_n = (R_m q)^T (R_n k) = q^T R_m^T R_n k = q^T R_{n-m} k$

The result depends only on the relative distance $(n-m)$ ! The rotation difference naturally encodes relative position.

Why this works better:

Relative positions emerge naturally from rotation algebra
Extrapolation: Training on length $L$ works for length $>L$ because rotations extend smoothly
No parameters: Unlike learned embeddings, RoPE is deterministic

rope.py

def rotate_half(x):
  """
  Split and rotate tensor
  """
  x1, x2 = x.chunk(2, dim=-1)
  return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin):
  """
  Apply rotary position embedding to Q and K
  
  Key insight: Rotate Q and K based on their positions
  Relative position naturally emerges from rotation difference
  """
  # Apply rotation matrix
  q_embed = (q * cos) + (rotate_half(q) * sin)
  k_embed = (k * cos) + (rotate_half(k) * sin)
  
  return q_embed, k_embed

class RoPEAttention:
  def __init__(self, d_model=512, max_seq_len=2048):
      self.d_model = d_model
      
      # Precompute rotation angles
      position = torch.arange(max_seq_len).unsqueeze(1)
      div_term = torch.exp(
          torch.arange(0, d_model, 2) * 
          -(math.log(10000.0) / d_model)
      )
      
      # Rotation angles for each position and dimension
      angles = position * div_term
      self.cos = torch.cos(angles)
      self.sin = torch.sin(angles)
  
  def forward(self, x):
      seq_len = x.shape[1]
      
      # Generate Q, K, V
      Q = x @ self.W_q
      K = x @ self.W_k
      V = x @ self.W_v
      
      # Apply rotary embeddings to Q and K only
      Q_rot, K_rot = apply_rotary_pos_emb(
          Q, K,
          self.cos[:seq_len],
          self.sin[:seq_len]
      )
      
      # Attention with rotated Q, K
      scores = Q_rot @ K_rot.T
      weights = softmax(scores)
      output = weights @ V
      
      return output

# Why RoPE is better:
# 1. Naturally encodes relative positions
# 2. Better extrapolation to longer sequences
# 3. No added parameters (unlike learned embeddings)
# 4. Works better in practice (LLaMA uses this)

RoPE Visualization:

Position 0 and Position 5:
├─ Rotation difference = 5 steps
├─ This creates a consistent "distance" signal
└─ Works for any position pair!

Traditional encoding:
├─ pos_0 = [0.0, 1.0, 0.0, ...]
├─ pos_5 = [0.96, -0.28, 0.48, ...]
└─ No clear relative position signal

Solution 4: ALiBi (Attention with Linear Biases)

Used in: BLOOM, MPT

ALiBi (Press et al., 2022) takes a radically simple approach: don’t encode positions at all. Instead, add a distance-based bias directly to attention scores:

$\text{score}_{ij} = q_i^T k_j - \lambda \cdot |i - j|$

where $\lambda$ is a head-specific slope (different heads use different slopes to capture multiple scales).

Why this is brilliant:

No position embeddings: Saves memory and parameters
Perfect linearity: Distance bias is exactly linear, making extrapolation trivial
Inductive bias: Closer tokens naturally get higher attention scores

Mathematical intuition: The bias creates a “locality preference” - tokens far apart need stronger query-key compatibility to attend to each other. This matches linguistic intuition: nearby words are usually more related.

alibi.py

def alibi_attention(Q, K, V, num_heads=8):
  """
  Add linear bias based on distance between positions
  
  No position embeddings at all!
  Just bias the attention scores
  """
  seq_len = Q.shape[1]
  
  # Compute attention scores
  scores = Q @ K.T  # [seq_len, seq_len]
  
  # Create ALiBi bias matrix
  # Bias increases with distance
  positions = torch.arange(seq_len)
  distance = positions.unsqueeze(1) - positions.unsqueeze(0)
  distance = distance.abs()
  
  # Different slopes for different attention heads
  slopes = torch.tensor([
      2 ** (-(i+1)) for i in range(num_heads)
  ])
  
  # Apply bias (negative, so distant tokens get lower scores)
  bias = -distance.unsqueeze(0) * slopes.unsqueeze(1).unsqueeze(2)
  
  # Add bias to scores
  scores = scores + bias
  
  # Rest is normal attention
  weights = softmax(scores, dim=-1)
  output = weights @ V
  
  return output

# ALiBi bias matrix (head with slope=-0.5):
#      Pos0  Pos1  Pos2  Pos3  Pos4
# Pos0   0   -0.5  -1.0  -1.5  -2.0
# Pos1  -0.5   0   -0.5  -1.0  -1.5
# Pos2  -1.0 -0.5   0   -0.5  -1.0
# ...

# Benefits:
# - No position embeddings needed (saves memory)
# - Excellent extrapolation (tested up to 2× training length)
# - Simple and fast

Comparison

Sinusoidal (GPT-3):
├─ Pros: Simple, deterministic, extrapolates
├─ Cons: Not as good as learned
└─ Used in: Original Transformer, GPT-3

Learned (BERT, GPT-2):
├─ Pros: Task-specific, performs well
├─ Cons: Fixed max length, more parameters
└─ Used in: BERT, GPT-2, early models

RoPE (LLaMA):
├─ Pros: Best performance, great extrapolation
├─ Cons: Slightly more complex
└─ Used in: LLaMA, GPT-Neo, most modern LLMs

ALiBi (BLOOM):
├─ Pros: Best extrapolation, no embeddings
├─ Cons: Less common, newer technique
└─ Used in: BLOOM, MPT, some research models

Extrapolation Test

How well models handle sequences longer than training:

extrapolation_test.py

# Train on 2K tokens, test on 4K tokens

results = {
  'Sinusoidal': {
      '2K (train)': 2.1,  # Perplexity
      '4K (2× train)': 3.8,
      '8K (4× train)': 7.2,
  },
  'Learned': {
      '2K (train)': 2.0,
      '4K (2× train)': 15.3,  # Breaks down!
      '8K (4× train)': 48.7,  # Completely fails
  },
  'RoPE': {
      '2K (train)': 2.0,
      '4K (2× train)': 2.3,  # Good!
      '8K (4× train)': 3.1,  # Still works
  },
  'ALiBi': {
      '2K (train)': 2.0,
      '4K (2× train)': 2.1,  # Best!
      '8K (4× train)': 2.4,  # Excellent
  }
}

# Learned embeddings fail hard
# RoPE and ALiBi handle longer sequences well

Modern Best Practices

For new models:

Use RoPE if you want proven performance (LLaMA 2 approach)
Use ALiBi if you need extreme extrapolation (BLOOM approach)
Avoid learned unless you have fixed max length (BERT legacy)
Avoid sinusoidal unless you want simplicity (outdated)

For extending context:

# Example: Extend LLaMA from 4K to 32K context

# Method 1: RoPE Scaling (linear)
def scaled_rope(position, scale=8.0):
    # Divide position by scale factor
    return rope(position / scale)

# Method 2: RoPE NTK-Aware Scaling
def ntk_scaled_rope(position, scale=8.0):
    # Adjust frequency instead of position
    # Better preservation of local patterns
    return rope_with_adjusted_freq(position, scale)

# Both methods allow 4K model to handle 32K tokens
# GPT-4 likely uses similar techniques

Resources & Further Reading

📄 Foundational Papers

Sinusoidal Encodings

Attention is All You Need (Vaswani et al., 2017)
- Original transformer paper introducing sinusoidal position encodings
- Section 3.5 covers the mathematical foundation

RoPE (Rotary Position Embeddings)

RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)
- Introduces rotation-based position encoding
- Used in LLaMA, GPT-Neo, PaLM

ALiBi

Train Short, Test Long: Attention with Linear Biases (Press et al., 2022)
- Linear biases for better extrapolation
- Used in BLOOM, MPT

🔧 Implementation Resources

Hugging Face Transformers: Position encoding implementations
LLaMA RoPE: Meta’s official implementation
Flash Attention with RoPE: Optimized CUDA kernels

Attention Mechanisms → - Understanding the attention computation
Transformer Architecture → - Complete architecture overview
Long-Context Architecture → - Extending to longer sequences
Inference Optimization → - Making inference faster

🔬 Advanced Topics

Context Extension

Theoretical Analysis

Last updated: October 2025

Position Encodings in Transformers Explained

Position Encodings Explained

The Problem

Solution 1: Absolute Sinusoidal Encoding (Original Transformer)

Solution 2: Learned Position Embeddings (BERT, GPT-2)

Solution 3: RoPE (Rotary Position Embedding)

Solution 4: ALiBi (Attention with Linear Biases)

Comparison

Extrapolation Test

Modern Best Practices

Resources & Further Reading

📄 Foundational Papers

🔧 Implementation Resources

🔬 Advanced Topics

Related Articles

Attention Mechanisms

Long-Context Architecture

Position Encodings Explained

The Problem

Solution 1: Absolute Sinusoidal Encoding (Original Transformer)

Solution 2: Learned Position Embeddings (BERT, GPT-2)

Solution 3: RoPE (Rotary Position Embedding)

Solution 4: ALiBi (Attention with Linear Biases)

Comparison

Extrapolation Test

Modern Best Practices

Resources & Further Reading

📄 Foundational Papers

🔧 Implementation Resources

📚 Related Technical Guides

🔬 Advanced Topics

Related Articles

Attention Mechanisms

Long-Context Architecture

🚀 Get AI Tool Insights

You're In!