Home Technical Position Encodings in Transformers Explained

Position Encodings in Transformers Explained

How transformers understand word order - from sinusoidal to RoPE and ALiBi

AI Tools Reviews Technical Team
January 24, 2024
LLM technical transformers position-encoding

Position Encodings Explained

Transformers have no inherent sense of order. Position encodings teach them “the” at position 1 is different from “the” at position 5.

The attention mechanism is permutation-equivariant: shuffling input tokens produces shuffled outputs, but the model can’t detect the shuffle! This is catastrophic for language, where “dog bites man” ≠ “man bites dog”. Position encodings inject order information, transforming the model into a sequence processor.

The evolution from sinusoidal encodings (2017) to RoPE (2021) to ALiBi (2022) represents a fascinating arc of increasingly elegant solutions to the same problem: how do we tell a model where each token is, without sacrificing the parallelism that makes transformers fast?

The Problem

Without position information:

Input: "The cat ate the mouse"
Input: "The mouse ate the cat"

Without positions → Same representation!
Model can't distinguish word order

Attention is permutation invariant - it treats tokens as an unordered set.

Solution 1: Absolute Sinusoidal Encoding (Original Transformer)

The original “Attention is All You Need” paper proposed adding fixed sinusoidal functions to embeddings:

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

where pospos is position and ii is dimension index. Each dimension oscillates at a different frequency, creating a unique “fingerprint” for each position.

The beautiful property: Due to trigonometric identities, PEpos+kPE_{pos+k} can be expressed as a linear function of PEposPE_{pos}:

PEpos+k=f(PEpos,k)PE_{pos+k} = f(PE_{pos}, k)

This means the model can learn to attend to relative positions (“3 tokens back”) even though we only provide absolute positions.

Proof of the key property: Consider the dot product between position encodings at positions pp and p+kp+k:

PEpPEp+k=i=0d/21[sin(pωi)sin(p+kωi)+cos(pωi)cos(p+kωi)]PE_p \cdot PE_{p+k} = \sum_{i=0}^{d/2-1} \left[\sin\left(\frac{p}{\omega_i}\right)\sin\left(\frac{p+k}{\omega_i}\right) + \cos\left(\frac{p}{\omega_i}\right)\cos\left(\frac{p+k}{\omega_i}\right)\right]

Using the trigonometric identity cos(ab)=cos(a)cos(b)+sin(a)sin(b)\cos(a-b) = \cos(a)\cos(b) + \sin(a)\sin(b):

PEpPEp+k=i=0d/21cos(kωi)PE_p \cdot PE_{p+k} = \sum_{i=0}^{d/2-1} \cos\left(\frac{k}{\omega_i}\right)

Notice this depends only on kk (the relative distance), not on the absolute position pp! This is the magic that allows the model to learn relative position patterns. When computing attention scores QpKp+kQ_p \cdot K_{p+k}, the position encoding contribution naturally captures relative distance.

sinusoidal_encoding.py
import numpy as np

def sinusoidal_position_encoding(seq_len, d_model):
  """
  Original position encoding from 'Attention is All You Need'
  
  Uses sin/cos with different frequencies for each dimension
  """
  position = np.arange(seq_len)[:, np.newaxis]
  div_term = np.exp(
      np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model)
  )
  
  pe = np.zeros((seq_len, d_model))
  
  # Even dimensions: sine
  pe[:, 0::2] = np.sin(position * div_term)
  # Odd dimensions: cosine
  pe[:, 1::2] = np.cos(position * div_term)
  
  return pe

# Visualization for position 0-100, first 8 dimensions:
# Pos  Dim0   Dim1   Dim2   Dim3   Dim4   Dim5   Dim6   Dim7
# 0    0.00   1.00   0.00   1.00   0.00   1.00   0.00   1.00
# 1    0.84   0.54   0.10   1.00   0.01   1.00   0.00   1.00
# 5    0.96  -0.28   0.48   0.88   0.05   1.00   0.01   1.00
# 10  -0.54  -0.84   0.84   0.54   0.10   0.99   0.01   1.00

# Pattern: Low dimensions change fast, high dimensions change slow
# This encodes position at multiple scales

Why sin/cos?

  1. Continuous function: Smooth transitions between positions
  2. Bounded: Values stay in [-1, 1]
  3. Deterministic: Same position always gets same encoding
  4. Extrapolation: Can handle longer sequences than training
using_sinusoidal.py
def transformer_with_positional_encoding(tokens):
  # Get token embeddings
  token_embs = embedding_layer(tokens)  # [seq_len, d_model]
  
  # Get position encodings
  seq_len = len(tokens)
  pos_encoding = sinusoidal_position_encoding(seq_len, d_model)
  
  # Add them together!
  input_embs = token_embs + pos_encoding
  
  # Now transformer can distinguish positions
  output = transformer_layers(input_embs)
  
  return output

# Example:
# Token "the" at position 0: [0.2, 0.5, ...] + [0.0, 1.0, ...] = [0.2, 1.5, ...]
# Token "the" at position 3: [0.2, 0.5, ...] + [0.9, 0.4, ...] = [1.1, 0.9, ...]
# Different positions → Different representations!

Solution 2: Learned Position Embeddings (BERT, GPT-2)

Learn position embeddings like token embeddings:

learned_positions.py
class LearnedPositionalEmbedding:
  def __init__(self, max_seq_len=512, d_model=768):
      # Just a lookup table!
      # Position 0 → vector, Position 1 → different vector, etc.
      self.pos_embeddings = nn.Embedding(max_seq_len, d_model)
  
  def forward(self, seq_len):
      positions = torch.arange(seq_len)
      return self.pos_embeddings(positions)

# BERT uses this
# GPT-2 uses this

# Pros:
# - Simple
# - Model learns best encoding for the task

# Cons:
# - Fixed max length (can't extrapolate)
# - Uses more parameters

Solution 3: RoPE (Rotary Position Embedding)

Used in: LLaMA, GPT-Neo, PaLM

RoPE (Su et al., 2021) is elegant: instead of adding position to embeddings, rotate the query and key vectors by an angle proportional to their position.

For a 2D subspace, the rotation matrix for position mm is:

Rm=(cos(mθ)sin(mθ)sin(mθ)cos(mθ))R_m = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix}

Key insight: The dot product between rotated vectors at positions mm and nn:

qmTkn=(Rmq)T(Rnk)=qTRmTRnk=qTRnmkq_m^T k_n = (R_m q)^T (R_n k) = q^T R_m^T R_n k = q^T R_{n-m} k

The result depends only on the relative distance (nm)(n-m)! The rotation difference naturally encodes relative position.

Why this works better:

  1. Relative positions emerge naturally from rotation algebra
  2. Extrapolation: Training on length LL works for length >L>L because rotations extend smoothly
  3. No parameters: Unlike learned embeddings, RoPE is deterministic
rope.py
def rotate_half(x):
  """
  Split and rotate tensor
  """
  x1, x2 = x.chunk(2, dim=-1)
  return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(q, k, cos, sin):
  """
  Apply rotary position embedding to Q and K
  
  Key insight: Rotate Q and K based on their positions
  Relative position naturally emerges from rotation difference
  """
  # Apply rotation matrix
  q_embed = (q * cos) + (rotate_half(q) * sin)
  k_embed = (k * cos) + (rotate_half(k) * sin)
  
  return q_embed, k_embed

class RoPEAttention:
  def __init__(self, d_model=512, max_seq_len=2048):
      self.d_model = d_model
      
      # Precompute rotation angles
      position = torch.arange(max_seq_len).unsqueeze(1)
      div_term = torch.exp(
          torch.arange(0, d_model, 2) * 
          -(math.log(10000.0) / d_model)
      )
      
      # Rotation angles for each position and dimension
      angles = position * div_term
      self.cos = torch.cos(angles)
      self.sin = torch.sin(angles)
  
  def forward(self, x):
      seq_len = x.shape[1]
      
      # Generate Q, K, V
      Q = x @ self.W_q
      K = x @ self.W_k
      V = x @ self.W_v
      
      # Apply rotary embeddings to Q and K only
      Q_rot, K_rot = apply_rotary_pos_emb(
          Q, K,
          self.cos[:seq_len],
          self.sin[:seq_len]
      )
      
      # Attention with rotated Q, K
      scores = Q_rot @ K_rot.T
      weights = softmax(scores)
      output = weights @ V
      
      return output

# Why RoPE is better:
# 1. Naturally encodes relative positions
# 2. Better extrapolation to longer sequences
# 3. No added parameters (unlike learned embeddings)
# 4. Works better in practice (LLaMA uses this)

RoPE Visualization:

Position 0 and Position 5:
├─ Rotation difference = 5 steps
├─ This creates a consistent "distance" signal
└─ Works for any position pair!

Traditional encoding:
├─ pos_0 = [0.0, 1.0, 0.0, ...]
├─ pos_5 = [0.96, -0.28, 0.48, ...]
└─ No clear relative position signal

Solution 4: ALiBi (Attention with Linear Biases)

Used in: BLOOM, MPT

ALiBi (Press et al., 2022) takes a radically simple approach: don’t encode positions at all. Instead, add a distance-based bias directly to attention scores:

scoreij=qiTkjλij\text{score}_{ij} = q_i^T k_j - \lambda \cdot |i - j|

where λ\lambda is a head-specific slope (different heads use different slopes to capture multiple scales).

Why this is brilliant:

  1. No position embeddings: Saves memory and parameters
  2. Perfect linearity: Distance bias is exactly linear, making extrapolation trivial
  3. Inductive bias: Closer tokens naturally get higher attention scores

Mathematical intuition: The bias creates a “locality preference” - tokens far apart need stronger query-key compatibility to attend to each other. This matches linguistic intuition: nearby words are usually more related.

alibi.py
def alibi_attention(Q, K, V, num_heads=8):
  """
  Add linear bias based on distance between positions
  
  No position embeddings at all!
  Just bias the attention scores
  """
  seq_len = Q.shape[1]
  
  # Compute attention scores
  scores = Q @ K.T  # [seq_len, seq_len]
  
  # Create ALiBi bias matrix
  # Bias increases with distance
  positions = torch.arange(seq_len)
  distance = positions.unsqueeze(1) - positions.unsqueeze(0)
  distance = distance.abs()
  
  # Different slopes for different attention heads
  slopes = torch.tensor([
      2 ** (-(i+1)) for i in range(num_heads)
  ])
  
  # Apply bias (negative, so distant tokens get lower scores)
  bias = -distance.unsqueeze(0) * slopes.unsqueeze(1).unsqueeze(2)
  
  # Add bias to scores
  scores = scores + bias
  
  # Rest is normal attention
  weights = softmax(scores, dim=-1)
  output = weights @ V
  
  return output

# ALiBi bias matrix (head with slope=-0.5):
#      Pos0  Pos1  Pos2  Pos3  Pos4
# Pos0   0   -0.5  -1.0  -1.5  -2.0
# Pos1  -0.5   0   -0.5  -1.0  -1.5
# Pos2  -1.0 -0.5   0   -0.5  -1.0
# ...

# Benefits:
# - No position embeddings needed (saves memory)
# - Excellent extrapolation (tested up to 2× training length)
# - Simple and fast

Comparison

Sinusoidal (GPT-3):
├─ Pros: Simple, deterministic, extrapolates
├─ Cons: Not as good as learned
└─ Used in: Original Transformer, GPT-3

Learned (BERT, GPT-2):
├─ Pros: Task-specific, performs well
├─ Cons: Fixed max length, more parameters
└─ Used in: BERT, GPT-2, early models

RoPE (LLaMA):
├─ Pros: Best performance, great extrapolation
├─ Cons: Slightly more complex
└─ Used in: LLaMA, GPT-Neo, most modern LLMs

ALiBi (BLOOM):
├─ Pros: Best extrapolation, no embeddings
├─ Cons: Less common, newer technique
└─ Used in: BLOOM, MPT, some research models

Extrapolation Test

How well models handle sequences longer than training:

extrapolation_test.py
# Train on 2K tokens, test on 4K tokens

results = {
  'Sinusoidal': {
      '2K (train)': 2.1,  # Perplexity
      '4K (2× train)': 3.8,
      '8K (4× train)': 7.2,
  },
  'Learned': {
      '2K (train)': 2.0,
      '4K (2× train)': 15.3,  # Breaks down!
      '8K (4× train)': 48.7,  # Completely fails
  },
  'RoPE': {
      '2K (train)': 2.0,
      '4K (2× train)': 2.3,  # Good!
      '8K (4× train)': 3.1,  # Still works
  },
  'ALiBi': {
      '2K (train)': 2.0,
      '4K (2× train)': 2.1,  # Best!
      '8K (4× train)': 2.4,  # Excellent
  }
}

# Learned embeddings fail hard
# RoPE and ALiBi handle longer sequences well

Modern Best Practices

For new models:

  1. Use RoPE if you want proven performance (LLaMA 2 approach)
  2. Use ALiBi if you need extreme extrapolation (BLOOM approach)
  3. Avoid learned unless you have fixed max length (BERT legacy)
  4. Avoid sinusoidal unless you want simplicity (outdated)

For extending context:

# Example: Extend LLaMA from 4K to 32K context

# Method 1: RoPE Scaling (linear)
def scaled_rope(position, scale=8.0):
    # Divide position by scale factor
    return rope(position / scale)

# Method 2: RoPE NTK-Aware Scaling
def ntk_scaled_rope(position, scale=8.0):
    # Adjust frequency instead of position
    # Better preservation of local patterns
    return rope_with_adjusted_freq(position, scale)

# Both methods allow 4K model to handle 32K tokens
# GPT-4 likely uses similar techniques

Resources & Further Reading

📄 Foundational Papers

Sinusoidal Encodings

  • Attention is All You Need (Vaswani et al., 2017)
    • Original transformer paper introducing sinusoidal position encodings
    • Section 3.5 covers the mathematical foundation

RoPE (Rotary Position Embeddings)

ALiBi

🔧 Implementation Resources

🔬 Advanced Topics

Context Extension

Theoretical Analysis


Last updated: October 2025