Home Technical Training Large Language Models: Complete Guide

Training Large Language Models: Complete Guide

How LLMs are trained from scratch - pre-training, fine-tuning, RLHF, and everything in between

AI Tools Reviews Technical Team
January 22, 2024
LLM technical training deep-learning RLHF

Training Large Language Models

Training a model like GPT-4 or Claude costs millions of dollars and takes months. Understanding the training pipeline is essential for understanding why these models behave the way they do.

The scale is staggering: GPT-4 was trained on approximately 13 trillion tokens using an estimated 25,000 A100 GPUs for 90-120 days, consuming roughly 50 gigawatt-hours of electricity—enough to power a small city for a month. The training cost? Estimates range from 50Mto50M to 100M just for the compute.

But why does training take so long? The bottleneck is the gradient computation. For each training batch, we must:

  1. Forward pass through LL layers (typically 80-100 for 175B models)
  2. Compute loss across VV vocabulary items (typically 50K-100K)
  3. Backward pass to compute PP parameter gradients (175 billion for GPT-4)
  4. Update parameters with optimizer state (2-3x memory overhead)

With O(n2)O(n^2) attention complexity, processing a batch of 4M tokens (typical for large-scale training) requires computing attention over 4M×4M=164M \times 4M = 16 trillion attention scores per layer. Multiply by 96 layers, and you’re looking at over a quadrillion floating-point operations per batch.

The Three-Stage Training Pipeline

Stage 1: Pre-training (Unsupervised)
├─ Goal: Learn language patterns
├─ Data: Trillions of tokens from the internet
├─ Cost: $100M+ for GPT-4 scale
└─ Duration: 2-4 months

Stage 2: Supervised Fine-Tuning (SFT)
├─ Goal: Learn to follow instructions
├─ Data: 10K-100K high-quality examples
├─ Cost: $100K-$1M
└─ Duration: 1-2 weeks

Stage 3: Reinforcement Learning (RLHF)
├─ Goal: Align with human preferences
├─ Data: Human feedback on outputs
├─ Cost: $500K-$5M
└─ Duration: 2-4 weeks

Stage 1: Pre-Training

Data Collection

Pre-training requires massive amounts of text data:

data_collection.py
class DatasetBuilder:
  def __init__(self):
      self.sources = {
          'web_crawl': 0.60,      # Common Crawl, web pages
          'books': 0.20,           # Books corpus
          'code': 0.10,            # GitHub, StackOverflow
          'academic': 0.05,        # arXiv, papers
          'wikipedia': 0.05        # Wikipedia
      }
  
  def collect_training_data(self, target_tokens=1e12):
      """
      Collect 1 trillion tokens for pre-training
      """
      dataset = []
      
      for source, proportion in self.sources.items():
          tokens_needed = int(target_tokens * proportion)
          data = self.crawl_source(source, tokens_needed)
          dataset.extend(data)
      
      # GPT-4 was trained on ~13 trillion tokens
      # Claude on ~10 trillion tokens
      return dataset
  
  def filter_quality(self, data):
      """
      Remove low-quality content
      """
      filters = [
          self.remove_duplicates,
          self.filter_gibberish,
          self.remove_toxic_content,
          self.check_language_quality,
      ]
      
      for filter_fn in filters:
          data = filter_fn(data)
      
      return data

Data Quality Issues:

  • Duplicate content: 30-50% of Common Crawl is duplicates
  • Low quality: SEO spam, auto-generated content
  • Toxic content: Hate speech, harmful instructions
  • Copyright: Books, articles (legal gray area)
  • Bias: Overrepresentation of certain viewpoints

Pre-Training Objective: Next Token Prediction

The training objective is deceptively simple yet profoundly effective: predict the next token. Given a sequence of tokens x1,x2,...,xn1x_1, x_2, ..., x_{n-1}, the model learns to predict xnx_n.

Mathematically, we maximize the log-likelihood of the training data:

L=i=1NlogP(xix1,...,xi1)\mathcal{L} = \sum_{i=1}^{N} \log P(x_i | x_1, ..., x_{i-1})

where NN is the total number of tokens in the training corpus (trillions for GPT-4).

The model outputs a probability distribution over the vocabulary at each position:

P(xi=wx1,...,xi1)=exp(zw)wVexp(zw)P(x_i = w | x_1, ..., x_{i-1}) = \frac{\exp(z_w)}{\sum_{w' \in V} \exp(z_{w'})}

where zwz_w is the logit for word ww and VV is the vocabulary.

Cross-entropy loss measures how well these predictions match the actual next tokens:

Loss=1Ni=1NlogP(xix1,...,xi1)\text{Loss} = -\frac{1}{N} \sum_{i=1}^{N} \log P(x_i | x_1, ..., x_{i-1})

Lower loss means better predictions. GPT-4 achieves a loss around 2.0-2.5 on held-out data, meaning its predictions have roughly e2.29e^{2.2} \approx 9 bits of uncertainty per token—remarkably low considering the vocabulary size of 100K tokens (log2(100000)16.6\log_2(100000) \approx 16.6 bits of maximum entropy).

Training dynamics: Initially, the model predicts nearly uniformly across the vocabulary (loss log(V)11.5\approx \log(V) \approx 11.5 for 100K vocab). Over millions of steps, the loss gradually decreases as the model learns:

Step 0:      Loss = 11.5  (random predictions)
Step 10K:    Loss = 8.2   (learning common words)
Step 100K:   Loss = 5.1   (learning syntax)
Step 1M:     Loss = 3.4   (learning semantics)
Step 10M:    Loss = 2.2   (near-human performance)

This smooth, predictable improvement is why scaling laws work so well for LLMs.

pretraining_loss.py
def compute_pretraining_loss(model, batch):
  """
  Compute next-token prediction loss
  """
  # Input: "The cat sat on the"
  # Target: "cat sat on the mat"
  
  input_ids = batch['input_ids']  # [batch, seq_len]
  
  # Forward pass
  logits = model(input_ids)  # [batch, seq_len, vocab_size]
  
  # Shift for next-token prediction
  logits = logits[:, :-1, :]  # Remove last prediction
  targets = input_ids[:, 1:]   # Remove first token
  
  # Cross-entropy loss
  loss = cross_entropy_loss(
      logits.reshape(-1, vocab_size),
      targets.reshape(-1)
  )
  
  return loss

# Example training loop
for batch in dataloader:
  loss = compute_pretraining_loss(model, batch)
  loss.backward()
  optimizer.step()
  optimizer.zero_grad()

# After billions of steps:
# - Model learns grammar, facts, reasoning
# - But doesn't follow instructions well

Training Infrastructure

Technical Specifications

GPT-3 (175B)
10,000 A100 GPUs, 34 days
GPT-4
~25,000 GPUs, 90-100 days
LLaMA 2 70B
2,000 A100s, 21 days
Power Cost
$1-2M/month for GPT-4
distributed_training.py
class DistributedTrainer:
  def __init__(
      self,
      model_size='175B',
      num_gpus=10000,
      parallelism_strategy='3D'
  ):
      # 3D Parallelism: Data + Pipeline + Tensor
      self.data_parallel = 100   # Copy model across nodes
      self.pipeline_parallel = 20 # Split layers across GPUs
      self.tensor_parallel = 5    # Split matrices within layers
      
      assert num_gpus == (
          self.data_parallel * 
          self.pipeline_parallel * 
          self.tensor_parallel
      )
  
  def split_model_layers(self, model, num_stages):
      """
      Pipeline parallelism: split model into stages
      """
      layers_per_stage = len(model.layers) // num_stages
      
      stages = []
      for i in range(num_stages):
          start = i * layers_per_stage
          end = (i + 1) * layers_per_stage
          stages.append(model.layers[start:end])
      
      return stages
  
  def forward_backward_pipeline(self, batch, stages):
      """
      Micro-batching for pipeline efficiency
      """
      micro_batches = split_batch(batch, num_microbatches=8)
      
      # Pipeline stages process different micro-batches
      # Stage 1: micro-batch 1 forward
      # Stage 2: micro-batch 1 forward, micro-batch 2 forward
      # Stage 3: all stages busy
      
      activations = []
      for mb in micro_batches:
          x = mb
          for stage in stages:
              x = stage(x)
          activations.append(x)
      
      # Backward pass in reverse
      for act in reversed(activations):
          act.backward()
      
      return loss

Stage 2: Supervised Fine-Tuning (SFT)

Creating Instruction Datasets

High-quality examples teach the model to follow instructions:

instruction_data.py
instruction_examples = [
  {
      "instruction": "Write a haiku about programming",
      "input": "",
      "output": "Code flows like water\nBugs lurk in shadowed corners\nDebug brings the light"
  },
  {
      "instruction": "Translate to French",
      "input": "Hello, how are you?",
      "output": "Bonjour, comment allez-vous?"
  },
  {
      "instruction": "Explain quantum computing to a 10-year-old",
      "input": "",
      "output": "Imagine if your computer could try all possible answers to a puzzle at the same time, instead of one by one. That's kind of what quantum computers do! They use special quantum bits that can be multiple things at once."
  }
]

# OpenAI used ~100K examples for GPT-3.5
# Anthropic used ~150K for Claude
# Meta used ~1M for LLaMA 2 Chat

Quality over Quantity:

  • 10K high-quality examples > 100K mediocre ones
  • Human-written preferred over AI-generated
  • Diverse tasks: coding, writing, math, reasoning
  • Consistent style and format

SFT Training Process

sft_training.py
def sft_loss(model, example):
  """
  Supervised fine-tuning loss
  Only compute loss on the output tokens
  """
  # Format: <instruction> + <input> + <output>
  prompt = format_prompt(example['instruction'], example['input'])
  full_text = prompt + example['output']
  
  # Tokenize
  tokens = tokenizer(full_text)
  prompt_len = len(tokenizer(prompt))
  
  # Forward pass
  logits = model(tokens)
  
  # Only compute loss on output tokens (ignore prompt)
  output_logits = logits[prompt_len:]
  output_targets = tokens[prompt_len + 1:]
  
  loss = cross_entropy_loss(output_logits, output_targets)
  
  return loss

# Training config
training_config = {
  'learning_rate': 1e-5,  # Much smaller than pre-training
  'batch_size': 64,
  'epochs': 3,            # Just a few passes
  'warmup_steps': 100,
}

Stage 3: RLHF (Reinforcement Learning from Human Feedback)

RLHF is the secret sauce that transforms a language model into a helpful assistant. Without it, models complete text but don’t follow instructions well.

The core problem: How do you optimize for “helpfulness”, “harmlessness”, and “honesty” when these aren’t differentiable loss functions? You can’t compute gradients of “is this response helpful?”

The solution: Train a reward model to predict human preferences, then use reinforcement learning to maximize expected reward.

The mathematics behind RLHF is elegant. We model text generation as a Markov Decision Process (MDP) where:

  • State: The prompt and partial response so far
  • Action: Choosing the next token
  • Reward: Human preference score (from reward model)
  • Policy: The language model πθ(atst)\pi_\theta(a_t | s_t) that generates tokens

The goal is to find policy parameters θ\theta^* that maximize expected cumulative reward:

θ=argmaxθExD,yπθ(x)[r(x,y)]\theta^* = \arg\max_\theta \mathbb{E}_{x \sim D, y \sim \pi_\theta(·|x)} \left[r(x, y)\right]

where r(x,y)r(x,y) is the reward for generating response yy to prompt xx.

Step 1: Train Reward Model

Humans rank model outputs to create a reward model:

reward_model.py
class RewardModel:
  """
  Learn to predict human preferences
  """
  def __init__(self, base_model):
      self.model = base_model
      # Replace output head with single reward value
      self.reward_head = nn.Linear(d_model, 1)
  
  def forward(self, text):
      hidden = self.model(text)
      reward = self.reward_head(hidden[-1])  # Last token
      return reward

# Training data format
comparison_data = [
  {
      'prompt': 'Explain photosynthesis',
      'response_A': 'Plants make food from sun...',  # Good
      'response_B': 'Idk google it',                 # Bad
      'preference': 'A'  # Human chose A
  }
]

def train_reward_model(model, comparisons):
  for item in comparisons:
      reward_A = model(item['prompt'] + item['response_A'])
      reward_B = model(item['prompt'] + item['response_B'])
      
      # Higher reward for preferred response
      if item['preference'] == 'A':
          loss = -log_sigmoid(reward_A - reward_B)
      else:
          loss = -log_sigmoid(reward_B - reward_A)
      
      loss.backward()
      optimizer.step()

Step 2: PPO Optimization

Use the reward model to improve the language model:

ppo_training.py
def ppo_step(policy_model, reward_model, prompt):
  """
  Proximal Policy Optimization for RLHF
  """
  # Generate response with current policy
  response = policy_model.generate(prompt)
  
  # Get reward from reward model
  reward = reward_model(prompt + response)
  
  # Also penalize divergence from SFT model (KL penalty)
  # This prevents model from "hacking" the reward
  old_logprobs = sft_model.get_logprobs(prompt, response)
  new_logprobs = policy_model.get_logprobs(prompt, response)
  kl_penalty = kl_divergence(new_logprobs, old_logprobs)
  
  # Combined objective
  total_reward = reward - 0.1 * kl_penalty
  
  # Update policy to maximize reward
  loss = -total_reward
  loss.backward()
  
  return loss

# This is what makes ChatGPT helpful, harmless, honest
# vs just predicting internet text

Why RLHF Works:

  • Aligns with human preferences, not just internet text
  • Reduces harmful/biased outputs
  • Makes model more helpful and honest
  • But: Can make model overly cautious

Training Costs Breakdown

GPT-4 Estimated Training Cost:

Pre-training:
├─ Compute: $63M (25,000 A100s × 100 days)
├─ Power: $2M
├─ Data: $5M (cleaning, processing)
└─ Subtotal: ~$70M

Fine-tuning:
├─ SFT: $500K
├─ RLHF (reward + PPO): $2M
└─ Subtotal: ~$2.5M

Infrastructure:
├─ Networking: $5M
├─ Storage: $2M
├─ Staff: $10M (100 engineers × 6 months)
└─ Subtotal: ~$17M

Total: ~$100M minimum
Likely $150-200M including failed experiments

Modern Training Techniques

Flash Attention

Faster attention computation:

flash_attention.py
# Traditional attention: O(n²) memory
def standard_attention(Q, K, V):
  scores = Q @ K.T  # Store full matrix
  weights = softmax(scores)
  output = weights @ V
  return output

# Flash Attention: O(n) memory, 2-4x faster
def flash_attention(Q, K, V, block_size=128):
  """
  Compute attention in blocks without materializing
  the full attention matrix
  """
  output = torch.zeros_like(V)
  
  for q_block in split_blocks(Q, block_size):
      for k_block, v_block in zip(
          split_blocks(K, block_size),
          split_blocks(V, block_size)
      ):
          # Compute attention for this block only
          scores = q_block @ k_block.T
          weights = softmax(scores)
          output += weights @ v_block
  
  return output

# Used in GPT-4, LLaMA 2, and most modern LLMs

Mixed Precision Training

mixed_precision.py
class MixedPrecisionTrainer:
  """
  Use FP16 for speed, FP32 for stability
  """
  def __init__(self, model):
      self.model = model.half()  # Convert to FP16
      self.optimizer = AdamW(model.parameters())
      self.scaler = GradScaler()  # Handle small gradients
  
  def training_step(self, batch):
      # Forward in FP16 (faster)
      with autocast():
          loss = self.model(batch)
      
      # Backward with gradient scaling
      self.scaler.scale(loss).backward()
      
      # Optimizer step in FP32 (stable)
      self.scaler.step(self.optimizer)
      self.scaler.update()
      
      return loss

# 2x memory savings, 2-3x speed improvement
# Used universally in modern LLM training

Evaluation During Training

Track multiple metrics:

Perplexity: 3.2 → 2.8 → 2.1 (lower = better)
├─ Measures prediction accuracy
└─ Target: < 2.0 for good models

Downstream Tasks:
├─ MMLU (knowledge): 45% → 67% → 85%
├─ HumanEval (coding): 15% → 48% → 67%
└─ GSM8K (math): 20% → 58% → 92%

Human Eval:
├─ Helpfulness: 3.2 → 4.5 → 4.8 / 5
├─ Harmlessness: 3.8 → 4.6 → 4.9 / 5
└─ Honesty: 3.5 → 4.2 → 4.7 / 5