LLM Inference Optimization

Running a 70B parameter model costs thousands per month. These optimizations can reduce that 10x while maintaining quality.

The economics are brutal: At commercial scale, every millisecond of latency costs money. If your model serves 1M requests/day with an average latency of 2 seconds per request, that’s 2 million GPU-seconds per day. On an A100 ($3/hour), that’s approximately:

$\text{Daily cost} = \frac{2,000,000 \text{ GPU-sec}}{3600 \text{ sec/hour}} \times \$3/\text{hour} \approx \$1,667/\text{day}$

Or about ** $50,000/month** just for inference compute. Cut latency in half through optimization, and you save$ 25K/month. These optimizations aren’t academic—they’re the difference between a profitable and unprofitable AI business.

The latency breakdown for a typical 70B model generating 100 tokens:

Prompt processing (prefill):     200ms  (20%)   ← Parallelizable
First token generation:          100ms  (10%)   ← Serial bottleneck  
Remaining 99 tokens:             700ms  (70%)   ← Auto-regressive
                                 ------
Total:                          1000ms (100%)

The key insight: token generation is memory-bandwidth bound, not compute-bound. Loading 140GB of weights from GPU memory 100 times dominates the cost. Quantization, batching, and speculative decoding all attack this bottleneck from different angles.

The Inference Cost Problem

GPT-4 API Pricing (Jan 2024):
├─ Input: $0.03 per 1K tokens
├─ Output: $0.06 per 1K tokens
└─ 1M requests/day = $1,800-$3,600/day

Self-Hosted 70B Model:
├─ Hardware: 4× A100 80GB = $10K/month rental
├─ Power: $500/month
├─ Staff: $20K/month (engineers)
└─ Total: ~$30K/month

Goal: Optimize to run on cheaper hardware
2× A100 40GB instead of 4× A100 80GB = 50% cost savings

Optimization Techniques

1. Quantization: Trading Precision for Speed

Reduce precision from FP16 to INT8/INT4. The core idea: most weights don’t need 16 bits of precision.

Mathematical foundation: Quantization maps floating-point values to integers using an affine transformation:

$x_{\text{int}} = \text{round}\left(\frac{x_{\text{float}} - z}{s}\right)$

where:

$s$ is the scale factor (step size between quantized values)
$z$ is the zero point (which float value maps to 0)

Dequantization recovers the approximate value:

$\hat{x}_{\text{float}} = s \cdot x_{\text{int}} + z$

The quantization error is:

$\epsilon = x_{\text{float}} - \hat{x}_{\text{float}} \in \left[-\frac{s}{2}, \frac{s}{2}\right]$

For INT8 (256 levels) quantizing a range $[w_{\min}, w_{\max}]$ :

$s = \frac{w_{\max} - w_{\min}}{255}$

The magic: Neural networks are remarkably robust to this quantization noise! Experiments show that INT8 quantization (256 levels instead of 65,536) causes only 1-2% accuracy loss for most LLMs. Why? The model learns to tolerate noise during training, and quantization error acts like regularization.

quantization.py

import torch

class QuantizedLinear:
  """
  Quantize weights from FP16 (16 bits) to INT8 (8 bits)
  """
  def __init__(self, weight_fp16):
      # Find min/max values
      self.min_val = weight_fp16.min()
      self.max_val = weight_fp16.max()
      
      # Quantize: map [-max, max] to [-128, 127]
      scale = (self.max_val - self.min_val) / 255
      zero_point = -128 - self.min_val / scale
      
      self.weight_int8 = torch.round(
          weight_fp16 / scale + zero_point
      ).to(torch.int8)
      
      self.scale = scale
      self.zero_point = zero_point
  
  def dequantize(self):
      """Convert back to FP16 for computation"""
      return (self.weight_int8 - self.zero_point) * self.scale
  
  def forward(self, x):
      # Dequantize weights, compute, quantize result
      weight_fp16 = self.dequantize()
      output = x @ weight_fp16
      return output

# Memory savings:
# 70B model in FP16: 140 GB
# 70B model in INT8: 70 GB (50% reduction)
# 70B model in INT4: 35 GB (75% reduction)

# Quality impact:
# INT8: ~1-2% accuracy loss
# INT4: ~3-5% accuracy loss

Quantization Methods:

Post-Training Quantization (PTQ)
├─ Quantize after training
├─ Fast (hours)
└─ 1-3% accuracy loss

Quantization-Aware Training (QAT)
├─ Train with quantization in mind
├─ Slow (retraining needed)
└─ <1% accuracy loss

GPTQ (GPT Quantization)
├─ Layer-by-layer quantization
├─ INT4 with minimal loss
└─ Used in LLaMA.cpp, llama-int4

2. KV Caching

Cache key/value tensors to avoid recomputation:

kv_cache.py

class AttentionWithKVCache:
  """
  Cache keys and values for autoregressive generation
  """
  def __init__(self):
      self.kv_cache = {
          'keys': [],
          'values': []
      }
  
  def forward(self, x, use_cache=True):
      # Generate Q, K, V
      Q = x @ self.W_q
      K = x @ self.W_k
      V = x @ self.W_v
      
      if use_cache:
          # Append new K, V to cache
          self.kv_cache['keys'].append(K)
          self.kv_cache['values'].append(V)
          
          # Use all cached keys/values
          K_full = torch.cat(self.kv_cache['keys'], dim=1)
          V_full = torch.cat(self.kv_cache['values'], dim=1)
      else:
          K_full, V_full = K, V
      
      # Attention with full context
      scores = Q @ K_full.T
      weights = softmax(scores)
      output = weights @ V_full
      
      return output

# Generation with KV cache:
prompt = "The cat sat on the"

# First token: compute full attention
output = model.forward(prompt, use_cache=True)
# Cache now contains K,V for all prompt tokens

# Subsequent tokens: only compute new K,V
for _ in range(100):  # Generate 100 tokens
  new_token = sample(output)
  output = model.forward(new_token, use_cache=True)
  # Only computes 1 new K,V, reuses cached ones

# Speed improvement:
# Without cache: O(n²) for each token
# With cache: O(n) for each token
# 10-100x faster generation

3. Batch Processing: Amortizing the Cost

Process multiple requests simultaneously to maximize GPU utilization. The key insight: loading model weights from memory dominates the cost, so amortize it across multiple examples.

The economics: For a 70B model in FP16:

Model weights: 140 GB
A100 memory bandwidth: 1,935 GB/s
Time to load weights: $\frac{140 \text{ GB}}{1.935 \text{ GB/s}} \approx 72\text{ ms}$

For a single token:

Compute: ~1 ms (matrix multiplication)
Memory transfer: 72 ms (loading weights)
Utilization: 1/73 = 1.4% ← Terrible!

With batch size $B$ :

Load weights once: 72 ms
Compute for $B$ examples: $B$ ms (parallelized)
Time per example: $\frac{72 + B}{B}$ ms

$\text{Throughput} = \frac{B}{72 + B} \text{ tokens/ms}$

As $B \to \infty$ , throughput approaches 1 token/ms. With $B=32$ :

$\text{Time per example} = \frac{72 + 32}{32} \approx 3.25\text{ ms}$

versus 73 ms without batching—a 22x improvement!

The latency-throughput tradeoff: Larger batches mean waiting for more requests to arrive, increasing latency for early requests. The sweet spot depends on your SLA:

Batch Size  |  Latency (p50)  |  Throughput  |  Cost/Token
-----------------------------------------------------------------
1           |  75 ms          |  13 tok/s    |  $0.020
8           |  150 ms         |  89 tok/s    |  $0.003
32          |  300 ms         |  280 tok/s   |  $0.001
128         |  800 ms         |  720 tok/s   |  $0.0004

Google/OpenAI likely use batch sizes of 100-500 to balance cost and user experience.

dynamic_batching.py

class DynamicBatcher:
  """
  Batch requests with different lengths efficiently
  """
  def __init__(self, max_batch_size=32, timeout_ms=100):
      self.max_batch_size = max_batch_size
      self.timeout_ms = timeout_ms
      self.pending_requests = []
  
  def add_request(self, request):
      self.pending_requests.append(request)
      
      # Process when batch is full or timeout reached
      if (len(self.pending_requests) >= self.max_batch_size or
          self.time_since_first_request() > self.timeout_ms):
          return self.process_batch()
  
  def process_batch(self):
      batch = self.pending_requests[:self.max_batch_size]
      self.pending_requests = self.pending_requests[self.max_batch_size:]
      
      # Pad sequences to same length
      max_len = max(len(r['tokens']) for r in batch)
      padded_batch = [
          pad_to_length(r['tokens'], max_len)
          for r in batch
      ]
      
      # Process all at once
      outputs = model.forward(torch.stack(padded_batch))
      
      return outputs

# Throughput improvement:
# Sequential: 10 requests/sec
# Batched (32): 200 requests/sec
# 20x throughput improvement!

# But: Increases latency for individual requests
# Trade-off: throughput vs latency

4. Speculative Decoding: Gambling on Predictions

Generate multiple tokens per forward pass using a small “draft” model to propose candidates that a large model verifies in parallel.

The standard autoregressive bottleneck: Each token requires a full forward pass:

$T_{\text{total}} = n \times T_{\text{forward}}$

For 100 tokens at 100ms/token: 10 seconds. Can we parallelize?

Speculative decoding idea (Chen et al., 2023):

Small draft model quickly generates $K$ candidate tokens
Large model verifies all $K$ candidates in parallel (one forward pass)
Accept longest prefix that matches draft model probabilities

Mathematical guarantee: The final distribution is identical to standard sampling! The speedup is “free” — no quality loss.

Expected speedup with acceptance rate $\alpha$ and lookahead $K$ :

$\text{Speedup} = \frac{1 + K \cdot \alpha}{1 + \frac{T_{\text{draft}}}{T_{\text{target}}}}$

With $K=5$ , $\alpha=0.7$ (70% acceptance), and draft model 10x faster:

$\text{Speedup} = \frac{1 + 5 \times 0.7}{1 + 0.1} = \frac{4.5}{1.1} \approx 4.1\text{x}$

For easier tasks (high $\alpha$ ), speedups of 5-8x are common. The draft model can be:

A smaller model (7B drafting for 70B)
The same model quantized to INT4
Early-exit layers from the target model

Use small model to predict, large model to verify:

speculative_decoding.py

def speculative_decode(large_model, small_model, prompt, k=4):
  """
  Use small model to generate k tokens
  Large model verifies in parallel
  
  2-3x speedup with no quality loss
  """
  tokens = tokenize(prompt)
  
  while len(tokens) < max_length:
      # Small model: generate k candidate tokens (fast)
      candidates = small_model.generate(
          tokens, 
          num_tokens=k,
          temperature=0  # Greedy
      )
      
      # Large model: score all candidates in parallel
      # This is faster than k sequential generations!
      scores = large_model.score_sequence(tokens + candidates)
      
      # Accept tokens while they match large model preference
      accepted = 0
      for i, candidate in enumerate(candidates):
          if is_acceptable(scores[i]):
              tokens.append(candidate)
              accepted += 1
          else:
              break
      
      # If rejected, use large model for next token
      if accepted < k:
          next_token = large_model.generate_one(tokens)
          tokens.append(next_token)
  
  return tokens

# Example:
# Small model (7B): "The cat sat on the mat"
# Large model (70B): Verifies first 3 tokens, rejects "mat", generates "rug"
# Result: 3 tokens in 1 large model call instead of 4
# 3x speedup in this example

5. Flash Attention

Memory-efficient attention computation:

flash_attention_v2.py

def flash_attention_v2(Q, K, V, block_size=128):
  """
  Flash Attention v2: 2-4x faster than standard attention
  
  Key ideas:
  1. Tile-based computation (use GPU cache)
  2. Recomputation in backward (save memory)
  3. Fused operations (fewer kernel launches)
  """
  seq_len, d_k = Q.shape
  output = torch.zeros_like(V)
  
  # Process in blocks to fit in GPU cache (SRAM)
  for i in range(0, seq_len, block_size):
      q_block = Q[i:i+block_size]
      
      # Accumulate attention over K, V blocks
      attn_sum = torch.zeros(block_size, d_k)
      
      for j in range(0, seq_len, block_size):
          k_block = K[j:j+block_size]
          v_block = V[j:j+block_size]
          
          # Compute attention scores for this block
          scores = q_block @ k_block.T  # [block, block]
          weights = softmax(scores, dim=-1)
          attn_sum += weights @ v_block
      
      output[i:i+block_size] = attn_sum
  
  return output

# Performance comparison:
# Standard attention: 100 tokens/sec
# Flash Attention v1: 250 tokens/sec
# Flash Attention v2: 400 tokens/sec
# 4x improvement!

# Used in: GPT-4, LLaMA 2, Claude, all modern LLMs

Hardware Optimizations

GPU Selection

Technical Specifications

A100 80GB: $2,500/month, best for 70B
A100 40GB: $1,500/month, good for 30B
A10G 24GB: $500/month, good for 7B
H100 80GB: $4,000/month, 2x A100 speed

hardware_requirements.py

def calculate_gpu_memory(
  num_params,
  precision='fp16',
  batch_size=1,
  seq_len=2048,
  num_layers=80
):
  """
  Calculate GPU memory requirements
  """
  bytes_per_param = {
      'fp32': 4,
      'fp16': 2,
      'int8': 1,
      'int4': 0.5
  }[precision]
  
  # Model weights
  model_memory = num_params * bytes_per_param
  
  # KV cache
  kv_cache = (
      2 *  # K and V
      batch_size *
      seq_len *
      num_layers *
      8192 *  # d_model (example)
      bytes_per_param
  )
  
  # Activations (rough estimate)
  activations = batch_size * seq_len * 8192 * 4 * bytes_per_param
  
  total_gb = (model_memory + kv_cache + activations) / 1e9
  
  return {
      'model': model_memory / 1e9,
      'kv_cache': kv_cache / 1e9,
      'activations': activations / 1e9,
      'total': total_gb
  }

# Example: LLaMA 2 70B
memory = calculate_gpu_memory(
  num_params=70e9,
  precision='fp16',
  batch_size=1,
  seq_len=4096
)
print(memory)
# {
#   'model': 140 GB,
#   'kv_cache': 5.2 GB,
#   'activations': 2.7 GB,
#   'total': 147.9 GB
# }
# Needs 2× A100 80GB (160 GB total)

Model Parallelism

Split model across GPUs:

model_parallelism.py

class TensorParallelLinear:
  """
  Split weight matrix across GPUs
  """
  def __init__(self, weight, num_gpus=2):
      # Split columns across GPUs
      self.weight_shards = torch.chunk(weight, num_gpus, dim=1)
      self.num_gpus = num_gpus
  
  def forward(self, x):
      # Each GPU computes its shard
      outputs = []
      for gpu_id, weight_shard in enumerate(self.weight_shards):
          x_gpu = x.to(f'cuda:{gpu_id}')
          out = x_gpu @ weight_shard
          outputs.append(out)
      
      # Concatenate results
      return torch.cat(outputs, dim=-1)

class PipelineParallelModel:
  """
  Split layers across GPUs
  """
  def __init__(self, layers, num_gpus=4):
      layers_per_gpu = len(layers) // num_gpus
      
      self.gpu_stages = []
      for i in range(num_gpus):
          start = i * layers_per_gpu
          end = (i + 1) * layers_per_gpu
          stage = layers[start:end]
          stage.to(f'cuda:{i}')
          self.gpu_stages.append(stage)
  
  def forward(self, x):
      for gpu_id, stage in enumerate(self.gpu_stages):
          x = x.to(f'cuda:{gpu_id}')
          for layer in stage:
              x = layer(x)
      return x

# Enables running 70B model on 2×40GB instead of 2×80GB
# But: Adds communication overhead

Deployment Strategies

Inference Servers

vLLM (Recommended)
├─ PagedAttention for KV cache
├─ Continuous batching
├─ 10-20x throughput vs naive
└─ pip install vllm

TensorRT-LLM
├─ NVIDIA optimized
├─ INT8/INT4 quantization
├─ Best for NVIDIA GPUs
└─ Requires more setup

Text Generation Inference (TGI)
├─ HuggingFace solution
├─ Good for standard models
├─ Easy Docker deployment
└─ docker pull ghcr.io/huggingface/tgi

vllm_deployment.py

from vllm import LLM, SamplingParams

# Initialize model with optimizations
llm = LLM(
  model="meta-llama/Llama-2-70b-hf",
  tensor_parallel_size=2,  # Split across 2 GPUs
  dtype="half",            # FP16
  max_num_seqs=256,        # Large batch size
  max_num_batched_tokens=8192
)

# Efficient sampling
sampling_params = SamplingParams(
  temperature=0.8,
  top_p=0.95,
  max_tokens=512
)

# Process requests
prompts = [
  "Explain quantum computing",
  "Write a haiku about AI",
  "Translate to Spanish: Hello"
]

outputs = llm.generate(prompts, sampling_params)

# vLLM automatically:
# - Batches requests
# - Uses PagedAttention for KV cache
# - Applies Flash Attention
# - Handles different sequence lengths

# Result: 10-20x throughput improvement

Cost Optimization Summary

Baseline (70B FP16):
├─ Hardware: 4× A100 80GB = $10K/month
├─ Throughput: 10 req/sec
└─ Cost per 1M tokens: $50

After Optimizations:
├─ Quantization (INT8): 2× A100 80GB = $5K/month
├─ vLLM + batching: 200 req/sec (20x)
├─ Flash Attention: Included in vLLM
└─ Cost per 1M tokens: $1.25

96% cost reduction!

Optimization Checklist:

✅ Use INT8 quantization (50% memory savings)
✅ Enable KV caching (10x generation speedup)
✅ Deploy with vLLM (20x throughput)
✅ Use Flash Attention (included in vLLM)
✅ Batch requests (maximize GPU utilization)
✅ Choose right GPU (A100 40GB often sufficient)

LLM Inference Optimization: Speed & Cost Guide

LLM Inference Optimization

The Inference Cost Problem

Optimization Techniques

1. Quantization: Trading Precision for Speed

2. KV Caching

3. Batch Processing: Amortizing the Cost

4. Speculative Decoding: Gambling on Predictions

5. Flash Attention

Hardware Optimizations

GPU Selection

Technical Specifications

Model Parallelism

Deployment Strategies

Inference Servers

Cost Optimization Summary

Related Articles

Attention Mechanisms

Long-Context Architecture

LLM Inference Optimization

The Inference Cost Problem

Optimization Techniques

1. Quantization: Trading Precision for Speed

2. KV Caching

3. Batch Processing: Amortizing the Cost

4. Speculative Decoding: Gambling on Predictions

5. Flash Attention

Hardware Optimizations

GPU Selection

Technical Specifications

Model Parallelism

Deployment Strategies

Inference Servers

Cost Optimization Summary

Related Articles

Related Articles

Attention Mechanisms

Long-Context Architecture

🚀 Get AI Tool Insights

You're In!