LLM Inference Optimization: Speed & Cost Guide
How to make LLM inference faster and cheaper - quantization, batching, KV caching, and more
LLM Inference Optimization
Running a 70B parameter model costs thousands per month. These optimizations can reduce that 10x while maintaining quality.
The economics are brutal: At commercial scale, every millisecond of latency costs money. If your model serves 1M requests/day with an average latency of 2 seconds per request, that’s 2 million GPU-seconds per day. On an A100 ($3/hour), that’s approximately:
Or about **25K/month. These optimizations aren’t academic—they’re the difference between a profitable and unprofitable AI business.
The latency breakdown for a typical 70B model generating 100 tokens:
Prompt processing (prefill): 200ms (20%) ← Parallelizable
First token generation: 100ms (10%) ← Serial bottleneck
Remaining 99 tokens: 700ms (70%) ← Auto-regressive
------
Total: 1000ms (100%)
The key insight: token generation is memory-bandwidth bound, not compute-bound. Loading 140GB of weights from GPU memory 100 times dominates the cost. Quantization, batching, and speculative decoding all attack this bottleneck from different angles.
The Inference Cost Problem
GPT-4 API Pricing (Jan 2024):
├─ Input: $0.03 per 1K tokens
├─ Output: $0.06 per 1K tokens
└─ 1M requests/day = $1,800-$3,600/day
Self-Hosted 70B Model:
├─ Hardware: 4× A100 80GB = $10K/month rental
├─ Power: $500/month
├─ Staff: $20K/month (engineers)
└─ Total: ~$30K/month
Goal: Optimize to run on cheaper hardware
2× A100 40GB instead of 4× A100 80GB = 50% cost savings
Optimization Techniques
1. Quantization: Trading Precision for Speed
Reduce precision from FP16 to INT8/INT4. The core idea: most weights don’t need 16 bits of precision.
Mathematical foundation: Quantization maps floating-point values to integers using an affine transformation:
where:
- is the scale factor (step size between quantized values)
- is the zero point (which float value maps to 0)
Dequantization recovers the approximate value:
The quantization error is:
For INT8 (256 levels) quantizing a range :
The magic: Neural networks are remarkably robust to this quantization noise! Experiments show that INT8 quantization (256 levels instead of 65,536) causes only 1-2% accuracy loss for most LLMs. Why? The model learns to tolerate noise during training, and quantization error acts like regularization.
import torch
class QuantizedLinear:
"""
Quantize weights from FP16 (16 bits) to INT8 (8 bits)
"""
def __init__(self, weight_fp16):
# Find min/max values
self.min_val = weight_fp16.min()
self.max_val = weight_fp16.max()
# Quantize: map [-max, max] to [-128, 127]
scale = (self.max_val - self.min_val) / 255
zero_point = -128 - self.min_val / scale
self.weight_int8 = torch.round(
weight_fp16 / scale + zero_point
).to(torch.int8)
self.scale = scale
self.zero_point = zero_point
def dequantize(self):
"""Convert back to FP16 for computation"""
return (self.weight_int8 - self.zero_point) * self.scale
def forward(self, x):
# Dequantize weights, compute, quantize result
weight_fp16 = self.dequantize()
output = x @ weight_fp16
return output
# Memory savings:
# 70B model in FP16: 140 GB
# 70B model in INT8: 70 GB (50% reduction)
# 70B model in INT4: 35 GB (75% reduction)
# Quality impact:
# INT8: ~1-2% accuracy loss
# INT4: ~3-5% accuracy loss Quantization Methods:
Post-Training Quantization (PTQ)
├─ Quantize after training
├─ Fast (hours)
└─ 1-3% accuracy loss
Quantization-Aware Training (QAT)
├─ Train with quantization in mind
├─ Slow (retraining needed)
└─ <1% accuracy loss
GPTQ (GPT Quantization)
├─ Layer-by-layer quantization
├─ INT4 with minimal loss
└─ Used in LLaMA.cpp, llama-int4
2. KV Caching
Cache key/value tensors to avoid recomputation:
class AttentionWithKVCache:
"""
Cache keys and values for autoregressive generation
"""
def __init__(self):
self.kv_cache = {
'keys': [],
'values': []
}
def forward(self, x, use_cache=True):
# Generate Q, K, V
Q = x @ self.W_q
K = x @ self.W_k
V = x @ self.W_v
if use_cache:
# Append new K, V to cache
self.kv_cache['keys'].append(K)
self.kv_cache['values'].append(V)
# Use all cached keys/values
K_full = torch.cat(self.kv_cache['keys'], dim=1)
V_full = torch.cat(self.kv_cache['values'], dim=1)
else:
K_full, V_full = K, V
# Attention with full context
scores = Q @ K_full.T
weights = softmax(scores)
output = weights @ V_full
return output
# Generation with KV cache:
prompt = "The cat sat on the"
# First token: compute full attention
output = model.forward(prompt, use_cache=True)
# Cache now contains K,V for all prompt tokens
# Subsequent tokens: only compute new K,V
for _ in range(100): # Generate 100 tokens
new_token = sample(output)
output = model.forward(new_token, use_cache=True)
# Only computes 1 new K,V, reuses cached ones
# Speed improvement:
# Without cache: O(n²) for each token
# With cache: O(n) for each token
# 10-100x faster generation 3. Batch Processing: Amortizing the Cost
Process multiple requests simultaneously to maximize GPU utilization. The key insight: loading model weights from memory dominates the cost, so amortize it across multiple examples.
The economics: For a 70B model in FP16:
- Model weights: 140 GB
- A100 memory bandwidth: 1,935 GB/s
- Time to load weights:
For a single token:
- Compute: ~1 ms (matrix multiplication)
- Memory transfer: 72 ms (loading weights)
- Utilization: 1/73 = 1.4% ← Terrible!
With batch size :
- Load weights once: 72 ms
- Compute for examples: ms (parallelized)
- Time per example: ms
As , throughput approaches 1 token/ms. With :
versus 73 ms without batching—a 22x improvement!
The latency-throughput tradeoff: Larger batches mean waiting for more requests to arrive, increasing latency for early requests. The sweet spot depends on your SLA:
Batch Size | Latency (p50) | Throughput | Cost/Token
-----------------------------------------------------------------
1 | 75 ms | 13 tok/s | $0.020
8 | 150 ms | 89 tok/s | $0.003
32 | 300 ms | 280 tok/s | $0.001
128 | 800 ms | 720 tok/s | $0.0004
Google/OpenAI likely use batch sizes of 100-500 to balance cost and user experience.
class DynamicBatcher:
"""
Batch requests with different lengths efficiently
"""
def __init__(self, max_batch_size=32, timeout_ms=100):
self.max_batch_size = max_batch_size
self.timeout_ms = timeout_ms
self.pending_requests = []
def add_request(self, request):
self.pending_requests.append(request)
# Process when batch is full or timeout reached
if (len(self.pending_requests) >= self.max_batch_size or
self.time_since_first_request() > self.timeout_ms):
return self.process_batch()
def process_batch(self):
batch = self.pending_requests[:self.max_batch_size]
self.pending_requests = self.pending_requests[self.max_batch_size:]
# Pad sequences to same length
max_len = max(len(r['tokens']) for r in batch)
padded_batch = [
pad_to_length(r['tokens'], max_len)
for r in batch
]
# Process all at once
outputs = model.forward(torch.stack(padded_batch))
return outputs
# Throughput improvement:
# Sequential: 10 requests/sec
# Batched (32): 200 requests/sec
# 20x throughput improvement!
# But: Increases latency for individual requests
# Trade-off: throughput vs latency 4. Speculative Decoding: Gambling on Predictions
Generate multiple tokens per forward pass using a small “draft” model to propose candidates that a large model verifies in parallel.
The standard autoregressive bottleneck: Each token requires a full forward pass:
For 100 tokens at 100ms/token: 10 seconds. Can we parallelize?
Speculative decoding idea (Chen et al., 2023):
- Small draft model quickly generates candidate tokens
- Large model verifies all candidates in parallel (one forward pass)
- Accept longest prefix that matches draft model probabilities
Mathematical guarantee: The final distribution is identical to standard sampling! The speedup is “free” — no quality loss.
Expected speedup with acceptance rate and lookahead :
With , (70% acceptance), and draft model 10x faster:
For easier tasks (high ), speedups of 5-8x are common. The draft model can be:
- A smaller model (7B drafting for 70B)
- The same model quantized to INT4
- Early-exit layers from the target model
Use small model to predict, large model to verify:
def speculative_decode(large_model, small_model, prompt, k=4):
"""
Use small model to generate k tokens
Large model verifies in parallel
2-3x speedup with no quality loss
"""
tokens = tokenize(prompt)
while len(tokens) < max_length:
# Small model: generate k candidate tokens (fast)
candidates = small_model.generate(
tokens,
num_tokens=k,
temperature=0 # Greedy
)
# Large model: score all candidates in parallel
# This is faster than k sequential generations!
scores = large_model.score_sequence(tokens + candidates)
# Accept tokens while they match large model preference
accepted = 0
for i, candidate in enumerate(candidates):
if is_acceptable(scores[i]):
tokens.append(candidate)
accepted += 1
else:
break
# If rejected, use large model for next token
if accepted < k:
next_token = large_model.generate_one(tokens)
tokens.append(next_token)
return tokens
# Example:
# Small model (7B): "The cat sat on the mat"
# Large model (70B): Verifies first 3 tokens, rejects "mat", generates "rug"
# Result: 3 tokens in 1 large model call instead of 4
# 3x speedup in this example 5. Flash Attention
Memory-efficient attention computation:
def flash_attention_v2(Q, K, V, block_size=128):
"""
Flash Attention v2: 2-4x faster than standard attention
Key ideas:
1. Tile-based computation (use GPU cache)
2. Recomputation in backward (save memory)
3. Fused operations (fewer kernel launches)
"""
seq_len, d_k = Q.shape
output = torch.zeros_like(V)
# Process in blocks to fit in GPU cache (SRAM)
for i in range(0, seq_len, block_size):
q_block = Q[i:i+block_size]
# Accumulate attention over K, V blocks
attn_sum = torch.zeros(block_size, d_k)
for j in range(0, seq_len, block_size):
k_block = K[j:j+block_size]
v_block = V[j:j+block_size]
# Compute attention scores for this block
scores = q_block @ k_block.T # [block, block]
weights = softmax(scores, dim=-1)
attn_sum += weights @ v_block
output[i:i+block_size] = attn_sum
return output
# Performance comparison:
# Standard attention: 100 tokens/sec
# Flash Attention v1: 250 tokens/sec
# Flash Attention v2: 400 tokens/sec
# 4x improvement!
# Used in: GPT-4, LLaMA 2, Claude, all modern LLMs Hardware Optimizations
GPU Selection
Technical Specifications
- A100 80GB
- $2,500/month, best for 70B
- A100 40GB
- $1,500/month, good for 30B
- A10G 24GB
- $500/month, good for 7B
- H100 80GB
- $4,000/month, 2x A100 speed
def calculate_gpu_memory(
num_params,
precision='fp16',
batch_size=1,
seq_len=2048,
num_layers=80
):
"""
Calculate GPU memory requirements
"""
bytes_per_param = {
'fp32': 4,
'fp16': 2,
'int8': 1,
'int4': 0.5
}[precision]
# Model weights
model_memory = num_params * bytes_per_param
# KV cache
kv_cache = (
2 * # K and V
batch_size *
seq_len *
num_layers *
8192 * # d_model (example)
bytes_per_param
)
# Activations (rough estimate)
activations = batch_size * seq_len * 8192 * 4 * bytes_per_param
total_gb = (model_memory + kv_cache + activations) / 1e9
return {
'model': model_memory / 1e9,
'kv_cache': kv_cache / 1e9,
'activations': activations / 1e9,
'total': total_gb
}
# Example: LLaMA 2 70B
memory = calculate_gpu_memory(
num_params=70e9,
precision='fp16',
batch_size=1,
seq_len=4096
)
print(memory)
# {
# 'model': 140 GB,
# 'kv_cache': 5.2 GB,
# 'activations': 2.7 GB,
# 'total': 147.9 GB
# }
# Needs 2× A100 80GB (160 GB total) Model Parallelism
Split model across GPUs:
class TensorParallelLinear:
"""
Split weight matrix across GPUs
"""
def __init__(self, weight, num_gpus=2):
# Split columns across GPUs
self.weight_shards = torch.chunk(weight, num_gpus, dim=1)
self.num_gpus = num_gpus
def forward(self, x):
# Each GPU computes its shard
outputs = []
for gpu_id, weight_shard in enumerate(self.weight_shards):
x_gpu = x.to(f'cuda:{gpu_id}')
out = x_gpu @ weight_shard
outputs.append(out)
# Concatenate results
return torch.cat(outputs, dim=-1)
class PipelineParallelModel:
"""
Split layers across GPUs
"""
def __init__(self, layers, num_gpus=4):
layers_per_gpu = len(layers) // num_gpus
self.gpu_stages = []
for i in range(num_gpus):
start = i * layers_per_gpu
end = (i + 1) * layers_per_gpu
stage = layers[start:end]
stage.to(f'cuda:{i}')
self.gpu_stages.append(stage)
def forward(self, x):
for gpu_id, stage in enumerate(self.gpu_stages):
x = x.to(f'cuda:{gpu_id}')
for layer in stage:
x = layer(x)
return x
# Enables running 70B model on 2×40GB instead of 2×80GB
# But: Adds communication overhead Deployment Strategies
Inference Servers
vLLM (Recommended)
├─ PagedAttention for KV cache
├─ Continuous batching
├─ 10-20x throughput vs naive
└─ pip install vllm
TensorRT-LLM
├─ NVIDIA optimized
├─ INT8/INT4 quantization
├─ Best for NVIDIA GPUs
└─ Requires more setup
Text Generation Inference (TGI)
├─ HuggingFace solution
├─ Good for standard models
├─ Easy Docker deployment
└─ docker pull ghcr.io/huggingface/tgi
from vllm import LLM, SamplingParams
# Initialize model with optimizations
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=2, # Split across 2 GPUs
dtype="half", # FP16
max_num_seqs=256, # Large batch size
max_num_batched_tokens=8192
)
# Efficient sampling
sampling_params = SamplingParams(
temperature=0.8,
top_p=0.95,
max_tokens=512
)
# Process requests
prompts = [
"Explain quantum computing",
"Write a haiku about AI",
"Translate to Spanish: Hello"
]
outputs = llm.generate(prompts, sampling_params)
# vLLM automatically:
# - Batches requests
# - Uses PagedAttention for KV cache
# - Applies Flash Attention
# - Handles different sequence lengths
# Result: 10-20x throughput improvement Cost Optimization Summary
Baseline (70B FP16):
├─ Hardware: 4× A100 80GB = $10K/month
├─ Throughput: 10 req/sec
└─ Cost per 1M tokens: $50
After Optimizations:
├─ Quantization (INT8): 2× A100 80GB = $5K/month
├─ vLLM + batching: 200 req/sec (20x)
├─ Flash Attention: Included in vLLM
└─ Cost per 1M tokens: $1.25
96% cost reduction!
Optimization Checklist:
- ✅ Use INT8 quantization (50% memory savings)
- ✅ Enable KV caching (10x generation speedup)
- ✅ Deploy with vLLM (20x throughput)
- ✅ Use Flash Attention (included in vLLM)
- ✅ Batch requests (maximize GPU utilization)
- ✅ Choose right GPU (A100 40GB often sufficient)