LLM Benchmarking: Latency, Throughput, TTFT, TPS
Benchmarking LLM inference is fundamentally different from benchmarking traditional web services. A single request involves two distinct compute phases (prefill and decode), latency is measured per-token, and throughput depends on dynamic batching behavior under load. This post defines the key metrics, demonstrates rigorous benchmarking methodology, and shows how to translate benchmark results into capacity planning decisions.
Key Metrics Definitions
LLM inference has its own vocabulary of performance metrics. Understanding what each measures — and what it doesn't — is critical for making correct optimization decisions.
Latency Metrics
- TTFT (Time To First Token): Time from request arrival to first generated token. Dominated by prefill latency. Users perceive this as "response start time." Target: <500ms for chat, <2s for long context.
- ITL (Inter-Token Latency): Time between consecutive generated tokens. Determines perceived "typing speed." Target: <50ms (20+ tok/s) for smooth chat experience.
- TPOT (Time Per Output Token): Average time per output token = total decode time / output tokens. Similar to ITL but averaged over the full response.
- E2E Latency: Total time from request to last token. E2E = TTFT + (output_tokens × TPOT).
Throughput Metrics
- TPS (Tokens Per Second): Total tokens generated per second across all concurrent requests. The primary throughput metric.
- Request Throughput: Completed requests per second. Less useful than TPS because request sizes vary widely.
- Prefill Throughput: Input tokens processed per second during prefill. Typically 10-50× higher than decode throughput due to parallelism.
- Decode Throughput: Output tokens generated per second during decode. Memory-bandwidth bound.
Metrics Timeline
Benchmarking Methodology
A rigorous benchmark requires controlled conditions and realistic workloads. The gold standard is the ShareGPT dataset — real conversation traces with natural distributions of input/output lengths.
## Step 1: Prepare realistic workload from ShareGPT import json, random, numpy as np def prepare_sharegpt_workload(dataset_path, num_requests=1000): """Load ShareGPT conversations and extract input/output length distributions.""" with open(dataset_path) as f: data = json.load(f) workload = [] for conv in random.sample(data, num_requests): # Use first user turn as input, first assistant turn length as target user_msg = conv["conversations"][0]["value"] expected_output_len = len(conv["conversations"][1]["value"].split()) * 1.3 # rough token estimate workload.append({ "prompt": user_msg, "max_tokens": int(expected_output_len), }) # Distribution stats (typical ShareGPT): # Input lengths: mean=290, median=162, p95=1024, p99=4096 # Output lengths: mean=215, median=128, p95=768, p99=2048 return workload ## Step 2: Configure benchmark parameters benchmark_config = { "target_url": "http://localhost:8000/v1/completions", "concurrency_levels": [1, 2, 4, 8, 16, 32, 64, 128], # sweep concurrency "warmup_requests": 50, # discard first 50 for JIT warmup "measurement_requests": 500, # measure on 500 requests per level "request_rate": None, # None = closed-loop (max throughput) # Alternative: fixed rate (open-loop) for SLA testing # "request_rate": 10.0, # 10 req/s Poisson arrival }
Load Testing Tools
Purpose-built tools for LLM benchmarking handle the unique challenges of streaming responses and per-token timing:
vLLM Benchmark Suite
- Built-in:
benchmark_serving.py - Supports ShareGPT + synthetic workloads
- Measures TTFT, TPOT, ITL, E2E, throughput
- Open-loop (Poisson arrival) and closed-loop modes
- Reports p50/p95/p99 for all metrics
Custom Locust Benchmark
- Flexible: custom arrival patterns, mixed workloads
- Distributed load generation across machines
- Real-time dashboards during benchmark
- Requires custom SSE parsing for per-token metrics
- Good for production-like scenarios
## vLLM benchmark_serving.py — the standard benchmark ## Run from the vLLM repository: # python benchmarks/benchmark_serving.py \ # --backend vllm \ # --model meta-llama/Llama-3.1-70B-Instruct \ # --dataset-name sharegpt \ # --dataset-path ShareGPT_V3_unfiltered.json \ # --num-prompts 1000 \ # --request-rate 10 \ # --endpoint /v1/completions ## Custom Locust benchmark with per-token timing import time, json, sseclient from locust import HttpUser, task, between class LLMUser(HttpUser): wait_time = between(0.5, 2.0) # think time between requests def on_start(self): self.workload = prepare_sharegpt_workload("sharegpt.json", 5000) self.idx = 0 @task def chat_completion(self): req = self.workload[self.idx % len(self.workload)] self.idx += 1 start = time.perf_counter() first_token_time = None token_times = [] total_tokens = 0 with self.client.post( "/v1/completions", json={ "model": "meta-llama/Llama-3.1-70B-Instruct", "prompt": req["prompt"], "max_tokens": req["max_tokens"], "stream": True, "temperature": 0.7, }, stream=True, catch_response=True, name="chat_stream" ) as resp: client = sseclient.SSEClient(resp) for event in client.events(): if event.data == "[DONE]": break now = time.perf_counter() if first_token_time is None: first_token_time = now - start token_times.append(now) total_tokens += 1 end = time.perf_counter() # Record custom metrics self.environment.events.request.fire( request_type="TTFT", name="time_to_first_token", response_time=first_token_time * 1000, response_length=0, exception=None, context={"tokens": total_tokens} )
Statistical Analysis
Raw benchmark numbers are meaningless without proper statistical treatment. LLM latency distributions are typically right-skewed (long tail) and multimodal (prefill-dominated vs decode-dominated requests).
## Analyze benchmark results with proper statistics import numpy as np from scipy import stats def analyze_benchmark(results): """Compute comprehensive statistics from benchmark results.""" ttft = np.array([r["ttft_ms"] for r in results]) tpot = np.array([r["tpot_ms"] for r in results]) e2e = np.array([r["e2e_ms"] for r in results]) tps = np.array([r["tokens_per_sec"] for r in results]) report = {} for name, data in [("TTFT", ttft), ("TPOT", tpot), ("E2E", e2e)]: report[name] = { "mean": np.mean(data), "median": np.median(data), "std": np.std(data), "p50": np.percentile(data, 50), "p90": np.percentile(data, 90), "p95": np.percentile(data, 95), "p99": np.percentile(data, 99), "min": np.min(data), "max": np.max(data), } # Throughput (aggregate, not per-request) total_tokens = sum(r["output_tokens"] for r in results) total_time = (results[-1]["end_time"] - results[0]["start_time"]) report["throughput_tps"] = total_tokens / total_time return report ## Example output (Llama-3.1-70B on 8×H100, 32 concurrent): ## TTFT: p50=142ms, p90=285ms, p95=410ms, p99=1240ms ## TPOT: p50=26ms, p90=35ms, p95=42ms, p99=78ms ## E2E: p50=3.2s, p90=6.8s, p95=9.1s, p99=15.3s ## Throughput: 1847 tok/s (system-wide)
Capacity Planning
Translating benchmark results into production GPU requirements:
## Capacity planning from benchmark data def plan_capacity( target_rps, # target requests per second avg_input_tokens, # average input length avg_output_tokens, # average output length benchmark_tps, # measured system TPS from benchmark benchmark_gpus, # GPUs used in benchmark ttft_p99_target_ms, # TTFT SLA benchmark_ttft_p99, # measured TTFT p99 headroom_factor=1.3 # 30% headroom for traffic spikes ): # Required output throughput required_tps = target_rps * avg_output_tokens print(f"Required throughput: {required_tps} tok/s") # GPU scaling (assume linear with headroom) tps_per_gpu = benchmark_tps / benchmark_gpus gpus_for_throughput = (required_tps / tps_per_gpu) * headroom_factor print(f"GPUs for throughput: {gpus_for_throughput:.1f}") # Check TTFT constraint # Higher concurrency → higher TTFT. May need more GPUs to meet SLA. if benchmark_ttft_p99 > ttft_p99_target_ms: # Need more GPUs to reduce per-GPU concurrency scale_factor = benchmark_ttft_p99 / ttft_p99_target_ms gpus_for_latency = gpus_for_throughput * scale_factor print(f"GPUs for TTFT SLA: {gpus_for_latency:.1f}") else: gpus_for_latency = gpus_for_throughput total_gpus = max(gpus_for_throughput, gpus_for_latency) print(f"Total GPUs needed: {int(np.ceil(total_gpus))}") # Cost estimate (H100 at $3.50/hr on-demand) monthly_cost = int(np.ceil(total_gpus)) * 3.50 * 24 * 30 print(f"Monthly cost: ${monthly_cost:,.0f}") return int(np.ceil(total_gpus)) ## Example: 50 req/s, avg 300 input + 150 output tokens ## Benchmark: 1847 tok/s on 8 H100s, TTFT p99 = 1240ms ## Target: TTFT p99 < 500ms plan_capacity( target_rps=50, avg_input_tokens=300, avg_output_tokens=150, benchmark_tps=1847, benchmark_gpus=8, ttft_p99_target_ms=500, benchmark_ttft_p99=1240 ) ## Required throughput: 7500 tok/s ## GPUs for throughput: 42.2 ## GPUs for TTFT SLA: 104.7 (TTFT is the bottleneck!) ## Total GPUs needed: 105 ## Monthly cost: $264,600
Common Pitfalls
Benchmarking mistakes that lead to wrong capacity decisions:
Measurement Pitfalls
- Not warming up: First 50-100 requests trigger CUDA kernel compilation, Triton JIT, and weight loading. Always discard warmup.
- Fixed prompt length: Using identical prompt lengths creates unrealistic batching. Use ShareGPT or similar for natural distribution.
- Closed-loop only: Measures max throughput but hides queuing behavior. Use open-loop for latency SLAs.
- Ignoring output length: Capping max_tokens at 100 when production needs 500 gives ~5× better latency numbers.
Analysis Pitfalls
- Reporting means: Mean TTFT hides tail latency. Always report p95/p99.
- Single concurrency level: Performance at 1 concurrent is useless. Sweep 1→128+ and find the throughput-latency knee.
- Not testing at saturation: The system performs differently at 50% vs 90% capacity. Test beyond expected peak load.
- Comparing different models: Token counts vary by tokenizer. 100 tokens from Llama ≠ 100 tokens from GPT-4.
## Proper benchmark sweep: concurrency vs throughput vs latency import subprocess, json results = {} for concurrency in [1, 2, 4, 8, 16, 32, 64, 128, 256]: output = subprocess.run([ "python", "benchmarks/benchmark_serving.py", "--backend", "vllm", "--model", "meta-llama/Llama-3.1-70B-Instruct", "--dataset-name", "sharegpt", "--num-prompts", "500", "--request-rate", str(concurrency * 2), # open-loop "--output-json", f"results_c{concurrency}.json", ], capture_output=True) with open(f"results_c{concurrency}.json") as f: results[concurrency] = json.load(f) ## Plot the throughput-latency curve ## X-axis: system throughput (tok/s) ## Y-axis: TTFT p99 (ms) ## Find the "knee" — where latency starts spiking ## ## Typical results (Llama-70B, 8×H100): ## Concurrency TPS TTFT_p50 TTFT_p99 ## 1 35 45ms 52ms ## 4 138 48ms 68ms ## 16 520 72ms 180ms ## 32 980 142ms 520ms ← sweet spot ## 64 1450 310ms 1800ms ## 128 1780 890ms 5200ms ← degraded ## 256 1850 2100ms 12000ms ← saturated
Key takeaway: LLM benchmarking is a discipline, not a single number. Always measure TTFT, TPOT, ITL, and E2E across a sweep of concurrency levels with realistic workloads. Report percentiles (p50/p95/p99), not means. Use the throughput-latency curve to find the optimal operating point, then add 30% headroom for production capacity. The TTFT SLA is almost always the binding constraint for capacity planning.