LLM Benchmarking: Latency, Throughput, TTFT, TPS

MLOps Series LLM Inference & Serving

Benchmarking LLM inference is fundamentally different from benchmarking traditional web services. A single request involves two distinct compute phases (prefill and decode), latency is measured per-token, and throughput depends on dynamic batching behavior under load. This post defines the key metrics, demonstrates rigorous benchmarking methodology, and shows how to translate benchmark results into capacity planning decisions.

Key Metrics Definitions

LLM inference has its own vocabulary of performance metrics. Understanding what each measures — and what it doesn't — is critical for making correct optimization decisions.

Latency Metrics

TTFT (Time To First Token): Time from request arrival to first generated token. Dominated by prefill latency. Users perceive this as "response start time." Target: <500ms for chat, <2s for long context.
ITL (Inter-Token Latency): Time between consecutive generated tokens. Determines perceived "typing speed." Target: <50ms (20+ tok/s) for smooth chat experience.
TPOT (Time Per Output Token): Average time per output token = total decode time / output tokens. Similar to ITL but averaged over the full response.
E2E Latency: Total time from request to last token. E2E = TTFT + (output_tokens × TPOT).

Throughput Metrics

TPS (Tokens Per Second): Total tokens generated per second across all concurrent requests. The primary throughput metric.
Request Throughput: Completed requests per second. Less useful than TPS because request sizes vary widely.
Prefill Throughput: Input tokens processed per second during prefill. Typically 10-50× higher than decode throughput due to parallelism.
Decode Throughput: Output tokens generated per second during decode. Memory-bandwidth bound.

The TTFT vs TPS trade-off: Optimizing for throughput (high TPS) typically hurts TTFT because the system batches more requests, increasing queue wait time. In production, you must choose: optimize for latency (small batches, low TTFT, lower TPS) or throughput (large batches, higher TTFT, higher TPS). SLAs should specify both: e.g., "TTFT p99 < 1s AND throughput > 2000 tok/s."

Metrics Timeline

Benchmarking Methodology

A rigorous benchmark requires controlled conditions and realistic workloads. The gold standard is the ShareGPT dataset — real conversation traces with natural distributions of input/output lengths.

## Step 1: Prepare realistic workload from ShareGPT
import json, random, numpy as np

def prepare_sharegpt_workload(dataset_path, num_requests=1000):
    """Load ShareGPT conversations and extract input/output length distributions."""
    with open(dataset_path) as f:
        data = json.load(f)

    workload = []
    for conv in random.sample(data, num_requests):
        # Use first user turn as input, first assistant turn length as target
        user_msg = conv["conversations"][0]["value"]
        expected_output_len = len(conv["conversations"][1]["value"].split()) * 1.3  # rough token estimate
        workload.append({
            "prompt": user_msg,
            "max_tokens": int(expected_output_len),
        })

    # Distribution stats (typical ShareGPT):
    # Input lengths:  mean=290, median=162, p95=1024, p99=4096
    # Output lengths: mean=215, median=128, p95=768,  p99=2048
    return workload

## Step 2: Configure benchmark parameters
benchmark_config = {
    "target_url": "http://localhost:8000/v1/completions",
    "concurrency_levels": [1, 2, 4, 8, 16, 32, 64, 128],  # sweep concurrency
    "warmup_requests": 50,         # discard first 50 for JIT warmup
    "measurement_requests": 500,    # measure on 500 requests per level
    "request_rate": None,           # None = closed-loop (max throughput)
    # Alternative: fixed rate (open-loop) for SLA testing
    # "request_rate": 10.0,  # 10 req/s Poisson arrival
}

Open-loop vs closed-loop: Closed-loop benchmarks (send next request immediately after previous completes) measure maximum throughput but hide queuing effects. Open-loop benchmarks (send requests at a fixed rate regardless of completion) reveal how the system behaves under realistic load patterns. Always use open-loop for SLA validation.

Load Testing Tools

Purpose-built tools for LLM benchmarking handle the unique challenges of streaming responses and per-token timing:

vLLM Benchmark Suite

Built-in: benchmark_serving.py
Supports ShareGPT + synthetic workloads
Measures TTFT, TPOT, ITL, E2E, throughput
Open-loop (Poisson arrival) and closed-loop modes
Reports p50/p95/p99 for all metrics

Custom Locust Benchmark

Flexible: custom arrival patterns, mixed workloads
Distributed load generation across machines
Real-time dashboards during benchmark
Requires custom SSE parsing for per-token metrics
Good for production-like scenarios

## vLLM benchmark_serving.py — the standard benchmark
## Run from the vLLM repository:
# python benchmarks/benchmark_serving.py \
#   --backend vllm \
#   --model meta-llama/Llama-3.1-70B-Instruct \
#   --dataset-name sharegpt \
#   --dataset-path ShareGPT_V3_unfiltered.json \
#   --num-prompts 1000 \
#   --request-rate 10 \
#   --endpoint /v1/completions

## Custom Locust benchmark with per-token timing
import time, json, sseclient
from locust import HttpUser, task, between

class LLMUser(HttpUser):
    wait_time = between(0.5, 2.0)  # think time between requests

    def on_start(self):
        self.workload = prepare_sharegpt_workload("sharegpt.json", 5000)
        self.idx = 0

    @task
    def chat_completion(self):
        req = self.workload[self.idx % len(self.workload)]
        self.idx += 1

        start = time.perf_counter()
        first_token_time = None
        token_times = []
        total_tokens = 0

        with self.client.post(
            "/v1/completions",
            json={
                "model": "meta-llama/Llama-3.1-70B-Instruct",
                "prompt": req["prompt"],
                "max_tokens": req["max_tokens"],
                "stream": True,
                "temperature": 0.7,
            },
            stream=True,
            catch_response=True,
            name="chat_stream"
        ) as resp:
            client = sseclient.SSEClient(resp)
            for event in client.events():
                if event.data == "[DONE]":
                    break
                now = time.perf_counter()
                if first_token_time is None:
                    first_token_time = now - start
                token_times.append(now)
                total_tokens += 1

            end = time.perf_counter()

        # Record custom metrics
        self.environment.events.request.fire(
            request_type="TTFT", name="time_to_first_token",
            response_time=first_token_time * 1000,
            response_length=0, exception=None,
            context={"tokens": total_tokens}
        )

Statistical Analysis

Raw benchmark numbers are meaningless without proper statistical treatment. LLM latency distributions are typically right-skewed (long tail) and multimodal (prefill-dominated vs decode-dominated requests).

## Analyze benchmark results with proper statistics
import numpy as np
from scipy import stats

def analyze_benchmark(results):
    """Compute comprehensive statistics from benchmark results."""
    ttft = np.array([r["ttft_ms"] for r in results])
    tpot = np.array([r["tpot_ms"] for r in results])
    e2e  = np.array([r["e2e_ms"] for r in results])
    tps  = np.array([r["tokens_per_sec"] for r in results])

    report = {}
    for name, data in [("TTFT", ttft), ("TPOT", tpot), ("E2E", e2e)]:
        report[name] = {
            "mean":   np.mean(data),
            "median": np.median(data),
            "std":    np.std(data),
            "p50":    np.percentile(data, 50),
            "p90":    np.percentile(data, 90),
            "p95":    np.percentile(data, 95),
            "p99":    np.percentile(data, 99),
            "min":    np.min(data),
            "max":    np.max(data),
        }

    # Throughput (aggregate, not per-request)
    total_tokens = sum(r["output_tokens"] for r in results)
    total_time = (results[-1]["end_time"] - results[0]["start_time"])
    report["throughput_tps"] = total_tokens / total_time

    return report

## Example output (Llama-3.1-70B on 8×H100, 32 concurrent):
## TTFT:  p50=142ms,  p90=285ms,  p95=410ms,  p99=1240ms
## TPOT:  p50=26ms,   p90=35ms,   p95=42ms,   p99=78ms
## E2E:   p50=3.2s,   p90=6.8s,   p95=9.1s,   p99=15.3s
## Throughput: 1847 tok/s (system-wide)

Always report percentiles, not averages. The mean TTFT might be 200ms, but if p99 is 5 seconds, 1% of your users wait 25× longer. For SLAs, define requirements at p95 or p99. The gap between p50 and p99 reveals how well the system handles tail latency — often caused by long-prompt requests, GC pauses, or memory pressure from KV cache eviction.

Capacity Planning

Translating benchmark results into production GPU requirements:

## Capacity planning from benchmark data
def plan_capacity(
    target_rps,          # target requests per second
    avg_input_tokens,    # average input length
    avg_output_tokens,   # average output length
    benchmark_tps,       # measured system TPS from benchmark
    benchmark_gpus,      # GPUs used in benchmark
    ttft_p99_target_ms,  # TTFT SLA
    benchmark_ttft_p99,  # measured TTFT p99
    headroom_factor=1.3  # 30% headroom for traffic spikes
):
    # Required output throughput
    required_tps = target_rps * avg_output_tokens
    print(f"Required throughput: {required_tps} tok/s")

    # GPU scaling (assume linear with headroom)
    tps_per_gpu = benchmark_tps / benchmark_gpus
    gpus_for_throughput = (required_tps / tps_per_gpu) * headroom_factor
    print(f"GPUs for throughput: {gpus_for_throughput:.1f}")

    # Check TTFT constraint
    # Higher concurrency → higher TTFT. May need more GPUs to meet SLA.
    if benchmark_ttft_p99 > ttft_p99_target_ms:
        # Need more GPUs to reduce per-GPU concurrency
        scale_factor = benchmark_ttft_p99 / ttft_p99_target_ms
        gpus_for_latency = gpus_for_throughput * scale_factor
        print(f"GPUs for TTFT SLA: {gpus_for_latency:.1f}")
    else:
        gpus_for_latency = gpus_for_throughput

    total_gpus = max(gpus_for_throughput, gpus_for_latency)
    print(f"Total GPUs needed: {int(np.ceil(total_gpus))}")

    # Cost estimate (H100 at $3.50/hr on-demand)
    monthly_cost = int(np.ceil(total_gpus)) * 3.50 * 24 * 30
    print(f"Monthly cost: ${monthly_cost:,.0f}")
    return int(np.ceil(total_gpus))

## Example: 50 req/s, avg 300 input + 150 output tokens
## Benchmark: 1847 tok/s on 8 H100s, TTFT p99 = 1240ms
## Target: TTFT p99 < 500ms

plan_capacity(
    target_rps=50, avg_input_tokens=300, avg_output_tokens=150,
    benchmark_tps=1847, benchmark_gpus=8,
    ttft_p99_target_ms=500, benchmark_ttft_p99=1240
)
## Required throughput: 7500 tok/s
## GPUs for throughput: 42.2
## GPUs for TTFT SLA: 104.7 (TTFT is the bottleneck!)
## Total GPUs needed: 105
## Monthly cost: $264,600

TTFT is often the binding constraint. In the example above, throughput alone requires 43 GPUs, but meeting the TTFT SLA requires 105. This is because TTFT degrades super-linearly with concurrency. Solutions: use chunked prefill, reduce batch size per GPU (add more GPUs), or relax the TTFT SLA to p99 < 1s.

Common Pitfalls

Benchmarking mistakes that lead to wrong capacity decisions:

Measurement Pitfalls

Not warming up: First 50-100 requests trigger CUDA kernel compilation, Triton JIT, and weight loading. Always discard warmup.
Fixed prompt length: Using identical prompt lengths creates unrealistic batching. Use ShareGPT or similar for natural distribution.
Closed-loop only: Measures max throughput but hides queuing behavior. Use open-loop for latency SLAs.
Ignoring output length: Capping max_tokens at 100 when production needs 500 gives ~5× better latency numbers.

Analysis Pitfalls

Reporting means: Mean TTFT hides tail latency. Always report p95/p99.
Single concurrency level: Performance at 1 concurrent is useless. Sweep 1→128+ and find the throughput-latency knee.
Not testing at saturation: The system performs differently at 50% vs 90% capacity. Test beyond expected peak load.
Comparing different models: Token counts vary by tokenizer. 100 tokens from Llama ≠ 100 tokens from GPT-4.

## Proper benchmark sweep: concurrency vs throughput vs latency
import subprocess, json

results = {}
for concurrency in [1, 2, 4, 8, 16, 32, 64, 128, 256]:
    output = subprocess.run([
        "python", "benchmarks/benchmark_serving.py",
        "--backend", "vllm",
        "--model", "meta-llama/Llama-3.1-70B-Instruct",
        "--dataset-name", "sharegpt",
        "--num-prompts", "500",
        "--request-rate", str(concurrency * 2),  # open-loop
        "--output-json", f"results_c{concurrency}.json",
    ], capture_output=True)

    with open(f"results_c{concurrency}.json") as f:
        results[concurrency] = json.load(f)

## Plot the throughput-latency curve
## X-axis: system throughput (tok/s)
## Y-axis: TTFT p99 (ms)
## Find the "knee" — where latency starts spiking
##
## Typical results (Llama-70B, 8×H100):
## Concurrency  TPS    TTFT_p50  TTFT_p99
## 1            35     45ms      52ms
## 4            138    48ms      68ms
## 16           520    72ms      180ms
## 32           980    142ms     520ms    ← sweet spot
## 64           1450   310ms     1800ms
## 128          1780   890ms     5200ms   ← degraded
## 256          1850   2100ms    12000ms  ← saturated

The throughput-latency knee is your operating point. In the sweep above, concurrency 32 gives 980 tok/s with TTFT p99 of 520ms — a good balance. Going to 64 doubles throughput to 1450 tok/s but TTFT p99 jumps to 1.8s. The right operating point depends on your SLA: if TTFT p99 must be <1s, you can push to ~50 concurrent requests. Beyond that, add more GPUs.

Key takeaway: LLM benchmarking is a discipline, not a single number. Always measure TTFT, TPOT, ITL, and E2E across a sweep of concurrency levels with realistic workloads. Report percentiles (p50/p95/p99), not means. Use the throughput-latency curve to find the optimal operating point, then add 30% headroom for production capacity. The TTFT SLA is almost always the binding constraint for capacity planning.