vLLM vs TGI vs TensorRT-LLM vs Triton
Framework Overview
Choosing an LLM serving framework is one of the highest-leverage infrastructure decisions for AI teams. The wrong choice can mean 2–5× higher GPU costs or unacceptable latency. This guide provides a data-driven comparison based on real benchmarks and production experience.
vLLM
Origin: UC Berkeley (Sky Lab)
Language: Python + CUDA
Philosophy: Memory-efficient serving via PagedAttention. Broadest model support, OpenAI-compatible API, easiest to get started.
Best for: General-purpose LLM serving, rapid prototyping, teams prioritizing flexibility.
TGI
Origin: HuggingFace
Language: Rust (router) + Python (model)
Philosophy: Production-ready server with Rust performance for routing/streaming. Native HuggingFace Hub integration.
Best for: HuggingFace ecosystem users, teams needing watermarking, built-in token streaming.
TensorRT-LLM
Origin: NVIDIA
Language: C++ runtime + Python build
Philosophy: Ahead-of-time compilation to GPU-specific optimized engines. Maximum performance at the cost of flexibility.
Best for: Large-scale production on NVIDIA GPUs, FP8 on H100, maximum throughput/$.
Triton Inference Server
Origin: NVIDIA
Language: C++ core
Philosophy: Model-agnostic serving platform. Wraps any backend (TRT-LLM, PyTorch, ONNX). Enterprise-grade orchestration.
Best for: Multi-model deployments, ensemble pipelines, teams already on NVIDIA's ML stack.
Feature Matrix
A comprehensive comparison of capabilities across all four frameworks:
| Feature | vLLM | TGI | TensorRT-LLM | Triton + TRT-LLM |
|---|---|---|---|---|
| Continuous Batching | ✓ Yes | ✓ Yes | ✓ In-flight | ✓ Yes |
| PagedAttention | ✓ Core feature | ✓ Yes | ✓ Yes | ✓ Via TRT-LLM |
| Flash Attention 2 | ✓ Yes | ✓ Yes | ~ Custom fused kernels | ~ Via TRT-LLM |
| Token Streaming | ✓ SSE | ✓ SSE (Rust) | ~ Callback-based | ✓ gRPC stream |
| OpenAI-Compatible API | ✓ Native | ~ Messages API | ✗ Custom API | ✗ Triton API |
| Tensor Parallelism | ✓ NCCL | ✓ NCCL | ✓ NCCL | ✓ Via TRT-LLM |
| Pipeline Parallelism | ✓ Yes | ✗ No | ✓ Yes | ✓ Via TRT-LLM |
| FP8 Quantization | ✓ Yes | ✗ No | ✓ Native H100 | ✓ Via TRT-LLM |
| INT4 (AWQ/GPTQ) | ✓ Yes | ✓ Yes | ✓ Yes | ✓ Via TRT-LLM |
| Speculative Decoding | ✓ Yes | ✓ Yes | ✓ Yes | ✓ Via TRT-LLM |
| Multi-LoRA | ✓ Hot-swap | ✓ Yes | ~ Limited | ~ Limited |
| Prefix Caching | ✓ APC | ✗ No | ~ Manual | ~ Manual |
| AMD GPU Support | ✓ ROCm | ~ Experimental | ✗ NVIDIA only | ✗ NVIDIA only |
| Model Count | 50+ architectures | 30+ architectures | 20+ architectures | Via backend |
| Structured Output | ✓ Grammar | ✓ Grammar | ✗ No | ✗ No |
Throughput Benchmarks
All benchmarks below use the ShareGPT dataset (variable input/output lengths, mean ~250 input / ~250 output tokens) with 64 concurrent requests unless otherwise noted. Hardware: 4×A100-80GB with NVLink, CUDA 12.2.
Llama-2-7B (1×A100-80GB, FP16)
- vLLM: 2,850 tok/s (24.1 req/s)
- TGI: 2,420 tok/s (20.3 req/s)
- TRT-LLM: 3,380 tok/s (28.5 req/s)
- TRT-LLM + Triton: 3,310 tok/s
- TRT-LLM is ~1.2× faster than vLLM
Llama-2-70B (4×A100-80GB, TP=4, FP16)
- vLLM: 1,280 tok/s (10.8 req/s)
- TGI: 980 tok/s (8.2 req/s)
- TRT-LLM: 1,920 tok/s (16.1 req/s)
- TRT-LLM FP8: 2,640 tok/s (22.2 req/s)
- TRT-LLM FP8 is ~2× faster than vLLM FP16
Mixtral-8x7B (2×A100-80GB, TP=2)
- vLLM: 1,650 tok/s
- TGI: 1,380 tok/s
- TRT-LLM: 2,180 tok/s
- MoE routing overhead varies by framework
- vLLM's expert-parallel is still maturing
High-Concurrency (Llama-7B, 256 reqs)
- vLLM: 4,200 tok/s (scales well)
- TGI: 3,100 tok/s (queue pressure)
- TRT-LLM: 5,100 tok/s
- At 256 concurrent, vLLM's PagedAttention
- and TRT-LLM's C++ runtime both shine
Latency Comparison
Latency has two critical components: TTFT (Time to First Token — how fast the first token arrives) and TPS (Tokens Per Second — decode speed after the first token). These optimize differently.
| Metric | vLLM | TGI | TRT-LLM |
|---|---|---|---|
| TTFT (512-token prompt, 7B) | ~45 ms | ~50 ms | ~28 ms |
| TTFT (2048-token prompt, 7B) | ~120 ms | ~135 ms | ~75 ms |
| TTFT (512-token prompt, 70B, TP=4) | ~180 ms | ~210 ms | ~110 ms |
| TPS per user (7B, batch=1) | ~62 tok/s | ~58 tok/s | ~78 tok/s |
| TPS per user (70B TP=4, batch=1) | ~28 tok/s | ~25 tok/s | ~38 tok/s |
| P99 TTFT under load (7B, 64 reqs) | ~280 ms | ~350 ms | ~160 ms |
TensorRT-LLM's TTFT advantage comes from fused prefill kernels and CUDA graph execution—the engine skips Python overhead entirely during inference. vLLM's advantage narrows at high batch sizes where GPU compute (not overhead) dominates.
# Benchmarking script for fair comparison import asyncio, time, aiohttp async def benchmark(url, prompts, concurrency=64): semaphore = asyncio.Semaphore(concurrency) results = [] async def send_request(prompt): async with semaphore: t0 = time.perf_counter() first_token_time = None tokens = 0 async with aiohttp.ClientSession() as session: async with session.post(url, json={ "model": "llama", "messages": [{"role": "user", "content": prompt}], "max_tokens": 256, "stream": True }) as resp: async for line in resp.content: if first_token_time is None: first_token_time = time.perf_counter() - t0 tokens += 1 total = time.perf_counter() - t0 results.append({ "ttft": first_token_time, "total": total, "tps": tokens / total }) await asyncio.gather(*[send_request(p) for p in prompts]) return results
Deployment Complexity
Time-to-first-inference matters for developer productivity and iteration speed. Here's a realistic assessment:
vLLM — Easiest
Time to first request: ~5 minutes
pip install vllm- Single command to start server
- OpenAI-compatible API (zero client changes)
- No compilation step
- Docker image: ~8 GB
# Literally 2 commands
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct
TGI — Easy
Time to first request: ~10 minutes
- Docker-first deployment
- Auto-downloads model from Hub
- Custom API (not OpenAI-compatible)
- No compilation step
- Docker image: ~10 GB
# Docker one-liner
docker run --gpus all -p 8080:80 \
ghcr.io/huggingface/text-generation-inference \
--model-id meta-llama/Llama-3.1-8B-Instruct
TensorRT-LLM — Hard
Time to first request: 1–3 hours
- Convert checkpoint (10–30 min)
- Build engine (30–120 min)
- GPU-specific (rebuild per GPU type)
- Configuration-specific (rebuild per batch/seq)
- Docker image: ~15 GB
Triton + TRT-LLM — Hardest
Time to first request: 2–5 hours
- All of TRT-LLM build steps, plus:
- Model repository structure setup
- Pre/post-processing models
- config.pbtxt configuration
- Ensemble pipeline wiring
Quantization & Hardware Support
Quantization support varies significantly and can be the deciding factor for cost-sensitive deployments:
| Quantization | vLLM | TGI | TRT-LLM |
|---|---|---|---|
| FP16 / BF16 | ✓ | ✓ | ✓ |
| FP8 (H100) | ✓ | ✗ | ✓ Native |
| INT8 (SmoothQuant) | ~ Via SQ | ✓ EETQ | ✓ W8A8 |
| INT4-GPTQ | ✓ | ✓ | ✓ |
| INT4-AWQ | ✓ | ✓ | ✓ |
| GGUF (llama.cpp) | ~ Experimental | ✗ | ✗ |
Hardware support:
vLLM Hardware
- NVIDIA: A100, H100, L40S, A10G, T4, RTX
- AMD: MI250, MI300X (ROCm)
- AWS Neuron: Inferentia2 (experimental)
- TPU: Community support (limited)
- Broadest hardware support
TRT-LLM Hardware
- NVIDIA only: A100, H100, H200, L40S
- FP8: H100/H200 only
- No AMD/Intel/TPU support
- GPU-specific engines (compile per arch)
- Narrowest but deepest optimization
Decision Flowchart
Use this flowchart to pick the right framework for your use case:
The LLM serving landscape evolves rapidly. vLLM adds new optimizations monthly, TGI continues to improve its Rust router, and TensorRT-LLM keeps pushing the performance ceiling. Re-benchmark quarterly, and don't over-invest in any single framework—keep your serving layer abstracted behind an OpenAI-compatible API so you can swap engines without changing client code.