vLLM vs TGI vs TensorRT-LLM vs Triton

MLOps Series LLM Inference & Serving

Framework Overview

Choosing an LLM serving framework is one of the highest-leverage infrastructure decisions for AI teams. The wrong choice can mean 2–5× higher GPU costs or unacceptable latency. This guide provides a data-driven comparison based on real benchmarks and production experience.

vLLM

Origin: UC Berkeley (Sky Lab)

Language: Python + CUDA

Philosophy: Memory-efficient serving via PagedAttention. Broadest model support, OpenAI-compatible API, easiest to get started.

Best for: General-purpose LLM serving, rapid prototyping, teams prioritizing flexibility.

TGI

Origin: HuggingFace

Language: Rust (router) + Python (model)

Philosophy: Production-ready server with Rust performance for routing/streaming. Native HuggingFace Hub integration.

Best for: HuggingFace ecosystem users, teams needing watermarking, built-in token streaming.

TensorRT-LLM

Origin: NVIDIA

Language: C++ runtime + Python build

Philosophy: Ahead-of-time compilation to GPU-specific optimized engines. Maximum performance at the cost of flexibility.

Best for: Large-scale production on NVIDIA GPUs, FP8 on H100, maximum throughput/$.

Triton Inference Server

Origin: NVIDIA

Language: C++ core

Philosophy: Model-agnostic serving platform. Wraps any backend (TRT-LLM, PyTorch, ONNX). Enterprise-grade orchestration.

Best for: Multi-model deployments, ensemble pipelines, teams already on NVIDIA's ML stack.

Feature Matrix

A comprehensive comparison of capabilities across all four frameworks:

Feature	vLLM	TGI	TensorRT-LLM	Triton + TRT-LLM
Continuous Batching	✓ Yes	✓ Yes	✓ In-flight	✓ Yes
PagedAttention	✓ Core feature	✓ Yes	✓ Yes	✓ Via TRT-LLM
Flash Attention 2	✓ Yes	✓ Yes	~ Custom fused kernels	~ Via TRT-LLM
Token Streaming	✓ SSE	✓ SSE (Rust)	~ Callback-based	✓ gRPC stream
OpenAI-Compatible API	✓ Native	~ Messages API	✗ Custom API	✗ Triton API
Tensor Parallelism	✓ NCCL	✓ NCCL	✓ NCCL	✓ Via TRT-LLM
Pipeline Parallelism	✓ Yes	✗ No	✓ Yes	✓ Via TRT-LLM
FP8 Quantization	✓ Yes	✗ No	✓ Native H100	✓ Via TRT-LLM
INT4 (AWQ/GPTQ)	✓ Yes	✓ Yes	✓ Yes	✓ Via TRT-LLM
Speculative Decoding	✓ Yes	✓ Yes	✓ Yes	✓ Via TRT-LLM
Multi-LoRA	✓ Hot-swap	✓ Yes	~ Limited	~ Limited
Prefix Caching	✓ APC	✗ No	~ Manual	~ Manual
AMD GPU Support	✓ ROCm	~ Experimental	✗ NVIDIA only	✗ NVIDIA only
Model Count	50+ architectures	30+ architectures	20+ architectures	Via backend
Structured Output	✓ Grammar	✓ Grammar	✗ No	✗ No

Throughput Benchmarks

All benchmarks below use the ShareGPT dataset (variable input/output lengths, mean ~250 input / ~250 output tokens) with 64 concurrent requests unless otherwise noted. Hardware: 4×A100-80GB with NVLink, CUDA 12.2.

Llama-2-7B (1×A100-80GB, FP16)

vLLM: 2,850 tok/s (24.1 req/s)
TGI: 2,420 tok/s (20.3 req/s)
TRT-LLM: 3,380 tok/s (28.5 req/s)
TRT-LLM + Triton: 3,310 tok/s
TRT-LLM is ~1.2× faster than vLLM

Llama-2-70B (4×A100-80GB, TP=4, FP16)

vLLM: 1,280 tok/s (10.8 req/s)
TGI: 980 tok/s (8.2 req/s)
TRT-LLM: 1,920 tok/s (16.1 req/s)
TRT-LLM FP8: 2,640 tok/s (22.2 req/s)
TRT-LLM FP8 is ~2× faster than vLLM FP16

Mixtral-8x7B (2×A100-80GB, TP=2)

vLLM: 1,650 tok/s
TGI: 1,380 tok/s
TRT-LLM: 2,180 tok/s
MoE routing overhead varies by framework
vLLM's expert-parallel is still maturing

High-Concurrency (Llama-7B, 256 reqs)

vLLM: 4,200 tok/s (scales well)
TGI: 3,100 tok/s (queue pressure)
TRT-LLM: 5,100 tok/s
At 256 concurrent, vLLM's PagedAttention
and TRT-LLM's C++ runtime both shine

The FP8 factor: On H100 GPUs, TensorRT-LLM with FP8 quantization achieves roughly 2× the throughput of any FP16 framework at <0.5% quality degradation. If you're on H100s, FP8 via TRT-LLM is almost always the right choice for throughput-oriented workloads.

Latency Comparison

Latency has two critical components: TTFT (Time to First Token — how fast the first token arrives) and TPS (Tokens Per Second — decode speed after the first token). These optimize differently.

Metric	vLLM	TGI	TRT-LLM
TTFT (512-token prompt, 7B)	~45 ms	~50 ms	~28 ms
TTFT (2048-token prompt, 7B)	~120 ms	~135 ms	~75 ms
TTFT (512-token prompt, 70B, TP=4)	~180 ms	~210 ms	~110 ms
TPS per user (7B, batch=1)	~62 tok/s	~58 tok/s	~78 tok/s
TPS per user (70B TP=4, batch=1)	~28 tok/s	~25 tok/s	~38 tok/s
P99 TTFT under load (7B, 64 reqs)	~280 ms	~350 ms	~160 ms

TensorRT-LLM's TTFT advantage comes from fused prefill kernels and CUDA graph execution—the engine skips Python overhead entirely during inference. vLLM's advantage narrows at high batch sizes where GPU compute (not overhead) dominates.

# Benchmarking script for fair comparison
import asyncio, time, aiohttp

async def benchmark(url, prompts, concurrency=64):
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def send_request(prompt):
        async with semaphore:
            t0 = time.perf_counter()
            first_token_time = None
            tokens = 0
            async with aiohttp.ClientSession() as session:
                async with session.post(url, json={
                    "model": "llama",
                    "messages": [{"role": "user", "content": prompt}],
                    "max_tokens": 256, "stream": True
                }) as resp:
                    async for line in resp.content:
                        if first_token_time is None:
                            first_token_time = time.perf_counter() - t0
                        tokens += 1
            total = time.perf_counter() - t0
            results.append({
                "ttft": first_token_time,
                "total": total,
                "tps": tokens / total
            })

    await asyncio.gather(*[send_request(p) for p in prompts])
    return results

Benchmarking pitfall: Always measure under realistic concurrency. Single-request (batch=1) latency favors TensorRT-LLM by 30–40%. At 128+ concurrent requests, the gap narrows to 15–20% as all frameworks become GPU-compute-bound rather than overhead-bound.

Deployment Complexity

Time-to-first-inference matters for developer productivity and iteration speed. Here's a realistic assessment:

vLLM — Easiest

Time to first request: ~5 minutes

pip install vllm
Single command to start server
OpenAI-compatible API (zero client changes)
No compilation step
Docker image: ~8 GB

# Literally 2 commands
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct

TGI — Easy

Time to first request: ~10 minutes

Docker-first deployment
Auto-downloads model from Hub
Custom API (not OpenAI-compatible)
No compilation step
Docker image: ~10 GB

# Docker one-liner
docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference \
  --model-id meta-llama/Llama-3.1-8B-Instruct

TensorRT-LLM — Hard

Time to first request: 1–3 hours

Convert checkpoint (10–30 min)
Build engine (30–120 min)
GPU-specific (rebuild per GPU type)
Configuration-specific (rebuild per batch/seq)
Docker image: ~15 GB

Triton + TRT-LLM — Hardest

Time to first request: 2–5 hours

All of TRT-LLM build steps, plus:
Model repository structure setup
Pre/post-processing models
config.pbtxt configuration
Ensemble pipeline wiring

Quantization & Hardware Support

Quantization support varies significantly and can be the deciding factor for cost-sensitive deployments:

Quantization	vLLM	TGI	TRT-LLM
FP16 / BF16	✓	✓	✓
FP8 (H100)	✓	✗	✓ Native
INT8 (SmoothQuant)	~ Via SQ	✓ EETQ	✓ W8A8
INT4-GPTQ	✓	✓	✓
INT4-AWQ	✓	✓	✓
GGUF (llama.cpp)	~ Experimental	✗	✗

Hardware support:

vLLM Hardware

NVIDIA: A100, H100, L40S, A10G, T4, RTX
AMD: MI250, MI300X (ROCm)
AWS Neuron: Inferentia2 (experimental)
TPU: Community support (limited)
Broadest hardware support

TRT-LLM Hardware

NVIDIA only: A100, H100, H200, L40S
FP8: H100/H200 only
No AMD/Intel/TPU support
GPU-specific engines (compile per arch)
Narrowest but deepest optimization

Cost optimization math: A Llama-70B model in INT4-AWQ on vLLM fits on 1×A100-80GB (~$2/hr on cloud). The same model in FP16 on TRT-LLM needs 4×A100 (~$8/hr) but serves 3× more requests. At >100 req/s, TRT-LLM's throughput advantage makes 4×A100 cheaper per request than 1×A100 with quantized vLLM.

Decision Flowchart

Use this flowchart to pick the right framework for your use case:

Production recommendation: Start with vLLM for prototyping and early production. If throughput becomes a bottleneck and you're on NVIDIA hardware, benchmark TensorRT-LLM for your specific model and workload. The 1–3 day setup investment typically pays for itself within a month at scale through 30–50% lower GPU costs.

The LLM serving landscape evolves rapidly. vLLM adds new optimizations monthly, TGI continues to improve its Rust router, and TensorRT-LLM keeps pushing the performance ceiling. Re-benchmark quarterly, and don't over-invest in any single framework—keep your serving layer abstracted behind an OpenAI-compatible API so you can swap engines without changing client code.