← All Posts

vLLM vs TGI vs TensorRT-LLM vs Triton

Framework Overview

Choosing an LLM serving framework is one of the highest-leverage infrastructure decisions for AI teams. The wrong choice can mean 2–5× higher GPU costs or unacceptable latency. This guide provides a data-driven comparison based on real benchmarks and production experience.

vLLM

Origin: UC Berkeley (Sky Lab)

Language: Python + CUDA

Philosophy: Memory-efficient serving via PagedAttention. Broadest model support, OpenAI-compatible API, easiest to get started.

Best for: General-purpose LLM serving, rapid prototyping, teams prioritizing flexibility.

TGI

Origin: HuggingFace

Language: Rust (router) + Python (model)

Philosophy: Production-ready server with Rust performance for routing/streaming. Native HuggingFace Hub integration.

Best for: HuggingFace ecosystem users, teams needing watermarking, built-in token streaming.

TensorRT-LLM

Origin: NVIDIA

Language: C++ runtime + Python build

Philosophy: Ahead-of-time compilation to GPU-specific optimized engines. Maximum performance at the cost of flexibility.

Best for: Large-scale production on NVIDIA GPUs, FP8 on H100, maximum throughput/$.

Triton Inference Server

Origin: NVIDIA

Language: C++ core

Philosophy: Model-agnostic serving platform. Wraps any backend (TRT-LLM, PyTorch, ONNX). Enterprise-grade orchestration.

Best for: Multi-model deployments, ensemble pipelines, teams already on NVIDIA's ML stack.

Feature Matrix

A comprehensive comparison of capabilities across all four frameworks:

Feature vLLM TGI TensorRT-LLM Triton + TRT-LLM
Continuous Batching ✓ Yes ✓ Yes ✓ In-flight ✓ Yes
PagedAttention ✓ Core feature ✓ Yes ✓ Yes ✓ Via TRT-LLM
Flash Attention 2 ✓ Yes ✓ Yes ~ Custom fused kernels ~ Via TRT-LLM
Token Streaming ✓ SSE ✓ SSE (Rust) ~ Callback-based ✓ gRPC stream
OpenAI-Compatible API ✓ Native ~ Messages API ✗ Custom API ✗ Triton API
Tensor Parallelism ✓ NCCL ✓ NCCL ✓ NCCL ✓ Via TRT-LLM
Pipeline Parallelism ✓ Yes ✗ No ✓ Yes ✓ Via TRT-LLM
FP8 Quantization ✓ Yes ✗ No ✓ Native H100 ✓ Via TRT-LLM
INT4 (AWQ/GPTQ) ✓ Yes ✓ Yes ✓ Yes ✓ Via TRT-LLM
Speculative Decoding ✓ Yes ✓ Yes ✓ Yes ✓ Via TRT-LLM
Multi-LoRA ✓ Hot-swap ✓ Yes ~ Limited ~ Limited
Prefix Caching ✓ APC ✗ No ~ Manual ~ Manual
AMD GPU Support ✓ ROCm ~ Experimental ✗ NVIDIA only ✗ NVIDIA only
Model Count 50+ architectures 30+ architectures 20+ architectures Via backend
Structured Output ✓ Grammar ✓ Grammar ✗ No ✗ No

Throughput Benchmarks

All benchmarks below use the ShareGPT dataset (variable input/output lengths, mean ~250 input / ~250 output tokens) with 64 concurrent requests unless otherwise noted. Hardware: 4×A100-80GB with NVLink, CUDA 12.2.

Llama-2-7B (1×A100-80GB, FP16)

  • vLLM: 2,850 tok/s (24.1 req/s)
  • TGI: 2,420 tok/s (20.3 req/s)
  • TRT-LLM: 3,380 tok/s (28.5 req/s)
  • TRT-LLM + Triton: 3,310 tok/s
  • TRT-LLM is ~1.2× faster than vLLM

Llama-2-70B (4×A100-80GB, TP=4, FP16)

  • vLLM: 1,280 tok/s (10.8 req/s)
  • TGI: 980 tok/s (8.2 req/s)
  • TRT-LLM: 1,920 tok/s (16.1 req/s)
  • TRT-LLM FP8: 2,640 tok/s (22.2 req/s)
  • TRT-LLM FP8 is ~2× faster than vLLM FP16

Mixtral-8x7B (2×A100-80GB, TP=2)

  • vLLM: 1,650 tok/s
  • TGI: 1,380 tok/s
  • TRT-LLM: 2,180 tok/s
  • MoE routing overhead varies by framework
  • vLLM's expert-parallel is still maturing

High-Concurrency (Llama-7B, 256 reqs)

  • vLLM: 4,200 tok/s (scales well)
  • TGI: 3,100 tok/s (queue pressure)
  • TRT-LLM: 5,100 tok/s
  • At 256 concurrent, vLLM's PagedAttention
  • and TRT-LLM's C++ runtime both shine
The FP8 factor: On H100 GPUs, TensorRT-LLM with FP8 quantization achieves roughly 2× the throughput of any FP16 framework at <0.5% quality degradation. If you're on H100s, FP8 via TRT-LLM is almost always the right choice for throughput-oriented workloads.

Latency Comparison

Latency has two critical components: TTFT (Time to First Token — how fast the first token arrives) and TPS (Tokens Per Second — decode speed after the first token). These optimize differently.

Metric vLLM TGI TRT-LLM
TTFT (512-token prompt, 7B) ~45 ms ~50 ms ~28 ms
TTFT (2048-token prompt, 7B) ~120 ms ~135 ms ~75 ms
TTFT (512-token prompt, 70B, TP=4) ~180 ms ~210 ms ~110 ms
TPS per user (7B, batch=1) ~62 tok/s ~58 tok/s ~78 tok/s
TPS per user (70B TP=4, batch=1) ~28 tok/s ~25 tok/s ~38 tok/s
P99 TTFT under load (7B, 64 reqs) ~280 ms ~350 ms ~160 ms

TensorRT-LLM's TTFT advantage comes from fused prefill kernels and CUDA graph execution—the engine skips Python overhead entirely during inference. vLLM's advantage narrows at high batch sizes where GPU compute (not overhead) dominates.

# Benchmarking script for fair comparison
import asyncio, time, aiohttp

async def benchmark(url, prompts, concurrency=64):
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def send_request(prompt):
        async with semaphore:
            t0 = time.perf_counter()
            first_token_time = None
            tokens = 0
            async with aiohttp.ClientSession() as session:
                async with session.post(url, json={
                    "model": "llama",
                    "messages": [{"role": "user", "content": prompt}],
                    "max_tokens": 256, "stream": True
                }) as resp:
                    async for line in resp.content:
                        if first_token_time is None:
                            first_token_time = time.perf_counter() - t0
                        tokens += 1
            total = time.perf_counter() - t0
            results.append({
                "ttft": first_token_time,
                "total": total,
                "tps": tokens / total
            })

    await asyncio.gather(*[send_request(p) for p in prompts])
    return results
Benchmarking pitfall: Always measure under realistic concurrency. Single-request (batch=1) latency favors TensorRT-LLM by 30–40%. At 128+ concurrent requests, the gap narrows to 15–20% as all frameworks become GPU-compute-bound rather than overhead-bound.

Deployment Complexity

Time-to-first-inference matters for developer productivity and iteration speed. Here's a realistic assessment:

vLLM — Easiest

Time to first request: ~5 minutes

  • pip install vllm
  • Single command to start server
  • OpenAI-compatible API (zero client changes)
  • No compilation step
  • Docker image: ~8 GB
# Literally 2 commands
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct

TGI — Easy

Time to first request: ~10 minutes

  • Docker-first deployment
  • Auto-downloads model from Hub
  • Custom API (not OpenAI-compatible)
  • No compilation step
  • Docker image: ~10 GB
# Docker one-liner
docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference \
  --model-id meta-llama/Llama-3.1-8B-Instruct

TensorRT-LLM — Hard

Time to first request: 1–3 hours

  • Convert checkpoint (10–30 min)
  • Build engine (30–120 min)
  • GPU-specific (rebuild per GPU type)
  • Configuration-specific (rebuild per batch/seq)
  • Docker image: ~15 GB

Triton + TRT-LLM — Hardest

Time to first request: 2–5 hours

  • All of TRT-LLM build steps, plus:
  • Model repository structure setup
  • Pre/post-processing models
  • config.pbtxt configuration
  • Ensemble pipeline wiring

Quantization & Hardware Support

Quantization support varies significantly and can be the deciding factor for cost-sensitive deployments:

Quantization vLLM TGI TRT-LLM
FP16 / BF16
FP8 (H100) ✓ Native
INT8 (SmoothQuant) ~ Via SQ ✓ EETQ ✓ W8A8
INT4-GPTQ
INT4-AWQ
GGUF (llama.cpp) ~ Experimental

Hardware support:

vLLM Hardware

  • NVIDIA: A100, H100, L40S, A10G, T4, RTX
  • AMD: MI250, MI300X (ROCm)
  • AWS Neuron: Inferentia2 (experimental)
  • TPU: Community support (limited)
  • Broadest hardware support

TRT-LLM Hardware

  • NVIDIA only: A100, H100, H200, L40S
  • FP8: H100/H200 only
  • No AMD/Intel/TPU support
  • GPU-specific engines (compile per arch)
  • Narrowest but deepest optimization
Cost optimization math: A Llama-70B model in INT4-AWQ on vLLM fits on 1×A100-80GB (~$2/hr on cloud). The same model in FP16 on TRT-LLM needs 4×A100 (~$8/hr) but serves 3× more requests. At >100 req/s, TRT-LLM's throughput advantage makes 4×A100 cheaper per request than 1×A100 with quantized vLLM.

Decision Flowchart

Use this flowchart to pick the right framework for your use case:

Need to serve an LLM? Using NVIDIA GPUs only? No (AMD/mixed) → vLLM Yes Max throughput or fast iteration? Fast iteration HuggingFace ecosystem? Yes → TGI No → vLLM Max throughput Multi-model or ensemble needed? Yes → Triton No → TensorRT-LLM Quick Decision Rules vLLM: Default choice. Best model coverage, easiest setup, OpenAI-compat API. Start here. TGI: HuggingFace-native. Great if you need watermarking, Rust performance, or Inference Endpoints. TensorRT-LLM: When you need every last token/sec/$. Budget 1-3 days setup. H100 FP8 is unbeatable. Triton: Multi-model serving, ensemble pipelines, or when your org already uses NVIDIA's ML stack.
Production recommendation: Start with vLLM for prototyping and early production. If throughput becomes a bottleneck and you're on NVIDIA hardware, benchmark TensorRT-LLM for your specific model and workload. The 1–3 day setup investment typically pays for itself within a month at scale through 30–50% lower GPU costs.

The LLM serving landscape evolves rapidly. vLLM adds new optimizations monthly, TGI continues to improve its Rust router, and TensorRT-LLM keeps pushing the performance ceiling. Re-benchmark quarterly, and don't over-invest in any single framework—keep your serving layer abstracted behind an OpenAI-compatible API so you can swap engines without changing client code.