LLM Inference Cost: Tokens/Dollar
The Cost Model: From GPU-Hours to $/Million Tokens
Every LLM inference cost calculation reduces to one equation: Cost per token = GPU cost per second ÷ tokens per second. But getting accurate numbers for both sides requires understanding the full stack — hardware throughput, batching efficiency, utilization rates, and operational overhead.
# Cost-per-token calculation def cost_per_million_tokens( gpu_cost_per_hour: float, # e.g., $2.21 for A100 80GB on AWS tokens_per_second: float, # e.g., 2400 tok/s at batch=64 utilization: float = 0.70, # average GPU utilization num_gpus: int = 1 ) -> float: gpu_cost_per_sec = (gpu_cost_per_hour * num_gpus) / 3600 effective_tps = tokens_per_second * utilization cost_per_token = gpu_cost_per_sec / effective_tps return cost_per_token * 1_000_000 # Example: Llama-2 70B on 4×A100 80GB cost_per_million_tokens( gpu_cost_per_hour=2.21, # p4d.24xlarge = $32.77/hr ÷ 8 GPUs ÷ ~1.85 for 4 GPUs tokens_per_second=2400, # vLLM, batch=64, FP16 utilization=0.70, num_gpus=4 ) # = (2.21 * 4 / 3600) / (2400 * 0.7) * 1e6 = $1.46 / 1M tokens
Right-Sizing GPUs: A100 vs H100 vs A10G vs L4
Choosing the right GPU is the single highest-impact cost decision. The most expensive GPU isn't always the most cost-efficient. The key metric is tokens per dollar, not raw throughput.
| GPU | VRAM | FP16 TFLOPS | Mem BW (TB/s) | Cloud $/hr | Llama-2 7B tok/s | $/1M tokens |
|---|---|---|---|---|---|---|
| H100 SXM | 80 GB | 989 | 3.35 | ~$4.50 | ~4,800 | $0.26 |
| A100 80GB | 80 GB | 312 | 2.0 | ~$2.21 | ~2,900 | $0.21 |
| A10G | 24 GB | 125 | 0.6 | ~$1.01 | ~850 | $0.33 |
| L4 | 24 GB | 121 | 0.3 | ~$0.54 | ~620 | $0.24 |
| T4 | 16 GB | 65 | 0.3 | ~$0.38 | ~350 | $0.30 |
For larger models that require multi-GPU, the economics shift:
| Model | Config | GPUs | Total $/hr | Batch tok/s | $/1M tokens |
|---|---|---|---|---|---|
| Llama-2 70B FP16 | 4×A100 80GB (TP=4) | 4 | $8.84 | 2,400 | $1.46 |
| Llama-2 70B FP16 | 4×H100 SXM (TP=4) | 4 | $18.00 | 6,200 | $0.81 |
| Llama-2 70B INT4 (AWQ) | 2×A100 80GB (TP=2) | 2 | $4.42 | 1,800 | $0.68 |
| Llama-2 70B INT4 (AWQ) | 1×H100 SXM | 1 | $4.50 | 1,500 | $0.83 |
| Mixtral 8×7B FP16 | 2×A100 80GB (TP=2) | 2 | $4.42 | 3,200 | $0.38 |
Batching Economics: The Throughput Multiplier
Continuous batching is the most impactful cost optimization technique. Without batching, a single H100 serving Llama-2 7B generates ~90 tokens/second. With continuous batching at batch size 64, throughput jumps to ~4,800 tokens/second — a 53× improvement.
The economics are dramatic because decode is memory-bandwidth-bound. Reading model weights costs the same whether you compute one output token or 64. By batching, you amortize the weight-read cost across all sequences in the batch.
# Batching throughput model for decode phase def decode_throughput( model_size_bytes: float, # e.g., 14e9 for 7B FP16 mem_bandwidth: float, # e.g., 3.35e12 for H100 batch_size: int, overhead_factor: float = 1.2 # KV cache reads, activation memory ) -> float: # Time to read model weights once (constant per decode step) weight_read_time = model_size_bytes / mem_bandwidth # Time for KV cache reads scales with batch × sequence length # but is typically 10-30% of weight reads at moderate batch sizes total_time = weight_read_time * overhead_factor tokens_per_step = batch_size # one token per sequence return tokens_per_step / total_time # H100 + Llama-2 7B FP16 decode_throughput(14e9, 3.35e12, batch_size=1) # ~ 200 tok/s decode_throughput(14e9, 3.35e12, batch_size=32) # ~ 3,800 tok/s decode_throughput(14e9, 3.35e12, batch_size=64) # ~ 4,800 tok/s decode_throughput(14e9, 3.35e12, batch_size=128) # ~ 5,100 tok/s (diminishing returns)
The cost impact of batching at different utilization levels:
Low Utilization (20%)
- Average batch: 2-4 requests
- Effective throughput: ~400 tok/s
- Cost: $3.13 / 1M tokens
- GPU idle 80% of time
- Common in: dev/staging, low-traffic APIs
High Utilization (80%)
- Average batch: 32-64 requests
- Effective throughput: ~3,800 tok/s
- Cost: $0.33 / 1M tokens
- GPU saturated during peak
- Common in: production chat, API services
Caching ROI: Prefix, Semantic, and Prompt Caching
Caching is the second most impactful cost lever. Three types of caching apply to LLM inference, each with different ROI profiles:
1. Prefix caching (KV cache reuse): When multiple requests share the same system prompt or prefix, the KV cache for shared tokens is computed once and reused. For a 2048-token system prompt, this saves ~85% of prefill compute for every subsequent request with the same prefix.
# Prefix caching ROI calculation system_prompt_tokens = 2048 user_query_tokens = 256 requests_per_hour = 10_000 cache_hit_rate = 0.85 # 85% of requests share a system prompt # Prefill cost without caching (all tokens every time) total_prefill_tokens = (2048 + 256) * 10_000 # = 23.04M tokens/hr # Prefill cost with caching (only user query for cache hits) cached_prefill = 256 * (10_000 * 0.85) # = 2.176M tokens uncached_prefill = 2304 * (10_000 * 0.15) # = 3.456M tokens total_with_cache = 2.176 + 3.456 # = 5.632M tokens/hr # Savings: 75.6% reduction in prefill compute # At $0.50/1M prefill tokens: $8.70/hr saved
2. Semantic caching: Cache entire responses for semantically similar queries. A vector similarity search (cosine similarity ≥ 0.95) determines if a cached response can be reused. Hit rates of 10-30% are common for FAQ-style workloads, saving both prefill and decode cost.
3. Prompt template compilation: For structured prompts (RAG pipelines, tool-use), pre-compute and cache the KV state for the fixed template portions. Only the dynamic context (retrieved documents, user input) needs fresh computation.
Spot Instances & Preemptible GPUs
Spot/preemptible instances offer 60-70% discounts on GPU compute. For inference workloads (stateless, short-lived), spot is viable with proper architecture:
✅ Spot-Friendly Patterns
- Batch inference jobs (offline processing)
- Overflow capacity for peak traffic
- Non-latency-sensitive workloads
- Evaluation and benchmarking runs
- Workloads with <5 min generation time
❌ Spot-Hostile Patterns
- Primary serving fleet (SLO-bound)
- Long multi-turn conversations (state loss)
- Single-replica deployments
- Workloads needing consistent P99 latency
- Models with >5 min cold-start time
The hybrid approach works best: maintain a base fleet on reserved/on-demand instances (sized for P50 traffic) and scale out with spot instances for peak demand. With this pattern, cost savings of 30-40% are realistic:
# Hybrid spot/on-demand cost model base_fleet = 4 # On-demand A100s for guaranteed capacity spot_fleet = 0..8 # Spot A100s for burst (autoscaled) on_demand_rate = 2.21 # $/hr per A100 spot_rate = 0.73 # $/hr per A100 (~67% discount) # Average traffic needs 6 GPUs, peak needs 10 monthly_base_cost = 4 * 2.21 * 730 # = $6,453 monthly_spot_cost = 4 * 0.73 * 730 * 0.6 # = $1,278 (60% of time) total_hybrid = 6_453 + 1_278 # = $7,731/month # Compare: all on-demand for average 6 GPUs all_on_demand = 6 * 2.21 * 730 # = $9,680/month # Savings: 20.1% with hybrid approach
Multi-Tenancy Savings
Running multiple models or tenants on the same GPU infrastructure amortizes fixed costs and improves utilization. The key insight: most tenants don't peak simultaneously.
Model multiplexing: Load multiple LoRA adapters on a shared base model. A single Llama-2 70B base with 50 LoRA adapters (each ~100MB) serves 50 different fine-tuned models with only 1-3% throughput overhead from adapter switching. Without multiplexing, each model needs dedicated GPUs — 50× the cost.
Statistical multiplexing: With 10 independent tenants averaging 40% utilization each, the combined utilization is ~85-90% (by the law of large numbers, peaks don't align). Compared to dedicated infrastructure at 40% utilization, this is a 2× cost reduction.
Full Cost Comparison: Self-Hosted vs API
The build-vs-buy decision depends on scale. Here's a comprehensive comparison for serving a 70B parameter model at different request volumes:
| Metric | API (GPT-4o) | Self-Hosted (Llama-70B, 4×A100) | Self-Hosted + Optimized |
|---|---|---|---|
| Input cost / 1M tokens | $2.50 | $1.46 | $0.68 (INT4 + caching) |
| Output cost / 1M tokens | $10.00 | $1.46 | $0.68 |
| Monthly cost @ 1B tokens | $6,250 | $6,453 | $3,024 |
| Monthly cost @ 10B tokens | $62,500 | $6,453* | $4,966* |
| Engineering effort | ~0 (API calls) | 1-2 engineers | 2-3 engineers |
| Latency control | None | Full | Full |
| Data privacy | Shared infra | Full control | Full control |
* Self-hosted costs plateau because GPU fleet is fixed — you add GPUs only when throughput is saturated. At 10B tokens/month, you may need 8-12 GPUs, raising cost to ~$13-19K/month.
Optimization stack summary — cumulative cost reduction from a $3.00/1M token baseline:
- INT4 quantization: -50% → $1.50/1M (fewer GPUs needed, higher throughput)
- Continuous batching: -60% → $0.60/1M (53× throughput vs no batching)
- Prefix caching: -40% → $0.36/1M (workload-dependent)
- Spot instances (hybrid): -20% → $0.29/1M
- Multi-tenancy: -30% → $0.20/1M
Total potential reduction: 93% from naive single-request, FP16, on-demand serving. In practice, achieving 80-85% reduction is realistic for high-volume production workloads.