LLM Inference Cost: Tokens/Dollar

MLOps Series LLM Inference & Serving

The Cost Model: From GPU-Hours to $/Million Tokens

Every LLM inference cost calculation reduces to one equation: Cost per token = GPU cost per second ÷ tokens per second. But getting accurate numbers for both sides requires understanding the full stack — hardware throughput, batching efficiency, utilization rates, and operational overhead.

# Cost-per-token calculation
def cost_per_million_tokens(
    gpu_cost_per_hour: float,     # e.g., $2.21 for A100 80GB on AWS
    tokens_per_second: float,     # e.g., 2400 tok/s at batch=64
    utilization: float = 0.70,    # average GPU utilization
    num_gpus: int = 1
) -> float:
    gpu_cost_per_sec = (gpu_cost_per_hour * num_gpus) / 3600
    effective_tps    = tokens_per_second * utilization
    cost_per_token   = gpu_cost_per_sec / effective_tps
    return cost_per_token * 1_000_000

# Example: Llama-2 70B on 4×A100 80GB
cost_per_million_tokens(
    gpu_cost_per_hour=2.21,   # p4d.24xlarge = $32.77/hr ÷ 8 GPUs ÷ ~1.85 for 4 GPUs
    tokens_per_second=2400,   # vLLM, batch=64, FP16
    utilization=0.70,
    num_gpus=4
)
# = (2.21 * 4 / 3600) / (2400 * 0.7) * 1e6 = $1.46 / 1M tokens

Right-Sizing GPUs: A100 vs H100 vs A10G vs L4

Choosing the right GPU is the single highest-impact cost decision. The most expensive GPU isn't always the most cost-efficient. The key metric is tokens per dollar, not raw throughput.

GPU	VRAM	FP16 TFLOPS	Mem BW (TB/s)	Cloud $/hr	Llama-2 7B tok/s	$/1M tokens
H100 SXM	80 GB	989	3.35	~$4.50	~4,800	$0.26
A100 80GB	80 GB	312	2.0	~$2.21	~2,900	$0.21
A10G	24 GB	125	0.6	~$1.01	~850	$0.33
L4	24 GB	121	0.3	~$0.54	~620	$0.24
T4	16 GB	65	0.3	~$0.38	~350	$0.30

Surprising result: The A100 80GB often offers the best $/token for mid-size models (7-13B) due to its favorable memory bandwidth to cost ratio. The H100 pulls ahead for large models (70B+) where compute becomes the bottleneck. For small models (<7B), the L4 wins on cost efficiency with INT8 quantization.

For larger models that require multi-GPU, the economics shift:

Model	Config	GPUs	Total $/hr	Batch tok/s	$/1M tokens
Llama-2 70B FP16	4×A100 80GB (TP=4)	4	$8.84	2,400	$1.46
Llama-2 70B FP16	4×H100 SXM (TP=4)	4	$18.00	6,200	$0.81
Llama-2 70B INT4 (AWQ)	2×A100 80GB (TP=2)	2	$4.42	1,800	$0.68
Llama-2 70B INT4 (AWQ)	1×H100 SXM	1	$4.50	1,500	$0.83
Mixtral 8×7B FP16	2×A100 80GB (TP=2)	2	$4.42	3,200	$0.38

Batching Economics: The Throughput Multiplier

Continuous batching is the most impactful cost optimization technique. Without batching, a single H100 serving Llama-2 7B generates ~90 tokens/second. With continuous batching at batch size 64, throughput jumps to ~4,800 tokens/second — a 53× improvement.

The economics are dramatic because decode is memory-bandwidth-bound. Reading model weights costs the same whether you compute one output token or 64. By batching, you amortize the weight-read cost across all sequences in the batch.

# Batching throughput model for decode phase
def decode_throughput(
    model_size_bytes: float,      # e.g., 14e9 for 7B FP16
    mem_bandwidth: float,         # e.g., 3.35e12 for H100
    batch_size: int,
    overhead_factor: float = 1.2  # KV cache reads, activation memory
) -> float:
    # Time to read model weights once (constant per decode step)
    weight_read_time = model_size_bytes / mem_bandwidth
    # Time for KV cache reads scales with batch × sequence length
    # but is typically 10-30% of weight reads at moderate batch sizes
    total_time = weight_read_time * overhead_factor
    tokens_per_step = batch_size  # one token per sequence
    return tokens_per_step / total_time

# H100 + Llama-2 7B FP16
decode_throughput(14e9, 3.35e12, batch_size=1)    # ~  200 tok/s
decode_throughput(14e9, 3.35e12, batch_size=32)   # ~ 3,800 tok/s
decode_throughput(14e9, 3.35e12, batch_size=64)   # ~ 4,800 tok/s
decode_throughput(14e9, 3.35e12, batch_size=128)  # ~ 5,100 tok/s (diminishing returns)

Diminishing returns: Beyond batch size ~64-128, throughput gains flatten because KV cache reads grow linearly with batch size and start competing with weight reads for memory bandwidth. Additionally, KV cache memory consumption limits maximum batch size — each sequence at 2048 tokens needs ~800MB of KV cache for a 7B model.

The cost impact of batching at different utilization levels:

Low Utilization (20%)

Average batch: 2-4 requests
Effective throughput: ~400 tok/s
Cost: $3.13 / 1M tokens
GPU idle 80% of time
Common in: dev/staging, low-traffic APIs

High Utilization (80%)

Average batch: 32-64 requests
Effective throughput: ~3,800 tok/s
Cost: $0.33 / 1M tokens
GPU saturated during peak
Common in: production chat, API services

Caching ROI: Prefix, Semantic, and Prompt Caching

Caching is the second most impactful cost lever. Three types of caching apply to LLM inference, each with different ROI profiles:

1. Prefix caching (KV cache reuse): When multiple requests share the same system prompt or prefix, the KV cache for shared tokens is computed once and reused. For a 2048-token system prompt, this saves ~85% of prefill compute for every subsequent request with the same prefix.

# Prefix caching ROI calculation
system_prompt_tokens = 2048
user_query_tokens   = 256
requests_per_hour   = 10_000
cache_hit_rate      = 0.85      # 85% of requests share a system prompt

# Prefill cost without caching (all tokens every time)
total_prefill_tokens = (2048 + 256) * 10_000   # = 23.04M tokens/hr

# Prefill cost with caching (only user query for cache hits)
cached_prefill = 256 * (10_000 * 0.85)          # = 2.176M tokens
uncached_prefill = 2304 * (10_000 * 0.15)        # = 3.456M tokens
total_with_cache = 2.176 + 3.456                 # = 5.632M tokens/hr

# Savings: 75.6% reduction in prefill compute
# At $0.50/1M prefill tokens: $8.70/hr saved

2. Semantic caching: Cache entire responses for semantically similar queries. A vector similarity search (cosine similarity ≥ 0.95) determines if a cached response can be reused. Hit rates of 10-30% are common for FAQ-style workloads, saving both prefill and decode cost.

3. Prompt template compilation: For structured prompts (RAG pipelines, tool-use), pre-compute and cache the KV state for the fixed template portions. Only the dynamic context (retrieved documents, user input) needs fresh computation.

Combined caching impact: In production RAG workloads, prefix caching + semantic caching together reduce inference costs by 40-60%. The implementation cost (Redis/Memcached for KV cache, vector DB for semantic cache) is typically <5% of GPU costs.

Spot Instances & Preemptible GPUs

Spot/preemptible instances offer 60-70% discounts on GPU compute. For inference workloads (stateless, short-lived), spot is viable with proper architecture:

✅ Spot-Friendly Patterns

Batch inference jobs (offline processing)
Overflow capacity for peak traffic
Non-latency-sensitive workloads
Evaluation and benchmarking runs
Workloads with <5 min generation time

❌ Spot-Hostile Patterns

Primary serving fleet (SLO-bound)
Long multi-turn conversations (state loss)
Single-replica deployments
Workloads needing consistent P99 latency
Models with >5 min cold-start time

The hybrid approach works best: maintain a base fleet on reserved/on-demand instances (sized for P50 traffic) and scale out with spot instances for peak demand. With this pattern, cost savings of 30-40% are realistic:

# Hybrid spot/on-demand cost model
base_fleet = 4          # On-demand A100s for guaranteed capacity
spot_fleet = 0..8       # Spot A100s for burst (autoscaled)
on_demand_rate = 2.21   # $/hr per A100
spot_rate = 0.73         # $/hr per A100 (~67% discount)

# Average traffic needs 6 GPUs, peak needs 10
monthly_base_cost = 4 * 2.21 * 730       # = $6,453
monthly_spot_cost = 4 * 0.73 * 730 * 0.6 # = $1,278 (60% of time)
total_hybrid = 6_453 + 1_278               # = $7,731/month

# Compare: all on-demand for average 6 GPUs
all_on_demand = 6 * 2.21 * 730             # = $9,680/month
# Savings: 20.1% with hybrid approach

Multi-Tenancy Savings

Running multiple models or tenants on the same GPU infrastructure amortizes fixed costs and improves utilization. The key insight: most tenants don't peak simultaneously.

Model multiplexing: Load multiple LoRA adapters on a shared base model. A single Llama-2 70B base with 50 LoRA adapters (each ~100MB) serves 50 different fine-tuned models with only 1-3% throughput overhead from adapter switching. Without multiplexing, each model needs dedicated GPUs — 50× the cost.

Statistical multiplexing: With 10 independent tenants averaging 40% utilization each, the combined utilization is ~85-90% (by the law of large numbers, peaks don't align). Compared to dedicated infrastructure at 40% utilization, this is a 2× cost reduction.

Multi-tenant economics: Serving 10 tenants on shared infrastructure costs ~$15K/month vs $40K+/month with dedicated GPUs per tenant — a 62% reduction. The tradeoff is noisy-neighbor risk and more complex SLO management.

Full Cost Comparison: Self-Hosted vs API

The build-vs-buy decision depends on scale. Here's a comprehensive comparison for serving a 70B parameter model at different request volumes:

Metric	API (GPT-4o)	Self-Hosted (Llama-70B, 4×A100)	Self-Hosted + Optimized
Input cost / 1M tokens	$2.50	$1.46	$0.68 (INT4 + caching)
Output cost / 1M tokens	$10.00	$1.46	$0.68
Monthly cost @ 1B tokens	$6,250	$6,453	$3,024
Monthly cost @ 10B tokens	$62,500	$6,453*	$4,966*
Engineering effort	~0 (API calls)	1-2 engineers	2-3 engineers
Latency control	None	Full	Full
Data privacy	Shared infra	Full control	Full control

* Self-hosted costs plateau because GPU fleet is fixed — you add GPUs only when throughput is saturated. At 10B tokens/month, you may need 8-12 GPUs, raising cost to ~$13-19K/month.

Rule of thumb: Self-hosting becomes cost-effective at ~2-5B tokens/month for 70B models, or ~500M tokens/month for 7B models. Below that, API providers offer better economics when you factor in engineering time ($150-250K/yr per ML engineer).

Optimization stack summary — cumulative cost reduction from a $3.00/1M token baseline:

INT4 quantization: -50% → $1.50/1M (fewer GPUs needed, higher throughput)
Continuous batching: -60% → $0.60/1M (53× throughput vs no batching)
Prefix caching: -40% → $0.36/1M (workload-dependent)
Spot instances (hybrid): -20% → $0.29/1M
Multi-tenancy: -30% → $0.20/1M

Total potential reduction: 93% from naive single-request, FP16, on-demand serving. In practice, achieving 80-85% reduction is realistic for high-volume production workloads.