← All Posts

LLM Inference Cost: Tokens/Dollar

The Cost Model: From GPU-Hours to $/Million Tokens

Every LLM inference cost calculation reduces to one equation: Cost per token = GPU cost per second ÷ tokens per second. But getting accurate numbers for both sides requires understanding the full stack — hardware throughput, batching efficiency, utilization rates, and operational overhead.

GPU Cost $/hour per GPU × hours/month Throughput tokens/sec (batch) × utilization % Utilization request arrival rate ÷ max throughput $ / Million Tokens = GPU cost ÷ (throughput × utilization × 3600) Levers: cheaper GPUs ↓ | higher throughput ↑ | better utilization ↑ | caching ↑
# Cost-per-token calculation
def cost_per_million_tokens(
    gpu_cost_per_hour: float,     # e.g., $2.21 for A100 80GB on AWS
    tokens_per_second: float,     # e.g., 2400 tok/s at batch=64
    utilization: float = 0.70,    # average GPU utilization
    num_gpus: int = 1
) -> float:
    gpu_cost_per_sec = (gpu_cost_per_hour * num_gpus) / 3600
    effective_tps    = tokens_per_second * utilization
    cost_per_token   = gpu_cost_per_sec / effective_tps
    return cost_per_token * 1_000_000

# Example: Llama-2 70B on 4×A100 80GB
cost_per_million_tokens(
    gpu_cost_per_hour=2.21,   # p4d.24xlarge = $32.77/hr ÷ 8 GPUs ÷ ~1.85 for 4 GPUs
    tokens_per_second=2400,   # vLLM, batch=64, FP16
    utilization=0.70,
    num_gpus=4
)
# = (2.21 * 4 / 3600) / (2400 * 0.7) * 1e6 = $1.46 / 1M tokens

Right-Sizing GPUs: A100 vs H100 vs A10G vs L4

Choosing the right GPU is the single highest-impact cost decision. The most expensive GPU isn't always the most cost-efficient. The key metric is tokens per dollar, not raw throughput.

GPU VRAM FP16 TFLOPS Mem BW (TB/s) Cloud $/hr Llama-2 7B tok/s $/1M tokens
H100 SXM 80 GB 989 3.35 ~$4.50 ~4,800 $0.26
A100 80GB 80 GB 312 2.0 ~$2.21 ~2,900 $0.21
A10G 24 GB 125 0.6 ~$1.01 ~850 $0.33
L4 24 GB 121 0.3 ~$0.54 ~620 $0.24
T4 16 GB 65 0.3 ~$0.38 ~350 $0.30
Surprising result: The A100 80GB often offers the best $/token for mid-size models (7-13B) due to its favorable memory bandwidth to cost ratio. The H100 pulls ahead for large models (70B+) where compute becomes the bottleneck. For small models (<7B), the L4 wins on cost efficiency with INT8 quantization.

For larger models that require multi-GPU, the economics shift:

Model Config GPUs Total $/hr Batch tok/s $/1M tokens
Llama-2 70B FP16 4×A100 80GB (TP=4) 4 $8.84 2,400 $1.46
Llama-2 70B FP16 4×H100 SXM (TP=4) 4 $18.00 6,200 $0.81
Llama-2 70B INT4 (AWQ) 2×A100 80GB (TP=2) 2 $4.42 1,800 $0.68
Llama-2 70B INT4 (AWQ) 1×H100 SXM 1 $4.50 1,500 $0.83
Mixtral 8×7B FP16 2×A100 80GB (TP=2) 2 $4.42 3,200 $0.38

Batching Economics: The Throughput Multiplier

Continuous batching is the most impactful cost optimization technique. Without batching, a single H100 serving Llama-2 7B generates ~90 tokens/second. With continuous batching at batch size 64, throughput jumps to ~4,800 tokens/second — a 53× improvement.

The economics are dramatic because decode is memory-bandwidth-bound. Reading model weights costs the same whether you compute one output token or 64. By batching, you amortize the weight-read cost across all sequences in the batch.

# Batching throughput model for decode phase
def decode_throughput(
    model_size_bytes: float,      # e.g., 14e9 for 7B FP16
    mem_bandwidth: float,         # e.g., 3.35e12 for H100
    batch_size: int,
    overhead_factor: float = 1.2  # KV cache reads, activation memory
) -> float:
    # Time to read model weights once (constant per decode step)
    weight_read_time = model_size_bytes / mem_bandwidth
    # Time for KV cache reads scales with batch × sequence length
    # but is typically 10-30% of weight reads at moderate batch sizes
    total_time = weight_read_time * overhead_factor
    tokens_per_step = batch_size  # one token per sequence
    return tokens_per_step / total_time

# H100 + Llama-2 7B FP16
decode_throughput(14e9, 3.35e12, batch_size=1)    # ~  200 tok/s
decode_throughput(14e9, 3.35e12, batch_size=32)   # ~ 3,800 tok/s
decode_throughput(14e9, 3.35e12, batch_size=64)   # ~ 4,800 tok/s
decode_throughput(14e9, 3.35e12, batch_size=128)  # ~ 5,100 tok/s (diminishing returns)
Diminishing returns: Beyond batch size ~64-128, throughput gains flatten because KV cache reads grow linearly with batch size and start competing with weight reads for memory bandwidth. Additionally, KV cache memory consumption limits maximum batch size — each sequence at 2048 tokens needs ~800MB of KV cache for a 7B model.

The cost impact of batching at different utilization levels:

Low Utilization (20%)

  • Average batch: 2-4 requests
  • Effective throughput: ~400 tok/s
  • Cost: $3.13 / 1M tokens
  • GPU idle 80% of time
  • Common in: dev/staging, low-traffic APIs

High Utilization (80%)

  • Average batch: 32-64 requests
  • Effective throughput: ~3,800 tok/s
  • Cost: $0.33 / 1M tokens
  • GPU saturated during peak
  • Common in: production chat, API services

Caching ROI: Prefix, Semantic, and Prompt Caching

Caching is the second most impactful cost lever. Three types of caching apply to LLM inference, each with different ROI profiles:

1. Prefix caching (KV cache reuse): When multiple requests share the same system prompt or prefix, the KV cache for shared tokens is computed once and reused. For a 2048-token system prompt, this saves ~85% of prefill compute for every subsequent request with the same prefix.

# Prefix caching ROI calculation
system_prompt_tokens = 2048
user_query_tokens   = 256
requests_per_hour   = 10_000
cache_hit_rate      = 0.85      # 85% of requests share a system prompt

# Prefill cost without caching (all tokens every time)
total_prefill_tokens = (2048 + 256) * 10_000   # = 23.04M tokens/hr

# Prefill cost with caching (only user query for cache hits)
cached_prefill = 256 * (10_000 * 0.85)          # = 2.176M tokens
uncached_prefill = 2304 * (10_000 * 0.15)        # = 3.456M tokens
total_with_cache = 2.176 + 3.456                 # = 5.632M tokens/hr

# Savings: 75.6% reduction in prefill compute
# At $0.50/1M prefill tokens: $8.70/hr saved

2. Semantic caching: Cache entire responses for semantically similar queries. A vector similarity search (cosine similarity ≥ 0.95) determines if a cached response can be reused. Hit rates of 10-30% are common for FAQ-style workloads, saving both prefill and decode cost.

3. Prompt template compilation: For structured prompts (RAG pipelines, tool-use), pre-compute and cache the KV state for the fixed template portions. Only the dynamic context (retrieved documents, user input) needs fresh computation.

Combined caching impact: In production RAG workloads, prefix caching + semantic caching together reduce inference costs by 40-60%. The implementation cost (Redis/Memcached for KV cache, vector DB for semantic cache) is typically <5% of GPU costs.

Spot Instances & Preemptible GPUs

Spot/preemptible instances offer 60-70% discounts on GPU compute. For inference workloads (stateless, short-lived), spot is viable with proper architecture:

✅ Spot-Friendly Patterns

  • Batch inference jobs (offline processing)
  • Overflow capacity for peak traffic
  • Non-latency-sensitive workloads
  • Evaluation and benchmarking runs
  • Workloads with <5 min generation time

❌ Spot-Hostile Patterns

  • Primary serving fleet (SLO-bound)
  • Long multi-turn conversations (state loss)
  • Single-replica deployments
  • Workloads needing consistent P99 latency
  • Models with >5 min cold-start time

The hybrid approach works best: maintain a base fleet on reserved/on-demand instances (sized for P50 traffic) and scale out with spot instances for peak demand. With this pattern, cost savings of 30-40% are realistic:

# Hybrid spot/on-demand cost model
base_fleet = 4          # On-demand A100s for guaranteed capacity
spot_fleet = 0..8       # Spot A100s for burst (autoscaled)
on_demand_rate = 2.21   # $/hr per A100
spot_rate = 0.73         # $/hr per A100 (~67% discount)

# Average traffic needs 6 GPUs, peak needs 10
monthly_base_cost = 4 * 2.21 * 730       # = $6,453
monthly_spot_cost = 4 * 0.73 * 730 * 0.6 # = $1,278 (60% of time)
total_hybrid = 6_453 + 1_278               # = $7,731/month

# Compare: all on-demand for average 6 GPUs
all_on_demand = 6 * 2.21 * 730             # = $9,680/month
# Savings: 20.1% with hybrid approach

Multi-Tenancy Savings

Running multiple models or tenants on the same GPU infrastructure amortizes fixed costs and improves utilization. The key insight: most tenants don't peak simultaneously.

Model multiplexing: Load multiple LoRA adapters on a shared base model. A single Llama-2 70B base with 50 LoRA adapters (each ~100MB) serves 50 different fine-tuned models with only 1-3% throughput overhead from adapter switching. Without multiplexing, each model needs dedicated GPUs — 50× the cost.

Statistical multiplexing: With 10 independent tenants averaging 40% utilization each, the combined utilization is ~85-90% (by the law of large numbers, peaks don't align). Compared to dedicated infrastructure at 40% utilization, this is a 2× cost reduction.

Multi-tenant economics: Serving 10 tenants on shared infrastructure costs ~$15K/month vs $40K+/month with dedicated GPUs per tenant — a 62% reduction. The tradeoff is noisy-neighbor risk and more complex SLO management.

Full Cost Comparison: Self-Hosted vs API

The build-vs-buy decision depends on scale. Here's a comprehensive comparison for serving a 70B parameter model at different request volumes:

Metric API (GPT-4o) Self-Hosted (Llama-70B, 4×A100) Self-Hosted + Optimized
Input cost / 1M tokens $2.50 $1.46 $0.68 (INT4 + caching)
Output cost / 1M tokens $10.00 $1.46 $0.68
Monthly cost @ 1B tokens $6,250 $6,453 $3,024
Monthly cost @ 10B tokens $62,500 $6,453* $4,966*
Engineering effort ~0 (API calls) 1-2 engineers 2-3 engineers
Latency control None Full Full
Data privacy Shared infra Full control Full control

* Self-hosted costs plateau because GPU fleet is fixed — you add GPUs only when throughput is saturated. At 10B tokens/month, you may need 8-12 GPUs, raising cost to ~$13-19K/month.

Rule of thumb: Self-hosting becomes cost-effective at ~2-5B tokens/month for 70B models, or ~500M tokens/month for 7B models. Below that, API providers offer better economics when you factor in engineering time ($150-250K/yr per ML engineer).

Optimization stack summary — cumulative cost reduction from a $3.00/1M token baseline:

Total potential reduction: 93% from naive single-request, FP16, on-demand serving. In practice, achieving 80-85% reduction is realistic for high-volume production workloads.