Rate Limiting & Throttling
Why Rate Limit?
Every public-facing API that doesn't enforce rate limits is an incident waiting to happen. Rate limiting is the practice of controlling the number of requests a client can make to a service within a given time window. It sits at the intersection of reliability engineering, security, and cost management.
Core Motivations
| Goal | What It Prevents |
|---|---|
| Prevent Abuse | Malicious actors flooding your API with millions of requests, scraping data, or brute-forcing authentication endpoints. |
| Protect Resources | A single runaway client shouldn't saturate your database connections, exhaust thread pools, or crash downstream services. |
| Ensure Fair Usage | Without limits, one aggressive client can starve all other users of capacity (the "noisy neighbor" problem). |
| DDoS Mitigation | Rate limiting is the first line of defense. While it can't stop a massive volumetric attack, it handles application-layer (L7) floods effectively. |
| Cost Control | Cloud resources cost money. Unbounded traffic means unbounded bills. Rate limits cap the financial exposure. |
Where to Rate Limit
Rate limiting can be applied at multiple layers in your architecture, and in practice you should use multiple layers:
Client → CDN/WAF (IP-level) → API Gateway (user-level) → Application (endpoint-level) → Database (connection pooling)
↑ Cloudflare/AWS WAF ↑ Kong/Envoy/nginx ↑ Middleware/decorator ↑ pg_bouncer/ProxySQL
- API Gateway / Reverse Proxy — The most common place. Catches abuse before it reaches your application code. Configure in Kong, Envoy, nginx, or AWS API Gateway.
- Application Level — Per-endpoint rate limits (e.g., login endpoint gets 5 req/min, search gets 100 req/min). Implemented as middleware or decorators.
- Per-User / Per-API-Key — Identify the caller via API key, JWT, or OAuth token. Essential for multi-tenant SaaS platforms.
- Per-IP — Simplest approach but problematic with NAT (thousands of users behind one IP). Good for anonymous endpoints.
- Global — Hard ceiling on total system throughput to prevent cascading failures.
Token Bucket Algorithm
The token bucket is the most widely used rate-limiting algorithm, adopted by AWS, Stripe, and most API gateways. It's elegant because it naturally allows bursts while enforcing a long-term average rate.
How It Works
- A bucket holds tokens. The bucket has a maximum capacity (e.g., 10 tokens).
- Tokens are added at a fixed refill rate (e.g., 1 token per second).
- Each incoming request must consume one token. If a token is available, the request proceeds. If not, the request is rejected (or queued).
- If the bucket is full, new tokens are discarded (the bucket can't overflow).
Parameters
| Parameter | Description | Example |
|---|---|---|
bucket_capacity | Maximum burst size | 10 |
refill_rate | Tokens added per second | 2 tokens/sec |
Implementation
import time
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
self.capacity = capacity
self.tokens = capacity # start full
self.refill_rate = refill_rate # tokens per second
self.last_refill = time.monotonic()
def _refill(self):
now = time.monotonic()
elapsed = now - self.last_refill
new_tokens = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + new_tokens)
self.last_refill = now
def allow_request(self, tokens: int = 1) -> bool:
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
# Usage: 10 requests max burst, 2 requests/sec sustained
limiter = TokenBucket(capacity=10, refill_rate=2)
for i in range(15):
if limiter.allow_request():
print(f"Request {i}: ✅ Allowed")
else:
print(f"Request {i}: ❌ Rate limited")
▶ Token Bucket Animation
Watch tokens refill, get consumed by requests, and see what happens when the bucket is empty.
Leaky Bucket Algorithm
The leaky bucket is the dual of the token bucket. Instead of tokens controlling admission, a queue (the bucket) holds incoming requests, and they "leak" out at a fixed rate — like water dripping through a hole at the bottom of a bucket.
How It Works
- Incoming requests are added to a FIFO queue (the bucket).
- The queue has a maximum size. If the queue is full, new requests are dropped.
- Requests are processed (leaked) at a constant rate, regardless of how fast they arrive.
Key Difference from Token Bucket
| Aspect | Token Bucket | Leaky Bucket |
|---|---|---|
| Burst handling | Allows bursts (up to bucket capacity) | Smooths out bursts completely |
| Output rate | Variable (matches input up to limit) | Constant (always fixed rate) |
| When bucket is full | Reject immediately | Drop from queue |
| Use case | API rate limiting (most common) | Traffic shaping, network QoS |
Implementation
import time
from collections import deque
class LeakyBucket:
def __init__(self, capacity: int, leak_rate: float):
self.capacity = capacity
self.leak_rate = leak_rate # requests processed per second
self.queue = deque()
self.last_leak = time.monotonic()
def _leak(self):
now = time.monotonic()
elapsed = now - self.last_leak
leaked = int(elapsed * self.leak_rate)
for _ in range(min(leaked, len(self.queue))):
self.queue.popleft()
if leaked > 0:
self.last_leak = now
def allow_request(self, request_id: str) -> bool:
self._leak()
if len(self.queue) < self.capacity:
self.queue.append(request_id)
return True # queued (will be processed at leak rate)
return False # queue full, dropped
# Queue up to 5 requests, process 2 per second
limiter = LeakyBucket(capacity=5, leak_rate=2)
Fixed Window Counter
The simplest rate-limiting algorithm: divide time into fixed windows and count requests per window.
How It Works
- Time is divided into fixed windows (e.g., each minute: 12:00:00–12:00:59, 12:01:00–12:01:59).
- Each window has a counter, starting at 0.
- On each request, increment the counter for the current window.
- If the counter exceeds the limit, reject the request.
- When a new window starts, the counter resets to 0.
Implementation with Redis
import redis
import time
r = redis.Redis()
def fixed_window_rate_limit(user_id: str, limit: int, window_seconds: int) -> bool:
"""Returns True if request is allowed, False if rate limited."""
# Key includes the window timestamp
window = int(time.time() // window_seconds)
key = f"rate:{user_id}:{window}"
# Atomic increment
current = r.incr(key)
# Set expiry only on first request of the window
if current == 1:
r.expire(key, window_seconds + 1) # +1 for clock skew
return current <= limit
# Allow 100 requests per minute
allowed = fixed_window_rate_limit("user_123", limit=100, window_seconds=60)
The Boundary Problem
Fixed windows have a critical flaw: a user can make 2× the allowed requests in a short burst spanning the window boundary.
Window 1: 12:00:00 ──────────────── 12:00:59 | Window 2: 12:01:00 ──────────────── 12:01:59
|
... idle ... 100 reqs | 100 reqs ... idle ...
↑ 12:00:58-12:00:59 ↑ 12:01:00-12:01:01
|
200 requests in ~2 seconds! (limit was 100/min)
Both windows are individually within limits, but the user effectively sent 200 requests in 2 seconds. This is why fixed windows are rarely used in production for strict rate limiting.
Sliding Window Log
The sliding window log fixes the boundary problem by tracking every request timestamp and counting requests in a trailing window.
How It Works
- For each user, maintain a sorted set of request timestamps.
- On each new request, remove timestamps older than
now - window_size. - Count remaining entries. If count < limit, allow the request and add its timestamp. Otherwise reject.
Redis Implementation
import redis
import time
import uuid
r = redis.Redis()
def sliding_window_log(user_id: str, limit: int, window_seconds: int) -> bool:
now = time.time()
key = f"rate:swl:{user_id}"
window_start = now - window_seconds
# Use a pipeline for atomicity
pipe = r.pipeline()
# Remove timestamps outside the window
pipe.zremrangebyscore(key, 0, window_start)
# Count remaining requests in the window
pipe.zcard(key)
# Add current request timestamp (score = timestamp, member = unique ID)
pipe.zadd(key, {str(uuid.uuid4()): now})
# Set TTL to auto-cleanup
pipe.expire(key, window_seconds + 1)
results = pipe.execute()
request_count = results[1] # zcard result
if request_count >= limit:
# Remove the entry we just added
pipe = r.pipeline()
pipe.zpopmax(key)
pipe.execute()
return False
return True
# 100 requests per 60-second sliding window
allowed = sliding_window_log("user_123", limit=100, window_seconds=60)
Lua Script (Fully Atomic)
-- sliding_window_log.lua
-- KEYS[1] = rate limit key
-- ARGV[1] = window size in seconds
-- ARGV[2] = max requests
-- ARGV[3] = current timestamp
-- ARGV[4] = unique request ID
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local req_id = ARGV[4]
-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
-- Count current entries
local count = redis.call('ZCARD', key)
if count < limit then
-- Add new request
redis.call('ZADD', key, now, req_id)
redis.call('EXPIRE', key, window + 1)
return 1 -- allowed
else
return 0 -- rejected
end
Sliding Window Counter
The sliding window counter is the sweet spot: it combines the accuracy of sliding windows with the memory efficiency of fixed windows. This is the algorithm Cloudflare uses at scale.
How It Works
Instead of storing individual timestamps, use a weighted average of the current and previous window's counts:
Effective count = (previous_window_count × overlap_percentage) + current_window_count
Example:
Window size: 60 seconds
Current time: 12:01:15 (15 seconds into current window)
Previous window (12:00:00–12:00:59): 84 requests
Current window (12:01:00–12:01:59): 36 requests
Overlap of previous window: (60 - 15) / 60 = 75%
Effective count = 84 × 0.75 + 36 = 63 + 36 = 99
If limit is 100 → this request is allowed (99 < 100)
Implementation
import redis
import time
r = redis.Redis()
def sliding_window_counter(
user_id: str, limit: int, window_seconds: int
) -> bool:
now = time.time()
current_window = int(now // window_seconds)
previous_window = current_window - 1
current_key = f"rate:swc:{user_id}:{current_window}"
previous_key = f"rate:swc:{user_id}:{previous_window}"
# Get counts for both windows
pipe = r.pipeline()
pipe.get(current_key)
pipe.get(previous_key)
results = pipe.execute()
current_count = int(results[0] or 0)
previous_count = int(results[1] or 0)
# Calculate elapsed fraction of current window
elapsed = now - (current_window * window_seconds)
weight = 1 - (elapsed / window_seconds)
# Weighted count
effective_count = previous_count * weight + current_count
if effective_count >= limit:
return False
# Increment current window
pipe = r.pipeline()
pipe.incr(current_key)
pipe.expire(current_key, window_seconds * 2)
pipe.execute()
return True
# 100 requests per minute, only 2 Redis keys per user
allowed = sliding_window_counter("user_123", limit=100, window_seconds=60)
Why This Is the Best Trade-Off
| Algorithm | Memory per User | Accuracy | Burst Handling |
|---|---|---|---|
| Fixed Window | 8 bytes | Poor (2× at boundary) | Allows boundary bursts |
| Sliding Log | O(limit) entries | Perfect | Precise |
| Sliding Counter | 16 bytes | ~99.7% accurate | Smooth |
| Token Bucket | 16 bytes | Good | Allows controlled bursts |
▶ Sliding Window Animation
See the boundary problem with fixed windows and how sliding windows solve it.
Distributed Rate Limiting
In a single-server world, rate limiting is trivial — an in-memory counter works fine. But modern architectures run dozens or hundreds of API servers behind a load balancer. The fundamental challenge: how do all servers agree on the request count?
The Problem
┌─────────────┐
Request ──────────▶│ API Server 1 │──── Local counter: 45
└─────────────┘
Request ──────────▶│ API Server 2 │──── Local counter: 38
└─────────────┘
Request ──────────▶│ API Server 3 │──── Local counter: 42
└─────────────┘
Each server thinks the user made ~40 requests.
But the REAL total is 45 + 38 + 42 = 125!
If the limit is 100, we've allowed 25% over-limit.
Solution: Centralized Store (Redis)
Use Redis as a shared counter that all API servers query atomically:
┌─────────────┐
Request ──────────▶│ API Server 1 │────┐
└─────────────┘ │
Request ──────────▶│ API Server 2 │────┼──▶ Redis (single counter: 125)
└─────────────┘ │
Request ──────────▶│ API Server 3 │────┘
└─────────────┘
Atomic Rate Limiting with Lua
The classic INCR + EXPIRE pattern has a race condition. If the server crashes between INCR and EXPIRE, the key lives forever. Use a Lua script for atomicity:
-- rate_limit.lua — Atomic fixed-window rate limiter
-- KEYS[1] = rate limit key
-- ARGV[1] = max requests
-- ARGV[2] = window size in seconds
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current = redis.call('INCR', key)
if current == 1 then
-- First request: set the expiry
redis.call('EXPIRE', key, window)
end
if current > limit then
return 0 -- rejected
else
return 1 -- allowed
end
# Python caller
import redis
import time
r = redis.Redis()
# Load the Lua script
RATE_LIMIT_SCRIPT = r.register_script("""
local current = redis.call('INCR', KEYS[1])
if current == 1 then
redis.call('EXPIRE', KEYS[1], ARGV[2])
end
if current > tonumber(ARGV[1]) then
return 0
else
return 1
end
""")
def is_allowed(user_id: str, limit: int = 100, window: int = 60) -> bool:
window_id = int(time.time() // window)
key = f"rl:{user_id}:{window_id}"
result = RATE_LIMIT_SCRIPT(keys=[key], args=[limit, window])
return result == 1
Token Bucket with Redis
-- token_bucket.lua — Distributed token bucket
-- KEYS[1] = bucket key
-- ARGV[1] = capacity
-- ARGV[2] = refill rate (tokens/sec)
-- ARGV[3] = current timestamp (seconds, float)
-- ARGV[4] = tokens to consume
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
-- Get current state
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens
local elapsed = math.max(0, now - last_refill)
tokens = math.min(capacity, tokens + elapsed * refill_rate)
-- Try to consume
if tokens >= requested then
tokens = tokens - requested
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
return 1 -- allowed
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
return 0 -- rejected
end
Race Conditions and Mitigations
| Race Condition | Cause | Solution |
|---|---|---|
| INCR without EXPIRE | Crash between INCR and EXPIRE | Lua script (atomic) |
| GET-then-SET | Two servers read same count, both allow | INCR (atomic) or Lua |
| Clock skew | Servers have different clocks | Use Redis TIME, or NTP sync |
| Redis failover | Replica promoted, counters lost | Accept brief over-limit; use WAIT for sync |
▶ Distributed Rate Limiter
Multiple API servers share a Redis counter. Watch atomic INCR + EXPIRE and 429 rejection.
Rate Limiting in Practice
HTTP Response Headers
The standard way to communicate rate-limit status to clients. These aren't in the HTTP spec (yet) but are a de facto standard adopted by GitHub, Stripe, Twitter, and others:
HTTP/1.1 200 OK
X-RateLimit-Limit: 100 ← Max requests per window
X-RateLimit-Remaining: 42 ← Requests left in current window
X-RateLimit-Reset: 1714435200 ← Unix timestamp when window resets
429 Too Many Requests
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714435200
{
"error": {
"code": "rate_limit_exceeded",
"message": "Rate limit exceeded. Try again in 30 seconds.",
"retry_after": 30,
"limit": 100,
"remaining": 0,
"reset_at": "2026-04-30T12:00:00Z"
}
}
Implementing in Express.js Middleware
const Redis = require('ioredis');
const redis = new Redis();
const RATE_LIMIT_LUA = `
local current = redis.call('INCR', KEYS[1])
if current == 1 then
redis.call('EXPIRE', KEYS[1], ARGV[2])
end
if current > tonumber(ARGV[1]) then
return {0, current, tonumber(redis.call('TTL', KEYS[1]))}
else
return {1, current, tonumber(redis.call('TTL', KEYS[1]))}
end
`;
function rateLimit({ limit = 100, windowSeconds = 60 } = {}) {
return async (req, res, next) => {
const userId = req.user?.id || req.ip;
const window = Math.floor(Date.now() / 1000 / windowSeconds);
const key = `rl:${userId}:${window}`;
const [allowed, count, ttl] = await redis.eval(
RATE_LIMIT_LUA, 1, key, limit, windowSeconds
);
// Always set rate limit headers
res.set('X-RateLimit-Limit', String(limit));
res.set('X-RateLimit-Remaining', String(Math.max(0, limit - count)));
res.set('X-RateLimit-Reset', String(
Math.floor(Date.now() / 1000) + ttl
));
if (!allowed) {
res.set('Retry-After', String(ttl));
return res.status(429).json({
error: {
code: 'rate_limit_exceeded',
message: `Rate limit exceeded. Try again in ${ttl}s.`,
retry_after: ttl,
}
});
}
next();
};
}
// Usage
app.use('/api/', rateLimit({ limit: 100, windowSeconds: 60 }));
app.use('/api/login', rateLimit({ limit: 5, windowSeconds: 300 }));
Client-Side: Exponential Backoff with Jitter
When you receive a 429, don't retry immediately — use exponential backoff with jitter to avoid the thundering herd problem:
import time
import random
import requests
def request_with_backoff(url: str, max_retries: int = 5) -> requests.Response:
for attempt in range(max_retries):
response = requests.get(url)
if response.status_code != 429:
return response
# Prefer Retry-After header if present
retry_after = response.headers.get('Retry-After')
if retry_after:
wait = int(retry_after)
else:
# Exponential backoff: 1s, 2s, 4s, 8s, 16s
base_wait = 2 ** attempt
# Add jitter: random value between 0 and base_wait
wait = base_wait + random.uniform(0, base_wait)
print(f"Rate limited. Retrying in {wait:.1f}s (attempt {attempt + 1})")
time.sleep(wait)
raise Exception(f"Failed after {max_retries} retries")
// JavaScript equivalent
async function fetchWithBackoff(url, maxRetries = 5) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
const response = await fetch(url);
if (response.status !== 429) return response;
const retryAfter = response.headers.get('Retry-After');
const baseWait = retryAfter
? parseInt(retryAfter, 10)
: Math.pow(2, attempt);
const jitter = Math.random() * baseWait;
const wait = baseWait + jitter;
console.log(`Rate limited. Retry in ${wait.toFixed(1)}s`);
await new Promise(r => setTimeout(r, wait * 1000));
}
throw new Error(`Failed after ${maxRetries} retries`);
}
Multi-Level Rate Limiting
Production systems rarely use a single rate limit. Instead, they layer multiple limits at different granularities:
Layered Limits
┌──────────────────────────────────────────────────────┐
│ Layer 1: Global │
│ All traffic combined: 50,000 req/sec │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Layer 2: Per-IP │ │
│ │ Anonymous: 60 req/min per IP │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ Layer 3: Per-User │ │ │
│ │ │ Authenticated: 1,000 req/min per user │ │ │
│ │ │ ┌──────────────────────────────────────────┐ │ │ │
│ │ │ │ Layer 4: Per-Endpoint │ │ │ │
│ │ │ │ POST /login: 5 req/min │ │ │ │
│ │ │ │ GET /search: 30 req/min │ │ │ │
│ │ │ │ GET /api/data: 1,000 req/min │ │ │ │
│ │ │ └──────────────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
API Tier Limits
Most SaaS APIs offer tiered pricing with different rate limits:
| Tier | Rate Limit | Burst | Daily Quota |
|---|---|---|---|
| Free | 60 req/min | 10 | 1,000 |
| Pro | 600 req/min | 100 | 50,000 |
| Enterprise | 6,000 req/min | 1,000 | Unlimited |
Implementation with Multiple Buckets
import redis
import time
r = redis.Redis()
# Lua script that checks multiple rate limits atomically
MULTI_LIMIT_LUA = """
-- Check all limits; reject if ANY limit is exceeded
for i = 1, #KEYS do
local current = redis.call('INCR', KEYS[i])
if current == 1 then
redis.call('EXPIRE', KEYS[i], ARGV[i * 2])
end
if current > tonumber(ARGV[i * 2 - 1]) then
return {0, i} -- rejected, which limit was hit
end
end
return {1, 0} -- all limits passed
"""
def multi_rate_limit(user_id: str, ip: str, endpoint: str, tier: str):
now = int(time.time())
# Define limits per tier
tier_limits = {
'free': {'per_min': 60, 'per_hour': 1000},
'pro': {'per_min': 600, 'per_hour': 50000},
'enterprise': {'per_min': 6000, 'per_hour': 500000},
}
limits = tier_limits.get(tier, tier_limits['free'])
minute_window = now // 60
hour_window = now // 3600
keys = [
f"rl:ip:{ip}:{minute_window}", # Per-IP: 120/min
f"rl:user:{user_id}:{minute_window}", # Per-user: tier limit/min
f"rl:user:{user_id}:h:{hour_window}", # Per-user: tier limit/hour
f"rl:ep:{endpoint}:{user_id}:{minute_window}", # Per-endpoint
]
args = [
120, 60, # IP: 120 req, 60s window
limits['per_min'], 60, # User per min
limits['per_hour'], 3600, # User per hour
30, 60, # Endpoint: 30 req/min
]
result = r.eval(MULTI_LIMIT_LUA, len(keys), *keys, *args)
return result[0] == 1 # True if all limits passed
API Gateway Configuration (Kong)
# kong.yml — Rate limiting plugin configuration
plugins:
- name: rate-limiting
config:
minute: 100
hour: 5000
policy: redis # Use Redis for distributed counting
redis_host: redis.internal
redis_port: 6379
redis_timeout: 2000
redis_database: 0
limit_by: consumer # Per API key / consumer
fault_tolerant: true # Allow traffic if Redis is down
hide_client_headers: false
# Stricter limit for auth endpoints
- name: rate-limiting
route: login-route
config:
minute: 5
hour: 20
policy: redis
redis_host: redis.internal
limit_by: ip
Envoy Proxy Configuration
# envoy.yaml — Rate limit filter
http_filters:
- name: envoy.filters.http.ratelimit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
domain: my_api
rate_limit_service:
grpc_service:
envoy_grpc:
cluster_name: rate_limit_cluster
transport_api_version: V3
# Rate limit service descriptor configuration
domain: my_api
descriptors:
- key: user_id
rate_limit:
unit: minute
requests_per_unit: 100
- key: user_id
value: free_tier
rate_limit:
unit: minute
requests_per_unit: 10
fault_tolerant: true does exactly this.