Design: Rate Limiter
The Problem: Protecting APIs from Abuse
Every production API faces the same existential threat: uncontrolled traffic. A single misbehaving client—whether a buggy mobile app stuck in a retry loop, a data scraper harvesting your catalogue, or a deliberate DDoS attack—can overwhelm your servers, spike cloud bills, and degrade service for everyone.
A rate limiter is the gatekeeper that controls how many requests a client can make within a time window. It’s the first line of defence and one of the most commonly asked system design interview questions because it touches on distributed systems, data structures, caching, atomicity, and API design all in one problem.
| Goal | What It Prevents |
|---|---|
| Prevent abuse | Brute-force attacks, credential stuffing, API scraping, denial-of-service. |
| Fair usage | One tenant consuming all resources in a multi-tenant system. |
| Cost control | Runaway cloud spend when traffic spikes unexpectedly. |
| Stability | Cascading failures when backend services are overloaded. |
| Compliance | Enforcing contractual SLAs and tier-based API plans. |
Requirements
Functional Requirements
- Configurable rules — Limit requests by user ID, IP address, API endpoint, or any combination. Rules should be expressible in a declarative format (YAML/JSON).
- Multiple algorithms — Support token bucket, sliding window log, sliding window counter, and fixed window counter depending on the use case.
- Accurate counting — In a distributed deployment across N servers, the total count for a key must be globally consistent, not per-server.
- Informative responses — When a request is throttled, return HTTP 429 with
Retry-AfterandX-RateLimit-*headers so clients can back off intelligently. - Hot-reload rules — Changing rate limit configuration should not require a deployment or server restart.
Non-Functional Requirements
- Low latency — The rate limit check must add <5ms to request latency (p99). This rules out round-trips to relational databases for every request.
- High availability — If the rate limiter is down, the system should degrade gracefully (fail-open) rather than reject all traffic.
- Distributed — Must work across multiple API gateway instances sharing global counters.
- Scalable — Handle millions of unique rate-limit keys without memory pressure on any single node.
- Minimal false positives — Legitimate users should almost never be incorrectly throttled.
Back-of-the-Envelope Estimation
Let’s size a rate limiter for a mid-scale SaaS API:
| Metric | Value |
|---|---|
| Active users | 1,000,000 |
| Avg requests per user per day | 500 |
| Total requests per day | 500M |
| Average RPS | ~5,800 req/s |
| Peak RPS (3× average) | ~17,400 req/s |
| Rate limit check latency budget | <5ms p99 |
| Redis keys (per-user counters, 1-min window) | ~1M keys |
| Memory per key (key + counter + TTL) | ~100 bytes → ~100 MB total |
Where to Place the Rate Limiter
There are three possible locations, each with trade-offs:
Option 1: Client-Side
The client throttles itself before making requests. This is unreliable because malicious or buggy clients can bypass the throttle. You cannot trust the client.
Option 2: Server-Side (Application Layer)
Each API server checks a shared counter (Redis) before processing a request. This gives you full control and access to application-level context (authenticated user ID, subscription tier), but it means every service must implement rate limiting logic.
Option 3: Middleware / API Gateway (Recommended)
Place the rate limiter as middleware in an API gateway that sits in front of all backend services. This is the industry-standard approach used by AWS API Gateway, Kong, Nginx, Envoy, and Cloudflare.
| Placement | Pros | Cons |
|---|---|---|
| Client | Reduces server load | Can’t be trusted; easily bypassed |
| Server | Full application context; flexible | Logic scattered across services; inconsistent enforcement |
| API Gateway | Centralized; language-agnostic; handled before auth/routing | Less application context; additional hop |
High-Level Architecture
The architecture follows a request’s journey from client to backend, with the rate limiter middleware intercepting every request before it reaches business logic.
▶ Distributed Rate Limiter Architecture
Step through a request’s lifecycle: client → API Gateway → Redis counter check → allow or reject (429).
Component Breakdown
- Clients — Mobile apps, web browsers, third-party integrations sending HTTP requests.
- API Gateway — Entry point (Nginx, Kong, Envoy, AWS API GW). Runs rate limiter middleware before request routing.
- Rate Limiter Middleware — Extracts the rate-limit key (IP, user ID, API key), queries Redis for the current counter, and decides allow or reject.
- Redis — Shared, in-memory counter store. Sub-millisecond reads/writes. All gateway instances share the same Redis, ensuring global rate limiting.
- Rules Engine — Loads YAML/JSON rate limit configuration. Determines which rule applies to each request.
- Backend Services — Microservices that handle business logic. Only reached if the request passes rate limiting.
Rules Engine & Configuration
Rate limit rules should be declarative, stored in a configuration file or database, and hot-reloadable. Here’s a YAML-based approach used by systems like Lyft’s Envoy rate limiter:
# rate-limit-rules.yaml
domain: api_gateway
descriptors:
# Global: 10,000 requests/minute across all clients
- key: global
rate_limit:
requests_per_unit: 10000
unit: minute
# Per-IP: 100 requests/minute (unauthenticated)
- key: remote_address
rate_limit:
requests_per_unit: 100
unit: minute
action: reject # HTTP 429
# Per-user: different limits by subscription tier
- key: user_id
descriptors:
- key: plan
value: free
rate_limit:
requests_per_unit: 60
unit: minute
action: reject
- key: plan
value: pro
rate_limit:
requests_per_unit: 600
unit: minute
action: reject
- key: plan
value: enterprise
rate_limit:
requests_per_unit: 6000
unit: minute
action: reject
# Per-endpoint: protect expensive operations
- key: endpoint
value: "POST /api/v1/search"
rate_limit:
requests_per_unit: 20
unit: minute
action: throttle # Queue and delay
- key: endpoint
value: "POST /api/v1/auth/login"
rate_limit:
requests_per_unit: 5
unit: minute
action: reject # Brute-force protection
Rule Matching Priority
When multiple rules match a request, the system evaluates them in priority order:
- Endpoint-specific (most specific) — e.g.,
POST /api/v1/auth/loginfor user X - User-level — e.g., user X on the “free” plan: 60 req/min
- IP-level — e.g., IP 10.0.0.1: 100 req/min
- Global (least specific) — 10,000 req/min total
A request is rejected if it violates any applicable rule. This layered approach means a single user can’t consume the entire global quota.
Hot-Reloading Configuration
Rules should be reloadable without restarts:
# Approach 1: File watch + signal
# Gateway watches rate-limit-rules.yaml for changes
# On SIGHUP or file change → reload rules into memory
# Approach 2: Configuration service
# Store rules in a database or config service (etcd, Consul)
# Gateway polls every 30 seconds or subscribes to change events
# Approach 3: Redis pub/sub
# Admin updates rules → publishes to "rate_limit_rules" channel
# All gateway instances subscribe and reload
import yaml, hashlib, threading
class RulesEngine:
def __init__(self, config_path):
self.config_path = config_path
self.rules = {}
self.config_hash = None
self._load()
self._start_watcher()
def _load(self):
with open(self.config_path) as f:
raw = f.read()
new_hash = hashlib.sha256(raw.encode()).hexdigest()
if new_hash != self.config_hash:
self.rules = yaml.safe_load(raw)
self.config_hash = new_hash
print(f"Reloaded {len(self.rules['descriptors'])} rules")
def _start_watcher(self):
def watch():
while True:
import time; time.sleep(30)
self._load()
t = threading.Thread(target=watch, daemon=True)
t.start()
def match(self, request):
"""Return the most specific matching rule for a request."""
matched = []
for rule in self.rules['descriptors']:
if self._matches(rule, request):
matched.append(rule)
# Return most specific (endpoint > user > IP > global)
return matched # Evaluate ALL matching rules
Rate Limiting Algorithms: Deep Dive
There are five major algorithms. Each has distinct trade-offs in memory, accuracy, and burst handling. Understanding when to use each is critical for interviews.
Algorithm 1: Fixed Window Counter
The simplest approach. Divide time into fixed windows (e.g., each minute), and count requests per key per window.
# Fixed Window Counter
# Key format: rate:{user_id}:{window_number}
def is_allowed_fixed_window(redis, user_id, limit, window_seconds):
"""
Returns True if request is allowed, False if rate-limited.
"""
window = int(time.time()) // window_seconds
key = f"rate:fw:{user_id}:{window}"
# Atomic increment + set expiry
pipe = redis.pipeline()
pipe.incr(key)
pipe.expire(key, window_seconds + 1) # +1 for clock skew
count, _ = pipe.execute()
return count <= limit
Pros: Simple, memory-efficient (one counter per key per window), fast.
Cons: The boundary problem. If the limit is 100 req/min, a client could send 100 requests at 12:00:59 and another 100 at 12:01:00 — 200 requests in 2 seconds, all passing the check. The effective rate can be double the configured limit at window boundaries.
Algorithm 2: Sliding Window Log
Store the timestamp of every request in a sorted set. To check if a request is allowed, remove timestamps older than the window and count the remaining entries.
# Sliding Window Log using Redis Sorted Sets
def is_allowed_sliding_log(redis, user_id, limit, window_seconds):
key = f"rate:swl:{user_id}"
now = time.time()
window_start = now - window_seconds
pipe = redis.pipeline()
# Remove timestamps outside the current window
pipe.zremrangebyscore(key, 0, window_start)
# Add current request timestamp (score = timestamp)
pipe.zadd(key, {str(now) + ":" + str(uuid4()): now})
# Count entries in the window
pipe.zcard(key)
# Set expiry to auto-cleanup
pipe.expire(key, window_seconds + 1)
_, _, count, _ = pipe.execute()
if count > limit:
# Remove the entry we just added (request will be rejected)
pipe2 = redis.pipeline()
pipe2.zremrangebyscore(key, now, now)
pipe2.execute()
return False
return True
Pros: Perfectly accurate — no boundary problem. The window truly slides with each request.
Cons: Memory-intensive. Stores every timestamp. For 1M users × 100 requests each = 100M entries in Redis. Each sorted set member uses ~60 bytes → ~6 GB of memory.
Algorithm 3: Sliding Window Counter
A hybrid that combines the memory efficiency of fixed windows with the accuracy of the sliding log. Use the weighted average of the current and previous window’s counts:
# Sliding Window Counter (weighted average)
# Memory: 2 counters per key (current + previous window)
def is_allowed_sliding_counter(redis, user_id, limit, window_seconds):
now = time.time()
current_window = int(now) // window_seconds
previous_window = current_window - 1
elapsed = now - (current_window * window_seconds)
weight = 1 - (elapsed / window_seconds) # % of prev window overlap
current_key = f"rate:swc:{user_id}:{current_window}"
previous_key = f"rate:swc:{user_id}:{previous_window}"
pipe = redis.pipeline()
pipe.get(current_key)
pipe.get(previous_key)
current_count, previous_count = pipe.execute()
current_count = int(current_count or 0)
previous_count = int(previous_count or 0)
# Weighted estimate of requests in the sliding window
estimated = (previous_count * weight) + current_count
if estimated >= limit:
return False
# Allowed — increment current window
pipe2 = redis.pipeline()
pipe2.incr(current_key)
pipe2.expire(current_key, window_seconds * 2)
pipe2.execute()
return True
Pros: Very memory-efficient (2 counters per key), smooth rate limiting, no boundary spikes.
Cons: Approximate (not exact), but Cloudflare reports only ~0.003% of requests are incorrectly allowed or denied.
Algorithm 4: Token Bucket (Most Popular)
The token bucket is the most widely used algorithm. Amazon, Stripe, and most API gateways use it because it naturally supports burst traffic while maintaining a long-term rate.
The concept is simple:
- A bucket holds tokens (maximum = burst capacity).
- Tokens are added at a fixed refill rate (e.g., 10 tokens/second).
- Each request consumes one token. If the bucket is empty, the request is rejected.
- This allows bursts (up to the bucket size) while maintaining a steady-state average rate.
# Token Bucket with Redis (atomic via MULTI/EXEC)
def is_allowed_token_bucket(redis, key, max_tokens, refill_rate, refill_interval):
"""
max_tokens: Bucket capacity (e.g., 100)
refill_rate: Tokens added per interval (e.g., 10)
refill_interval: Seconds between refills (e.g., 1.0)
"""
now = time.time()
bucket_key = f"rate:tb:{key}"
# Lua script for atomic check-and-update
lua_script = """
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local refill_interval = tonumber(ARGV[3])
local now = tonumber(ARGV[4])
local ttl = tonumber(ARGV[5])
-- Get current state
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1])
local last_refill = tonumber(data[2])
-- Initialize if first request
if tokens == nil then
tokens = max_tokens
last_refill = now
end
-- Calculate token refill
local elapsed = now - last_refill
local refills = math.floor(elapsed / refill_interval)
if refills > 0 then
tokens = math.min(max_tokens, tokens + (refills * refill_rate))
last_refill = last_refill + (refills * refill_interval)
end
-- Check if request is allowed
local allowed = 0
if tokens >= 1 then
tokens = tokens - 1
allowed = 1
end
-- Save state
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', last_refill)
redis.call('EXPIRE', key, ttl)
return {allowed, tokens}
"""
ttl = int(max_tokens / refill_rate * refill_interval) + 10
result = redis.eval(lua_script, 1, bucket_key,
max_tokens, refill_rate, refill_interval, now, ttl)
allowed = result[0] == 1
remaining = result[1]
return allowed, remaining
Algorithm 5: Leaky Bucket
Similar to token bucket, but instead of tokens, requests are added to a FIFO queue that “leaks” (processes) at a fixed rate. Useful when you want a perfectly smooth output rate.
# Leaky Bucket — conceptual model
class LeakyBucket:
def __init__(self, capacity, leak_rate):
self.capacity = capacity # Max queue size
self.leak_rate = leak_rate # Requests processed per second
self.water = 0 # Current queue level
self.last_leak = time.time()
def allow(self):
now = time.time()
# Leak (drain) water based on elapsed time
elapsed = now - self.last_leak
leaked = elapsed * self.leak_rate
self.water = max(0, self.water - leaked)
self.last_leak = now
# Try to add new request
if self.water < self.capacity:
self.water += 1
return True
return False # Queue full, reject
Pros: Perfectly smooth output rate; no bursts.
Cons: Doesn’t allow any burst at all; legitimate traffic spikes are penalised. Not suitable for APIs where occasional bursts are acceptable.
Algorithm Comparison
| Algorithm | Memory | Accuracy | Burst | Best For |
|---|---|---|---|---|
| Fixed Window | Very low | Poor (boundary) | 2× at edges | Simple analytics |
| Sliding Log | High | Perfect | None | Audit logging |
| Sliding Counter | Low | Very good | Smooth | General API limiting |
| Token Bucket | Low | Good | Controlled burst | APIs, CDNs, gateways |
| Leaky Bucket | Low | Good | None | Smooth processing (queues) |
Distributed Implementation with Redis
The core challenge of distributed rate limiting: multiple API gateway instances must share the same counters. Redis is the standard solution.
Why Redis?
- In-memory — Sub-millisecond latency for GET/SET/INCR operations.
- Atomic operations —
INCR,MULTI/EXEC, and Lua scripts prevent race conditions. - TTL support —
EXPIREauto-cleans old counters, no garbage collection needed. - Data structures — Sorted sets (for sliding window log), hashes (for token bucket), strings (for counters).
- Cluster mode — Horizontal scaling across multiple Redis shards for very high throughput.
Approach 1: INCR + EXPIRE (Fixed Window)
# Simple but has a subtle race condition!
def check_rate_limit_naive(redis, key, limit, window):
count = redis.incr(key) # Atomic increment
if count == 1:
redis.expire(key, window) # Set TTL on first request
return count <= limit
# PROBLEM: If the process crashes between INCR and EXPIRE,
# the key persists forever with no TTL → counter never resets!
# FIX: Use a Lua script for atomicity
LUA_FIXED_WINDOW = """
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local count = redis.call('INCR', key)
if count == 1 then
redis.call('EXPIRE', key, window)
end
if count > limit then
return 0 -- rejected
end
return 1 -- allowed
"""
Approach 2: Sliding Window with Sorted Sets
# Atomic sliding window using Lua
LUA_SLIDING_WINDOW = """
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local member = ARGV[4]
-- Remove entries outside the window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
-- Count current entries
local count = redis.call('ZCARD', key)
if count >= limit then
return {0, count, 0} -- rejected, current count, no retry
end
-- Add new entry and set expiry
redis.call('ZADD', key, now, member)
redis.call('EXPIRE', key, window)
return {1, count + 1, limit - count - 1} -- allowed, count, remaining
"""
Handling Race Conditions
The check-then-act pattern is the root cause of all race conditions in rate limiting:
# RACE CONDITION (DON'T DO THIS):
count = redis.get(key) # Thread A reads: 99
# Thread B reads: 99
if count < limit: # Both threads see 99 < 100
redis.incr(key) # Thread A: 100 ✓
# Thread B: 101 ✗ (over limit!)
# FIX 1: Use INCR (returns new value atomically)
count = redis.incr(key) # Atomic: read + increment + return
if count > limit:
return REJECTED
# FIX 2: Lua script (for complex logic)
# Entire read-check-update executes atomically
# FIX 3: Redis MULTI/EXEC (optimistic locking with WATCH)
redis.watch(key)
count = redis.get(key)
if count and int(count) >= limit:
redis.unwatch()
return REJECTED
pipe = redis.pipeline()
pipe.multi()
pipe.incr(key)
pipe.execute() # Fails if key was modified since WATCH
Redis Cluster Considerations
When using Redis Cluster with multiple shards:
- Hash tags: Use
{user_id}as a hash tag to ensure all keys for a user land on the same shard. E.g.,rate:{user123}:fwandrate:{user123}:tbco-locate. - Lua scripts: In cluster mode, all keys in a Lua script must reside on the same shard. Use hash tags to guarantee this.
- Replication lag: If a replica is promoted during failover, some counters may be slightly behind. Accept this as a trade-off (fail-open for a few seconds).
Token Bucket with Redis: Step-by-Step
This animation shows how the token bucket algorithm works with Redis, including atomic MULTI/EXEC operations for check-and-decrement.
▶ Token Bucket with Redis
Watch tokens being consumed and refilled. MULTI/EXEC ensures atomic check-and-decrement. Bucket capacity=5, refill=1 token/step.
HTTP Response Headers
A well-designed rate limiter communicates its state to clients through standard HTTP headers. This is critical for a good developer experience.
Standard Headers
# When request is ALLOWED (HTTP 200):
HTTP/1.1 200 OK
X-RateLimit-Limit: 100 # Max requests allowed in window
X-RateLimit-Remaining: 73 # Requests remaining in current window
X-RateLimit-Reset: 1714500060 # Unix timestamp when window resets
X-RateLimit-Policy: 100;w=60 # IETF draft: 100 per 60 seconds
# When request is REJECTED (HTTP 429):
HTTP/1.1 429 Too Many Requests
Retry-After: 37 # Seconds until client should retry
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714500060
Content-Type: application/json
{
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": "Rate limit exceeded. Maximum 100 requests per minute.",
"retry_after": 37,
"limit": 100,
"window": "1m",
"documentation_url": "https://api.example.com/docs/rate-limits"
}
}
Implementation
from fastapi import FastAPI, Request, Response
from fastapi.responses import JSONResponse
app = FastAPI()
@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
key = extract_rate_limit_key(request) # IP, user ID, etc.
rule = rules_engine.match(request)
allowed, remaining, reset_at = check_rate_limit(
redis, key, rule.limit, rule.window
)
if not allowed:
response = JSONResponse(
status_code=429,
content={
"error": {
"code": "RATE_LIMIT_EXCEEDED",
"message": f"Rate limit exceeded. Max {rule.limit} "
f"requests per {rule.window}s.",
"retry_after": reset_at - int(time.time()),
}
}
)
else:
response = await call_next(request)
# Always set rate limit headers (even on 429)
response.headers["X-RateLimit-Limit"] = str(rule.limit)
response.headers["X-RateLimit-Remaining"] = str(max(0, remaining))
response.headers["X-RateLimit-Reset"] = str(reset_at)
if not allowed:
response.headers["Retry-After"] = str(
reset_at - int(time.time())
)
return response
Graceful Degradation
What happens when the rate limiter itself fails? There are two philosophies:
Fail-Open (Recommended for Most Cases)
If Redis is unreachable, allow all requests through. The system continues serving traffic, sacrificing rate limiting temporarily. This is the safe default for most applications.
async def check_rate_limit_safe(redis, key, limit, window):
try:
return await check_rate_limit(redis, key, limit, window)
except (ConnectionError, TimeoutError) as e:
# Redis is down — fail open
logger.warning(f"Rate limiter unavailable: {e}. Failing open.")
metrics.increment("rate_limiter.fail_open")
return True, limit, int(time.time()) + window # Allow request
Fail-Closed (Security-Critical)
If Redis is unreachable, reject all requests. Used when rate limiting is a security requirement (e.g., authentication endpoints, payment processing).
async def check_rate_limit_strict(redis, key, limit, window):
try:
return await check_rate_limit(redis, key, limit, window)
except (ConnectionError, TimeoutError) as e:
logger.error(f"Rate limiter unavailable: {e}. Failing closed.")
metrics.increment("rate_limiter.fail_closed")
return False, 0, int(time.time()) + 60 # Reject request
Soft vs Hard Limits
| Type | Behaviour | Use Case |
|---|---|---|
| Soft limit | Allow requests over the limit but log a warning, add delay, or return a degraded response. | Free-tier users; non-critical endpoints; gradual enforcement. |
| Hard limit | Reject with HTTP 429 immediately once the limit is exceeded. | Paid plans with SLAs; security endpoints; resource-intensive operations. |
# Soft limit with degradation levels
def enforce_limit(count, soft_limit, hard_limit):
if count <= soft_limit:
return "ALLOW" # Full service
elif count <= hard_limit:
return "DEGRADE" # Allow but with reduced functionality
# e.g., return cached results, disable expensive features
else:
return "REJECT" # HTTP 429
# Example: soft=80, hard=100
# At 85 requests → degrade (return cached search results)
# At 101 requests → reject (HTTP 429)
Multi-Tier Rate Limiting
Real-world systems don’t use a single rate limit. They enforce multiple limits simultaneously at different granularities:
# Multi-tier rate limiting — ALL limits must pass
RATE_LIMITS = [
{"window": 1, "limit": 10, "name": "per_second"},
{"window": 60, "limit": 200, "name": "per_minute"},
{"window": 3600, "limit": 5000, "name": "per_hour"},
{"window": 86400, "limit": 50000, "name": "per_day"},
]
def is_allowed_multi_tier(redis, user_id):
"""Check ALL rate limit tiers. Request is rejected if ANY tier is exceeded."""
results = {}
most_restrictive_remaining = float('inf')
most_restrictive_reset = 0
for tier in RATE_LIMITS:
allowed, remaining, reset_at = check_rate_limit(
redis,
f"rate:{user_id}:{tier['name']}",
tier['limit'],
tier['window']
)
results[tier['name']] = {
"allowed": allowed,
"remaining": remaining,
"reset_at": reset_at
}
if remaining < most_restrictive_remaining:
most_restrictive_remaining = remaining
most_restrictive_reset = reset_at
if not allowed:
return False, results # Fail fast on first violation
return True, results
Why multi-tier?
- Per-second limits prevent burst abuse (e.g., automated scripts firing 1000 requests in 1 second).
- Per-minute limits enforce sustained rate control for interactive users.
- Per-hour/day limits enforce quota-based billing and fair usage policies.
Monitoring Rate Limiter Effectiveness
A rate limiter without monitoring is flying blind. You need to know if limits are too strict (blocking legitimate users) or too lax (not preventing abuse).
Key Metrics
# Prometheus metrics for rate limiter monitoring
from prometheus_client import Counter, Histogram, Gauge
# Request outcome
rate_limit_requests = Counter(
'rate_limiter_requests_total',
'Total requests processed by rate limiter',
['result', 'tier', 'endpoint'] # result: allowed|rejected|error
)
# Latency of rate limit check
rate_limit_latency = Histogram(
'rate_limiter_check_duration_seconds',
'Latency of rate limit check',
buckets=[0.0005, 0.001, 0.005, 0.01, 0.05]
)
# Current utilisation per key
rate_limit_utilisation = Gauge(
'rate_limiter_utilisation_ratio',
'Current usage / limit ratio per key',
['user_tier']
)
# Redis health
redis_errors = Counter(
'rate_limiter_redis_errors_total',
'Redis connection errors (fail-open events)',
['error_type']
)
Alerting Rules
# Grafana alert rules (Prometheus query)
# 1. Too many rejections (limits may be too strict)
rate(rate_limiter_requests_total{result="rejected"}[5m])
/ rate(rate_limiter_requests_total[5m]) > 0.1
# Alert: >10% of requests being rejected
# 2. Rate limiter latency spike
histogram_quantile(0.99, rate(rate_limiter_check_duration_seconds_bucket[5m]))
> 0.005
# Alert: p99 latency exceeds 5ms budget
# 3. Redis fail-open events
rate(rate_limiter_redis_errors_total[5m]) > 0
# Alert: Rate limiter is failing open (Redis issues)
# 4. Single user consuming disproportionate quota
topk(10, rate_limiter_utilisation_ratio) > 0.9
# Alert: Top users at >90% of their limit
Dashboard Panels
- Request throughput — Total RPS, split by allowed vs rejected.
- Rejection rate by endpoint — Which endpoints are hitting limits most?
- Top rate-limited users/IPs — Identify abusers or misconfigured clients.
- Redis latency p50/p99 — Ensure rate limit checks stay within budget.
- Fail-open events — How often is Redis unreachable?
- Quota utilisation distribution — Histogram of users by % of quota used.
Interview Framework & Common Questions
Rate limiter is one of the most asked system design questions. Here’s a structured approach:
Step 1: Clarify Requirements (2 min)
- “Is this a client-side or server-side rate limiter?” (Server-side)
- “What should we limit by — user ID, IP, API key?”
- “Do we need to support multiple rate limit tiers (free/pro)?”
- “What happens when a request is rate-limited? Hard reject or queue?”
- “How many servers will this run on?” (Distributed)
- “What’s the expected RPS?”
Step 2: High-Level Design (5 min)
- Draw the architecture: Client → API Gateway → Rate Limiter Middleware → Redis → Backend
- Explain why middleware/gateway placement is preferred
- Mention Redis as the shared counter store (justify: speed, atomic ops, TTL)
Step 3: Algorithm Selection (5 min)
- Present token bucket as the default choice (supports burst)
- Contrast with sliding window counter for simpler use cases
- Explain the boundary problem of fixed window
- Show the Lua script for atomicity
Step 4: Deep Dive (10 min)
- Race conditions and atomic operations
- Redis cluster with hash tags
- HTTP 429 response with proper headers
- Multi-tier limiting
- Graceful degradation (fail-open vs fail-closed)
Step 5: Production Concerns (3 min)
- Monitoring and alerting
- Hot-reloading configuration
- Handling Redis failover
- Client-side retry with exponential backoff
Frequently Asked Follow-Up Questions
| Question | Key Points |
|---|---|
| “How do you handle distributed rate limiting?” | Centralised Redis counter shared by all gateway instances. Lua scripts for atomicity. Hash tags for cluster mode. |
| “What if Redis goes down?” | Fail-open by default (allow traffic). Use local in-memory fallback with reduced accuracy. Alert ops team. |
| “How do you prevent race conditions?” | Lua scripts (single atomic execution), INCR (atomic increment), or WATCH/MULTI/EXEC (optimistic locking). |
| “Token bucket vs sliding window?” | Token bucket allows controlled bursts, ideal for APIs. Sliding window is stricter, prevents any burst, good for quotas. |
| “How do you rate limit by IP when behind a CDN?” | Use X-Forwarded-For header. Validate it by allowing only trusted proxy IPs in the chain. Rate limit on the first untrusted IP. |
| “How do you handle API key rate limiting for webhooks?” | Outbound rate limiting: use a token bucket to pace webhook delivery. Queue excess webhooks and drain at the allowed rate. |
Production Examples
Stripe
Stripe uses a token bucket per API key with generous burst capacity. Their rate limiter runs in the API gateway layer and communicates limits via standard headers. They use separate limits for read vs write operations.
GitHub API
GitHub uses a sliding window approach: 5,000 requests/hour for authenticated users, 60/hour for unauthenticated. They include X-RateLimit-Remaining on every response, and their search API has a separate, lower limit (30 requests/minute).
Cloudflare
Cloudflare implements rate limiting at the edge (CDN layer), before requests even reach the origin server. They use the sliding window counter algorithm (weighted average of current and previous window), reporting only 0.003% error rate.
Discord
Discord uses per-route rate limiting with a token bucket. Each API route has its own bucket, and they use X-RateLimit-Bucket headers to group related endpoints. They also implement a global rate limit (50 requests/second per bot).
Summary
| Decision | Recommendation |
|---|---|
| Placement | API Gateway middleware + per-service fine-grained limits |
| Algorithm | Token bucket (default) or sliding window counter |
| Storage | Redis (in-memory, atomic, TTL, data structures) |
| Atomicity | Lua scripts in Redis (gold standard) |
| Failure mode | Fail-open (most cases) or fail-closed (security) |
| Response | HTTP 429 + Retry-After + X-RateLimit-* headers |
| Configuration | YAML rules with hot-reload (file watch or config service) |
| Monitoring | Rejection rate, latency, Redis health, top consumers |