← All Posts
High Level Design Series · Real-World Designs· Part 45 of 70

Design: Rate Limiter

The Problem: Protecting APIs from Abuse

Every production API faces the same existential threat: uncontrolled traffic. A single misbehaving client—whether a buggy mobile app stuck in a retry loop, a data scraper harvesting your catalogue, or a deliberate DDoS attack—can overwhelm your servers, spike cloud bills, and degrade service for everyone.

A rate limiter is the gatekeeper that controls how many requests a client can make within a time window. It’s the first line of defence and one of the most commonly asked system design interview questions because it touches on distributed systems, data structures, caching, atomicity, and API design all in one problem.

GoalWhat It Prevents
Prevent abuseBrute-force attacks, credential stuffing, API scraping, denial-of-service.
Fair usageOne tenant consuming all resources in a multi-tenant system.
Cost controlRunaway cloud spend when traffic spikes unexpectedly.
StabilityCascading failures when backend services are overloaded.
ComplianceEnforcing contractual SLAs and tier-based API plans.
Interview tip: When you hear “design a rate limiter,” the interviewer wants to see that you understand where to place it, which algorithm to use, how it works in a distributed environment, and how to handle edge cases like race conditions and clock skew. This post covers all of these in depth.

Requirements

Functional Requirements

  1. Configurable rules — Limit requests by user ID, IP address, API endpoint, or any combination. Rules should be expressible in a declarative format (YAML/JSON).
  2. Multiple algorithms — Support token bucket, sliding window log, sliding window counter, and fixed window counter depending on the use case.
  3. Accurate counting — In a distributed deployment across N servers, the total count for a key must be globally consistent, not per-server.
  4. Informative responses — When a request is throttled, return HTTP 429 with Retry-After and X-RateLimit-* headers so clients can back off intelligently.
  5. Hot-reload rules — Changing rate limit configuration should not require a deployment or server restart.

Non-Functional Requirements

  1. Low latency — The rate limit check must add <5ms to request latency (p99). This rules out round-trips to relational databases for every request.
  2. High availability — If the rate limiter is down, the system should degrade gracefully (fail-open) rather than reject all traffic.
  3. Distributed — Must work across multiple API gateway instances sharing global counters.
  4. Scalable — Handle millions of unique rate-limit keys without memory pressure on any single node.
  5. Minimal false positives — Legitimate users should almost never be incorrectly throttled.

Back-of-the-Envelope Estimation

Let’s size a rate limiter for a mid-scale SaaS API:

MetricValue
Active users1,000,000
Avg requests per user per day500
Total requests per day500M
Average RPS~5,800 req/s
Peak RPS (3× average)~17,400 req/s
Rate limit check latency budget<5ms p99
Redis keys (per-user counters, 1-min window)~1M keys
Memory per key (key + counter + TTL)~100 bytes → ~100 MB total
Key insight: A single Redis instance can handle 100K+ ops/sec with sub-millisecond latency. Even at peak, our 17.4K RPS is well within the capacity of a single Redis node. For redundancy, we use a Redis cluster with replication, but a single primary is sufficient for rate limiting throughput.

Where to Place the Rate Limiter

There are three possible locations, each with trade-offs:

Option 1: Client-Side

The client throttles itself before making requests. This is unreliable because malicious or buggy clients can bypass the throttle. You cannot trust the client.

Option 2: Server-Side (Application Layer)

Each API server checks a shared counter (Redis) before processing a request. This gives you full control and access to application-level context (authenticated user ID, subscription tier), but it means every service must implement rate limiting logic.

Option 3: Middleware / API Gateway (Recommended)

Place the rate limiter as middleware in an API gateway that sits in front of all backend services. This is the industry-standard approach used by AWS API Gateway, Kong, Nginx, Envoy, and Cloudflare.

PlacementProsCons
ClientReduces server loadCan’t be trusted; easily bypassed
ServerFull application context; flexibleLogic scattered across services; inconsistent enforcement
API GatewayCentralized; language-agnostic; handled before auth/routingLess application context; additional hop
Best practice: Use a two-layer approach. The API gateway enforces coarse-grained limits (per-IP, per-API-key). Individual services enforce fine-grained limits (per-user per-endpoint) with application-level context.

High-Level Architecture

The architecture follows a request’s journey from client to backend, with the rate limiter middleware intercepting every request before it reaches business logic.

▶ Distributed Rate Limiter Architecture

Step through a request’s lifecycle: client → API Gateway → Redis counter check → allow or reject (429).

Component Breakdown

Rules Engine & Configuration

Rate limit rules should be declarative, stored in a configuration file or database, and hot-reloadable. Here’s a YAML-based approach used by systems like Lyft’s Envoy rate limiter:

# rate-limit-rules.yaml
domain: api_gateway

descriptors:
  # Global: 10,000 requests/minute across all clients
  - key: global
    rate_limit:
      requests_per_unit: 10000
      unit: minute

  # Per-IP: 100 requests/minute (unauthenticated)
  - key: remote_address
    rate_limit:
      requests_per_unit: 100
      unit: minute
    action: reject  # HTTP 429

  # Per-user: different limits by subscription tier
  - key: user_id
    descriptors:
      - key: plan
        value: free
        rate_limit:
          requests_per_unit: 60
          unit: minute
        action: reject

      - key: plan
        value: pro
        rate_limit:
          requests_per_unit: 600
          unit: minute
        action: reject

      - key: plan
        value: enterprise
        rate_limit:
          requests_per_unit: 6000
          unit: minute
        action: reject

  # Per-endpoint: protect expensive operations
  - key: endpoint
    value: "POST /api/v1/search"
    rate_limit:
      requests_per_unit: 20
      unit: minute
    action: throttle  # Queue and delay

  - key: endpoint
    value: "POST /api/v1/auth/login"
    rate_limit:
      requests_per_unit: 5
      unit: minute
    action: reject  # Brute-force protection

Rule Matching Priority

When multiple rules match a request, the system evaluates them in priority order:

  1. Endpoint-specific (most specific) — e.g., POST /api/v1/auth/login for user X
  2. User-level — e.g., user X on the “free” plan: 60 req/min
  3. IP-level — e.g., IP 10.0.0.1: 100 req/min
  4. Global (least specific) — 10,000 req/min total

A request is rejected if it violates any applicable rule. This layered approach means a single user can’t consume the entire global quota.

Hot-Reloading Configuration

Rules should be reloadable without restarts:

# Approach 1: File watch + signal
# Gateway watches rate-limit-rules.yaml for changes
# On SIGHUP or file change → reload rules into memory

# Approach 2: Configuration service
# Store rules in a database or config service (etcd, Consul)
# Gateway polls every 30 seconds or subscribes to change events

# Approach 3: Redis pub/sub
# Admin updates rules → publishes to "rate_limit_rules" channel
# All gateway instances subscribe and reload

import yaml, hashlib, threading

class RulesEngine:
    def __init__(self, config_path):
        self.config_path = config_path
        self.rules = {}
        self.config_hash = None
        self._load()
        self._start_watcher()

    def _load(self):
        with open(self.config_path) as f:
            raw = f.read()
        new_hash = hashlib.sha256(raw.encode()).hexdigest()
        if new_hash != self.config_hash:
            self.rules = yaml.safe_load(raw)
            self.config_hash = new_hash
            print(f"Reloaded {len(self.rules['descriptors'])} rules")

    def _start_watcher(self):
        def watch():
            while True:
                import time; time.sleep(30)
                self._load()
        t = threading.Thread(target=watch, daemon=True)
        t.start()

    def match(self, request):
        """Return the most specific matching rule for a request."""
        matched = []
        for rule in self.rules['descriptors']:
            if self._matches(rule, request):
                matched.append(rule)
        # Return most specific (endpoint > user > IP > global)
        return matched  # Evaluate ALL matching rules

Rate Limiting Algorithms: Deep Dive

There are five major algorithms. Each has distinct trade-offs in memory, accuracy, and burst handling. Understanding when to use each is critical for interviews.

Algorithm 1: Fixed Window Counter

The simplest approach. Divide time into fixed windows (e.g., each minute), and count requests per key per window.

# Fixed Window Counter
# Key format: rate:{user_id}:{window_number}

def is_allowed_fixed_window(redis, user_id, limit, window_seconds):
    """
    Returns True if request is allowed, False if rate-limited.
    """
    window = int(time.time()) // window_seconds
    key = f"rate:fw:{user_id}:{window}"

    # Atomic increment + set expiry
    pipe = redis.pipeline()
    pipe.incr(key)
    pipe.expire(key, window_seconds + 1)  # +1 for clock skew
    count, _ = pipe.execute()

    return count <= limit

Pros: Simple, memory-efficient (one counter per key per window), fast.

Cons: The boundary problem. If the limit is 100 req/min, a client could send 100 requests at 12:00:59 and another 100 at 12:01:00 — 200 requests in 2 seconds, all passing the check. The effective rate can be double the configured limit at window boundaries.

Algorithm 2: Sliding Window Log

Store the timestamp of every request in a sorted set. To check if a request is allowed, remove timestamps older than the window and count the remaining entries.

# Sliding Window Log using Redis Sorted Sets
def is_allowed_sliding_log(redis, user_id, limit, window_seconds):
    key = f"rate:swl:{user_id}"
    now = time.time()
    window_start = now - window_seconds

    pipe = redis.pipeline()
    # Remove timestamps outside the current window
    pipe.zremrangebyscore(key, 0, window_start)
    # Add current request timestamp (score = timestamp)
    pipe.zadd(key, {str(now) + ":" + str(uuid4()): now})
    # Count entries in the window
    pipe.zcard(key)
    # Set expiry to auto-cleanup
    pipe.expire(key, window_seconds + 1)
    _, _, count, _ = pipe.execute()

    if count > limit:
        # Remove the entry we just added (request will be rejected)
        pipe2 = redis.pipeline()
        pipe2.zremrangebyscore(key, now, now)
        pipe2.execute()
        return False

    return True

Pros: Perfectly accurate — no boundary problem. The window truly slides with each request.

Cons: Memory-intensive. Stores every timestamp. For 1M users × 100 requests each = 100M entries in Redis. Each sorted set member uses ~60 bytes → ~6 GB of memory.

Algorithm 3: Sliding Window Counter

A hybrid that combines the memory efficiency of fixed windows with the accuracy of the sliding log. Use the weighted average of the current and previous window’s counts:

# Sliding Window Counter (weighted average)
# Memory: 2 counters per key (current + previous window)

def is_allowed_sliding_counter(redis, user_id, limit, window_seconds):
    now = time.time()
    current_window = int(now) // window_seconds
    previous_window = current_window - 1
    elapsed = now - (current_window * window_seconds)
    weight = 1 - (elapsed / window_seconds)  # % of prev window overlap

    current_key = f"rate:swc:{user_id}:{current_window}"
    previous_key = f"rate:swc:{user_id}:{previous_window}"

    pipe = redis.pipeline()
    pipe.get(current_key)
    pipe.get(previous_key)
    current_count, previous_count = pipe.execute()

    current_count = int(current_count or 0)
    previous_count = int(previous_count or 0)

    # Weighted estimate of requests in the sliding window
    estimated = (previous_count * weight) + current_count

    if estimated >= limit:
        return False

    # Allowed — increment current window
    pipe2 = redis.pipeline()
    pipe2.incr(current_key)
    pipe2.expire(current_key, window_seconds * 2)
    pipe2.execute()
    return True

Pros: Very memory-efficient (2 counters per key), smooth rate limiting, no boundary spikes.

Cons: Approximate (not exact), but Cloudflare reports only ~0.003% of requests are incorrectly allowed or denied.

Algorithm 4: Token Bucket (Most Popular)

The token bucket is the most widely used algorithm. Amazon, Stripe, and most API gateways use it because it naturally supports burst traffic while maintaining a long-term rate.

The concept is simple:

  1. A bucket holds tokens (maximum = burst capacity).
  2. Tokens are added at a fixed refill rate (e.g., 10 tokens/second).
  3. Each request consumes one token. If the bucket is empty, the request is rejected.
  4. This allows bursts (up to the bucket size) while maintaining a steady-state average rate.
# Token Bucket with Redis (atomic via MULTI/EXEC)
def is_allowed_token_bucket(redis, key, max_tokens, refill_rate, refill_interval):
    """
    max_tokens:      Bucket capacity (e.g., 100)
    refill_rate:     Tokens added per interval (e.g., 10)
    refill_interval: Seconds between refills (e.g., 1.0)
    """
    now = time.time()
    bucket_key = f"rate:tb:{key}"

    # Lua script for atomic check-and-update
    lua_script = """
    local key = KEYS[1]
    local max_tokens = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local refill_interval = tonumber(ARGV[3])
    local now = tonumber(ARGV[4])
    local ttl = tonumber(ARGV[5])

    -- Get current state
    local data = redis.call('HMGET', key, 'tokens', 'last_refill')
    local tokens = tonumber(data[1])
    local last_refill = tonumber(data[2])

    -- Initialize if first request
    if tokens == nil then
        tokens = max_tokens
        last_refill = now
    end

    -- Calculate token refill
    local elapsed = now - last_refill
    local refills = math.floor(elapsed / refill_interval)
    if refills > 0 then
        tokens = math.min(max_tokens, tokens + (refills * refill_rate))
        last_refill = last_refill + (refills * refill_interval)
    end

    -- Check if request is allowed
    local allowed = 0
    if tokens >= 1 then
        tokens = tokens - 1
        allowed = 1
    end

    -- Save state
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', last_refill)
    redis.call('EXPIRE', key, ttl)

    return {allowed, tokens}
    """

    ttl = int(max_tokens / refill_rate * refill_interval) + 10
    result = redis.eval(lua_script, 1, bucket_key,
                        max_tokens, refill_rate, refill_interval, now, ttl)
    allowed = result[0] == 1
    remaining = result[1]
    return allowed, remaining
Why Lua? Redis Lua scripts execute atomically — no other command can interleave. This eliminates race conditions where two requests simultaneously read the token count, both see tokens available, and both decrement — allowing one extra request through. With Lua, the read-check-update happens as a single atomic operation.

Algorithm 5: Leaky Bucket

Similar to token bucket, but instead of tokens, requests are added to a FIFO queue that “leaks” (processes) at a fixed rate. Useful when you want a perfectly smooth output rate.

# Leaky Bucket — conceptual model
class LeakyBucket:
    def __init__(self, capacity, leak_rate):
        self.capacity = capacity      # Max queue size
        self.leak_rate = leak_rate    # Requests processed per second
        self.water = 0                # Current queue level
        self.last_leak = time.time()

    def allow(self):
        now = time.time()
        # Leak (drain) water based on elapsed time
        elapsed = now - self.last_leak
        leaked = elapsed * self.leak_rate
        self.water = max(0, self.water - leaked)
        self.last_leak = now

        # Try to add new request
        if self.water < self.capacity:
            self.water += 1
            return True
        return False  # Queue full, reject

Pros: Perfectly smooth output rate; no bursts.

Cons: Doesn’t allow any burst at all; legitimate traffic spikes are penalised. Not suitable for APIs where occasional bursts are acceptable.

Algorithm Comparison

AlgorithmMemoryAccuracyBurstBest For
Fixed WindowVery lowPoor (boundary)2× at edgesSimple analytics
Sliding LogHighPerfectNoneAudit logging
Sliding CounterLowVery goodSmoothGeneral API limiting
Token BucketLowGoodControlled burstAPIs, CDNs, gateways
Leaky BucketLowGoodNoneSmooth processing (queues)

Distributed Implementation with Redis

The core challenge of distributed rate limiting: multiple API gateway instances must share the same counters. Redis is the standard solution.

Why Redis?

Approach 1: INCR + EXPIRE (Fixed Window)

# Simple but has a subtle race condition!
def check_rate_limit_naive(redis, key, limit, window):
    count = redis.incr(key)       # Atomic increment
    if count == 1:
        redis.expire(key, window) # Set TTL on first request
    return count <= limit

# PROBLEM: If the process crashes between INCR and EXPIRE,
# the key persists forever with no TTL → counter never resets!

# FIX: Use a Lua script for atomicity
LUA_FIXED_WINDOW = """
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local count = redis.call('INCR', key)
if count == 1 then
    redis.call('EXPIRE', key, window)
end

if count > limit then
    return 0  -- rejected
end
return 1  -- allowed
"""

Approach 2: Sliding Window with Sorted Sets

# Atomic sliding window using Lua
LUA_SLIDING_WINDOW = """
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local member = ARGV[4]

-- Remove entries outside the window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count current entries
local count = redis.call('ZCARD', key)

if count >= limit then
    return {0, count, 0}  -- rejected, current count, no retry
end

-- Add new entry and set expiry
redis.call('ZADD', key, now, member)
redis.call('EXPIRE', key, window)

return {1, count + 1, limit - count - 1}  -- allowed, count, remaining
"""

Handling Race Conditions

The check-then-act pattern is the root cause of all race conditions in rate limiting:

# RACE CONDITION (DON'T DO THIS):
count = redis.get(key)        # Thread A reads: 99
                               # Thread B reads: 99
if count < limit:              # Both threads see 99 < 100
    redis.incr(key)            # Thread A: 100 ✓
                               # Thread B: 101 ✗ (over limit!)

# FIX 1: Use INCR (returns new value atomically)
count = redis.incr(key)       # Atomic: read + increment + return
if count > limit:
    return REJECTED

# FIX 2: Lua script (for complex logic)
# Entire read-check-update executes atomically

# FIX 3: Redis MULTI/EXEC (optimistic locking with WATCH)
redis.watch(key)
count = redis.get(key)
if count and int(count) >= limit:
    redis.unwatch()
    return REJECTED
pipe = redis.pipeline()
pipe.multi()
pipe.incr(key)
pipe.execute()  # Fails if key was modified since WATCH
Important: Even with Redis being single-threaded, race conditions arise because multiple clients execute separate commands. Between a client’s GET and SET, another client’s INCR can interleave. Lua scripts are the gold standard for atomic multi-step operations.

Redis Cluster Considerations

When using Redis Cluster with multiple shards:

Token Bucket with Redis: Step-by-Step

This animation shows how the token bucket algorithm works with Redis, including atomic MULTI/EXEC operations for check-and-decrement.

▶ Token Bucket with Redis

Watch tokens being consumed and refilled. MULTI/EXEC ensures atomic check-and-decrement. Bucket capacity=5, refill=1 token/step.

HTTP Response Headers

A well-designed rate limiter communicates its state to clients through standard HTTP headers. This is critical for a good developer experience.

Standard Headers

# When request is ALLOWED (HTTP 200):
HTTP/1.1 200 OK
X-RateLimit-Limit: 100          # Max requests allowed in window
X-RateLimit-Remaining: 73       # Requests remaining in current window
X-RateLimit-Reset: 1714500060   # Unix timestamp when window resets
X-RateLimit-Policy: 100;w=60    # IETF draft: 100 per 60 seconds

# When request is REJECTED (HTTP 429):
HTTP/1.1 429 Too Many Requests
Retry-After: 37                  # Seconds until client should retry
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714500060
Content-Type: application/json

{
  "error": {
    "code": "RATE_LIMIT_EXCEEDED",
    "message": "Rate limit exceeded. Maximum 100 requests per minute.",
    "retry_after": 37,
    "limit": 100,
    "window": "1m",
    "documentation_url": "https://api.example.com/docs/rate-limits"
  }
}

Implementation

from fastapi import FastAPI, Request, Response
from fastapi.responses import JSONResponse

app = FastAPI()

@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    key = extract_rate_limit_key(request)  # IP, user ID, etc.
    rule = rules_engine.match(request)

    allowed, remaining, reset_at = check_rate_limit(
        redis, key, rule.limit, rule.window
    )

    if not allowed:
        response = JSONResponse(
            status_code=429,
            content={
                "error": {
                    "code": "RATE_LIMIT_EXCEEDED",
                    "message": f"Rate limit exceeded. Max {rule.limit} "
                               f"requests per {rule.window}s.",
                    "retry_after": reset_at - int(time.time()),
                }
            }
        )
    else:
        response = await call_next(request)

    # Always set rate limit headers (even on 429)
    response.headers["X-RateLimit-Limit"] = str(rule.limit)
    response.headers["X-RateLimit-Remaining"] = str(max(0, remaining))
    response.headers["X-RateLimit-Reset"] = str(reset_at)
    if not allowed:
        response.headers["Retry-After"] = str(
            reset_at - int(time.time())
        )

    return response

Graceful Degradation

What happens when the rate limiter itself fails? There are two philosophies:

Fail-Open (Recommended for Most Cases)

If Redis is unreachable, allow all requests through. The system continues serving traffic, sacrificing rate limiting temporarily. This is the safe default for most applications.

async def check_rate_limit_safe(redis, key, limit, window):
    try:
        return await check_rate_limit(redis, key, limit, window)
    except (ConnectionError, TimeoutError) as e:
        # Redis is down — fail open
        logger.warning(f"Rate limiter unavailable: {e}. Failing open.")
        metrics.increment("rate_limiter.fail_open")
        return True, limit, int(time.time()) + window  # Allow request

Fail-Closed (Security-Critical)

If Redis is unreachable, reject all requests. Used when rate limiting is a security requirement (e.g., authentication endpoints, payment processing).

async def check_rate_limit_strict(redis, key, limit, window):
    try:
        return await check_rate_limit(redis, key, limit, window)
    except (ConnectionError, TimeoutError) as e:
        logger.error(f"Rate limiter unavailable: {e}. Failing closed.")
        metrics.increment("rate_limiter.fail_closed")
        return False, 0, int(time.time()) + 60  # Reject request

Soft vs Hard Limits

TypeBehaviourUse Case
Soft limitAllow requests over the limit but log a warning, add delay, or return a degraded response.Free-tier users; non-critical endpoints; gradual enforcement.
Hard limitReject with HTTP 429 immediately once the limit is exceeded.Paid plans with SLAs; security endpoints; resource-intensive operations.
# Soft limit with degradation levels
def enforce_limit(count, soft_limit, hard_limit):
    if count <= soft_limit:
        return "ALLOW"           # Full service
    elif count <= hard_limit:
        return "DEGRADE"         # Allow but with reduced functionality
        # e.g., return cached results, disable expensive features
    else:
        return "REJECT"          # HTTP 429

# Example: soft=80, hard=100
# At 85 requests → degrade (return cached search results)
# At 101 requests → reject (HTTP 429)

Multi-Tier Rate Limiting

Real-world systems don’t use a single rate limit. They enforce multiple limits simultaneously at different granularities:

# Multi-tier rate limiting — ALL limits must pass
RATE_LIMITS = [
    {"window": 1,     "limit": 10,    "name": "per_second"},
    {"window": 60,    "limit": 200,   "name": "per_minute"},
    {"window": 3600,  "limit": 5000,  "name": "per_hour"},
    {"window": 86400, "limit": 50000, "name": "per_day"},
]

def is_allowed_multi_tier(redis, user_id):
    """Check ALL rate limit tiers. Request is rejected if ANY tier is exceeded."""
    results = {}
    most_restrictive_remaining = float('inf')
    most_restrictive_reset = 0

    for tier in RATE_LIMITS:
        allowed, remaining, reset_at = check_rate_limit(
            redis,
            f"rate:{user_id}:{tier['name']}",
            tier['limit'],
            tier['window']
        )
        results[tier['name']] = {
            "allowed": allowed,
            "remaining": remaining,
            "reset_at": reset_at
        }

        if remaining < most_restrictive_remaining:
            most_restrictive_remaining = remaining
            most_restrictive_reset = reset_at

        if not allowed:
            return False, results  # Fail fast on first violation

    return True, results

Why multi-tier?

Real-world example (Twitter/X API): 300 tweets/3 hours (write), 900 reads/15 minutes per app, 100K reads/24 hours per user. GitHub API: 5000 requests/hour for authenticated users, 60/hour for unauthenticated.

Monitoring Rate Limiter Effectiveness

A rate limiter without monitoring is flying blind. You need to know if limits are too strict (blocking legitimate users) or too lax (not preventing abuse).

Key Metrics

# Prometheus metrics for rate limiter monitoring
from prometheus_client import Counter, Histogram, Gauge

# Request outcome
rate_limit_requests = Counter(
    'rate_limiter_requests_total',
    'Total requests processed by rate limiter',
    ['result', 'tier', 'endpoint']  # result: allowed|rejected|error
)

# Latency of rate limit check
rate_limit_latency = Histogram(
    'rate_limiter_check_duration_seconds',
    'Latency of rate limit check',
    buckets=[0.0005, 0.001, 0.005, 0.01, 0.05]
)

# Current utilisation per key
rate_limit_utilisation = Gauge(
    'rate_limiter_utilisation_ratio',
    'Current usage / limit ratio per key',
    ['user_tier']
)

# Redis health
redis_errors = Counter(
    'rate_limiter_redis_errors_total',
    'Redis connection errors (fail-open events)',
    ['error_type']
)

Alerting Rules

# Grafana alert rules (Prometheus query)

# 1. Too many rejections (limits may be too strict)
rate(rate_limiter_requests_total{result="rejected"}[5m])
  / rate(rate_limiter_requests_total[5m]) > 0.1
# Alert: >10% of requests being rejected

# 2. Rate limiter latency spike
histogram_quantile(0.99, rate(rate_limiter_check_duration_seconds_bucket[5m]))
  > 0.005
# Alert: p99 latency exceeds 5ms budget

# 3. Redis fail-open events
rate(rate_limiter_redis_errors_total[5m]) > 0
# Alert: Rate limiter is failing open (Redis issues)

# 4. Single user consuming disproportionate quota
topk(10, rate_limiter_utilisation_ratio) > 0.9
# Alert: Top users at >90% of their limit

Dashboard Panels

Interview Framework & Common Questions

Rate limiter is one of the most asked system design questions. Here’s a structured approach:

Step 1: Clarify Requirements (2 min)

Step 2: High-Level Design (5 min)

Step 3: Algorithm Selection (5 min)

Step 4: Deep Dive (10 min)

Step 5: Production Concerns (3 min)

Frequently Asked Follow-Up Questions

QuestionKey Points
“How do you handle distributed rate limiting?”Centralised Redis counter shared by all gateway instances. Lua scripts for atomicity. Hash tags for cluster mode.
“What if Redis goes down?”Fail-open by default (allow traffic). Use local in-memory fallback with reduced accuracy. Alert ops team.
“How do you prevent race conditions?”Lua scripts (single atomic execution), INCR (atomic increment), or WATCH/MULTI/EXEC (optimistic locking).
“Token bucket vs sliding window?”Token bucket allows controlled bursts, ideal for APIs. Sliding window is stricter, prevents any burst, good for quotas.
“How do you rate limit by IP when behind a CDN?”Use X-Forwarded-For header. Validate it by allowing only trusted proxy IPs in the chain. Rate limit on the first untrusted IP.
“How do you handle API key rate limiting for webhooks?”Outbound rate limiting: use a token bucket to pace webhook delivery. Queue excess webhooks and drain at the allowed rate.

Production Examples

Stripe

Stripe uses a token bucket per API key with generous burst capacity. Their rate limiter runs in the API gateway layer and communicates limits via standard headers. They use separate limits for read vs write operations.

GitHub API

GitHub uses a sliding window approach: 5,000 requests/hour for authenticated users, 60/hour for unauthenticated. They include X-RateLimit-Remaining on every response, and their search API has a separate, lower limit (30 requests/minute).

Cloudflare

Cloudflare implements rate limiting at the edge (CDN layer), before requests even reach the origin server. They use the sliding window counter algorithm (weighted average of current and previous window), reporting only 0.003% error rate.

Discord

Discord uses per-route rate limiting with a token bucket. Each API route has its own bucket, and they use X-RateLimit-Bucket headers to group related endpoints. They also implement a global rate limit (50 requests/second per bot).

Summary

DecisionRecommendation
PlacementAPI Gateway middleware + per-service fine-grained limits
AlgorithmToken bucket (default) or sliding window counter
StorageRedis (in-memory, atomic, TTL, data structures)
AtomicityLua scripts in Redis (gold standard)
Failure modeFail-open (most cases) or fail-closed (security)
ResponseHTTP 429 + Retry-After + X-RateLimit-* headers
ConfigurationYAML rules with hot-reload (file watch or config service)
MonitoringRejection rate, latency, Redis health, top consumers