High Level Design Series· Building Blocks · Part 4

Rate Limiting & Throttling

April 2026 · 20 min read

Why Rate Limit?

Every public-facing API that doesn't enforce rate limits is an incident waiting to happen. Rate limiting is the practice of controlling the number of requests a client can make to a service within a given time window. It sits at the intersection of reliability engineering, security, and cost management.

Core Motivations

Goal	What It Prevents
Prevent Abuse	Malicious actors flooding your API with millions of requests, scraping data, or brute-forcing authentication endpoints.
Protect Resources	A single runaway client shouldn't saturate your database connections, exhaust thread pools, or crash downstream services.
Ensure Fair Usage	Without limits, one aggressive client can starve all other users of capacity (the "noisy neighbor" problem).
DDoS Mitigation	Rate limiting is the first line of defense. While it can't stop a massive volumetric attack, it handles application-layer (L7) floods effectively.
Cost Control	Cloud resources cost money. Unbounded traffic means unbounded bills. Rate limits cap the financial exposure.

Where to Rate Limit

Rate limiting can be applied at multiple layers in your architecture, and in practice you should use multiple layers:

Client → CDN/WAF (IP-level) → API Gateway (user-level) → Application (endpoint-level) → Database (connection pooling)
         ↑ Cloudflare/AWS WAF    ↑ Kong/Envoy/nginx      ↑ Middleware/decorator      ↑ pg_bouncer/ProxySQL

API Gateway / Reverse Proxy — The most common place. Catches abuse before it reaches your application code. Configure in Kong, Envoy, nginx, or AWS API Gateway.
Application Level — Per-endpoint rate limits (e.g., login endpoint gets 5 req/min, search gets 100 req/min). Implemented as middleware or decorators.
Per-User / Per-API-Key — Identify the caller via API key, JWT, or OAuth token. Essential for multi-tenant SaaS platforms.
Per-IP — Simplest approach but problematic with NAT (thousands of users behind one IP). Good for anonymous endpoints.
Global — Hard ceiling on total system throughput to prevent cascading failures.

Rate limiting ≠ throttling. Rate limiting rejects excess requests (HTTP 429). Throttling slows them down (queuing, delayed responses). In practice, the terms are used interchangeably, but the distinction matters when designing your system.

Token Bucket Algorithm

The token bucket is the most widely used rate-limiting algorithm, adopted by AWS, Stripe, and most API gateways. It's elegant because it naturally allows bursts while enforcing a long-term average rate.

How It Works

A bucket holds tokens. The bucket has a maximum capacity (e.g., 10 tokens).
Tokens are added at a fixed refill rate (e.g., 1 token per second).
Each incoming request must consume one token. If a token is available, the request proceeds. If not, the request is rejected (or queued).
If the bucket is full, new tokens are discarded (the bucket can't overflow).

Parameters

Parameter	Description	Example
`bucket_capacity`	Maximum burst size	10
`refill_rate`	Tokens added per second	2 tokens/sec

Implementation

import time

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.tokens = capacity        # start full
        self.refill_rate = refill_rate # tokens per second
        self.last_refill = time.monotonic()

    def _refill(self):
        now = time.monotonic()
        elapsed = now - self.last_refill
        new_tokens = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + new_tokens)
        self.last_refill = now

    def allow_request(self, tokens: int = 1) -> bool:
        self._refill()
        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False

# Usage: 10 requests max burst, 2 requests/sec sustained
limiter = TokenBucket(capacity=10, refill_rate=2)

for i in range(15):
    if limiter.allow_request():
        print(f"Request {i}: ✅ Allowed")
    else:
        print(f"Request {i}: ❌ Rate limited")

Why token bucket is dominant: It handles bursty traffic gracefully. A user who hasn't made requests for 5 seconds has accumulated tokens and can burst — which matches real user behavior (idle, then active). Fixed window counters penalize this pattern.

▶ Token Bucket Animation

Watch tokens refill, get consumed by requests, and see what happens when the bucket is empty.

Leaky Bucket Algorithm

The leaky bucket is the dual of the token bucket. Instead of tokens controlling admission, a queue (the bucket) holds incoming requests, and they "leak" out at a fixed rate — like water dripping through a hole at the bottom of a bucket.

How It Works

Incoming requests are added to a FIFO queue (the bucket).
The queue has a maximum size. If the queue is full, new requests are dropped.
Requests are processed (leaked) at a constant rate, regardless of how fast they arrive.

Key Difference from Token Bucket

Aspect	Token Bucket	Leaky Bucket
Burst handling	Allows bursts (up to bucket capacity)	Smooths out bursts completely
Output rate	Variable (matches input up to limit)	Constant (always fixed rate)
When bucket is full	Reject immediately	Drop from queue
Use case	API rate limiting (most common)	Traffic shaping, network QoS

Implementation

import time
from collections import deque

class LeakyBucket:
    def __init__(self, capacity: int, leak_rate: float):
        self.capacity = capacity
        self.leak_rate = leak_rate  # requests processed per second
        self.queue = deque()
        self.last_leak = time.monotonic()

    def _leak(self):
        now = time.monotonic()
        elapsed = now - self.last_leak
        leaked = int(elapsed * self.leak_rate)
        for _ in range(min(leaked, len(self.queue))):
            self.queue.popleft()
        if leaked > 0:
            self.last_leak = now

    def allow_request(self, request_id: str) -> bool:
        self._leak()
        if len(self.queue) < self.capacity:
            self.queue.append(request_id)
            return True  # queued (will be processed at leak rate)
        return False      # queue full, dropped

# Queue up to 5 requests, process 2 per second
limiter = LeakyBucket(capacity=5, leak_rate=2)

When to use leaky bucket: When you need a perfectly smooth output rate — e.g., sending emails (don't want bursts hitting SMTP servers), writing to databases with strict IOPS limits, or network traffic shaping at the packet level. For API rate limiting, token bucket is almost always preferred because users expect burst capability.

Fixed Window Counter

The simplest rate-limiting algorithm: divide time into fixed windows and count requests per window.

How It Works

Time is divided into fixed windows (e.g., each minute: 12:00:00–12:00:59, 12:01:00–12:01:59).
Each window has a counter, starting at 0.
On each request, increment the counter for the current window.
If the counter exceeds the limit, reject the request.
When a new window starts, the counter resets to 0.

Implementation with Redis

import redis
import time

r = redis.Redis()

def fixed_window_rate_limit(user_id: str, limit: int, window_seconds: int) -> bool:
    """Returns True if request is allowed, False if rate limited."""
    # Key includes the window timestamp
    window = int(time.time() // window_seconds)
    key = f"rate:{user_id}:{window}"

    # Atomic increment
    current = r.incr(key)

    # Set expiry only on first request of the window
    if current == 1:
        r.expire(key, window_seconds + 1)  # +1 for clock skew

    return current <= limit

# Allow 100 requests per minute
allowed = fixed_window_rate_limit("user_123", limit=100, window_seconds=60)

The Boundary Problem

Fixed windows have a critical flaw: a user can make 2× the allowed requests in a short burst spanning the window boundary.

Window 1: 12:00:00 ──────────────── 12:00:59 | Window 2: 12:01:00 ──────────────── 12:01:59
                                               |
                    ... idle ...     100 reqs  | 100 reqs    ... idle ...
                                     ↑ 12:00:58-12:00:59    ↑ 12:01:00-12:01:01
                                               |
                             200 requests in ~2 seconds! (limit was 100/min)

Both windows are individually within limits, but the user effectively sent 200 requests in 2 seconds. This is why fixed windows are rarely used in production for strict rate limiting.

Fixed window is best for: Approximate limits where precision doesn't matter, internal service rate limits, or as a fast first-pass filter combined with a more accurate algorithm. Its simplicity (single Redis key per window) makes it the cheapest to run.

Sliding Window Log

The sliding window log fixes the boundary problem by tracking every request timestamp and counting requests in a trailing window.

How It Works

For each user, maintain a sorted set of request timestamps.
On each new request, remove timestamps older than now - window_size.
Count remaining entries. If count < limit, allow the request and add its timestamp. Otherwise reject.

Redis Implementation

import redis
import time
import uuid

r = redis.Redis()

def sliding_window_log(user_id: str, limit: int, window_seconds: int) -> bool:
    now = time.time()
    key = f"rate:swl:{user_id}"
    window_start = now - window_seconds

    # Use a pipeline for atomicity
    pipe = r.pipeline()

    # Remove timestamps outside the window
    pipe.zremrangebyscore(key, 0, window_start)

    # Count remaining requests in the window
    pipe.zcard(key)

    # Add current request timestamp (score = timestamp, member = unique ID)
    pipe.zadd(key, {str(uuid.uuid4()): now})

    # Set TTL to auto-cleanup
    pipe.expire(key, window_seconds + 1)

    results = pipe.execute()
    request_count = results[1]  # zcard result

    if request_count >= limit:
        # Remove the entry we just added
        pipe = r.pipeline()
        pipe.zpopmax(key)
        pipe.execute()
        return False

    return True

# 100 requests per 60-second sliding window
allowed = sliding_window_log("user_123", limit=100, window_seconds=60)

Lua Script (Fully Atomic)

-- sliding_window_log.lua
-- KEYS[1] = rate limit key
-- ARGV[1] = window size in seconds
-- ARGV[2] = max requests
-- ARGV[3] = current timestamp
-- ARGV[4] = unique request ID

local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local req_id = ARGV[4]

-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count current entries
local count = redis.call('ZCARD', key)

if count < limit then
    -- Add new request
    redis.call('ZADD', key, now, req_id)
    redis.call('EXPIRE', key, window + 1)
    return 1  -- allowed
else
    return 0  -- rejected
end

Memory trade-off: If you allow 10,000 requests per hour, each user's sorted set holds up to 10,000 entries. For 1M users, that's 10 billion entries in Redis. This is why sliding window log is impractical at scale — use sliding window counter instead.

Sliding Window Counter

The sliding window counter is the sweet spot: it combines the accuracy of sliding windows with the memory efficiency of fixed windows. This is the algorithm Cloudflare uses at scale.

How It Works

Instead of storing individual timestamps, use a weighted average of the current and previous window's counts:

Effective count = (previous_window_count × overlap_percentage) + current_window_count

Example:
  Window size: 60 seconds
  Current time: 12:01:15 (15 seconds into current window)
  Previous window (12:00:00–12:00:59): 84 requests
  Current window  (12:01:00–12:01:59): 36 requests

  Overlap of previous window: (60 - 15) / 60 = 75%

  Effective count = 84 × 0.75 + 36 = 63 + 36 = 99

  If limit is 100 → this request is allowed (99 < 100)

Implementation

import redis
import time

r = redis.Redis()

def sliding_window_counter(
    user_id: str, limit: int, window_seconds: int
) -> bool:
    now = time.time()
    current_window = int(now // window_seconds)
    previous_window = current_window - 1

    current_key = f"rate:swc:{user_id}:{current_window}"
    previous_key = f"rate:swc:{user_id}:{previous_window}"

    # Get counts for both windows
    pipe = r.pipeline()
    pipe.get(current_key)
    pipe.get(previous_key)
    results = pipe.execute()

    current_count = int(results[0] or 0)
    previous_count = int(results[1] or 0)

    # Calculate elapsed fraction of current window
    elapsed = now - (current_window * window_seconds)
    weight = 1 - (elapsed / window_seconds)

    # Weighted count
    effective_count = previous_count * weight + current_count

    if effective_count >= limit:
        return False

    # Increment current window
    pipe = r.pipeline()
    pipe.incr(current_key)
    pipe.expire(current_key, window_seconds * 2)
    pipe.execute()

    return True

# 100 requests per minute, only 2 Redis keys per user
allowed = sliding_window_counter("user_123", limit=100, window_seconds=60)

Why This Is the Best Trade-Off

Algorithm	Memory per User	Accuracy	Burst Handling
Fixed Window	8 bytes	Poor (2× at boundary)	Allows boundary bursts
Sliding Log	O(limit) entries	Perfect	Precise
Sliding Counter	16 bytes	~99.7% accurate	Smooth
Token Bucket	16 bytes	Good	Allows controlled bursts

▶ Sliding Window Animation

See the boundary problem with fixed windows and how sliding windows solve it.

Distributed Rate Limiting

In a single-server world, rate limiting is trivial — an in-memory counter works fine. But modern architectures run dozens or hundreds of API servers behind a load balancer. The fundamental challenge: how do all servers agree on the request count?

The Problem

                    ┌─────────────┐
  Request ──────────▶│ API Server 1 │──── Local counter: 45
                    └─────────────┘
  Request ──────────▶│ API Server 2 │──── Local counter: 38
                    └─────────────┘
  Request ──────────▶│ API Server 3 │──── Local counter: 42
                    └─────────────┘

  Each server thinks the user made ~40 requests.
  But the REAL total is 45 + 38 + 42 = 125!
  If the limit is 100, we've allowed 25% over-limit.

Solution: Centralized Store (Redis)

Use Redis as a shared counter that all API servers query atomically:

                    ┌─────────────┐
  Request ──────────▶│ API Server 1 │────┐
                    └─────────────┘    │
  Request ──────────▶│ API Server 2 │────┼──▶ Redis (single counter: 125)
                    └─────────────┘    │
  Request ──────────▶│ API Server 3 │────┘
                    └─────────────┘

Atomic Rate Limiting with Lua

The classic INCR + EXPIRE pattern has a race condition. If the server crashes between INCR and EXPIRE, the key lives forever. Use a Lua script for atomicity:

-- rate_limit.lua — Atomic fixed-window rate limiter
-- KEYS[1] = rate limit key
-- ARGV[1] = max requests
-- ARGV[2] = window size in seconds

local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = redis.call('INCR', key)

if current == 1 then
    -- First request: set the expiry
    redis.call('EXPIRE', key, window)
end

if current > limit then
    return 0  -- rejected
else
    return 1  -- allowed
end

# Python caller
import redis
import time

r = redis.Redis()

# Load the Lua script
RATE_LIMIT_SCRIPT = r.register_script("""
local current = redis.call('INCR', KEYS[1])
if current == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[2])
end
if current > tonumber(ARGV[1]) then
    return 0
else
    return 1
end
""")

def is_allowed(user_id: str, limit: int = 100, window: int = 60) -> bool:
    window_id = int(time.time() // window)
    key = f"rl:{user_id}:{window_id}"
    result = RATE_LIMIT_SCRIPT(keys=[key], args=[limit, window])
    return result == 1

Token Bucket with Redis

-- token_bucket.lua — Distributed token bucket
-- KEYS[1] = bucket key
-- ARGV[1] = capacity
-- ARGV[2] = refill rate (tokens/sec)
-- ARGV[3] = current timestamp (seconds, float)
-- ARGV[4] = tokens to consume

local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

-- Get current state
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill tokens
local elapsed = math.max(0, now - last_refill)
tokens = math.min(capacity, tokens + elapsed * refill_rate)

-- Try to consume
if tokens >= requested then
    tokens = tokens - requested
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
    return 1  -- allowed
else
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
    return 0  -- rejected
end

Race Conditions and Mitigations

Race Condition	Cause	Solution
INCR without EXPIRE	Crash between INCR and EXPIRE	Lua script (atomic)
GET-then-SET	Two servers read same count, both allow	INCR (atomic) or Lua
Clock skew	Servers have different clocks	Use Redis TIME, or NTP sync
Redis failover	Replica promoted, counters lost	Accept brief over-limit; use WAIT for sync

Redis Cluster for scale: Partition rate-limit keys by user ID across Redis Cluster shards. Each user's counter lives on one shard. With consistent hashing, you get linear scalability. At Stripe's scale, a single Redis node handles 200K+ rate-limit checks per second.

▶ Distributed Rate Limiter

Multiple API servers share a Redis counter. Watch atomic INCR + EXPIRE and 429 rejection.

Rate Limiting in Practice

HTTP Response Headers

The standard way to communicate rate-limit status to clients. These aren't in the HTTP spec (yet) but are a de facto standard adopted by GitHub, Stripe, Twitter, and others:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100        ← Max requests per window
X-RateLimit-Remaining: 42     ← Requests left in current window
X-RateLimit-Reset: 1714435200 ← Unix timestamp when window resets

429 Too Many Requests

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714435200

{
    "error": {
        "code": "rate_limit_exceeded",
        "message": "Rate limit exceeded. Try again in 30 seconds.",
        "retry_after": 30,
        "limit": 100,
        "remaining": 0,
        "reset_at": "2026-04-30T12:00:00Z"
    }
}

Implementing in Express.js Middleware

const Redis = require('ioredis');
const redis = new Redis();

const RATE_LIMIT_LUA = `
local current = redis.call('INCR', KEYS[1])
if current == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[2])
end
if current > tonumber(ARGV[1]) then
    return {0, current, tonumber(redis.call('TTL', KEYS[1]))}
else
    return {1, current, tonumber(redis.call('TTL', KEYS[1]))}
end
`;

function rateLimit({ limit = 100, windowSeconds = 60 } = {}) {
    return async (req, res, next) => {
        const userId = req.user?.id || req.ip;
        const window = Math.floor(Date.now() / 1000 / windowSeconds);
        const key = `rl:${userId}:${window}`;

        const [allowed, count, ttl] = await redis.eval(
            RATE_LIMIT_LUA, 1, key, limit, windowSeconds
        );

        // Always set rate limit headers
        res.set('X-RateLimit-Limit', String(limit));
        res.set('X-RateLimit-Remaining', String(Math.max(0, limit - count)));
        res.set('X-RateLimit-Reset', String(
            Math.floor(Date.now() / 1000) + ttl
        ));

        if (!allowed) {
            res.set('Retry-After', String(ttl));
            return res.status(429).json({
                error: {
                    code: 'rate_limit_exceeded',
                    message: `Rate limit exceeded. Try again in ${ttl}s.`,
                    retry_after: ttl,
                }
            });
        }

        next();
    };
}

// Usage
app.use('/api/', rateLimit({ limit: 100, windowSeconds: 60 }));
app.use('/api/login', rateLimit({ limit: 5, windowSeconds: 300 }));

Client-Side: Exponential Backoff with Jitter

When you receive a 429, don't retry immediately — use exponential backoff with jitter to avoid the thundering herd problem:

import time
import random
import requests

def request_with_backoff(url: str, max_retries: int = 5) -> requests.Response:
    for attempt in range(max_retries):
        response = requests.get(url)

        if response.status_code != 429:
            return response

        # Prefer Retry-After header if present
        retry_after = response.headers.get('Retry-After')
        if retry_after:
            wait = int(retry_after)
        else:
            # Exponential backoff: 1s, 2s, 4s, 8s, 16s
            base_wait = 2 ** attempt
            # Add jitter: random value between 0 and base_wait
            wait = base_wait + random.uniform(0, base_wait)

        print(f"Rate limited. Retrying in {wait:.1f}s (attempt {attempt + 1})")
        time.sleep(wait)

    raise Exception(f"Failed after {max_retries} retries")

// JavaScript equivalent
async function fetchWithBackoff(url, maxRetries = 5) {
    for (let attempt = 0; attempt < maxRetries; attempt++) {
        const response = await fetch(url);

        if (response.status !== 429) return response;

        const retryAfter = response.headers.get('Retry-After');
        const baseWait = retryAfter
            ? parseInt(retryAfter, 10)
            : Math.pow(2, attempt);
        const jitter = Math.random() * baseWait;
        const wait = baseWait + jitter;

        console.log(`Rate limited. Retry in ${wait.toFixed(1)}s`);
        await new Promise(r => setTimeout(r, wait * 1000));
    }

    throw new Error(`Failed after ${maxRetries} retries`);
}

Why jitter matters: Without jitter, all rate-limited clients retry at exactly the same time (2ⁿ seconds), creating a "retry storm" that overwhelms the server again. Jitter spreads retries uniformly across time, preventing synchronized bursts.

Multi-Level Rate Limiting

Production systems rarely use a single rate limit. Instead, they layer multiple limits at different granularities:

Layered Limits

┌──────────────────────────────────────────────────────┐
│ Layer 1: Global                                      │
│   All traffic combined: 50,000 req/sec               │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Layer 2: Per-IP                                  │ │
│ │   Anonymous: 60 req/min per IP                   │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ Layer 3: Per-User                            │ │ │
│ │ │   Authenticated: 1,000 req/min per user      │ │ │
│ │ │ ┌──────────────────────────────────────────┐ │ │ │
│ │ │ │ Layer 4: Per-Endpoint                    │ │ │ │
│ │ │ │   POST /login: 5 req/min                 │ │ │ │
│ │ │ │   GET /search: 30 req/min                │ │ │ │
│ │ │ │   GET /api/data: 1,000 req/min           │ │ │ │
│ │ │ └──────────────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘

API Tier Limits

Most SaaS APIs offer tiered pricing with different rate limits:

Tier	Rate Limit	Burst	Daily Quota
Free	60 req/min	10	1,000
Pro	600 req/min	100	50,000
Enterprise	6,000 req/min	1,000	Unlimited

Implementation with Multiple Buckets

import redis
import time

r = redis.Redis()

# Lua script that checks multiple rate limits atomically
MULTI_LIMIT_LUA = """
-- Check all limits; reject if ANY limit is exceeded
for i = 1, #KEYS do
    local current = redis.call('INCR', KEYS[i])
    if current == 1 then
        redis.call('EXPIRE', KEYS[i], ARGV[i * 2])
    end
    if current > tonumber(ARGV[i * 2 - 1]) then
        return {0, i}  -- rejected, which limit was hit
    end
end
return {1, 0}  -- all limits passed
"""

def multi_rate_limit(user_id: str, ip: str, endpoint: str, tier: str):
    now = int(time.time())

    # Define limits per tier
    tier_limits = {
        'free':       {'per_min': 60,   'per_hour': 1000},
        'pro':        {'per_min': 600,  'per_hour': 50000},
        'enterprise': {'per_min': 6000, 'per_hour': 500000},
    }

    limits = tier_limits.get(tier, tier_limits['free'])
    minute_window = now // 60
    hour_window = now // 3600

    keys = [
        f"rl:ip:{ip}:{minute_window}",           # Per-IP: 120/min
        f"rl:user:{user_id}:{minute_window}",     # Per-user: tier limit/min
        f"rl:user:{user_id}:h:{hour_window}",     # Per-user: tier limit/hour
        f"rl:ep:{endpoint}:{user_id}:{minute_window}",  # Per-endpoint
    ]

    args = [
        120, 60,                        # IP: 120 req, 60s window
        limits['per_min'], 60,          # User per min
        limits['per_hour'], 3600,       # User per hour
        30, 60,                         # Endpoint: 30 req/min
    ]

    result = r.eval(MULTI_LIMIT_LUA, len(keys), *keys, *args)
    return result[0] == 1  # True if all limits passed

API Gateway Configuration (Kong)

# kong.yml — Rate limiting plugin configuration
plugins:
  - name: rate-limiting
    config:
      minute: 100
      hour: 5000
      policy: redis            # Use Redis for distributed counting
      redis_host: redis.internal
      redis_port: 6379
      redis_timeout: 2000
      redis_database: 0
      limit_by: consumer       # Per API key / consumer
      fault_tolerant: true     # Allow traffic if Redis is down
      hide_client_headers: false

  # Stricter limit for auth endpoints
  - name: rate-limiting
    route: login-route
    config:
      minute: 5
      hour: 20
      policy: redis
      redis_host: redis.internal
      limit_by: ip

Envoy Proxy Configuration

# envoy.yaml — Rate limit filter
http_filters:
  - name: envoy.filters.http.ratelimit
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
      domain: my_api
      rate_limit_service:
        grpc_service:
          envoy_grpc:
            cluster_name: rate_limit_cluster
        transport_api_version: V3

# Rate limit service descriptor configuration
domain: my_api
descriptors:
  - key: user_id
    rate_limit:
      unit: minute
      requests_per_unit: 100
  - key: user_id
    value: free_tier
    rate_limit:
      unit: minute
      requests_per_unit: 10

Graceful degradation: If Redis is down, should you block all traffic or allow all traffic? Most production systems choose fail-open (allow all) rather than fail-closed, because a few seconds of unthrottled traffic is better than a total outage. Kong's fault_tolerant: true does exactly this.