← All Posts
High Level Design Series· Building Blocks · Part 4

Rate Limiting & Throttling

Why Rate Limit?

Every public-facing API that doesn't enforce rate limits is an incident waiting to happen. Rate limiting is the practice of controlling the number of requests a client can make to a service within a given time window. It sits at the intersection of reliability engineering, security, and cost management.

Core Motivations

GoalWhat It Prevents
Prevent AbuseMalicious actors flooding your API with millions of requests, scraping data, or brute-forcing authentication endpoints.
Protect ResourcesA single runaway client shouldn't saturate your database connections, exhaust thread pools, or crash downstream services.
Ensure Fair UsageWithout limits, one aggressive client can starve all other users of capacity (the "noisy neighbor" problem).
DDoS MitigationRate limiting is the first line of defense. While it can't stop a massive volumetric attack, it handles application-layer (L7) floods effectively.
Cost ControlCloud resources cost money. Unbounded traffic means unbounded bills. Rate limits cap the financial exposure.

Where to Rate Limit

Rate limiting can be applied at multiple layers in your architecture, and in practice you should use multiple layers:

Client → CDN/WAF (IP-level) → API Gateway (user-level) → Application (endpoint-level) → Database (connection pooling)
         ↑ Cloudflare/AWS WAF    ↑ Kong/Envoy/nginx      ↑ Middleware/decorator      ↑ pg_bouncer/ProxySQL
Rate limiting ≠ throttling. Rate limiting rejects excess requests (HTTP 429). Throttling slows them down (queuing, delayed responses). In practice, the terms are used interchangeably, but the distinction matters when designing your system.

Token Bucket Algorithm

The token bucket is the most widely used rate-limiting algorithm, adopted by AWS, Stripe, and most API gateways. It's elegant because it naturally allows bursts while enforcing a long-term average rate.

How It Works

  1. A bucket holds tokens. The bucket has a maximum capacity (e.g., 10 tokens).
  2. Tokens are added at a fixed refill rate (e.g., 1 token per second).
  3. Each incoming request must consume one token. If a token is available, the request proceeds. If not, the request is rejected (or queued).
  4. If the bucket is full, new tokens are discarded (the bucket can't overflow).

Parameters

ParameterDescriptionExample
bucket_capacityMaximum burst size10
refill_rateTokens added per second2 tokens/sec

Implementation

import time

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.tokens = capacity        # start full
        self.refill_rate = refill_rate # tokens per second
        self.last_refill = time.monotonic()

    def _refill(self):
        now = time.monotonic()
        elapsed = now - self.last_refill
        new_tokens = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + new_tokens)
        self.last_refill = now

    def allow_request(self, tokens: int = 1) -> bool:
        self._refill()
        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False

# Usage: 10 requests max burst, 2 requests/sec sustained
limiter = TokenBucket(capacity=10, refill_rate=2)

for i in range(15):
    if limiter.allow_request():
        print(f"Request {i}: ✅ Allowed")
    else:
        print(f"Request {i}: ❌ Rate limited")
Why token bucket is dominant: It handles bursty traffic gracefully. A user who hasn't made requests for 5 seconds has accumulated tokens and can burst — which matches real user behavior (idle, then active). Fixed window counters penalize this pattern.

▶ Token Bucket Animation

Watch tokens refill, get consumed by requests, and see what happens when the bucket is empty.

Leaky Bucket Algorithm

The leaky bucket is the dual of the token bucket. Instead of tokens controlling admission, a queue (the bucket) holds incoming requests, and they "leak" out at a fixed rate — like water dripping through a hole at the bottom of a bucket.

How It Works

  1. Incoming requests are added to a FIFO queue (the bucket).
  2. The queue has a maximum size. If the queue is full, new requests are dropped.
  3. Requests are processed (leaked) at a constant rate, regardless of how fast they arrive.

Key Difference from Token Bucket

AspectToken BucketLeaky Bucket
Burst handlingAllows bursts (up to bucket capacity)Smooths out bursts completely
Output rateVariable (matches input up to limit)Constant (always fixed rate)
When bucket is fullReject immediatelyDrop from queue
Use caseAPI rate limiting (most common)Traffic shaping, network QoS

Implementation

import time
from collections import deque

class LeakyBucket:
    def __init__(self, capacity: int, leak_rate: float):
        self.capacity = capacity
        self.leak_rate = leak_rate  # requests processed per second
        self.queue = deque()
        self.last_leak = time.monotonic()

    def _leak(self):
        now = time.monotonic()
        elapsed = now - self.last_leak
        leaked = int(elapsed * self.leak_rate)
        for _ in range(min(leaked, len(self.queue))):
            self.queue.popleft()
        if leaked > 0:
            self.last_leak = now

    def allow_request(self, request_id: str) -> bool:
        self._leak()
        if len(self.queue) < self.capacity:
            self.queue.append(request_id)
            return True  # queued (will be processed at leak rate)
        return False      # queue full, dropped

# Queue up to 5 requests, process 2 per second
limiter = LeakyBucket(capacity=5, leak_rate=2)
When to use leaky bucket: When you need a perfectly smooth output rate — e.g., sending emails (don't want bursts hitting SMTP servers), writing to databases with strict IOPS limits, or network traffic shaping at the packet level. For API rate limiting, token bucket is almost always preferred because users expect burst capability.

Fixed Window Counter

The simplest rate-limiting algorithm: divide time into fixed windows and count requests per window.

How It Works

  1. Time is divided into fixed windows (e.g., each minute: 12:00:00–12:00:59, 12:01:00–12:01:59).
  2. Each window has a counter, starting at 0.
  3. On each request, increment the counter for the current window.
  4. If the counter exceeds the limit, reject the request.
  5. When a new window starts, the counter resets to 0.

Implementation with Redis

import redis
import time

r = redis.Redis()

def fixed_window_rate_limit(user_id: str, limit: int, window_seconds: int) -> bool:
    """Returns True if request is allowed, False if rate limited."""
    # Key includes the window timestamp
    window = int(time.time() // window_seconds)
    key = f"rate:{user_id}:{window}"

    # Atomic increment
    current = r.incr(key)

    # Set expiry only on first request of the window
    if current == 1:
        r.expire(key, window_seconds + 1)  # +1 for clock skew

    return current <= limit

# Allow 100 requests per minute
allowed = fixed_window_rate_limit("user_123", limit=100, window_seconds=60)

The Boundary Problem

Fixed windows have a critical flaw: a user can make 2× the allowed requests in a short burst spanning the window boundary.

Window 1: 12:00:00 ──────────────── 12:00:59 | Window 2: 12:01:00 ──────────────── 12:01:59
                                               |
                    ... idle ...     100 reqs  | 100 reqs    ... idle ...
                                     ↑ 12:00:58-12:00:59    ↑ 12:01:00-12:01:01
                                               |
                             200 requests in ~2 seconds! (limit was 100/min)

Both windows are individually within limits, but the user effectively sent 200 requests in 2 seconds. This is why fixed windows are rarely used in production for strict rate limiting.

Fixed window is best for: Approximate limits where precision doesn't matter, internal service rate limits, or as a fast first-pass filter combined with a more accurate algorithm. Its simplicity (single Redis key per window) makes it the cheapest to run.

Sliding Window Log

The sliding window log fixes the boundary problem by tracking every request timestamp and counting requests in a trailing window.

How It Works

  1. For each user, maintain a sorted set of request timestamps.
  2. On each new request, remove timestamps older than now - window_size.
  3. Count remaining entries. If count < limit, allow the request and add its timestamp. Otherwise reject.

Redis Implementation

import redis
import time
import uuid

r = redis.Redis()

def sliding_window_log(user_id: str, limit: int, window_seconds: int) -> bool:
    now = time.time()
    key = f"rate:swl:{user_id}"
    window_start = now - window_seconds

    # Use a pipeline for atomicity
    pipe = r.pipeline()

    # Remove timestamps outside the window
    pipe.zremrangebyscore(key, 0, window_start)

    # Count remaining requests in the window
    pipe.zcard(key)

    # Add current request timestamp (score = timestamp, member = unique ID)
    pipe.zadd(key, {str(uuid.uuid4()): now})

    # Set TTL to auto-cleanup
    pipe.expire(key, window_seconds + 1)

    results = pipe.execute()
    request_count = results[1]  # zcard result

    if request_count >= limit:
        # Remove the entry we just added
        pipe = r.pipeline()
        pipe.zpopmax(key)
        pipe.execute()
        return False

    return True

# 100 requests per 60-second sliding window
allowed = sliding_window_log("user_123", limit=100, window_seconds=60)

Lua Script (Fully Atomic)

-- sliding_window_log.lua
-- KEYS[1] = rate limit key
-- ARGV[1] = window size in seconds
-- ARGV[2] = max requests
-- ARGV[3] = current timestamp
-- ARGV[4] = unique request ID

local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local req_id = ARGV[4]

-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count current entries
local count = redis.call('ZCARD', key)

if count < limit then
    -- Add new request
    redis.call('ZADD', key, now, req_id)
    redis.call('EXPIRE', key, window + 1)
    return 1  -- allowed
else
    return 0  -- rejected
end
Memory trade-off: If you allow 10,000 requests per hour, each user's sorted set holds up to 10,000 entries. For 1M users, that's 10 billion entries in Redis. This is why sliding window log is impractical at scale — use sliding window counter instead.

Sliding Window Counter

The sliding window counter is the sweet spot: it combines the accuracy of sliding windows with the memory efficiency of fixed windows. This is the algorithm Cloudflare uses at scale.

How It Works

Instead of storing individual timestamps, use a weighted average of the current and previous window's counts:

Effective count = (previous_window_count × overlap_percentage) + current_window_count

Example:
  Window size: 60 seconds
  Current time: 12:01:15 (15 seconds into current window)
  Previous window (12:00:00–12:00:59): 84 requests
  Current window  (12:01:00–12:01:59): 36 requests

  Overlap of previous window: (60 - 15) / 60 = 75%

  Effective count = 84 × 0.75 + 36 = 63 + 36 = 99

  If limit is 100 → this request is allowed (99 < 100)

Implementation

import redis
import time

r = redis.Redis()

def sliding_window_counter(
    user_id: str, limit: int, window_seconds: int
) -> bool:
    now = time.time()
    current_window = int(now // window_seconds)
    previous_window = current_window - 1

    current_key = f"rate:swc:{user_id}:{current_window}"
    previous_key = f"rate:swc:{user_id}:{previous_window}"

    # Get counts for both windows
    pipe = r.pipeline()
    pipe.get(current_key)
    pipe.get(previous_key)
    results = pipe.execute()

    current_count = int(results[0] or 0)
    previous_count = int(results[1] or 0)

    # Calculate elapsed fraction of current window
    elapsed = now - (current_window * window_seconds)
    weight = 1 - (elapsed / window_seconds)

    # Weighted count
    effective_count = previous_count * weight + current_count

    if effective_count >= limit:
        return False

    # Increment current window
    pipe = r.pipeline()
    pipe.incr(current_key)
    pipe.expire(current_key, window_seconds * 2)
    pipe.execute()

    return True

# 100 requests per minute, only 2 Redis keys per user
allowed = sliding_window_counter("user_123", limit=100, window_seconds=60)

Why This Is the Best Trade-Off

AlgorithmMemory per UserAccuracyBurst Handling
Fixed Window8 bytesPoor (2× at boundary)Allows boundary bursts
Sliding LogO(limit) entriesPerfectPrecise
Sliding Counter16 bytes~99.7% accurateSmooth
Token Bucket16 bytesGoodAllows controlled bursts

▶ Sliding Window Animation

See the boundary problem with fixed windows and how sliding windows solve it.

Distributed Rate Limiting

In a single-server world, rate limiting is trivial — an in-memory counter works fine. But modern architectures run dozens or hundreds of API servers behind a load balancer. The fundamental challenge: how do all servers agree on the request count?

The Problem

                    ┌─────────────┐
  Request ──────────▶│ API Server 1 │──── Local counter: 45
                    └─────────────┘
  Request ──────────▶│ API Server 2 │──── Local counter: 38
                    └─────────────┘
  Request ──────────▶│ API Server 3 │──── Local counter: 42
                    └─────────────┘

  Each server thinks the user made ~40 requests.
  But the REAL total is 45 + 38 + 42 = 125!
  If the limit is 100, we've allowed 25% over-limit.

Solution: Centralized Store (Redis)

Use Redis as a shared counter that all API servers query atomically:

                    ┌─────────────┐
  Request ──────────▶│ API Server 1 │────┐
                    └─────────────┘    │
  Request ──────────▶│ API Server 2 │────┼──▶ Redis (single counter: 125)
                    └─────────────┘    │
  Request ──────────▶│ API Server 3 │────┘
                    └─────────────┘

Atomic Rate Limiting with Lua

The classic INCR + EXPIRE pattern has a race condition. If the server crashes between INCR and EXPIRE, the key lives forever. Use a Lua script for atomicity:

-- rate_limit.lua — Atomic fixed-window rate limiter
-- KEYS[1] = rate limit key
-- ARGV[1] = max requests
-- ARGV[2] = window size in seconds

local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = redis.call('INCR', key)

if current == 1 then
    -- First request: set the expiry
    redis.call('EXPIRE', key, window)
end

if current > limit then
    return 0  -- rejected
else
    return 1  -- allowed
end
# Python caller
import redis
import time

r = redis.Redis()

# Load the Lua script
RATE_LIMIT_SCRIPT = r.register_script("""
local current = redis.call('INCR', KEYS[1])
if current == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[2])
end
if current > tonumber(ARGV[1]) then
    return 0
else
    return 1
end
""")

def is_allowed(user_id: str, limit: int = 100, window: int = 60) -> bool:
    window_id = int(time.time() // window)
    key = f"rl:{user_id}:{window_id}"
    result = RATE_LIMIT_SCRIPT(keys=[key], args=[limit, window])
    return result == 1

Token Bucket with Redis

-- token_bucket.lua — Distributed token bucket
-- KEYS[1] = bucket key
-- ARGV[1] = capacity
-- ARGV[2] = refill rate (tokens/sec)
-- ARGV[3] = current timestamp (seconds, float)
-- ARGV[4] = tokens to consume

local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

-- Get current state
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now

-- Refill tokens
local elapsed = math.max(0, now - last_refill)
tokens = math.min(capacity, tokens + elapsed * refill_rate)

-- Try to consume
if tokens >= requested then
    tokens = tokens - requested
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
    return 1  -- allowed
else
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) + 1)
    return 0  -- rejected
end

Race Conditions and Mitigations

Race ConditionCauseSolution
INCR without EXPIRECrash between INCR and EXPIRELua script (atomic)
GET-then-SETTwo servers read same count, both allowINCR (atomic) or Lua
Clock skewServers have different clocksUse Redis TIME, or NTP sync
Redis failoverReplica promoted, counters lostAccept brief over-limit; use WAIT for sync
Redis Cluster for scale: Partition rate-limit keys by user ID across Redis Cluster shards. Each user's counter lives on one shard. With consistent hashing, you get linear scalability. At Stripe's scale, a single Redis node handles 200K+ rate-limit checks per second.

▶ Distributed Rate Limiter

Multiple API servers share a Redis counter. Watch atomic INCR + EXPIRE and 429 rejection.

Rate Limiting in Practice

HTTP Response Headers

The standard way to communicate rate-limit status to clients. These aren't in the HTTP spec (yet) but are a de facto standard adopted by GitHub, Stripe, Twitter, and others:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100        ← Max requests per window
X-RateLimit-Remaining: 42     ← Requests left in current window
X-RateLimit-Reset: 1714435200 ← Unix timestamp when window resets

429 Too Many Requests

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1714435200

{
    "error": {
        "code": "rate_limit_exceeded",
        "message": "Rate limit exceeded. Try again in 30 seconds.",
        "retry_after": 30,
        "limit": 100,
        "remaining": 0,
        "reset_at": "2026-04-30T12:00:00Z"
    }
}

Implementing in Express.js Middleware

const Redis = require('ioredis');
const redis = new Redis();

const RATE_LIMIT_LUA = `
local current = redis.call('INCR', KEYS[1])
if current == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[2])
end
if current > tonumber(ARGV[1]) then
    return {0, current, tonumber(redis.call('TTL', KEYS[1]))}
else
    return {1, current, tonumber(redis.call('TTL', KEYS[1]))}
end
`;

function rateLimit({ limit = 100, windowSeconds = 60 } = {}) {
    return async (req, res, next) => {
        const userId = req.user?.id || req.ip;
        const window = Math.floor(Date.now() / 1000 / windowSeconds);
        const key = `rl:${userId}:${window}`;

        const [allowed, count, ttl] = await redis.eval(
            RATE_LIMIT_LUA, 1, key, limit, windowSeconds
        );

        // Always set rate limit headers
        res.set('X-RateLimit-Limit', String(limit));
        res.set('X-RateLimit-Remaining', String(Math.max(0, limit - count)));
        res.set('X-RateLimit-Reset', String(
            Math.floor(Date.now() / 1000) + ttl
        ));

        if (!allowed) {
            res.set('Retry-After', String(ttl));
            return res.status(429).json({
                error: {
                    code: 'rate_limit_exceeded',
                    message: `Rate limit exceeded. Try again in ${ttl}s.`,
                    retry_after: ttl,
                }
            });
        }

        next();
    };
}

// Usage
app.use('/api/', rateLimit({ limit: 100, windowSeconds: 60 }));
app.use('/api/login', rateLimit({ limit: 5, windowSeconds: 300 }));

Client-Side: Exponential Backoff with Jitter

When you receive a 429, don't retry immediately — use exponential backoff with jitter to avoid the thundering herd problem:

import time
import random
import requests

def request_with_backoff(url: str, max_retries: int = 5) -> requests.Response:
    for attempt in range(max_retries):
        response = requests.get(url)

        if response.status_code != 429:
            return response

        # Prefer Retry-After header if present
        retry_after = response.headers.get('Retry-After')
        if retry_after:
            wait = int(retry_after)
        else:
            # Exponential backoff: 1s, 2s, 4s, 8s, 16s
            base_wait = 2 ** attempt
            # Add jitter: random value between 0 and base_wait
            wait = base_wait + random.uniform(0, base_wait)

        print(f"Rate limited. Retrying in {wait:.1f}s (attempt {attempt + 1})")
        time.sleep(wait)

    raise Exception(f"Failed after {max_retries} retries")
// JavaScript equivalent
async function fetchWithBackoff(url, maxRetries = 5) {
    for (let attempt = 0; attempt < maxRetries; attempt++) {
        const response = await fetch(url);

        if (response.status !== 429) return response;

        const retryAfter = response.headers.get('Retry-After');
        const baseWait = retryAfter
            ? parseInt(retryAfter, 10)
            : Math.pow(2, attempt);
        const jitter = Math.random() * baseWait;
        const wait = baseWait + jitter;

        console.log(`Rate limited. Retry in ${wait.toFixed(1)}s`);
        await new Promise(r => setTimeout(r, wait * 1000));
    }

    throw new Error(`Failed after ${maxRetries} retries`);
}
Why jitter matters: Without jitter, all rate-limited clients retry at exactly the same time (2n seconds), creating a "retry storm" that overwhelms the server again. Jitter spreads retries uniformly across time, preventing synchronized bursts.

Multi-Level Rate Limiting

Production systems rarely use a single rate limit. Instead, they layer multiple limits at different granularities:

Layered Limits

┌──────────────────────────────────────────────────────┐
│ Layer 1: Global                                      │
│   All traffic combined: 50,000 req/sec               │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Layer 2: Per-IP                                  │ │
│ │   Anonymous: 60 req/min per IP                   │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ Layer 3: Per-User                            │ │ │
│ │ │   Authenticated: 1,000 req/min per user      │ │ │
│ │ │ ┌──────────────────────────────────────────┐ │ │ │
│ │ │ │ Layer 4: Per-Endpoint                    │ │ │ │
│ │ │ │   POST /login: 5 req/min                 │ │ │ │
│ │ │ │   GET /search: 30 req/min                │ │ │ │
│ │ │ │   GET /api/data: 1,000 req/min           │ │ │ │
│ │ │ └──────────────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘

API Tier Limits

Most SaaS APIs offer tiered pricing with different rate limits:

TierRate LimitBurstDaily Quota
Free60 req/min101,000
Pro600 req/min10050,000
Enterprise6,000 req/min1,000Unlimited

Implementation with Multiple Buckets

import redis
import time

r = redis.Redis()

# Lua script that checks multiple rate limits atomically
MULTI_LIMIT_LUA = """
-- Check all limits; reject if ANY limit is exceeded
for i = 1, #KEYS do
    local current = redis.call('INCR', KEYS[i])
    if current == 1 then
        redis.call('EXPIRE', KEYS[i], ARGV[i * 2])
    end
    if current > tonumber(ARGV[i * 2 - 1]) then
        return {0, i}  -- rejected, which limit was hit
    end
end
return {1, 0}  -- all limits passed
"""

def multi_rate_limit(user_id: str, ip: str, endpoint: str, tier: str):
    now = int(time.time())

    # Define limits per tier
    tier_limits = {
        'free':       {'per_min': 60,   'per_hour': 1000},
        'pro':        {'per_min': 600,  'per_hour': 50000},
        'enterprise': {'per_min': 6000, 'per_hour': 500000},
    }

    limits = tier_limits.get(tier, tier_limits['free'])
    minute_window = now // 60
    hour_window = now // 3600

    keys = [
        f"rl:ip:{ip}:{minute_window}",           # Per-IP: 120/min
        f"rl:user:{user_id}:{minute_window}",     # Per-user: tier limit/min
        f"rl:user:{user_id}:h:{hour_window}",     # Per-user: tier limit/hour
        f"rl:ep:{endpoint}:{user_id}:{minute_window}",  # Per-endpoint
    ]

    args = [
        120, 60,                        # IP: 120 req, 60s window
        limits['per_min'], 60,          # User per min
        limits['per_hour'], 3600,       # User per hour
        30, 60,                         # Endpoint: 30 req/min
    ]

    result = r.eval(MULTI_LIMIT_LUA, len(keys), *keys, *args)
    return result[0] == 1  # True if all limits passed

API Gateway Configuration (Kong)

# kong.yml — Rate limiting plugin configuration
plugins:
  - name: rate-limiting
    config:
      minute: 100
      hour: 5000
      policy: redis            # Use Redis for distributed counting
      redis_host: redis.internal
      redis_port: 6379
      redis_timeout: 2000
      redis_database: 0
      limit_by: consumer       # Per API key / consumer
      fault_tolerant: true     # Allow traffic if Redis is down
      hide_client_headers: false

  # Stricter limit for auth endpoints
  - name: rate-limiting
    route: login-route
    config:
      minute: 5
      hour: 20
      policy: redis
      redis_host: redis.internal
      limit_by: ip

Envoy Proxy Configuration

# envoy.yaml — Rate limit filter
http_filters:
  - name: envoy.filters.http.ratelimit
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
      domain: my_api
      rate_limit_service:
        grpc_service:
          envoy_grpc:
            cluster_name: rate_limit_cluster
        transport_api_version: V3

# Rate limit service descriptor configuration
domain: my_api
descriptors:
  - key: user_id
    rate_limit:
      unit: minute
      requests_per_unit: 100
  - key: user_id
    value: free_tier
    rate_limit:
      unit: minute
      requests_per_unit: 10
Graceful degradation: If Redis is down, should you block all traffic or allow all traffic? Most production systems choose fail-open (allow all) rather than fail-closed, because a few seconds of unthrottled traffic is better than a total outage. Kong's fault_tolerant: true does exactly this.