← All Posts
High Level Design Series· Foundations · Part 3 of 70

Load Balancing

What Is Load Balancing?

A load balancer is a device or software component that distributes incoming network traffic across a pool of backend servers. Its purpose is deceptively simple: make sure no single server bears too much load. In practice, this one sentence unlocks horizontal scaling, fault tolerance, and predictable latency at every tier of a modern distributed system.

Without a load balancer, your architecture looks like this: a single server handling every request. When that server reaches its capacity—say 5,000 concurrent connections on a typical 4-core, 16 GB machine running Nginx—new users get timeouts or 503 errors. Worse, if that server crashes, everything goes down. The load balancer solves both problems simultaneously.

Where Load Balancers Sit in Architecture

Load balancers appear at every boundary in a production system:

Real-world scale: Netflix runs over 700 microservices, each with multiple instances. Without load balancing at every layer, a single hot service would cascade failures across the entire platform. Their custom load balancer, Zuul, handles billions of requests per day.

The Single Point of Failure Problem

If the load balancer itself goes down, the entire system becomes unreachable. This is the classic irony: you added a load balancer to prevent SPOF at the server level, and now the load balancer itself is a SPOF. The solution is high availability (HA) for the load balancer itself:

PatternHow It WorksUsed By
Active-Passive Two load balancers share a virtual IP (VIP) via a protocol like VRRP (Virtual Router Redundancy Protocol). The active node handles all traffic. If it fails, the passive node detects the failure via heartbeat (typically every 1–3 seconds) and takes over the VIP. Failover time: 2–10 seconds. HAProxy + Keepalived, F5 Big-IP pairs
Active-Active Both load balancers handle traffic simultaneously. DNS returns multiple IPs (or BGP Anycast advertises the same IP from both). If one fails, the other absorbs all traffic. Doubles throughput in normal operation. AWS NLB (multi-AZ), Cloudflare, Google Cloud LB

In cloud environments, the HA problem is largely solved by the provider. AWS Network Load Balancer, for instance, runs across multiple Availability Zones and automatically fails over—you never manage the redundancy yourself. This is one of the strongest arguments for managed load balancers.

L4 vs L7 Load Balancing

The two fundamental types of load balancers are defined by which OSI layer they operate at. This single distinction affects performance, features, cost, and when you'd choose one over the other.

Layer 4 (Transport Layer)

An L4 load balancer operates at the TCP/UDP level. It sees source IP, destination IP, source port, and destination port—and nothing else. It cannot read HTTP headers, cookies, or URL paths. It simply forwards the raw TCP connection (or UDP datagram) to a backend server.

How it works internally: When a client opens a TCP connection to the load balancer's VIP, the LB uses Network Address Translation (NAT) to rewrite the destination IP to one of the backend servers. For the lifetime of that TCP connection, all packets go to the same server. Some implementations use Direct Server Return (DSR), where the response goes directly from the backend to the client, bypassing the LB entirely—dramatically reducing the LB's bandwidth requirements.

< 1µs
Added latency per packet
10M+
Concurrent connections
100 Gbps
Throughput (hardware)

Layer 7 (Application Layer)

An L7 load balancer terminates the TCP connection, reads the full HTTP request (method, path, headers, cookies, body), and then opens a new connection to the selected backend. This makes it far more powerful but also more resource-intensive.

What L7 can do that L4 cannot:

# Nginx L7 routing example
upstream api_servers {
    server 10.0.1.10:8080 weight=3;
    server 10.0.1.11:8080 weight=2;
    server 10.0.1.12:8080 weight=1;
}

upstream static_servers {
    server 10.0.2.10:80;
    server 10.0.2.11:80;
}

server {
    listen 443 ssl;
    server_name api.example.com;

    ssl_certificate     /etc/ssl/cert.pem;
    ssl_certificate_key /etc/ssl/key.pem;

    location /api/ {
        proxy_pass http://api_servers;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Request-ID $request_id;
    }

    location /static/ {
        proxy_pass http://static_servers;
        expires 30d;
        add_header Cache-Control "public, immutable";
    }
}

Comparison Table

DimensionLayer 4Layer 7
OSI LayerTransport (TCP/UDP)Application (HTTP/HTTPS/gRPC)
VisibilityIP + Port onlyFull request: URL, headers, cookies, body
Latency overhead< 1µs per packet0.5–5 ms per request (TLS + parsing)
ThroughputMillions of connections/secHundreds of thousands of requests/sec
SSL/TLSPassthrough (encrypted traffic)Termination (decrypt at LB)
Routing intelligenceIP hash, round robin onlyURL, header, cookie, query param routing
Connection multiplexingNo — 1:1 client-to-serverYes — many clients share backend connections
WebSocket supportNative (it's just TCP)Requires upgrade handling
Cost (cloud)Lower (AWS NLB: ~$0.006/GB)Higher (AWS ALB: ~$0.008/GB + $0.40/hr)
Best forHigh-throughput TCP: databases, game servers, gRPC, IoTHTTP APIs, microservices, web apps
When to use each: Use L4 when you need raw throughput and low latency (e.g., a database proxy, a gaming server, or a real-time streaming service). Use L7 when you need content-aware routing (e.g., routing API versions, A/B testing, or multi-tenant SaaS). Many production setups use both: an L4 LB at the edge (for DDoS protection and TLS passthrough) that forwards to L7 LBs internally (for smart routing).

Load Balancing Algorithms

The algorithm is the brain of the load balancer. It decides which backend server receives each request. The right choice depends on your workload characteristics: Are requests homogeneous? Are servers identical? Is session state involved?

▶ Request Routing Visualization

Watch how Round Robin and Weighted Round Robin distribute requests across servers. Connection counts update in real time.

Round Robin

The simplest algorithm: requests go to servers in sequential order. Server 1, Server 2, Server 3, Server 1, Server 2, Server 3, and so on. No state is tracked. Each server gets exactly 1/N of the traffic.

# Round Robin pseudocode
servers = [S1, S2, S3, S4]
counter = 0

def get_next_server():
    global counter
    server = servers[counter % len(servers)]
    counter += 1
    return server

When it works well: All servers are identical (same CPU, RAM, network) and all requests are roughly equal in cost. This is common for stateless API servers behind an auto-scaling group.

When it breaks: If Server 3 is processing a long-running report while Server 1 is idle, Round Robin will keep sending requests to Server 3 regardless. It has zero awareness of current load.

Weighted Round Robin

An extension where each server has a weight proportional to its capacity. A server with weight 3 receives three times as many requests as one with weight 1.

# Nginx weighted round robin
upstream backend {
    server 10.0.1.10:8080 weight=5;   # 8-core, 32GB  — gets 5x share
    server 10.0.1.11:8080 weight=3;   # 4-core, 16GB  — gets 3x share
    server 10.0.1.12:8080 weight=1;   # 2-core, 8GB   — gets 1x share
}
# Request distribution: S1 gets 5/9, S2 gets 3/9, S3 gets 1/9

Real use case: During a canary deployment, you might give the new version weight=1 and the stable version weight=99, routing only ~1% of traffic to the canary. As confidence increases, you gradually shift the weights.

Least Connections

Route each new request to the server with the fewest active connections. This is the first algorithm that considers actual server load, making it significantly better for heterogeneous workloads.

# HAProxy least connections
backend api_servers
    balance leastconn
    server s1 10.0.1.10:8080 check
    server s2 10.0.1.11:8080 check
    server s3 10.0.1.12:8080 check

Why it matters: Imagine a mix of requests: some take 10 ms (GET user profile), others take 5 seconds (generate PDF report). Round Robin would blindly assign PDF requests to servers that are already overloaded. Least Connections would route new requests to whichever server has freed up capacity.

The Weighted Least Connections variant combines weights with connection counting: score = active_connections / weight. The server with the lowest score wins.

Least Response Time

Route to the server with the lowest average response time and fewest active connections. This uses actual observed performance rather than just connection count, making it more accurate.

Nginx Plus implements this as least_time:

upstream backend {
    least_time header;  # measure time to receive response headers
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
    server 10.0.1.12:8080;
}

Tradeoff: Requires tracking response times per server (typically using an exponentially weighted moving average with a window of 10–60 seconds). Slightly more overhead than Least Connections, but catches servers that have many connections that are technically "open" but responding slowly due to disk I/O or garbage collection pauses.

IP Hash

Compute a hash of the client's IP address and use it to consistently map to the same server: server_index = hash(client_ip) % N. This guarantees that the same client always reaches the same server—useful for session affinity without cookies.

upstream backend {
    ip_hash;
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
    server 10.0.1.12:8080;
}

Problem: When you add or remove a server, the modulo changes and almost every client gets remapped. If you have 4 servers and add a 5th, approximately 80% of clients shift. This is the exact problem that consistent hashing solves (see next section).

Random

Pick a random server for each request. Surprisingly effective at large scale. With 100+ servers and thousands of requests per second, the law of large numbers ensures roughly even distribution. Twitter's Finagle library uses Power of Two Choices (P2C): pick two random servers, then send to the one with fewer connections. This achieves near-optimal distribution with minimal overhead.

# Power of Two Choices pseudocode
def p2c_select(servers):
    s1, s2 = random.sample(servers, 2)
    return s1 if s1.active_connections <= s2.active_connections else s2

Algorithm Comparison

AlgorithmState TrackedBest ForWeakness
Round RobinCounter onlyIdentical servers, uniform requestsIgnores server load
Weighted RRCounter + weightsMixed server capacities, canary deploysStatic weights, no runtime adaptation
Least ConnectionsConnection count/serverVariable request durationsDoesn't consider server capacity
Least Response TimeEWMA latency + connectionsLatency-sensitive APIsMore overhead, needs warm-up period
IP HashNone (stateless hash)Session persistence without cookiesUneven with few clients; disrupted on scaling
Random / P2CNone (or 2 samples)Large server pools (100+)Can be uneven with few servers

Consistent Hashing for Load Balancing

Standard modulo-based hashing (hash(key) % N) has a critical flaw: when N changes (server added or removed), nearly every key gets remapped. For a load balancer routing based on IP or session ID, this means all users lose their server affinity simultaneously.

The Hash Ring

Consistent hashing arranges the hash space as a circular ring (0 to 232−1). Each server is placed at one or more points on the ring (determined by hashing the server's identifier). To route a request, hash the request key and walk clockwise around the ring until you hit a server node.

# Consistent Hashing — conceptual illustration
Hash Ring: 0 -------- S1(45) -------- S3(120) -------- S2(200) -------- 0 (wrap)

Request hashed to 80 → walk clockwise → hits S3(120)
Request hashed to 150 → walk clockwise → hits S2(200)
Request hashed to 210 → walk clockwise → wraps to S1(45)

Key advantage: When you add server S4 at position 170, only requests between 120 and 170 are remapped (from S2 to S4). All other mappings remain unchanged. On average, only K/N keys move (where K is total keys and N is the number of servers). Compare this to modulo hashing where nearly all keys move.

Virtual Nodes

With just 3–5 physical servers on the ring, the distribution is very uneven—one server might own 60% of the hash space. The fix: assign each physical server multiple virtual nodes (typically 100–200) spread around the ring. This smooths out the distribution.

# Virtual nodes for server S1
hash("S1-0")  → position 45
hash("S1-1")  → position 178
hash("S1-2")  → position 312
...
hash("S1-149") → position 4021850321

# Each physical server gets ~150 virtual nodes
# The ring now has 150 × N points, ensuring even distribution

Real-world usage: Amazon DynamoDB uses consistent hashing with virtual nodes for partition assignment. Cassandra uses it for token ring allocation. Envoy proxy supports consistent hashing for load balancing stateful services.

Deep dive coming: Consistent Hashing gets its own full post later in this series (Post 8: Consistent Hashing), where we'll implement it from scratch and explore Maglev hashing (Google's variant that achieves perfectly uniform distribution).

Health Checks

A load balancer is only as good as its ability to detect unhealthy servers and stop routing traffic to them. Without health checks, the LB happily sends requests to a server that crashed 10 minutes ago.

▶ Health Check Lifecycle

Watch as the load balancer detects a server failure, removes it from the pool, and adds it back when it recovers.

Active Health Checks

The load balancer proactively sends requests to each backend server at regular intervals to verify it's alive and responding correctly.

Check TypeHow It WorksChecks What
TCP connectOpen a TCP connection to the server's port. If the 3-way handshake succeeds, it's "up."Process is running and accepting connections
HTTP GETSend GET /health and check for 200 OK.Application is running and can handle requests
Deep healthSend GET /health/deep — the endpoint checks DB connection, cache, disk, downstream services.Full dependency chain is healthy
# AWS ALB health check configuration (Terraform)
resource "aws_lb_target_group" "api" {
  name     = "api-targets"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id

  health_check {
    path                = "/health"
    port                = "traffic-port"
    protocol            = "HTTP"
    healthy_threshold   = 3      # 3 consecutive passes = healthy
    unhealthy_threshold = 2      # 2 consecutive failures = unhealthy
    interval            = 15     # check every 15 seconds
    timeout             = 5      # 5 second timeout per check
    matcher             = "200"  # expect HTTP 200
  }
}

Passive Health Checks

Instead of sending probes, the LB monitors actual traffic responses. If a server returns 5 consecutive 502/503 errors, or if response times spike above a threshold, the LB marks it unhealthy.

# Nginx passive health checks (via max_fails)
upstream backend {
    server 10.0.1.10:8080 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:8080 max_fails=3 fail_timeout=30s;
    # After 3 failures within 30s, server is marked down for 30s
}

Active + Passive together: Most production systems use both. Passive checks catch problems instantly (as real traffic fails). Active checks detect problems even when no traffic is flowing (e.g., a server that crashed at 3 AM when traffic is near zero).

Health Check Intervals and Thresholds

5–30s
Typical check interval
2–3
Failures to mark unhealthy
3–5
Passes to mark healthy
2–5s
Timeout per check

The unhealthy threshold is kept low (2–3) to remove bad servers quickly. The healthy threshold is kept higher (3–5) to prevent "flapping"—a server that's intermittently failing shouldn't keep toggling between healthy and unhealthy.

Graceful Removal and Addition

When removing a server (for maintenance or deployment), don't just yank it from the pool. Use connection draining: stop sending new requests to it, but let existing requests finish (typically with a 30–300 second timeout). AWS calls this "deregistration delay." This prevents in-flight requests from failing.

# HAProxy graceful server drain
# In the runtime API:
$ echo "set server backend/s1 state drain" | socat stdio /run/haproxy/admin.sock
# s1 stops receiving new connections
# Existing connections complete naturally
# After all connections close:
$ echo "set server backend/s1 state maint" | socat stdio /run/haproxy/admin.sock

Session Persistence (Sticky Sessions)

Some applications store session state locally on the server: shopping cart data, authentication tokens, in-memory caches, or WebSocket connections. If a returning user gets routed to a different server, their session is lost. Sticky sessions (session affinity) ensure the same client always reaches the same backend.

Cookie-Based Affinity

The load balancer inserts a cookie (e.g., SERVERID=s3) in the response. On subsequent requests, the client sends this cookie, and the LB routes to the specified server.

# HAProxy cookie-based stickiness
backend app_servers
    balance roundrobin
    cookie SERVERID insert indirect nocache
    server s1 10.0.1.10:8080 check cookie s1
    server s2 10.0.1.11:8080 check cookie s2
    server s3 10.0.1.12:8080 check cookie s3

Pros: Works reliably regardless of client IP changes (mobile networks, VPNs). Cons: Requires L7 LB (must read HTTP headers). Cookie size overhead (tiny, usually < 50 bytes).

IP-Based Affinity

Use IP Hash (described in algorithms above): hash(client_ip) % N. Simpler than cookies, works at L4, but breaks when clients share IPs (corporate NATs where 10,000 employees appear as one IP) or when IPs change (mobile networks).

Drawbacks of Sticky Sessions

The better approach: externalize session state. Store sessions in a shared store like Redis (with a TTL of 30 minutes) or a database. Now any server can handle any request because the session data is always accessible. This is the standard approach at scale—Netflix, Spotify, and Stripe all use externalized sessions. Sticky sessions become unnecessary, and your load balancer can use any algorithm freely.
# Express.js with Redis session store (no sticky sessions needed)
const session = require('express-session');
const RedisStore = require('connect-redis').default;
const { createClient } = require('redis');

const redisClient = createClient({ url: 'redis://session-cache:6379' });
redisClient.connect();

app.use(session({
  store: new RedisStore({ client: redisClient }),
  secret: process.env.SESSION_SECRET,
  resave: false,
  saveUninitialized: false,
  cookie: { maxAge: 30 * 60 * 1000 }  // 30 min TTL
}));

Global Server Load Balancing (GSLB)

Everything we've discussed so far operates within a single data center or region. GSLB distributes traffic across multiple geographic regions—this is how you serve users in Tokyo, Frankfurt, and São Paulo from their nearest data center.

DNS-Based Load Balancing

The simplest form: your DNS server returns different IP addresses for the same domain. A DNS query for api.example.com might return:

; DNS Round Robin
api.example.com.  300  IN  A  52.1.2.3     ; US-East
api.example.com.  300  IN  A  13.4.5.6     ; EU-West
api.example.com.  300  IN  A  103.7.8.9    ; AP-Southeast

; The client's DNS resolver picks one (usually the first)
; TTL of 300 seconds (5 min) means changes propagate slowly

Limitations: DNS doesn't know the client's location (it sees the resolver's IP, not the user's). TTL caching means changes take minutes to propagate. No health checking—DNS happily returns a dead server's IP.

GeoDNS Routing

GeoDNS extends DNS-based balancing by mapping the client's geographic location (via GeoIP databases) to the nearest data center. When a user in Tokyo queries api.example.com, the GeoDNS server returns the IP of the Tokyo data center.

# Route 53 geolocation routing policy (AWS CLI)
aws route53 change-resource-record-sets --hosted-zone-id Z123 \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "SetIdentifier": "asia-pacific",
        "GeoLocation": { "ContinentCode": "AS" },
        "TTL": 60,
        "ResourceRecords": [{ "Value": "103.7.8.9" }]
      }
    }]
  }'

Services like AWS Route 53, Cloudflare DNS, and NS1 support GeoDNS with health checks integrated—if the Tokyo data center goes down, users in Asia are automatically routed to the next nearest healthy region.

Anycast Routing

Anycast advertises the same IP address from multiple data centers worldwide using BGP. The internet's routing infrastructure automatically directs each packet to the nearest data center (by network hops, not geography).

How it works: All your data centers announce the same IP prefix (e.g., 203.0.113.0/24) via BGP. When a client sends a packet to 203.0.113.1, their ISP's router has multiple paths and picks the shortest one.

Advantages over GeoDNS:

Limitation: Anycast works best for short-lived, stateless requests (DNS, CDN). For long-lived TCP connections, a BGP route change can disrupt the connection. This is why Anycast is typically used for the first hop (DNS or CDN edge), with GeoDNS handling the actual application routing.

Multi-Region Failover

A production-grade GSLB setup combines multiple techniques:

  1. Primary routing via GeoDNS: Route users to their nearest region.
  2. Health checks per region: The GSLB monitors each region's health.
  3. Automatic failover: If US-East fails, its users are routed to US-West (the next closest healthy region).
  4. Weighted failover: During degraded performance, gradually shift traffic (e.g., 90% to primary, 10% to secondary) instead of a hard cutover.
# Route 53 failover routing (Terraform)
resource "aws_route53_health_check" "us_east" {
  fqdn              = "us-east.api.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
  request_interval  = 10
}

resource "aws_route53_record" "primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  set_identifier = "primary"
  failover_routing_policy { type = "PRIMARY" }
  alias {
    name    = aws_lb.us_east.dns_name
    zone_id = aws_lb.us_east.zone_id
  }
  health_check_id = aws_route53_health_check.us_east.id
}

resource "aws_route53_record" "secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"
  set_identifier = "secondary"
  failover_routing_policy { type = "SECONDARY" }
  alias {
    name    = aws_lb.eu_west.dns_name
    zone_id = aws_lb.eu_west.zone_id
  }
}

CDN Integration

CDNs like Cloudflare, Akamai, and AWS CloudFront are essentially massive GSLB systems. They combine Anycast (for DNS), GeoDNS (for edge selection), and L7 load balancing (for origin routing) into one integrated service. When you put your API behind CloudFront, users automatically hit the nearest edge location (300+ globally), which handles TLS termination, caching, and WAF—only cache misses reach your origin servers.

Load Balancers in Practice

Let's look at the actual tools used in production environments, from hardware appliances to cloud-native solutions.

Hardware Load Balancers

F5 Big-IP is the most well-known hardware LB. It's a dedicated appliance with custom ASICs for packet processing. F5 can handle millions of connections with sub-millisecond latency. However, it costs $50,000–$500,000+ per unit, requires dedicated network engineers, and scaling means buying more hardware. It's used primarily by banks, telecoms, and large enterprises with strict compliance requirements.

Software Load Balancers

SoftwareTypePerformanceKey FeaturesUsed By
Nginx L7 (L4 in stream mode) ~1M req/sec (HTTP, single instance) Reverse proxy, SSL termination, caching, rate limiting, WebSocket support. Configuration-driven. Dropbox, Netflix, WordPress.com
HAProxy L4 and L7 ~2M req/sec (HTTP), ~3M conn/sec (TCP) Zero-downtime reloads, runtime API, advanced health checks, stick tables for rate limiting. The gold standard for raw performance. GitHub, Stack Overflow, Reddit
Envoy L7 (L4 filters) ~500K req/sec per instance Built for service meshes. gRPC-native, automatic retries, circuit breaking, distributed tracing (Jaeger/Zipkin), hot restarts. xDS API for dynamic configuration. Lyft (created it), Airbnb, Uber, Istio sidecar
Traefik L7 ~200K req/sec Auto-discovery from Docker/K8s labels, automatic Let's Encrypt certificates, dashboard. Great for container environments. Small-to-mid Kubernetes deployments

Cloud Load Balancers

ServiceProviderTypeKey Characteristics
NLB (Network LB) AWS L4 Handles millions of req/sec, static IPs, preserves source IP, ultra-low latency. ~$0.006/GB processed.
ALB (Application LB) AWS L7 Content-based routing, WebSocket, gRPC, WAF integration, authentication. ~$0.008/GB + $0.40/hr fixed.
CLB (Classic LB) AWS L4/L7 (legacy) Deprecated. Exists only for backward compatibility. Don't use for new deployments.
Cloud Load Balancing GCP L4 and L7 Global by default (single anycast VIP), auto-scaling, integrated CDN. Premium tier uses Google's private network backbone.
Azure Load Balancer / App Gateway Azure L4 / L7 Azure LB for L4 (free basic tier), Application Gateway for L7 with WAF v2.
Choosing in practice: If you're on AWS, start with an ALB for HTTP workloads and NLB for TCP/gRPC. If you need fine-grained control (custom algorithms, complex health checks), put HAProxy or Envoy behind a cloud NLB. If you're running Kubernetes, use an Ingress Controller (Nginx Ingress, Traefik, or Envoy/Istio) that integrates with your cloud LB.

Common Interview Questions

How would you handle load balancer failure?

Answer framework:

  1. Active-Passive HA: Deploy two LB instances with a shared VIP using VRRP/Keepalived. The passive node takes over within 2–5 seconds if the active fails.
  2. Active-Active: Deploy multiple LBs behind DNS (each with its own IP). If one fails, DNS health checks remove its IP within 30–60 seconds. For faster failover, use Anycast so BGP reconverges in seconds.
  3. Cloud-native: Use managed LBs (AWS ALB/NLB, GCP GLB) that are inherently HA across AZs. The cloud provider handles redundancy—this is the simplest and most reliable approach.
  4. Multi-layer: In critical systems, use DNS-level GSLB → Regional LB → Per-AZ LB. Each layer provides redundancy for the layer below.

Key insight: The LB should never be a single instance. Always design for N+1 redundancy at the LB layer, just as you do for application servers.

How to choose between L4 and L7 load balancing?

Choose L4 when:

  • You need maximum throughput (millions of connections/sec).
  • Your protocol isn't HTTP (raw TCP, UDP, database connections, game servers).
  • You need to preserve the original TCP connection (no termination).
  • Latency requirements are ultra-strict (< 1 ms overhead).

Choose L7 when:

  • You need content-based routing (URL paths, headers, cookies).
  • You want SSL termination at the LB.
  • You need request/response modification (header injection, URL rewriting).
  • You're building microservices that need per-route routing rules.
  • You want integrated WAF, rate limiting, or authentication.

Common pattern: L4 at the edge (NLB) for raw TCP handling + DDoS protection, L7 internally (ALB/Envoy) for smart application routing. This is what AWS recommends for high-traffic architectures.

Design a load balancer that handles 1M requests/sec

Requirements analysis: 1M req/sec = ~86 billion requests/day. Assuming average request size of 2 KB and response of 10 KB, that's ~96 Gbps of throughput.

Architecture:

  1. Layer 1 — DNS + Anycast: Use Anycast to distribute traffic across 4+ PoPs (Points of Presence). Each PoP handles ~250K req/sec.
  2. Layer 2 — L4 LBs per PoP: 2–4 NLBs (or DPDK-based L4 LBs like Katran) per PoP. Use ECMP (Equal-Cost Multi-Path) to distribute packets across LB instances. Each handles 100K+ connections/sec.
  3. Layer 3 — L7 LBs: 10–20 HAProxy/Envoy instances per PoP behind the L4 LBs. Each handles 50K–100K HTTP req/sec. These do SSL termination, routing, and health checks.
  4. Backend servers: 50–200 application servers per PoP, behind the L7 LBs.

Key design decisions:

  • Use consistent hashing at L4 for connection affinity.
  • Use Least Connections at L7 for even request distribution.
  • Health checks: active (every 5s) + passive (mark down after 3 consecutive 5xx errors).
  • Connection draining: 30-second drain period during deploys.
  • Metrics: Track p50/p95/p99 latency per backend, connection counts, error rates via Prometheus + Grafana.

Reference: Facebook's Katran (L4, XDP/eBPF-based) handles 10M+ packets/sec per core. Google's Maglev handles 10M connections/sec per machine. These are the architectures behind systems that actually serve 1M+ req/sec.

What happens when you type a URL in the browser? (Load balancing perspective)

Focusing on the load balancing hops:

  1. DNS resolution: Your browser queries DNS for www.example.com. The authoritative DNS server uses GeoDNS to return the IP of the nearest edge/PoP.
  2. CDN edge (Anycast): If a CDN is involved, the IP is an Anycast address. BGP routes you to the nearest CDN edge node. The CDN checks its cache—if it's a hit, it responds directly (no origin needed).
  3. L4 load balancer: For cache misses, the CDN (or direct traffic) hits an L4 LB. It selects a backend using IP hash or ECMP and forwards the TCP connection.
  4. L7 load balancer: The L7 LB terminates TLS, reads the HTTP request, and routes based on path/headers to the appropriate backend service.
  5. Application server: The selected server processes the request, potentially making further load-balanced calls to databases, caches, and microservices.

A single "simple" HTTP request can traverse 3–5 load balancers in a typical production architecture.