Load Balancing
What Is Load Balancing?
A load balancer is a device or software component that distributes incoming network traffic across a pool of backend servers. Its purpose is deceptively simple: make sure no single server bears too much load. In practice, this one sentence unlocks horizontal scaling, fault tolerance, and predictable latency at every tier of a modern distributed system.
Without a load balancer, your architecture looks like this: a single server handling every request. When that server reaches its capacity—say 5,000 concurrent connections on a typical 4-core, 16 GB machine running Nginx—new users get timeouts or 503 errors. Worse, if that server crashes, everything goes down. The load balancer solves both problems simultaneously.
Where Load Balancers Sit in Architecture
Load balancers appear at every boundary in a production system:
- Between clients and web servers — The most common placement. A DNS record points to the load balancer's IP, and it forwards requests to one of N web/application servers. This is what services like
api.stripe.comorwww.netflix.comdo. - Between web servers and application servers — In a three-tier architecture, the web tier (Nginx) might load-balance to multiple app servers (Node.js, Django, Spring).
- Between application servers and databases — A read-heavy workload can route SELECT queries to read replicas while sending writes to the primary. Tools like ProxySQL or PgBouncer do this.
- Between microservices — Service meshes like Istio and Linkerd perform per-request load balancing between thousands of service instances.
The Single Point of Failure Problem
If the load balancer itself goes down, the entire system becomes unreachable. This is the classic irony: you added a load balancer to prevent SPOF at the server level, and now the load balancer itself is a SPOF. The solution is high availability (HA) for the load balancer itself:
| Pattern | How It Works | Used By |
|---|---|---|
| Active-Passive | Two load balancers share a virtual IP (VIP) via a protocol like VRRP (Virtual Router Redundancy Protocol). The active node handles all traffic. If it fails, the passive node detects the failure via heartbeat (typically every 1–3 seconds) and takes over the VIP. Failover time: 2–10 seconds. | HAProxy + Keepalived, F5 Big-IP pairs |
| Active-Active | Both load balancers handle traffic simultaneously. DNS returns multiple IPs (or BGP Anycast advertises the same IP from both). If one fails, the other absorbs all traffic. Doubles throughput in normal operation. | AWS NLB (multi-AZ), Cloudflare, Google Cloud LB |
In cloud environments, the HA problem is largely solved by the provider. AWS Network Load Balancer, for instance, runs across multiple Availability Zones and automatically fails over—you never manage the redundancy yourself. This is one of the strongest arguments for managed load balancers.
L4 vs L7 Load Balancing
The two fundamental types of load balancers are defined by which OSI layer they operate at. This single distinction affects performance, features, cost, and when you'd choose one over the other.
Layer 4 (Transport Layer)
An L4 load balancer operates at the TCP/UDP level. It sees source IP, destination IP, source port, and destination port—and nothing else. It cannot read HTTP headers, cookies, or URL paths. It simply forwards the raw TCP connection (or UDP datagram) to a backend server.
How it works internally: When a client opens a TCP connection to the load balancer's VIP, the LB uses Network Address Translation (NAT) to rewrite the destination IP to one of the backend servers. For the lifetime of that TCP connection, all packets go to the same server. Some implementations use Direct Server Return (DSR), where the response goes directly from the backend to the client, bypassing the LB entirely—dramatically reducing the LB's bandwidth requirements.
Layer 7 (Application Layer)
An L7 load balancer terminates the TCP connection, reads the full HTTP request (method, path, headers, cookies, body), and then opens a new connection to the selected backend. This makes it far more powerful but also more resource-intensive.
What L7 can do that L4 cannot:
- Content-based routing: Route
/api/*to API servers,/static/*to a CDN origin,/ws/*to WebSocket servers. - Header inspection: Route based on
User-Agent(mobile vs desktop),Accept-Language(localization), or custom headers likeX-Tenant-ID(multi-tenancy). - SSL/TLS termination: Decrypt HTTPS at the LB, forward plain HTTP to backends. This offloads expensive cryptographic operations from every backend server.
- Request modification: Add headers (
X-Forwarded-For,X-Request-ID), rewrite URLs, compress responses. - Web Application Firewall (WAF): Inspect requests for SQL injection, XSS, and other attacks.
- Rate limiting: Throttle requests per user, per API key, or per IP—at the edge before they hit backends.
# Nginx L7 routing example
upstream api_servers {
server 10.0.1.10:8080 weight=3;
server 10.0.1.11:8080 weight=2;
server 10.0.1.12:8080 weight=1;
}
upstream static_servers {
server 10.0.2.10:80;
server 10.0.2.11:80;
}
server {
listen 443 ssl;
server_name api.example.com;
ssl_certificate /etc/ssl/cert.pem;
ssl_certificate_key /etc/ssl/key.pem;
location /api/ {
proxy_pass http://api_servers;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Request-ID $request_id;
}
location /static/ {
proxy_pass http://static_servers;
expires 30d;
add_header Cache-Control "public, immutable";
}
}
Comparison Table
| Dimension | Layer 4 | Layer 7 |
|---|---|---|
| OSI Layer | Transport (TCP/UDP) | Application (HTTP/HTTPS/gRPC) |
| Visibility | IP + Port only | Full request: URL, headers, cookies, body |
| Latency overhead | < 1µs per packet | 0.5–5 ms per request (TLS + parsing) |
| Throughput | Millions of connections/sec | Hundreds of thousands of requests/sec |
| SSL/TLS | Passthrough (encrypted traffic) | Termination (decrypt at LB) |
| Routing intelligence | IP hash, round robin only | URL, header, cookie, query param routing |
| Connection multiplexing | No — 1:1 client-to-server | Yes — many clients share backend connections |
| WebSocket support | Native (it's just TCP) | Requires upgrade handling |
| Cost (cloud) | Lower (AWS NLB: ~$0.006/GB) | Higher (AWS ALB: ~$0.008/GB + $0.40/hr) |
| Best for | High-throughput TCP: databases, game servers, gRPC, IoT | HTTP APIs, microservices, web apps |
Load Balancing Algorithms
The algorithm is the brain of the load balancer. It decides which backend server receives each request. The right choice depends on your workload characteristics: Are requests homogeneous? Are servers identical? Is session state involved?
▶ Request Routing Visualization
Watch how Round Robin and Weighted Round Robin distribute requests across servers. Connection counts update in real time.
Round Robin
The simplest algorithm: requests go to servers in sequential order. Server 1, Server 2, Server 3, Server 1, Server 2, Server 3, and so on. No state is tracked. Each server gets exactly 1/N of the traffic.
# Round Robin pseudocode
servers = [S1, S2, S3, S4]
counter = 0
def get_next_server():
global counter
server = servers[counter % len(servers)]
counter += 1
return server
When it works well: All servers are identical (same CPU, RAM, network) and all requests are roughly equal in cost. This is common for stateless API servers behind an auto-scaling group.
When it breaks: If Server 3 is processing a long-running report while Server 1 is idle, Round Robin will keep sending requests to Server 3 regardless. It has zero awareness of current load.
Weighted Round Robin
An extension where each server has a weight proportional to its capacity. A server with weight 3 receives three times as many requests as one with weight 1.
# Nginx weighted round robin
upstream backend {
server 10.0.1.10:8080 weight=5; # 8-core, 32GB — gets 5x share
server 10.0.1.11:8080 weight=3; # 4-core, 16GB — gets 3x share
server 10.0.1.12:8080 weight=1; # 2-core, 8GB — gets 1x share
}
# Request distribution: S1 gets 5/9, S2 gets 3/9, S3 gets 1/9
Real use case: During a canary deployment, you might give the new version weight=1 and the stable version weight=99, routing only ~1% of traffic to the canary. As confidence increases, you gradually shift the weights.
Least Connections
Route each new request to the server with the fewest active connections. This is the first algorithm that considers actual server load, making it significantly better for heterogeneous workloads.
# HAProxy least connections
backend api_servers
balance leastconn
server s1 10.0.1.10:8080 check
server s2 10.0.1.11:8080 check
server s3 10.0.1.12:8080 check
Why it matters: Imagine a mix of requests: some take 10 ms (GET user profile), others take 5 seconds (generate PDF report). Round Robin would blindly assign PDF requests to servers that are already overloaded. Least Connections would route new requests to whichever server has freed up capacity.
The Weighted Least Connections variant combines weights with connection counting: score = active_connections / weight. The server with the lowest score wins.
Least Response Time
Route to the server with the lowest average response time and fewest active connections. This uses actual observed performance rather than just connection count, making it more accurate.
Nginx Plus implements this as least_time:
upstream backend {
least_time header; # measure time to receive response headers
server 10.0.1.10:8080;
server 10.0.1.11:8080;
server 10.0.1.12:8080;
}
Tradeoff: Requires tracking response times per server (typically using an exponentially weighted moving average with a window of 10–60 seconds). Slightly more overhead than Least Connections, but catches servers that have many connections that are technically "open" but responding slowly due to disk I/O or garbage collection pauses.
IP Hash
Compute a hash of the client's IP address and use it to consistently map to the same server: server_index = hash(client_ip) % N. This guarantees that the same client always reaches the same server—useful for session affinity without cookies.
upstream backend {
ip_hash;
server 10.0.1.10:8080;
server 10.0.1.11:8080;
server 10.0.1.12:8080;
}
Problem: When you add or remove a server, the modulo changes and almost every client gets remapped. If you have 4 servers and add a 5th, approximately 80% of clients shift. This is the exact problem that consistent hashing solves (see next section).
Random
Pick a random server for each request. Surprisingly effective at large scale. With 100+ servers and thousands of requests per second, the law of large numbers ensures roughly even distribution. Twitter's Finagle library uses Power of Two Choices (P2C): pick two random servers, then send to the one with fewer connections. This achieves near-optimal distribution with minimal overhead.
# Power of Two Choices pseudocode
def p2c_select(servers):
s1, s2 = random.sample(servers, 2)
return s1 if s1.active_connections <= s2.active_connections else s2
Algorithm Comparison
| Algorithm | State Tracked | Best For | Weakness |
|---|---|---|---|
| Round Robin | Counter only | Identical servers, uniform requests | Ignores server load |
| Weighted RR | Counter + weights | Mixed server capacities, canary deploys | Static weights, no runtime adaptation |
| Least Connections | Connection count/server | Variable request durations | Doesn't consider server capacity |
| Least Response Time | EWMA latency + connections | Latency-sensitive APIs | More overhead, needs warm-up period |
| IP Hash | None (stateless hash) | Session persistence without cookies | Uneven with few clients; disrupted on scaling |
| Random / P2C | None (or 2 samples) | Large server pools (100+) | Can be uneven with few servers |
Consistent Hashing for Load Balancing
Standard modulo-based hashing (hash(key) % N) has a critical flaw: when N changes (server added or removed), nearly every key gets remapped. For a load balancer routing based on IP or session ID, this means all users lose their server affinity simultaneously.
The Hash Ring
Consistent hashing arranges the hash space as a circular ring (0 to 232−1). Each server is placed at one or more points on the ring (determined by hashing the server's identifier). To route a request, hash the request key and walk clockwise around the ring until you hit a server node.
# Consistent Hashing — conceptual illustration
Hash Ring: 0 -------- S1(45) -------- S3(120) -------- S2(200) -------- 0 (wrap)
Request hashed to 80 → walk clockwise → hits S3(120)
Request hashed to 150 → walk clockwise → hits S2(200)
Request hashed to 210 → walk clockwise → wraps to S1(45)
Key advantage: When you add server S4 at position 170, only requests between 120 and 170 are remapped (from S2 to S4). All other mappings remain unchanged. On average, only K/N keys move (where K is total keys and N is the number of servers). Compare this to modulo hashing where nearly all keys move.
Virtual Nodes
With just 3–5 physical servers on the ring, the distribution is very uneven—one server might own 60% of the hash space. The fix: assign each physical server multiple virtual nodes (typically 100–200) spread around the ring. This smooths out the distribution.
# Virtual nodes for server S1
hash("S1-0") → position 45
hash("S1-1") → position 178
hash("S1-2") → position 312
...
hash("S1-149") → position 4021850321
# Each physical server gets ~150 virtual nodes
# The ring now has 150 × N points, ensuring even distribution
Real-world usage: Amazon DynamoDB uses consistent hashing with virtual nodes for partition assignment. Cassandra uses it for token ring allocation. Envoy proxy supports consistent hashing for load balancing stateful services.
Health Checks
A load balancer is only as good as its ability to detect unhealthy servers and stop routing traffic to them. Without health checks, the LB happily sends requests to a server that crashed 10 minutes ago.
▶ Health Check Lifecycle
Watch as the load balancer detects a server failure, removes it from the pool, and adds it back when it recovers.
Active Health Checks
The load balancer proactively sends requests to each backend server at regular intervals to verify it's alive and responding correctly.
| Check Type | How It Works | Checks What |
|---|---|---|
| TCP connect | Open a TCP connection to the server's port. If the 3-way handshake succeeds, it's "up." | Process is running and accepting connections |
| HTTP GET | Send GET /health and check for 200 OK. | Application is running and can handle requests |
| Deep health | Send GET /health/deep — the endpoint checks DB connection, cache, disk, downstream services. | Full dependency chain is healthy |
# AWS ALB health check configuration (Terraform)
resource "aws_lb_target_group" "api" {
name = "api-targets"
port = 8080
protocol = "HTTP"
vpc_id = aws_vpc.main.id
health_check {
path = "/health"
port = "traffic-port"
protocol = "HTTP"
healthy_threshold = 3 # 3 consecutive passes = healthy
unhealthy_threshold = 2 # 2 consecutive failures = unhealthy
interval = 15 # check every 15 seconds
timeout = 5 # 5 second timeout per check
matcher = "200" # expect HTTP 200
}
}
Passive Health Checks
Instead of sending probes, the LB monitors actual traffic responses. If a server returns 5 consecutive 502/503 errors, or if response times spike above a threshold, the LB marks it unhealthy.
# Nginx passive health checks (via max_fails)
upstream backend {
server 10.0.1.10:8080 max_fails=3 fail_timeout=30s;
server 10.0.1.11:8080 max_fails=3 fail_timeout=30s;
# After 3 failures within 30s, server is marked down for 30s
}
Active + Passive together: Most production systems use both. Passive checks catch problems instantly (as real traffic fails). Active checks detect problems even when no traffic is flowing (e.g., a server that crashed at 3 AM when traffic is near zero).
Health Check Intervals and Thresholds
The unhealthy threshold is kept low (2–3) to remove bad servers quickly. The healthy threshold is kept higher (3–5) to prevent "flapping"—a server that's intermittently failing shouldn't keep toggling between healthy and unhealthy.
Graceful Removal and Addition
When removing a server (for maintenance or deployment), don't just yank it from the pool. Use connection draining: stop sending new requests to it, but let existing requests finish (typically with a 30–300 second timeout). AWS calls this "deregistration delay." This prevents in-flight requests from failing.
# HAProxy graceful server drain
# In the runtime API:
$ echo "set server backend/s1 state drain" | socat stdio /run/haproxy/admin.sock
# s1 stops receiving new connections
# Existing connections complete naturally
# After all connections close:
$ echo "set server backend/s1 state maint" | socat stdio /run/haproxy/admin.sock
Session Persistence (Sticky Sessions)
Some applications store session state locally on the server: shopping cart data, authentication tokens, in-memory caches, or WebSocket connections. If a returning user gets routed to a different server, their session is lost. Sticky sessions (session affinity) ensure the same client always reaches the same backend.
Cookie-Based Affinity
The load balancer inserts a cookie (e.g., SERVERID=s3) in the response. On subsequent requests, the client sends this cookie, and the LB routes to the specified server.
# HAProxy cookie-based stickiness
backend app_servers
balance roundrobin
cookie SERVERID insert indirect nocache
server s1 10.0.1.10:8080 check cookie s1
server s2 10.0.1.11:8080 check cookie s2
server s3 10.0.1.12:8080 check cookie s3
Pros: Works reliably regardless of client IP changes (mobile networks, VPNs). Cons: Requires L7 LB (must read HTTP headers). Cookie size overhead (tiny, usually < 50 bytes).
IP-Based Affinity
Use IP Hash (described in algorithms above): hash(client_ip) % N. Simpler than cookies, works at L4, but breaks when clients share IPs (corporate NATs where 10,000 employees appear as one IP) or when IPs change (mobile networks).
Drawbacks of Sticky Sessions
- Uneven load distribution: Some users generate far more traffic than others. A "sticky" power user pins to one server, overloading it while others sit idle.
- Reduced fault tolerance: If the sticky server dies, the user loses their session. The LB routes them to a new server with no session data.
- Scaling friction: Can't easily add/remove servers without disrupting sessions.
- Cache fragmentation: Each server's in-memory cache holds a different subset of data, reducing hit rates.
# Express.js with Redis session store (no sticky sessions needed)
const session = require('express-session');
const RedisStore = require('connect-redis').default;
const { createClient } = require('redis');
const redisClient = createClient({ url: 'redis://session-cache:6379' });
redisClient.connect();
app.use(session({
store: new RedisStore({ client: redisClient }),
secret: process.env.SESSION_SECRET,
resave: false,
saveUninitialized: false,
cookie: { maxAge: 30 * 60 * 1000 } // 30 min TTL
}));
Global Server Load Balancing (GSLB)
Everything we've discussed so far operates within a single data center or region. GSLB distributes traffic across multiple geographic regions—this is how you serve users in Tokyo, Frankfurt, and São Paulo from their nearest data center.
DNS-Based Load Balancing
The simplest form: your DNS server returns different IP addresses for the same domain. A DNS query for api.example.com might return:
; DNS Round Robin
api.example.com. 300 IN A 52.1.2.3 ; US-East
api.example.com. 300 IN A 13.4.5.6 ; EU-West
api.example.com. 300 IN A 103.7.8.9 ; AP-Southeast
; The client's DNS resolver picks one (usually the first)
; TTL of 300 seconds (5 min) means changes propagate slowly
Limitations: DNS doesn't know the client's location (it sees the resolver's IP, not the user's). TTL caching means changes take minutes to propagate. No health checking—DNS happily returns a dead server's IP.
GeoDNS Routing
GeoDNS extends DNS-based balancing by mapping the client's geographic location (via GeoIP databases) to the nearest data center. When a user in Tokyo queries api.example.com, the GeoDNS server returns the IP of the Tokyo data center.
# Route 53 geolocation routing policy (AWS CLI)
aws route53 change-resource-record-sets --hosted-zone-id Z123 \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "asia-pacific",
"GeoLocation": { "ContinentCode": "AS" },
"TTL": 60,
"ResourceRecords": [{ "Value": "103.7.8.9" }]
}
}]
}'
Services like AWS Route 53, Cloudflare DNS, and NS1 support GeoDNS with health checks integrated—if the Tokyo data center goes down, users in Asia are automatically routed to the next nearest healthy region.
Anycast Routing
Anycast advertises the same IP address from multiple data centers worldwide using BGP. The internet's routing infrastructure automatically directs each packet to the nearest data center (by network hops, not geography).
How it works: All your data centers announce the same IP prefix (e.g., 203.0.113.0/24) via BGP. When a client sends a packet to 203.0.113.1, their ISP's router has multiple paths and picks the shortest one.
Advantages over GeoDNS:
- Instant failover: If a data center goes down, BGP routes converge in seconds (vs. DNS TTL minutes).
- No DNS caching issues: Routing happens at the network layer.
- DDoS absorption: Attack traffic is distributed across all data centers. Cloudflare's entire network (300+ cities) runs on Anycast.
Limitation: Anycast works best for short-lived, stateless requests (DNS, CDN). For long-lived TCP connections, a BGP route change can disrupt the connection. This is why Anycast is typically used for the first hop (DNS or CDN edge), with GeoDNS handling the actual application routing.
Multi-Region Failover
A production-grade GSLB setup combines multiple techniques:
- Primary routing via GeoDNS: Route users to their nearest region.
- Health checks per region: The GSLB monitors each region's health.
- Automatic failover: If US-East fails, its users are routed to US-West (the next closest healthy region).
- Weighted failover: During degraded performance, gradually shift traffic (e.g., 90% to primary, 10% to secondary) instead of a hard cutover.
# Route 53 failover routing (Terraform)
resource "aws_route53_health_check" "us_east" {
fqdn = "us-east.api.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 10
}
resource "aws_route53_record" "primary" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "primary"
failover_routing_policy { type = "PRIMARY" }
alias {
name = aws_lb.us_east.dns_name
zone_id = aws_lb.us_east.zone_id
}
health_check_id = aws_route53_health_check.us_east.id
}
resource "aws_route53_record" "secondary" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "secondary"
failover_routing_policy { type = "SECONDARY" }
alias {
name = aws_lb.eu_west.dns_name
zone_id = aws_lb.eu_west.zone_id
}
}
CDN Integration
CDNs like Cloudflare, Akamai, and AWS CloudFront are essentially massive GSLB systems. They combine Anycast (for DNS), GeoDNS (for edge selection), and L7 load balancing (for origin routing) into one integrated service. When you put your API behind CloudFront, users automatically hit the nearest edge location (300+ globally), which handles TLS termination, caching, and WAF—only cache misses reach your origin servers.
Load Balancers in Practice
Let's look at the actual tools used in production environments, from hardware appliances to cloud-native solutions.
Hardware Load Balancers
F5 Big-IP is the most well-known hardware LB. It's a dedicated appliance with custom ASICs for packet processing. F5 can handle millions of connections with sub-millisecond latency. However, it costs $50,000–$500,000+ per unit, requires dedicated network engineers, and scaling means buying more hardware. It's used primarily by banks, telecoms, and large enterprises with strict compliance requirements.
Software Load Balancers
| Software | Type | Performance | Key Features | Used By |
|---|---|---|---|---|
| Nginx | L7 (L4 in stream mode) | ~1M req/sec (HTTP, single instance) | Reverse proxy, SSL termination, caching, rate limiting, WebSocket support. Configuration-driven. | Dropbox, Netflix, WordPress.com |
| HAProxy | L4 and L7 | ~2M req/sec (HTTP), ~3M conn/sec (TCP) | Zero-downtime reloads, runtime API, advanced health checks, stick tables for rate limiting. The gold standard for raw performance. | GitHub, Stack Overflow, Reddit |
| Envoy | L7 (L4 filters) | ~500K req/sec per instance | Built for service meshes. gRPC-native, automatic retries, circuit breaking, distributed tracing (Jaeger/Zipkin), hot restarts. xDS API for dynamic configuration. | Lyft (created it), Airbnb, Uber, Istio sidecar |
| Traefik | L7 | ~200K req/sec | Auto-discovery from Docker/K8s labels, automatic Let's Encrypt certificates, dashboard. Great for container environments. | Small-to-mid Kubernetes deployments |
Cloud Load Balancers
| Service | Provider | Type | Key Characteristics |
|---|---|---|---|
| NLB (Network LB) | AWS | L4 | Handles millions of req/sec, static IPs, preserves source IP, ultra-low latency. ~$0.006/GB processed. |
| ALB (Application LB) | AWS | L7 | Content-based routing, WebSocket, gRPC, WAF integration, authentication. ~$0.008/GB + $0.40/hr fixed. |
| CLB (Classic LB) | AWS | L4/L7 (legacy) | Deprecated. Exists only for backward compatibility. Don't use for new deployments. |
| Cloud Load Balancing | GCP | L4 and L7 | Global by default (single anycast VIP), auto-scaling, integrated CDN. Premium tier uses Google's private network backbone. |
| Azure Load Balancer / App Gateway | Azure | L4 / L7 | Azure LB for L4 (free basic tier), Application Gateway for L7 with WAF v2. |
Common Interview Questions
How would you handle load balancer failure?
Answer framework:
- Active-Passive HA: Deploy two LB instances with a shared VIP using VRRP/Keepalived. The passive node takes over within 2–5 seconds if the active fails.
- Active-Active: Deploy multiple LBs behind DNS (each with its own IP). If one fails, DNS health checks remove its IP within 30–60 seconds. For faster failover, use Anycast so BGP reconverges in seconds.
- Cloud-native: Use managed LBs (AWS ALB/NLB, GCP GLB) that are inherently HA across AZs. The cloud provider handles redundancy—this is the simplest and most reliable approach.
- Multi-layer: In critical systems, use DNS-level GSLB → Regional LB → Per-AZ LB. Each layer provides redundancy for the layer below.
Key insight: The LB should never be a single instance. Always design for N+1 redundancy at the LB layer, just as you do for application servers.
How to choose between L4 and L7 load balancing?
Choose L4 when:
- You need maximum throughput (millions of connections/sec).
- Your protocol isn't HTTP (raw TCP, UDP, database connections, game servers).
- You need to preserve the original TCP connection (no termination).
- Latency requirements are ultra-strict (< 1 ms overhead).
Choose L7 when:
- You need content-based routing (URL paths, headers, cookies).
- You want SSL termination at the LB.
- You need request/response modification (header injection, URL rewriting).
- You're building microservices that need per-route routing rules.
- You want integrated WAF, rate limiting, or authentication.
Common pattern: L4 at the edge (NLB) for raw TCP handling + DDoS protection, L7 internally (ALB/Envoy) for smart application routing. This is what AWS recommends for high-traffic architectures.
Design a load balancer that handles 1M requests/sec
Requirements analysis: 1M req/sec = ~86 billion requests/day. Assuming average request size of 2 KB and response of 10 KB, that's ~96 Gbps of throughput.
Architecture:
- Layer 1 — DNS + Anycast: Use Anycast to distribute traffic across 4+ PoPs (Points of Presence). Each PoP handles ~250K req/sec.
- Layer 2 — L4 LBs per PoP: 2–4 NLBs (or DPDK-based L4 LBs like Katran) per PoP. Use ECMP (Equal-Cost Multi-Path) to distribute packets across LB instances. Each handles 100K+ connections/sec.
- Layer 3 — L7 LBs: 10–20 HAProxy/Envoy instances per PoP behind the L4 LBs. Each handles 50K–100K HTTP req/sec. These do SSL termination, routing, and health checks.
- Backend servers: 50–200 application servers per PoP, behind the L7 LBs.
Key design decisions:
- Use consistent hashing at L4 for connection affinity.
- Use Least Connections at L7 for even request distribution.
- Health checks: active (every 5s) + passive (mark down after 3 consecutive 5xx errors).
- Connection draining: 30-second drain period during deploys.
- Metrics: Track p50/p95/p99 latency per backend, connection counts, error rates via Prometheus + Grafana.
Reference: Facebook's Katran (L4, XDP/eBPF-based) handles 10M+ packets/sec per core. Google's Maglev handles 10M connections/sec per machine. These are the architectures behind systems that actually serve 1M+ req/sec.
What happens when you type a URL in the browser? (Load balancing perspective)
Focusing on the load balancing hops:
- DNS resolution: Your browser queries DNS for
www.example.com. The authoritative DNS server uses GeoDNS to return the IP of the nearest edge/PoP. - CDN edge (Anycast): If a CDN is involved, the IP is an Anycast address. BGP routes you to the nearest CDN edge node. The CDN checks its cache—if it's a hit, it responds directly (no origin needed).
- L4 load balancer: For cache misses, the CDN (or direct traffic) hits an L4 LB. It selects a backend using IP hash or ECMP and forwards the TCP connection.
- L7 load balancer: The L7 LB terminates TLS, reads the HTTP request, and routes based on path/headers to the appropriate backend service.
- Application server: The selected server processes the request, potentially making further load-balanced calls to databases, caches, and microservices.
A single "simple" HTTP request can traverse 3–5 load balancers in a typical production architecture.