Case Study: Netflix Architecture
Netflix at Scale
Netflix is one of the most demanding distributed systems ever built. As of 2024, the platform serves 230+ million subscribers across 190+ countries, consuming roughly 15% of global downstream internet bandwidth during peak hours. Every evening, hundreds of millions of concurrent streams fire up — each requiring real-time content discovery, personalization, adaptive video encoding, and sub-second playback start.
| Metric | Scale |
|---|---|
| Global subscribers | 230M+ paid memberships |
| Daily viewing hours | ~500 million hours/day |
| Internet bandwidth | ~15% of global downstream traffic |
| Microservices | 1,000+ microservices in production |
| API requests | Billions of API calls per day |
| Content library | 17,000+ titles, each encoded in 1,200+ streams |
| CDN footprint | Open Connect Appliances in 1,000+ ISP locations |
| AWS regions | 3 active AWS regions (us-east-1, us-west-2, eu-west-1) |
What makes Netflix architecturally fascinating is not just the scale, but how they got there. The story of Netflix's engineering evolution — from a monolithic DVD-rental application to the planet's most sophisticated streaming platform — is a masterclass in incremental migration, build-vs-buy decisions, and embracing failure as a feature.
The Monolith-to-Microservices Migration (2008–2016)
In 2007, Netflix suffered a major database corruption incident that took down the site for three days. This event became the catalyst for one of the most famous architecture migrations in software history.
Migration Timeline
Phase 1: The Wake-Up Call (2008–2009)
Netflix's original architecture was a classic monolith: a single Java application backed by an Oracle relational database. The DVD rental service, the nascent streaming service, billing, user profiles, and content metadata all lived in one codebase deployed as a single WAR file. The 2007 database corruption was not a code bug — it was a systemic failure caused by tight coupling. A corruption in the billing tables cascaded into the entire application because everything shared one database.
The engineering team made two pivotal decisions:
- Move to the cloud. Rather than building more data centers, Netflix would run on AWS — becoming AWS's largest customer and effectively betting the company on public cloud infrastructure years before it was common.
- Decompose the monolith. Each business domain would become an independent service with its own data store, communicating via well-defined APIs.
Phase 2: Strangler Fig Pattern (2009–2012)
Netflix didn't rewrite everything at once. They used what Martin Fowler later named the Strangler Fig Pattern: new features were built as microservices, and existing functionality was incrementally extracted from the monolith. An API gateway (the precursor to Zuul) routed traffic — some requests went to the old monolith, others to new services. Over time, the monolith shrank as services grew around it.
Key milestones in this phase:
- 2009: Non-customer-facing systems (movie encoding, log analysis) migrated to AWS first as low-risk experiments.
- 2010: The member-facing website components began migrating. User account management was extracted as a standalone service.
- 2011: Netflix launched Chaos Monkey, signaling a fundamental shift in how they thought about reliability — if you can't prevent failure, embrace it.
- 2012: The streaming API was fully running on AWS microservices. The monolith still handled some billing and DVD operations.
Phase 3: Full Cloud Native (2013–2016)
The final phase involved migrating the remaining monolith components, including the most sensitive system: billing. Billing was the last to move because it directly handles money — any bug means revenue loss. Netflix built an elaborate dual-write system where both the monolith and the new billing microservice processed transactions in parallel, with automated reconciliation to verify consistency before cutting over.
- 2013: Netflix open-sourced the Netflix OSS stack (Eureka, Zuul, Hystrix, Ribbon), giving the world access to battle-tested microservice infrastructure.
- 2014: Internal data platform services migrated. Cassandra replaced Oracle for high-throughput use cases.
- 2015: The last piece of the monolith — the DVD.com billing system — was extracted. The Oracle database was decommissioned.
- 2016: Netflix declared the migration complete. The entire platform ran on AWS with 1,000+ microservices, zero Oracle databases, and a fully cloud-native architecture.
Guiding Principles
Several principles emerged from the migration that shaped Netflix's engineering culture permanently:
- Freedom and Responsibility: Teams own their services end-to-end — design, development, deployment, and on-call. No separate ops team deploys your code.
- Highly Aligned, Loosely Coupled: Teams agree on goals and interfaces but choose their own tech stacks, languages, and release cadences.
- Context, not Control: Engineering leadership sets direction and provides tools, but doesn't mandate implementation details.
- Embrace the Cloud: Treat cloud instances as ephemeral. Any instance can die at any time. Design for it.
Netflix OSS — The Infrastructure Stack
Netflix didn't just build microservices — they built an entire open-source ecosystem to make microservices work at scale. The Netflix OSS stack became the de-facto reference architecture for cloud-native systems, and many of its concepts were later absorbed into frameworks like Spring Cloud.
Zuul — API Gateway
Zuul is Netflix's edge service that handles all incoming API traffic. Every request from every Netflix client (TV, phone, browser, game console) passes through Zuul before reaching any backend microservice.
Zuul operates as a series of filter chains:
- Pre-filters: Authentication, rate limiting, request decoration (adding device info, A/B test group assignments, routing metadata).
- Routing filters: Determines which backend service handles the request. Supports canary deployments (send 1% of traffic to a new version), regional routing, and dark launching.
- Post-filters: Response decoration, header injection, metrics collection, and response compression.
- Error filters: Fallback responses, error categorization, and alerting triggers.
Zuul 2 (released 2018) moved from a blocking Servlet model to a non-blocking, event-driven architecture built on Netty, handling millions of concurrent connections with far fewer threads. At peak, Zuul handles 300,000+ requests per second per cluster.
Eureka — Service Discovery
With 1,000+ microservices and tens of thousands of running instances, you can't hardcode service locations. Eureka is Netflix's service registry — a REST-based service where every instance registers itself at startup and sends periodic heartbeats.
Key design decisions in Eureka:
- AP over CP: Eureka prioritizes availability over consistency (in CAP terms). During a network partition, Eureka servers don't reject registrations — they accept stale data rather than returning errors. Netflix chose this because a slightly stale service registry is far better than no registry at all.
- Client-side caching: Every service caches the full registry locally and refreshes every 30 seconds. If Eureka goes completely down, services continue routing using their local cache — graceful degradation.
- Self-preservation mode: If Eureka detects that too many instances are failing heartbeats simultaneously (suggesting a network issue, not actual failures), it enters self-preservation mode and stops expiring registrations.
- Peer-to-peer replication: Eureka servers replicate registrations to each other asynchronously. No leader election, no consensus protocol — eventual consistency is sufficient for service discovery.
Ribbon — Client-Side Load Balancing
Unlike traditional load balancing where a central LB (like NGINX or HAProxy) sits between clients and servers, Ribbon pushes load-balancing logic into the client. Each service instance gets the list of available backend instances from Eureka and makes its own routing decisions.
Why client-side LB? It eliminates a network hop. Instead of Client → LB → Server, it's Client → Server directly. At Netflix's scale, removing one hop from billions of daily requests saves measurable latency and eliminates the load balancer as a single point of failure.
Ribbon supports multiple strategies: round-robin, weighted response time (prefer faster servers), availability filtering (skip servers with open circuit breakers), and zone-aware routing (prefer servers in the same AWS availability zone to reduce cross-zone data transfer costs).
Hystrix — Circuit Breaker
Hystrix is Netflix's implementation of the circuit breaker pattern — arguably their most important resilience tool. In a microservice architecture, a single slow or failing service can cascade failures across the entire system (the "thundering herd" or "retry storm" problem).
Hystrix wraps every inter-service call in a command that monitors failure rates in real time:
// Simplified Hystrix command flow
class GetRecommendationsCommand extends HystrixCommand {
// 1. Is the circuit open? (failure rate > 50% in 10s window)
// → YES: Skip the call entirely, return fallback immediately
// → NO: Proceed to step 2
// 2. Is the thread pool / semaphore full?
// → YES: Reject immediately, return fallback (bulkhead pattern)
// → NO: Proceed to step 3
// 3. Execute the actual network call with a timeout (e.g., 1000ms)
// → Success: Close circuit, return response
// → Failure/Timeout: Record failure, check if threshold exceeded
// → If threshold exceeded: Open circuit for 5s
// → Return fallback response
@Override
protected List<Video> run() {
return recommendationService.getPersonalized(userId);
}
@Override
protected List<Video> getFallback() {
// Return cached recommendations or top-10 popular titles
return cachedRecommendations.getOrDefault(userId, topTenPopular);
}
}
Critical Hystrix concepts:
- Bulkhead pattern: Each dependency gets its own thread pool. If the recommendation service is slow, it saturates its own 20-thread pool — but the user-profile service has its own separate pool and continues working normally.
- Fallbacks everywhere: Netflix requires every Hystrix command to define a fallback. Recommendations fail? Show trending titles. User profile fails? Show cached data. Billing check fails? Allow playback (err on the side of letting users watch).
- Real-time monitoring: Hystrix streams metrics to the Hystrix Dashboard, where engineers see per-command success rates, latencies, thread pool saturation, and circuit states in real time.
EVCache — Distributed Caching
EVCache (Ephemeral Volatile Cache) is Netflix's distributed caching solution built on top of Memcached. It's the primary caching layer for nearly every Netflix service.
- Scale: EVCache clusters hold terabytes of data across thousands of Memcached nodes. Netflix processes 30+ million requests per second against EVCache globally.
- Multi-zone replication: Data is replicated across all availability zones within a region. Reads are served from the local zone (fast); writes go to all zones (consistent). If an entire AZ fails, the cache in other zones is immediately warm.
- Use cases: Session data, user profiles, viewing history summaries, recommendation model outputs, content metadata, A/B test assignments — essentially any data that's read far more frequently than it's written.
- Cache warming: When new instances spin up, they "warm" their cache by copying data from existing instances rather than hammering the downstream databases with a flood of cache-miss queries.
Interactive: Netflix Request Flow
Step through a complete request lifecycle — from a user opening the Netflix app to video playback starting:
▶ Netflix Request Flow
User opens app → Zuul API Gateway → Eureka discovers services → Hystrix wraps calls → backend services → EVCache → CDN serves video.
Open Connect — Netflix's Own CDN
While the control plane (APIs, recommendations, user profiles) runs on AWS, the data plane — the actual video bytes — is served by Netflix's proprietary CDN: Open Connect. This is perhaps the single most important piece of Netflix's architecture. It's the reason Netflix can serve 15% of global internet traffic without bankrupting itself on bandwidth costs.
Why Build a CDN?
In the early days, Netflix used third-party CDNs (Akamai, Limelight). But video has a unique characteristic that makes it different from web content: the files are enormous (a 4K movie is ~15 GB), the catalog is known in advance, and popularity follows a power-law distribution (a small fraction of titles account for most viewing hours). Netflix realized they could build a specialized CDN that exploits these properties far more efficiently than a general-purpose CDN.
Open Connect Appliances (OCAs)
An OCA is a custom-built server — essentially a high-density storage box running FreeBSD with a custom TCP stack and NGINX-based content server. Netflix designs the hardware themselves and ships these appliances to ISPs worldwide at no cost to the ISP.
| OCA Generation | Storage | Throughput | Role |
|---|---|---|---|
| Flash OCA | 280 TB NVMe SSD | 100+ Gbps | High-demand ISP sites, popular content |
| Storage OCA | 500+ TB HDD | 40 Gbps | Full catalog storage, large ISPs |
| Combo OCA | HDD + SSD tiered | 100 Gbps | Mixed workload, hot/cold data tiering |
Content Positioning — The Fill Algorithm
Every night during off-peak hours, Netflix's fill algorithm runs a global optimization problem: given the popularity predictions for tomorrow, the storage capacity of each OCA, and the geographic distribution of subscribers, which content should be pre-positioned where?
- Popularity prediction: ML models predict which titles will be watched tomorrow in each region based on trending data, new releases, time-of-week patterns, and recommendation model outputs.
- Tiered caching: The most popular 10% of titles are pushed to every OCA worldwide. The next 30% go to regional OCAs. The long tail lives in Netflix's origin servers in AWS and is pulled on demand.
- Delta fills: Only new or changed content is transferred each night. If a new episode drops, it's pre-positioned to ISP OCAs before the marketing email goes out.
- Encoding variants: Not every OCA stores every encoding. Flash OCAs at busy ISPs might only store the most-requested bitrates for the top titles. Storage OCAs hold the full encoding ladder.
Client Steering
When a subscriber hits play, the Netflix client calls the playback API (running on AWS), which returns an ordered list of OCA URLs to try. The steering algorithm considers:
- Network proximity: Prefer OCAs within the subscriber's own ISP (zero peering/transit costs).
- Server load: OCAs report their current throughput and connection counts. Overloaded boxes are ranked lower.
- Historical quality: If a particular OCA consistently delivers poor throughput to a client's network, it's deprioritized.
- Content availability: Obviously, the OCA must have the requested title and encoding variant.
The result: 95%+ of Netflix video bytes are served from OCAs inside the subscriber's own ISP — literally one network hop away. This is why Netflix streams start fast and rarely buffer, even on mediocre internet connections.
Interactive: Open Connect CDN
See how Netflix pre-positions content to ISP-embedded appliances and routes playback requests to the nearest OCA:
▶ Open Connect CDN
Off-peak: content fills from origin to OCA boxes at ISPs. Playback: user routes to nearest OCA instead of Netflix servers.
Content Encoding Pipeline
Before a single frame reaches a subscriber, Netflix's encoding pipeline transforms raw studio masters into an astonishing number of optimized streams. A single movie or episode is encoded into approximately 1,200+ different streams — combinations of resolutions, bitrates, codecs, and audio formats.
The Encoding Ladder
Traditionally, video services used a fixed encoding ladder — a static set of bitrate/resolution pairs applied to all content. Netflix revolutionized this with per-title encoding (later evolved to per-shot encoding).
The insight: a slow dialogue scene can look pristine at 500 Kbps, while a fast-action sequence needs 8 Mbps for the same perceived quality. Applying one bitrate ladder to both wastes either bandwidth or quality.
- Per-title optimization (2015): For each title, Netflix runs a brute-force encoding analysis: encode at every resolution/bitrate combination, measure VMAF (Video Multi-Method Assessment Fusion — Netflix's open-source perceptual quality metric), and select the optimal ladder. A nature documentary gets a very different ladder than an anime series.
- Per-shot optimization (2018): Extends per-title to the shot level. Each scene cut is analyzed independently. The encoding for a quiet conversation scene is different from the following chase scene — within the same episode.
- Per-device profiles: A phone screen doesn't need 4K. Netflix encodes separate streams optimized for specific device categories (mobile, tablet, TV, HDR TV) so smaller devices get high quality at lower bitrates.
Pipeline Architecture
Studio Master (ProRes/IMF, ~300 GB per hour)
│
├─→ Quality Analysis: Scene detection, complexity scoring, VMAF targets
│
├─→ Encoding Farm (100,000+ cores)
│ ├── H.264/AVC (legacy devices)
│ ├── H.265/HEVC (4K, HDR)
│ ├── VP9 (Chrome, Android)
│ ├── AV1 (next-gen, 20% better compression)
│ ├── Dolby Vision (premium HDR)
│ └── HDR10 / HDR10+
│
├─→ Audio Encoding
│ ├── AAC Stereo (mobile)
│ ├── Dolby Digital 5.1 (TV)
│ ├── Dolby Atmos (premium)
│ └── 30+ language tracks per title
│
├─→ Subtitle/Caption Generation (30+ languages)
│
├─→ Packaging: DASH / HLS manifests, DRM encryption (Widevine, FairPlay, PlayReady)
│
└─→ Distribution: Push to Open Connect origin, trigger OCA fill
The encoding farm is built on Apache Titus — Netflix's container management platform running on AWS EC2. Encoding jobs are massively parallelized: a single movie is split into hundreds of chunks encoded simultaneously, then reassembled. A new title can go from studio master to globally available in under 4 hours.
Chaos Engineering
Netflix didn't just accept that failures happen — they deliberately inject failures into production systems to ensure resilience. This practice, which Netflix pioneered, is now known as chaos engineering.
The philosophy: if your system can survive random destruction during business hours, it can survive anything that happens at 3 AM on a Saturday.
The Simian Army
| Tool | What It Destroys | Why |
|---|---|---|
| Chaos Monkey | Randomly terminates individual EC2 instances during business hours | Ensures every service handles instance failure gracefully |
| Chaos Gorilla | Simulates an entire Availability Zone failure | Validates AZ-level redundancy and failover |
| Chaos Kong | Simulates an entire AWS region going offline | Tests cross-region failover (the nuclear option) |
| Latency Monkey | Injects artificial latency into RESTful calls | Exposes timeout and retry bugs |
| Conformity Monkey | Finds and shuts down instances that don't conform to best practices | Enforces architectural standards |
| Janitor Monkey | Cleans up unused resources (orphaned instances, unattached volumes) | Reduces cloud waste and cost |
Chaos Kong — Regional Failover
Chaos Kong is the most impressive (and terrifying) chaos exercise. Netflix runs production traffic in three AWS regions. During a Chaos Kong exercise, they evacuate an entire region — redirecting all traffic to the remaining two. This validates:
- DNS failover: Route 53 health checks detect the "failed" region and update DNS within seconds.
- Stateless services: Since microservices are stateless, traffic can move to another region without session loss.
- Data replication: Cassandra's cross-region replication ensures the destination region has up-to-date data.
- Capacity headroom: Each region must maintain enough spare capacity to absorb the failed region's traffic without degraded performance.
- EVCache warming: The receiving region's caches may not be warm for the new traffic patterns. Cache miss storms must be handled gracefully.
Netflix runs Chaos Kong exercises monthly. The fact that they can evacuate an entire region during peak hours, with subscribers never noticing, is a testament to their architecture's resilience.
ChAP — Chaos Automation Platform
As chaos engineering matured, Netflix built ChAP (Chaos Automation Platform) to make experiments more scientific. ChAP runs controlled experiments: it takes a portion of production traffic, divides it into control and experiment groups, injects a failure into the experiment group, and measures the difference in SPS (Stream Starts Per Second) — Netflix's primary business metric.
If the experiment group shows statistically significant degradation in SPS, the failure mode is flagged for engineering follow-up. This transformed chaos engineering from "break things and see what happens" into a rigorous, data-driven discipline.
Data Architecture
Netflix's data architecture follows a polyglot persistence strategy — each data store is chosen for the specific access patterns and consistency requirements of its use case.
Apache Cassandra — The Primary Workhorse
Cassandra is Netflix's most widely used database, running thousands of clusters with tens of petabytes of data. Netflix chose Cassandra for its:
- Linear horizontal scalability: Add nodes to increase throughput proportionally — no re-sharding, no downtime.
- Multi-region replication: Every write is asynchronously replicated across all three AWS regions. Each region can serve reads locally with single-digit millisecond latency.
- Tunable consistency: Per-query consistency levels. Viewing history uses LOCAL_QUORUM (fast, within-region consistency). Billing uses EACH_QUORUM (strong, cross-region consistency).
- Fault tolerance: With a replication factor of 3 per region, Cassandra tolerates multiple node failures without data loss.
Netflix uses Cassandra for: viewing history, playback bookmarks, user profiles, content metadata, A/B test results, and device registrations.
MySQL (CockroachDB Migration) — Billing & Transactions
Billing requires ACID transactions — exactly-once payment processing, idempotent subscription changes, and audit-grade consistency. MySQL (via AWS RDS) handles Netflix's billing database with: read replicas for reporting, automated backups, and point-in-time recovery. Netflix has also explored CockroachDB for globally distributed transactional workloads.
Elasticsearch — Search & Discovery
Netflix's search functionality is powered by Elasticsearch clusters running hundreds of nodes. The search index includes:
- Title metadata (names, descriptions, genres, cast, directors)
- Fuzzy matching and typo tolerance
- Personalized ranking — search results are re-ordered based on the user's viewing history and preferences
- Real-time indexing — when a new title is added, it's searchable within seconds
Apache Kafka — Event Streaming Backbone
Kafka is the central nervous system of Netflix's data architecture. Every significant event — a play button press, a search query, a recommendation impression, a signup, a billing event — is published to Kafka.
- Scale: Netflix's Kafka clusters process over 6 petabytes of messages per day across multiple clusters. Peak throughput exceeds 8 million messages per second.
- Keystone pipeline: Netflix's data pipeline (codenamed Keystone) routes Kafka events to various sinks: S3 for batch analytics, Elasticsearch for real-time search indexing, Cassandra for materialized views, and Apache Spark/Flink for stream processing.
- Event sourcing: Many Netflix systems use Kafka as an event log, enabling replay and reprocessing when business logic changes.
- Chukwa → Kafka migration: Netflix originally used Apache Chukwa for data collection but migrated to Kafka for its superior throughput, partitioning, and consumer-group semantics.
Netflix Data Flow (simplified):
┌────────────┐
│ User App │──play, search, browse events──→ ┌─────────┐
│ (Device) │ │ Kafka │
└────────────┘ │ Cluster │
└────┬────┘
┌────────────────────────────────────────┼────────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ S3 / Iceberg│ │Elasticsearch │ │ Cassandra │ │ Spark/Flink │
│ (Data Lake) │ │ (Search) │ │ (Views) │ │ (Streaming) │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐
│ Presto / │ ← Ad-hoc analytics, dashboards, ML training
│ Spark SQL │
└──────────────┘
Apache Iceberg — Data Lake Table Format
Netflix created Apache Iceberg (donated to Apache Foundation) to solve the limitations of Hive table format for their massive data lake. Iceberg provides ACID transactions on petabyte-scale tables, time-travel queries, schema evolution without rewriting data, and efficient partition pruning. Netflix's data lake on S3 stores hundreds of petabytes of analytics, ML training data, and operational logs.
Recommendation Engine
Netflix estimates that their recommendation system saves them $1 billion per year in reduced churn. Over 80% of content watched on Netflix is discovered through recommendations, not search.
Architecture Overview
The recommendation system is a multi-layered ML pipeline:
- Candidate generation: Broad models (collaborative filtering, content-based filtering) generate thousands of candidate titles from the catalog. This runs offline in batch on Spark.
- Ranking: A deep learning model ranks candidates by predicted engagement. Features include: viewing history, time of day, device type, genre preferences, "taste communities" (clusters of users with similar preferences), title freshness, and social signals.
- Row generation: The Netflix homepage is organized into rows ("Because you watched X", "Trending Now", "New Releases"). A separate model decides which rows to show, in what order, with which titles — personalized for each subscriber.
- Artwork personalization: Even the thumbnail image shown for each title is personalized. A user who watches lots of comedies sees a funny scene from a drama; a romance fan sees the romantic subplot. Netflix found this increased click-through rates by 20-30%.
ML Infrastructure
- Offline training: Models are trained on Spark clusters processing hundreds of terabytes of viewing data. Training runs daily.
- Online inference: Real-time model serving uses a custom platform that pre-computes recommendations for active users and caches results in EVCache. The recommendation API has a p99 latency target of <100ms.
- A/B testing: Every model change is A/B tested on a fraction of subscribers before rollout. Netflix runs hundreds of simultaneous A/B tests. The key metric is not CTR (click-through rate) but member retention — does this recommendation change make people keep their subscription?
- Feature store: Centralized feature store provides consistent feature computation across training (offline) and serving (online), preventing training-serving skew.
Adaptive Bitrate Streaming (ABR)
Once the video bytes leave an OCA, the Netflix client takes over. The Adaptive Bitrate (ABR) algorithm dynamically adjusts video quality based on real-time network conditions — the reason you rarely see buffering on Netflix.
How ABR Works
- Buffer-based control: The client maintains a playback buffer (typically 30-120 seconds of video). The ABR algorithm monitors the buffer level and available bandwidth.
- Quality selection: If the buffer is healthy (>60s) and bandwidth is stable, the client requests higher-quality chunks. If the buffer is draining, it drops to a lower bitrate immediately — prioritizing continuity over quality.
- Per-title bitrate ladders: Because each title has a custom encoding ladder (from per-title optimization), the ABR algorithm switches between quality levels that are already optimized for that specific content. A nature documentary's "medium" quality might look better than a generic "high" quality.
- Device-aware profiles: A mobile client on a 5-inch screen caps at 1080p (sending 4K would waste bandwidth and battery). A 65-inch TV targets 4K/HDR. The ABR algorithm operates within device-appropriate bounds.
- Predictive buffering: Netflix's client analyzes network patterns (e.g., your WiFi is consistently slower at 8 PM when neighbors are online) and pre-buffers aggressively during good conditions to survive future quality drops.
Real-World Impact
Netflix's combination of per-title encoding + Open Connect + ABR achieves remarkable efficiency:
- 1080p quality at an average of ~3 Mbps (vs. 5+ Mbps for competitors using fixed encoding)
- 4K HDR at ~15 Mbps (vs. 25+ Mbps for naive encoding)
- Rebuffer rate below 0.1% across all subscribers globally
- Start time under 2 seconds for 95% of streams
Lessons Learned
Netflix's architecture offers several broadly applicable lessons for distributed systems engineering:
1. Optimize for the Failure Case
Netflix's entire architecture assumes things will fail. Rather than trying to prevent all failures (impossible at scale), they invest in graceful degradation. Every service has a fallback. Every region can absorb another's traffic. Chaos engineering is not optional — it's part of the engineering process.
2. Own Your Critical Path
Netflix uses AWS for compute but built their own CDN. Why? Because video delivery is their core business. Relying on a third-party CDN meant they couldn't optimize for video-specific patterns, couldn't control costs at their scale, and couldn't customize the hardware. The lesson: buy commodity infrastructure, build differentiating infrastructure.
3. Polyglot Persistence is Worth the Complexity
Using Cassandra, MySQL, Elasticsearch, and Kafka (each for its strengths) is operationally complex but delivers far better performance than forcing everything into one database. The key is having strong platform teams that abstract away the operational burden.
4. Migrate Incrementally
The 8-year monolith-to-microservices migration succeeded because each step was independently valuable. They never had a "all or nothing" moment. The strangler fig pattern allowed them to ship features continuously while migrating.
5. Invest in Developer Productivity
Netflix invested heavily in internal developer tools: Spinnaker (continuous delivery), Atlas (metrics), Mantis (stream processing), and centralized logging. When developers can deploy, monitor, and debug independently, velocity compounds. Netflix engineers deploy thousands of times per day across all services.
6. Data-Driven Everything
Every decision — from which thumbnail to show, to which encoding bitrate to use, to which OCA to route to — is driven by data and A/B testing. Netflix's culture of experimentation means that even strongly-held engineering opinions are validated with data before being adopted.
Summary
- Scale: 230M+ subscribers, 15% of global bandwidth, 1,000+ microservices, billions of API calls daily.
- Migration: 8-year incremental journey from monolith+Oracle to microservices+AWS using the strangler fig pattern.
- Netflix OSS: Zuul (API gateway), Eureka (service discovery), Ribbon (client-side LB), Hystrix (circuit breaker), EVCache (distributed cache) — a complete microservices toolkit.
- Open Connect: Netflix's own CDN with custom hardware (OCAs) embedded in ISPs worldwide, serving 95%+ of video bytes one hop from the subscriber.
- Encoding: 1,200+ streams per title via per-title and per-shot optimization, VMAF-guided quality, AV1 next-gen codec adoption.
- Chaos engineering: Chaos Monkey (instance), Gorilla (AZ), Kong (region) — deliberate failure injection in production to build antifragile systems.
- Data: Cassandra for scale, MySQL for ACID, Elasticsearch for search, Kafka for event streaming (6 PB/day), Iceberg for the data lake.
- Recommendations: Multi-stage ML pipeline saving $1B/year in reduced churn, with personalized artwork and row ordering.
- ABR: Adaptive bitrate streaming + per-title encoding achieves 1080p at ~3 Mbps with <0.1% rebuffer rate.