High Level Design Series · Real-World Designs· Part 51 of 70

Design: YouTube / Video Streaming

April 2026 · 30 min read

Requirements

YouTube is the world's second-largest search engine and the dominant video-sharing platform, serving over 2 billion logged-in users per month and streaming more than 1 billion hours of video per day. Designing a system at this scale forces you to confront nearly every distributed-systems challenge simultaneously: massive storage, extreme bandwidth, real-time transcoding, global distribution, and ML-powered recommendations.

Functional Requirements

Upload videos — creators upload video files (up to 256 GB, 12 hours) in any common format (MP4, MOV, AVI, MKV, WebM)
Stream videos — viewers watch on-demand with adaptive quality, seeking, and subtitles
Search videos — full-text search across titles, descriptions, tags, and auto-generated captions
Recommendations — personalized homepage feed, "up next" sidebar, trending content
Comments & replies — threaded comments on every video
Likes, dislikes, subscriptions — engagement signals and creator subscriptions
Live streaming — real-time broadcast with chat (stretch goal)

Non-Functional Requirements

Scale — 2B monthly active users, 500+ hours of video uploaded per minute
Availability — 99.99% uptime (≤ 52 min downtime/year)
Latency — video playback start < 2 seconds, search results < 200 ms
Durability — zero data loss on uploaded content (11 nines durability)
Global reach — low-latency streaming on every continent
Eventual consistency — view counts, like counts need not be real-time

Back-of-the-Envelope Estimation

Metric	Estimate
Videos uploaded per day	~720,000 (500 hrs/min × 60 min × 24 hrs, avg 1 min each)
Average original video size	~500 MB (1080p, 5 min, H.264)
Daily raw upload storage	~360 TB/day (720K × 500 MB)
Transcoded storage multiplier	~5× (multiple resolutions + codecs)
Daily total storage	~1.8 PB/day
Daily video watches	~5 billion
Peak concurrent streams	~100 million
Peak egress bandwidth	~500 Tbps (100M × 5 Mbps avg)

The read-heavy ratio: YouTube's read:write ratio is approximately 10,000:1. For every video uploaded, there are thousands of views. This fundamentally shapes the architecture — we optimize aggressively for reads (CDN caching, pre-transcoding) and tolerate slower writes (asynchronous processing pipeline).

High-Level Architecture

The system decomposes into several key subsystems, each operating as an independent microservice cluster:

┌──────────────┐     ┌───────────────┐     ┌──────────────────┐
│   Clients    │────▶│  API Gateway  │────▶│  Auth Service    │
│ (Web/Mobile/ │     │  (Rate Limit, │     │  (OAuth, JWT)    │
│  Smart TV)   │     │   Routing)    │     └──────────────────┘
└──────┬───────┘     └───────┬───────┘
       │                     │
       │    ┌────────────────┼────────────────────┐
       │    │                │                     │
       │    ▼                ▼                     ▼
       │  ┌──────────┐  ┌──────────┐      ┌──────────────┐
       │  │ Upload   │  │ Video    │      │ Search       │
       │  │ Service  │  │ Service  │      │ Service      │
       │  └────┬─────┘  └────┬─────┘      └──────┬───────┘
       │       │              │                    │
       │       ▼              ▼                    ▼
       │  ┌──────────┐  ┌──────────┐      ┌──────────────┐
       │  │Transcoder│  │ Metadata │      │Elasticsearch │
       │  │ Workers  │  │    DB    │◀────▶│   Cluster    │
       │  └────┬─────┘  └──────────┘      └──────────────┘
       │       │
       │       ▼
       │  ┌──────────┐     ┌──────────────────────┐
       │  │  Blob    │────▶│   CDN Edge Servers    │
       │  │ Storage  │     │  (Global, 200+ PoPs)  │
       │  └──────────┘     └──────────────────────┘
       │                            │
       └────────────────────────────┘
              (stream from nearest edge)

Async Services:
┌────────────┐  ┌───────────────┐  ┌──────────────┐  ┌────────────┐
│ Comment    │  │ Recommendation│  │ View Counter │  │Notification│
│ Service    │  │ Engine        │  │ Service      │  │ Service    │
└────────────┘  └───────────────┘  └──────────────┘  └────────────┘

Key design principles:

Separation of upload and streaming paths — upload is write-heavy and CPU-intensive (transcoding); streaming is read-heavy and bandwidth-intensive (CDN)
Asynchronous processing — video processing happens in background workers via message queues
Pre-computation — transcode into all formats upfront rather than on-demand
CDN-first delivery — origin servers should rarely serve video data directly to clients

Video Upload Pipeline

The upload pipeline is the most complex write path in the system. It transforms a raw video file from a creator into a fully processed, globally distributed, streamable asset. The pipeline must handle 500+ hours of video per minute reliably.

Step-by-Step Upload Flow

Client-side pre-processing — the client computes a SHA-256 hash of the file for deduplication and integrity verification, then initiates a resumable upload via the Google Resumable Upload Protocol (or tus.io)
Upload to API server — the file is streamed directly to blob storage (bypassing the API server's memory) via a pre-signed upload URL. The API server only handles metadata
Store original in blob storage — the raw file lands in a "raw uploads" bucket in S3/GCS with 3× replication across availability zones
Enqueue transcoding jobs — the upload service publishes a message to Kafka/SQS with the video ID, storage location, and requested output formats
Parallel transcoding — a fleet of transcoding workers (running FFmpeg/libx264/libvpx) picks up the job and produces multiple renditions in parallel
Generate thumbnails — extract key frames at regular intervals (every 2 seconds) for the video scrubber, plus 3 candidate thumbnails for the creator to choose from
Extract metadata — run speech-to-text for auto-captions, detect language, extract audio features for Content ID matching
Update metadata DB — mark the video as "processed" with pointers to all renditions and thumbnails
Push to CDN — proactively push popular-origin content to edge nodes; for long-tail content, CDN pulls on first request
Notify creator — send push notification / email that processing is complete

▶ Video Upload Pipeline

Watch a video flow from upload through parallel transcoding to CDN distribution.

Resumable Uploads

Uploading a 4 GB video over a flaky mobile connection is unreliable. YouTube uses resumable uploads — the client sends the file in chunks (typically 8–16 MB each), and the server tracks how many bytes have been received. If the connection drops, the client queries the server for the last confirmed byte offset and resumes from there.

POST /upload/resumable HTTP/1.1
Host: upload.youtube.com
Content-Type: application/json
X-Upload-Content-Type: video/mp4
X-Upload-Content-Length: 4294967296

{ "title": "My Video", "description": "...", "privacyStatus": "public" }

--- Response ---
HTTP/1.1 200 OK
Location: https://upload.youtube.com/upload?id=xa298sd_sdlkj2

--- Resume with ---
PUT https://upload.youtube.com/upload?id=xa298sd_sdlkj2
Content-Range: bytes 16777216-33554431/4294967296
Content-Length: 16777216
[binary chunk data]

Transcoding Architecture

Transcoding is the most CPU-intensive operation in the pipeline. Each uploaded video must be converted into multiple resolutions and multiple codecs:

Resolution	Bitrate (H.264)	Bitrate (VP9)	Bitrate (AV1)	Use Case
240p (426×240)	300 kbps	150 kbps	100 kbps	2G/slow connections
360p (640×360)	700 kbps	350 kbps	250 kbps	Mobile, low bandwidth
480p (854×480)	1.5 Mbps	750 kbps	500 kbps	Standard mobile
720p (1280×720)	3 Mbps	1.5 Mbps	1 Mbps	HD default
1080p (1920×1080)	6 Mbps	3 Mbps	2 Mbps	Desktop, tablets
1440p (2560×1440)	10 Mbps	6 Mbps	4 Mbps	QHD monitors
4K (3840×2160)	20 Mbps	12 Mbps	8 Mbps	4K TVs, premium

Why multiple codecs? H.264 has universal browser/device support but lower compression. VP9 (Google's open codec) offers ~50% better compression than H.264 — it's used for most YouTube playback on Chrome and Android. AV1 (the newest open codec from the Alliance for Open Media) offers another ~30% improvement over VP9 but is much slower to encode. YouTube uses AV1 increasingly for popular videos where the one-time encoding cost is amortized across billions of views.

Parallel Transcoding with DAG Processing

YouTube doesn't process transcoding jobs sequentially. Instead, the pipeline is modeled as a Directed Acyclic Graph (DAG) of tasks:

Original Video (raw upload)
    │
    ├──▶ [Video Split into Segments] ──────────────────┐
    │       (GOP-aligned, ~4s segments)                │
    │                                                  │
    │    ┌──────────── Parallel per segment ───────────┤
    │    │                                             │
    │    ├──▶ Transcode → 240p H.264                   │
    │    ├──▶ Transcode → 360p H.264                   │
    │    ├──▶ Transcode → 480p H.264                   │
    │    ├──▶ Transcode → 720p H.264                   │
    │    ├──▶ Transcode → 1080p H.264                  │
    │    ├──▶ Transcode → 720p VP9                     │
    │    ├──▶ Transcode → 1080p VP9                    │
    │    ├──▶ Transcode → 4K VP9                       │
    │    └──▶ Transcode → 1080p AV1 (if popular)       │
    │                                                  │
    │    └──────────── Merge segments ─────────────────┘
    │                       │
    ├──▶ [Extract Audio] ──▶ AAC 128kbps, Opus 64kbps
    │
    ├──▶ [Generate Thumbnails] ──▶ every 2s + 3 candidates
    │
    ├──▶ [Speech-to-Text] ──▶ auto-captions (VTT/SRT)
    │
    └──▶ [Content ID Fingerprint] ──▶ copyright check

The key optimization is segment-level parallelism. A 10-minute video at 30 fps is split into ~150 four-second segments. Each segment is transcoded independently across the worker fleet. A 10-minute video that would take 45 minutes to transcode sequentially on a single machine can be completed in under 2 minutes with 150 parallel workers.

# FFmpeg command to split a video into segments at GOP boundaries
ffmpeg -i input.mp4 -c copy -f segment \
  -segment_time 4 -reset_timestamps 1 \
  -segment_list segments.csv \
  segment_%04d.mp4

# Transcode a single segment to 720p H.264
ffmpeg -i segment_0042.mp4 \
  -vf "scale=1280:720" \
  -c:v libx264 -preset medium -crf 23 \
  -c:a aac -b:a 128k \
  -movflags +faststart \
  segment_0042_720p.mp4

# Transcode to VP9 (better compression, slower encode)
ffmpeg -i segment_0042.mp4 \
  -vf "scale=1280:720" \
  -c:v libvpx-vp9 -b:v 1500k -minrate 750k -maxrate 2100k \
  -c:a libopus -b:a 64k \
  segment_0042_720p.webm

# Transcode to AV1 (best compression, slowest encode)
ffmpeg -i segment_0042.mp4 \
  -vf "scale=1280:720" \
  -c:v libaom-av1 -crf 30 -b:v 0 -strict experimental \
  -c:a libopus -b:a 64k \
  segment_0042_720p_av1.mp4

Adaptive Bitrate Streaming (ABR)

The defining innovation that makes video streaming work over variable-bandwidth internet connections is Adaptive Bitrate Streaming (ABR). Instead of downloading a single file, the client dynamically selects the best quality level for each segment based on current network conditions.

How ABR Works

The client requests a manifest file (playlist) that describes all available quality levels and segment URLs
The client starts playing at a conservative quality (e.g., 480p)
After each segment download, the client measures actual throughput
If bandwidth is sufficient, the next segment is requested at a higher quality
If bandwidth drops, the client immediately switches to a lower quality — seamlessly, mid-stream
The player maintains a buffer of 15–30 seconds ahead to absorb transient network fluctuations

HLS (HTTP Live Streaming)

Apple's HLS is the most widely supported ABR protocol. It uses a master playlist (m3u8) pointing to variant playlists for each quality level, each of which lists the individual segment files (.ts or .fmp4).

#EXTM3U
#EXT-X-VERSION:6

## Master Playlist — lists all available quality levels
## Client picks the highest quality its bandwidth can sustain

#EXT-X-STREAM-INF:BANDWIDTH=300000,RESOLUTION=426x240,CODECS="avc1.42e00a,mp4a.40.2"
https://cdn.example.com/video/abc123/240p/playlist.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=700000,RESOLUTION=640x360,CODECS="avc1.42e01e,mp4a.40.2"
https://cdn.example.com/video/abc123/360p/playlist.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=1500000,RESOLUTION=854x480,CODECS="avc1.4d401f,mp4a.40.2"
https://cdn.example.com/video/abc123/480p/playlist.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=3000000,RESOLUTION=1280x720,CODECS="avc1.4d4020,mp4a.40.2"
https://cdn.example.com/video/abc123/720p/playlist.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=6000000,RESOLUTION=1920x1080,CODECS="avc1.640028,mp4a.40.2"
https://cdn.example.com/video/abc123/1080p/playlist.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=12000000,RESOLUTION=2560x1440,CODECS="avc1.640032,mp4a.40.2"
https://cdn.example.com/video/abc123/1440p/playlist.m3u8

#EXT-X-STREAM-INF:BANDWIDTH=20000000,RESOLUTION=3840x2160,CODECS="avc1.640033,mp4a.40.2"
https://cdn.example.com/video/abc123/4k/playlist.m3u8

## Variant Playlist for 720p — individual segments
#EXTM3U
#EXT-X-VERSION:6
#EXT-X-TARGETDURATION:4
#EXT-X-MEDIA-SEQUENCE:0
#EXT-X-PLAYLIST-TYPE:VOD

#EXTINF:4.000,
https://cdn.example.com/video/abc123/720p/seg_0000.ts
#EXTINF:4.000,
https://cdn.example.com/video/abc123/720p/seg_0001.ts
#EXTINF:4.000,
https://cdn.example.com/video/abc123/720p/seg_0002.ts
#EXTINF:4.000,
https://cdn.example.com/video/abc123/720p/seg_0003.ts
#EXTINF:3.840,
https://cdn.example.com/video/abc123/720p/seg_0004.ts

#EXT-X-ENDLIST

MPEG-DASH (Dynamic Adaptive Streaming over HTTP)

DASH is the open standard alternative to HLS. It uses an XML-based Media Presentation Description (MPD) instead of m3u8 playlists. YouTube uses DASH internally for most playback.

<?xml version="1.0" encoding="UTF-8"?>
<MPD xmlns="urn:mpeg:dash:schema:mpd:2011"
     type="static"
     mediaPresentationDuration="PT10M24S"
     minBufferTime="PT4S">
  <Period duration="PT10M24S">

    <!-- Video Adaptation Set —>
    <AdaptationSet mimeType="video/mp4" segmentAlignment="true">

      <Representation id="240p" width="426" height="240"
                      bandwidth="300000" codecs="avc1.42e00a">
        <SegmentTemplate media="240p/seg_$Number%04d$.m4s"
                         initialization="240p/init.mp4"
                         duration="4000" timescale="1000"/>
      </Representation>

      <Representation id="720p" width="1280" height="720"
                      bandwidth="3000000" codecs="avc1.4d4020">
        <SegmentTemplate media="720p/seg_$Number%04d$.m4s"
                         initialization="720p/init.mp4"
                         duration="4000" timescale="1000"/>
      </Representation>

      <Representation id="1080p" width="1920" height="1080"
                      bandwidth="6000000" codecs="avc1.640028">
        <SegmentTemplate media="1080p/seg_$Number%04d$.m4s"
                         initialization="1080p/init.mp4"
                         duration="4000" timescale="1000"/>
      </Representation>

      <Representation id="4k" width="3840" height="2160"
                      bandwidth="20000000" codecs="avc1.640033">
        <SegmentTemplate media="4k/seg_$Number%04d$.m4s"
                         initialization="4k/init.mp4"
                         duration="4000" timescale="1000"/>
      </Representation>

    </AdaptationSet>

    <!-- Audio Adaptation Set —>
    <AdaptationSet mimeType="audio/mp4" lang="en">
      <Representation id="audio_128k" bandwidth="128000"
                      codecs="mp4a.40.2">
        <SegmentTemplate media="audio/seg_$Number%04d$.m4s"
                         initialization="audio/init.mp4"
                         duration="4000" timescale="1000"/>
      </Representation>
    </AdaptationSet>

  </Period>
</MPD>

▶ Adaptive Bitrate Streaming

Watch how the client dynamically switches quality levels as bandwidth fluctuates.

ABR Algorithm: Buffer-Based vs. Rate-Based

The client-side ABR algorithm decides which quality to request next. Two dominant approaches exist:

Algorithm	How It Works	Pros	Cons
Rate-based	Estimate bandwidth from recent segment download times, pick highest quality below estimated bandwidth	Fast adaptation	Oscillation when bandwidth fluctuates
Buffer-based (BBA)	Map buffer level to quality: low buffer → low quality, high buffer → high quality	Stable, avoids oscillation	Slower to ramp up
Hybrid (MPC)	Model Predictive Control — uses both throughput prediction and buffer state to optimize a lookahead window	Best quality of experience	Complex, higher CPU on client

# Simplified buffer-based ABR logic (pseudo-Python)
class BufferBasedABR:
    RESERVOIR = 5    # seconds — below this, use lowest quality
    CUSHION = 15     # seconds — linear ramp between reservoir and cushion
    MAX_BUFFER = 30  # seconds — above this, use highest quality

    def __init__(self, quality_levels):
        self.levels = sorted(quality_levels, key=lambda q: q.bitrate)

    def select_quality(self, buffer_level_sec, available_bandwidth_bps):
        if buffer_level_sec < self.RESERVOIR:
            return self.levels[0]  # emergency: lowest quality

        if buffer_level_sec > self.MAX_BUFFER:
            return self.levels[-1]  # buffer is full: max quality

        # Linear interpolation between lowest and highest
        fraction = (buffer_level_sec - self.RESERVOIR) / self.CUSHION
        index = int(fraction * (len(self.levels) - 1))
        candidate = self.levels[min(index, len(self.levels) - 1)]

        # Safety check: don't exceed measured bandwidth
        for level in reversed(self.levels[:index + 1]):
            if level.bitrate < available_bandwidth_bps * 0.8:
                return level

        return self.levels[0]

Video Storage

YouTube's storage system must manage petabytes of new data per day across two tiers: the original uploads (for potential re-transcoding) and the transcoded renditions (for actual serving).

Blob Storage Architecture

Storage Layout:
gs://youtube-raw/
  └── {video_id}/
      └── original.mp4              (raw upload, immutable)

gs://youtube-transcoded/
  └── {video_id}/
      ├── h264/
      │   ├── 240p/
      │   │   ├── init.mp4
      │   │   ├── seg_0000.m4s
      │   │   ├── seg_0001.m4s
      │   │   └── ...
      │   ├── 480p/
      │   ├── 720p/
      │   ├── 1080p/
      │   └── 4k/
      ├── vp9/
      │   ├── 720p/
      │   ├── 1080p/
      │   └── 4k/
      ├── av1/
      │   └── 1080p/
      ├── audio/
      │   ├── aac_128k/
      │   └── opus_64k/
      ├── thumbnails/
      │   ├── sprite_sheet.jpg       (scrubber thumbnails)
      │   ├── thumb_001.jpg
      │   ├── thumb_002.jpg
      │   └── thumb_003.jpg
      └── captions/
          ├── en_auto.vtt
          └── en_manual.vtt

Storage Tiering

Not all videos are accessed equally. YouTube employs tiered storage to balance cost and performance:

Tier	Access Pattern	Storage Class	Cost ($/GB/month)
Hot	Viral/trending, >10K views/day	Standard (multi-region)	$0.026
Warm	Active, 100–10K views/day	Nearline	$0.010
Cold	Long-tail, <100 views/day	Coldline	$0.004
Archive	Rarely accessed, raw originals	Archive	$0.0012

The long tail dominates: Over 80% of YouTube videos have fewer than 1,000 lifetime views. These long-tail videos consume most of the storage but a tiny fraction of bandwidth. Aggressive tiering moves them to cheap storage while keeping popular content on fast, replicated storage near CDN edges.

Metadata & Database Design

While video bytes live in blob storage, all metadata about videos, users, channels, and interactions lives in a relational database (YouTube uses a custom-sharded MySQL system called Vitess).

Core Schema

-- Videos table (sharded by video_id)
CREATE TABLE videos (
    video_id        BIGINT PRIMARY KEY,       -- snowflake ID
    channel_id      BIGINT NOT NULL,
    title           VARCHAR(100) NOT NULL,
    description     TEXT,
    upload_time     TIMESTAMP NOT NULL,
    duration_sec    INT NOT NULL,
    status          ENUM('processing','active','removed','flagged'),
    privacy         ENUM('public','unlisted','private'),
    category_id     SMALLINT,
    language        CHAR(5),
    storage_bucket  VARCHAR(255),             -- GCS path prefix
    thumbnail_url   VARCHAR(512),
    view_count      BIGINT DEFAULT 0,         -- eventually consistent
    like_count      INT DEFAULT 0,
    dislike_count   INT DEFAULT 0,
    comment_count   INT DEFAULT 0,
    INDEX idx_channel (channel_id),
    INDEX idx_upload_time (upload_time),
    INDEX idx_status (status)
) ENGINE=InnoDB;

-- Channels table (sharded by channel_id)
CREATE TABLE channels (
    channel_id      BIGINT PRIMARY KEY,
    user_id         BIGINT NOT NULL,
    name            VARCHAR(100) NOT NULL,
    description     TEXT,
    subscriber_count BIGINT DEFAULT 0,
    created_at      TIMESTAMP NOT NULL,
    avatar_url      VARCHAR(512),
    banner_url      VARCHAR(512),
    country         CHAR(2),
    INDEX idx_user (user_id)
) ENGINE=InnoDB;

-- Video renditions (which transcoded versions exist)
CREATE TABLE video_renditions (
    video_id        BIGINT NOT NULL,
    codec           ENUM('h264','vp9','av1'),
    resolution      ENUM('240p','360p','480p','720p','1080p','1440p','4k'),
    bitrate_kbps    INT NOT NULL,
    segment_count   INT NOT NULL,
    storage_path    VARCHAR(512) NOT NULL,
    file_size_bytes BIGINT NOT NULL,
    PRIMARY KEY (video_id, codec, resolution),
    FOREIGN KEY (video_id) REFERENCES videos(video_id)
) ENGINE=InnoDB;

-- Subscriptions (sharded by subscriber_id for "my subscriptions" queries)
CREATE TABLE subscriptions (
    subscriber_id   BIGINT NOT NULL,
    channel_id      BIGINT NOT NULL,
    subscribed_at   TIMESTAMP NOT NULL,
    notifications   BOOLEAN DEFAULT TRUE,
    PRIMARY KEY (subscriber_id, channel_id),
    INDEX idx_channel_subs (channel_id)
) ENGINE=InnoDB;

-- Watch history (sharded by user_id, for recommendations)
CREATE TABLE watch_history (
    user_id         BIGINT NOT NULL,
    video_id        BIGINT NOT NULL,
    watched_at      TIMESTAMP NOT NULL,
    watch_duration  INT NOT NULL,          -- seconds actually watched
    watch_percent   DECIMAL(5,2),          -- % of video watched
    PRIMARY KEY (user_id, video_id, watched_at),
    INDEX idx_recent (user_id, watched_at DESC)
) ENGINE=InnoDB;

Sharding Strategy

With billions of rows, a single database instance cannot handle the load. YouTube shards using Vitess:

Videos — sharded by video_id (hash-based, 256+ shards)
Users/Channels — sharded by user_id or channel_id
Watch history — sharded by user_id (all history for one user on one shard)
Comments — sharded by video_id (all comments for one video on one shard)

Cross-shard queries (e.g., "all videos by a channel") are handled by maintaining secondary indexes that map channel_id → [video_ids] on the channel's shard, pointing to the actual video data on video shards.

Video Search

YouTube search must handle billions of queries per day across a corpus of 800+ million videos. The search system combines traditional information retrieval with machine learning ranking.

Search Architecture

User Query: "how to make pasta"
    │
    ▼
┌────────────────┐
│  Query Parser  │  → tokenize, spell-correct, expand synonyms
└───────┬────────┘
        ▼
┌────────────────┐
│ Elasticsearch  │  → retrieve top ~1000 candidate videos
│ (Inverted Index│     using BM25 on title + description + tags
│  + Sharded)    │     + auto-captions
└───────┬────────┘
        ▼
┌────────────────┐
│  ML Re-Ranker  │  → neural network re-ranks candidates using:
│  (Deep Model)  │     - text relevance score
│                │     - video quality signals (watch time, CTR)
│                │     - freshness / recency
│                │     - user personalization (watch history)
│                │     - engagement metrics
└───────┬────────┘
        ▼
┌────────────────┐
│  Result Blender│  → mix in ads, Knowledge Panels, playlists
└───────┬────────┘
        ▼
    Search Results (typically top 20)

Elasticsearch Index Structure

PUT /youtube_videos
{
  "settings": {
    "number_of_shards": 64,
    "number_of_replicas": 2,
    "analysis": {
      "analyzer": {
        "youtube_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "snowball", "synonym_filter"]
        }
      },
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "tutorial,howto,guide",
            "vlog,daily vlog,day in my life"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "video_id":      { "type": "keyword" },
      "title":         { "type": "text", "analyzer": "youtube_analyzer",
                         "boost": 3.0 },
      "description":   { "type": "text", "analyzer": "youtube_analyzer" },
      "tags":          { "type": "text", "analyzer": "youtube_analyzer",
                         "boost": 2.0 },
      "captions":      { "type": "text", "analyzer": "youtube_analyzer" },
      "channel_name":  { "type": "text", "boost": 1.5 },
      "category":      { "type": "keyword" },
      "language":      { "type": "keyword" },
      "upload_date":   { "type": "date" },
      "duration_sec":  { "type": "integer" },
      "view_count":    { "type": "long" },
      "like_ratio":    { "type": "float" },
      "avg_watch_pct": { "type": "float" },
      "subscriber_count": { "type": "long" }
    }
  }
}

Recommendation Engine

Recommendations drive over 70% of total watch time on YouTube. The recommendation system is arguably the most valuable component of the entire platform.

Two-Stage Architecture

800+ Million Videos
        │
        ▼
┌───────────────────────┐
│  Candidate Generation │  → narrow to ~1000 candidates
│  (fast, recall-focused)│
│                       │
│  Techniques:          │
│  • Collaborative       │
│    filtering           │
│  • Content-based       │
│    similarity          │
│  • User embedding      │
│    nearest neighbors   │
└───────────┬───────────┘
            ▼
       ~1000 candidates
            │
            ▼
┌───────────────────────┐
│     Ranking Model     │  → score & rank to top ~50
│ (slower, precision-   │
│  focused)             │
│                       │
│  Features:            │
│  • Watch history      │
│  • Time since upload  │
│  • Video length       │
│  • Channel affinity   │
│  • Device type        │
│  • Time of day        │
│  • Language match     │
│  • Past CTR for       │
│    similar videos     │
└───────────┬───────────┘
            ▼
    Ranked Recommendations
    (homepage, sidebar, end screen)

Collaborative Filtering

At its core: "Users who watched X also watched Y." YouTube builds a massive user-video interaction matrix and factors it into user embeddings and video embeddings using matrix factorization:

# Simplified collaborative filtering with implicit feedback
# Using Alternating Least Squares (ALS)

import numpy as np

class ALSRecommender:
    def __init__(self, n_users, n_videos, n_factors=128, reg=0.01):
        self.user_factors = np.random.normal(0, 0.01, (n_users, n_factors))
        self.video_factors = np.random.normal(0, 0.01, (n_videos, n_factors))
        self.reg = reg

    def train(self, interactions, n_iterations=15):
        """
        interactions: sparse matrix (user × video)
        where value = confidence = 1 + α × watch_time
        """
        for iteration in range(n_iterations):
            # Fix video factors, solve for user factors
            self._als_step(interactions, self.user_factors,
                          self.video_factors)
            # Fix user factors, solve for video factors
            self._als_step(interactions.T, self.video_factors,
                          self.user_factors)

    def recommend(self, user_id, n=20):
        scores = self.user_factors[user_id] @ self.video_factors.T
        top_indices = np.argsort(scores)[::-1][:n]
        return top_indices, scores[top_indices]

Deep Learning Ranking

The ranking stage uses a deep neural network (YouTube's 2016 paper describes this architecture, and it has evolved significantly since). The model predicts expected watch time rather than click probability — this avoids promoting clickbait:

# Simplified YouTube deep ranking model
# Predicts expected watch time for (user, video) pair

import torch
import torch.nn as nn

class YouTubeRankingModel(nn.Module):
    def __init__(self, n_video_features, n_user_features, embed_dim=64):
        super().__init__()
        # Feature processing
        self.video_mlp = nn.Sequential(
            nn.Linear(n_video_features, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Linear(256, embed_dim)
        )
        self.user_mlp = nn.Sequential(
            nn.Linear(n_user_features, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Linear(256, embed_dim)
        )
        # Ranking head — predicts watch time
        self.ranking_head = nn.Sequential(
            nn.Linear(embed_dim * 2 + 32, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 1)  # predicted watch time in seconds
        )
        # Context features (time of day, device, etc.)
        self.context_embed = nn.Linear(8, 32)

    def forward(self, video_features, user_features, context):
        v = self.video_mlp(video_features)
        u = self.user_mlp(user_features)
        c = self.context_embed(context)
        combined = torch.cat([v, u, c], dim=1)
        return self.ranking_head(combined).squeeze()

CDN & Content Delivery

YouTube operates one of the largest CDN networks in the world — Google's Global Cache (GGC) — with cache nodes in over 200 countries and thousands of ISP networks. The CDN is not optional; it is the most critical component for user experience.

CDN Architecture

┌─────────────┐     Cache Miss      ┌──────────────┐
│   Origin    │◀────────────────────│  Regional    │
│   Storage   │                     │  Cache (PoP) │
│  (GCS/S3)   │────────────────────▶│  (L2 Cache)  │
│             │     Fill Response   │              │
└─────────────┘                     └──────┬───────┘
                                           │
                              Cache Miss   │   Cache Hit
                                           │
                                    ┌──────▼───────┐
                                    │   Edge Node  │
                                    │  (L1 Cache)  │
                                    │  ISP-level   │
                                    └──────┬───────┘
                                           │
                                    ┌──────▼───────┐
                                    │    Client    │
                                    │              │
                                    └──────────────┘

Three-tier cache hierarchy:
  L1: Edge (in-ISP) — microseconds latency, small cache, highest hit rate for popular content
  L2: Regional PoP   — 5-20 ms latency, larger cache, catches most L1 misses
  Origin             — 50-200 ms latency, authoritative, only for cold content

CDN Routing & Edge Selection

When a client requests a video segment, the CDN must route to the optimal edge server. YouTube uses a combination of:

DNS-based routing — resolve rr1.sn-something.googlevideo.com to the nearest edge IP
Anycast — the same IP is announced from multiple PoPs; BGP routing sends the client to the nearest one
302 redirects — the API server can redirect to a specific edge based on real-time load and cache status
Client hints — the player reports its location, ISP, and measured latency to help with server selection

CDN Cost Analysis

Bandwidth egress is YouTube's single largest operational cost. Let's estimate:

Cost Component	Calculation	Monthly Cost
Egress bandwidth	1 billion hrs/day × avg 3 Mbps = ~40 EB/month at ~$0.02/GB	~$800M
Storage (total ~1 EB)	~1 EB at blended ~$0.005/GB/month	~$5M
Transcoding compute	720K videos/day × ~$0.50/video avg (multiple renditions)	~$10.8M
CDN infrastructure	1000s of edge servers, peering agreements	~$200M
Total estimated		~$1B+/month

Why Google builds its own CDN: At YouTube's scale, using a third-party CDN (Cloudflare, Akamai) would be prohibitively expensive. Google instead deploys its own Google Global Cache (GGC) servers directly inside ISP networks. This is a win-win: the ISP reduces its upstream bandwidth costs, and Google gets sub-millisecond latency to end users. Over 90% of YouTube traffic is served from these ISP-level caches.

View Counting at Scale

Counting views sounds trivial — just increment a counter, right? At YouTube's scale (5+ billion views/day, potential 10 million views/second for viral videos), naive counting breaks down in multiple ways.

Challenges

Thundering herd — a viral video could get millions of concurrent increments, creating a single-row hotspot
Duplicate counting — the same user refreshing the page shouldn't count as multiple views
Bot detection — bots and view-farming services try to inflate counts
Consistency — showing slightly stale counts is fine, but the count should never go backwards

Architecture: Eventual Consistency + Batch Aggregation

View Event Flow:
  Client → "view" event → Kafka (partitioned by video_id)
      │
      ▼
  Stream Processor (Flink/Dataflow)
      │
      ├──▶ Deduplication (HyperLogLog per video per hour)
      │    └── If user_id already counted in this window → discard
      │
      ├──▶ Bot Detection (ML model)
      │    └── If bot score > 0.9 → discard
      │
      └──▶ Batch Aggregation (every 5 minutes)
           └── Accumulated delta → UPDATE videos SET view_count =
               view_count + delta WHERE video_id = ?

Near-real-time counter (for display):
  Redis (sharded) — INCRBY video:{id}:views {delta}
  └── Periodically flush to MySQL (every 5 min)
  └── HyperLogLog for unique viewer estimation:
      PFADD video:{id}:viewers:{hour} {user_id}
      PFCOUNT video:{id}:viewers:{hour}  → ~0.81% error

Redis HyperLogLog for View Deduplication

HyperLogLog is a probabilistic data structure that estimates cardinality (unique count) using only 12 KB of memory regardless of how many elements are added. YouTube uses it to deduplicate views:

# Redis HyperLogLog for view dedup
# Each video has an hourly HLL

# User watches video — add to HLL
PFADD video:abc123:viewers:2026041215 user_98765
# Returns 1 if this is a new unique viewer, 0 if already seen

# Get approximate unique viewer count
PFCOUNT video:abc123:viewers:2026041215
# Returns ~count with ±0.81% error

# Merge hourly HLLs into daily count
PFMERGE video:abc123:viewers:20260412 \
  video:abc123:viewers:2026041200 \
  video:abc123:viewers:2026041201 \
  ... \
  video:abc123:viewers:2026041223

# Memory: 12 KB per HLL × 24 hours × 800M videos
# = ~230 TB (only for "active" videos — most are inactive)
# In practice: only maintain HLLs for videos with recent views
# Expiry: TTL 48 hours, then merge into permanent count

Comment Service

YouTube comments are a separate microservice with its own database, sharded by video_id. This isolation ensures that comment-related issues (spam waves, heavy moderation) don't affect video streaming.

Comment Schema & Storage

-- Comments table (sharded by video_id)
CREATE TABLE comments (
    comment_id      BIGINT PRIMARY KEY,       -- snowflake ID
    video_id        BIGINT NOT NULL,
    parent_id       BIGINT DEFAULT NULL,      -- NULL for top-level,
                                              -- comment_id for replies
    user_id         BIGINT NOT NULL,
    content         TEXT NOT NULL,
    created_at      TIMESTAMP NOT NULL,
    updated_at      TIMESTAMP,
    like_count      INT DEFAULT 0,
    is_pinned       BOOLEAN DEFAULT FALSE,
    is_hearted      BOOLEAN DEFAULT FALSE,    -- creator "heart"
    status          ENUM('active','removed','spam','held'),
    INDEX idx_video_time (video_id, created_at DESC),
    INDEX idx_video_top (video_id, like_count DESC),
    INDEX idx_parent (parent_id)
) ENGINE=InnoDB;

-- Sorting strategies per video:
-- "Top comments" → ORDER BY like_count DESC, created_at DESC
-- "Newest first" → ORDER BY created_at DESC
-- Pagination via cursor: WHERE video_id = ? AND created_at < ?
--   LIMIT 20

Comment Moderation Pipeline

New Comment
    │
    ├──▶ Spam Detection (ML model, < 10ms)
    │    ├── Score > 0.95 → auto-remove (mark as spam)
    │    ├── Score 0.5–0.95 → hold for review
    │    └── Score < 0.5 → publish immediately
    │
    ├──▶ Toxicity Detection (Perspective API)
    │    └── Score > 0.8 → hold for review
    │
    ├──▶ Regex Filters (channel-specific blocked words)
    │
    └──▶ Rate Limiting (max 10 comments/minute per user)

Published comments are indexed in Elasticsearch for
search within the video's comment section.

Live Streaming Architecture

Live streaming adds real-time constraints on top of the VOD architecture. The key difference: transcoding must happen in real time, and latency from broadcaster to viewer should be under 5 seconds.

Broadcaster (OBS/phone)
    │
    │  RTMP/SRT ingest (1-2 Mbps uplink)
    ▼
┌────────────────┐
│  Ingest Server │  → receive raw stream, buffer 1-2 GOP
│  (Region-local)│
└───────┬────────┘
        │
        ▼
┌────────────────┐
│  Live Transcoder│  → real-time: 240p, 480p, 720p, 1080p
│  (GPU-accelerated)│     using NVENC / hardware encoders
│                │     latency budget: < 500ms per segment
└───────┬────────┘
        │
        ├──▶ HLS/DASH segments (2s segments for low latency)
        │    pushed to CDN edge every 2 seconds
        │
        ├──▶ LL-HLS (Low-Latency HLS) for < 3s glass-to-glass
        │
        └──▶ DVR buffer (last 4 hours) for rewind/seek

Viewers request the "live edge" of the playlist:
  #EXT-X-SERVER-CONTROL:CAN-BLOCK-RELOAD=YES,PART-HOLD-BACK=1.0
  The server holds the response until a new segment is ready,
  eliminating polling delay (HTTP long-polling for live edge).

Infrastructure Cost Deep Dive

Building a YouTube-scale system forces hard economic trade-offs. Let's analyze the three dominant cost centers:

1. Storage Costs (Petabyte Scale)

Daily upload volume:
  500 hrs/min × 60 min/hr × 24 hrs = 720,000 hrs of video/day

Per video (average 5-min, 1080p original):
  Original: ~500 MB
  Transcoded (7 resolutions × 2 codecs avg): ~2.5 GB total
  Thumbnails + captions: ~5 MB
  Total per video: ~3 GB

Daily new storage: 720K videos × 3 GB = ~2.16 PB/day
Annual new storage: ~790 PB/year
Cumulative (15+ years): ~5-10 EB (Exabytes)

Annual storage cost (blended tiers):
  Hot tier (5%):   50 PB × $0.023/GB × 12 = $13.8M/year
  Warm tier (15%): 150 PB × $0.010/GB × 12 = $18.0M/year
  Cold tier (60%): 600 PB × $0.004/GB × 12 = $28.8M/year
  Archive (20%):   200 PB × $0.001/GB × 12 = $2.4M/year
  ─────────────────────────────────────────
  Total storage:                              ~$63M/year

2. Bandwidth Costs (Egress Is King)

Daily video views: ~5 billion
Average video duration watched: ~7 minutes
Average bitrate: ~3 Mbps (blended across quality levels)

Daily egress:
  5B views × 7 min × 60 sec × 3 Mbps / 8 = ~787 PB/day

Monthly egress: ~23.6 EB/month

Cost at cloud provider rates ($0.05-0.08/GB):
  23.6 EB × $0.05/GB = ~$1.18B/month  ← ASTRONOMICAL

Google's advantage: own fiber network + ISP-level caches
  Effective cost: ~$0.005-0.01/GB (peering + owned infra)
  23.6 EB × $0.008/GB = ~$189M/month  ← still huge but 6x cheaper

3. Compute Costs (Transcoding)

Videos transcoded per day: ~720,000
Average transcoding time per rendition: ~2× real-time
  (5-min video takes ~10 min to transcode per rendition)
Renditions per video: ~12 (7 resolutions × ~2 codecs)

Total transcode-minutes per day:
  720K × 12 renditions × 10 min = 86.4M CPU-minutes/day

At AWS c5.2xlarge ($0.34/hr, 8 vCPUs):
  86.4M min ÷ 8 cores ÷ 60 = ~180K instance-hours/day
  180K × $0.34 = ~$61K/day = ~$1.83M/month

With spot/preemptible instances (~70% discount):
  ~$550K/month for transcoding compute

Note: AV1 encoding is 10-50× slower than H.264,
so it's only used for the most popular videos where
the bandwidth savings justify the compute cost.

Fault Tolerance & Reliability

At YouTube's scale, failures are not exceptions — they are the norm. The system must gracefully handle:

Transcoding worker crashes — each segment job is idempotent; the queue redelivers failed segments to another worker. A video is only marked "complete" when ALL segments are done (tracked in a completion counter in Redis)
CDN edge failure — clients automatically fall back to the next-nearest edge. The player detects segment download failures and retries with a different CDN hostname
Database shard failure — Vitess maintains synchronous replicas per shard. Automated failover promotes a replica to primary within seconds
Blob storage corruption — GCS/S3 use erasure coding (e.g., Reed-Solomon) and cross-region replication. 11 nines durability means less than 1 object lost per 10 billion per year
Regional outage — traffic is rerouted to the next-nearest region via global load balancing. CDN edges in unaffected regions continue serving cached content

Circuit Breaker Pattern

# Circuit breaker for transcoding service calls
class CircuitBreaker:
    CLOSED = 'closed'      # normal operation
    OPEN = 'open'          # failing, reject requests
    HALF_OPEN = 'half_open' # testing if service recovered

    def __init__(self, failure_threshold=5, timeout_sec=30):
        self.state = self.CLOSED
        self.failure_count = 0
        self.threshold = failure_threshold
        self.timeout = timeout_sec
        self.last_failure_time = None

    def call(self, func, *args, **kwargs):
        if self.state == self.OPEN:
            if time.time() - self.last_failure_time > self.timeout:
                self.state = self.HALF_OPEN
            else:
                raise ServiceUnavailableError("Circuit breaker OPEN")

        try:
            result = func(*args, **kwargs)
            if self.state == self.HALF_OPEN:
                self.state = self.CLOSED
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.threshold:
                self.state = self.OPEN
            raise

Summary & Interview Tips

Component	Technology	Key Design Decision
Upload	Resumable upload → Kafka → worker fleet	Async processing, segment-level parallelism
Transcoding	FFmpeg, H.264/VP9/AV1, DAG scheduler	Pre-transcode all formats; AV1 only for popular
Streaming	HLS/DASH, adaptive bitrate	Buffer-based ABR, 4-sec segments
Storage	GCS/S3, tiered (hot/warm/cold/archive)	Long-tail → cold; popular → multi-region hot
Metadata DB	MySQL + Vitess (sharded)	Shard by video_id; secondary indexes for lookups
Search	Elasticsearch + ML re-ranker	BM25 retrieval → neural ranking
Recommendations	ALS + deep ranking model	Optimize for watch time, not clicks
CDN	Google Global Cache (ISP-level)	3-tier cache hierarchy, own infra to save costs
View counting	Kafka → Flink → Redis HLL → MySQL	Eventual consistency, batch aggregation
Comments	Separate microservice, sharded by video_id	Isolated blast radius, ML moderation

Interview key points: (1) Clearly separate the upload (write) and streaming (read) paths — they have completely different requirements. (2) Emphasize CDN as the primary serving layer — origin servers should rarely be hit. (3) Discuss codec trade-offs (H.264 vs VP9 vs AV1). (4) Explain why view counts use eventual consistency. (5) Mention cost — bandwidth egress is the dominant expense. (6) Talk about the recommendation engine as the core business differentiator.