← All Posts
High Level Design Series · Real-World Designs· Post 52 of 70

Design: Google Drive / Dropbox

Cloud file storage is deceptively simple on the surface — "just upload files and sync them" — but behind that simplicity lies one of the most complex distributed systems problems in modern engineering. Google Drive alone serves over 1 billion users, storing trillions of files, syncing changes across billions of devices in near real-time. Dropbox processes over 1.2 billion files per day.

The core challenge: how do you let hundreds of millions of users seamlessly upload, download, share, and collaboratively edit files across every device they own, while keeping storage costs manageable, latency low, and conflicts resolvable? The answer involves file chunking, content-addressable storage, chunk-level deduplication, delta sync, and a carefully orchestrated notification pipeline.

Scope of this design: We focus on the sync engine — the hardest part. Not the WYSIWYG editor (Google Docs), not the spreadsheet engine, not real-time collaborative cursors. Those are separate systems layered on top. Our system handles file storage, sync, sharing, and versioning for arbitrary file types.

Requirements & Scale

Functional Requirements

Non-Functional Requirements

Back-of-Envelope Estimates

Users:               500M registered, 100M DAU
Files per user:      ~200 files average
Total files:         100 billion files
Average file size:   ~500 KB
Total storage:       100B × 500KB = 50 PB (raw), ~150 PB with replication
Daily uploads:       100M users × 3 file changes/day = 300M operations/day
                     = ~3,500 ops/second average, ~10,000 ops/second peak
Daily bandwidth:     300M × 500KB avg change = 150 TB/day upload
Sync notifications:  ~1 billion notifications/day (multi-device fan-out)

High-Level Architecture

The system decomposes into six major components, each independently scalable:

┌──────────────────────────────────────────────────────────────────────┐
│                           CLIENT DEVICE                              │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐               │
│  │ File Watcher │  │ Chunker /    │  │ Local DB      │               │
│  │ (inotify /   │  │ Hasher       │  │ (SQLite)      │               │
│  │  FSEvents /  │  │              │  │ chunk_hashes  │               │
│  │  ReadDir)    │  │ 4MB chunks   │  │ sync_state    │               │
│  └──────┬───────┘  └──────┬───────┘  └───────┬───────┘               │
│         │                 │                   │                       │
│         └─────────┬───────┘───────────────────┘                       │
│                   ▼                                                   │
│           ┌───────────────┐                                           │
│           │  Sync Engine  │ ← coordinates upload/download/conflict    │
│           └───────┬───────┘                                           │
└───────────────────┼──────────────────────────────────────────────────┘
                    │ HTTPS / WebSocket
                    ▼
┌──────────────────────────────────────────────────────────────────────┐
│                         API GATEWAY / LB                             │
└──────────────────────────┬───────────────────────────────────────────┘
           ┌───────────────┼───────────────┐
           ▼               ▼               ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────────┐
│  Upload      │  │  Metadata    │  │  Notification    │
│  Service     │  │  Service     │  │  Service         │
│              │  │              │  │  (WebSocket /    │
│  chunk →     │  │  files,      │  │   long-polling)  │
│  block store │  │  folders,    │  │                  │
│              │  │  versions,   │  │  push changes    │
│              │  │  ACLs        │  │  to devices      │
└──────┬───────┘  └──────┬───────┘  └──────────────────┘
       │                 │
       ▼                 ▼
┌──────────────┐  ┌──────────────┐
│  Block Store │  │  Metadata DB │
│  (S3 / GCS)  │  │  (MySQL /    │
│              │  │   PostgreSQL)│
│  content-    │  │              │
│  addressable │  │  + Redis     │
│              │  │    cache     │
└──────────────┘  └──────────────┘
Key insight: We separate data (file chunks — large, binary, immutable) from metadata (file names, folder structure, permissions, versions — small, relational, mutable). This separation is fundamental to every cloud storage system. Data goes to object storage (S3/GCS); metadata goes to a relational database.

File Chunking — The Foundation

Every file storage system at scale uses chunking — splitting files into fixed-size blocks. This single decision enables deduplication, resumable uploads, parallel transfers, and efficient delta sync. Without chunking, changing one byte of a 5 GB video would require re-uploading the entire file.

Choosing the Chunk Size

The chunk size is a critical engineering trade-off:

Chunk SizeProsCons
1 MBFine-grained dedup, small re-uploadsToo many chunks per file → metadata overhead, S3 PUT cost
4 MB ✓Good balance: reasonable dedup, manageable chunk countGood for most file types
8 MBFewer chunks, less metadataCoarser dedup; small edits waste more bandwidth
64 MBMinimal metadata (HDFS default)Terrible for sync — one byte change re-uploads 64 MB

We choose 4 MB chunks. A 1 GB file becomes ~256 chunks. A 100 KB document is a single chunk. Dropbox uses 4 MB; Google Drive uses variable chunk sizes based on file type (4–8 MB for most).

The Chunking Algorithm

// Fixed-size chunking with SHA-256 hashing
function chunkFile(filePath) {
    const CHUNK_SIZE = 4 * 1024 * 1024;  // 4 MB
    const file = openFile(filePath);
    const chunks = [];
    let offset = 0;
    let index = 0;

    while (offset < file.size) {
        const end = Math.min(offset + CHUNK_SIZE, file.size);
        const data = file.read(offset, end);

        // SHA-256 hash of the raw chunk bytes
        const hash = sha256(data);

        chunks.push({
            index: index,
            offset: offset,
            size: end - offset,
            hash: hash,         // content-addressable key
            data: data           // raw bytes (held in memory briefly)
        });

        offset = end;
        index++;
    }

    return {
        fileName: filePath,
        fileSize: file.size,
        totalChunks: chunks.length,
        fileHash: sha256(chunks.map(c => c.hash).join('')),
        chunks: chunks
    };
}

// Example output for a 14 MB file:
// {
//   fileName: "presentation.pptx",
//   fileSize: 14680064,
//   totalChunks: 4,            // ceil(14MB / 4MB) = 4
//   fileHash: "a3f2c1...",
//   chunks: [
//     { index: 0, offset: 0,        size: 4194304, hash: "e7b3a1..." },
//     { index: 1, offset: 4194304,  size: 4194304, hash: "c2d4f9..." },
//     { index: 2, offset: 8388608,  size: 4194304, hash: "91ab2e..." },
//     { index: 3, offset: 12582912, size: 2097152, hash: "f8c7d3..." }
//   ]
// }
Why SHA-256? SHA-256 produces a 256-bit (32-byte) hash. Collision probability is astronomically low — approximately 2-128 for birthday collisions, which is less than the probability of a random bit flip in RAM. It's fast enough for client-side computation (~400 MB/s on modern CPUs) and universally supported.

Content-Defined Chunking (CDC) — Advanced

Fixed-size chunking has a weakness: if you insert data at the beginning of a file, every chunk boundary shifts and all chunk hashes change. Content-Defined Chunking (Rabin fingerprinting) solves this by choosing chunk boundaries based on the content:

// Content-Defined Chunking using Rabin fingerprinting
function contentDefinedChunk(filePath) {
    const MIN_CHUNK = 2 * 1024 * 1024;   // 2 MB minimum
    const MAX_CHUNK = 8 * 1024 * 1024;   // 8 MB maximum
    const TARGET    = 4 * 1024 * 1024;   // 4 MB target
    const MASK      = 0x00FFFFF;          // ~4 MB average (22 bits)

    const file = openFile(filePath);
    const chunks = [];
    let offset = 0;
    let chunkStart = 0;

    while (offset < file.size) {
        const windowSize = 48;  // Rabin window
        const fingerprint = rabinFingerprint(
            file.read(offset, offset + windowSize)
        );

        const chunkLen = offset - chunkStart;

        // Cut when fingerprint matches mask AND chunk >= minimum
        // OR when chunk hits maximum size
        if ((chunkLen >= MIN_CHUNK && (fingerprint & MASK) === 0)
            || chunkLen >= MAX_CHUNK) {

            const data = file.read(chunkStart, offset);
            chunks.push({
                offset: chunkStart,
                size: chunkLen,
                hash: sha256(data)
            });
            chunkStart = offset;
        }

        offset++;
    }

    // Flush remaining bytes as last chunk
    if (chunkStart < file.size) {
        const data = file.read(chunkStart, file.size);
        chunks.push({
            offset: chunkStart,
            size: file.size - chunkStart,
            hash: sha256(data)
        });
    }

    return chunks;
}

// Advantage: Inserting 100 bytes at offset 0 only changes the FIRST chunk.
// All subsequent chunk boundaries stay the same because they're
// content-defined, not position-defined. Massive bandwidth savings!

Dropbox switched from fixed-size to content-defined chunking in 2014, reducing sync bandwidth by ~40% for files with insertions/deletions near the beginning.

Deduplication — Store Once, Reference Many

With content-addressable storage, each chunk is stored using its SHA-256 hash as the key. If two chunks have the same hash, they are (with overwhelming probability) identical. We store the data only once and add a reference.

Deduplication at Multiple Levels

// Level 1: Intra-file dedup
// A 400 MB VM image with large zero-filled regions
// Chunk "0000...0000" (4 MB of zeros) appears 50 times
// → store 1 chunk, reference it 50 times
// Savings: 196 MB

// Level 2: Cross-file dedup (same user)
// User copies presentation.pptx → presentation_v2.pptx
// Makes a small edit → only 1 chunk changes
// → 255 of 256 chunks already exist, upload only 1
// Savings: ~1 GB for a 1 GB file copy

// Level 3: Cross-user dedup
// 10,000 users upload the same company_logo.png (800 KB)
// → store 1 chunk, 10,000 references
// Savings: ~8 GB

// Upload flow with dedup:
async function uploadWithDedup(fileChunks) {
    // Step 1: Send all chunk hashes to server
    const hashList = fileChunks.map(c => c.hash);
    const serverResponse = await api.checkChunks(hashList);
    // Response: { needed: ["e7b3a1...", "f8c7d3..."], existing: ["c2d4f9...", "91ab2e..."] }

    // Step 2: Upload ONLY chunks the server doesn't have
    const newChunks = fileChunks.filter(c => serverResponse.needed.includes(c.hash));
    for (const chunk of newChunks) {
        await api.uploadChunk(chunk.hash, chunk.data);
        // Server stores: S3 key = "chunks/{hash}" → chunk data
    }

    // Step 3: Register file metadata (references existing + new chunks)
    await api.registerFile({
        path: "/documents/report.pdf",
        chunks: hashList,    // ordered list of chunk hashes
        fileHash: sha256(hashList.join(''))
    });

    // Result: only 2 of 4 chunks uploaded (50% bandwidth saved)
}

Deduplication in Practice

ScenarioWithout DedupWith DedupSavings
Edit 1 page of a 100-page PDFRe-upload 50 MBUpload 4 MB (1 chunk)92%
Copy a 2 GB folder within DriveUpload 2 GBUpload 0 bytes (metadata only)100%
10K users upload same installer500 GB × 10K = 5 PB500 GB (once)99.99%
Overall platform (Dropbox 2019)~500 PB~120 PB (reported)~76%

Sync Protocol — The State Machine

The sync engine on each device is a state machine that watches the local file system, detects changes, computes diffs at the chunk level, uploads only what's new, and downloads changes made by other devices. It is the most complex component in the entire system.

Sync Engine States

// Sync Engine State Machine
//
//  ┌────────────┐  file change   ┌────────────┐  chunks computed   ┌────────────┐
//  │   IDLE     │ ─────────────→ │  INDEXING   │ ─────────────────→ │  DIFFING   │
//  │            │                │            │                     │            │
//  │  watching  │                │  reading   │                     │  comparing │
//  │  for       │                │  file,     │                     │  local vs  │
//  │  changes   │                │  chunking, │                     │  server    │
//  │            │                │  hashing   │                     │  chunks    │
//  └────────────┘                └────────────┘                     └──────┬─────┘
//       ▲                                                                  │
//       │                                                           has diff?
//       │                                                          ┌───────┴───────┐
//       │                                                     no   │               │ yes
//       │                                                          ▼               ▼
//       │                                                    ┌──────────┐   ┌──────────────┐
//       │                                                    │ UP TO    │   │ UPLOADING    │
//       │◄───────────────────────────────────────────────────│ DATE     │   │              │
//       │                                                    └──────────┘   │ sending new  │
//       │                                                                   │ chunks to    │
//       │                                                                   │ block store  │
//       │                                                                   └──────┬───────┘
//       │                                                                          │
//       │                                                                   all uploaded
//       │                                                                          ▼
//       │                                                                   ┌──────────────┐
//       │                                                                   │ COMMITTING   │
//       │◄──────────────────────────────────────────────────────────────────│              │
//       │                                        done                       │ update       │
//                                                                           │ metadata DB  │
//                                                                           │ + notify     │
//                                                                           └──────────────┘

enum SyncState {
    IDLE,        // Watching file system for changes
    INDEXING,    // Reading changed file, computing chunks + hashes
    DIFFING,     // Comparing local chunk list vs server chunk list
    UPLOADING,   // Uploading new/changed chunks to block store
    COMMITTING,  // Updating metadata DB with new file version
    DOWNLOADING, // Fetching chunks changed by another device
    CONFLICTED,  // Conflicting changes detected
    ERROR        // Retryable error state (exponential backoff)
}

Change Detection

The client watches the local file system using OS-level APIs:

// Platform-specific file system watchers
// Linux:   inotify (kernel-level, event-driven, ~8K watches default)
// macOS:   FSEvents (coalesced events, efficient for large trees)
// Windows: ReadDirectoryChangesW (per-directory, recursive option)

class FileWatcher {
    constructor(syncRoot) {
        this.syncRoot = syncRoot;
        this.debounceMs = 500;  // Coalesce rapid changes
        this.pendingChanges = new Map();
    }

    onFileChange(event) {
        // event: { type: CREATE|MODIFY|DELETE|RENAME, path: string }

        // Debounce: wait 500ms after last change before processing
        // (editors do save → delete → rename atomically)
        clearTimeout(this.pendingChanges.get(event.path));
        this.pendingChanges.set(event.path, setTimeout(() => {
            this.pendingChanges.delete(event.path);
            this.syncEngine.enqueue(event);
        }, this.debounceMs));
    }
}

// The debounce is critical: text editors often perform multi-step saves:
// 1. Write to temp file (report.pdf.tmp)
// 2. Delete original (report.pdf)
// 3. Rename temp → original (report.pdf.tmp → report.pdf)
// Without debouncing, we'd process each step as a separate change.

Upload Protocol (Client → Server)

// Complete upload flow
async function syncLocalChange(filePath) {
    // 1. INDEXING: Chunk the file and compute hashes
    const chunks = chunkFile(filePath);
    const localHashes = chunks.map(c => c.hash);

    // 2. DIFFING: Ask server which chunks it already has
    const { needed, serverVersion } = await api.diffChunks({
        filePath: filePath,
        chunkHashes: localHashes
    });
    // Server compares against the latest version's chunk list
    // Returns: needed = hashes not in block store

    // 3. Check for conflicts (server version newer than our base version)
    const localVersion = localDB.getVersion(filePath);
    if (serverVersion > localVersion) {
        return handleConflict(filePath, localVersion, serverVersion);
    }

    // 4. UPLOADING: Upload only the chunks the server needs
    const uploadPromises = needed.map(hash => {
        const chunk = chunks.find(c => c.hash === hash);
        return uploadChunkWithRetry(chunk, {
            maxRetries: 3,
            backoff: 'exponential'
        });
    });
    await Promise.all(uploadPromises);  // Parallel upload!

    // 5. COMMITTING: Register new version with metadata service
    const newVersion = await api.commitFileVersion({
        filePath: filePath,
        chunkHashes: localHashes,       // ordered chunk list
        fileSize: chunks.reduce((s, c) => s + c.size, 0),
        checksum: sha256(localHashes.join('')),
        baseVersion: localVersion        // optimistic concurrency
    });

    // 6. Update local DB
    localDB.setVersion(filePath, newVersion.version);
    localDB.setChunkHashes(filePath, localHashes);
}

// Chunk upload with resumability
async function uploadChunkWithRetry(chunk, opts) {
    for (let attempt = 0; attempt <= opts.maxRetries; attempt++) {
        try {
            // Use pre-signed S3 URL for direct upload (bypass our servers)
            const { uploadUrl } = await api.getUploadUrl(chunk.hash);
            await httpPut(uploadUrl, chunk.data, {
                headers: {
                    'Content-Type': 'application/octet-stream',
                    'Content-Length': chunk.size,
                    'x-amz-content-sha256': chunk.hash
                }
            });
            return;
        } catch (err) {
            if (attempt === opts.maxRetries) throw err;
            const delay = Math.pow(2, attempt) * 1000;  // 1s, 2s, 4s
            await sleep(delay);
        }
    }
}

Download Protocol (Server → Client)

// When the notification service tells us another device changed a file:
async function syncRemoteChange(notification) {
    // notification: { filePath, newVersion, changedChunks, deletedChunks }

    // 1. Get the new chunk list from metadata service
    const newMeta = await api.getFileMetadata(notification.filePath, notification.newVersion);
    // Returns: { chunkHashes: [...], fileSize: ..., modifiedAt: ... }

    // 2. Compare with our local chunk list
    const localHashes = localDB.getChunkHashes(notification.filePath);
    const toDownload = newMeta.chunkHashes.filter(h => !localHashes.includes(h));
    const toRemove = localHashes.filter(h => !newMeta.chunkHashes.includes(h));

    // 3. Download new chunks (parallel, from S3 via CDN)
    const downloadedChunks = await Promise.all(
        toDownload.map(hash => downloadChunk(hash))
    );

    // 4. Reassemble the file from chunks
    //    Local cache: keep recently used chunks in ~/.drive/cache/
    const allChunks = newMeta.chunkHashes.map(hash => {
        const downloaded = downloadedChunks.find(c => c.hash === hash);
        if (downloaded) return downloaded.data;
        return localChunkCache.get(hash);  // unchanged chunk from cache
    });

    // 5. Write to temp file, then atomic rename
    const tempPath = filePath + '.drive-tmp';
    writeFile(tempPath, concatenate(allChunks));
    atomicRename(tempPath, filePath);

    // 6. Update local state
    localDB.setVersion(filePath, notification.newVersion);
    localDB.setChunkHashes(filePath, newMeta.chunkHashes);
}

File Sync Flow — Animated

The following animation shows the complete sync flow when a user edits a file on Device A and the change propagates to Device B.

▶ File Sync Flow

Step through the flow: edit on Device A → chunk → diff → upload changed chunks → server metadata update → notify Device B → Device B downloads → synced.

Notification Service

After a file version is committed, all other devices with access to that file must be notified. The notification service maintains persistent connections to all online clients:

Approaches Compared

MethodLatencyConnectionsBatteryUse Case
PollingHigh (interval)Reconnects each timeBadSimple, rarely used at scale
Long-polling ✓Low (~instant)1 persistent per deviceGoodDesktop clients, reliable
WebSocket ✓Lowest1 persistent, bidirectionalGoodReal-time sync, web clients
Push (APNs/FCM)MediumManaged by OSBestMobile (when app is backgrounded)
// Notification Service Architecture
class NotificationService {
    // In-memory: userId → Set of connected WebSocket sessions
    // Backed by Redis Pub/Sub for multi-server fan-out
    connections = new Map();  // userId → [ws1, ws2, ...]

    async notifyFileChange(fileId, changedByUserId, newVersion) {
        // 1. Get all users who have access to this file
        const accessList = await metadataDB.getFileAccessList(fileId);
        // Returns: [{ userId, permission, deviceIds }]

        // 2. Get all devices for those users
        const devices = accessList.flatMap(u =>
            u.deviceIds.map(d => ({ userId: u.userId, deviceId: d }))
        ).filter(d =>
            // Don't notify the device that made the change
            d.deviceId !== changedByUserId
        );

        // 3. Publish to Redis Pub/Sub (fan-out to notification servers)
        const notification = {
            type: 'FILE_CHANGED',
            fileId: fileId,
            version: newVersion,
            timestamp: Date.now()
        };

        for (const device of devices) {
            await redis.publish(
                `notifications:${device.userId}:${device.deviceId}`,
                JSON.stringify(notification)
            );
        }
    }

    // Each notification server subscribes to channels for connected clients
    onWebSocketConnect(userId, deviceId, ws) {
        redis.subscribe(`notifications:${userId}:${deviceId}`, (msg) => {
            ws.send(msg);
        });
    }
}

// Long-polling fallback (for clients behind strict firewalls):
// Client: GET /api/changes?since=cursor_abc&timeout=60
// Server: Hold request open for up to 60s. Return immediately if
//         there are changes, or empty response on timeout.
// Client: Immediately reconnect with new cursor.

Conflict Resolution

Conflicts occur when two devices edit the same file before either sync completes. This is unavoidable in a system with offline support. We need a strategy that never loses data while keeping the UX simple.

Detecting Conflicts

// Optimistic concurrency control with version vectors
async function commitFileVersion(request) {
    const { filePath, chunkHashes, baseVersion } = request;

    // Atomic check: is the current server version what the client expects?
    const currentVersion = await metadataDB.getCurrentVersion(filePath);

    if (currentVersion !== baseVersion) {
        // CONFLICT: someone else committed a new version while we were editing
        throw new ConflictError({
            clientBaseVersion: baseVersion,
            serverCurrentVersion: currentVersion,
            filePath: filePath
        });
    }

    // No conflict: commit the new version
    const newVersion = currentVersion + 1;
    await metadataDB.insertVersion({
        filePath: filePath,
        version: newVersion,
        chunkHashes: chunkHashes,
        committedAt: Date.now()
    });

    return { version: newVersion };
}

Resolution Strategies

// Strategy 1: Last-Write-Wins (LWW)
// Simple but can lose data. Used for non-critical files.
// The last device to commit "wins" and overwrites.
// Pros: Simple, no user intervention
// Cons: Silent data loss

// Strategy 2: Conflict Copies (Dropbox approach) ✓
// Create a "conflicted copy" with the losing version
async function handleConflict(filePath, localChunks, serverVersion) {
    // 1. Download the server's version (winner)
    await syncRemoteChange({ filePath, newVersion: serverVersion });

    // 2. Save our version as a conflict copy
    const timestamp = formatDate(Date.now());
    const deviceName = getDeviceName();
    const ext = path.extname(filePath);
    const base = path.basename(filePath, ext);

    // "report.pdf" → "report (Neel's Laptop's conflicted copy 2026-04-15).pdf"
    const conflictPath = `${path.dirname(filePath)}/${base} `
        + `(${deviceName}'s conflicted copy ${timestamp})${ext}`;

    await writeFile(conflictPath, assembleFromChunks(localChunks));

    // 3. Notify the user
    notifyUser({
        type: 'CONFLICT',
        message: `Conflicting changes detected for ${filePath}. `
               + `Your changes saved as: ${conflictPath}`,
        originalFile: filePath,
        conflictCopy: conflictPath
    });
}

// Strategy 3: Operational Transform / CRDT
// Used by Google Docs for real-time collaborative editing
// Too complex for file-level sync; only for document editors

// Strategy 4: Three-Way Merge (Git-style)
// Possible for text files: find common ancestor, merge changes
// Complex, error-prone for binary files
// Used by some advanced sync engines for code files
Dropbox's real-world approach: Dropbox uses conflict copies as the primary strategy. The server always has the authoritative version. If a client tries to commit based on a stale base version, its changes are saved as a conflict copy, and the user is notified. This ensures zero data loss at the cost of occasional manual resolution. In practice, conflicts are rare (~0.01% of syncs) because most files are single-user.

Block Storage — Content-Addressable Store

Chunks are stored in object storage (S3 or GCS) using their SHA-256 hash as the key. This is called content-addressable storage (CAS) — the address is the content.

// Block Store Schema (S3/GCS)
// Key:   chunks/{sha256-hash}
// Value: raw chunk bytes (up to 4 MB)
//
// Examples:
// s3://drive-blocks/chunks/e7b3a1f2c4d6e8...
// s3://drive-blocks/chunks/c2d4f91ab2e3f5...

// Properties of content-addressable storage:
// 1. Immutable: a chunk's content never changes (same hash = same data)
// 2. Idempotent: uploading the same chunk twice is safe (PUT is idempotent)
// 3. Naturally deduped: identical content = identical key
// 4. Easy to verify: download and re-hash to confirm integrity

// S3 Configuration
{
    "bucket": "drive-blocks-us-east-1",
    "storageClass": "S3_STANDARD",        // hot data (recent files)
    "replication": "CROSS_REGION",         // us-east-1 → eu-west-1
    "encryption": "AES-256-SSE",           // server-side encryption
    "lifecycle": {
        // Move to cheaper storage after 90 days of no access
        "transition": [
            { "days": 90, "storageClass": "S3_INFREQUENT_ACCESS" },
            { "days": 365, "storageClass": "S3_GLACIER" }
        ]
    }
}

// Reference counting for garbage collection:
// Each chunk has a reference count in the metadata DB.
// When a file version is deleted, decrement chunk ref counts.
// When ref_count reaches 0, the chunk can be garbage collected.
// Use a background job with delay (to handle race conditions):

CREATE TABLE chunk_refs (
    chunk_hash   CHAR(64) PRIMARY KEY,
    ref_count    INT NOT NULL DEFAULT 1,
    size_bytes   BIGINT NOT NULL,
    created_at   TIMESTAMP DEFAULT NOW(),
    last_ref_at  TIMESTAMP DEFAULT NOW()
);

-- When a new file references a chunk:
INSERT INTO chunk_refs (chunk_hash, ref_count, size_bytes)
VALUES ('e7b3a1...', 1, 4194304)
ON CONFLICT (chunk_hash) DO UPDATE SET
    ref_count = chunk_refs.ref_count + 1,
    last_ref_at = NOW();

-- When a file version is deleted:
UPDATE chunk_refs SET ref_count = ref_count - 1
WHERE chunk_hash = 'e7b3a1...';

-- GC job (runs hourly):
DELETE FROM chunk_refs WHERE ref_count <= 0 AND last_ref_at < NOW() - INTERVAL '7 days';
-- Then delete from S3: aws s3 rm s3://drive-blocks/chunks/{hash}

Metadata Service — The Brain

The metadata service is the source of truth for everything except raw file bytes. It manages file trees, sharing permissions, version history, and device sync state.

Database Schema

-- Core tables (PostgreSQL)

-- Users
CREATE TABLE users (
    user_id       BIGINT PRIMARY KEY,
    email         VARCHAR(255) UNIQUE NOT NULL,
    display_name  VARCHAR(255),
    quota_bytes   BIGINT DEFAULT 15 * 1024 * 1024 * 1024,  -- 15 GB free
    used_bytes    BIGINT DEFAULT 0,
    created_at    TIMESTAMP DEFAULT NOW()
);

-- Files and Folders (tree structure using materialized path)
CREATE TABLE file_entries (
    entry_id      BIGINT PRIMARY KEY,
    owner_id      BIGINT REFERENCES users(user_id),
    parent_id     BIGINT REFERENCES file_entries(entry_id),
    name          VARCHAR(255) NOT NULL,
    is_folder     BOOLEAN DEFAULT FALSE,
    current_ver   INT DEFAULT 1,
    size_bytes    BIGINT DEFAULT 0,
    mime_type     VARCHAR(127),
    created_at    TIMESTAMP DEFAULT NOW(),
    modified_at   TIMESTAMP DEFAULT NOW(),
    is_deleted    BOOLEAN DEFAULT FALSE,   -- soft delete (trash)
    deleted_at    TIMESTAMP,

    -- Materialized path for efficient subtree queries
    -- e.g., "/user_123/Documents/Work/report.pdf"
    path          TEXT NOT NULL,

    UNIQUE(parent_id, name)   -- no duplicate names in same folder
);

CREATE INDEX idx_file_entries_owner ON file_entries(owner_id);
CREATE INDEX idx_file_entries_parent ON file_entries(parent_id);
CREATE INDEX idx_file_entries_path ON file_entries(path);

-- File Versions (one row per version)
CREATE TABLE file_versions (
    entry_id      BIGINT REFERENCES file_entries(entry_id),
    version       INT NOT NULL,
    chunk_hashes  TEXT[] NOT NULL,        -- ordered array of SHA-256 hashes
    size_bytes    BIGINT NOT NULL,
    file_hash     CHAR(64) NOT NULL,      -- hash of the complete file
    modified_by   BIGINT REFERENCES users(user_id),
    device_id     VARCHAR(64),
    created_at    TIMESTAMP DEFAULT NOW(),

    PRIMARY KEY (entry_id, version)
);

-- Sharing / ACL
CREATE TABLE sharing_permissions (
    entry_id      BIGINT REFERENCES file_entries(entry_id),
    user_id       BIGINT REFERENCES users(user_id),
    permission    VARCHAR(20) NOT NULL,   -- 'owner', 'editor', 'viewer'
    granted_by    BIGINT REFERENCES users(user_id),
    granted_at    TIMESTAMP DEFAULT NOW(),

    PRIMARY KEY (entry_id, user_id)
);

-- Shared Links
CREATE TABLE shared_links (
    link_id       VARCHAR(32) PRIMARY KEY,  -- random token
    entry_id      BIGINT REFERENCES file_entries(entry_id),
    permission    VARCHAR(20) DEFAULT 'viewer',
    password_hash VARCHAR(255),    -- optional password protection
    expires_at    TIMESTAMP,       -- optional expiry
    created_by    BIGINT REFERENCES users(user_id),
    created_at    TIMESTAMP DEFAULT NOW()
);

-- Device Sync State
CREATE TABLE device_sync_state (
    user_id       BIGINT REFERENCES users(user_id),
    device_id     VARCHAR(64) NOT NULL,
    device_name   VARCHAR(255),
    last_cursor   VARCHAR(255),    -- opaque cursor for incremental sync
    last_sync_at  TIMESTAMP,

    PRIMARY KEY (user_id, device_id)
);

Sharding Strategy

With 500M users and 100B files, a single database won't cut it. We shard by owner_id:

// Sharding by owner_id (consistent hashing)
// - All of a user's files on the same shard
// - Shared files: metadata on owner's shard, ACL lookup cross-shard
// - 256 logical shards, each handling ~2M users

function getShard(ownerId) {
    return consistentHash(ownerId, NUM_SHARDS);
}

// Query routing:
// "List my files"        → route to user's shard (single shard)
// "Files shared with me" → scatter-gather across shards (expensive)
//                          OR use a secondary index table (denormalized)

// Denormalized "shared with me" index:
CREATE TABLE shared_with_me (
    user_id       BIGINT,
    entry_id      BIGINT,
    owner_id      BIGINT,
    entry_name    VARCHAR(255),
    permission    VARCHAR(20),
    shared_at     TIMESTAMP,
    PRIMARY KEY (user_id, entry_id)
);
-- Sharded by user_id → "shared with me" queries are single-shard

Offline Support

Cloud storage must work seamlessly when the user is disconnected — on a plane, in a subway, or in a building with no signal. The client maintains a local operation log that queues all changes.

// Local SQLite database on the client device
CREATE TABLE pending_ops (
    op_id        INTEGER PRIMARY KEY AUTOINCREMENT,
    file_path    TEXT NOT NULL,
    op_type      TEXT NOT NULL,   -- 'CREATE', 'MODIFY', 'DELETE', 'RENAME', 'MOVE'
    chunk_hashes TEXT,            -- JSON array of chunk hashes (for CREATE/MODIFY)
    old_path     TEXT,            -- for RENAME/MOVE
    created_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status       TEXT DEFAULT 'pending',   -- 'pending', 'uploading', 'committed', 'failed'
    retry_count  INTEGER DEFAULT 0
);

// Offline workflow:
// 1. User edits file while disconnected
// 2. Client detects change, chunks the file, stores chunks in local cache
// 3. Inserts operation into pending_ops
// 4. Displays "Waiting to sync" icon on the file

// When connectivity resumes:
class OfflineSyncManager {
    async onConnectivityRestored() {
        // Process pending ops in order (FIFO)
        const ops = await localDB.query(
            'SELECT * FROM pending_ops WHERE status = ? ORDER BY created_at',
            ['pending']
        );

        for (const op of ops) {
            try {
                await localDB.update(op.op_id, { status: 'uploading' });

                switch (op.op_type) {
                    case 'CREATE':
                    case 'MODIFY':
                        await syncLocalChange(op.file_path);
                        break;
                    case 'DELETE':
                        await api.deleteFile(op.file_path);
                        break;
                    case 'RENAME':
                        await api.renameFile(op.old_path, op.file_path);
                        break;
                }

                await localDB.update(op.op_id, { status: 'committed' });
            } catch (err) {
                if (err instanceof ConflictError) {
                    await handleConflict(op.file_path);
                    await localDB.update(op.op_id, { status: 'committed' });
                } else {
                    await localDB.update(op.op_id, {
                        status: 'failed',
                        retry_count: op.retry_count + 1
                    });
                }
            }
        }
    }
}

Sharing & Permissions (ACL)

Sharing is ACL-based (Access Control List). Every file/folder has a list of principals (users, groups) with their permission level.

// Permission hierarchy:
// Owner   → full control (delete, share, edit, view)
// Editor  → edit and view (cannot delete or change sharing)
// Viewer  → read-only (can download but not modify)

// Permission inheritance:
// Sharing a folder grants the same permission to ALL files/folders inside it
// A file's effective permission = MAX(own permission, inherited from parent folders)

function getEffectivePermission(userId, entryId) {
    let maxPermission = null;

    // Walk up the folder tree
    let current = entryId;
    while (current !== null) {
        const perm = db.query(
            'SELECT permission FROM sharing_permissions WHERE entry_id = ? AND user_id = ?',
            [current, userId]
        );

        if (perm) {
            maxPermission = maxOf(maxPermission, perm);
        }

        // Move to parent
        current = db.query('SELECT parent_id FROM file_entries WHERE entry_id = ?', [current]);
    }

    return maxPermission;  // null = no access
}

// Shared link access:
// GET /shared/{link_id}
// Server validates: link exists, not expired, password matches (if set)
// Returns file metadata + download URL (or folder listing)

// Permission check middleware:
async function checkPermission(req, res, next) {
    const { userId } = req.auth;
    const { entryId } = req.params;
    const requiredLevel = req.method === 'GET' ? 'viewer' : 'editor';

    const permission = await getEffectivePermission(userId, entryId);

    if (!permission || permissionLevel(permission) < permissionLevel(requiredLevel)) {
        return res.status(403).json({ error: 'Insufficient permissions' });
    }

    next();
}

Version History — Space-Efficient

Every file change creates a new version. But because we store chunk references (not copies), version history is extremely space-efficient. Unchanged chunks are shared across versions.

// Version storage: list of chunk hashes per version
//
// Version 1: [A, B, C, D, E]      (5 chunks, 20 MB)
// Version 2: [A, B, C', D, E]     (1 chunk changed: C→C')
// Version 3: [A, B, C', D, E, F]  (1 chunk added: F)
//
// Storage cost:
//   Unique chunks: A, B, C, C', D, E, F = 7 chunks = 28 MB
//   Without versioning: 20 + 20 + 24 = 64 MB
//   With chunk-sharing: 28 MB (56% savings)
//
// For a file with 100 versions where each version changes 1 of 50 chunks:
//   Naive: 100 × 200 MB = 20 GB
//   Chunk-shared: 50 + 100 unique = ~600 MB (97% savings!)

// Restoring a previous version:
async function restoreVersion(entryId, targetVersion) {
    // 1. Get the chunk list for the target version
    const oldVersion = await db.query(
        'SELECT chunk_hashes FROM file_versions WHERE entry_id = ? AND version = ?',
        [entryId, targetVersion]
    );

    // 2. Create a new version with the old chunk list
    const currentVersion = await db.query(
        'SELECT MAX(version) FROM file_versions WHERE entry_id = ?',
        [entryId]
    );

    await db.insert('file_versions', {
        entry_id: entryId,
        version: currentVersion + 1,
        chunk_hashes: oldVersion.chunk_hashes,  // same chunks!
        size_bytes: oldVersion.size_bytes,
        modified_by: currentUserId
    });

    // No chunks need to be copied or moved!
    // The restored version just references the same chunks.
    // Cost: one metadata row (~200 bytes) regardless of file size.

    // 3. Notify all devices of the new version
    await notificationService.notifyFileChange(entryId);
}

Cold Storage for Old Versions

// Version lifecycle:
// Days 0-30:   All versions kept in S3 Standard (instant access)
// Days 30-90:  Versions beyond the latest 10 moved to S3 IA (cheaper, 128KB minimum)
// Days 90-365: Non-latest versions moved to S3 Glacier (cents/GB, minutes to retrieve)
// Day 365+:    Versions beyond latest 5 permanently deleted (configurable per plan)

// Cold storage migration job (runs nightly):
async function migrateOldVersions() {
    const files = await db.query(`
        SELECT fv.entry_id, fv.version, fv.chunk_hashes, fv.created_at
        FROM file_versions fv
        JOIN file_entries fe ON fe.entry_id = fv.entry_id
        WHERE fv.created_at < NOW() - INTERVAL '90 days'
          AND fv.version < fe.current_ver - 5
          AND fv.storage_class = 'STANDARD'
    `);

    for (const version of files) {
        // Move each unique chunk to Glacier
        for (const hash of version.chunk_hashes) {
            const refCount = await getActiveRefCount(hash);
            if (refCount.hotRefs === 0) {
                // No hot version references this chunk — safe to move
                await s3.copyObject({
                    Bucket: 'drive-blocks',
                    CopySource: `drive-blocks/chunks/${hash}`,
                    Key: `chunks/${hash}`,
                    StorageClass: 'GLACIER'
                });
            }
        }

        await db.update('file_versions', version, { storage_class: 'GLACIER' });
    }
}

// Restoring a Glacier version:
// 1. Initiate Glacier restore (takes 1-5 minutes for Expedited, 3-5 hours for Standard)
// 2. Once restored, chunks are temporarily available in S3 for 24 hours
// 3. Download and reassemble the file
// 4. Cost: $0.03 per GB + $10 per 1000 requests (Expedited)

Chunking & Deduplication — Animated

This animation demonstrates how file chunking and deduplication work together. A large file is split into chunks, the user modifies part of the file, and only the changed chunk is uploaded.

▶ Chunking & Dedup

Watch a file get split into chunks, then see how only the modified chunk is re-uploaded after an edit — hash comparison saves bandwidth.

Security & Encryption

// Encryption layers:
//
// 1. In-Transit: TLS 1.3 for all API calls and chunk transfers
//    - Certificate pinning on mobile clients
//    - HSTS headers on web
//
// 2. At-Rest: AES-256 server-side encryption (SSE-S3 or SSE-KMS)
//    - Each chunk encrypted before writing to S3
//    - Encryption keys managed by AWS KMS or Google Cloud KMS
//    - Key rotation every 90 days
//
// 3. Client-Side Encryption (optional, enterprise feature):
//    - User holds the key; server stores only ciphertext
//    - Breaks server-side dedup (encrypted chunks of same content differ)
//    - Used for compliance: HIPAA, SOX, GDPR

// Chunk integrity verification:
async function downloadAndVerify(chunkHash) {
    const data = await s3.getObject(`chunks/${chunkHash}`);
    const computedHash = sha256(data);

    if (computedHash !== chunkHash) {
        // Data corruption detected!
        // Fetch from replica region
        const replicaData = await s3Replica.getObject(`chunks/${chunkHash}`);
        const replicaHash = sha256(replicaData);

        if (replicaHash === chunkHash) {
            // Repair primary from replica
            await s3.putObject(`chunks/${chunkHash}`, replicaData);
            return replicaData;
        }

        throw new DataCorruptionError(chunkHash);
    }

    return data;
}

Scalability Deep Dive

// Component-level scaling:
//
// Upload Service:      Stateless, horizontally scaled behind ALB
//                      Auto-scale: CPU > 60% → add instances
//                      Each instance handles ~5K concurrent uploads
//
// Metadata Service:    Sharded PostgreSQL (256 shards, by owner_id)
//                      Read replicas for heavy-read queries
//                      Redis cache: file metadata (TTL 5 min)
//                      Connection pooling: PgBouncer
//
// Notification Service: WebSocket servers behind sticky sessions (ALB)
//                       Each server holds ~500K connections
//                       200 servers → 100M concurrent connections
//                       Redis Pub/Sub for cross-server notification fan-out
//
// Block Store (S3):    Virtually unlimited (managed by AWS/GCS)
//                      Multi-region replication for durability
//                      S3 handles ~5,500 PUT/s per prefix (partition by hash prefix)
//
// CDN:                 CloudFront / Cloud CDN for download acceleration
//                      Popular shared files cached at edge locations
//                      Pre-signed URLs with 1-hour expiry

// Rate limiting:
// Per user:  100 API calls/min, 50 uploads/min, 10 GB/hour upload
// Per file:  10 versions/min (prevent rapid-fire saves from consuming resources)
// Global:    Circuit breaker on S3 failures (fail fast, retry after cooldown)

Real-World Implementations

FeatureGoogle DriveDropboxOneDrive
Chunk size4-8 MB (variable)4 MB (CDC)~10 MB
Block storeColossus (GFS2)Magic Pocket (custom)Azure Blob Storage
Metadata DBSpannerMySQL (Edgestore)Cosmos DB
NotificationgRPC streamingLong-pollingWebSocket
Sync protocolDelta sync (binary diff)Chunk-level deltaDifferential sync
Conflict handlingAuto-merge (Docs), copies (Drive)Conflict copiesConflict copies
DedupCross-user (global)Cross-user (global)Per-user only
Dropbox's "Magic Pocket": In 2016, Dropbox migrated from AWS S3 to their custom storage system called Magic Pocket. They manage exabytes of data across multiple data centers with custom hardware (low-cost, high-density storage nodes). This saved them an estimated $75M over two years vs S3 pricing. The system uses Reed-Solomon erasure coding (instead of full replication) to achieve 99.999999999% durability at ~1.5x storage overhead (vs 3x for triple replication).

Key Takeaways