Design: Google Drive / Dropbox
Cloud file storage is deceptively simple on the surface — "just upload files and sync them" — but behind that simplicity lies one of the most complex distributed systems problems in modern engineering. Google Drive alone serves over 1 billion users, storing trillions of files, syncing changes across billions of devices in near real-time. Dropbox processes over 1.2 billion files per day.
The core challenge: how do you let hundreds of millions of users seamlessly upload, download, share, and collaboratively edit files across every device they own, while keeping storage costs manageable, latency low, and conflicts resolvable? The answer involves file chunking, content-addressable storage, chunk-level deduplication, delta sync, and a carefully orchestrated notification pipeline.
Requirements & Scale
Functional Requirements
- Upload & download files — any file type, up to 15 GB per file
- Sync across devices — changes on one device appear on all others within seconds
- Share files & folders — with specific users or via shareable links, with granular permissions
- Version history — view and restore previous versions of any file
- Offline support — queue changes locally, sync automatically when connectivity resumes
- Conflict resolution — handle simultaneous edits from multiple devices gracefully
Non-Functional Requirements
- Scale: 500M registered users, 100M DAU
- Storage: Average 2 GB per user → 1 exabyte total
- Upload throughput: ~1 billion file operations per day
- Sync latency: Changes propagate to connected devices within 5 seconds
- Durability: 99.999999999% (11 nines) — zero data loss, ever
- Availability: 99.99% uptime
- Bandwidth efficiency: Minimize bytes transferred — don't re-upload unchanged data
Back-of-Envelope Estimates
Users: 500M registered, 100M DAU
Files per user: ~200 files average
Total files: 100 billion files
Average file size: ~500 KB
Total storage: 100B × 500KB = 50 PB (raw), ~150 PB with replication
Daily uploads: 100M users × 3 file changes/day = 300M operations/day
= ~3,500 ops/second average, ~10,000 ops/second peak
Daily bandwidth: 300M × 500KB avg change = 150 TB/day upload
Sync notifications: ~1 billion notifications/day (multi-device fan-out)
High-Level Architecture
The system decomposes into six major components, each independently scalable:
┌──────────────────────────────────────────────────────────────────────┐
│ CLIENT DEVICE │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ File Watcher │ │ Chunker / │ │ Local DB │ │
│ │ (inotify / │ │ Hasher │ │ (SQLite) │ │
│ │ FSEvents / │ │ │ │ chunk_hashes │ │
│ │ ReadDir) │ │ 4MB chunks │ │ sync_state │ │
│ └──────┬───────┘ └──────┬───────┘ └───────┬───────┘ │
│ │ │ │ │
│ └─────────┬───────┘───────────────────┘ │
│ ▼ │
│ ┌───────────────┐ │
│ │ Sync Engine │ ← coordinates upload/download/conflict │
│ └───────┬───────┘ │
└───────────────────┼──────────────────────────────────────────────────┘
│ HTTPS / WebSocket
▼
┌──────────────────────────────────────────────────────────────────────┐
│ API GATEWAY / LB │
└──────────────────────────┬───────────────────────────────────────────┘
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Upload │ │ Metadata │ │ Notification │
│ Service │ │ Service │ │ Service │
│ │ │ │ │ (WebSocket / │
│ chunk → │ │ files, │ │ long-polling) │
│ block store │ │ folders, │ │ │
│ │ │ versions, │ │ push changes │
│ │ │ ACLs │ │ to devices │
└──────┬───────┘ └──────┬───────┘ └──────────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Block Store │ │ Metadata DB │
│ (S3 / GCS) │ │ (MySQL / │
│ │ │ PostgreSQL)│
│ content- │ │ │
│ addressable │ │ + Redis │
│ │ │ cache │
└──────────────┘ └──────────────┘
File Chunking — The Foundation
Every file storage system at scale uses chunking — splitting files into fixed-size blocks. This single decision enables deduplication, resumable uploads, parallel transfers, and efficient delta sync. Without chunking, changing one byte of a 5 GB video would require re-uploading the entire file.
Choosing the Chunk Size
The chunk size is a critical engineering trade-off:
| Chunk Size | Pros | Cons |
|---|---|---|
| 1 MB | Fine-grained dedup, small re-uploads | Too many chunks per file → metadata overhead, S3 PUT cost |
| 4 MB ✓ | Good balance: reasonable dedup, manageable chunk count | Good for most file types |
| 8 MB | Fewer chunks, less metadata | Coarser dedup; small edits waste more bandwidth |
| 64 MB | Minimal metadata (HDFS default) | Terrible for sync — one byte change re-uploads 64 MB |
We choose 4 MB chunks. A 1 GB file becomes ~256 chunks. A 100 KB document is a single chunk. Dropbox uses 4 MB; Google Drive uses variable chunk sizes based on file type (4–8 MB for most).
The Chunking Algorithm
// Fixed-size chunking with SHA-256 hashing
function chunkFile(filePath) {
const CHUNK_SIZE = 4 * 1024 * 1024; // 4 MB
const file = openFile(filePath);
const chunks = [];
let offset = 0;
let index = 0;
while (offset < file.size) {
const end = Math.min(offset + CHUNK_SIZE, file.size);
const data = file.read(offset, end);
// SHA-256 hash of the raw chunk bytes
const hash = sha256(data);
chunks.push({
index: index,
offset: offset,
size: end - offset,
hash: hash, // content-addressable key
data: data // raw bytes (held in memory briefly)
});
offset = end;
index++;
}
return {
fileName: filePath,
fileSize: file.size,
totalChunks: chunks.length,
fileHash: sha256(chunks.map(c => c.hash).join('')),
chunks: chunks
};
}
// Example output for a 14 MB file:
// {
// fileName: "presentation.pptx",
// fileSize: 14680064,
// totalChunks: 4, // ceil(14MB / 4MB) = 4
// fileHash: "a3f2c1...",
// chunks: [
// { index: 0, offset: 0, size: 4194304, hash: "e7b3a1..." },
// { index: 1, offset: 4194304, size: 4194304, hash: "c2d4f9..." },
// { index: 2, offset: 8388608, size: 4194304, hash: "91ab2e..." },
// { index: 3, offset: 12582912, size: 2097152, hash: "f8c7d3..." }
// ]
// }
Content-Defined Chunking (CDC) — Advanced
Fixed-size chunking has a weakness: if you insert data at the beginning of a file, every chunk boundary shifts and all chunk hashes change. Content-Defined Chunking (Rabin fingerprinting) solves this by choosing chunk boundaries based on the content:
// Content-Defined Chunking using Rabin fingerprinting
function contentDefinedChunk(filePath) {
const MIN_CHUNK = 2 * 1024 * 1024; // 2 MB minimum
const MAX_CHUNK = 8 * 1024 * 1024; // 8 MB maximum
const TARGET = 4 * 1024 * 1024; // 4 MB target
const MASK = 0x00FFFFF; // ~4 MB average (22 bits)
const file = openFile(filePath);
const chunks = [];
let offset = 0;
let chunkStart = 0;
while (offset < file.size) {
const windowSize = 48; // Rabin window
const fingerprint = rabinFingerprint(
file.read(offset, offset + windowSize)
);
const chunkLen = offset - chunkStart;
// Cut when fingerprint matches mask AND chunk >= minimum
// OR when chunk hits maximum size
if ((chunkLen >= MIN_CHUNK && (fingerprint & MASK) === 0)
|| chunkLen >= MAX_CHUNK) {
const data = file.read(chunkStart, offset);
chunks.push({
offset: chunkStart,
size: chunkLen,
hash: sha256(data)
});
chunkStart = offset;
}
offset++;
}
// Flush remaining bytes as last chunk
if (chunkStart < file.size) {
const data = file.read(chunkStart, file.size);
chunks.push({
offset: chunkStart,
size: file.size - chunkStart,
hash: sha256(data)
});
}
return chunks;
}
// Advantage: Inserting 100 bytes at offset 0 only changes the FIRST chunk.
// All subsequent chunk boundaries stay the same because they're
// content-defined, not position-defined. Massive bandwidth savings!
Dropbox switched from fixed-size to content-defined chunking in 2014, reducing sync bandwidth by ~40% for files with insertions/deletions near the beginning.
Deduplication — Store Once, Reference Many
With content-addressable storage, each chunk is stored using its SHA-256 hash as the key. If two chunks have the same hash, they are (with overwhelming probability) identical. We store the data only once and add a reference.
Deduplication at Multiple Levels
// Level 1: Intra-file dedup
// A 400 MB VM image with large zero-filled regions
// Chunk "0000...0000" (4 MB of zeros) appears 50 times
// → store 1 chunk, reference it 50 times
// Savings: 196 MB
// Level 2: Cross-file dedup (same user)
// User copies presentation.pptx → presentation_v2.pptx
// Makes a small edit → only 1 chunk changes
// → 255 of 256 chunks already exist, upload only 1
// Savings: ~1 GB for a 1 GB file copy
// Level 3: Cross-user dedup
// 10,000 users upload the same company_logo.png (800 KB)
// → store 1 chunk, 10,000 references
// Savings: ~8 GB
// Upload flow with dedup:
async function uploadWithDedup(fileChunks) {
// Step 1: Send all chunk hashes to server
const hashList = fileChunks.map(c => c.hash);
const serverResponse = await api.checkChunks(hashList);
// Response: { needed: ["e7b3a1...", "f8c7d3..."], existing: ["c2d4f9...", "91ab2e..."] }
// Step 2: Upload ONLY chunks the server doesn't have
const newChunks = fileChunks.filter(c => serverResponse.needed.includes(c.hash));
for (const chunk of newChunks) {
await api.uploadChunk(chunk.hash, chunk.data);
// Server stores: S3 key = "chunks/{hash}" → chunk data
}
// Step 3: Register file metadata (references existing + new chunks)
await api.registerFile({
path: "/documents/report.pdf",
chunks: hashList, // ordered list of chunk hashes
fileHash: sha256(hashList.join(''))
});
// Result: only 2 of 4 chunks uploaded (50% bandwidth saved)
}
Deduplication in Practice
| Scenario | Without Dedup | With Dedup | Savings |
|---|---|---|---|
| Edit 1 page of a 100-page PDF | Re-upload 50 MB | Upload 4 MB (1 chunk) | 92% |
| Copy a 2 GB folder within Drive | Upload 2 GB | Upload 0 bytes (metadata only) | 100% |
| 10K users upload same installer | 500 GB × 10K = 5 PB | 500 GB (once) | 99.99% |
| Overall platform (Dropbox 2019) | ~500 PB | ~120 PB (reported) | ~76% |
Sync Protocol — The State Machine
The sync engine on each device is a state machine that watches the local file system, detects changes, computes diffs at the chunk level, uploads only what's new, and downloads changes made by other devices. It is the most complex component in the entire system.
Sync Engine States
// Sync Engine State Machine
//
// ┌────────────┐ file change ┌────────────┐ chunks computed ┌────────────┐
// │ IDLE │ ─────────────→ │ INDEXING │ ─────────────────→ │ DIFFING │
// │ │ │ │ │ │
// │ watching │ │ reading │ │ comparing │
// │ for │ │ file, │ │ local vs │
// │ changes │ │ chunking, │ │ server │
// │ │ │ hashing │ │ chunks │
// └────────────┘ └────────────┘ └──────┬─────┘
// ▲ │
// │ has diff?
// │ ┌───────┴───────┐
// │ no │ │ yes
// │ ▼ ▼
// │ ┌──────────┐ ┌──────────────┐
// │ │ UP TO │ │ UPLOADING │
// │◄───────────────────────────────────────────────────│ DATE │ │ │
// │ └──────────┘ │ sending new │
// │ │ chunks to │
// │ │ block store │
// │ └──────┬───────┘
// │ │
// │ all uploaded
// │ ▼
// │ ┌──────────────┐
// │ │ COMMITTING │
// │◄──────────────────────────────────────────────────────────────────│ │
// │ done │ update │
// │ metadata DB │
// │ + notify │
// └──────────────┘
enum SyncState {
IDLE, // Watching file system for changes
INDEXING, // Reading changed file, computing chunks + hashes
DIFFING, // Comparing local chunk list vs server chunk list
UPLOADING, // Uploading new/changed chunks to block store
COMMITTING, // Updating metadata DB with new file version
DOWNLOADING, // Fetching chunks changed by another device
CONFLICTED, // Conflicting changes detected
ERROR // Retryable error state (exponential backoff)
}
Change Detection
The client watches the local file system using OS-level APIs:
// Platform-specific file system watchers
// Linux: inotify (kernel-level, event-driven, ~8K watches default)
// macOS: FSEvents (coalesced events, efficient for large trees)
// Windows: ReadDirectoryChangesW (per-directory, recursive option)
class FileWatcher {
constructor(syncRoot) {
this.syncRoot = syncRoot;
this.debounceMs = 500; // Coalesce rapid changes
this.pendingChanges = new Map();
}
onFileChange(event) {
// event: { type: CREATE|MODIFY|DELETE|RENAME, path: string }
// Debounce: wait 500ms after last change before processing
// (editors do save → delete → rename atomically)
clearTimeout(this.pendingChanges.get(event.path));
this.pendingChanges.set(event.path, setTimeout(() => {
this.pendingChanges.delete(event.path);
this.syncEngine.enqueue(event);
}, this.debounceMs));
}
}
// The debounce is critical: text editors often perform multi-step saves:
// 1. Write to temp file (report.pdf.tmp)
// 2. Delete original (report.pdf)
// 3. Rename temp → original (report.pdf.tmp → report.pdf)
// Without debouncing, we'd process each step as a separate change.
Upload Protocol (Client → Server)
// Complete upload flow
async function syncLocalChange(filePath) {
// 1. INDEXING: Chunk the file and compute hashes
const chunks = chunkFile(filePath);
const localHashes = chunks.map(c => c.hash);
// 2. DIFFING: Ask server which chunks it already has
const { needed, serverVersion } = await api.diffChunks({
filePath: filePath,
chunkHashes: localHashes
});
// Server compares against the latest version's chunk list
// Returns: needed = hashes not in block store
// 3. Check for conflicts (server version newer than our base version)
const localVersion = localDB.getVersion(filePath);
if (serverVersion > localVersion) {
return handleConflict(filePath, localVersion, serverVersion);
}
// 4. UPLOADING: Upload only the chunks the server needs
const uploadPromises = needed.map(hash => {
const chunk = chunks.find(c => c.hash === hash);
return uploadChunkWithRetry(chunk, {
maxRetries: 3,
backoff: 'exponential'
});
});
await Promise.all(uploadPromises); // Parallel upload!
// 5. COMMITTING: Register new version with metadata service
const newVersion = await api.commitFileVersion({
filePath: filePath,
chunkHashes: localHashes, // ordered chunk list
fileSize: chunks.reduce((s, c) => s + c.size, 0),
checksum: sha256(localHashes.join('')),
baseVersion: localVersion // optimistic concurrency
});
// 6. Update local DB
localDB.setVersion(filePath, newVersion.version);
localDB.setChunkHashes(filePath, localHashes);
}
// Chunk upload with resumability
async function uploadChunkWithRetry(chunk, opts) {
for (let attempt = 0; attempt <= opts.maxRetries; attempt++) {
try {
// Use pre-signed S3 URL for direct upload (bypass our servers)
const { uploadUrl } = await api.getUploadUrl(chunk.hash);
await httpPut(uploadUrl, chunk.data, {
headers: {
'Content-Type': 'application/octet-stream',
'Content-Length': chunk.size,
'x-amz-content-sha256': chunk.hash
}
});
return;
} catch (err) {
if (attempt === opts.maxRetries) throw err;
const delay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
await sleep(delay);
}
}
}
Download Protocol (Server → Client)
// When the notification service tells us another device changed a file:
async function syncRemoteChange(notification) {
// notification: { filePath, newVersion, changedChunks, deletedChunks }
// 1. Get the new chunk list from metadata service
const newMeta = await api.getFileMetadata(notification.filePath, notification.newVersion);
// Returns: { chunkHashes: [...], fileSize: ..., modifiedAt: ... }
// 2. Compare with our local chunk list
const localHashes = localDB.getChunkHashes(notification.filePath);
const toDownload = newMeta.chunkHashes.filter(h => !localHashes.includes(h));
const toRemove = localHashes.filter(h => !newMeta.chunkHashes.includes(h));
// 3. Download new chunks (parallel, from S3 via CDN)
const downloadedChunks = await Promise.all(
toDownload.map(hash => downloadChunk(hash))
);
// 4. Reassemble the file from chunks
// Local cache: keep recently used chunks in ~/.drive/cache/
const allChunks = newMeta.chunkHashes.map(hash => {
const downloaded = downloadedChunks.find(c => c.hash === hash);
if (downloaded) return downloaded.data;
return localChunkCache.get(hash); // unchanged chunk from cache
});
// 5. Write to temp file, then atomic rename
const tempPath = filePath + '.drive-tmp';
writeFile(tempPath, concatenate(allChunks));
atomicRename(tempPath, filePath);
// 6. Update local state
localDB.setVersion(filePath, notification.newVersion);
localDB.setChunkHashes(filePath, newMeta.chunkHashes);
}
File Sync Flow — Animated
The following animation shows the complete sync flow when a user edits a file on Device A and the change propagates to Device B.
▶ File Sync Flow
Step through the flow: edit on Device A → chunk → diff → upload changed chunks → server metadata update → notify Device B → Device B downloads → synced.
Notification Service
After a file version is committed, all other devices with access to that file must be notified. The notification service maintains persistent connections to all online clients:
Approaches Compared
| Method | Latency | Connections | Battery | Use Case |
|---|---|---|---|---|
| Polling | High (interval) | Reconnects each time | Bad | Simple, rarely used at scale |
| Long-polling ✓ | Low (~instant) | 1 persistent per device | Good | Desktop clients, reliable |
| WebSocket ✓ | Lowest | 1 persistent, bidirectional | Good | Real-time sync, web clients |
| Push (APNs/FCM) | Medium | Managed by OS | Best | Mobile (when app is backgrounded) |
// Notification Service Architecture
class NotificationService {
// In-memory: userId → Set of connected WebSocket sessions
// Backed by Redis Pub/Sub for multi-server fan-out
connections = new Map(); // userId → [ws1, ws2, ...]
async notifyFileChange(fileId, changedByUserId, newVersion) {
// 1. Get all users who have access to this file
const accessList = await metadataDB.getFileAccessList(fileId);
// Returns: [{ userId, permission, deviceIds }]
// 2. Get all devices for those users
const devices = accessList.flatMap(u =>
u.deviceIds.map(d => ({ userId: u.userId, deviceId: d }))
).filter(d =>
// Don't notify the device that made the change
d.deviceId !== changedByUserId
);
// 3. Publish to Redis Pub/Sub (fan-out to notification servers)
const notification = {
type: 'FILE_CHANGED',
fileId: fileId,
version: newVersion,
timestamp: Date.now()
};
for (const device of devices) {
await redis.publish(
`notifications:${device.userId}:${device.deviceId}`,
JSON.stringify(notification)
);
}
}
// Each notification server subscribes to channels for connected clients
onWebSocketConnect(userId, deviceId, ws) {
redis.subscribe(`notifications:${userId}:${deviceId}`, (msg) => {
ws.send(msg);
});
}
}
// Long-polling fallback (for clients behind strict firewalls):
// Client: GET /api/changes?since=cursor_abc&timeout=60
// Server: Hold request open for up to 60s. Return immediately if
// there are changes, or empty response on timeout.
// Client: Immediately reconnect with new cursor.
Conflict Resolution
Conflicts occur when two devices edit the same file before either sync completes. This is unavoidable in a system with offline support. We need a strategy that never loses data while keeping the UX simple.
Detecting Conflicts
// Optimistic concurrency control with version vectors
async function commitFileVersion(request) {
const { filePath, chunkHashes, baseVersion } = request;
// Atomic check: is the current server version what the client expects?
const currentVersion = await metadataDB.getCurrentVersion(filePath);
if (currentVersion !== baseVersion) {
// CONFLICT: someone else committed a new version while we were editing
throw new ConflictError({
clientBaseVersion: baseVersion,
serverCurrentVersion: currentVersion,
filePath: filePath
});
}
// No conflict: commit the new version
const newVersion = currentVersion + 1;
await metadataDB.insertVersion({
filePath: filePath,
version: newVersion,
chunkHashes: chunkHashes,
committedAt: Date.now()
});
return { version: newVersion };
}
Resolution Strategies
// Strategy 1: Last-Write-Wins (LWW)
// Simple but can lose data. Used for non-critical files.
// The last device to commit "wins" and overwrites.
// Pros: Simple, no user intervention
// Cons: Silent data loss
// Strategy 2: Conflict Copies (Dropbox approach) ✓
// Create a "conflicted copy" with the losing version
async function handleConflict(filePath, localChunks, serverVersion) {
// 1. Download the server's version (winner)
await syncRemoteChange({ filePath, newVersion: serverVersion });
// 2. Save our version as a conflict copy
const timestamp = formatDate(Date.now());
const deviceName = getDeviceName();
const ext = path.extname(filePath);
const base = path.basename(filePath, ext);
// "report.pdf" → "report (Neel's Laptop's conflicted copy 2026-04-15).pdf"
const conflictPath = `${path.dirname(filePath)}/${base} `
+ `(${deviceName}'s conflicted copy ${timestamp})${ext}`;
await writeFile(conflictPath, assembleFromChunks(localChunks));
// 3. Notify the user
notifyUser({
type: 'CONFLICT',
message: `Conflicting changes detected for ${filePath}. `
+ `Your changes saved as: ${conflictPath}`,
originalFile: filePath,
conflictCopy: conflictPath
});
}
// Strategy 3: Operational Transform / CRDT
// Used by Google Docs for real-time collaborative editing
// Too complex for file-level sync; only for document editors
// Strategy 4: Three-Way Merge (Git-style)
// Possible for text files: find common ancestor, merge changes
// Complex, error-prone for binary files
// Used by some advanced sync engines for code files
Block Storage — Content-Addressable Store
Chunks are stored in object storage (S3 or GCS) using their SHA-256 hash as the key. This is called content-addressable storage (CAS) — the address is the content.
// Block Store Schema (S3/GCS)
// Key: chunks/{sha256-hash}
// Value: raw chunk bytes (up to 4 MB)
//
// Examples:
// s3://drive-blocks/chunks/e7b3a1f2c4d6e8...
// s3://drive-blocks/chunks/c2d4f91ab2e3f5...
// Properties of content-addressable storage:
// 1. Immutable: a chunk's content never changes (same hash = same data)
// 2. Idempotent: uploading the same chunk twice is safe (PUT is idempotent)
// 3. Naturally deduped: identical content = identical key
// 4. Easy to verify: download and re-hash to confirm integrity
// S3 Configuration
{
"bucket": "drive-blocks-us-east-1",
"storageClass": "S3_STANDARD", // hot data (recent files)
"replication": "CROSS_REGION", // us-east-1 → eu-west-1
"encryption": "AES-256-SSE", // server-side encryption
"lifecycle": {
// Move to cheaper storage after 90 days of no access
"transition": [
{ "days": 90, "storageClass": "S3_INFREQUENT_ACCESS" },
{ "days": 365, "storageClass": "S3_GLACIER" }
]
}
}
// Reference counting for garbage collection:
// Each chunk has a reference count in the metadata DB.
// When a file version is deleted, decrement chunk ref counts.
// When ref_count reaches 0, the chunk can be garbage collected.
// Use a background job with delay (to handle race conditions):
CREATE TABLE chunk_refs (
chunk_hash CHAR(64) PRIMARY KEY,
ref_count INT NOT NULL DEFAULT 1,
size_bytes BIGINT NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
last_ref_at TIMESTAMP DEFAULT NOW()
);
-- When a new file references a chunk:
INSERT INTO chunk_refs (chunk_hash, ref_count, size_bytes)
VALUES ('e7b3a1...', 1, 4194304)
ON CONFLICT (chunk_hash) DO UPDATE SET
ref_count = chunk_refs.ref_count + 1,
last_ref_at = NOW();
-- When a file version is deleted:
UPDATE chunk_refs SET ref_count = ref_count - 1
WHERE chunk_hash = 'e7b3a1...';
-- GC job (runs hourly):
DELETE FROM chunk_refs WHERE ref_count <= 0 AND last_ref_at < NOW() - INTERVAL '7 days';
-- Then delete from S3: aws s3 rm s3://drive-blocks/chunks/{hash}
Metadata Service — The Brain
The metadata service is the source of truth for everything except raw file bytes. It manages file trees, sharing permissions, version history, and device sync state.
Database Schema
-- Core tables (PostgreSQL)
-- Users
CREATE TABLE users (
user_id BIGINT PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
display_name VARCHAR(255),
quota_bytes BIGINT DEFAULT 15 * 1024 * 1024 * 1024, -- 15 GB free
used_bytes BIGINT DEFAULT 0,
created_at TIMESTAMP DEFAULT NOW()
);
-- Files and Folders (tree structure using materialized path)
CREATE TABLE file_entries (
entry_id BIGINT PRIMARY KEY,
owner_id BIGINT REFERENCES users(user_id),
parent_id BIGINT REFERENCES file_entries(entry_id),
name VARCHAR(255) NOT NULL,
is_folder BOOLEAN DEFAULT FALSE,
current_ver INT DEFAULT 1,
size_bytes BIGINT DEFAULT 0,
mime_type VARCHAR(127),
created_at TIMESTAMP DEFAULT NOW(),
modified_at TIMESTAMP DEFAULT NOW(),
is_deleted BOOLEAN DEFAULT FALSE, -- soft delete (trash)
deleted_at TIMESTAMP,
-- Materialized path for efficient subtree queries
-- e.g., "/user_123/Documents/Work/report.pdf"
path TEXT NOT NULL,
UNIQUE(parent_id, name) -- no duplicate names in same folder
);
CREATE INDEX idx_file_entries_owner ON file_entries(owner_id);
CREATE INDEX idx_file_entries_parent ON file_entries(parent_id);
CREATE INDEX idx_file_entries_path ON file_entries(path);
-- File Versions (one row per version)
CREATE TABLE file_versions (
entry_id BIGINT REFERENCES file_entries(entry_id),
version INT NOT NULL,
chunk_hashes TEXT[] NOT NULL, -- ordered array of SHA-256 hashes
size_bytes BIGINT NOT NULL,
file_hash CHAR(64) NOT NULL, -- hash of the complete file
modified_by BIGINT REFERENCES users(user_id),
device_id VARCHAR(64),
created_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (entry_id, version)
);
-- Sharing / ACL
CREATE TABLE sharing_permissions (
entry_id BIGINT REFERENCES file_entries(entry_id),
user_id BIGINT REFERENCES users(user_id),
permission VARCHAR(20) NOT NULL, -- 'owner', 'editor', 'viewer'
granted_by BIGINT REFERENCES users(user_id),
granted_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (entry_id, user_id)
);
-- Shared Links
CREATE TABLE shared_links (
link_id VARCHAR(32) PRIMARY KEY, -- random token
entry_id BIGINT REFERENCES file_entries(entry_id),
permission VARCHAR(20) DEFAULT 'viewer',
password_hash VARCHAR(255), -- optional password protection
expires_at TIMESTAMP, -- optional expiry
created_by BIGINT REFERENCES users(user_id),
created_at TIMESTAMP DEFAULT NOW()
);
-- Device Sync State
CREATE TABLE device_sync_state (
user_id BIGINT REFERENCES users(user_id),
device_id VARCHAR(64) NOT NULL,
device_name VARCHAR(255),
last_cursor VARCHAR(255), -- opaque cursor for incremental sync
last_sync_at TIMESTAMP,
PRIMARY KEY (user_id, device_id)
);
Sharding Strategy
With 500M users and 100B files, a single database won't cut it. We shard by owner_id:
// Sharding by owner_id (consistent hashing)
// - All of a user's files on the same shard
// - Shared files: metadata on owner's shard, ACL lookup cross-shard
// - 256 logical shards, each handling ~2M users
function getShard(ownerId) {
return consistentHash(ownerId, NUM_SHARDS);
}
// Query routing:
// "List my files" → route to user's shard (single shard)
// "Files shared with me" → scatter-gather across shards (expensive)
// OR use a secondary index table (denormalized)
// Denormalized "shared with me" index:
CREATE TABLE shared_with_me (
user_id BIGINT,
entry_id BIGINT,
owner_id BIGINT,
entry_name VARCHAR(255),
permission VARCHAR(20),
shared_at TIMESTAMP,
PRIMARY KEY (user_id, entry_id)
);
-- Sharded by user_id → "shared with me" queries are single-shard
Offline Support
Cloud storage must work seamlessly when the user is disconnected — on a plane, in a subway, or in a building with no signal. The client maintains a local operation log that queues all changes.
// Local SQLite database on the client device
CREATE TABLE pending_ops (
op_id INTEGER PRIMARY KEY AUTOINCREMENT,
file_path TEXT NOT NULL,
op_type TEXT NOT NULL, -- 'CREATE', 'MODIFY', 'DELETE', 'RENAME', 'MOVE'
chunk_hashes TEXT, -- JSON array of chunk hashes (for CREATE/MODIFY)
old_path TEXT, -- for RENAME/MOVE
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
status TEXT DEFAULT 'pending', -- 'pending', 'uploading', 'committed', 'failed'
retry_count INTEGER DEFAULT 0
);
// Offline workflow:
// 1. User edits file while disconnected
// 2. Client detects change, chunks the file, stores chunks in local cache
// 3. Inserts operation into pending_ops
// 4. Displays "Waiting to sync" icon on the file
// When connectivity resumes:
class OfflineSyncManager {
async onConnectivityRestored() {
// Process pending ops in order (FIFO)
const ops = await localDB.query(
'SELECT * FROM pending_ops WHERE status = ? ORDER BY created_at',
['pending']
);
for (const op of ops) {
try {
await localDB.update(op.op_id, { status: 'uploading' });
switch (op.op_type) {
case 'CREATE':
case 'MODIFY':
await syncLocalChange(op.file_path);
break;
case 'DELETE':
await api.deleteFile(op.file_path);
break;
case 'RENAME':
await api.renameFile(op.old_path, op.file_path);
break;
}
await localDB.update(op.op_id, { status: 'committed' });
} catch (err) {
if (err instanceof ConflictError) {
await handleConflict(op.file_path);
await localDB.update(op.op_id, { status: 'committed' });
} else {
await localDB.update(op.op_id, {
status: 'failed',
retry_count: op.retry_count + 1
});
}
}
}
}
}
Sharing & Permissions (ACL)
Sharing is ACL-based (Access Control List). Every file/folder has a list of principals (users, groups) with their permission level.
// Permission hierarchy:
// Owner → full control (delete, share, edit, view)
// Editor → edit and view (cannot delete or change sharing)
// Viewer → read-only (can download but not modify)
// Permission inheritance:
// Sharing a folder grants the same permission to ALL files/folders inside it
// A file's effective permission = MAX(own permission, inherited from parent folders)
function getEffectivePermission(userId, entryId) {
let maxPermission = null;
// Walk up the folder tree
let current = entryId;
while (current !== null) {
const perm = db.query(
'SELECT permission FROM sharing_permissions WHERE entry_id = ? AND user_id = ?',
[current, userId]
);
if (perm) {
maxPermission = maxOf(maxPermission, perm);
}
// Move to parent
current = db.query('SELECT parent_id FROM file_entries WHERE entry_id = ?', [current]);
}
return maxPermission; // null = no access
}
// Shared link access:
// GET /shared/{link_id}
// Server validates: link exists, not expired, password matches (if set)
// Returns file metadata + download URL (or folder listing)
// Permission check middleware:
async function checkPermission(req, res, next) {
const { userId } = req.auth;
const { entryId } = req.params;
const requiredLevel = req.method === 'GET' ? 'viewer' : 'editor';
const permission = await getEffectivePermission(userId, entryId);
if (!permission || permissionLevel(permission) < permissionLevel(requiredLevel)) {
return res.status(403).json({ error: 'Insufficient permissions' });
}
next();
}
Version History — Space-Efficient
Every file change creates a new version. But because we store chunk references (not copies), version history is extremely space-efficient. Unchanged chunks are shared across versions.
// Version storage: list of chunk hashes per version
//
// Version 1: [A, B, C, D, E] (5 chunks, 20 MB)
// Version 2: [A, B, C', D, E] (1 chunk changed: C→C')
// Version 3: [A, B, C', D, E, F] (1 chunk added: F)
//
// Storage cost:
// Unique chunks: A, B, C, C', D, E, F = 7 chunks = 28 MB
// Without versioning: 20 + 20 + 24 = 64 MB
// With chunk-sharing: 28 MB (56% savings)
//
// For a file with 100 versions where each version changes 1 of 50 chunks:
// Naive: 100 × 200 MB = 20 GB
// Chunk-shared: 50 + 100 unique = ~600 MB (97% savings!)
// Restoring a previous version:
async function restoreVersion(entryId, targetVersion) {
// 1. Get the chunk list for the target version
const oldVersion = await db.query(
'SELECT chunk_hashes FROM file_versions WHERE entry_id = ? AND version = ?',
[entryId, targetVersion]
);
// 2. Create a new version with the old chunk list
const currentVersion = await db.query(
'SELECT MAX(version) FROM file_versions WHERE entry_id = ?',
[entryId]
);
await db.insert('file_versions', {
entry_id: entryId,
version: currentVersion + 1,
chunk_hashes: oldVersion.chunk_hashes, // same chunks!
size_bytes: oldVersion.size_bytes,
modified_by: currentUserId
});
// No chunks need to be copied or moved!
// The restored version just references the same chunks.
// Cost: one metadata row (~200 bytes) regardless of file size.
// 3. Notify all devices of the new version
await notificationService.notifyFileChange(entryId);
}
Cold Storage for Old Versions
// Version lifecycle:
// Days 0-30: All versions kept in S3 Standard (instant access)
// Days 30-90: Versions beyond the latest 10 moved to S3 IA (cheaper, 128KB minimum)
// Days 90-365: Non-latest versions moved to S3 Glacier (cents/GB, minutes to retrieve)
// Day 365+: Versions beyond latest 5 permanently deleted (configurable per plan)
// Cold storage migration job (runs nightly):
async function migrateOldVersions() {
const files = await db.query(`
SELECT fv.entry_id, fv.version, fv.chunk_hashes, fv.created_at
FROM file_versions fv
JOIN file_entries fe ON fe.entry_id = fv.entry_id
WHERE fv.created_at < NOW() - INTERVAL '90 days'
AND fv.version < fe.current_ver - 5
AND fv.storage_class = 'STANDARD'
`);
for (const version of files) {
// Move each unique chunk to Glacier
for (const hash of version.chunk_hashes) {
const refCount = await getActiveRefCount(hash);
if (refCount.hotRefs === 0) {
// No hot version references this chunk — safe to move
await s3.copyObject({
Bucket: 'drive-blocks',
CopySource: `drive-blocks/chunks/${hash}`,
Key: `chunks/${hash}`,
StorageClass: 'GLACIER'
});
}
}
await db.update('file_versions', version, { storage_class: 'GLACIER' });
}
}
// Restoring a Glacier version:
// 1. Initiate Glacier restore (takes 1-5 minutes for Expedited, 3-5 hours for Standard)
// 2. Once restored, chunks are temporarily available in S3 for 24 hours
// 3. Download and reassemble the file
// 4. Cost: $0.03 per GB + $10 per 1000 requests (Expedited)
Chunking & Deduplication — Animated
This animation demonstrates how file chunking and deduplication work together. A large file is split into chunks, the user modifies part of the file, and only the changed chunk is uploaded.
▶ Chunking & Dedup
Watch a file get split into chunks, then see how only the modified chunk is re-uploaded after an edit — hash comparison saves bandwidth.
Security & Encryption
// Encryption layers:
//
// 1. In-Transit: TLS 1.3 for all API calls and chunk transfers
// - Certificate pinning on mobile clients
// - HSTS headers on web
//
// 2. At-Rest: AES-256 server-side encryption (SSE-S3 or SSE-KMS)
// - Each chunk encrypted before writing to S3
// - Encryption keys managed by AWS KMS or Google Cloud KMS
// - Key rotation every 90 days
//
// 3. Client-Side Encryption (optional, enterprise feature):
// - User holds the key; server stores only ciphertext
// - Breaks server-side dedup (encrypted chunks of same content differ)
// - Used for compliance: HIPAA, SOX, GDPR
// Chunk integrity verification:
async function downloadAndVerify(chunkHash) {
const data = await s3.getObject(`chunks/${chunkHash}`);
const computedHash = sha256(data);
if (computedHash !== chunkHash) {
// Data corruption detected!
// Fetch from replica region
const replicaData = await s3Replica.getObject(`chunks/${chunkHash}`);
const replicaHash = sha256(replicaData);
if (replicaHash === chunkHash) {
// Repair primary from replica
await s3.putObject(`chunks/${chunkHash}`, replicaData);
return replicaData;
}
throw new DataCorruptionError(chunkHash);
}
return data;
}
Scalability Deep Dive
// Component-level scaling:
//
// Upload Service: Stateless, horizontally scaled behind ALB
// Auto-scale: CPU > 60% → add instances
// Each instance handles ~5K concurrent uploads
//
// Metadata Service: Sharded PostgreSQL (256 shards, by owner_id)
// Read replicas for heavy-read queries
// Redis cache: file metadata (TTL 5 min)
// Connection pooling: PgBouncer
//
// Notification Service: WebSocket servers behind sticky sessions (ALB)
// Each server holds ~500K connections
// 200 servers → 100M concurrent connections
// Redis Pub/Sub for cross-server notification fan-out
//
// Block Store (S3): Virtually unlimited (managed by AWS/GCS)
// Multi-region replication for durability
// S3 handles ~5,500 PUT/s per prefix (partition by hash prefix)
//
// CDN: CloudFront / Cloud CDN for download acceleration
// Popular shared files cached at edge locations
// Pre-signed URLs with 1-hour expiry
// Rate limiting:
// Per user: 100 API calls/min, 50 uploads/min, 10 GB/hour upload
// Per file: 10 versions/min (prevent rapid-fire saves from consuming resources)
// Global: Circuit breaker on S3 failures (fail fast, retry after cooldown)
Real-World Implementations
| Feature | Google Drive | Dropbox | OneDrive |
|---|---|---|---|
| Chunk size | 4-8 MB (variable) | 4 MB (CDC) | ~10 MB |
| Block store | Colossus (GFS2) | Magic Pocket (custom) | Azure Blob Storage |
| Metadata DB | Spanner | MySQL (Edgestore) | Cosmos DB |
| Notification | gRPC streaming | Long-polling | WebSocket |
| Sync protocol | Delta sync (binary diff) | Chunk-level delta | Differential sync |
| Conflict handling | Auto-merge (Docs), copies (Drive) | Conflict copies | Conflict copies |
| Dedup | Cross-user (global) | Cross-user (global) | Per-user only |
Key Takeaways
- Separate data from metadata. File chunks go to object storage (S3/GCS); file trees, permissions, and versions go to a relational DB. This separation enables independent scaling and optimization.
- Chunk everything. 4 MB fixed-size (or content-defined) chunks enable deduplication, resumable uploads, parallel transfers, and efficient delta sync. One byte change ≠ full file re-upload.
- Content-addressable storage (SHA-256 as key) gives you dedup, integrity verification, and immutability for free.
- The sync engine is a state machine — IDLE → INDEXING → DIFFING → UPLOADING → COMMITTING. Each state has well-defined transitions, error handling, and retry logic.
- Conflict resolution: prefer data preservation. Conflict copies (Dropbox-style) ensure zero data loss. LWW is simpler but lossy. Three-way merge is powerful but complex.
- Notification pipeline (WebSocket + long-polling + push) ensures all connected devices learn about changes within seconds.
- Offline-first: Queue changes locally in SQLite, replay when connectivity resumes. Treat the network as unreliable.
- Version history is cheap with chunk sharing — unchanged chunks are referenced, not copied. Storage overhead for 100 versions of a file that changes slightly each time is minimal.
- Cold storage tiering (Standard → IA → Glacier) keeps costs manageable for version history spanning months or years.