Case Study: Uber Engineering
Uber processes 20+ million rides per day across 70+ countries and 10,000+ cities. At any given second, the platform is matching riders to drivers, computing real-time pricing, tracking millions of moving vehicles, processing payments in dozens of currencies, and running fraud detection models — all with sub-second latency expectations. This scale has forced Uber to invent, open-source, and battle-test an extraordinary collection of infrastructure.
What makes Uber's engineering story uniquely instructive is the domain complexity. Unlike a social network (read-heavy, mostly text) or an e-commerce platform (catalog + checkout), Uber operates a real-time, two-sided marketplace where both supply (drivers) and demand (riders) are physically moving, decisions must be made in milliseconds, and the cost of a wrong decision is immediate (a rider waiting in the rain, a driver sitting idle). Every architectural choice is shaped by this fundamental constraint: the world won't wait.
- 20M+ trips/day across Rides, Eats, Freight, and other verticals
- 6M+ drivers/couriers actively online at peak hours
- 131M+ monthly active users across all platforms
- 5,000+ microservices running in production
- 1 trillion+ Kafka messages/day flowing through data infrastructure
- 100M+ location updates/second from driver apps worldwide
- $138B+ in gross bookings annually
From Monolith to Microservices
Uber's original backend was a single Node.js monolith — a single codebase that handled dispatch, payments, trip management, user profiles, and everything else. In 2012-2013, when Uber operated in a handful of cities, this was adequate. The entire engineering team could fit in one room, deployments were simple, and the blast radius of bugs was manageable.
By 2014, the monolith was buckling. Growth was exponential — Uber was doubling city count every few months. The problems were textbook:
// The Uber Monolith (~2013) — a simplified view
// dispatch.js, payments.js, trips.js, users.js, surge.js, maps.js
// ALL in one Node.js process, one Git repository, one deployment
// Problem 1: Blast Radius
// A bug in payment processing crashes the entire process
// → dispatch goes down, no one can get rides
// Problem 2: Scaling Mismatch
// Dispatch needs 100x the compute during New Year's Eve
// Payments needs 1x — but you can only scale the whole monolith
// Problem 3: Team Coupling
// 200+ engineers committing to one repo
// Merge conflicts, broken builds, 6-hour deploy queues
// "Who broke staging?" becomes the team's daily ritual
// Problem 4: Technology Lock-in
// Node.js is great for I/O-bound dispatch
// But ML models need Python, low-latency matching needs Go
// Can't use different tech for different problems
Uber's migration wasn't a "big bang" rewrite. It followed a Strangler Fig pattern — new services were built alongside the monolith, and traffic was gradually rerouted. The first services extracted were the ones with the clearest domain boundaries: Dispatch, Payments, and Rider/Driver Profiles.
Service-Oriented Architecture (SOAL)
By 2016, Uber had evolved into what they called SOAL (Service-Oriented Architecture at the Largest scale). Key principles included:
// Uber's Microservice Principles
//
// 1. SINGLE RESPONSIBILITY
// Each service owns one domain concept
// Trip Service → trip lifecycle (request → match → ride → complete)
// Pricing Service → fare calculation, surge, promotions
// Payment Service → charge authorization, settlement, refunds
//
// 2. DATA OWNERSHIP
// Each service owns its data store — no shared databases
// Trip Service → trip-db (Schemaless/MySQL)
// Pricing Service → pricing-db (Cassandra for time-series pricing)
// Payment Service → payment-db (MySQL with strict ACID)
//
// 3. API CONTRACTS (Thrift → later Protobuf/gRPC)
// service TripService {
// TripResponse requestTrip(1: TripRequest request)
// TripStatus getTrip(1: TripId id)
// void cancelTrip(1: TripId id, 2: CancelReason reason)
// }
//
// 4. LANGUAGE DIVERSITY (right tool for the job)
// Go → networking, dispatch, real-time systems (low latency)
// Java → business logic services, data processing (ecosystem)
// Python → ML models, data science pipelines (libraries)
// Node.js → API gateway, BFF layers (async I/O)
// The result by 2018:
// 2,200+ microservices
// 50+ distinct data stores
// 4,000+ engineers
// ~1,000 deployments per day
Domain-Oriented Microservice Architecture (DOMA)
By 2020, Uber had thousands of microservices, and the original promise of "independent services" had eroded. Services were calling services calling services — creating deep call chains, unpredictable latency, and cascading failures. DOMA was Uber's answer:
// DOMA: Domain-Oriented Microservice Architecture
//
// Core Concept: Group microservices into DOMAINS
// Each domain has:
// - A gateway service (the ONLY entry point from outside)
// - Internal services (hidden from other domains)
// - A clear interface contract
//
// ┌─────────────────────────────────────────────┐
// │ TRIP DOMAIN │
// │ ┌──────────────┐ │
// │ │ Trip Gateway │ ← only external entry │
// │ └──────┬───────┘ │
// │ │ │
// │ ┌──────▼──────┐ ┌──────────┐ ┌─────────┐ │
// │ │ Trip Logic │ │ Trip DB │ │ Trip │ │
// │ │ Service │ │ Service │ │ Events │ │
// │ └─────────────┘ └──────────┘ └─────────┘ │
// └─────────────────────────────────────────────┘
//
// ┌─────────────────────────────────────────────┐
// │ MARKETPLACE DOMAIN │
// │ ┌──────────────────┐ │
// │ │ Marketplace GW │ ← only external entry │
// │ └──────┬───────────┘ │
// │ │ │
// │ ┌──────▼──────┐ ┌──────────┐ ┌─────────┐ │
// │ │ Matching │ │ Pricing │ │ Surge │ │
// │ │ Engine │ │ Engine │ │ Engine │ │
// │ └─────────────┘ └──────────┘ └─────────┘ │
// └─────────────────────────────────────────────┘
//
// Benefits:
// - Reduced cross-domain calls by 60%
// - Clear ownership: each domain has a team
// - Gateway enforces API versioning, auth, rate limiting
// - Internal services can evolve freely
// - Blast radius contained within domain boundaries
▶ Uber Architecture Overview
Step through a complete ride lifecycle showing which services are called at each stage — from ride request to payment.
Ringpop: Consistent Hashing for Service Discovery
Ringpop is Uber's open-source library for building cooperative, application-layer sharding. It combines consistent hashing (to partition work across nodes) with a SWIM gossip protocol (for membership and failure detection) — enabling services to self-organize into a ring without requiring a central coordinator.
The problem Ringpop solves is fundamental: when you have stateful services (like a trip management service that holds in-memory state for active trips), you need request routing that always sends requests for the same trip to the same node, even as nodes join, leave, or fail.
// Ringpop Architecture
//
// ┌─────────────────────────────────────────────────────┐
// │ HASH RING │
// │ │
// │ Node A (owns range 0–1000) │
// │ ╱ ╲ │
// │ ╱ ╲ │
// │ Node D Node B │
// │ (owns 7501–10000) (owns 1001–3500) │
// │ ╲ ╱ │
// │ ╲ ╱ │
// │ Node C (owns 3501–7500) │
// │ │
// │ hash("trip-12345") = 2847 → routes to Node B │
// │ hash("trip-67890") = 5123 → routes to Node C │
// └─────────────────────────────────────────────────────┘
//
// How Ringpop works:
//
// 1. SWIM GOSSIP PROTOCOL (Membership)
// - Each node periodically pings a random peer
// - If no response → indirect ping through K other nodes
// - If still no response → mark as SUSPECT → then FAULTY
// - Membership changes propagate via piggyback gossip
// - Full membership convergence in O(log N) rounds
//
// 2. CONSISTENT HASHING (Request Routing)
// - Each node owns a range on the hash ring
// - Requests are hashed to a position → forwarded to owner
// - When nodes join/leave, only K/N keys are remapped
// - Virtual nodes ensure even distribution
//
// 3. REQUEST FORWARDING
// - Client sends request to ANY Ringpop node
// - That node computes hash(key) → determines owner
// - If it's the owner → handle locally
// - If not → forward to the correct owner
// - Forwarding is transparent to the caller
var ringpop = new Ringpop({
app: 'trip-service',
hostPort: '10.0.1.5:3000',
channel: tchannel.makeSubChannel({ serviceName: 'ringpop' })
});
// Register request handler
ringpop.on('request', function(req, res) {
var tripId = req.headers.tripId;
var dest = ringpop.lookup(tripId); // hash → node
if (dest === ringpop.whoami()) {
// I own this trip — handle locally
handleTrip(req, res);
} else {
// Forward to the correct node
ringpop.forward(req, dest, function(err, resp) {
res.send(resp);
});
}
});
// When Node C fails:
// SWIM detects failure in ~200ms (configurable)
// Node C's hash range redistributed to neighbors (B and D)
// Requests for trips in Node C's range are re-routed
// Active trip state on Node C is lost — rebuilt from persistent store
TChannel: Uber's RPC Protocol
Ringpop runs on top of TChannel, Uber's open-source multiplexing RPC protocol. TChannel provides framing, multiplexing (multiple concurrent requests over a single TCP connection), and built-in tracing headers — think of it as a predecessor to gRPC, designed when gRPC was still in its infancy.
// TChannel Frame Format
// ┌─────────────┬───────────┬──────────┬──────────────────────┐
// │ Frame Size │ Frame Type│ Frame ID │ Payload │
// │ (4 bytes) │ (1 byte) │ (4 bytes)│ (variable) │
// └─────────────┴───────────┴──────────┴──────────────────────┘
//
// Frame Types:
// 0x01 = init req (handshake)
// 0x02 = init res (handshake response)
// 0x03 = call req (RPC request)
// 0x04 = call res (RPC response)
// 0x13 = call req cont (continuation frame for large payloads)
// 0x14 = call res cont (response continuation)
// 0xc0 = cancel (cancel a pending request)
// 0xff = error (error frame)
//
// Key features:
// - Multiplexing: 64K concurrent requests per TCP connection
// - Checksums: CRC-32C for data integrity
// - Tracing: Built-in span ID, trace ID, parent ID headers
// - Streaming: Large payloads split into continuation frames
// - Deadline propagation: TTL headers for timeout cascading
Schemaless: Uber's MySQL-Backed Data Store
When Uber outgrew PostgreSQL (they published a famous blog post about PostgreSQL's write amplification and replication issues), they didn't switch to Cassandra or DynamoDB. Instead, they built Schemaless — an append-only, immutable datastore built on top of MySQL. The design is deceptively simple but remarkably effective.
The Data Model
// Schemaless Data Model
//
// Core abstraction: CELL
// A cell is identified by (row_key, column_name, ref_key)
//
// ┌──────────────────────────────────────────────────────────────┐
// │ SCHEMALESS CELL │
// │ │
// │ Row Key : UUID (e.g., trip UUID) │
// │ Column Name : string (e.g., "base", "payment", "receipt") │
// │ Ref Key : integer (monotonically increasing version) │
// │ Body : JSON blob (the actual data) │
// │ Created At : timestamp │
// └──────────────────────────────────────────────────────────────┘
//
// KEY PROPERTY: CELLS ARE IMMUTABLE
// You never UPDATE a cell — you INSERT a new cell with a higher ref_key
//
// Example: Trip lifecycle stored in Schemaless
//
// Row Key = "trip-abc-123"
//
// ┌────────────┬─────────┬──────────────────────────────────────┐
// │ Column │ Ref Key │ Body (JSON) │
// ├────────────┼─────────┼──────────────────────────────────────┤
// │ base │ 1 │ {status:"requested", rider:"u1", │
// │ │ │ pickup:{lat:37.77, lng:-122.41}} │
// │ base │ 2 │ {status:"matched", driver:"d5", │
// │ │ │ eta_seconds: 180} │
// │ base │ 3 │ {status:"in_progress", │
// │ │ │ started_at:"2026-04-15T10:32:00Z"} │
// │ base │ 4 │ {status:"completed", │
// │ │ │ ended_at:"2026-04-15T10:47:00Z", │
// │ │ │ distance_miles: 3.2} │
// │ payment │ 1 │ {amount: 14.50, currency: "USD", │
// │ │ │ method: "credit_card", status:"auth"}│
// │ payment │ 2 │ {amount: 14.50, status: "captured", │
// │ │ │ transaction_id: "tx-789"} │
// │ receipt │ 1 │ {fare: 10.00, surge: 1.2, │
// │ │ │ service_fee: 2.30, total: 14.50} │
// └────────────┴─────────┴──────────────────────────────────────┘
//
// Reading: SELECT with highest ref_key for a column
// SELECT body FROM cells
// WHERE row_key = 'trip-abc-123'
// AND column_name = 'base'
// ORDER BY ref_key DESC LIMIT 1;
// → Returns {status:"completed", ...}
//
// Full history: SELECT all ref_keys for a column
// → Returns entire trip lifecycle — perfect for auditing/debugging
Architecture & Sharding
Schemaless is built on top of stock MySQL instances. It doesn't use MySQL's query optimizer, schema enforcement, or transactions. MySQL is treated as a dumb storage engine — Schemaless handles sharding, routing, and indexing at the application layer.
// Schemaless Physical Architecture
//
// ┌───────────────────────────────────────────────────┐
// │ SCHEMALESS CLIENT │
// │ row_key → shard = hash(row_key) % N_shards │
// │ shard → MySQL instance (via shard map) │
// └──────────────────────┬────────────────────────────┘
// │
// ┌───────────────┼───────────────┐
// ▼ ▼ ▼
// ┌─────────┐ ┌─────────┐ ┌─────────┐
// │ MySQL 0 │ │ MySQL 1 │ │ MySQL 2 │ ... (4096 shards)
// │ Shard 0 │ │ Shard 1 │ │ Shard 2 │
// └─────────┘ └─────────┘ └─────────┘
// │ Primary │ │ Primary │ │ Primary │
// │ Replica │ │ Replica │ │ Replica │
// │ Replica │ │ Replica │ │ Replica │
// └─────────┘ └─────────┘ └─────────┘
//
// Each MySQL instance runs a SINGLE table:
//
// CREATE TABLE cells (
// added_id BIGINT AUTO_INCREMENT PRIMARY KEY,
// row_key VARBINARY(36),
// column_name VARBINARY(64),
// ref_key INT,
// body MEDIUMBLOB, -- JSON, compressed with zlib
// created_at DATETIME(6),
// UNIQUE KEY (row_key, column_name, ref_key)
// );
//
// The auto_increment `added_id` is crucial:
// - Provides a total ordering of all writes to a shard
// - Enables change-data-capture: poll for added_id > last_seen
// - Used by secondary indexes to tail the write log
// SECONDARY INDEXES:
// Schemaless supports async secondary indexes:
// 1. A trigger process tails each shard (by added_id)
// 2. For each new cell, it extracts indexed fields from the JSON body
// 3. Writes index entries to a separate index shard
//
// Example: index on (status, city) for trips
// Query: "all in-progress trips in San Francisco"
// → hits the secondary index shard
// → returns list of row_keys
// → client fetches full cells from primary shards
//
// Index entries are EVENTUALLY CONSISTENT
// (typically < 1 second lag)
H3: Hexagonal Hierarchical Spatial Index
H3 is arguably Uber's most impactful open-source contribution to the broader tech community. It's a hierarchical geospatial indexing system that divides the entire Earth's surface into hexagonal cells at multiple resolutions. H3 powers surge pricing, supply positioning, ETA calculations, marketplace optimization, and dozens of other geospatial features at Uber.
Why Hexagons?
The choice of hexagons over squares or triangles is not arbitrary — it's mathematically optimal for spatial analysis:
// Why Hexagons beat Squares and Triangles
//
// PROPERTY 1: UNIFORM ADJACENCY
// Squares have TWO types of neighbors:
// - Edge neighbors (4): distance = d
// - Corner neighbors (4): distance = d × √2 ≈ 1.41d
// This creates directional bias in distance calculations
//
// Hexagons have ONE type of neighbor:
// - All 6 neighbors share an edge
// - All 6 neighbors are equidistant from center
// - No directional bias!
//
// ┌─────────────────────────────────────────────┐
// │ Square Grid: Hex Grid: │
// │ ┌───┬───┬───┐ ╱╲ ╱╲ │
// │ │ C │ E │ C │ │ │ │ │ │
// │ ├───┼───┼───┤ ╱╲ ╱╲╱╲ ╱╲ │
// │ │ E │ ● │ E │ │ │ ● │ │ │
// │ ├───┼───┼───┤ ╲╱ ╲╱╲╱ ╲╱ │
// │ │ C │ E │ C │ │ │ │ │ │
// │ └───┴───┴───┘ ╲╱ ╲╱ │
// │ E=edge C=corner All 6 = edge │
// │ d ≠ d√2 d = d = d │
// └─────────────────────────────────────────────┘
//
// PROPERTY 2: BEST APPROXIMATION OF CIRCLES
// When you "grow" a region outward by 1 step:
// - Squares → diamond/cross shape
// - Hexagons → approximate circle (best discrete approximation)
// This matters for "find all drivers within 500m"
//
// PROPERTY 3: MINIMAL EDGE-TO-AREA RATIO
// Hexagons have the lowest perimeter-to-area ratio of any
// regular polygon that can tile a plane.
// → Fewer boundary effects when aggregating data per cell
//
// PROPERTY 4: EFFICIENT TESSELLATION
// Hexagons tile the plane with no gaps or overlaps
// (unlike circles, which leave gaps or overlap)
Resolution Levels
H3 defines 16 resolution levels (0–15), from global-scale cells covering ~4.3 million km² down to sub-meter cells. Each resolution is approximately 7× finer than the previous one (each hex contains ~7 children).
// H3 Resolution Levels
//
// Res │ Avg Hex Area │ Avg Edge Length │ Use Case
// ────┼───────────────────┼────────────────┼──────────────────────
// 0 │ 4,357,449 km² │ 1,108 km │ Global regions
// 1 │ 609,788 km² │ 419 km │ Continental zones
// 2 │ 86,801 km² │ 158 km │ Country sub-regions
// 3 │ 12,393 km² │ 59 km │ Metro areas
// 4 │ 1,770 km² │ 22 km │ City zones
// 5 │ 252 km² │ 8.5 km │ City districts
// 6 │ 36 km² │ 3.2 km │ Neighborhoods
// 7 │ 5.16 km² │ 1.2 km │ ★ SURGE PRICING
// 8 │ 0.74 km² │ 461 m │ ★ SUPPLY POSITIONING
// 9 │ 0.105 km² │ 174 m │ ★ ETA/MATCHING
// 10 │ 0.015 km² │ 65 m │ Street-level
// 11 │ 0.002 km² │ 25 m │ Building-level
// 12 │ ~329 m² │ 9.4 m │ Parking spots
// 13 │ ~47 m² │ 3.6 m │ Sub-building
// 14 │ ~6.7 m² │ 1.3 m │ Room-level
// 15 │ ~0.9 m² │ 0.5 m │ Sub-meter precision
//
// Uber primarily uses resolutions 7-9 for ride operations:
//
// Resolution 7 (~5 km²): Surge pricing zones
// - Aggregate demand and supply per hex
// - Compute surge multiplier per cell
// - Update every 30-60 seconds
//
// Resolution 8 (~0.7 km²): Supply positioning
// - Predict demand per hex for next 15 minutes
// - Nudge idle drivers toward high-demand cells
//
// Resolution 9 (~100,000 m²): Matching & ETA
// - Find nearest available drivers
// - Compute pickup ETA with street-level precision
// H3 Index encoding (64-bit integer):
// ┌──────┬─────────┬──────────────────────────────────────────┐
// │ Mode │ Res (4b)│ Base Cell (7b) + Digit Path (3b × res) │
// │ (4b) │ │ │
// └──────┴─────────┴──────────────────────────────────────────┘
//
// Example: h3.geoToH3(37.7749, -122.4194, 9)
// → 0x8928308280fffff (San Francisco, res 9)
//
// Operations:
// h3.h3ToGeo(index) → [lat, lng] center point
// h3.h3ToGeoBoundary(index) → [[lat,lng]...] hex vertices
// h3.kRing(index, k) → all hexes within k steps
// h3.h3ToParent(index, res) → parent at coarser resolution
// h3.h3ToChildren(index, res) → children at finer resolution
// h3.h3Distance(a, b) → grid distance between two hexes
// h3.polyfill(polygon, res) → all hexes covering a polygon
H3 for Surge Pricing
Surge pricing is where H3 truly shines. Every 30-60 seconds, Uber's pricing engine computes a surge multiplier for every H3 cell at resolution 7 in every active city:
// Surge Pricing Pipeline (simplified)
//
// STEP 1: Aggregate demand per H3 cell (res 7)
// - Count ride requests in each cell over last 2 minutes
// - Decay older requests with exponential weighting
//
// STEP 2: Aggregate supply per H3 cell
// - Count available drivers in each cell
// - Include drivers in neighboring cells (kRing, k=1)
// weighted by expected pickup time
//
// STEP 3: Compute supply/demand ratio per cell
// ratio = available_drivers / active_requests
//
// STEP 4: Map ratio to surge multiplier
// if ratio > 1.5: surge = 1.0x (plenty of drivers)
// if ratio = 1.0: surge = 1.0x (balanced)
// if ratio = 0.5: surge = 1.5x (mild shortage)
// if ratio = 0.2: surge = 2.5x (significant shortage)
// if ratio < 0.1: surge = 4.0x+ (extreme shortage)
//
// STEP 5: Spatial smoothing
// Raw per-cell surge creates jarring price cliffs
// Smooth using weighted average of neighboring cells:
//
// smoothed_surge(cell) =
// 0.6 × surge(cell) +
// 0.4 × avg(surge(neighbor) for neighbor in kRing(cell, 1))
//
// This prevents a rider from walking 50m to halve their fare
function computeSurgeForCity(cityPolygon) {
const cells = h3.polyfill(cityPolygon, 7);
const surgeMap = {};
for (const cell of cells) {
const demand = getRecentRequests(cell, WINDOW_2MIN);
const neighbors = h3.kRing(cell, 1);
const supply = neighbors.reduce((sum, n) => {
const weight = n === cell ? 1.0 : 0.5; // discount neighbors
return sum + getAvailableDrivers(n) * weight;
}, 0);
const ratio = supply / Math.max(demand, 1);
surgeMap[cell] = clamp(surgeFunction(ratio), 1.0, MAX_SURGE);
}
return spatialSmoothing(surgeMap);
}
▶ H3 Hexagonal Grid
Map divided into hexagonal cells at different resolutions. Zoom into a city, see cells colored by demand level, and watch surge pricing computed per cell.
Geofence Service
Every time a rider opens the Uber app, the system needs to determine: What city is this user in? What zone? What products are available? This is the job of the Geofence Service — a point-in-polygon engine that handles billions of lookups per day.
// Geofence Service Architecture
//
// Problem: Given a (lat, lng), determine which of ~100,000 polygons
// (cities, zones, airports, surge areas, regulatory regions) contain it
//
// Naive approach: Test every polygon → O(N × P) where N = polygons,
// P = vertices per polygon. At 100K polygons × 50 vertices = 5M
// operations per lookup. At 100K QPS = 500 billion ops/second. No.
//
// Uber's approach: H3-BASED SPATIAL INDEX
//
// 1. OFFLINE PREPROCESSING:
// For each polygon, compute the set of H3 cells (res 12) it covers:
// cells = h3.polyfill(polygon, 12)
// Store: cell → [polygon_id_1, polygon_id_2, ...]
// This is an inverted index from H3 cells to polygons.
//
// 2. ONLINE LOOKUP:
// Given (lat, lng):
// a) Compute H3 cell: cell = h3.geoToH3(lat, lng, 12)
// b) Look up candidate polygons: candidates = index[cell]
// c) For each candidate, do precise point-in-polygon test
// d) Return matching polygons with metadata
//
// Result: Average lookup checks 1-3 polygons instead of 100,000
// Latency: p50 < 1ms, p99 < 5ms
//
// ┌─────────────┐ ┌──────────────┐ ┌──────────────┐
// │ (37.7, -122)│────▶│ H3 Cell Idx │────▶│ Candidates: │
// │ lat, lng │ │ res 12 lookup│ │ SF, Zone 3, │
// └─────────────┘ └──────────────┘ │ Airport Zone │
// └──────┬───────┘
// │
// ┌──────▼───────┐
// │ Point-in-Poly│
// │ ✓ SF │
// │ ✓ Zone 3 │
// │ ✗ Airport │
// └──────────────┘
// Geofence types at Uber:
// - City boundaries (which city are you in?)
// - Product zones (UberX available here, but not Uber Lux)
// - Surge zones (different surge multipliers per zone)
// - Airport geofences (special pickup/dropoff rules)
// - Regulatory regions (different pricing rules, tax rates)
// - Dynamic event zones (concerts, sports events → temporary surge)
MAST: Multi-Tenant Storage
MAST (Multi-tenancy Abstraction for Storage) is Uber's storage abstraction layer that sits between application services and the underlying storage engines (MySQL/Schemaless, Cassandra, Redis). It provides multi-tenancy, quota management, traffic shaping, and storage engine abstraction.
// MAST Architecture
//
// ┌──────────┐ ┌──────────┐ ┌──────────┐
// │ Trip Svc │ │ User Svc │ │ Pay Svc │
// └─────┬────┘ └─────┬────┘ └─────┬────┘
// │ │ │
// └────────────┼────────────┘
// │
// ┌─────────▼──────────┐
// │ MAST │
// │ ┌──────────────┐ │
// │ │ Tenant Mgmt │ │ ← Quota per service
// │ │ Rate Limiter │ │ ← Prevent noisy neighbors
// │ │ Query Router │ │ ← Route to correct engine
// │ │ Schema Cache │ │ ← Validate requests
// │ └──────────────┘ │
// └────────┬───────────┘
// │
// ┌───────────┼───────────┐
// ▼ ▼ ▼
// ┌──────────┐ ┌─────────┐ ┌───────┐
// │Schemaless│ │Cassandra│ │ Redis │
// │ (MySQL) │ │ │ │ │
// └──────────┘ └─────────┘ └───────┘
//
// Key features:
// - Each service gets a tenant with allocated IOPS quota
// - Noisy neighbor protection: if Trip Service spikes writes,
// it doesn't degrade Payment Service
// - Transparent failover between storage engines
// - Schema evolution management across storage backends
// - Metrics per tenant: latency, throughput, error rates
Marketplace: Real-Time Supply/Demand Balancing
The Marketplace platform is the brain of Uber — it decides who gets matched with whom, what the price should be, and where drivers should be positioned. It's a real-time optimization engine running on top of H3, processing millions of decisions per minute.
The Matching Engine
// Matching Algorithm (simplified)
//
// Input:
// - Ride request: {rider_id, pickup_lat, pickup_lng, product_type}
// - Available drivers in area: [{driver_id, lat, lng, rating, ...}]
//
// Uber's matching has evolved through several generations:
//
// GEN 1: NEAREST DRIVER (2012-2014)
// Sort drivers by distance to pickup → assign closest
// Problem: globally suboptimal. Driver A is closest to Rider 1
// but even closer to Rider 2 who will request in 10 seconds.
//
// GEN 2: BATCHED MATCHING (2015-2017)
// Collect requests in a small window (1-2 seconds)
// Solve assignment as a bipartite matching problem
// Minimize total ETA across all rider-driver pairs
// Uses Hungarian Algorithm variant → O(n³) but n is small per batch
//
// GEN 3: FORWARD-LOOKING MATCHING (2018+)
// Incorporate predicted future demand
// A driver completing a trip near a high-demand area
// might be "reserved" for an upcoming request there
// rather than dispatched to a distant current request
//
// ┌────────────────────────────────────────────────────┐
// │ BATCHED MATCHING EXAMPLE │
// │ │
// │ Riders waiting: Drivers available: │
// │ R1 (downtown) D1 (downtown) ETA matrix: │
// │ R2 (airport) D2 (midtown) ┌────┬────┐ │
// │ R3 (midtown) D3 (airport) │ │D1 │D2│D3│
// │ ├────┼────┤ │
// │ Nearest-driver: R1→D1, R2→D3, │ R1 │ 3m │8m│25m│
// │ Total ETA = 3+5+12 = 20 min │ R2 │22m │15m│5m│
// │ │ R3 │10m │4m│18m│
// │ Optimal matching: R1→D1, R2→D3, └────┴────┘ │
// │ R3→D2 │
// │ Total ETA = 3+5+4 = 12 min ✓ │
// └────────────────────────────────────────────────────┘
// Matching score function (multi-objective):
function matchScore(rider, driver) {
const eta = computeETA(driver.location, rider.pickup);
const driverRating = driver.rating;
const driverCompletionRate = driver.completionRate;
const predictedDemand = getFutureDemand(driver.location, 15); // 15 min
return (
-0.6 * normalize(eta) + // minimize ETA (highest weight)
0.15 * normalize(driverRating) + // prefer higher-rated drivers
0.10 * normalize(driverCompletionRate) +
-0.15 * normalize(predictedDemand) // don't pull drivers from
// soon-to-be-busy areas
);
}
Dynamic Pricing Algorithm
// Uber's Pricing Pipeline
//
// fare = baseFare + (perMile × distance) + (perMinute × duration)
// fare = fare × surgeMultiplier
// fare = max(fare, minimumFare)
// fare = fare + tolls + surcharges + bookingFee
//
// But the REAL complexity is in upfront pricing:
//
// UPFRONT PRICING (shown to rider before requesting):
// 1. Route prediction: ML model predicts likely route
// 2. Duration prediction: ML model factors in real-time traffic
// 3. Surge computation: H3-based supply/demand ratio
// 4. Price compilation: base + distance + time + surge
// 5. Price locking: rider sees and agrees to a price
// 6. Settlement: actual fare = upfront price (± large deviations)
//
// This is a COMMITMENT — Uber bears the risk of:
// - Traffic being worse than predicted (rider still pays quoted price)
// - Driver taking a longer route (driver gets paid, Uber absorbs diff)
// - Surge changing during the ride (locked at request time)
//
// The ML models for upfront pricing are trained on billions of
// historical trips and continuously updated via Michelangelo.
// Surge pricing economics:
// Surge is NOT just "charge more when busy"
// It's a two-sided market equilibrium mechanism:
//
// When demand > supply in an area:
// 1. Higher price → some riders delay/cancel (demand reduces)
// 2. Higher earnings → nearby drivers move toward surge area
// 3. Higher earnings → offline drivers come online
// 4. Equilibrium: supply meets reduced demand at surge price
//
// Without surge: riders wait 20+ minutes, drivers don't move
// With surge: riders who value the ride pay more, get it in 5 min
// Other riders wait for surge to drop (or take transit)
Peloton: Resource Scheduling
Peloton is Uber's unified resource scheduler, managing millions of containers across multiple data centers. It sits in the same problem space as Kubernetes and Apache Mesos, but was purpose-built for Uber's specific workload patterns — a mix of stateless microservices, stateful databases, batch jobs, and ML training workloads.
// Peloton Architecture
//
// ┌──────────────────────────────────────────────────┐
// │ PELOTON │
// │ │
// │ ┌────────────┐ ┌──────────────┐ ┌──────────┐ │
// │ │ Resource │ │ Placement │ │ Job │ │
// │ │ Manager │ │ Engine │ │ Manager │ │
// │ │ │ │ │ │ │ │
// │ │ Tracks all │ │ Bin-packing │ │ Lifecycle │ │
// │ │ host │ │ & scheduling │ │ of jobs │ │
// │ │ resources │ │ strategies │ │ & tasks │ │
// │ └────────────┘ └──────────────┘ └──────────┘ │
// │ │
// │ ┌────────────┐ ┌──────────────┐ │
// │ │ Host │ │ Volume │ │
// │ │ Manager │ │ Manager │ │
// │ │ │ │ │ │
// │ │ Manages │ │ Persistent │ │
// │ │ Mesos │ │ storage │ │
// │ │ agents │ │ for stateful │ │
// │ └────────────┘ └──────────────┘ │
// └──────────────────────────────────────────────────┘
//
// Key innovations:
//
// 1. ELASTIC RESOURCE PARTITIONING
// Resources divided into pools: stateless, stateful, batch
// Pools can borrow from each other when idle
// Batch jobs run on resources temporarily unused by services
//
// 2. PREEMPTION
// Priority-based: production services > batch jobs > experiments
// If a production service needs more resources,
// batch jobs are gracefully preempted (checkpointed + resumed)
//
// 3. PLACEMENT STRATEGIES
// - Spread: distribute across failure domains (racks, DCs)
// - Pack: bin-pack for efficiency (batch workloads)
// - Affinity: co-locate services that communicate heavily
// - Anti-affinity: separate replicas across hosts
//
// 4. SLA-BASED ADMISSION
// Each job declares an SLA: "99.9% of tasks running within 5s"
// Peloton tracks SLA compliance and prioritizes placement
Data Infrastructure at Uber Scale
Uber's data infrastructure processes over 1 trillion Kafka messages per day. The data platform team manages one of the largest Kafka deployments in the world, alongside massive Hadoop, Hive, Presto, and Spark clusters.
Apache Kafka: The Central Nervous System
// Kafka at Uber — The Numbers
//
// - 1 trillion+ messages per day
// - Petabytes of data per day in transit
// - Multiple Kafka clusters per data center
// - 10,000+ topics across all clusters
// - Avg message size: ~500 bytes to 10KB
// - Peak throughput: tens of GB/s per cluster
//
// Uber's Kafka Use Cases:
//
// 1. EVENT SOURCING
// Trip state changes → trip-events topic
// Every status transition is an immutable event
// Consumers: analytics, ML features, billing, safety
//
// 2. LOG AGGREGATION
// All service logs → centralized via Kafka
// 5,000+ services × thousands of instances = massive log volume
// Consumers: ELK stack, anomaly detection, debugging tools
//
// 3. REAL-TIME ANALYTICS
// Driver locations → location-events topic (100M+/sec)
// Consumed by: ETA engine, surge calculator, supply heatmaps
//
// 4. CHANGE DATA CAPTURE
// MySQL binlog → Kafka → downstream services
// Schemaless added_id tailing → Kafka → secondary indexes
//
// 5. INTER-SERVICE COMMUNICATION
// Async communication between microservices
// Preferred over synchronous RPC for non-latency-critical paths
// Uber's Kafka Enhancements:
//
// CONSUMER PROXY (uReplicator):
// - Cross-datacenter Kafka replication
// - Active-active setup: produce in one DC, consume in both
// - Handles topic-level routing and filtering
//
// DEAD LETTER QUEUE (DLQ):
// - Failed messages automatically routed to DLQ topics
// - Dashboard for inspecting and replaying failed messages
// - Per-consumer retry policies with exponential backoff
//
// CLUSTER MANAGEMENT:
// - Automated partition rebalancing
// - Online broker decommissioning
// - Dynamic topic configuration (retention, replication factor)
// - Quota management per producer/consumer
// Kafka Consumer Groups at Scale
//
// Challenge: 10,000+ consumer groups × 10,000+ topics
// Kafka's consumer group protocol doesn't scale well:
// - Rebalancing takes O(consumers × partitions) time
// - A single slow consumer triggers rebalance for ALL consumers
//
// Uber's Solution: Consumer Proxy
// - Thin proxy layer between Kafka and consumers
// - Manages partition assignment outside Kafka's protocol
// - Incremental rebalancing (only reassign affected partitions)
// - Consumer health monitoring + automatic eviction
Data Lake & Analytics
// Uber's Data Analytics Stack
//
// ┌────────────────────────────────────────────────┐
// │ DATA SOURCES │
// │ Kafka │ Schemaless │ MySQL │ Cassandra │ APIs │
// └────────┬───────────────────────────────┬───────┘
// │ │
// ▼ ▼
// ┌─────────────────┐ ┌─────────────────┐
// │ Apache Hive │ │ Real-Time │
// │ (Data Lake) │ │ Processing │
// │ │ │ (Apache Flink) │
// │ Petabytes of │ │ │
// │ historical data │ │ Stream analytics │
// │ Parquet format │ │ Real-time ETL │
// └────────┬─────────┘ └────────┬─────────┘
// │ │
// ▼ ▼
// ┌─────────────────┐ ┌─────────────────┐
// │ Presto │ │ Apache Spark │
// │ (Interactive) │ │ (Batch ETL) │
// │ │ │ │
// │ Ad-hoc SQL │ │ Feature eng. │
// │ Dashboards │ │ Model training │
// │ < 30s queries │ │ Large-scale ETL │
// └──────────────────┘ └──────────────────┘
//
// Uber's Data Quality:
// - Automated data quality checks on every ETL pipeline
// - Schema registry enforces backward compatibility
// - Data lineage tracking: trace any metric to its source
// - SLA monitoring: alert if a pipeline is late
//
// Scale:
// - 100+ Petabytes in the Hive data lake
// - 100,000+ Presto queries per day
// - 10,000+ scheduled Spark jobs
// - 500+ data engineers and scientists
Michelangelo: Uber's ML Platform
Michelangelo is Uber's end-to-end machine learning platform. Named after the Renaissance artist who saw the sculpture within the marble, Michelangelo aims to make building, deploying, and monitoring ML models as easy as building a CRUD app. It handles everything from feature engineering to model serving at massive scale.
// Michelangelo ML Pipeline
//
// ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
// │ Feature │───▶│ Model │───▶│ Model │───▶│ Model │
// │ Store │ │ Training │ │ Evaluate │ │ Serving │
// └──────────┘ └──────────┘ └──────────┘ └──────────┘
//
// ═══════════════════════════════════════════════════════════════
//
// 1. FEATURE STORE (Palette)
// ┌─────────────────────────────────────────────────────┐
// │ Central repository of curated, reusable features │
// │ │
// │ Offline Features (Hive): │
// │ - rider_avg_trip_distance_30d │
// │ - driver_rating_last_500_trips │
// │ - city_demand_hour_of_week │
// │ │
// │ Online Features (Cassandra): │
// │ - rider_last_trip_time (real-time) │
// │ - driver_current_location (real-time) │
// │ - surge_current_h3_cell (real-time) │
// │ │
// │ Feature pipelines compute and store features │
// │ Same features used in training AND serving │
// │ Eliminates training/serving skew! │
// └─────────────────────────────────────────────────────┘
//
// 2. MODEL TRAINING
// - Supports: XGBoost, TensorFlow, PyTorch, LightGBM
// - Distributed training on Peloton-managed GPU clusters
// - Hyperparameter tuning via Bayesian optimization
// - Automated feature selection
// - Training jobs: define features + labels + algorithm
// → Michelangelo handles data loading, splitting, training
//
// 3. MODEL EVALUATION
// - Automatic A/B testing framework
// - Canary deployment: 1% → 5% → 25% → 100%
// - Statistical significance testing before full rollout
// - Model comparison: accuracy, latency, feature importance
// - Bias detection: check for disparate impact across demographics
//
// 4. MODEL SERVING
// - Online serving: gRPC endpoint, p99 < 10ms
// - Batch prediction: scheduled via Spark
// - Feature fetching: automatic from Feature Store
// - Model versioning: rollback in < 1 minute
// - Shadow mode: run new model alongside old, compare results
// ML Models at Uber (selected):
//
// ETA Prediction:
// Input: route, traffic, weather, time-of-day, driver
// Output: estimated arrival time (minutes)
// Impact: shown to riders, affects matching decisions
// Architecture: Gradient boosted trees + deep learning ensemble
//
// Upfront Pricing:
// Input: route, ETA, demand/supply, rider history
// Output: fare estimate (locked price)
// Impact: $138B+ in bookings priced by this model
//
// Fraud Detection:
// Input: trip patterns, payment history, device fingerprint
// Output: fraud probability (0-1)
// Impact: saves hundreds of millions in fraud annually
// Architecture: Real-time scoring + offline batch analysis
//
// One-Click Chat (NLP):
// Input: rider/driver support message
// Output: suggested response or automated resolution
// Impact: resolves 30%+ of support tickets automatically
Observability: Jaeger & M3
With 5,000+ microservices, observability is not optional — it's survival. Uber built and open-sourced two critical observability systems: Jaeger for distributed tracing and M3 for metrics.
Jaeger: Distributed Tracing
When a ride request traverses 20+ services, how do you debug latency? Jaeger (named after a German word for "hunter") was created at Uber and later donated to the CNCF (Cloud Native Computing Foundation), where it became a graduated project.
// Jaeger Architecture
//
// ┌──────────────┐
// │ Service A │
// │ (instrumented│
// │ with SDK) │
// └──────┬───────┘
// │ spans
// ▼
// ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
// │ Jaeger Agent │────▶│ Jaeger │────▶│ Jaeger │
// │ (sidecar, │ │ Collector │ │ Query │
// │ UDP recv) │ │ (validates, │ │ (API + │
// └──────────────┘ │ indexes) │ │ Jaeger UI) │
// └──────┬───────┘ └──────────────┘
// │
// ┌──────▼───────┐
// │ Storage │
// │ (Cassandra/ │
// │ Elasticsearch)
// └──────────────┘
//
// Span: A unit of work within a trace
// Trace: A tree of spans representing an end-to-end request
//
// Example Trace: Ride Request
// ┌─────────────────────────────────────────────────┐
// │ Trace ID: abc-123 │
// │ │
// │ [API Gateway ] 45ms │
// │ ├─[Auth Service ] 3ms │
// │ ├─[Trip Service ] 38ms │
// │ │ ├─[Pricing Svc ] 12ms │
// │ │ │ └─[Surge Svc] 5ms │
// │ │ ├─[Location Svc ] 8ms │
// │ │ └─[Matching Svc ] 15ms │
// │ │ ├─[Driver DB ] 4ms │
// │ │ └─[ETA Svc ] 9ms │
// │ └─[Geofence Svc ] 2ms │
// └─────────────────────────────────────────────────┘
//
// Jaeger Key Features:
// - OpenTracing compatible (now OpenTelemetry)
// - Adaptive sampling: high-volume services → lower sample rate
// Error/slow requests → always sampled (100%)
// - Service dependency graph auto-generated from traces
// - Latency histograms per service per operation
// - Compare traces: "why was this request slow vs. normal?"
// Sampling Strategies:
//
// CONST: Sample everything (low traffic) or nothing
// sampler: { type: "const", param: 1 } // sample 100%
//
// PROBABILISTIC: Sample X% of traces
// sampler: { type: "probabilistic", param: 0.01 } // sample 1%
//
// RATE LIMITING: Sample N traces per second
// sampler: { type: "rateLimiting", param: 10 } // 10/sec
//
// ADAPTIVE (Uber's default):
// Automatically adjusts sampling rate per endpoint
// High-traffic endpoint → lower rate (enough traces to analyze)
// Low-traffic endpoint → higher rate (don't miss rare events)
// Target: consistent traces/second per endpoint regardless of QPS
M3: Metrics at Scale
// M3 — Uber's Metrics Platform
//
// M3 was built because existing solutions (Graphite, InfluxDB,
// Prometheus) couldn't handle Uber's scale:
// - Billions of time series
// - Hundreds of millions of datapoints per second
// - Sub-second query latency on dashboards
// - Multi-datacenter aggregation
//
// ┌─────────────────────────────────────────────────────┐
// │ M3 STACK │
// │ │
// │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
// │ │ M3 │ │ M3 │ │ M3 │ │
// │ │ Collector │ │ Aggregator│ │ Query │ │
// │ │ │ │ │ │ │ │
// │ │ Receives │ │ Rollups │ │ PromQL │ │
// │ │ metrics │ │ Downsampl │ │ compatible│ │
// │ │ from apps │ │ Cross-DC │ │ Grafana │ │
// │ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │
// │ │ │ │ │
// │ └──────────────┼──────────────┘ │
// │ │ │
// │ ┌───────▼────────┐ │
// │ │ M3DB │ │
// │ │ │ │
// │ │ Distributed │ │
// │ │ time-series DB │ │
// │ │ Built-in │ │
// │ │ compression │ │
// │ │ (Gorilla- │ │
// │ │ encoding) │ │
// │ └────────────────┘ │
// └─────────────────────────────────────────────────────┘
//
// M3DB Key Design Decisions:
//
// 1. GORILLA COMPRESSION
// Time-series data is highly compressible:
// - Timestamps: delta-of-delta encoding (~1 bit/sample)
// - Values: XOR encoding (~0.5 bits/sample for similar values)
// Result: 12x compression vs. naive float64 storage
//
// 2. INVERTED INDEX
// Fast tag-based queries: service=trip-svc AND env=prod
// Built on top of Roaring Bitmaps for set operations
//
// 3. WRITE-AHEAD LOG + LSM
// Writes go to WAL → memtable → flush to compressed blocks
// Optimized for write-heavy workloads
//
// 4. MULTI-DC REPLICATION
// Writes replicated across DCs
// Reads can query any DC (eventual consistency OK for metrics)
//
// M3 at Uber:
// - Billions of unique time series
// - Hundreds of millions of datapoints ingested per second
// - 6.6 billion+ datapoints stored
// - Dashboard query p99 < 2 seconds
Lessons Learned
Uber's engineering journey offers several powerful lessons for system designers:
1. Build vs. Buy at Scale
Uber built Schemaless (instead of using Cassandra), Ringpop (instead of ZooKeeper), Peloton (instead of Kubernetes), M3 (instead of Prometheus), and Michelangelo (instead of SageMaker). At Uber's scale, the operational cost of adapting a general-purpose tool often exceeds the cost of building a purpose-built one. But this is only true beyond a certain scale — startups should almost always buy/use off-the-shelf.
2. Domain-Driven Infrastructure
H3 exists because Uber's domain is fundamentally spatial. Schemaless exists because Uber's data is append-heavy, event-sourced trip data. The best infrastructure isn't generic — it's shaped by the domain it serves. When designing systems, start with the domain constraints, not the technology catalog.
3. Two-Sided Marketplaces Are Hard
Unlike typical web services where you optimize for one user, Uber must simultaneously optimize for riders (minimize wait time and price), drivers (maximize earnings and utilization), and the platform (maximize completed trips and revenue). Every decision is a multi-objective optimization with real-time constraints.
4. Observability Is a Feature, Not an Afterthought
Jaeger and M3 weren't add-ons — they were critical infrastructure that enabled the microservice migration. Without distributed tracing, debugging a 20-service call chain is impossible. Without metrics at scale, you can't detect regressions across 5,000+ services. Uber invested in observability before the migration, not after.
5. Incremental Migration Over Big Rewrites
The monolith-to-microservices migration took years, not months. Uber used the Strangler Fig pattern, migrating one service at a time, running both systems in parallel, and gradually shifting traffic. This minimized risk but required discipline — maintaining two systems simultaneously is expensive.
// Summary: Uber's Tech Stack
//
// ┌─────────────────────────────────────────────────────┐
// │ LAYER │ TECHNOLOGY │
// ├────────────────────┼─────────────────────────────────┤
// │ Mobile Apps │ iOS (Swift), Android (Kotlin) │
// │ API Gateway │ Custom (Go) │
// │ Service Mesh │ Custom (built on Envoy) │
// │ RPC Framework │ TChannel → gRPC (Protobuf) │
// │ Service Discovery │ Ringpop (consistent hashing) │
// │ Languages │ Go, Java, Python, Node.js │
// │ Primary Storage │ Schemaless (MySQL-backed) │
// │ Cache │ Redis, Memcached │
// │ Time-Series Store │ M3DB (custom) │
// │ Search │ Elasticsearch │
// │ Messaging │ Apache Kafka │
// │ Stream Processing │ Apache Flink │
// │ Batch Processing │ Apache Spark, Hive, Presto │
// │ Geospatial │ H3 (hexagonal index) │
// │ ML Platform │ Michelangelo (custom) │
// │ Resource Scheduler │ Peloton (custom, Mesos-based) │
// │ Tracing │ Jaeger (CNCF graduated) │
// │ Metrics │ M3 (custom) │
// │ CI/CD │ Custom (Monorepo tooling) │
// └────────────────────┴─────────────────────────────────┘
Key Takeaways
- Ringpop combines SWIM gossip with consistent hashing for application-layer sharding — no external coordinator needed. Use when you need stateful request routing with automatic failure detection.
- Schemaless proves that MySQL can be a building block for a planet-scale data store. Append-only, immutable cells with secondary indexes provide auditability, scalability, and simplicity.
- H3 is the gold standard for geospatial indexing. Hexagons provide uniform adjacency, and hierarchical resolutions enable analysis from city-level down to street-level. Use for any geospatial aggregation, proximity search, or spatial analysis.
- DOMA (Domain-Oriented Microservice Architecture) tames microservice sprawl by grouping services into domains with gateway interfaces — reducing cross-domain coupling.
- Michelangelo's Feature Store solves training/serving skew — the silent killer of ML systems. Centralize feature computation so training and serving use identical logic.
- M3 and Jaeger demonstrate that observability at scale requires purpose-built infrastructure. Generic tools break down at billions of time series and millions of traces per day.
- Real-time marketplaces demand end-to-end optimization: batched matching, forward-looking supply positioning, and dynamic pricing that balances multiple objectives simultaneously.