← All Posts
High Level Design Series · Real-World Designs

Design: Nearby Friends

Problem Statement

Almost every modern social app — Facebook, Snapchat, WhatsApp — offers a "Nearby Friends" feature that lets users see which of their friends are physically close. Unlike a Proximity Service that indexes static businesses and points of interest, Nearby Friends deals with constantly moving users whose locations change in real time. This makes the problem fundamentally different — and significantly harder.

The challenge boils down to a single question: How do you continuously match 100 million moving users against each other's friend lists, in real time, at planetary scale, without draining their phone batteries?

Key Difference from Proximity Service: A proximity service indexes static entities (restaurants, gas stations) into a spatial data structure and answers point queries ("what's near me?"). Nearby Friends must handle bidirectional, real-time location streams between pairs of users — fundamentally a pub/sub problem, not a spatial indexing problem.

Requirements

Functional Requirements

Non-Functional Requirements

Capacity Estimates

Concurrent users:     10M (10% of 100M DAU)
Update frequency:     1 update per 30 seconds per user
Updates per second:   10,000,000 / 30 ≈ 334,000 updates/sec

Average friends:      ~400 per user
Friends online:       ~10% of 400 = ~40 concurrent friends
Nearby friends:       ~10% of 40 = ~4 friends within 5 miles

Fan-out per update:   Each update goes to ~40 online friends
Total fan-out:        334K × 40 = ~13.4M messages/sec delivered
                      (each message is small: ~100 bytes)

Location data size:   ~100 bytes per entry (user_id, lat, lng, timestamp)
Cache size (Redis):   10M × 100 bytes ≈ 1 GB (fits in memory easily)

WebSocket connections: 10M simultaneous (need ~50–100 WS servers
                       at 100K–200K connections each)
The 13.4M messages/sec fan-out is the real engineering challenge. The location ingestion is relatively easy at 334K/s — it's the fan-out to friends that drives the architecture.

Proximity Service vs. Nearby Friends

Let's understand why we can't simply reuse the Proximity Service design with its geospatial index:

AspectProximity ServiceNearby Friends
Data Static businesses (updated rarely) Moving users (updated every 30s)
Query pattern "What businesses are near lat/lng?" "Which of MY FRIENDS are near me?"
Index freshness Rebuilt hourly or daily Must be real-time (<1s stale)
Relationship filter None — return all nearby results Social graph filter — only show friends
Direction Unidirectional (user → service) Bidirectional (server pushes to friends)
Communication Request/Response (HTTP) Persistent connections (WebSocket)
Core mechanism Spatial index (QuadTree, Geohash) Pub/Sub + social graph

A naive approach — "store every user's location in a QuadTree and query it every 30 seconds for each user's friends" — would require 10M spatial queries every 30 seconds, each checking ~400 friend IDs. This is computationally infeasible at scale. Instead, we flip the model: rather than querying for nearby friends, we push location updates to friends and let the client compute distances.

High-Level Architecture

The core insight is that this is a pub/sub problem, not a search problem. Each user who opts into Nearby Friends effectively publishes their location to a channel, and their online friends subscribe to that channel. When a user moves, all their subscribing friends are immediately notified.

┌─────────────┐    WebSocket      ┌──────────────────┐
│  Mobile App │◄──────────────────►│  WebSocket Server │
│  (Client)   │  persistent conn  │  (Stateful)       │
└─────────────┘                   └────────┬─────────┘
                                           │
                    ┌──────────────────────┼──────────────────────┐
                    │                      │                      │
              ┌─────▼──────┐     ┌─────────▼────────┐   ┌────────▼────────┐
              │  Location   │     │   Redis Pub/Sub   │   │  Friend Service │
              │  Cache      │     │   (Fan-out)       │   │  (Social Graph) │
              │  (Redis)    │     │                   │   │                 │
              └─────────────┘     └───────────────────┘   └─────────────────┘
                    │                      │
              user_id → {lat,        Location channels:
              lng, timestamp}        user:{id}:location
              TTL: 10 minutes        (per-user channels)

Component Overview

ComponentRoleTechnology
Mobile App Collects GPS location, sends updates, displays friends on map iOS Core Location / Android Fused Location
WebSocket Servers Maintain persistent connections with clients, relay location updates Go/Node.js, 100K–200K connections per server
Location Cache Store latest location for each active user with auto-expiry Redis (key-value with TTL)
Redis Pub/Sub Fan out location updates to subscribing friends in real time Redis Pub/Sub or Redis Streams
Friend Service Provide the social graph (who is friends with whom) Existing microservice / Graph DB
Location History DB Persist location trail for features like "timeline" (optional) Cassandra / DynamoDB (append-only writes)

Detailed Design

Location Update Flow

Let's trace exactly what happens when User A moves and sends a location update:

Location Update Flow (Step by Step)

  1. Mobile app obtains GPS coordinates via OS location API (every 30 seconds)
  2. App sends update over the existing WebSocket connection: {user_id: "A", lat: 37.7749, lng: -122.4194, ts: 1712345678}
  3. WebSocket server receives the update and:
    • Writes to Location Cache (Redis): SET user:A:loc "{lat:37.7749,lng:-122.4194,ts:1712345678}" EX 600
    • Publishes to Redis Pub/Sub channel: PUBLISH user:A:location "{lat:37.7749,lng:-122.4194,ts:1712345678}"
  4. Redis Pub/Sub delivers the message to all servers subscribed to user:A:location
  5. Subscribing WebSocket servers receive the message. For each local client subscribed to A's channel, they:
    • Check if the subscriber has A as a friend (already verified at subscription time)
    • Compute distance between subscriber's last known location and A's new location
    • If within radius (5 miles), push the update to the subscriber's WebSocket
  6. Friend's mobile app receives the update and renders A's pin on the map
// Redis commands for location update (pseudocode)

// Step 1: Cache user's location with 10-minute TTL
SET user:{user_id}:loc
    "{\"lat\":37.7749,\"lng\":-122.4194,\"ts\":1712345678}"
    EX 600

// Step 2: Publish to user's location channel
PUBLISH user:{user_id}:location
    "{\"uid\":\"user_id\",\"lat\":37.7749,\"lng\":-122.4194,\"ts\":1712345678}"

// On the receiving WS server (subscribed to user:{user_id}:location):
// For each local subscriber of this channel:
//   1. Look up subscriber's location from local cache
//   2. Calculate haversine distance
//   3. If distance < 5 miles → push via WebSocket
//   4. If distance >= 5 miles → silently drop

▶ Location Update Pub/Sub

Watch User A's location update flow through the system to reach Friend B's map in real time.

WebSocket Connection Initialization

When a user opens the app and enables Nearby Friends, here's the initialization sequence:

// Client opens WebSocket connection
ws = new WebSocket("wss://nearby.example.com/ws?token=JWT_TOKEN")

// Server-side initialization handler:
function onConnect(ws, user_id):
    // 1. Authenticate via JWT
    user = verifyJWT(ws.token)
    if not user: ws.close(4001, "Unauthorized"); return

    // 2. Fetch friend list from Friend Service
    friends = friendService.getFriends(user_id)

    // 3. Filter to friends who have Nearby Friends enabled
    eligible = friends.filter(f => privacyService.isNearbyEnabled(f.id))

    // 4. Subscribe to each eligible friend's location channel
    for friend in eligible:
        redis.subscribe("user:{friend.id}:location", handler)

    // 5. Send initial snapshot of online friends' locations
    snapshot = []
    for friend in eligible:
        loc = redis.get("user:{friend.id}:loc")
        if loc and isWithinRadius(user.lastLoc, loc, 5_MILES):
            snapshot.append({friend_id: friend.id, ...loc})
    ws.send(JSON.stringify({type: "init", friends: snapshot}))

    // 6. Register this connection in the connection registry
    connectionRegistry.register(user_id, ws_server_id, ws)

    // 7. Subscribe to own channel so other servers
    //    can forward messages to this user
    redis.subscribe("user:{user_id}:incoming", incomingHandler)
Channel count: Each online user subscribes to ~40 friend channels (10% of 400 friends online). With 10M concurrent users, that's 400M Redis subscriptions. A single Redis instance supports ~1M channels with active subscribers, so we need a Redis Pub/Sub cluster (sharded by user ID hash) with ~400+ shards — or use a dedicated message broker like Kafka.

Redis Pub/Sub Deep Dive

Redis Pub/Sub is the backbone of the fan-out mechanism. Let's examine why it works well for this use case — and where it falls short:

Why Redis Pub/Sub Works

Redis Pub/Sub Limitations

# Redis Pub/Sub configuration for Nearby Friends

# Client output buffer for pubsub subscribers
# Hard limit: 256MB, Soft limit: 64MB for 60 seconds
client-output-buffer-limit pubsub 256mb 64mb 60

# Typical channel pattern:
#   user:{user_id}:location  — one channel per user
#
# Example message flow:
#   PUBLISH user:alice:location '{"lat":37.77,"lng":-122.42,"ts":1712345678}'
#
# All WebSocket servers with subscribers to Alice's channel receive this.
# Each server then checks which of its LOCAL connections are:
#   (a) friends with Alice
#   (b) within 5 miles of Alice's new location
# and forwards the update to matching clients.

# Scaling: Shard Redis Pub/Sub across N instances
# Shard key: hash(user_id) % N
# This ensures all publishes and subscribes for a given user
# go to the same Redis shard.

Redis Pub/Sub Cluster Sharding

Since a single Redis instance can't handle 400M subscriptions and 13.4M messages/sec, we shard across multiple instances:

// Sharding strategy for Redis Pub/Sub
const NUM_SHARDS = 400;

function getShardForUser(userId) {
    return consistentHash(userId) % NUM_SHARDS;
}

// When user A publishes a location update:
function publishLocation(userId, location) {
    const shard = getShardForUser(userId);
    redisShardsPool[shard].publish(
        `user:${userId}:location`,
        JSON.stringify(location)
    );
}

// When user B wants to subscribe to friend A's updates:
function subscribeToFriend(friendId, handler) {
    const shard = getShardForUser(friendId);
    redisShardsPool[shard].subscribe(
        `user:${friendId}:location`,
        handler
    );
}

// Each WebSocket server maintains connections to ALL 400 Redis shards
// (since its local users' friends are distributed across all shards).
// This means each WS server has 400 Redis connections — manageable.
Alternative: Redis Streams — If you need message durability (e.g., to replay the last few locations for a user who reconnects), consider XADD/XREAD with Redis Streams. The trade-off is higher memory usage and slightly more complex consumer management. For Nearby Friends, the fire-and-forget nature of Pub/Sub is usually sufficient.

WebSocket Server Design

WebSocket servers are the stateful glue between clients and Redis. They're the most operationally complex component.

Connection Management

// In-memory state on each WebSocket server
struct ServerState {
    // Map: user_id → WebSocket connection
    connections: HashMap<UserId, WebSocket>,

    // Map: user_id → set of friend_ids this user subscribes to
    subscriptions: HashMap<UserId, HashSet<FriendId>>,

    // Map: redis_channel → set of local user_ids listening
    channelListeners: HashMap<Channel, HashSet<UserId>>,

    // Map: user_id → last known location (local cache)
    locationCache: HashMap<UserId, Location>,
}

// When Redis delivers a message on channel "user:alice:location":
func onRedisMessage(channel, message) {
    userId = extractUserId(channel)  // "alice"
    location = parseLocation(message)

    // Find all local users subscribed to Alice's channel
    listeners = serverState.channelListeners[channel]

    for listenerId in listeners {
        listenerLoc = serverState.locationCache[listenerId]
        if listenerLoc == nil { continue }

        distance = haversine(listenerLoc, location)
        if distance <= 5.0 {  // 5 miles
            ws = serverState.connections[listenerId]
            ws.send(JSON.stringify({
                type: "friend_location",
                friend_id: userId,
                lat: location.lat,
                lng: location.lng,
                distance: round(distance, 1),
                ts: location.ts
            }))
        }
    }
}

Scaling WebSocket Servers

ChallengeSolution
10M concurrent connections 50–100 servers, each handling 100K–200K connections (Go with epoll/kqueue)
Load balancing sticky connections L4 load balancer (HAProxy/Envoy) with connection pinning via user_id hash
Graceful server restarts Drain connections with GOAWAY; clients reconnect to a different server within 5s
Uneven connection distribution Consistent hashing with virtual nodes; rebalance by draining overloaded servers
Server failure detection Health checks every 5s; failed servers removed from LB pool within 15s
Cross-server communication Not needed — Redis Pub/Sub handles fan-out across servers transparently
// WebSocket server sizing calculation

Connections per server:     200,000
Memory per connection:      ~20 KB (buffers + metadata)
Memory for connections:     200K × 20 KB = 4 GB

Redis subscriptions:        200K users × 40 friends = 8M subscriptions
Redis connections per server: 400 (one per Redis shard)

CPU per server:             8-16 cores (goroutine-per-connection model)
Network:                    ~2 Gbps (200K users × ~1 KB/s bidirectional)

Total servers needed:       10M / 200K = 50 servers
With 50% headroom:          75 servers

Subscription Management

Subscription lifecycle management is critical for correctness and resource efficiency:

// Subscription lifecycle events

// 1. User A comes online → subscribe to all online friends' channels
function onUserOnline(userId):
    friends = friendService.getFriends(userId)
    onlineFriends = filterOnline(friends)

    for friend in onlineFriends:
        shard = getShardForUser(friend.id)
        redis[shard].subscribe(`user:${friend.id}:location`)
        // Also notify friend that userId is now online
        // (friend's WS server subscribes to userId's channel)
        notifyFriendOnline(friend.id, userId)

// 2. User A goes offline → unsubscribe from all channels
function onUserOffline(userId):
    for channel in userSubscriptions[userId]:
        shard = getShardFromChannel(channel)
        redis[shard].unsubscribe(channel)

    // Remove from location cache (or let TTL expire)
    redis.del(`user:${userId}:loc`)

    // Notify friends that userId went offline
    for friend in onlineFriends:
        notifyFriendOffline(friend.id, userId)

// 3. Friend B comes online → all of B's online friends
//    subscribe to B's channel
function notifyFriendOnline(currentUserId, newFriendId):
    shard = getShardForUser(newFriendId)
    redis[shard].subscribe(`user:${newFriendId}:location`)

// 4. Friend B goes offline → unsubscribe from B's channel
function notifyFriendOffline(currentUserId, offlineFriendId):
    shard = getShardForUser(offlineFriendId)
    redis[shard].unsubscribe(`user:${offlineFriendId}:location`)

// 5. New friendship created (A befriends B while both online)
function onNewFriendship(userA, userB):
    // A subscribes to B's channel and vice versa
    subscribeToFriend(userA, userB)
    subscribeToFriend(userB, userA)

// 6. Friendship removed or privacy change
function onUnfriend(userA, userB):
    unsubscribeFromFriend(userA, userB)
    unsubscribeFromFriend(userB, userA)

Geohash Optimization

The naive approach — subscribing to all online friends' channels — works, but wastes bandwidth. If you have 40 friends online but only 4 are within 5 miles, 90% of location updates you receive are from distant friends. You still compute the haversine distance and drop them. At scale, this wasted computation and network traffic adds up.

The optimization: use geohash cells to filter subscriptions. Only subscribe to friends who are in your geohash cell or its 8 neighbors.

Geohash Refresher

// Geohash divides the world into a grid of cells
// Precision determines cell size:
//
// Precision 4: ~39km × 20km  (too coarse)
// Precision 5: ~4.9km × 4.9km (≈ 3 miles — good for 5-mile radius)
// Precision 6: ~1.2km × 0.6km (too fine, too many cells)
//
// We use precision 5 → each cell is ~5km × 5km
// A 5-mile (8km) radius spans the current cell + 8 neighbors = 9 cells

import geohash

lat, lng = 37.7749, -122.4194
cell = geohash.encode(lat, lng, precision=5)  # "9q8yy"
neighbors = geohash.neighbors(cell)           # 8 surrounding cells
relevant = [cell] + neighbors                 # 9 total cells

Geohash Subscription Filter

// Enhanced subscription with geohash filtering

function updateSubscriptions(userId, newLat, newLng):
    newCell = geohash.encode(newLat, newLng, precision=5)
    newCells = [newCell] + geohash.neighbors(newCell)  // 9 cells

    oldCells = userGeohashCells[userId] || []

    // Find friends in the new relevant cells
    newRelevantFriends = Set()
    for friend in onlineFriends[userId]:
        friendLoc = getLocation(friend.id)
        if friendLoc:
            friendCell = geohash.encode(friendLoc.lat, friendLoc.lng, 5)
            if friendCell in newCells:
                newRelevantFriends.add(friend.id)

    oldRelevantFriends = userRelevantFriends[userId] || Set()

    // Subscribe to newly relevant friends
    toSubscribe = newRelevantFriends - oldRelevantFriends
    for friendId in toSubscribe:
        redis.subscribe(`user:${friendId}:location`)

    // Unsubscribe from no-longer-relevant friends
    toUnsubscribe = oldRelevantFriends - newRelevantFriends
    for friendId in toUnsubscribe:
        redis.unsubscribe(`user:${friendId}:location`)

    userGeohashCells[userId] = newCells
    userRelevantFriends[userId] = newRelevantFriends

// Benefit: Instead of subscribing to ~40 friend channels,
// we subscribe to ~4–8 (only those in nearby cells).
// Reduces Redis fan-out by ~80–90%!

▶ Geohash Optimization

Watch how geohash cells determine which friends you subscribe to. Moving to a distant cell triggers unsubscription.

Geohash edge case: Two users can be very close but in different geohash cells (at cell boundaries). That's why we always include the 8 neighboring cells — this guarantees we cover all points within our radius. The trade-off is subscribing to a few extra friends at cell borders, which is an acceptable false positive.

Impact of Geohash Optimization

MetricWithout GeohashWith GeohashReduction
Subscriptions per user ~40 (all online friends) ~4–8 (nearby friends only) 80–90%
Total Redis subscriptions 400M 40–80M 80–90%
Messages delivered/sec 13.4M 1.3–2.7M 80–90%
Distance computations/sec 13.4M 1.3–2.7M 80–90%
Redis Pub/Sub shards needed ~400 ~40–80 80–90%
Subscription churn (re-subscriptions on movement) None ~5–10% of users per 30s cycle change cells New overhead

Privacy Design

Location data is among the most sensitive personal information. The privacy design must be robust and defense-in-depth:

User-Facing Privacy Controls

Technical Privacy Measures

// Privacy enforcement at every layer

// 1. Client-side: Don't collect location if feature is disabled
if (!userSettings.nearbyFriendsEnabled) {
    locationManager.stopUpdates()
    return
}

// 2. WebSocket server: Enforce privacy before publishing
function onLocationUpdate(userId, location):
    privacySettings = getPrivacySettings(userId)

    if privacySettings.ghostMode:
        // Cache location (for user's own map) but don't publish
        redis.set(`user:${userId}:loc`, location, "EX", 600)
        return

    if privacySettings.approximateMode:
        // Round to ~1km precision
        location.lat = round(location.lat, 2)  // ~1.1km precision
        location.lng = round(location.lng, 2)

    // Apply per-friend blocklist in subscription management
    // (blocked friends are never subscribed to this channel)

    redis.set(`user:${userId}:loc`, location, "EX", 600)
    redis.publish(`user:${userId}:location`, location)

// 3. Location TTL: Locations auto-expire after 10 minutes
//    If a user stops sending updates (closes app), they
//    disappear from friends' maps within 10 minutes.

// 4. No persistent storage by default
//    Location data lives only in Redis (volatile memory).
//    Only opt-in "Location History" feature writes to durable storage.

// 5. Data minimization: Location messages contain only
//    {user_id, lat, lng, timestamp} — no device info,
//    speed, heading, or other metadata.
GDPR & Privacy Laws: Location data is "personal data" under GDPR, "sensitive data" under CCPA. You must: (1) obtain explicit opt-in consent, (2) allow data deletion on request, (3) implement data minimization, (4) log all access for audit trails, (5) encrypt location data in transit (TLS) and at rest (encrypted Redis). Never store precise location history without explicit, granular consent.

Battery Optimization

GPS is one of the most power-hungry sensors on a mobile device. Naive location polling every 30 seconds can drain a battery from 100% to 0% in under 5 hours. Battery optimization isn't optional — it's a launch blocker.

Adaptive Update Frequency

// Dynamic update frequency based on movement and context

class LocationUpdateManager {
    var baseInterval = 30  // seconds
    var currentInterval = 30

    func adjustFrequency(speed: Double, context: Context) {
        if speed < 0.5 {
            // Stationary: slow down to every 5 minutes
            currentInterval = 300
        } else if speed < 2.0 {
            // Walking (~4 mph): every 60 seconds
            currentInterval = 60
        } else if speed < 15.0 {
            // Cycling/jogging: every 30 seconds
            currentInterval = 30
        } else {
            // Driving (>30 mph): every 15 seconds
            // Moving fast = location changes quickly
            currentInterval = 15
        }

        // Battery level adjustment
        if context.batteryLevel < 0.20 {
            currentInterval = max(currentInterval * 3, 300)
        } else if context.batteryLevel < 0.50 {
            currentInterval = currentInterval * 1.5
        }

        // Screen state
        if !context.screenOn {
            // App in background: use significant location
            // change API (iOS) or passive location (Android)
            currentInterval = max(currentInterval, 300)
        }

        // Thermal state
        if context.thermalState == .critical {
            currentInterval = 600  // Reduce to 10 minutes
        }
    }
}

Platform-Specific Location APIs

APIPlatformBattery ImpactUse Case
Fused Location Provider Android Low–Medium Combines GPS, Wi-Fi, cell tower. Use PRIORITY_BALANCED_POWER_ACCURACY
Significant Location Change iOS Very Low Only fires when user moves ~500m. Uses cell tower triangulation, not GPS.
Geofencing API Both Very Low Set geofence around current cell; when user exits, wake app for precise update
Activity Recognition Both Negligible Detect if user is still/walking/driving/cycling. Adjust update frequency.
Passive Location Android Zero Piggyback on location requests from other apps. No additional battery cost.

Comprehensive Battery Strategy

// Multi-tier battery optimization strategy

Tier 1: App in Foreground (map visible)
  ├── GPS: High accuracy, every 15–30 seconds
  ├── Activity Recognition: Adjust interval based on speed
  ├── WebSocket: Keep-alive with 60s heartbeat
  └── Expected drain: ~5% per hour

Tier 2: App in Background (recently used)
  ├── Significant Location Change API (iOS)
  ├── Fused Location: PRIORITY_LOW_POWER (Android)
  ├── Update interval: 1–5 minutes
  ├── WebSocket: Maintained, reduced heartbeat (5 min)
  └── Expected drain: ~1% per hour

Tier 3: App Suspended (long time in background)
  ├── Geofence around last known location (500m radius)
  ├── Only wake app when user leaves geofence
  ├── WebSocket: Disconnected, use push notification to wake
  └── Expected drain: ~0.1% per hour

Tier 4: Battery Saver / Low Battery (<20%)
  ├── GPS: Disabled entirely
  ├── Location: Cell tower only (passive)
  ├── Update interval: 10 minutes or on significant change
  ├── WebSocket: Disconnected, periodic HTTP poll every 5 min
  └── Expected drain: ~0.05% per hour
The WebSocket battery tax: Maintaining a WebSocket connection itself consumes battery (keeping the radio awake for heartbeats). On Android, use AlarmManager with setExactAndAllowWhileIdle() for heartbeats. On iOS, use URLSessionWebSocketTask which integrates with the system's connection coalescing. Consider switching to HTTP long-polling or silent push notifications when the app is backgrounded.

Handling Edge Cases

Thundering Herd at Peak Times

When a major event ends (concert, sports game), thousands of users in one area simultaneously open the app. All of them send location updates and subscribe to friends' channels at once.

// Mitigations for thundering herd:

// 1. Jittered connection backoff
function connectWithJitter():
    delay = random(0, 5000)  // 0–5 seconds random delay
    setTimeout(() => websocket.connect(), delay)

// 2. Staggered subscription loading
function loadSubscriptions(friends):
    // Don't subscribe to all 40 friends at once
    // Subscribe in batches of 5 with 100ms delay
    for batch in chunk(friends, 5):
        for friend in batch:
            redis.subscribe(`user:${friend.id}:location`)
        await sleep(100)

// 3. Rate limiting on location cache writes
// Use Redis pipeline to batch SET operations
pipeline = redis.pipeline()
for update in batchedUpdates:
    pipeline.set(`user:${update.userId}:loc`, update.data, "EX", 600)
pipeline.execute()

// 4. Circuit breaker on Redis Pub/Sub
// If a Redis shard is overloaded (>90% CPU), temporarily
// switch to polling mode for users on that shard

WebSocket Reconnection

// Client-side reconnection with exponential backoff

class WebSocketManager {
    var retryCount = 0
    var maxRetries = 10

    func onDisconnect(code, reason):
        if code == 4001:  // Auth failure
            refreshToken()

        delay = min(1000 * pow(2, retryCount), 30000)  // Max 30s
        delay += random(0, 1000)  // Jitter

        setTimeout(() => {
            ws = new WebSocket(url)
            ws.onopen = () => {
                retryCount = 0
                // Request snapshot of friends' current locations
                ws.send({type: "sync", since: lastUpdateTimestamp})
            }
            retryCount++
        }, delay)
}

Users with Too Many Friends

// Problem: Celebrity user with 5000 friends
// → subscribing to 500 channels, receiving 500 updates/30s

// Solution 1: Cap nearby friends subscriptions
const MAX_NEARBY_SUBSCRIPTIONS = 100

function selectFriendsToSubscribe(allFriends):
    // Priority: close friends > recent interactions > random
    scored = allFriends.map(f => ({
        id: f.id,
        score: f.closenessFactor * 10 +
               f.recentInteractions * 5 +
               random() * 2
    }))
    return scored.sort(byScore).slice(0, MAX_NEARBY_SUBSCRIPTIONS)

// Solution 2: Fan-out-on-read for high-fan-out users
// Instead of pushing to all 5000 friends,
// let friends poll the celebrity's location every 30s
// (HTTP GET instead of Pub/Sub push)

Data Model

Redis Data Structures

// 1. Location Cache (Redis String with TTL)
Key:    user:{user_id}:loc
Value:  {"lat": 37.7749, "lng": -122.4194, "ts": 1712345678}
TTL:    600 seconds (10 minutes)

// 2. User Online Status (Redis Set)
Key:    nearby:online
Value:  Set of user_ids currently using Nearby Friends
Ops:    SADD nearby:online user_123
        SREM nearby:online user_123
        SISMEMBER nearby:online user_123

// 3. User Geohash Cell (Redis String)
Key:    user:{user_id}:geohash
Value:  "9q8yy"
TTL:    600 seconds

// 4. Privacy Settings (Redis Hash)
Key:    user:{user_id}:privacy
Fields: enabled (1/0), ghost_until (timestamp),
        blocked_friends (comma-separated IDs),
        approximate_mode (1/0)

// 5. Connection Registry (Redis Hash)
Key:    user:{user_id}:conn
Fields: ws_server_id, connected_at, last_heartbeat
TTL:    120 seconds (refreshed on heartbeat)

Persistent Storage (Optional)

// Cassandra table for location history (opt-in only)
CREATE TABLE location_history (
    user_id     UUID,
    timestamp   TIMESTAMP,
    lat         DOUBLE,
    lng         DOUBLE,
    accuracy    FLOAT,
    speed       FLOAT,
    PRIMARY KEY (user_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
  AND default_time_to_live = 2592000;  -- 30-day retention

// This is append-only, write-heavy, time-series data
// → perfect fit for Cassandra's write-optimized LSM tree

API Design

WebSocket Message Protocol

// Client → Server messages

// 1. Location update
{
    "type": "location_update",
    "lat": 37.7749,
    "lng": -122.4194,
    "accuracy": 15.0,
    "speed": 1.2,
    "ts": 1712345678
}

// 2. Request friends snapshot
{
    "type": "sync",
    "since": 1712345600  // only friends updated after this timestamp
}

// 3. Privacy toggle
{
    "type": "privacy_update",
    "enabled": true,
    "ghost_mode": false,
    "hidden_friends": ["friend_456"]
}

// Server → Client messages

// 1. Initial snapshot (on connect)
{
    "type": "init",
    "friends": [
        {"id": "friend_123", "lat": 37.78, "lng": -122.41,
         "distance": 0.8, "ts": 1712345650},
        {"id": "friend_456", "lat": 37.77, "lng": -122.43,
         "distance": 1.2, "ts": 1712345640}
    ]
}

// 2. Friend location update (real-time push)
{
    "type": "friend_location",
    "friend_id": "friend_123",
    "lat": 37.7812,
    "lng": -122.4098,
    "distance": 0.6,
    "ts": 1712345678
}

// 3. Friend went offline
{
    "type": "friend_offline",
    "friend_id": "friend_123"
}

// 4. Friend came online nearby
{
    "type": "friend_online",
    "friend_id": "friend_789",
    "lat": 37.76,
    "lng": -122.44,
    "distance": 2.1,
    "ts": 1712345700
}

REST API (Fallback & Management)

// For when WebSocket isn't available (battery saver mode, etc.)

GET /api/v1/nearby/friends?lat=37.7749&lng=-122.4194&radius=5
Authorization: Bearer {jwt}
Response: {
    "friends": [
        {"id": "f123", "lat": 37.78, "lng": -122.41, "distance": 0.8,
         "last_seen": "2025-04-01T10:30:00Z"}
    ],
    "next_poll": 30  // suggested poll interval in seconds
}

PUT /api/v1/nearby/settings
Authorization: Bearer {jwt}
Body: {
    "enabled": true,
    "ghost_mode": false,
    "approximate_location": false,
    "hidden_friends": ["user_456"]
}

POST /api/v1/nearby/location
Authorization: Bearer {jwt}
Body: {
    "lat": 37.7749,
    "lng": -122.4194,
    "accuracy": 15.0,
    "ts": 1712345678
}
// Used when WebSocket is disconnected (background HTTP fallback)

Scaling & Reliability

Redis Pub/Sub Scaling

// Redis Pub/Sub cluster topology

┌──────────────┐  ┌──────────────┐      ┌──────────────┐
│ Redis Shard 0│  │ Redis Shard 1│ ...  │Redis Shard N │
│ (users 0-99) │  │(users 100-199)│     │              │
│  Pub/Sub     │  │  Pub/Sub     │      │  Pub/Sub     │
└──────┬───────┘  └──────┬───────┘      └──────┬───────┘
       │                 │                      │
  ┌────▼────┐       ┌────▼────┐           ┌────▼────┐
  │WS Server│       │WS Server│           │WS Server│
  │    1    │       │    2    │    ...    │    M    │
  └─────────┘       └─────────┘           └─────────┘

Each WS server connects to ALL Redis shards (it has users
whose friends span all shards).

Sharding key: consistent_hash(user_id) % N
All PUBLISH and SUBSCRIBE operations for a user's channel
go to the same shard.

// Capacity per Redis shard (empirical):
//   ~1M active channels
//   ~100K messages/sec throughput
//   ~50K subscribers

// With geohash optimization (40-80M total subscriptions):
//   N = 80 shards (each handling ~500K–1M subscriptions)
//   Each shard: ~30K–167K messages/sec (well within limits)

Failure Modes & Mitigation

FailureImpactMitigation
WebSocket server crash 100K–200K users disconnected Clients auto-reconnect (with jitter) to a different server. Initial snapshot restores state within 5s.
Redis Pub/Sub shard down Users whose channels are on this shard stop receiving updates Redis Sentinel auto-failover in <30s. During failover, affected users miss 1 update cycle. Clients see "location data may be stale" warning.
Location cache (Redis) OOM New location writes fail TTL-based eviction ensures old entries expire. Set maxmemory-policy allkeys-lru as safety net. Monitor memory usage with alerts at 80%.
Friend Service unavailable New connections can't load friend lists Cache friend lists on WebSocket servers (TTL: 5 min). If cache miss and service is down, use stale data or show "temporarily unavailable".
Network partition (WS ↔ Redis) Location updates not fan'd out Circuit breaker pattern: after 3 failed Redis operations, switch to degraded mode (direct WS-to-WS fan-out for local connections).

Key Metrics to Monitor

// Dashboard: Nearby Friends Health

// Throughput
  location_updates_per_sec          (target: ~334K)
  pubsub_messages_delivered_per_sec (target: ~2M with geohash)
  websocket_messages_sent_per_sec   (target: ~2M)

// Latency
  location_update_e2e_latency_p50   (target: <200ms)
  location_update_e2e_latency_p99   (target: <500ms)
  redis_pubsub_latency_p99          (target: <10ms)

// Connections
  active_websocket_connections      (target: ~10M)
  connections_per_ws_server          (target: <200K)
  redis_pubsub_subscriptions_total  (target: ~50M)

// Errors
  websocket_disconnect_rate         (alert: >5% in 5 min)
  redis_pubsub_message_drop_rate    (alert: >0.1%)
  location_update_failure_rate      (alert: >1%)

// Battery proxy
  avg_location_updates_per_user_per_hour (target: 60–120)
  users_on_battery_saver_mode       (informational)

Alternative Approaches

Kafka Instead of Redis Pub/Sub

// Kafka-based architecture for higher durability

// Topic per user: user.{user_id}.location
// Problem: 10M topics is impractical for Kafka
//          (ZooKeeper/KRaft can't handle that many partitions)

// Better: Topic per geohash cell
// Topic: location.{geohash_prefix}  (e.g., location.9q8yy)
// ~100K–1M topics (number of active geohash cells)
// Each WS server is a consumer group member,
// consuming only the geohash cells relevant to its local users.

// Tradeoffs vs Redis Pub/Sub:
// ✅ Message durability (can replay on reconnect)
// ✅ Better backpressure handling
// ✅ Natural partitioning by geohash
// ❌ Higher latency (10–50ms vs <1ms)
// ❌ More complex consumer management
// ❌ Higher infrastructure cost
// ❌ Overkill for ephemeral location data

Pull-Based (Polling) Architecture

// Simpler architecture: clients poll every 30s

GET /api/nearby/friends?lat=37.77&lng=-122.42

Server-side:
1. Fetch user's friend list (cached)
2. For each friend, check location cache
3. Compute distances, filter by radius
4. Return nearby friends

// Tradeoffs vs Push (Pub/Sub):
// ✅ Much simpler architecture (stateless servers)
// ✅ No WebSocket complexity
// ✅ Easy to scale horizontally
// ❌ 30-second update delay (not real-time)
// ❌ Wasted requests when no friends are nearby
// ❌ With 10M users polling every 30s = 334K req/sec
//    (same load, but server does more work per request)
// ❌ Higher battery drain (frequent HTTP requests vs idle WebSocket)

// Verdict: Good for MVP, replace with push for production

Hybrid Push/Pull Architecture

// Best of both worlds:
// - Push (WebSocket) when app is in foreground
// - Pull (HTTP) when app is in background
// - Silent push notification to wake app for significant changes

State Machine:
  FOREGROUND → WebSocket active, GPS every 30s
  BACKGROUND → WebSocket disconnected, HTTP poll every 5 min
  SUSPENDED  → No polling, push notification if friend nearby
  TERMINATED → Silent push notification to wake app

// This is what Facebook's implementation actually uses.
// The hybrid approach reduces WebSocket server load by ~70%
// (since only ~30% of DAU has the app in foreground at any time).

Complete Architecture

                            ┌────────────────────────────┐
                            │      Load Balancer (L4)     │
                            │  (HAProxy / AWS NLB)        │
                            └─────────────┬──────────────┘
                                          │
                    ┌─────────────────────┼─────────────────────┐
                    │                     │                     │
              ┌─────▼──────┐        ┌─────▼──────┐       ┌─────▼──────┐
              │ WS Server 1│        │ WS Server 2│  ...  │ WS Server N│
              │ (200K conn)│        │ (200K conn)│       │ (200K conn)│
              └──┬──┬──┬───┘        └──┬──┬──┬───┘       └──┬──┬──┬───┘
                 │  │  │               │  │  │               │  │  │
    ┌────────────┘  │  └────────┐      │  │  │               │  │  │
    │               │           │      │  │  │               │  │  │
┌───▼────┐   ┌──────▼──────┐ ┌─▼──────▼──▼──▼───────────────▼──▼──▼─┐
│Location│   │   Friend    │ │        Redis Pub/Sub Cluster          │
│ Cache  │   │  Service    │ │  (80 shards, consistent hash)         │
│(Redis) │   │  (gRPC)     │ │  Channels: user:{id}:location         │
│        │   │             │ │  ~50M active subscriptions             │
│10M keys│   │ Social graph│ └───────────────────────────────────────┘
│ ~1 GB  │   │ 400 avg     │
│TTL:10m │   │ friends/user│
└────────┘   └─────────────┘

          ┌─────────────────────────┐
          │   Optional Components   │
          ├─────────────────────────┤
          │ • Location History DB   │  Cassandra (opt-in, 30-day TTL)
          │ • Analytics Pipeline    │  Kafka → Spark (aggregate stats)
          │ • Privacy Service       │  Settings DB (PostgreSQL)
          │ • Push Notification     │  APNs/FCM (background wake)
          │ • HTTP Fallback API     │  Stateless REST servers
          └─────────────────────────┘

Interview Tips

The #1 interviewer trap: Don't start with "I'll use a QuadTree/Geohash to find nearby users." That's a proximity service for static data. This problem is fundamentally about real-time pub/sub between friends. Start with the social graph constraint and work toward the pub/sub architecture.

Key Talking Points

Common Follow-Up Questions

QuestionKey Points
"How do you handle a user with 10K friends?" Cap subscriptions at 100, prioritize by closeness/recency. Use fan-out-on-read for celebrity users.
"What if Redis Pub/Sub is a bottleneck?" Shard by user_id. If still insufficient, use Kafka with geohash-based topics.
"How do you handle location accuracy?" GPS accuracy metadata included in updates. Client shows accuracy circle on map. Don't display friends if both users have poor accuracy (>500m).
"Can you add 'friend is heading toward you' feature?" Include speed and heading in location updates. Client computes trajectory. Show ETA if friend is moving toward you.
"How do you handle different timezones?" All timestamps are UTC. Client converts for display. Geohash is timezone-agnostic.

Summary

Nearby Friends — Architecture at a Glance

Core patternPub/Sub with social graph filtering
Real-time channelWebSocket (foreground), HTTP poll (background), push notification (suspended)
Location cacheRedis key-value with 10-minute TTL
Fan-outRedis Pub/Sub, sharded by user_id (80 shards)
OptimizationGeohash-based subscription filtering (~80–90% reduction)
PrivacyOpt-in, ghost mode, per-friend hide, TTL auto-expiry
Battery4-tier adaptive strategy (foreground → background → suspended → low battery)
Scale10M concurrent users, 334K updates/s, ~2M fan-out messages/s

The key insight of this design is recognizing that Nearby Friends is not a spatial search problem — it's a real-time pub/sub problem filtered by social graph. Once you frame it correctly, the architecture flows naturally: WebSockets for persistent connections, Redis Pub/Sub for fan-out, geohash for optimization, and a multi-tier battery strategy for mobile-friendliness.

In the next post, we'll tackle Google Maps — a system that combines spatial indexing, tile rendering, routing algorithms, and real-time traffic data at truly planetary scale.