Design: Nearby Friends
Problem Statement
Almost every modern social app — Facebook, Snapchat, WhatsApp — offers a "Nearby Friends" feature that lets users see which of their friends are physically close. Unlike a Proximity Service that indexes static businesses and points of interest, Nearby Friends deals with constantly moving users whose locations change in real time. This makes the problem fundamentally different — and significantly harder.
The challenge boils down to a single question: How do you continuously match 100 million moving users against each other's friend lists, in real time, at planetary scale, without draining their phone batteries?
Requirements
Functional Requirements
- Real-time friend location: Show friends within a configurable radius (default 5 miles / 8 km) on a map, updated in near real time
- Location freshness: Location updates every ~30 seconds for active users; stale locations (>10 minutes old) disappear from the map
- Privacy controls: Users can opt in/out of sharing their location entirely, or hide from specific friends
- Distance display: Show the approximate distance and last-updated timestamp for each nearby friend
- Bidirectional visibility: If Alice can see Bob, Bob can also see Alice (both must have the feature enabled)
Non-Functional Requirements
- Scale: 100M DAU with 10% concurrent = ~10M simultaneous active users
- Latency: Location updates delivered to friends within <500ms of being sent
- Throughput: Each user sends 1 update every 30s → ~334K updates/sec at peak
- Battery: Minimal battery drain — must work with OS-level battery optimization (Doze mode, app standby)
- Availability: High availability (best-effort delivery is acceptable — missing one update out of 20 is fine)
- Consistency: Eventual consistency is acceptable — a few seconds of stale data is tolerable
Capacity Estimates
Concurrent users: 10M (10% of 100M DAU)
Update frequency: 1 update per 30 seconds per user
Updates per second: 10,000,000 / 30 ≈ 334,000 updates/sec
Average friends: ~400 per user
Friends online: ~10% of 400 = ~40 concurrent friends
Nearby friends: ~10% of 40 = ~4 friends within 5 miles
Fan-out per update: Each update goes to ~40 online friends
Total fan-out: 334K × 40 = ~13.4M messages/sec delivered
(each message is small: ~100 bytes)
Location data size: ~100 bytes per entry (user_id, lat, lng, timestamp)
Cache size (Redis): 10M × 100 bytes ≈ 1 GB (fits in memory easily)
WebSocket connections: 10M simultaneous (need ~50–100 WS servers
at 100K–200K connections each)
Proximity Service vs. Nearby Friends
Let's understand why we can't simply reuse the Proximity Service design with its geospatial index:
| Aspect | Proximity Service | Nearby Friends |
|---|---|---|
| Data | Static businesses (updated rarely) | Moving users (updated every 30s) |
| Query pattern | "What businesses are near lat/lng?" | "Which of MY FRIENDS are near me?" |
| Index freshness | Rebuilt hourly or daily | Must be real-time (<1s stale) |
| Relationship filter | None — return all nearby results | Social graph filter — only show friends |
| Direction | Unidirectional (user → service) | Bidirectional (server pushes to friends) |
| Communication | Request/Response (HTTP) | Persistent connections (WebSocket) |
| Core mechanism | Spatial index (QuadTree, Geohash) | Pub/Sub + social graph |
A naive approach — "store every user's location in a QuadTree and query it every 30 seconds for each user's friends" — would require 10M spatial queries every 30 seconds, each checking ~400 friend IDs. This is computationally infeasible at scale. Instead, we flip the model: rather than querying for nearby friends, we push location updates to friends and let the client compute distances.
High-Level Architecture
The core insight is that this is a pub/sub problem, not a search problem. Each user who opts into Nearby Friends effectively publishes their location to a channel, and their online friends subscribe to that channel. When a user moves, all their subscribing friends are immediately notified.
┌─────────────┐ WebSocket ┌──────────────────┐
│ Mobile App │◄──────────────────►│ WebSocket Server │
│ (Client) │ persistent conn │ (Stateful) │
└─────────────┘ └────────┬─────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
┌─────▼──────┐ ┌─────────▼────────┐ ┌────────▼────────┐
│ Location │ │ Redis Pub/Sub │ │ Friend Service │
│ Cache │ │ (Fan-out) │ │ (Social Graph) │
│ (Redis) │ │ │ │ │
└─────────────┘ └───────────────────┘ └─────────────────┘
│ │
user_id → {lat, Location channels:
lng, timestamp} user:{id}:location
TTL: 10 minutes (per-user channels)
Component Overview
| Component | Role | Technology |
|---|---|---|
| Mobile App | Collects GPS location, sends updates, displays friends on map | iOS Core Location / Android Fused Location |
| WebSocket Servers | Maintain persistent connections with clients, relay location updates | Go/Node.js, 100K–200K connections per server |
| Location Cache | Store latest location for each active user with auto-expiry | Redis (key-value with TTL) |
| Redis Pub/Sub | Fan out location updates to subscribing friends in real time | Redis Pub/Sub or Redis Streams |
| Friend Service | Provide the social graph (who is friends with whom) | Existing microservice / Graph DB |
| Location History DB | Persist location trail for features like "timeline" (optional) | Cassandra / DynamoDB (append-only writes) |
Detailed Design
Location Update Flow
Let's trace exactly what happens when User A moves and sends a location update:
Location Update Flow (Step by Step)
- Mobile app obtains GPS coordinates via OS location API (every 30 seconds)
- App sends update over the existing WebSocket connection:
{user_id: "A", lat: 37.7749, lng: -122.4194, ts: 1712345678} - WebSocket server receives the update and:
- Writes to Location Cache (Redis):
SET user:A:loc "{lat:37.7749,lng:-122.4194,ts:1712345678}" EX 600 - Publishes to Redis Pub/Sub channel:
PUBLISH user:A:location "{lat:37.7749,lng:-122.4194,ts:1712345678}"
- Writes to Location Cache (Redis):
- Redis Pub/Sub delivers the message to all servers subscribed to
user:A:location - Subscribing WebSocket servers receive the message. For each local client subscribed to A's channel, they:
- Check if the subscriber has A as a friend (already verified at subscription time)
- Compute distance between subscriber's last known location and A's new location
- If within radius (5 miles), push the update to the subscriber's WebSocket
- Friend's mobile app receives the update and renders A's pin on the map
// Redis commands for location update (pseudocode)
// Step 1: Cache user's location with 10-minute TTL
SET user:{user_id}:loc
"{\"lat\":37.7749,\"lng\":-122.4194,\"ts\":1712345678}"
EX 600
// Step 2: Publish to user's location channel
PUBLISH user:{user_id}:location
"{\"uid\":\"user_id\",\"lat\":37.7749,\"lng\":-122.4194,\"ts\":1712345678}"
// On the receiving WS server (subscribed to user:{user_id}:location):
// For each local subscriber of this channel:
// 1. Look up subscriber's location from local cache
// 2. Calculate haversine distance
// 3. If distance < 5 miles → push via WebSocket
// 4. If distance >= 5 miles → silently drop
▶ Location Update Pub/Sub
Watch User A's location update flow through the system to reach Friend B's map in real time.
WebSocket Connection Initialization
When a user opens the app and enables Nearby Friends, here's the initialization sequence:
// Client opens WebSocket connection
ws = new WebSocket("wss://nearby.example.com/ws?token=JWT_TOKEN")
// Server-side initialization handler:
function onConnect(ws, user_id):
// 1. Authenticate via JWT
user = verifyJWT(ws.token)
if not user: ws.close(4001, "Unauthorized"); return
// 2. Fetch friend list from Friend Service
friends = friendService.getFriends(user_id)
// 3. Filter to friends who have Nearby Friends enabled
eligible = friends.filter(f => privacyService.isNearbyEnabled(f.id))
// 4. Subscribe to each eligible friend's location channel
for friend in eligible:
redis.subscribe("user:{friend.id}:location", handler)
// 5. Send initial snapshot of online friends' locations
snapshot = []
for friend in eligible:
loc = redis.get("user:{friend.id}:loc")
if loc and isWithinRadius(user.lastLoc, loc, 5_MILES):
snapshot.append({friend_id: friend.id, ...loc})
ws.send(JSON.stringify({type: "init", friends: snapshot}))
// 6. Register this connection in the connection registry
connectionRegistry.register(user_id, ws_server_id, ws)
// 7. Subscribe to own channel so other servers
// can forward messages to this user
redis.subscribe("user:{user_id}:incoming", incomingHandler)
Redis Pub/Sub Deep Dive
Redis Pub/Sub is the backbone of the fan-out mechanism. Let's examine why it works well for this use case — and where it falls short:
Why Redis Pub/Sub Works
- Fire-and-forget semantics: Location updates are ephemeral. If a subscriber misses one update (30s), the next update arrives in 30s — no need for message durability or acknowledgment.
- Extreme low latency: Redis Pub/Sub delivers messages in <1ms within a single instance. Even with network hops, end-to-end delivery stays under 10ms.
- Per-user channels: Creating millions of channels in Redis is essentially free — channels are just keys in a hash table, consuming negligible memory when idle.
- Simple subscribe/unsubscribe: When a friend goes offline, their channel simply stops receiving publishes. No cleanup needed.
Redis Pub/Sub Limitations
- No persistence: If a subscriber is disconnected when a message is published, it's lost forever. For our use case, this is acceptable (the next update comes in 30s).
- No backpressure: A slow subscriber gets messages queued in its output buffer. If the buffer exceeds Redis's
client-output-buffer-limit, the client is disconnected. - Single-threaded: Each Redis instance handles pub/sub on a single thread. With very high message rates, a single instance can become a bottleneck.
- No message filtering: All subscribers on a channel receive every message. Distance filtering must happen on the WebSocket server side.
# Redis Pub/Sub configuration for Nearby Friends
# Client output buffer for pubsub subscribers
# Hard limit: 256MB, Soft limit: 64MB for 60 seconds
client-output-buffer-limit pubsub 256mb 64mb 60
# Typical channel pattern:
# user:{user_id}:location — one channel per user
#
# Example message flow:
# PUBLISH user:alice:location '{"lat":37.77,"lng":-122.42,"ts":1712345678}'
#
# All WebSocket servers with subscribers to Alice's channel receive this.
# Each server then checks which of its LOCAL connections are:
# (a) friends with Alice
# (b) within 5 miles of Alice's new location
# and forwards the update to matching clients.
# Scaling: Shard Redis Pub/Sub across N instances
# Shard key: hash(user_id) % N
# This ensures all publishes and subscribes for a given user
# go to the same Redis shard.
Redis Pub/Sub Cluster Sharding
Since a single Redis instance can't handle 400M subscriptions and 13.4M messages/sec, we shard across multiple instances:
// Sharding strategy for Redis Pub/Sub
const NUM_SHARDS = 400;
function getShardForUser(userId) {
return consistentHash(userId) % NUM_SHARDS;
}
// When user A publishes a location update:
function publishLocation(userId, location) {
const shard = getShardForUser(userId);
redisShardsPool[shard].publish(
`user:${userId}:location`,
JSON.stringify(location)
);
}
// When user B wants to subscribe to friend A's updates:
function subscribeToFriend(friendId, handler) {
const shard = getShardForUser(friendId);
redisShardsPool[shard].subscribe(
`user:${friendId}:location`,
handler
);
}
// Each WebSocket server maintains connections to ALL 400 Redis shards
// (since its local users' friends are distributed across all shards).
// This means each WS server has 400 Redis connections — manageable.
XADD/XREAD with Redis Streams. The trade-off is higher memory usage and slightly more complex consumer management. For Nearby Friends, the fire-and-forget nature of Pub/Sub is usually sufficient.
WebSocket Server Design
WebSocket servers are the stateful glue between clients and Redis. They're the most operationally complex component.
Connection Management
// In-memory state on each WebSocket server
struct ServerState {
// Map: user_id → WebSocket connection
connections: HashMap<UserId, WebSocket>,
// Map: user_id → set of friend_ids this user subscribes to
subscriptions: HashMap<UserId, HashSet<FriendId>>,
// Map: redis_channel → set of local user_ids listening
channelListeners: HashMap<Channel, HashSet<UserId>>,
// Map: user_id → last known location (local cache)
locationCache: HashMap<UserId, Location>,
}
// When Redis delivers a message on channel "user:alice:location":
func onRedisMessage(channel, message) {
userId = extractUserId(channel) // "alice"
location = parseLocation(message)
// Find all local users subscribed to Alice's channel
listeners = serverState.channelListeners[channel]
for listenerId in listeners {
listenerLoc = serverState.locationCache[listenerId]
if listenerLoc == nil { continue }
distance = haversine(listenerLoc, location)
if distance <= 5.0 { // 5 miles
ws = serverState.connections[listenerId]
ws.send(JSON.stringify({
type: "friend_location",
friend_id: userId,
lat: location.lat,
lng: location.lng,
distance: round(distance, 1),
ts: location.ts
}))
}
}
}
Scaling WebSocket Servers
| Challenge | Solution |
|---|---|
| 10M concurrent connections | 50–100 servers, each handling 100K–200K connections (Go with epoll/kqueue) |
| Load balancing sticky connections | L4 load balancer (HAProxy/Envoy) with connection pinning via user_id hash |
| Graceful server restarts | Drain connections with GOAWAY; clients reconnect to a different server within 5s |
| Uneven connection distribution | Consistent hashing with virtual nodes; rebalance by draining overloaded servers |
| Server failure detection | Health checks every 5s; failed servers removed from LB pool within 15s |
| Cross-server communication | Not needed — Redis Pub/Sub handles fan-out across servers transparently |
// WebSocket server sizing calculation
Connections per server: 200,000
Memory per connection: ~20 KB (buffers + metadata)
Memory for connections: 200K × 20 KB = 4 GB
Redis subscriptions: 200K users × 40 friends = 8M subscriptions
Redis connections per server: 400 (one per Redis shard)
CPU per server: 8-16 cores (goroutine-per-connection model)
Network: ~2 Gbps (200K users × ~1 KB/s bidirectional)
Total servers needed: 10M / 200K = 50 servers
With 50% headroom: 75 servers
Subscription Management
Subscription lifecycle management is critical for correctness and resource efficiency:
// Subscription lifecycle events
// 1. User A comes online → subscribe to all online friends' channels
function onUserOnline(userId):
friends = friendService.getFriends(userId)
onlineFriends = filterOnline(friends)
for friend in onlineFriends:
shard = getShardForUser(friend.id)
redis[shard].subscribe(`user:${friend.id}:location`)
// Also notify friend that userId is now online
// (friend's WS server subscribes to userId's channel)
notifyFriendOnline(friend.id, userId)
// 2. User A goes offline → unsubscribe from all channels
function onUserOffline(userId):
for channel in userSubscriptions[userId]:
shard = getShardFromChannel(channel)
redis[shard].unsubscribe(channel)
// Remove from location cache (or let TTL expire)
redis.del(`user:${userId}:loc`)
// Notify friends that userId went offline
for friend in onlineFriends:
notifyFriendOffline(friend.id, userId)
// 3. Friend B comes online → all of B's online friends
// subscribe to B's channel
function notifyFriendOnline(currentUserId, newFriendId):
shard = getShardForUser(newFriendId)
redis[shard].subscribe(`user:${newFriendId}:location`)
// 4. Friend B goes offline → unsubscribe from B's channel
function notifyFriendOffline(currentUserId, offlineFriendId):
shard = getShardForUser(offlineFriendId)
redis[shard].unsubscribe(`user:${offlineFriendId}:location`)
// 5. New friendship created (A befriends B while both online)
function onNewFriendship(userA, userB):
// A subscribes to B's channel and vice versa
subscribeToFriend(userA, userB)
subscribeToFriend(userB, userA)
// 6. Friendship removed or privacy change
function onUnfriend(userA, userB):
unsubscribeFromFriend(userA, userB)
unsubscribeFromFriend(userB, userA)
Geohash Optimization
The naive approach — subscribing to all online friends' channels — works, but wastes bandwidth. If you have 40 friends online but only 4 are within 5 miles, 90% of location updates you receive are from distant friends. You still compute the haversine distance and drop them. At scale, this wasted computation and network traffic adds up.
The optimization: use geohash cells to filter subscriptions. Only subscribe to friends who are in your geohash cell or its 8 neighbors.
Geohash Refresher
// Geohash divides the world into a grid of cells
// Precision determines cell size:
//
// Precision 4: ~39km × 20km (too coarse)
// Precision 5: ~4.9km × 4.9km (≈ 3 miles — good for 5-mile radius)
// Precision 6: ~1.2km × 0.6km (too fine, too many cells)
//
// We use precision 5 → each cell is ~5km × 5km
// A 5-mile (8km) radius spans the current cell + 8 neighbors = 9 cells
import geohash
lat, lng = 37.7749, -122.4194
cell = geohash.encode(lat, lng, precision=5) # "9q8yy"
neighbors = geohash.neighbors(cell) # 8 surrounding cells
relevant = [cell] + neighbors # 9 total cells
Geohash Subscription Filter
// Enhanced subscription with geohash filtering
function updateSubscriptions(userId, newLat, newLng):
newCell = geohash.encode(newLat, newLng, precision=5)
newCells = [newCell] + geohash.neighbors(newCell) // 9 cells
oldCells = userGeohashCells[userId] || []
// Find friends in the new relevant cells
newRelevantFriends = Set()
for friend in onlineFriends[userId]:
friendLoc = getLocation(friend.id)
if friendLoc:
friendCell = geohash.encode(friendLoc.lat, friendLoc.lng, 5)
if friendCell in newCells:
newRelevantFriends.add(friend.id)
oldRelevantFriends = userRelevantFriends[userId] || Set()
// Subscribe to newly relevant friends
toSubscribe = newRelevantFriends - oldRelevantFriends
for friendId in toSubscribe:
redis.subscribe(`user:${friendId}:location`)
// Unsubscribe from no-longer-relevant friends
toUnsubscribe = oldRelevantFriends - newRelevantFriends
for friendId in toUnsubscribe:
redis.unsubscribe(`user:${friendId}:location`)
userGeohashCells[userId] = newCells
userRelevantFriends[userId] = newRelevantFriends
// Benefit: Instead of subscribing to ~40 friend channels,
// we subscribe to ~4–8 (only those in nearby cells).
// Reduces Redis fan-out by ~80–90%!
▶ Geohash Optimization
Watch how geohash cells determine which friends you subscribe to. Moving to a distant cell triggers unsubscription.
Impact of Geohash Optimization
| Metric | Without Geohash | With Geohash | Reduction |
|---|---|---|---|
| Subscriptions per user | ~40 (all online friends) | ~4–8 (nearby friends only) | 80–90% |
| Total Redis subscriptions | 400M | 40–80M | 80–90% |
| Messages delivered/sec | 13.4M | 1.3–2.7M | 80–90% |
| Distance computations/sec | 13.4M | 1.3–2.7M | 80–90% |
| Redis Pub/Sub shards needed | ~400 | ~40–80 | 80–90% |
| Subscription churn (re-subscriptions on movement) | None | ~5–10% of users per 30s cycle change cells | New overhead |
Privacy Design
Location data is among the most sensitive personal information. The privacy design must be robust and defense-in-depth:
User-Facing Privacy Controls
- Global toggle: Enable/disable Nearby Friends entirely. When disabled, the user's location is never sent to the server, and they disappear from all friends' maps.
- Per-friend hide: Hide your location from specific friends. Implemented as a server-side blocklist — the server simply does not subscribe that friend to your channel.
- Ghost mode: Temporarily invisible for a set duration (1h, 8h, until turned off). Server sets a flag that prevents publishing to your channel.
- Precision control: Show approximate location (city-level) instead of exact location. Server rounds coordinates to ~1km precision before publishing.
Technical Privacy Measures
// Privacy enforcement at every layer
// 1. Client-side: Don't collect location if feature is disabled
if (!userSettings.nearbyFriendsEnabled) {
locationManager.stopUpdates()
return
}
// 2. WebSocket server: Enforce privacy before publishing
function onLocationUpdate(userId, location):
privacySettings = getPrivacySettings(userId)
if privacySettings.ghostMode:
// Cache location (for user's own map) but don't publish
redis.set(`user:${userId}:loc`, location, "EX", 600)
return
if privacySettings.approximateMode:
// Round to ~1km precision
location.lat = round(location.lat, 2) // ~1.1km precision
location.lng = round(location.lng, 2)
// Apply per-friend blocklist in subscription management
// (blocked friends are never subscribed to this channel)
redis.set(`user:${userId}:loc`, location, "EX", 600)
redis.publish(`user:${userId}:location`, location)
// 3. Location TTL: Locations auto-expire after 10 minutes
// If a user stops sending updates (closes app), they
// disappear from friends' maps within 10 minutes.
// 4. No persistent storage by default
// Location data lives only in Redis (volatile memory).
// Only opt-in "Location History" feature writes to durable storage.
// 5. Data minimization: Location messages contain only
// {user_id, lat, lng, timestamp} — no device info,
// speed, heading, or other metadata.
Battery Optimization
GPS is one of the most power-hungry sensors on a mobile device. Naive location polling every 30 seconds can drain a battery from 100% to 0% in under 5 hours. Battery optimization isn't optional — it's a launch blocker.
Adaptive Update Frequency
// Dynamic update frequency based on movement and context
class LocationUpdateManager {
var baseInterval = 30 // seconds
var currentInterval = 30
func adjustFrequency(speed: Double, context: Context) {
if speed < 0.5 {
// Stationary: slow down to every 5 minutes
currentInterval = 300
} else if speed < 2.0 {
// Walking (~4 mph): every 60 seconds
currentInterval = 60
} else if speed < 15.0 {
// Cycling/jogging: every 30 seconds
currentInterval = 30
} else {
// Driving (>30 mph): every 15 seconds
// Moving fast = location changes quickly
currentInterval = 15
}
// Battery level adjustment
if context.batteryLevel < 0.20 {
currentInterval = max(currentInterval * 3, 300)
} else if context.batteryLevel < 0.50 {
currentInterval = currentInterval * 1.5
}
// Screen state
if !context.screenOn {
// App in background: use significant location
// change API (iOS) or passive location (Android)
currentInterval = max(currentInterval, 300)
}
// Thermal state
if context.thermalState == .critical {
currentInterval = 600 // Reduce to 10 minutes
}
}
}
Platform-Specific Location APIs
| API | Platform | Battery Impact | Use Case |
|---|---|---|---|
| Fused Location Provider | Android | Low–Medium | Combines GPS, Wi-Fi, cell tower. Use PRIORITY_BALANCED_POWER_ACCURACY |
| Significant Location Change | iOS | Very Low | Only fires when user moves ~500m. Uses cell tower triangulation, not GPS. |
| Geofencing API | Both | Very Low | Set geofence around current cell; when user exits, wake app for precise update |
| Activity Recognition | Both | Negligible | Detect if user is still/walking/driving/cycling. Adjust update frequency. |
| Passive Location | Android | Zero | Piggyback on location requests from other apps. No additional battery cost. |
Comprehensive Battery Strategy
// Multi-tier battery optimization strategy
Tier 1: App in Foreground (map visible)
├── GPS: High accuracy, every 15–30 seconds
├── Activity Recognition: Adjust interval based on speed
├── WebSocket: Keep-alive with 60s heartbeat
└── Expected drain: ~5% per hour
Tier 2: App in Background (recently used)
├── Significant Location Change API (iOS)
├── Fused Location: PRIORITY_LOW_POWER (Android)
├── Update interval: 1–5 minutes
├── WebSocket: Maintained, reduced heartbeat (5 min)
└── Expected drain: ~1% per hour
Tier 3: App Suspended (long time in background)
├── Geofence around last known location (500m radius)
├── Only wake app when user leaves geofence
├── WebSocket: Disconnected, use push notification to wake
└── Expected drain: ~0.1% per hour
Tier 4: Battery Saver / Low Battery (<20%)
├── GPS: Disabled entirely
├── Location: Cell tower only (passive)
├── Update interval: 10 minutes or on significant change
├── WebSocket: Disconnected, periodic HTTP poll every 5 min
└── Expected drain: ~0.05% per hour
AlarmManager with setExactAndAllowWhileIdle() for heartbeats. On iOS, use URLSessionWebSocketTask which integrates with the system's connection coalescing. Consider switching to HTTP long-polling or silent push notifications when the app is backgrounded.
Handling Edge Cases
Thundering Herd at Peak Times
When a major event ends (concert, sports game), thousands of users in one area simultaneously open the app. All of them send location updates and subscribe to friends' channels at once.
// Mitigations for thundering herd:
// 1. Jittered connection backoff
function connectWithJitter():
delay = random(0, 5000) // 0–5 seconds random delay
setTimeout(() => websocket.connect(), delay)
// 2. Staggered subscription loading
function loadSubscriptions(friends):
// Don't subscribe to all 40 friends at once
// Subscribe in batches of 5 with 100ms delay
for batch in chunk(friends, 5):
for friend in batch:
redis.subscribe(`user:${friend.id}:location`)
await sleep(100)
// 3. Rate limiting on location cache writes
// Use Redis pipeline to batch SET operations
pipeline = redis.pipeline()
for update in batchedUpdates:
pipeline.set(`user:${update.userId}:loc`, update.data, "EX", 600)
pipeline.execute()
// 4. Circuit breaker on Redis Pub/Sub
// If a Redis shard is overloaded (>90% CPU), temporarily
// switch to polling mode for users on that shard
WebSocket Reconnection
// Client-side reconnection with exponential backoff
class WebSocketManager {
var retryCount = 0
var maxRetries = 10
func onDisconnect(code, reason):
if code == 4001: // Auth failure
refreshToken()
delay = min(1000 * pow(2, retryCount), 30000) // Max 30s
delay += random(0, 1000) // Jitter
setTimeout(() => {
ws = new WebSocket(url)
ws.onopen = () => {
retryCount = 0
// Request snapshot of friends' current locations
ws.send({type: "sync", since: lastUpdateTimestamp})
}
retryCount++
}, delay)
}
Users with Too Many Friends
// Problem: Celebrity user with 5000 friends
// → subscribing to 500 channels, receiving 500 updates/30s
// Solution 1: Cap nearby friends subscriptions
const MAX_NEARBY_SUBSCRIPTIONS = 100
function selectFriendsToSubscribe(allFriends):
// Priority: close friends > recent interactions > random
scored = allFriends.map(f => ({
id: f.id,
score: f.closenessFactor * 10 +
f.recentInteractions * 5 +
random() * 2
}))
return scored.sort(byScore).slice(0, MAX_NEARBY_SUBSCRIPTIONS)
// Solution 2: Fan-out-on-read for high-fan-out users
// Instead of pushing to all 5000 friends,
// let friends poll the celebrity's location every 30s
// (HTTP GET instead of Pub/Sub push)
Data Model
Redis Data Structures
// 1. Location Cache (Redis String with TTL)
Key: user:{user_id}:loc
Value: {"lat": 37.7749, "lng": -122.4194, "ts": 1712345678}
TTL: 600 seconds (10 minutes)
// 2. User Online Status (Redis Set)
Key: nearby:online
Value: Set of user_ids currently using Nearby Friends
Ops: SADD nearby:online user_123
SREM nearby:online user_123
SISMEMBER nearby:online user_123
// 3. User Geohash Cell (Redis String)
Key: user:{user_id}:geohash
Value: "9q8yy"
TTL: 600 seconds
// 4. Privacy Settings (Redis Hash)
Key: user:{user_id}:privacy
Fields: enabled (1/0), ghost_until (timestamp),
blocked_friends (comma-separated IDs),
approximate_mode (1/0)
// 5. Connection Registry (Redis Hash)
Key: user:{user_id}:conn
Fields: ws_server_id, connected_at, last_heartbeat
TTL: 120 seconds (refreshed on heartbeat)
Persistent Storage (Optional)
// Cassandra table for location history (opt-in only)
CREATE TABLE location_history (
user_id UUID,
timestamp TIMESTAMP,
lat DOUBLE,
lng DOUBLE,
accuracy FLOAT,
speed FLOAT,
PRIMARY KEY (user_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
AND default_time_to_live = 2592000; -- 30-day retention
// This is append-only, write-heavy, time-series data
// → perfect fit for Cassandra's write-optimized LSM tree
API Design
WebSocket Message Protocol
// Client → Server messages
// 1. Location update
{
"type": "location_update",
"lat": 37.7749,
"lng": -122.4194,
"accuracy": 15.0,
"speed": 1.2,
"ts": 1712345678
}
// 2. Request friends snapshot
{
"type": "sync",
"since": 1712345600 // only friends updated after this timestamp
}
// 3. Privacy toggle
{
"type": "privacy_update",
"enabled": true,
"ghost_mode": false,
"hidden_friends": ["friend_456"]
}
// Server → Client messages
// 1. Initial snapshot (on connect)
{
"type": "init",
"friends": [
{"id": "friend_123", "lat": 37.78, "lng": -122.41,
"distance": 0.8, "ts": 1712345650},
{"id": "friend_456", "lat": 37.77, "lng": -122.43,
"distance": 1.2, "ts": 1712345640}
]
}
// 2. Friend location update (real-time push)
{
"type": "friend_location",
"friend_id": "friend_123",
"lat": 37.7812,
"lng": -122.4098,
"distance": 0.6,
"ts": 1712345678
}
// 3. Friend went offline
{
"type": "friend_offline",
"friend_id": "friend_123"
}
// 4. Friend came online nearby
{
"type": "friend_online",
"friend_id": "friend_789",
"lat": 37.76,
"lng": -122.44,
"distance": 2.1,
"ts": 1712345700
}
REST API (Fallback & Management)
// For when WebSocket isn't available (battery saver mode, etc.)
GET /api/v1/nearby/friends?lat=37.7749&lng=-122.4194&radius=5
Authorization: Bearer {jwt}
Response: {
"friends": [
{"id": "f123", "lat": 37.78, "lng": -122.41, "distance": 0.8,
"last_seen": "2025-04-01T10:30:00Z"}
],
"next_poll": 30 // suggested poll interval in seconds
}
PUT /api/v1/nearby/settings
Authorization: Bearer {jwt}
Body: {
"enabled": true,
"ghost_mode": false,
"approximate_location": false,
"hidden_friends": ["user_456"]
}
POST /api/v1/nearby/location
Authorization: Bearer {jwt}
Body: {
"lat": 37.7749,
"lng": -122.4194,
"accuracy": 15.0,
"ts": 1712345678
}
// Used when WebSocket is disconnected (background HTTP fallback)
Scaling & Reliability
Redis Pub/Sub Scaling
// Redis Pub/Sub cluster topology
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Redis Shard 0│ │ Redis Shard 1│ ... │Redis Shard N │
│ (users 0-99) │ │(users 100-199)│ │ │
│ Pub/Sub │ │ Pub/Sub │ │ Pub/Sub │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│WS Server│ │WS Server│ │WS Server│
│ 1 │ │ 2 │ ... │ M │
└─────────┘ └─────────┘ └─────────┘
Each WS server connects to ALL Redis shards (it has users
whose friends span all shards).
Sharding key: consistent_hash(user_id) % N
All PUBLISH and SUBSCRIBE operations for a user's channel
go to the same shard.
// Capacity per Redis shard (empirical):
// ~1M active channels
// ~100K messages/sec throughput
// ~50K subscribers
// With geohash optimization (40-80M total subscriptions):
// N = 80 shards (each handling ~500K–1M subscriptions)
// Each shard: ~30K–167K messages/sec (well within limits)
Failure Modes & Mitigation
| Failure | Impact | Mitigation |
|---|---|---|
| WebSocket server crash | 100K–200K users disconnected | Clients auto-reconnect (with jitter) to a different server. Initial snapshot restores state within 5s. |
| Redis Pub/Sub shard down | Users whose channels are on this shard stop receiving updates | Redis Sentinel auto-failover in <30s. During failover, affected users miss 1 update cycle. Clients see "location data may be stale" warning. |
| Location cache (Redis) OOM | New location writes fail | TTL-based eviction ensures old entries expire. Set maxmemory-policy allkeys-lru as safety net. Monitor memory usage with alerts at 80%. |
| Friend Service unavailable | New connections can't load friend lists | Cache friend lists on WebSocket servers (TTL: 5 min). If cache miss and service is down, use stale data or show "temporarily unavailable". |
| Network partition (WS ↔ Redis) | Location updates not fan'd out | Circuit breaker pattern: after 3 failed Redis operations, switch to degraded mode (direct WS-to-WS fan-out for local connections). |
Key Metrics to Monitor
// Dashboard: Nearby Friends Health
// Throughput
location_updates_per_sec (target: ~334K)
pubsub_messages_delivered_per_sec (target: ~2M with geohash)
websocket_messages_sent_per_sec (target: ~2M)
// Latency
location_update_e2e_latency_p50 (target: <200ms)
location_update_e2e_latency_p99 (target: <500ms)
redis_pubsub_latency_p99 (target: <10ms)
// Connections
active_websocket_connections (target: ~10M)
connections_per_ws_server (target: <200K)
redis_pubsub_subscriptions_total (target: ~50M)
// Errors
websocket_disconnect_rate (alert: >5% in 5 min)
redis_pubsub_message_drop_rate (alert: >0.1%)
location_update_failure_rate (alert: >1%)
// Battery proxy
avg_location_updates_per_user_per_hour (target: 60–120)
users_on_battery_saver_mode (informational)
Alternative Approaches
Kafka Instead of Redis Pub/Sub
// Kafka-based architecture for higher durability
// Topic per user: user.{user_id}.location
// Problem: 10M topics is impractical for Kafka
// (ZooKeeper/KRaft can't handle that many partitions)
// Better: Topic per geohash cell
// Topic: location.{geohash_prefix} (e.g., location.9q8yy)
// ~100K–1M topics (number of active geohash cells)
// Each WS server is a consumer group member,
// consuming only the geohash cells relevant to its local users.
// Tradeoffs vs Redis Pub/Sub:
// ✅ Message durability (can replay on reconnect)
// ✅ Better backpressure handling
// ✅ Natural partitioning by geohash
// ❌ Higher latency (10–50ms vs <1ms)
// ❌ More complex consumer management
// ❌ Higher infrastructure cost
// ❌ Overkill for ephemeral location data
Pull-Based (Polling) Architecture
// Simpler architecture: clients poll every 30s
GET /api/nearby/friends?lat=37.77&lng=-122.42
Server-side:
1. Fetch user's friend list (cached)
2. For each friend, check location cache
3. Compute distances, filter by radius
4. Return nearby friends
// Tradeoffs vs Push (Pub/Sub):
// ✅ Much simpler architecture (stateless servers)
// ✅ No WebSocket complexity
// ✅ Easy to scale horizontally
// ❌ 30-second update delay (not real-time)
// ❌ Wasted requests when no friends are nearby
// ❌ With 10M users polling every 30s = 334K req/sec
// (same load, but server does more work per request)
// ❌ Higher battery drain (frequent HTTP requests vs idle WebSocket)
// Verdict: Good for MVP, replace with push for production
Hybrid Push/Pull Architecture
// Best of both worlds:
// - Push (WebSocket) when app is in foreground
// - Pull (HTTP) when app is in background
// - Silent push notification to wake app for significant changes
State Machine:
FOREGROUND → WebSocket active, GPS every 30s
BACKGROUND → WebSocket disconnected, HTTP poll every 5 min
SUSPENDED → No polling, push notification if friend nearby
TERMINATED → Silent push notification to wake app
// This is what Facebook's implementation actually uses.
// The hybrid approach reduces WebSocket server load by ~70%
// (since only ~30% of DAU has the app in foreground at any time).
Complete Architecture
┌────────────────────────────┐
│ Load Balancer (L4) │
│ (HAProxy / AWS NLB) │
└─────────────┬──────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
┌─────▼──────┐ ┌─────▼──────┐ ┌─────▼──────┐
│ WS Server 1│ │ WS Server 2│ ... │ WS Server N│
│ (200K conn)│ │ (200K conn)│ │ (200K conn)│
└──┬──┬──┬───┘ └──┬──┬──┬───┘ └──┬──┬──┬───┘
│ │ │ │ │ │ │ │ │
┌────────────┘ │ └────────┐ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │
┌───▼────┐ ┌──────▼──────┐ ┌─▼──────▼──▼──▼───────────────▼──▼──▼─┐
│Location│ │ Friend │ │ Redis Pub/Sub Cluster │
│ Cache │ │ Service │ │ (80 shards, consistent hash) │
│(Redis) │ │ (gRPC) │ │ Channels: user:{id}:location │
│ │ │ │ │ ~50M active subscriptions │
│10M keys│ │ Social graph│ └───────────────────────────────────────┘
│ ~1 GB │ │ 400 avg │
│TTL:10m │ │ friends/user│
└────────┘ └─────────────┘
┌─────────────────────────┐
│ Optional Components │
├─────────────────────────┤
│ • Location History DB │ Cassandra (opt-in, 30-day TTL)
│ • Analytics Pipeline │ Kafka → Spark (aggregate stats)
│ • Privacy Service │ Settings DB (PostgreSQL)
│ • Push Notification │ APNs/FCM (background wake)
│ • HTTP Fallback API │ Stateless REST servers
└─────────────────────────┘
Interview Tips
Key Talking Points
- Frame the problem correctly: "This is a social-graph-filtered, real-time location broadcasting problem, not a spatial search problem."
- Start with the math: Back-of-envelope calculation shows the fan-out (13M msg/s) is the bottleneck, not ingestion (334K/s).
- Introduce pub/sub early: "Each user has a location channel. Friends subscribe to each other's channels."
- Mention geohash optimization proactively: "We can reduce fan-out by 80% by only subscribing to friends in nearby geohash cells."
- Battery is a first-class concern: Interviewers love hearing about adaptive update frequency, significant location change APIs, and the foreground/background tier system.
- Privacy as a design pillar: Not an afterthought. TTL-based expiry, ghost mode, per-friend controls.
- Acknowledge trade-offs: Redis Pub/Sub is fire-and-forget (no durability). Explain why that's acceptable here.
Common Follow-Up Questions
| Question | Key Points |
|---|---|
| "How do you handle a user with 10K friends?" | Cap subscriptions at 100, prioritize by closeness/recency. Use fan-out-on-read for celebrity users. |
| "What if Redis Pub/Sub is a bottleneck?" | Shard by user_id. If still insufficient, use Kafka with geohash-based topics. |
| "How do you handle location accuracy?" | GPS accuracy metadata included in updates. Client shows accuracy circle on map. Don't display friends if both users have poor accuracy (>500m). |
| "Can you add 'friend is heading toward you' feature?" | Include speed and heading in location updates. Client computes trajectory. Show ETA if friend is moving toward you. |
| "How do you handle different timezones?" | All timestamps are UTC. Client converts for display. Geohash is timezone-agnostic. |
Summary
Nearby Friends — Architecture at a Glance
| Core pattern | Pub/Sub with social graph filtering |
| Real-time channel | WebSocket (foreground), HTTP poll (background), push notification (suspended) |
| Location cache | Redis key-value with 10-minute TTL |
| Fan-out | Redis Pub/Sub, sharded by user_id (80 shards) |
| Optimization | Geohash-based subscription filtering (~80–90% reduction) |
| Privacy | Opt-in, ghost mode, per-friend hide, TTL auto-expiry |
| Battery | 4-tier adaptive strategy (foreground → background → suspended → low battery) |
| Scale | 10M concurrent users, 334K updates/s, ~2M fan-out messages/s |
The key insight of this design is recognizing that Nearby Friends is not a spatial search problem — it's a real-time pub/sub problem filtered by social graph. Once you frame it correctly, the architecture flows naturally: WebSockets for persistent connections, Redis Pub/Sub for fan-out, geohash for optimization, and a multi-tier battery strategy for mobile-friendliness.
In the next post, we'll tackle Google Maps — a system that combines spatial indexing, tile rendering, routing algorithms, and real-time traffic data at truly planetary scale.