Consistent Hashing
1 · The Problem with Simple Hashing
The most natural way to distribute keys across N servers is modular hashing:
For a fixed N this works perfectly: keys are spread roughly uniformly (assuming a decent hash function), and lookups are O(1). The problem explodes the moment N changes.
What Happens When N Changes?
Suppose we have 3 servers and 12 keys distributed as:
key hash(key) hash%3 hash%4 (remapped?)
──────────────────────────────────────────
k0 0 0 0 same
k1 17 2 1 ✗ moved
k2 42 0 2 ✗ moved
k3 3 0 3 ✗ moved
k4 25 1 1 same
k5 11 2 3 ✗ moved
k6 36 0 0 same
k7 19 1 3 ✗ moved
k8 44 2 0 ✗ moved
k9 9 0 1 ✗ moved
k10 31 1 3 ✗ moved
k11 22 1 2 ✗ moved
Adding just one server (3→4) remapped 9 out of 12 keys (75%). This is not a fluke. Mathematically, when we go from N to N+1 servers:
The same problem occurs when removing a server (e.g., server crash). With modular hashing, nearly all keys must be reassigned. We need an approach where adding or removing a server only displaces K/N keys (where K is the total number of keys), not nearly all of them.
2 · Consistent Hashing Concept
Consistent hashing was introduced by Karger et al. in their 1997 paper "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web" at MIT, originally to solve the problem of distributed web caching.
The core idea is elegant: instead of mapping keys to servers using modular arithmetic, we map both keys and servers onto a circular ring (hash space), and each key is served by the first server encountered when walking clockwise from the key's position.
The Hash Ring
Imagine a circle representing the entire hash space, from 0 to 232−1 (or 2128−1, or 2256−1, depending on the hash function). The circle wraps around: position 232−1 is adjacent to position 0.
- Place servers on the ring: Hash each server identifier (e.g., IP address, hostname) to get a position on the ring.
pos(Si) = hash(server_idi) - Place keys on the ring: Hash each key to get its position.
pos(Kj) = hash(keyj) - Assignment rule: Each key is assigned to the first server encountered clockwise from the key's position on the ring.
Why This Works
When a new server Snew is added to the ring, it only takes over keys that were previously assigned to its immediate clockwise successor. All other key→server mappings remain unchanged. Similarly, when a server is removed, only its keys are reassigned—to its clockwise successor.
3 · Hash Ring in Detail
▶ Interactive Hash Ring
Step through adding and removing servers on a consistent hash ring. Watch how only neighboring keys are affected.
Implementation: Sorted Array + Binary Search
The ring is implemented as a sorted array of server hashes. To find which server owns a key:
- Compute
h = hash(key) - Binary search the sorted array for the smallest server hash ≥
h - If no server hash is ≥
h, wrap around to the first server in the array (index 0)
// Finding the responsible server: O(log N) with binary search
int findServer(uint32_t keyHash, vector<uint32_t>& ring) {
auto it = lower_bound(ring.begin(), ring.end(), keyHash);
if (it == ring.end())
return 0; // wrap around
return distance(ring.begin(), it);
}
Hash Function Choice
| Hash Function | Output Size | Speed | Distribution | Notes |
|---|---|---|---|---|
| MD5 | 128-bit | Moderate | Excellent | Widely used (Ketama). Cryptographic but slower. |
| SHA-1 | 160-bit | Moderate | Excellent | Used in Cassandra's partitioner. |
| xxHash | 32/64/128-bit | Very fast | Excellent | Non-cryptographic. Great for hash rings. |
| MurmurHash3 | 32/128-bit | Very fast | Excellent | Used in many distributed systems. |
| CRC32 | 32-bit | Very fast | Good | Small output. Faster but less uniform. |
Ring Size Considerations
A 32-bit hash gives 232 (≈4.3 billion) positions on the ring, which is more than sufficient for most systems. With virtual nodes (covered next), even a 100-server cluster with 200 virtual nodes each yields only 20,000 positions to store—trivially small.
4 · Virtual Nodes
With only a few physical servers, the hash ring can become severely unbalanced. If servers are spaced unevenly on the ring, one server might own 60% of the key space while another owns only 10%.
The Problem: Uneven Distribution
With N servers placed on the ring, each server's arc length follows an exponential distribution. The expected standard deviation of load per server is:
The Solution: Virtual Nodes (Vnodes)
Instead of placing each physical server at one position, we place it at V positions (virtual nodes). Each physical server Si is mapped to virtual nodes hash("Si#0"), hash("Si#1"), ..., hash("Si#V-1").
// Generate virtual node positions for a server
vector<uint32_t> virtualNodes(const string& serverId, int V) {
vector<uint32_t> positions;
for (int i = 0; i < V; i++) {
string vnode = serverId + "#" + to_string(i);
positions.push_back(hash(vnode));
}
return positions;
}
How Many Virtual Nodes?
| Virtual Nodes (V) | Std Dev of Load | Max/Avg Load Ratio | Memory per Server |
|---|---|---|---|
| 1 (no vnodes) | ~100% | ~3× | 4 bytes |
| 10 | ~31% | ~1.6× | 40 bytes |
| 50 | ~14% | ~1.3× | 200 bytes |
| 100 | ~10% | ~1.2× | 400 bytes |
| 150 | ~8% | ~1.15× | 600 bytes |
| 200 | ~7% | ~1.12× | 800 bytes |
Benefits Beyond Load Balancing
- Heterogeneous servers: Give more powerful servers more virtual nodes (proportional to capacity). A server with 2× RAM gets 2× vnodes.
- Gradual scaling: When adding a new server, its virtual nodes are scattered across the ring, stealing small portions from many existing servers rather than a large chunk from one.
- Graceful removal: When a server leaves, its load is spread across many successors rather than dumped onto one.
Trade-offs
- Memory: Ring lookup table grows by V×. For V=200 and N=1000, that is 200K entries (typically <2 MB)—negligible.
- Lookup time: Binary search on a larger ring: O(log(N×V)) instead of O(log N). For 200K entries, that is ~17 comparisons vs ~10. Also negligible.
- Rebalancing complexity: Adding a server requires computing V hash values and inserting V entries into the sorted ring. Still very fast.
5 · Adding and Removing Nodes
Adding a Node
When server Snew joins the ring:
- Compute all V virtual node positions for Snew
- For each virtual node position p:
- Find the current owner of position p (the next clockwise server)
- Transfer keys in the range
(predecessor(p), p]from the current owner to Snew
- Insert all virtual node positions into the ring
Removing a Node
When server Sold leaves the ring (planned or crash):
- For each of Sold's V virtual node positions:
- Find the successor server (next clockwise)
- Transfer keys owned by this virtual node to the successor
- Remove all V virtual node positions from the ring
Graceful Migration Strategy
In practice, data migration happens in the background while the system continues to serve requests. Common strategies include:
- Double-read: During migration, read from both old and new owner. Return whichever has data.
- Forwarding: Old owner forwards requests for migrated keys to the new owner.
- Hinted handoff: If the new owner is unavailable, a neighbor temporarily accepts writes and delivers them later (used by DynamoDB and Cassandra).
Modular vs. Consistent: Side-by-Side
▶ Simple Hash vs. Consistent Hash
See the dramatic difference when adding a 4th server. Left: modular hash remaps nearly everything. Right: consistent hash remaps only one key.
6 · Bounded Load Consistent Hashing
Standard consistent hashing provides good average distribution but does not give hard guarantees on maximum load. In 2016, Google published "Consistent Hashing with Bounded Loads" (Mirrokni, Thorup, Zadimoghaddam) to address this.
The Idea
Each server has a capacity cap of (1 + ε) × (K/N) where ε is a small constant (e.g., 0.25). When a key hashes to a server that is already at capacity, the key spills to the next server clockwise that has available capacity.
Algorithm
Server findBoundedServer(Key key, Ring ring, double epsilon) {
int avgLoad = totalKeys / ring.numServers();
int cap = ceil((1 + epsilon) * avgLoad);
uint32_t h = hash(key);
Server s = ring.nextClockwise(h);
while (s.currentLoad() >= cap) {
s = ring.nextClockwise(s.position());
}
return s;
}
Properties
- Hard upper bound: No server ever exceeds (1+ε) × average load.
- Minimal disruption: When ε is large, behavior approaches standard consistent hashing. When ε=0, it becomes round-robin.
- Practical ε values: 0.1–0.5 in production. Lower ε gives tighter bounds but more spillover.
- Dynamic: As load changes, key assignments can shift. This means lookups must check capacity in real time.
7 · Jump Consistent Hashing
In 2014, Google engineers Lamping and Veach published "A Fast, Minimal Memory, Consistent Hash Algorithm" describing jump consistent hashing—an algorithm that is:
- O(1) memory (no ring to store)
- O(ln n) time per lookup
- Perfectly uniform distribution
- Minimal disruption: exactly K/N keys move when going from N to N±1 buckets
The Algorithm
The entire algorithm fits in about 10 lines:
int32_t jumpConsistentHash(uint64_t key, int32_t numBuckets) {
int64_t b = -1, j = 0;
while (j < numBuckets) {
b = j;
key = key * 2862933555777941757ULL + 1;
j = (int64_t)((b + 1) * (double(1LL << 31) /
double((key >> 33) + 1)));
}
return (int32_t)b;
}
How It Works (Intuition)
Think of it as a sequence of “jumps” through bucket assignments. For each bucket count from 1 to N, the algorithm decides whether the key should “jump” to the new bucket or stay put. The probability of jumping from bucket b to a new bucket is 1/(b+1), which ensures that exactly 1/(N+1) of keys move when bucket N+1 is added.
Jump Hash vs. Ring Hash
| Property | Ring Hash | Jump Hash |
|---|---|---|
| Memory | O(N × V) | O(1) |
| Lookup Time | O(log(N × V)) | O(ln N) |
| Distribution | Near-uniform (with vnodes) | Perfectly uniform |
| Arbitrary node removal | ✓ Yes | ✗ No (end only) |
| Named nodes | ✓ Yes (IP, hostname) | ✗ No (0..n-1) |
| Dynamic topology | ✓ Flexible | ✗ Linear only |
8 · Rendezvous Hashing (HRW)
Highest Random Weight (HRW) hashing was independently developed by Thaler and Ravishankar in 1998. Unlike consistent hashing which uses a ring, rendezvous hashing takes a direct approach.
Algorithm
For each key, compute a score for every server and pick the server with the highest score:
Server findServer(Key key, vector<Server>& servers) {
Server best;
uint64_t bestScore = 0;
for (auto& s : servers) {
uint64_t score = hash(key + s.id());
if (score > bestScore) {
bestScore = score;
best = s;
}
}
return best;
}
Properties
- O(N) per lookup: Must evaluate all servers. This is fine for small N (5–50 servers) but becomes expensive for large N.
- No data structure needed: No ring, no sorted array. Just iterate over the server list.
- Minimal disruption: Removing one server moves only K/N keys (the keys that chose the removed server). Adding one server moves only K/(N+1) keys.
- Simple implementation: Trivially correct. No virtual nodes needed for good distribution.
- Naturally uniform: Each key is equally likely to choose any server (assuming a good hash function).
When to Use Rendezvous vs. Ring
Use Rendezvous When:
- Small number of servers (<50)
- Simplicity matters more than lookup speed
- Need perfect distribution without tuning vnodes
- Need k-of-n server selection (e.g., top-3 servers for replication)
Use Ring Hash When:
- Large number of servers (100+)
- Need O(log N) lookups
- Topology changes frequently
- Already have vnodes infrastructure
9 · Real-World Usage
Amazon DynamoDB
DynamoDB (and its predecessor Dynamo, described in the seminal 2007 paper) pioneered consistent hashing in production. Each node is assigned multiple virtual nodes on an MD5-based hash ring. Key features:
- Data is replicated to N successor nodes on the ring (default N=3)
- Uses hinted handoff for temporary failures
- Uses vector clocks for conflict resolution
- Virtual nodes allow heterogeneous hardware (more tokens for bigger machines)
Apache Cassandra
Cassandra uses consistent hashing with 256 virtual nodes (tokens) per physical node by default:
- Murmur3Partitioner: Default partitioner using MurmurHash3 on a −263 to 263−1 ring
- RandomPartitioner: Alternative using MD5 on a 0 to 2127−1 ring
- Token ranges are used to determine which node owns which data
- Adding a node automatically streams data from neighboring token ranges
Akamai CDN
Akamai (co-founded by Karger, one of the consistent hashing inventors) uses consistent hashing to assign web content to edge servers. When a user requests content:
- The URL is hashed to find the responsible edge server
- Server failures are handled transparently: requests route to the next server on the ring
- New servers are added incrementally, gradually absorbing load without cache stampedes
Discord
Discord uses consistent hashing to route messages to the correct guild (server) shard:
- Guild ID is hashed to determine the owning database shard
- Consistent hashing ensures that resharding (adding shards) only moves a fraction of guilds
- Combined with replication for high availability
Memcached (Ketama)
The Ketama library is the de facto standard for consistent hashing in Memcached clients:
- Uses MD5 to generate 4 virtual nodes per hash (using the 4 × 32-bit segments of the 128-bit MD5 output)
- Default: 160 virtual nodes per server (40 hashes × 4 points each)
- Originally developed at Last.fm, now used in libmemcached and most Memcached clients
| System | Hash Function | Virtual Nodes | Ring Size | Replication |
|---|---|---|---|---|
| DynamoDB | MD5 | Variable | 2128 | N=3 successors |
| Cassandra | Murmur3 | 256 default | 264 | Configurable RF |
| Akamai CDN | Proprietary | Yes | 232 | Multi-tier |
| Memcached (Ketama) | MD5 | 160 | 232 | None (cache) |
| Redis Cluster | CRC16 | 16384 slots | 16384 | Replica shards |
10 · Implementation
Here is a complete C++ implementation of consistent hashing with virtual nodes:
#include <map>
#include <string>
#include <vector>
#include <functional>
#include <stdexcept>
#include <sstream>
#include <cstdint>
class ConsistentHash {
private:
int numVnodes;
std::map<uint32_t, std::string> ring; // hash → server id
std::hash<std::string> hasher;
uint32_t hashKey(const std::string& key) const {
// In production, use MurmurHash3 or xxHash.
// std::hash is used here for simplicity.
return static_cast<uint32_t>(hasher(key));
}
public:
explicit ConsistentHash(int vnodes = 150)
: numVnodes(vnodes) {}
// Add a physical server with V virtual nodes
void addServer(const std::string& serverId) {
for (int i = 0; i < numVnodes; i++) {
std::string vnode = serverId + "#" + std::to_string(i);
uint32_t h = hashKey(vnode);
ring[h] = serverId;
}
}
// Remove a physical server and all its virtual nodes
void removeServer(const std::string& serverId) {
for (int i = 0; i < numVnodes; i++) {
std::string vnode = serverId + "#" + std::to_string(i);
uint32_t h = hashKey(vnode);
ring.erase(h);
}
}
// Find the server responsible for a key
std::string getServer(const std::string& key) const {
if (ring.empty())
throw std::runtime_error("Empty ring");
uint32_t h = hashKey(key);
// Find the first server position >= h (clockwise search)
auto it = ring.lower_bound(h);
if (it == ring.end())
it = ring.begin(); // wrap around to 0
return it->second;
}
// Get N distinct servers for replication
std::vector<std::string> getServers(
const std::string& key, int count) const {
if (ring.empty())
throw std::runtime_error("Empty ring");
std::vector<std::string> result;
uint32_t h = hashKey(key);
auto it = ring.lower_bound(h);
while (result.size() < static_cast<size_t>(count)) {
if (it == ring.end())
it = ring.begin();
// Skip duplicate physical servers
bool found = false;
for (auto& s : result) {
if (s == it->second) { found = true; break; }
}
if (!found)
result.push_back(it->second);
++it;
if (it == ring.begin()) break; // wrapped fully
}
return result;
}
size_t ringSize() const { return ring.size(); }
};
Python Implementation
import hashlib
import bisect
class ConsistentHash:
def __init__(self, vnodes=150):
self.vnodes = vnodes
self.ring = {} # hash → server_id
self.sorted_keys = [] # sorted list of hashes
def _hash(self, key: str) -> int:
"""Generate a 32-bit hash for a given key."""
return int(hashlib.md5(
key.encode()).hexdigest(), 16) % (2**32)
def add_server(self, server_id: str):
"""Add a server with V virtual nodes to the ring."""
for i in range(self.vnodes):
vnode_key = f"{server_id}#{i}"
h = self._hash(vnode_key)
self.ring[h] = server_id
bisect.insort(self.sorted_keys, h)
def remove_server(self, server_id: str):
"""Remove a server and all its virtual nodes."""
for i in range(self.vnodes):
vnode_key = f"{server_id}#{i}"
h = self._hash(vnode_key)
if h in self.ring:
del self.ring[h]
self.sorted_keys.remove(h)
def get_server(self, key: str) -> str:
"""Find the responsible server for a key."""
if not self.ring:
raise ValueError("Empty ring")
h = self._hash(key)
# Binary search for next clockwise server
idx = bisect.bisect_left(self.sorted_keys, h)
if idx == len(self.sorted_keys):
idx = 0 # wrap around
return self.ring[self.sorted_keys[idx]]
def get_servers(self, key: str, count: int) -> list:
"""Get 'count' distinct servers for replication."""
if not self.ring:
raise ValueError("Empty ring")
result = []
h = self._hash(key)
idx = bisect.bisect_left(self.sorted_keys, h)
seen = set()
while len(result) < count:
if idx == len(self.sorted_keys):
idx = 0
server = self.ring[self.sorted_keys[idx]]
if server not in seen:
result.append(server)
seen.add(server)
idx += 1
if idx == bisect.bisect_left(
self.sorted_keys, h):
break # full circle
return result
# ---- Usage example ----
ch = ConsistentHash(vnodes=150)
for s in ["server-A", "server-B", "server-C"]:
ch.add_server(s)
print(ch.get_server("user:1001")) # → one of A/B/C
print(ch.get_server("session:xyz")) # → one of A/B/C
# Add server D — only ~25% of keys move
ch.add_server("server-D")
# Replication: get top-3 servers for a key
print(ch.get_servers("user:1001", 3)) # → 3 distinct servers
Verifying Minimal Disruption
# Measure how many keys move when adding a server
import random
ch = ConsistentHash(vnodes=150)
servers = [f"server-{i}" for i in range(10)]
for s in servers:
ch.add_server(s)
# Map 10,000 random keys
keys = [f"key-{i}" for i in range(10000)]
before = {k: ch.get_server(k) for k in keys}
# Add server-10
ch.add_server("server-10")
after = {k: ch.get_server(k) for k in keys}
moved = sum(1 for k in keys if before[k] != after[k])
print(f"Keys moved: {moved}/{len(keys)}")
print(f"Percentage: {moved/len(keys)*100:.1f}%")
print(f"Expected: {100/11:.1f}%")
# Output: ~9.1% (close to the theoretical 1/11 = 9.09%)
Summary & Cheat Sheet
| Algorithm | Memory | Lookup | Distribution | Arbitrary Remove | Best For |
|---|---|---|---|---|---|
| Modular Hash (% N) | O(1) | O(1) | Uniform | ✗ (remaps all) | Static N only |
| Ring Hash + Vnodes | O(N×V) | O(log(N×V)) | Near-uniform | ✓ | General purpose |
| Bounded Load Ring | O(N×V) | O(log(N×V)) | Bounded | ✓ | Hot-spot prevention |
| Jump Hash | O(1) | O(ln N) | Perfect | ✗ (end only) | Static partitions |
| Rendezvous (HRW) | O(N) | O(N) | Perfect | ✓ | Small N, replication |