← All Posts
HLD · System Design » Foundations

Consistent Hashing

1 · The Problem with Simple Hashing

The most natural way to distribute keys across N servers is modular hashing:

server = hash(key) % N

For a fixed N this works perfectly: keys are spread roughly uniformly (assuming a decent hash function), and lookups are O(1). The problem explodes the moment N changes.

What Happens When N Changes?

Suppose we have 3 servers and 12 keys distributed as:

key  hash(key)  hash%3  hash%4  (remapped?)
──────────────────────────────────────────
k0       0        0       0      same
k1      17        2       1      ✗ moved
k2      42        0       2      ✗ moved
k3       3        0       3      ✗ moved
k4      25        1       1      same
k5      11        2       3      ✗ moved
k6      36        0       0      same
k7      19        1       3      ✗ moved
k8      44        2       0      ✗ moved
k9       9        0       1      ✗ moved
k10     31        1       3      ✗ moved
k11     22        1       2      ✗ moved

Adding just one server (3→4) remapped 9 out of 12 keys (75%). This is not a fluke. Mathematically, when we go from N to N+1 servers:

Expected fraction of keys that move = 1 − 1/(N+1) ≈ N/(N+1) For N = 3: 1 − 1/4 = 75% of keys remap For N = 10: 1 − 1/11 ≈ 91% of keys remap For N = 100: 1 − 1/101 ≈ 99% of keys remap
⚠ Catastrophic for Caches: If you have a Memcached cluster with 100 servers caching millions of objects, adding server 101 invalidates ~99% of the cache. Every request becomes a cache miss, hammering the backend database. This is called a cache stampede or thundering herd.

The same problem occurs when removing a server (e.g., server crash). With modular hashing, nearly all keys must be reassigned. We need an approach where adding or removing a server only displaces K/N keys (where K is the total number of keys), not nearly all of them.

2 · Consistent Hashing Concept

Consistent hashing was introduced by Karger et al. in their 1997 paper "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web" at MIT, originally to solve the problem of distributed web caching.

The core idea is elegant: instead of mapping keys to servers using modular arithmetic, we map both keys and servers onto a circular ring (hash space), and each key is served by the first server encountered when walking clockwise from the key's position.

The Hash Ring

Imagine a circle representing the entire hash space, from 0 to 232−1 (or 2128−1, or 2256−1, depending on the hash function). The circle wraps around: position 232−1 is adjacent to position 0.

  1. Place servers on the ring: Hash each server identifier (e.g., IP address, hostname) to get a position on the ring. pos(Si) = hash(server_idi)
  2. Place keys on the ring: Hash each key to get its position. pos(Kj) = hash(keyj)
  3. Assignment rule: Each key is assigned to the first server encountered clockwise from the key's position on the ring.

Why This Works

When a new server Snew is added to the ring, it only takes over keys that were previously assigned to its immediate clockwise successor. All other key→server mappings remain unchanged. Similarly, when a server is removed, only its keys are reassigned—to its clockwise successor.

Keys moved when adding 1 server to N servers: Expected = K / (N + 1) Keys moved when removing 1 server from N servers: Expected = K / N Compare with modular hashing: Expected ≈ K × N / (N + 1) ← almost ALL keys!
💡 Key Insight: With consistent hashing, adding a server to a 100-node cluster moves only ~1% of keys. With modular hashing, it would move ~99% of keys. This is the fundamental reason distributed systems use consistent hashing.

3 · Hash Ring in Detail

▶ Interactive Hash Ring

Step through adding and removing servers on a consistent hash ring. Watch how only neighboring keys are affected.

S1 S2 S3 S4 S5 (added) Keys

Implementation: Sorted Array + Binary Search

The ring is implemented as a sorted array of server hashes. To find which server owns a key:

  1. Compute h = hash(key)
  2. Binary search the sorted array for the smallest server hash ≥ h
  3. If no server hash is ≥ h, wrap around to the first server in the array (index 0)
// Finding the responsible server: O(log N) with binary search
int findServer(uint32_t keyHash, vector<uint32_t>& ring) {
    auto it = lower_bound(ring.begin(), ring.end(), keyHash);
    if (it == ring.end())
        return 0;                     // wrap around
    return distance(ring.begin(), it);
}

Hash Function Choice

Hash FunctionOutput SizeSpeedDistributionNotes
MD5128-bitModerateExcellentWidely used (Ketama). Cryptographic but slower.
SHA-1160-bitModerateExcellentUsed in Cassandra's partitioner.
xxHash32/64/128-bitVery fastExcellentNon-cryptographic. Great for hash rings.
MurmurHash332/128-bitVery fastExcellentUsed in many distributed systems.
CRC3232-bitVery fastGoodSmall output. Faster but less uniform.
💡 Practical Note: For consistent hashing, you do not need a cryptographic hash. xxHash or MurmurHash3 are typically the best choices: extremely fast with excellent distribution. The original Ketama library used MD5, which is overkill for this purpose.

Ring Size Considerations

A 32-bit hash gives 232 (≈4.3 billion) positions on the ring, which is more than sufficient for most systems. With virtual nodes (covered next), even a 100-server cluster with 200 virtual nodes each yields only 20,000 positions to store—trivially small.

4 · Virtual Nodes

With only a few physical servers, the hash ring can become severely unbalanced. If servers are spaced unevenly on the ring, one server might own 60% of the key space while another owns only 10%.

The Problem: Uneven Distribution

With N servers placed on the ring, each server's arc length follows an exponential distribution. The expected standard deviation of load per server is:

σ = 1/N × √(N−1)/N ≈ 1/√N For N=3: σ ≈ 57.7% of average load For N=10: σ ≈ 31.6% of average load This means with 3 servers, the most loaded server will handle roughly 33% ± 57% of total keys — catastrophically uneven.

The Solution: Virtual Nodes (Vnodes)

Instead of placing each physical server at one position, we place it at V positions (virtual nodes). Each physical server Si is mapped to virtual nodes hash("Si#0"), hash("Si#1"), ..., hash("Si#V-1").

// Generate virtual node positions for a server
vector<uint32_t> virtualNodes(const string& serverId, int V) {
    vector<uint32_t> positions;
    for (int i = 0; i < V; i++) {
        string vnode = serverId + "#" + to_string(i);
        positions.push_back(hash(vnode));
    }
    return positions;
}

How Many Virtual Nodes?

Virtual Nodes (V)Std Dev of LoadMax/Avg Load RatioMemory per Server
1 (no vnodes)~100%~3×4 bytes
10~31%~1.6×40 bytes
50~14%~1.3×200 bytes
100~10%~1.2×400 bytes
150~8%~1.15×600 bytes
200~7%~1.12×800 bytes
💡 Industry Standard: Most production systems use 100–200 virtual nodes per physical server. Cassandra defaults to 256 tokens (virtual nodes) per node. This provides near-uniform distribution while keeping the ring lookup table small. With 1000 servers × 200 vnodes = 200,000 entries, the ring still fits in under 1 MB.

Benefits Beyond Load Balancing

Trade-offs

5 · Adding and Removing Nodes

Adding a Node

When server Snew joins the ring:

  1. Compute all V virtual node positions for Snew
  2. For each virtual node position p:
    • Find the current owner of position p (the next clockwise server)
    • Transfer keys in the range (predecessor(p), p] from the current owner to Snew
  3. Insert all virtual node positions into the ring
Keys transferred = K / (N + 1) With V virtual nodes each, we transfer V small ranges from up to V different successor servers, totaling K/(N+1) keys.

Removing a Node

When server Sold leaves the ring (planned or crash):

  1. For each of Sold's V virtual node positions:
    • Find the successor server (next clockwise)
    • Transfer keys owned by this virtual node to the successor
  2. Remove all V virtual node positions from the ring

Graceful Migration Strategy

In practice, data migration happens in the background while the system continues to serve requests. Common strategies include:

Modular vs. Consistent: Side-by-Side

▶ Simple Hash vs. Consistent Hash

See the dramatic difference when adding a 4th server. Left: modular hash remaps nearly everything. Right: consistent hash remaps only one key.

6 · Bounded Load Consistent Hashing

Standard consistent hashing provides good average distribution but does not give hard guarantees on maximum load. In 2016, Google published "Consistent Hashing with Bounded Loads" (Mirrokni, Thorup, Zadimoghaddam) to address this.

The Idea

Each server has a capacity cap of (1 + ε) × (K/N) where ε is a small constant (e.g., 0.25). When a key hashes to a server that is already at capacity, the key spills to the next server clockwise that has available capacity.

capacity(server) = ceil((1 + ε) × total_keys / num_servers) For ε = 0.25 with 1000 keys and 10 servers: Average load = 100 keys/server Max allowed = ceil(1.25 × 100) = 125 keys/server

Algorithm

Server findBoundedServer(Key key, Ring ring, double epsilon) {
    int avgLoad = totalKeys / ring.numServers();
    int cap = ceil((1 + epsilon) * avgLoad);

    uint32_t h = hash(key);
    Server s = ring.nextClockwise(h);

    while (s.currentLoad() >= cap) {
        s = ring.nextClockwise(s.position());
    }
    return s;
}

Properties

⚠ Trade-off: Bounded-load consistent hashing sacrifices the nice property that a key always maps to the same server. Under load pressure, keys can spill to different servers. This complicates caching since a key might be on different servers at different times. Google's Vimeo and HAProxy load balancers use this approach.

7 · Jump Consistent Hashing

In 2014, Google engineers Lamping and Veach published "A Fast, Minimal Memory, Consistent Hash Algorithm" describing jump consistent hashing—an algorithm that is:

The Algorithm

The entire algorithm fits in about 10 lines:

int32_t jumpConsistentHash(uint64_t key, int32_t numBuckets) {
    int64_t b = -1, j = 0;
    while (j < numBuckets) {
        b = j;
        key = key * 2862933555777941757ULL + 1;
        j = (int64_t)((b + 1) * (double(1LL << 31) /
            double((key >> 33) + 1)));
    }
    return (int32_t)b;
}

How It Works (Intuition)

Think of it as a sequence of “jumps” through bucket assignments. For each bucket count from 1 to N, the algorithm decides whether the key should “jump” to the new bucket or stay put. The probability of jumping from bucket b to a new bucket is 1/(b+1), which ensures that exactly 1/(N+1) of keys move when bucket N+1 is added.

Jump Hash vs. Ring Hash

PropertyRing HashJump Hash
MemoryO(N × V)O(1)
Lookup TimeO(log(N × V))O(ln N)
DistributionNear-uniform (with vnodes)Perfectly uniform
Arbitrary node removal✓ Yes✗ No (end only)
Named nodes✓ Yes (IP, hostname)✗ No (0..n-1)
Dynamic topology✓ Flexible✗ Linear only
⚠ Critical Limitation: Jump hash can only handle numbered buckets 0 to n−1. You can add bucket n or remove bucket n−1, but you cannot remove an arbitrary bucket in the middle. This makes it unsuitable for server failure scenarios. It works well for statically sized partitions (like sharded databases where shard count is fixed).

8 · Rendezvous Hashing (HRW)

Highest Random Weight (HRW) hashing was independently developed by Thaler and Ravishankar in 1998. Unlike consistent hashing which uses a ring, rendezvous hashing takes a direct approach.

Algorithm

For each key, compute a score for every server and pick the server with the highest score:

Server findServer(Key key, vector<Server>& servers) {
    Server best;
    uint64_t bestScore = 0;

    for (auto& s : servers) {
        uint64_t score = hash(key + s.id());
        if (score > bestScore) {
            bestScore = score;
            best = s;
        }
    }
    return best;
}

Properties

When to Use Rendezvous vs. Ring

Use Rendezvous When:

  • Small number of servers (<50)
  • Simplicity matters more than lookup speed
  • Need perfect distribution without tuning vnodes
  • Need k-of-n server selection (e.g., top-3 servers for replication)

Use Ring Hash When:

  • Large number of servers (100+)
  • Need O(log N) lookups
  • Topology changes frequently
  • Already have vnodes infrastructure
💡 Replication Bonus: Rendezvous hashing naturally supports replication. To replicate a key to k servers, pick the top-k scoring servers. If one server fails, the key is already on k−1 others, and the replacement is deterministic (the (k+1)-th highest scoring server).

9 · Real-World Usage

Amazon DynamoDB

DynamoDB (and its predecessor Dynamo, described in the seminal 2007 paper) pioneered consistent hashing in production. Each node is assigned multiple virtual nodes on an MD5-based hash ring. Key features:

Apache Cassandra

Cassandra uses consistent hashing with 256 virtual nodes (tokens) per physical node by default:

Akamai CDN

Akamai (co-founded by Karger, one of the consistent hashing inventors) uses consistent hashing to assign web content to edge servers. When a user requests content:

Discord

Discord uses consistent hashing to route messages to the correct guild (server) shard:

Memcached (Ketama)

The Ketama library is the de facto standard for consistent hashing in Memcached clients:

SystemHash FunctionVirtual NodesRing SizeReplication
DynamoDBMD5Variable2128N=3 successors
CassandraMurmur3256 default264Configurable RF
Akamai CDNProprietaryYes232Multi-tier
Memcached (Ketama)MD5160232None (cache)
Redis ClusterCRC1616384 slots16384Replica shards

10 · Implementation

Here is a complete C++ implementation of consistent hashing with virtual nodes:

#include <map>
#include <string>
#include <vector>
#include <functional>
#include <stdexcept>
#include <sstream>
#include <cstdint>

class ConsistentHash {
private:
    int numVnodes;
    std::map<uint32_t, std::string> ring;       // hash → server id
    std::hash<std::string> hasher;

    uint32_t hashKey(const std::string& key) const {
        // In production, use MurmurHash3 or xxHash.
        // std::hash is used here for simplicity.
        return static_cast<uint32_t>(hasher(key));
    }

public:
    explicit ConsistentHash(int vnodes = 150)
        : numVnodes(vnodes) {}

    // Add a physical server with V virtual nodes
    void addServer(const std::string& serverId) {
        for (int i = 0; i < numVnodes; i++) {
            std::string vnode = serverId + "#" + std::to_string(i);
            uint32_t h = hashKey(vnode);
            ring[h] = serverId;
        }
    }

    // Remove a physical server and all its virtual nodes
    void removeServer(const std::string& serverId) {
        for (int i = 0; i < numVnodes; i++) {
            std::string vnode = serverId + "#" + std::to_string(i);
            uint32_t h = hashKey(vnode);
            ring.erase(h);
        }
    }

    // Find the server responsible for a key
    std::string getServer(const std::string& key) const {
        if (ring.empty())
            throw std::runtime_error("Empty ring");

        uint32_t h = hashKey(key);
        // Find the first server position >= h (clockwise search)
        auto it = ring.lower_bound(h);
        if (it == ring.end())
            it = ring.begin();   // wrap around to 0
        return it->second;
    }

    // Get N distinct servers for replication
    std::vector<std::string> getServers(
            const std::string& key, int count) const {
        if (ring.empty())
            throw std::runtime_error("Empty ring");

        std::vector<std::string> result;
        uint32_t h = hashKey(key);
        auto it = ring.lower_bound(h);

        while (result.size() < static_cast<size_t>(count)) {
            if (it == ring.end())
                it = ring.begin();

            // Skip duplicate physical servers
            bool found = false;
            for (auto& s : result) {
                if (s == it->second) { found = true; break; }
            }
            if (!found)
                result.push_back(it->second);

            ++it;
            if (it == ring.begin()) break; // wrapped fully
        }
        return result;
    }

    size_t ringSize() const { return ring.size(); }
};

Python Implementation

import hashlib
import bisect

class ConsistentHash:
    def __init__(self, vnodes=150):
        self.vnodes = vnodes
        self.ring = {}          # hash → server_id
        self.sorted_keys = []   # sorted list of hashes

    def _hash(self, key: str) -> int:
        """Generate a 32-bit hash for a given key."""
        return int(hashlib.md5(
            key.encode()).hexdigest(), 16) % (2**32)

    def add_server(self, server_id: str):
        """Add a server with V virtual nodes to the ring."""
        for i in range(self.vnodes):
            vnode_key = f"{server_id}#{i}"
            h = self._hash(vnode_key)
            self.ring[h] = server_id
            bisect.insort(self.sorted_keys, h)

    def remove_server(self, server_id: str):
        """Remove a server and all its virtual nodes."""
        for i in range(self.vnodes):
            vnode_key = f"{server_id}#{i}"
            h = self._hash(vnode_key)
            if h in self.ring:
                del self.ring[h]
                self.sorted_keys.remove(h)

    def get_server(self, key: str) -> str:
        """Find the responsible server for a key."""
        if not self.ring:
            raise ValueError("Empty ring")
        h = self._hash(key)
        # Binary search for next clockwise server
        idx = bisect.bisect_left(self.sorted_keys, h)
        if idx == len(self.sorted_keys):
            idx = 0   # wrap around
        return self.ring[self.sorted_keys[idx]]

    def get_servers(self, key: str, count: int) -> list:
        """Get 'count' distinct servers for replication."""
        if not self.ring:
            raise ValueError("Empty ring")
        result = []
        h = self._hash(key)
        idx = bisect.bisect_left(self.sorted_keys, h)
        seen = set()
        while len(result) < count:
            if idx == len(self.sorted_keys):
                idx = 0
            server = self.ring[self.sorted_keys[idx]]
            if server not in seen:
                result.append(server)
                seen.add(server)
            idx += 1
            if idx == bisect.bisect_left(
                    self.sorted_keys, h):
                break   # full circle
        return result


# ---- Usage example ----
ch = ConsistentHash(vnodes=150)
for s in ["server-A", "server-B", "server-C"]:
    ch.add_server(s)

print(ch.get_server("user:1001"))   # → one of A/B/C
print(ch.get_server("session:xyz")) # → one of A/B/C

# Add server D — only ~25% of keys move
ch.add_server("server-D")

# Replication: get top-3 servers for a key
print(ch.get_servers("user:1001", 3))  # → 3 distinct servers

Verifying Minimal Disruption

# Measure how many keys move when adding a server
import random

ch = ConsistentHash(vnodes=150)
servers = [f"server-{i}" for i in range(10)]
for s in servers:
    ch.add_server(s)

# Map 10,000 random keys
keys = [f"key-{i}" for i in range(10000)]
before = {k: ch.get_server(k) for k in keys}

# Add server-10
ch.add_server("server-10")
after = {k: ch.get_server(k) for k in keys}

moved = sum(1 for k in keys if before[k] != after[k])
print(f"Keys moved: {moved}/{len(keys)}")
print(f"Percentage: {moved/len(keys)*100:.1f}%")
print(f"Expected:   {100/11:.1f}%")
# Output: ~9.1% (close to the theoretical 1/11 = 9.09%)

Summary & Cheat Sheet

AlgorithmMemoryLookupDistributionArbitrary RemoveBest For
Modular Hash (% N)O(1)O(1)Uniform✗ (remaps all)Static N only
Ring Hash + VnodesO(N×V)O(log(N×V))Near-uniformGeneral purpose
Bounded Load RingO(N×V)O(log(N×V))BoundedHot-spot prevention
Jump HashO(1)O(ln N)Perfect✗ (end only)Static partitions
Rendezvous (HRW)O(N)O(N)PerfectSmall N, replication
💡 Interview Tip: In system design interviews, when asked about consistent hashing, mention the ring, virtual nodes, and at least one real-world system (DynamoDB or Cassandra). If you can also mention bounded-load hashing or rendezvous hashing as alternatives, you'll stand out. The key numbers to remember: adding a server to N nodes moves only ~1/N of keys, and 100–200 virtual nodes provide near-uniform distribution.