← All Posts
High Level Design Series · Building Blocks· Part 8 of 70

Distributed Locking

Why Distributed Locks?

In a single-process application, a mutex or semaphore is enough to coordinate access to shared memory. But the moment your system spans multiple servers, containers, or even data centers, local mutexes are useless—they cannot cross process boundaries. That is where distributed locks come in: a mechanism that allows multiple independent processes to agree that exactly one of them holds exclusive access to a shared resource at any given time.

Real-World Scenarios

Key insight: Distributed locks are not about performance—they are about correctness. If two processes both believe they hold the lock, data corruption or business-logic violations can follow.

Properties of a Good Distributed Lock

Not all distributed locks are created equal. A robust implementation must satisfy several properties, formalized in the literature as safety and liveness guarantees:

PropertyTypeDescription
Mutual ExclusionSafetyAt most one client can hold the lock at any given time. This is the fundamental guarantee. If this breaks, the lock is useless.
Deadlock-FreeLivenessEven if a client crashes while holding the lock, the lock must eventually be released. Typically enforced via TTL (time-to-live) or session expiration.
Fault ToleranceAvailabilityThe lock service must remain operational even if some nodes fail. A single-node lock server is a single point of failure.
Fencing TokensSafety (extended)Each lock acquisition produces a monotonically increasing token. Resource servers reject stale tokens to prevent delayed writes from a previous lock holder.

The first three are widely agreed upon. The fourth—fencing tokens—was highlighted by Martin Kleppmann as essential for correctness in asynchronous systems. We will explore it in depth in a dedicated section.

Redis-Based Locking

Redis is the most popular choice for distributed locking because of its low latency and simplicity. The basic pattern uses a single Redis instance:

Acquiring the Lock

# SET key value NX EX ttl
# NX  = only set if key does NOT exist (mutual exclusion)
# EX  = set TTL in seconds (deadlock-freedom)
# value = a unique random string identifying THIS client

SET my_lock "client-a-uuid-123" NX EX 30

# Returns OK if the lock was acquired
# Returns (nil) if someone else already holds it

The NX flag ensures atomicity—only one client wins the race. The EX 30 sets a 30-second TTL so that if the client crashes, the lock auto-releases. The random value identifies the lock owner, preventing accidental release by another client.

Releasing the Lock

You must not simply DEL my_lock. If the TTL expired and another client acquired the lock, you would delete their lock. Instead, use a Lua script for atomic check-and-delete:

-- Lua script: release only if the value matches our unique ID
if redis.call("GET", KEYS[1]) == ARGV[1] then
    return redis.call("DEL", KEYS[1])
else
    return 0
end

-- Execute from client:
EVAL "if redis.call('GET',KEYS[1])==ARGV[1] then return redis.call('DEL',KEYS[1]) else return 0 end" 1 my_lock "client-a-uuid-123"

Problems with Single-Instance Redis Locks

Bottom line: A single-Redis-instance lock is fine for efficiency purposes (avoiding duplicate work) but is NOT safe for correctness purposes (preventing data corruption). For correctness, you need either Redlock or a consensus-based system.

Redlock Algorithm

Redlock, proposed by Salvatore Sanfilippo (Antirez), addresses the single-instance weakness by using N independent Redis instances (typically 5). The algorithm:

Step-by-Step

  1. Record the start time (T1) in milliseconds.
  2. Sequentially try to acquire the lock on all N Redis instances using the same key, same random value, and the same TTL. Use a small per-instance timeout (e.g., 5–10ms) so that a down instance doesn't slow you down.
  3. Calculate elapsed time: elapsed = now() - T1.
  4. The lock is acquired if and only if:
    • The client acquired the lock on a majority of instances (at least ⌊N/2⌋ + 1, so 3 out of 5).
    • The total elapsed time is less than the lock TTL. The effective lock validity is TTL - elapsed.
  5. If the lock was NOT acquired (fewer than majority or timeout exceeded), release the lock on ALL instances (even the ones where it was acquired) to avoid deadlocks.
# Pseudocode for Redlock acquisition
import time, uuid, redis

LOCK_KEY = "my_resource"
LOCK_TTL = 30000  # 30 seconds in ms
QUORUM = 3        # majority of 5
RETRY_DELAY = 200 # ms between retries
instances = [redis.Redis(host=h) for h in REDIS_HOSTS]  # 5 hosts

def acquire_lock():
    value = str(uuid.uuid4())
    start = time.monotonic_ns() // 1_000_000  # ms

    acquired = 0
    for r in instances:
        try:
            if r.set(LOCK_KEY, value, nx=True, px=LOCK_TTL):
                acquired += 1
        except redis.ConnectionError:
            pass  # instance down, skip

    elapsed = (time.monotonic_ns() // 1_000_000) - start
    validity = LOCK_TTL - elapsed

    if acquired >= QUORUM and validity > 0:
        return value, validity  # lock acquired!
    else:
        # Release on ALL instances
        for r in instances:
            try:
                release_lock(r, value)
            except:
                pass
        return None, 0

def release_lock(r, value):
    lua = """
    if redis.call("GET",KEYS[1])==ARGV[1] then
        return redis.call("DEL",KEYS[1])
    else return 0 end
    """
    r.eval(lua, 1, LOCK_KEY, value)

▶ Redlock Algorithm

Step through lock acquisition across 5 independent Redis nodes. Majority (3/5) required.

The Kleppmann vs Antirez Debate

In 2016, Martin Kleppmann published a widely-read analysis titled "How to do distributed locking" that criticized Redlock. Salvatore Sanfilippo (creator of Redis) responded with a rebuttal. This debate remains one of the most important discussions in distributed systems:

Kleppmann’s CriticismsAntirez’s Rebuttals
GC pauses break safety. A client acquires the lock, then a GC pause of >TTL causes the lock to expire. Another client acquires it. The first client resumes and operates on shared data, believing it still holds the lock. Two clients now have "the lock." The client checks the remaining lock validity after acquiring it. If the GC pause happened during acquisition, the validity check catches it. However, Kleppmann argues the pause can happen after the check.
Clock jumps break safety. Redlock relies on wall-clock time for TTL calculations. If an NTP correction jumps a Redis server’s clock forward, the lock expires prematurely. Use monotonic clocks where possible. Redis uses gettimeofday() which can be affected by NTP, but in practice large jumps are rare with well-configured NTP (smearing, not stepping).
No fencing tokens. Redlock does not produce monotonically increasing tokens. Without fencing, even correctly implemented locks cannot prevent split-brain writes. If you need fencing tokens, you need a consensus system anyway. Redlock is designed for a different (valid) use case where strong fencing isn’t required.
Use a proper consensus system. If you need a lock for correctness, use ZooKeeper, etcd, or Chubby—systems designed with formal correctness proofs and linearizable semantics. Redlock provides a practical middle ground. Not everyone needs the operational complexity of ZooKeeper. For many use cases, Redlock’s guarantees are sufficient.
Verdict: Kleppmann’s argument is widely considered technically correct: for correctness-critical locks, Redlock is insufficient because it lacks fencing tokens and relies on timing assumptions. Use Redlock for efficiency (preventing duplicate work where occasional double-execution is tolerable). For correctness, use a consensus-based system with fencing.

ZooKeeper Locks

Apache ZooKeeper provides a more reliable locking primitive built on top of its consensus protocol (ZAB — ZooKeeper Atomic Broadcast). Unlike Redis, ZooKeeper was designed for coordination tasks.

How It Works: Ephemeral Sequential Znodes

  1. Create an ephemeral sequential znode under a lock path:
    # Client creates an ephemeral sequential node
    create -e -s /locks/my_resource/lock-
    # ZooKeeper returns: /locks/my_resource/lock-0000000001

    -e = ephemeral (auto-deleted when session ends), -s = sequential (ZooKeeper appends a monotonically increasing counter).

  2. List all children of the lock path:
    ls /locks/my_resource
    # Returns: [lock-0000000001, lock-0000000003, lock-0000000005]
  3. Check if you have the lowest sequence number:
    • Yes: You hold the lock. Proceed with your critical section.
    • No: Set a watch on the znode with the next-lower sequence number (not the lowest — this avoids the "herd effect" where all waiters wake up simultaneously).
  4. When the watched znode is deleted (previous holder releases or crashes), you are notified. Re-check if you now have the lowest number.
  5. Release: Simply delete your znode, or let your session expire (ZooKeeper automatically deletes ephemeral nodes).
# Pseudocode: ZooKeeper distributed lock (using Kazoo Python client)
from kazoo.client import KazooClient
from kazoo.recipe.lock import Lock

zk = KazooClient(hosts='zk1:2181,zk2:2181,zk3:2181')
zk.start()

lock = Lock(zk, "/locks/my_resource")

# Blocking acquire
lock.acquire(timeout=30)
try:
    # Critical section: only one client executes this at a time
    process_shared_resource()
finally:
    lock.release()

# Or use as context manager
with Lock(zk, "/locks/my_resource"):
    process_shared_resource()

Advantages Over Redis

Drawbacks

etcd Locks

etcd uses the Raft consensus algorithm and provides distributed locking through leases and transactions. It is the coordination backbone of Kubernetes.

Lease-Based Locks

# Grant a lease (TTL = 30 seconds)
etcdctl lease grant 30
# Output: lease 694d81417acd0b17 granted with TTL(30s)

# Put a key with the lease attached
etcdctl put --lease=694d81417acd0b17 /locks/my_resource "client-a"

# The key auto-deletes when the lease expires
# Keep the lease alive with periodic keep-alives:
etcdctl lease keep-alive 694d81417acd0b17

Transactions for Atomic Lock Acquisition

# Atomic compare-and-set: only create the lock if it doesn't exist
etcdctl txn --interactive
# compares:
create_revision("/locks/my_resource") = "0"
# success requests:
put /locks/my_resource "client-a" --lease=694d81417acd0b17
# failure requests:
get /locks/my_resource

The transaction checks that the key’s create_revision is 0 (key doesn’t exist). If true, it creates the key with the lease. If false, it returns the current holder’s value. This is atomic and linearizable.

Using etcd’s Built-in Lock Command

# etcd provides a high-level lock command
etcdctl lock /locks/my_resource
# Blocks until the lock is acquired
# Holds the lock until the process exits or is interrupted (Ctrl+C)

# Execute a command while holding the lock
etcdctl lock /locks/my_resource -- echo "I have the lock"

Programmatic Example (Go)

import (
    "context"
    "go.etcd.io/etcd/client/v3"
    "go.etcd.io/etcd/client/v3/concurrency"
)

cli, _ := clientv3.New(clientv3.Config{
    Endpoints: []string{"etcd1:2379", "etcd2:2379", "etcd3:2379"},
})
defer cli.Close()

// Create a session (lease-based)
session, _ := concurrency.NewSession(cli, concurrency.WithTTL(30))
defer session.Close()

// Create a mutex
mu := concurrency.NewMutex(session, "/locks/my_resource")

// Acquire
mu.Lock(context.TODO())
defer mu.Unlock(context.TODO())

// Critical section
processSharedResource()

// The lock's revision can serve as a fencing token:
fmt.Println("Fencing token:", mu.Header().Revision)

The mu.Header().Revision is a monotonically increasing number from etcd’s Raft log. This makes etcd locks compatible with fencing token requirements out of the box.

Fencing Tokens

Fencing tokens are the critical concept that separates a distributed lock that is merely "good enough" from one that is provably correct. This idea, popularized by Martin Kleppmann, addresses a fundamental problem in asynchronous distributed systems.

The Problem: Zombie Lock Holders

Consider this scenario without fencing:

  1. Client A acquires the lock with TTL=30s.
  2. Client A begins processing, but hits a long GC pause (or network delay).
  3. The lock’s TTL expires after 30s. Client A is still paused.
  4. Client B acquires the lock (legitimately).
  5. Client B writes to the shared resource.
  6. Client A’s GC pause ends. It still believes it holds the lock.
  7. Client A writes to the shared resource, overwriting Client B’s data.

The lock’s mutual exclusion guarantee was violated. Both clients wrote to the resource "at the same time" from the resource’s perspective.

The Solution: Monotonically Increasing Tokens

Each time a lock is acquired, the lock service issues a fencing token—a monotonically increasing number (e.g., ZooKeeper’s zxid, etcd’s revision). The resource server (database, storage, API) validates this token:

# Lock service issues tokens:
# Client A acquires lock → token = 33
# Client B acquires lock → token = 34

# Resource server tracks the highest token it has seen:
highest_token_seen = 0

def write(data, token):
    if token <= highest_token_seen:
        raise RejectedError(f"Stale token {token} < {highest_token_seen}")
    highest_token_seen = token
    actually_write(data)

# Client A (token=33) writes first  → accepted, highest=33
# Client B (token=34) writes        → accepted, highest=34
# Client A (delayed, token=33) tries → REJECTED (33 < 34)

▶ Distributed Lock with Fencing

Watch how fencing tokens prevent a zombie lock holder from corrupting data after a GC pause.

Where Fencing Tokens Come From

Lock ServiceFencing Token Source
ZooKeeperSequential znode number or zxid (transaction ID)
etcdrevision from the Raft log (returned by mu.Header().Revision)
Chubby (Google)Lock sequencer (built-in)
Redis / RedlockNot natively supported. Requires external sequencing.
Fencing requires cooperation from the resource server. It must understand and enforce token ordering. This is straightforward for databases (WHERE token > last_seen_token) but harder for arbitrary APIs or third-party services.

Database-Based Locks

If your application already uses a relational database, you can leverage it as a lock service. This avoids adding a new infrastructure dependency, though it comes with trade-offs.

SELECT FOR UPDATE (Row-Level Locking)

-- Create a locks table
CREATE TABLE distributed_locks (
    lock_name   VARCHAR(255) PRIMARY KEY,
    locked_by   VARCHAR(255),
    locked_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    expires_at  TIMESTAMP
);

-- Attempt to acquire (within a transaction)
BEGIN;
SELECT * FROM distributed_locks
    WHERE lock_name = 'my_resource'
    FOR UPDATE SKIP LOCKED;

-- If the row is returned, check expiration and update
UPDATE distributed_locks
    SET locked_by = 'client-a',
        locked_at = NOW(),
        expires_at = NOW() + INTERVAL '30 seconds'
    WHERE lock_name = 'my_resource'
      AND (expires_at IS NULL OR expires_at < NOW());
COMMIT;

-- Release
UPDATE distributed_locks
    SET locked_by = NULL, expires_at = NULL
    WHERE lock_name = 'my_resource'
      AND locked_by = 'client-a';

PostgreSQL Advisory Locks

PostgreSQL provides dedicated advisory lock functions that don’t require a table:

-- Session-level advisory lock (released when session ends)
SELECT pg_advisory_lock(12345);
-- ... critical section ...
SELECT pg_advisory_unlock(12345);

-- Try without blocking (returns true/false)
SELECT pg_try_advisory_lock(12345);

-- Transaction-level advisory lock (released at COMMIT/ROLLBACK)
SELECT pg_advisory_xact_lock(12345);

-- Using two integers for namespacing
SELECT pg_advisory_lock(1, 42);  -- namespace=1, resource=42

MySQL GET_LOCK

-- Acquire (timeout 30 seconds)
SELECT GET_LOCK('my_resource', 30);
-- Returns 1 if acquired, 0 if timed out, NULL on error

-- Release
SELECT RELEASE_LOCK('my_resource');

-- Check without acquiring
SELECT IS_FREE_LOCK('my_resource');

Pros and Cons

ProsCons
  • No additional infrastructure
  • Transactional semantics (ACID)
  • Works with your existing HA setup
  • Database becomes a bottleneck at scale
  • Lock contention increases query latency
  • Not designed for high-throughput locking
  • Connection pool exhaustion under heavy locking

Lock-Free Alternatives

Sometimes the best distributed lock is no lock at all. Several patterns avoid the need for explicit locking:

Optimistic Locking (Version Numbers)

Instead of preventing concurrent access, allow it but detect conflicts at write time:

-- Read the current version
SELECT data, version FROM resources WHERE id = 42;
-- Returns: data='old_value', version=7

-- Update only if the version hasn't changed
UPDATE resources
    SET data = 'new_value', version = version + 1
    WHERE id = 42 AND version = 7;

-- Check affected rows:
-- 1 row  → success, your write went through
-- 0 rows → conflict! Someone else updated first. Retry.

This is the most widely used concurrency control mechanism in web applications. It avoids the overhead and complexity of distributed locks entirely for most use cases.

Compare-and-Swap (CAS)

CAS is the atomic primitive underlying optimistic locking. Many distributed stores support it natively:

# etcd: atomic CAS using transactions
etcdctl txn --interactive
# compares:
value("/config/setting") = "old_value"
# success:
put /config/setting "new_value"
# failure:
get /config/setting

# DynamoDB: conditional write
table.put_item(
    Item={'id': '42', 'data': 'new_value', 'version': 8},
    ConditionExpression='version = :v',
    ExpressionAttributeValues={':v': 7}
)
# Raises ConditionalCheckFailedException if version != 7

CRDTs (Conflict-Free Replicated Data Types)

CRDTs are data structures that can be independently updated on different nodes and merged without conflicts. They eliminate the need for locking by making all operations commutative:

When to Avoid Locks

Use Locks WhenAvoid Locks When
  • Strict mutual exclusion is required
  • Operations are not idempotent
  • External side effects (sending email, charging a card)
  • Leader election
  • Operations are idempotent (safe to retry)
  • Optimistic concurrency works (low contention)
  • CRDTs can model your data
  • Eventual consistency is acceptable

Common Pitfalls

1. Lock Holder Dies Without Releasing

This is the most common pitfall. A client acquires the lock, crashes (segfault, OOM, power failure), and never releases it.

2. Clock Skew Between Nodes

If your lock’s correctness depends on time (TTL, lease duration), clock differences between machines can cause problems:

Mitigation: Use monotonic clocks for duration calculations (clock_gettime(CLOCK_MONOTONIC) or time.monotonic()). Keep NTP well-configured with slew mode (gradual adjustment) rather than step mode (sudden jumps).

3. Network Partitions During Lock Hold

A client acquires the lock, then a network partition isolates it from the lock server:

Mitigation: Fencing tokens. The resource server will reject the original client’s stale token.

4. GC Pauses Extending Lock Hold Time

This is particularly insidious in JVM-based systems. A stop-the-world GC pause can last seconds or even minutes in pathological cases:

// Timeline:
t=0s    Client A acquires lock (TTL=30s, token=33)
t=1s    Client A starts processing
t=2s    Client A hits a full GC pause ← FROZEN
t=32s   Lock expires (30s TTL)
t=33s   Client B acquires lock (token=34)
t=34s   Client B writes to resource
t=45s   Client A's GC ends ← RESUMES, thinks it has the lock!
t=46s   Client A writes to resource with token=33
        → MUST BE REJECTED by resource server

Mitigation: Fencing tokens (again). Also: tune GC to minimize pause times (G1GC, ZGC, Shenandoah), use shorter TTLs with frequent renewals, and always check lock validity before performing the critical operation.

5. Forgetting to Release the Lock

# BAD: Exception between acquire and release = lock held until TTL
lock.acquire()
process()        # ← if this throws, release() is never called
lock.release()

# GOOD: Use try/finally or context managers
lock.acquire()
try:
    process()
finally:
    lock.release()

# BEST: Context manager (Python) or defer (Go)
with lock:
    process()

6. Retry Storms

If all clients retry immediately after failing to acquire the lock, they create a thundering herd effect that overwhelms the lock server.

Mitigation: Use exponential backoff with jitter:

import random, time

def acquire_with_backoff(lock, max_retries=10):
    for attempt in range(max_retries):
        if lock.acquire(blocking=False):
            return True
        # Exponential backoff with jitter
        wait = min(2 ** attempt * 0.1, 5.0)  # cap at 5s
        wait *= (0.5 + random.random())       # add jitter
        time.sleep(wait)
    return False

Summary: Choosing a Distributed Lock

TechnologyBest ForFencing?Complexity
Redis (single)Efficiency, deduplicationLow
RedlockBetter efficiency, no SPOFMedium
ZooKeeperCorrectness-critical locksHigh
etcdKubernetes-native, correctnessMedium–High
DatabaseSimple apps, existing DB✔ (via version)Low
No lock (CAS/CRDT)High throughput, idempotent opsN/AVaries