Distributed Locking
Why Distributed Locks?
In a single-process application, a mutex or semaphore is enough to coordinate access to shared memory. But the moment your system spans multiple servers, containers, or even data centers, local mutexes are useless—they cannot cross process boundaries. That is where distributed locks come in: a mechanism that allows multiple independent processes to agree that exactly one of them holds exclusive access to a shared resource at any given time.
Real-World Scenarios
- Prevent double booking: Two users simultaneously try to reserve the last seat on a flight. Without a distributed lock on that seat inventory row, both could succeed, over-selling the flight.
- Single-writer guarantee: A distributed cache rebuilder should only be executed by one node at a time. If two nodes simultaneously rebuild, they might write partial or conflicting data.
- Cron job deduplication: In a Kubernetes deployment with N replicas, a scheduled task (send daily emails, archive old records) must run on exactly one pod. A distributed lock ensures only the winner executes.
- Leader election: In a replicated database or streaming system, nodes compete for leadership. The winner writes; followers replicate. A distributed lock (or lease) is the foundation of this pattern.
- Rate-limited API calls: A third-party API allows one request per second. Multiple workers must coordinate so only one sends a request in each time window.
Properties of a Good Distributed Lock
Not all distributed locks are created equal. A robust implementation must satisfy several properties, formalized in the literature as safety and liveness guarantees:
| Property | Type | Description |
|---|---|---|
| Mutual Exclusion | Safety | At most one client can hold the lock at any given time. This is the fundamental guarantee. If this breaks, the lock is useless. |
| Deadlock-Free | Liveness | Even if a client crashes while holding the lock, the lock must eventually be released. Typically enforced via TTL (time-to-live) or session expiration. |
| Fault Tolerance | Availability | The lock service must remain operational even if some nodes fail. A single-node lock server is a single point of failure. |
| Fencing Tokens | Safety (extended) | Each lock acquisition produces a monotonically increasing token. Resource servers reject stale tokens to prevent delayed writes from a previous lock holder. |
The first three are widely agreed upon. The fourth—fencing tokens—was highlighted by Martin Kleppmann as essential for correctness in asynchronous systems. We will explore it in depth in a dedicated section.
Redis-Based Locking
Redis is the most popular choice for distributed locking because of its low latency and simplicity. The basic pattern uses a single Redis instance:
Acquiring the Lock
# SET key value NX EX ttl
# NX = only set if key does NOT exist (mutual exclusion)
# EX = set TTL in seconds (deadlock-freedom)
# value = a unique random string identifying THIS client
SET my_lock "client-a-uuid-123" NX EX 30
# Returns OK if the lock was acquired
# Returns (nil) if someone else already holds it
The NX flag ensures atomicity—only one client wins the race. The EX 30 sets a 30-second TTL so that if the client crashes, the lock auto-releases. The random value identifies the lock owner, preventing accidental release by another client.
Releasing the Lock
You must not simply DEL my_lock. If the TTL expired and another client acquired the lock, you would delete their lock. Instead, use a Lua script for atomic check-and-delete:
-- Lua script: release only if the value matches our unique ID
if redis.call("GET", KEYS[1]) == ARGV[1] then
return redis.call("DEL", KEYS[1])
else
return 0
end
-- Execute from client:
EVAL "if redis.call('GET',KEYS[1])==ARGV[1] then return redis.call('DEL',KEYS[1]) else return 0 end" 1 my_lock "client-a-uuid-123"
Problems with Single-Instance Redis Locks
- Single point of failure (SPOF): If the Redis instance goes down, all locks are lost. If it comes back with a stale RDB snapshot, it might have "forgotten" about active locks, allowing two clients to believe they hold the same lock.
- Replication lag: If using Redis replication, the master might accept a
SET NXand crash before replicating to the replica. The replica is promoted, and another client acquires the "same" lock. Now two clients hold the lock. - Clock drift: TTL-based expiration depends on the server clock. Significant clock skew between Redis and clients can cause premature or delayed expiration.
- Long GC pauses: A client acquires the lock, then its JVM garbage collector pauses for 30 seconds. The TTL expires, another client acquires the lock, and when the original client resumes, it still thinks it holds the lock.
Redlock Algorithm
Redlock, proposed by Salvatore Sanfilippo (Antirez), addresses the single-instance weakness by using N independent Redis instances (typically 5). The algorithm:
Step-by-Step
- Record the start time (
T1) in milliseconds. - Sequentially try to acquire the lock on all N Redis instances using the same key, same random value, and the same TTL. Use a small per-instance timeout (e.g., 5–10ms) so that a down instance doesn't slow you down.
- Calculate elapsed time:
elapsed = now() - T1. - The lock is acquired if and only if:
- The client acquired the lock on a majority of instances (at least
⌊N/2⌋ + 1, so 3 out of 5). - The total elapsed time is less than the lock TTL. The effective lock validity is
TTL - elapsed.
- The client acquired the lock on a majority of instances (at least
- If the lock was NOT acquired (fewer than majority or timeout exceeded), release the lock on ALL instances (even the ones where it was acquired) to avoid deadlocks.
# Pseudocode for Redlock acquisition
import time, uuid, redis
LOCK_KEY = "my_resource"
LOCK_TTL = 30000 # 30 seconds in ms
QUORUM = 3 # majority of 5
RETRY_DELAY = 200 # ms between retries
instances = [redis.Redis(host=h) for h in REDIS_HOSTS] # 5 hosts
def acquire_lock():
value = str(uuid.uuid4())
start = time.monotonic_ns() // 1_000_000 # ms
acquired = 0
for r in instances:
try:
if r.set(LOCK_KEY, value, nx=True, px=LOCK_TTL):
acquired += 1
except redis.ConnectionError:
pass # instance down, skip
elapsed = (time.monotonic_ns() // 1_000_000) - start
validity = LOCK_TTL - elapsed
if acquired >= QUORUM and validity > 0:
return value, validity # lock acquired!
else:
# Release on ALL instances
for r in instances:
try:
release_lock(r, value)
except:
pass
return None, 0
def release_lock(r, value):
lua = """
if redis.call("GET",KEYS[1])==ARGV[1] then
return redis.call("DEL",KEYS[1])
else return 0 end
"""
r.eval(lua, 1, LOCK_KEY, value)
▶ Redlock Algorithm
Step through lock acquisition across 5 independent Redis nodes. Majority (3/5) required.
The Kleppmann vs Antirez Debate
In 2016, Martin Kleppmann published a widely-read analysis titled "How to do distributed locking" that criticized Redlock. Salvatore Sanfilippo (creator of Redis) responded with a rebuttal. This debate remains one of the most important discussions in distributed systems:
| Kleppmann’s Criticisms | Antirez’s Rebuttals |
|---|---|
| GC pauses break safety. A client acquires the lock, then a GC pause of >TTL causes the lock to expire. Another client acquires it. The first client resumes and operates on shared data, believing it still holds the lock. Two clients now have "the lock." | The client checks the remaining lock validity after acquiring it. If the GC pause happened during acquisition, the validity check catches it. However, Kleppmann argues the pause can happen after the check. |
| Clock jumps break safety. Redlock relies on wall-clock time for TTL calculations. If an NTP correction jumps a Redis server’s clock forward, the lock expires prematurely. |
Use monotonic clocks where possible. Redis uses gettimeofday() which can be affected by NTP, but in practice large jumps are rare with well-configured NTP (smearing, not stepping).
|
| No fencing tokens. Redlock does not produce monotonically increasing tokens. Without fencing, even correctly implemented locks cannot prevent split-brain writes. | If you need fencing tokens, you need a consensus system anyway. Redlock is designed for a different (valid) use case where strong fencing isn’t required. |
| Use a proper consensus system. If you need a lock for correctness, use ZooKeeper, etcd, or Chubby—systems designed with formal correctness proofs and linearizable semantics. | Redlock provides a practical middle ground. Not everyone needs the operational complexity of ZooKeeper. For many use cases, Redlock’s guarantees are sufficient. |
ZooKeeper Locks
Apache ZooKeeper provides a more reliable locking primitive built on top of its consensus protocol (ZAB — ZooKeeper Atomic Broadcast). Unlike Redis, ZooKeeper was designed for coordination tasks.
How It Works: Ephemeral Sequential Znodes
- Create an ephemeral sequential znode under a lock path:
# Client creates an ephemeral sequential node create -e -s /locks/my_resource/lock- # ZooKeeper returns: /locks/my_resource/lock-0000000001-e= ephemeral (auto-deleted when session ends),-s= sequential (ZooKeeper appends a monotonically increasing counter). - List all children of the lock path:
ls /locks/my_resource # Returns: [lock-0000000001, lock-0000000003, lock-0000000005] - Check if you have the lowest sequence number:
- Yes: You hold the lock. Proceed with your critical section.
- No: Set a watch on the znode with the next-lower sequence number (not the lowest — this avoids the "herd effect" where all waiters wake up simultaneously).
- When the watched znode is deleted (previous holder releases or crashes), you are notified. Re-check if you now have the lowest number.
- Release: Simply delete your znode, or let your session expire (ZooKeeper automatically deletes ephemeral nodes).
# Pseudocode: ZooKeeper distributed lock (using Kazoo Python client)
from kazoo.client import KazooClient
from kazoo.recipe.lock import Lock
zk = KazooClient(hosts='zk1:2181,zk2:2181,zk3:2181')
zk.start()
lock = Lock(zk, "/locks/my_resource")
# Blocking acquire
lock.acquire(timeout=30)
try:
# Critical section: only one client executes this at a time
process_shared_resource()
finally:
lock.release()
# Or use as context manager
with Lock(zk, "/locks/my_resource"):
process_shared_resource()
Advantages Over Redis
- Consensus-backed: ZAB protocol ensures all nodes agree on the state of the lock. No split-brain from replication lag.
- Automatic cleanup: Ephemeral nodes are deleted when the client session dies. No TTL guessing game.
- Monotonic ordering: Sequential znodes provide a natural ordering that can be used as fencing tokens.
- No clock dependency: Lock correctness does not depend on synchronized clocks.
- Watch mechanism: Clients are notified when the lock becomes available, rather than polling.
Drawbacks
- Operational complexity: Running a ZooKeeper ensemble (3 or 5 nodes) requires significant operational expertise.
- Higher latency: Consensus requires quorum writes, which are slower than a single Redis
SET. - Session management: If a client’s session expires due to a temporary network partition, the lock is released even though the client is still running (and might still be accessing the shared resource).
etcd Locks
etcd uses the Raft consensus algorithm and provides distributed locking through leases and transactions. It is the coordination backbone of Kubernetes.
Lease-Based Locks
# Grant a lease (TTL = 30 seconds)
etcdctl lease grant 30
# Output: lease 694d81417acd0b17 granted with TTL(30s)
# Put a key with the lease attached
etcdctl put --lease=694d81417acd0b17 /locks/my_resource "client-a"
# The key auto-deletes when the lease expires
# Keep the lease alive with periodic keep-alives:
etcdctl lease keep-alive 694d81417acd0b17
Transactions for Atomic Lock Acquisition
# Atomic compare-and-set: only create the lock if it doesn't exist
etcdctl txn --interactive
# compares:
create_revision("/locks/my_resource") = "0"
# success requests:
put /locks/my_resource "client-a" --lease=694d81417acd0b17
# failure requests:
get /locks/my_resource
The transaction checks that the key’s create_revision is 0 (key doesn’t exist). If true, it creates the key with the lease. If false, it returns the current holder’s value. This is atomic and linearizable.
Using etcd’s Built-in Lock Command
# etcd provides a high-level lock command
etcdctl lock /locks/my_resource
# Blocks until the lock is acquired
# Holds the lock until the process exits or is interrupted (Ctrl+C)
# Execute a command while holding the lock
etcdctl lock /locks/my_resource -- echo "I have the lock"
Programmatic Example (Go)
import (
"context"
"go.etcd.io/etcd/client/v3"
"go.etcd.io/etcd/client/v3/concurrency"
)
cli, _ := clientv3.New(clientv3.Config{
Endpoints: []string{"etcd1:2379", "etcd2:2379", "etcd3:2379"},
})
defer cli.Close()
// Create a session (lease-based)
session, _ := concurrency.NewSession(cli, concurrency.WithTTL(30))
defer session.Close()
// Create a mutex
mu := concurrency.NewMutex(session, "/locks/my_resource")
// Acquire
mu.Lock(context.TODO())
defer mu.Unlock(context.TODO())
// Critical section
processSharedResource()
// The lock's revision can serve as a fencing token:
fmt.Println("Fencing token:", mu.Header().Revision)
The mu.Header().Revision is a monotonically increasing number from etcd’s Raft log. This makes etcd locks compatible with fencing token requirements out of the box.
Fencing Tokens
Fencing tokens are the critical concept that separates a distributed lock that is merely "good enough" from one that is provably correct. This idea, popularized by Martin Kleppmann, addresses a fundamental problem in asynchronous distributed systems.
The Problem: Zombie Lock Holders
Consider this scenario without fencing:
- Client A acquires the lock with TTL=30s.
- Client A begins processing, but hits a long GC pause (or network delay).
- The lock’s TTL expires after 30s. Client A is still paused.
- Client B acquires the lock (legitimately).
- Client B writes to the shared resource.
- Client A’s GC pause ends. It still believes it holds the lock.
- Client A writes to the shared resource, overwriting Client B’s data.
The lock’s mutual exclusion guarantee was violated. Both clients wrote to the resource "at the same time" from the resource’s perspective.
The Solution: Monotonically Increasing Tokens
Each time a lock is acquired, the lock service issues a fencing token—a monotonically increasing number (e.g., ZooKeeper’s zxid, etcd’s revision). The resource server (database, storage, API) validates this token:
# Lock service issues tokens:
# Client A acquires lock → token = 33
# Client B acquires lock → token = 34
# Resource server tracks the highest token it has seen:
highest_token_seen = 0
def write(data, token):
if token <= highest_token_seen:
raise RejectedError(f"Stale token {token} < {highest_token_seen}")
highest_token_seen = token
actually_write(data)
# Client A (token=33) writes first → accepted, highest=33
# Client B (token=34) writes → accepted, highest=34
# Client A (delayed, token=33) tries → REJECTED (33 < 34)
▶ Distributed Lock with Fencing
Watch how fencing tokens prevent a zombie lock holder from corrupting data after a GC pause.
Where Fencing Tokens Come From
| Lock Service | Fencing Token Source |
|---|---|
| ZooKeeper | Sequential znode number or zxid (transaction ID) |
| etcd | revision from the Raft log (returned by mu.Header().Revision) |
| Chubby (Google) | Lock sequencer (built-in) |
| Redis / Redlock | ⚠ Not natively supported. Requires external sequencing. |
WHERE token > last_seen_token) but harder for arbitrary APIs or third-party services.
Database-Based Locks
If your application already uses a relational database, you can leverage it as a lock service. This avoids adding a new infrastructure dependency, though it comes with trade-offs.
SELECT FOR UPDATE (Row-Level Locking)
-- Create a locks table
CREATE TABLE distributed_locks (
lock_name VARCHAR(255) PRIMARY KEY,
locked_by VARCHAR(255),
locked_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
expires_at TIMESTAMP
);
-- Attempt to acquire (within a transaction)
BEGIN;
SELECT * FROM distributed_locks
WHERE lock_name = 'my_resource'
FOR UPDATE SKIP LOCKED;
-- If the row is returned, check expiration and update
UPDATE distributed_locks
SET locked_by = 'client-a',
locked_at = NOW(),
expires_at = NOW() + INTERVAL '30 seconds'
WHERE lock_name = 'my_resource'
AND (expires_at IS NULL OR expires_at < NOW());
COMMIT;
-- Release
UPDATE distributed_locks
SET locked_by = NULL, expires_at = NULL
WHERE lock_name = 'my_resource'
AND locked_by = 'client-a';
PostgreSQL Advisory Locks
PostgreSQL provides dedicated advisory lock functions that don’t require a table:
-- Session-level advisory lock (released when session ends)
SELECT pg_advisory_lock(12345);
-- ... critical section ...
SELECT pg_advisory_unlock(12345);
-- Try without blocking (returns true/false)
SELECT pg_try_advisory_lock(12345);
-- Transaction-level advisory lock (released at COMMIT/ROLLBACK)
SELECT pg_advisory_xact_lock(12345);
-- Using two integers for namespacing
SELECT pg_advisory_lock(1, 42); -- namespace=1, resource=42
MySQL GET_LOCK
-- Acquire (timeout 30 seconds)
SELECT GET_LOCK('my_resource', 30);
-- Returns 1 if acquired, 0 if timed out, NULL on error
-- Release
SELECT RELEASE_LOCK('my_resource');
-- Check without acquiring
SELECT IS_FREE_LOCK('my_resource');
Pros and Cons
| Pros | Cons |
|---|---|
|
|
Lock-Free Alternatives
Sometimes the best distributed lock is no lock at all. Several patterns avoid the need for explicit locking:
Optimistic Locking (Version Numbers)
Instead of preventing concurrent access, allow it but detect conflicts at write time:
-- Read the current version
SELECT data, version FROM resources WHERE id = 42;
-- Returns: data='old_value', version=7
-- Update only if the version hasn't changed
UPDATE resources
SET data = 'new_value', version = version + 1
WHERE id = 42 AND version = 7;
-- Check affected rows:
-- 1 row → success, your write went through
-- 0 rows → conflict! Someone else updated first. Retry.
This is the most widely used concurrency control mechanism in web applications. It avoids the overhead and complexity of distributed locks entirely for most use cases.
Compare-and-Swap (CAS)
CAS is the atomic primitive underlying optimistic locking. Many distributed stores support it natively:
# etcd: atomic CAS using transactions
etcdctl txn --interactive
# compares:
value("/config/setting") = "old_value"
# success:
put /config/setting "new_value"
# failure:
get /config/setting
# DynamoDB: conditional write
table.put_item(
Item={'id': '42', 'data': 'new_value', 'version': 8},
ConditionExpression='version = :v',
ExpressionAttributeValues={':v': 7}
)
# Raises ConditionalCheckFailedException if version != 7
CRDTs (Conflict-Free Replicated Data Types)
CRDTs are data structures that can be independently updated on different nodes and merged without conflicts. They eliminate the need for locking by making all operations commutative:
- G-Counter: Each node maintains its own counter. The global count is the sum. Increments on any node are always safe.
- OR-Set (Observed-Remove Set): Elements can be added/removed concurrently. Conflicts resolve deterministically.
- LWW-Register (Last-Writer-Wins): Each write is timestamped. The latest timestamp wins on merge.
When to Avoid Locks
| Use Locks When | Avoid Locks When |
|---|---|
|
|
Common Pitfalls
1. Lock Holder Dies Without Releasing
This is the most common pitfall. A client acquires the lock, crashes (segfault, OOM, power failure), and never releases it.
- Mitigation: Always use TTLs (Redis), leases (etcd), or ephemeral nodes (ZooKeeper). Never use a lock without an automatic expiration mechanism.
- Anti-pattern: Using a database row as a lock without an
expires_atcolumn. If the process dies, the lock is held forever until manual intervention.
2. Clock Skew Between Nodes
If your lock’s correctness depends on time (TTL, lease duration), clock differences between machines can cause problems:
- A Redis server’s clock jumps forward by 10s → lock expires 10s early.
- A client’s clock is 5s behind → it calculates lock validity incorrectly.
Mitigation: Use monotonic clocks for duration calculations (clock_gettime(CLOCK_MONOTONIC) or time.monotonic()). Keep NTP well-configured with slew mode (gradual adjustment) rather than step mode (sudden jumps).
3. Network Partitions During Lock Hold
A client acquires the lock, then a network partition isolates it from the lock server:
- The lock server considers the client dead and releases the lock.
- Another client on the other side of the partition acquires the lock.
- The original client is still running and still thinks it holds the lock.
Mitigation: Fencing tokens. The resource server will reject the original client’s stale token.
4. GC Pauses Extending Lock Hold Time
This is particularly insidious in JVM-based systems. A stop-the-world GC pause can last seconds or even minutes in pathological cases:
// Timeline:
t=0s Client A acquires lock (TTL=30s, token=33)
t=1s Client A starts processing
t=2s Client A hits a full GC pause ← FROZEN
t=32s Lock expires (30s TTL)
t=33s Client B acquires lock (token=34)
t=34s Client B writes to resource
t=45s Client A's GC ends ← RESUMES, thinks it has the lock!
t=46s Client A writes to resource with token=33
→ MUST BE REJECTED by resource server
Mitigation: Fencing tokens (again). Also: tune GC to minimize pause times (G1GC, ZGC, Shenandoah), use shorter TTLs with frequent renewals, and always check lock validity before performing the critical operation.
5. Forgetting to Release the Lock
# BAD: Exception between acquire and release = lock held until TTL
lock.acquire()
process() # ← if this throws, release() is never called
lock.release()
# GOOD: Use try/finally or context managers
lock.acquire()
try:
process()
finally:
lock.release()
# BEST: Context manager (Python) or defer (Go)
with lock:
process()
6. Retry Storms
If all clients retry immediately after failing to acquire the lock, they create a thundering herd effect that overwhelms the lock server.
Mitigation: Use exponential backoff with jitter:
import random, time
def acquire_with_backoff(lock, max_retries=10):
for attempt in range(max_retries):
if lock.acquire(blocking=False):
return True
# Exponential backoff with jitter
wait = min(2 ** attempt * 0.1, 5.0) # cap at 5s
wait *= (0.5 + random.random()) # add jitter
time.sleep(wait)
return False
Summary: Choosing a Distributed Lock
| Technology | Best For | Fencing? | Complexity |
|---|---|---|---|
| Redis (single) | Efficiency, deduplication | ✘ | Low |
| Redlock | Better efficiency, no SPOF | ✘ | Medium |
| ZooKeeper | Correctness-critical locks | ✔ | High |
| etcd | Kubernetes-native, correctness | ✔ | Medium–High |
| Database | Simple apps, existing DB | ✔ (via version) | Low |
| No lock (CAS/CRDT) | High throughput, idempotent ops | N/A | Varies |