Bloom Filters & Probabilistic Data Structures
The Membership Problem
One of the most fundamental operations in computing is the membership query: "Is element x in set S?" You encounter it everywhere — checking whether a username is taken, whether a URL has already been crawled, whether a cache key exists, or whether a request has been seen before.
The naive solution is a hash set. Insert every element, then check membership in O(1). That works beautifully when the set fits in memory. But what happens when the set has billions of elements?
| Data Structure | Space (1 billion strings, avg 50 bytes) | False Positives? |
|---|---|---|
| HashSet | ~50 GB + overhead | No |
| Sorted Array + Binary Search | ~50 GB | No |
| Bloom Filter (1% FPR) | ~1.2 GB | Yes (~1%) |
| Bloom Filter (0.1% FPR) | ~1.8 GB | Yes (~0.1%) |
That's a 40× reduction in memory by accepting a tiny probability of false positives. This is the fundamental trade-off of probabilistic data structures: sacrifice perfect accuracy for dramatic improvements in space (and often time).
The family of probabilistic data structures includes:
- Bloom Filter — approximate set membership ("is x in S?")
- Counting Bloom Filter — membership with deletion support
- Cuckoo Filter — improved Bloom filter alternative
- HyperLogLog — cardinality estimation ("how many distinct elements?")
- Count-Min Sketch — frequency estimation ("how many times has x appeared?")
Bloom Filter
Invented by Burton Howard Bloom in 1970, the Bloom filter is the most widely used probabilistic data structure. It answers membership queries with two possible responses:
- "Definitely NOT in the set" — guaranteed correct, zero false negatives
- "Probably in the set" — might be a false positive
Structure
A Bloom filter consists of two components:
- A bit array of m bits, all initially set to 0
- k independent hash functions, each mapping an element to one of the m positions uniformly at random
Insert Operation
To insert element x:
- Compute h₁(x), h₂(x), …, hₖ(x) — k hash values
- Set bits at positions h₁(x) mod m, h₂(x) mod m, …, hₖ(x) mod m to 1
Lookup Operation
To check if element x is in the set:
- Compute h₁(x), h₂(x), …, hₖ(x)
- Check if ALL bits at those positions are 1
- If any bit is 0 → element is definitely NOT in the set
- If all bits are 1 → element is probably in the set (could be a false positive)
Why False Positives Happen
When you insert multiple elements, their hash functions set various bits to 1. Over time, more and more bits become 1. When you check a new element that was never inserted, all k of its hash positions might happen to be set to 1 by other elements — a false positive.
Why False Negatives Are Impossible
Once a bit is set to 1, it is never set back to 0. So if element x was inserted, all k of its bit positions are guaranteed to be 1. The lookup will always succeed for elements that were actually inserted.
▶ Bloom Filter: Insert & Lookup
Watch how elements are inserted and queried in a 20-bit Bloom filter with k=3 hash functions.
Space Efficiency
A well-tuned Bloom filter uses about 10 bits per element for a 1% false positive rate, regardless of the element size. Compare:
- Storing 1 million email addresses (avg 25 bytes each) in a hash set: ~25 MB + pointers
- Bloom filter for 1 million elements at 1% FPR: ~1.2 MB (10 million bits)
- That's roughly a 20× memory savings
Math Behind Bloom Filters
Understanding the math lets you size Bloom filters correctly for your requirements.
Parameters
- n = number of elements to insert
- m = size of the bit array (number of bits)
- k = number of hash functions
- p = desired false positive rate
False Positive Probability
After inserting n elements into a Bloom filter of size m with k hash functions, the probability that a specific bit is still 0 is:
P(bit is 0) = (1 - 1/m)^(kn)
For large m, this approximates to:
P(bit is 0) ≈ e^(-kn/m)
The false positive probability — the chance that all k bit positions for a non-member are 1 — is:
p = (1 - e^(-kn/m))^k
Optimal Number of Hash Functions
The value of k that minimizes the false positive rate for given m and n is:
k_optimal = (m / n) × ln(2) ≈ 0.693 × (m / n)
At this optimal k, exactly half the bits in the array are set to 1 — the optimal fill ratio.
Sizing Formula
Given n elements and a desired false positive rate p, the required number of bits is:
m = -(n × ln(p)) / (ln(2))²
m ≈ -1.44 × n × log₂(p)
And the optimal number of hash functions:
k = (m / n) × ln(2)
k = -log₂(p) (when m is optimally sized)
Worked Examples
Example 1: URL De-duplication
A web crawler wants to track 100 million URLs with a 1% false positive rate.
n = 100,000,000 (100 million URLs)
p = 0.01 (1% false positive rate)
m = -(100,000,000 × ln(0.01)) / (ln(2))²
m = -(100,000,000 × (-4.6052)) / (0.4805)
m = 460,520,000 / 0.4805
m ≈ 958,505,837 bits
m ≈ 958.5 million bits ≈ 114.5 MB
k = (m / n) × ln(2) = 9.585 × 0.693 ≈ 6.64 → k = 7
Bits per element: m / n ≈ 9.58 bits
So about 115 MB and 7 hash functions — compared to storing the actual URLs which might use 5–10 GB.
Example 2: Cache Membership Check
A CDN wants to check if a resource is cached across 10 million objects with 0.1% FPR.
n = 10,000,000 (10 million objects)
p = 0.001 (0.1% false positive rate)
m = -(10,000,000 × ln(0.001)) / (ln(2))²
m = -(10,000,000 × (-6.9078)) / (0.4805)
m ≈ 143,775,882 bits ≈ 17.1 MB
k = -log₂(0.001) ≈ 9.97 → k = 10
Bits per element: m / n ≈ 14.38 bits
Only 17 MB and 10 hash functions for 10 million objects at 0.1% false positive rate.
Quick Reference Table
| FPR (p) | Bits per element (m/n) | Optimal k |
|---|---|---|
| 50% | 1.44 | 1 |
| 10% | 4.79 | 3 |
| 1% | 9.58 | 7 |
| 0.1% | 14.38 | 10 |
| 0.01% | 19.17 | 13 |
| 0.001% | 23.96 | 17 |
Counting Bloom Filter
The standard Bloom filter's biggest limitation is that you cannot delete elements. If you set a bit back to 0, you might break the membership test for other elements that also hash to that position.
The Counting Bloom Filter (Fan et al., 2000) replaces each bit with a small counter (typically 3–4 bits):
Operations
- Insert: Increment counters at positions h₁(x), h₂(x), …, hₖ(x)
- Delete: Decrement counters at positions h₁(x), h₂(x), …, hₖ(x)
- Lookup: Check if all counters at those positions are > 0
Trade-offs
| Aspect | Standard Bloom | Counting Bloom |
|---|---|---|
| Storage per position | 1 bit | 3–4 bits (counter) |
| Total memory | m bits | 3–4× m bits |
| Deletion | Not supported | Supported |
| Counter overflow risk | N/A | Possible (use 4 bits → max 15) |
| False negatives | Impossible | Possible (if you delete an element that was never inserted) |
# Counting Bloom Filter — pseudocode
class CountingBloomFilter:
def __init__(self, m, k):
self.counters = [0] * m # array of m counters (not bits)
self.m = m
self.k = k
def insert(self, x):
for i in range(self.k):
pos = hash_i(x) % self.m
self.counters[pos] += 1 # increment instead of set-to-1
def delete(self, x):
for i in range(self.k):
pos = hash_i(x) % self.m
if self.counters[pos] > 0:
self.counters[pos] -= 1
def lookup(self, x):
return all(
self.counters[hash_i(x) % self.m] > 0
for i in range(self.k)
)
Cuckoo Filter
The Cuckoo Filter (Fan et al., 2014) is a modern alternative to Bloom filters that supports deletion and offers better space efficiency for false positive rates below ~3%.
How It Works
A cuckoo filter uses a hash table with cuckoo hashing, but instead of storing the full element, it stores a compact fingerprint (e.g., 8–12 bits):
- Compute fingerprint f = fingerprint(x)
- Compute two candidate bucket positions:
i₁ = hash(x) i₂ = i₁ ⊕ hash(f) // XOR with hash of fingerprint - If either bucket has an empty slot, store f there
- If both are full, evict an existing fingerprint (cuckoo displacement) and relocate it to its alternate bucket
Lookup
Check if fingerprint f exists in either bucket i₁ or i₂.
Deletion
Find fingerprint f in bucket i₁ or i₂, remove it. This works because the XOR-based alternate bucket calculation is reversible: given any bucket i and fingerprint f, you can always compute the alternate bucket i ⊕ hash(f).
Cuckoo vs Bloom Comparison
| Feature | Bloom Filter | Cuckoo Filter |
|---|---|---|
| Deletion | Not supported | Supported |
| Space (FPR < 3%) | Higher | Lower |
| Space (FPR > 3%) | Lower | Higher |
| Lookup speed | k memory accesses | 2 memory accesses (constant) |
| Insert speed | k memory accesses | Amortized O(1), worst-case may require relocations |
| Implementation complexity | Simple | More complex |
HyperLogLog
While Bloom filters answer "is x in the set?", HyperLogLog (HLL) answers a different question: "How many distinct elements have I seen?" — the cardinality estimation problem.
The Problem
Counting exact distinct elements requires storing every element you've seen (a hash set). For a stream of billions of events, that's impractical. HyperLogLog estimates the count using only ~12 KB of memory, with a standard error of about 0.81%.
Core Intuition
Consider hashing each element to a uniformly random binary string. The maximum number of leading zeros across all hashed values gives you information about the cardinality:
- If you've seen at most 1 leading zero, you've probably seen ~2 distinct values
- If you've seen at most 2 leading zeros, you've probably seen ~4 distinct values
- If you've seen at most k leading zeros, you've probably seen ~2k distinct values
This is because the probability of seeing k leading zeros in a random hash is 1/2k+1, so you need roughly 2k+1 distinct elements before you're likely to see one.
The HyperLogLog Algorithm
- Partition into buckets: Use the first b bits of the hash to select one of m = 2b buckets (registers)
- Count leading zeros: In the remaining hash bits, count the number of leading zeros + 1. Store the maximum value seen for each bucket
- Estimate: The cardinality estimate is:
where M[j] is the max leading-zero count in register j, and α_m is a bias correction constant (~0.7213 for large m)E = α_m × m² × (Σ 2^(-M[j]))⁻¹
Sizing
Standard HLL uses m = 2^14 = 16,384 registers
Each register stores a value 0-63 → 6 bits each
Total memory: 16,384 × 6 bits = 12 KB
Standard error: 1.04 / √m = 1.04 / √16,384 = 0.0081 = 0.81%
This 12 KB structure can estimate cardinalities
up to billions with ~0.81% accuracy!
Redis HyperLogLog
Redis provides native HyperLogLog support with three simple commands:
# Add elements
PFADD visitors "user123" "user456" "user789"
# Get estimated cardinality
PFCOUNT visitors
→ (integer) 3
# Merge multiple HLLs (e.g., combine daily counts)
PFMERGE weekly_visitors monday_visitors tuesday_visitors ...
# Each HLL key uses at most 12 KB of memory
# Can count 2^64 distinct elements
▶ HyperLogLog: Cardinality Estimation
Watch how HyperLogLog uses leading zeros in hash values to estimate cardinality (4 buckets for simplicity).
Count-Min Sketch
The Count-Min Sketch (Cormode & Muthukrishnan, 2005) estimates the frequency of elements in a data stream: "How many times has element x appeared?"
Structure
A 2D array of counters with d rows and w columns, plus d independent hash functions (one per row):
col 0 col 1 col 2 col 3 ... col w-1
row 0 [ 0 | 0 | 0 | 0 | ... | 0 ] ← h₁
row 1 [ 0 | 0 | 0 | 0 | ... | 0 ] ← h₂
row 2 [ 0 | 0 | 0 | 0 | ... | 0 ] ← h₃
...
row d-1[ 0 | 0 | 0 | 0 | ... | 0 ] ← h_d
Update (Increment)
To record an occurrence of element x:
for each row i from 0 to d-1:
col = h_i(x) % w
counters[i][col] += 1
Point Query (Estimate Frequency)
To estimate the count of element x:
estimate = min(counters[i][h_i(x) % w] for i in 0..d-1)
Take the minimum across all d rows. This is because collisions only cause overestimates (other elements hash to the same column, inflating the count). Taking the minimum gives the tightest estimate.
Error Bounds
For a sketch with w columns and d rows processing a stream of N total events:
The estimated count f̂(x) satisfies:
f(x) ≤ f̂(x) ≤ f(x) + ε × N
with probability at least 1 - δ, where:
w = ⌈e / ε⌉ (e = 2.718...)
d = ⌈ln(1 / δ)⌉
Example: ε = 0.001, δ = 0.01
w = ⌈2.718 / 0.001⌉ = 2719 columns
d = ⌈ln(100)⌉ = 5 rows
Total counters: 2719 × 5 = 13,595
Memory: ~54 KB (4-byte counters)
Use Cases
- Network traffic analysis: Find heavy-hitter IP addresses without storing per-IP counters
- Search engine query frequency: Estimate how often each query term appears
- Database query optimization: Approximate frequency histograms for query planning
- Rate limiting: Track approximate request counts per API key
class CountMinSketch:
def __init__(self, w, d):
self.w = w
self.d = d
self.table = [[0] * w for _ in range(d)]
def update(self, x, count=1):
for i in range(self.d):
col = hash_i(x) % self.w
self.table[i][col] += count
def estimate(self, x):
return min(
self.table[i][hash_i(x) % self.w]
for i in range(self.d)
)
Real-World Usage
Google Bigtable — SSTable Bloom Filters
Bigtable stores data in SSTables (Sorted String Tables) on GFS. When a read request arrives, the system needs to determine which SSTables might contain the requested row key. Without Bloom filters, this requires reading each SSTable's index from disk.
Bigtable attaches a Bloom filter to each SSTable. Before performing a disk seek, it checks the Bloom filter in memory:
- If the Bloom filter says "not present" → skip this SSTable entirely (guaranteed correct)
- If the Bloom filter says "possibly present" → read the SSTable index and do the actual lookup
This dramatically reduces the number of disk reads, especially for rows that don't exist.
Apache Cassandra — Per-SSTable Bloom Filters
Cassandra uses a similar approach. Each SSTable has an associated Bloom filter that is loaded into memory. Configuration options include:
# cassandra.yaml
bloom_filter_fp_chance: 0.01 # 1% FPR (default)
# Lower FPR → more memory, fewer disk reads
# Higher FPR → less memory, more disk reads
# Per-table override:
CREATE TABLE users (
id UUID PRIMARY KEY,
name text
) WITH bloom_filter_fp_chance = 0.001; -- 0.1% FPR for hot table
CDN Cache Membership
Content Delivery Networks use Bloom filters to quickly determine if a piece of content is cached at a particular edge node. Instead of querying the cache backend (which might be slow), a Bloom filter check in memory can immediately tell you "definitely not cached here — route to origin." This avoids unnecessary cache lookups and reduces latency.
Chrome Safe Browsing
Google Chrome uses a local Bloom filter (now upgraded to more advanced structures) to check if a URL might be in the list of known malicious websites. If the local filter says "maybe malicious," Chrome contacts Google's servers for a definitive check. This avoids sending every URL you visit to Google while still protecting against known threats.
Spam Detection
Email systems use Bloom filters to quickly check if a message hash matches known spam signatures. The Bloom filter sits in memory for O(1) lookups, and only potential matches (including false positives) trigger a more expensive database lookup.
URL Shortener — Collision Avoidance
When generating short codes, you need to check if a code already exists. A Bloom filter provides a fast first check:
# Generate a short code for a new URL
def generate_short_code(url):
while True:
code = random_alphanumeric(7)
# Fast check: is this code already used?
if not bloom_filter.might_contain(code):
# Definitely not used — safe to proceed
bloom_filter.add(code)
db.insert(code, url)
return code
# Bloom says "maybe used" — check database to be sure
if not db.exists(code):
# False positive — code is actually free
bloom_filter.add(code)
db.insert(code, url)
return code
# Collision — try again
continue
Medium — Recommendation Dedup
Medium uses Bloom filters to avoid recommending articles a user has already seen. Each user has a Bloom filter tracking read articles. Before showing a recommendation, check the filter — a false positive just means skipping one extra article, which is harmless.
Implementation
Here is a complete, production-quality Bloom filter implementation in Python:
import math
import mmh3 # MurmurHash3 — fast, high-quality non-cryptographic hash
class BloomFilter:
"""
Space-efficient probabilistic set membership test.
Supports insert and lookup (no deletion).
"""
def __init__(self, expected_items: int, fp_rate: float = 0.01):
"""
Args:
expected_items: Expected number of elements (n)
fp_rate: Desired false positive rate (p), default 1%
"""
self.n = expected_items
self.p = fp_rate
# Calculate optimal size: m = -(n × ln(p)) / (ln(2))²
self.m = self._optimal_m(expected_items, fp_rate)
# Calculate optimal hash count: k = (m / n) × ln(2)
self.k = self._optimal_k(self.m, expected_items)
# Bit array (using bytearray for efficiency)
self.bit_array = bytearray(math.ceil(self.m / 8))
self.count = 0 # number of elements inserted
@staticmethod
def _optimal_m(n: int, p: float) -> int:
"""Compute optimal bit array size."""
m = -(n * math.log(p)) / (math.log(2) ** 2)
return int(math.ceil(m))
@staticmethod
def _optimal_k(m: int, n: int) -> int:
"""Compute optimal number of hash functions."""
k = (m / n) * math.log(2)
return max(1, int(round(k)))
def _get_hash_values(self, item: str) -> list:
"""
Generate k hash values using double hashing:
h_i(x) = h1(x) + i × h2(x)
This is proven equivalent to k independent hashes.
"""
h1 = mmh3.hash(item, seed=0) % self.m
h2 = mmh3.hash(item, seed=42) % self.m
return [(h1 + i * h2) % self.m for i in range(self.k)]
def _set_bit(self, pos: int):
byte_idx = pos // 8
bit_idx = pos % 8
self.bit_array[byte_idx] |= (1 << bit_idx)
def _get_bit(self, pos: int) -> bool:
byte_idx = pos // 8
bit_idx = pos % 8
return bool(self.bit_array[byte_idx] & (1 << bit_idx))
def add(self, item: str):
"""Insert an item into the Bloom filter."""
for pos in self._get_hash_values(item):
self._set_bit(pos)
self.count += 1
def might_contain(self, item: str) -> bool:
"""
Check if item might be in the set.
Returns:
False → item is DEFINITELY NOT in the set
True → item is PROBABLY in the set (FPR ≈ self.p)
"""
return all(self._get_bit(pos)
for pos in self._get_hash_values(item))
def estimated_fpr(self) -> float:
"""Estimate current false positive rate."""
exponent = -self.k * self.count / self.m
return (1 - math.exp(exponent)) ** self.k
def fill_ratio(self) -> float:
"""Fraction of bits that are set to 1."""
set_bits = sum(bin(byte).count('1')
for byte in self.bit_array)
return set_bits / self.m
def __len__(self):
return self.count
def __repr__(self):
return (f"BloomFilter(n={self.n}, m={self.m}, k={self.k}, "
f"items={self.count}, est_fpr={self.estimated_fpr():.4%})")
# === Usage ===
bf = BloomFilter(expected_items=1_000_000, fp_rate=0.01)
print(bf)
# BloomFilter(n=1000000, m=9585059, k=7, items=0, est_fpr=0.0000%)
# Insert elements
for i in range(1_000_000):
bf.add(f"user_{i}")
print(bf)
# BloomFilter(n=1000000, m=9585059, k=7, items=1000000, est_fpr=0.8189%)
# Lookup — known members
print(bf.might_contain("user_42")) # True (correct)
print(bf.might_contain("user_999999")) # True (correct)
# Lookup — non-members
# Some will be False (correct), some True (false positive)
false_positives = sum(
bf.might_contain(f"stranger_{i}")
for i in range(100_000)
)
print(f"False positives: {false_positives} / 100,000 "
f"= {false_positives/1000:.2f}%")
# Expected: ~1% ≈ 1000 false positives
print(f"Fill ratio: {bf.fill_ratio():.2%}")
# Expected: ~50% (optimal fill at optimal k)
Summary
| Structure | Question Answered | Space | Error Type |
|---|---|---|---|
| Bloom Filter | Is x in S? | ~10 bits/element | False positives only |
| Cuckoo Filter | Is x in S? (+ delete) | ~12 bits/element | False positives only |
| HyperLogLog | How many distinct x? | ~12 KB | ±0.81% error |
| Count-Min Sketch | How often has x appeared? | Configurable | Overestimates only |