← All Posts
High Level Design Series · Building Blocks· Part 7 of 70

Bloom Filters & Probabilistic Data Structures

The Membership Problem

One of the most fundamental operations in computing is the membership query: "Is element x in set S?" You encounter it everywhere — checking whether a username is taken, whether a URL has already been crawled, whether a cache key exists, or whether a request has been seen before.

The naive solution is a hash set. Insert every element, then check membership in O(1). That works beautifully when the set fits in memory. But what happens when the set has billions of elements?

Data StructureSpace (1 billion strings, avg 50 bytes)False Positives?
HashSet~50 GB + overheadNo
Sorted Array + Binary Search~50 GBNo
Bloom Filter (1% FPR)~1.2 GBYes (~1%)
Bloom Filter (0.1% FPR)~1.8 GBYes (~0.1%)

That's a 40× reduction in memory by accepting a tiny probability of false positives. This is the fundamental trade-off of probabilistic data structures: sacrifice perfect accuracy for dramatic improvements in space (and often time).

Key insight: In many real-world systems, a 1% false positive rate is perfectly acceptable. If your Bloom filter says "this URL might have been crawled," you just re-crawl it — a tiny waste. But if it says "this URL has definitely NOT been crawled," you can trust it completely.

The family of probabilistic data structures includes:

Bloom Filter

Invented by Burton Howard Bloom in 1970, the Bloom filter is the most widely used probabilistic data structure. It answers membership queries with two possible responses:

Structure

A Bloom filter consists of two components:

  1. A bit array of m bits, all initially set to 0
  2. k independent hash functions, each mapping an element to one of the m positions uniformly at random

Insert Operation

To insert element x:

  1. Compute h₁(x), h₂(x), …, hₖ(x)k hash values
  2. Set bits at positions h₁(x) mod m, h₂(x) mod m, …, hₖ(x) mod m to 1

Lookup Operation

To check if element x is in the set:

  1. Compute h₁(x), h₂(x), …, hₖ(x)
  2. Check if ALL bits at those positions are 1
  3. If any bit is 0 → element is definitely NOT in the set
  4. If all bits are 1 → element is probably in the set (could be a false positive)

Why False Positives Happen

When you insert multiple elements, their hash functions set various bits to 1. Over time, more and more bits become 1. When you check a new element that was never inserted, all k of its hash positions might happen to be set to 1 by other elements — a false positive.

Why False Negatives Are Impossible

Once a bit is set to 1, it is never set back to 0. So if element x was inserted, all k of its bit positions are guaranteed to be 1. The lookup will always succeed for elements that were actually inserted.

No deletion. You cannot remove elements from a standard Bloom filter. Setting a bit back to 0 could affect other elements that also hash to that position. (Counting Bloom filters solve this — see below.)

▶ Bloom Filter: Insert & Lookup

Watch how elements are inserted and queried in a 20-bit Bloom filter with k=3 hash functions.

Space Efficiency

A well-tuned Bloom filter uses about 10 bits per element for a 1% false positive rate, regardless of the element size. Compare:

Math Behind Bloom Filters

Understanding the math lets you size Bloom filters correctly for your requirements.

Parameters

False Positive Probability

After inserting n elements into a Bloom filter of size m with k hash functions, the probability that a specific bit is still 0 is:

P(bit is 0) = (1 - 1/m)^(kn)

For large m, this approximates to:

P(bit is 0) ≈ e^(-kn/m)

The false positive probability — the chance that all k bit positions for a non-member are 1 — is:

p = (1 - e^(-kn/m))^k

Optimal Number of Hash Functions

The value of k that minimizes the false positive rate for given m and n is:

k_optimal = (m / n) × ln(2) ≈ 0.693 × (m / n)

At this optimal k, exactly half the bits in the array are set to 1 — the optimal fill ratio.

Sizing Formula

Given n elements and a desired false positive rate p, the required number of bits is:

m = -(n × ln(p)) / (ln(2))²
m ≈ -1.44 × n × log₂(p)

And the optimal number of hash functions:

k = (m / n) × ln(2)
k = -log₂(p)        (when m is optimally sized)

Worked Examples

Example 1: URL De-duplication

A web crawler wants to track 100 million URLs with a 1% false positive rate.

n = 100,000,000  (100 million URLs)
p = 0.01         (1% false positive rate)

m = -(100,000,000 × ln(0.01)) / (ln(2))²
m = -(100,000,000 × (-4.6052)) / (0.4805)
m = 460,520,000 / 0.4805
m ≈ 958,505,837 bits
m ≈ 958.5 million bits ≈ 114.5 MB

k = (m / n) × ln(2) = 9.585 × 0.693 ≈ 6.64 → k = 7

Bits per element: m / n ≈ 9.58 bits

So about 115 MB and 7 hash functions — compared to storing the actual URLs which might use 5–10 GB.

Example 2: Cache Membership Check

A CDN wants to check if a resource is cached across 10 million objects with 0.1% FPR.

n = 10,000,000   (10 million objects)
p = 0.001        (0.1% false positive rate)

m = -(10,000,000 × ln(0.001)) / (ln(2))²
m = -(10,000,000 × (-6.9078)) / (0.4805)
m ≈ 143,775,882 bits ≈ 17.1 MB

k = -log₂(0.001) ≈ 9.97 → k = 10

Bits per element: m / n ≈ 14.38 bits

Only 17 MB and 10 hash functions for 10 million objects at 0.1% false positive rate.

Quick Reference Table

FPR (p)Bits per element (m/n)Optimal k
50%1.441
10%4.793
1%9.587
0.1%14.3810
0.01%19.1713
0.001%23.9617
Rule of thumb: Each additional 4.8 bits per element reduces the false positive rate by 10×. Going from 1% to 0.1% costs about 4.8 extra bits per element.

Counting Bloom Filter

The standard Bloom filter's biggest limitation is that you cannot delete elements. If you set a bit back to 0, you might break the membership test for other elements that also hash to that position.

The Counting Bloom Filter (Fan et al., 2000) replaces each bit with a small counter (typically 3–4 bits):

Operations

Trade-offs

AspectStandard BloomCounting Bloom
Storage per position1 bit3–4 bits (counter)
Total memorym bits3–4× m bits
DeletionNot supportedSupported
Counter overflow riskN/APossible (use 4 bits → max 15)
False negativesImpossiblePossible (if you delete an element that was never inserted)
# Counting Bloom Filter — pseudocode
class CountingBloomFilter:
    def __init__(self, m, k):
        self.counters = [0] * m      # array of m counters (not bits)
        self.m = m
        self.k = k

    def insert(self, x):
        for i in range(self.k):
            pos = hash_i(x) % self.m
            self.counters[pos] += 1   # increment instead of set-to-1

    def delete(self, x):
        for i in range(self.k):
            pos = hash_i(x) % self.m
            if self.counters[pos] > 0:
                self.counters[pos] -= 1

    def lookup(self, x):
        return all(
            self.counters[hash_i(x) % self.m] > 0
            for i in range(self.k)
        )
Practical note: 4-bit counters (max value 15) are sufficient for most use cases. The probability of any counter exceeding 15 is approximately 1.37 × 10⁻¹⁵ for a Bloom filter at its optimal fill ratio.

Cuckoo Filter

The Cuckoo Filter (Fan et al., 2014) is a modern alternative to Bloom filters that supports deletion and offers better space efficiency for false positive rates below ~3%.

How It Works

A cuckoo filter uses a hash table with cuckoo hashing, but instead of storing the full element, it stores a compact fingerprint (e.g., 8–12 bits):

  1. Compute fingerprint f = fingerprint(x)
  2. Compute two candidate bucket positions:
    i₁ = hash(x)
    i₂ = i₁ ⊕ hash(f)    // XOR with hash of fingerprint
  3. If either bucket has an empty slot, store f there
  4. If both are full, evict an existing fingerprint (cuckoo displacement) and relocate it to its alternate bucket

Lookup

Check if fingerprint f exists in either bucket i₁ or i₂.

Deletion

Find fingerprint f in bucket i₁ or i₂, remove it. This works because the XOR-based alternate bucket calculation is reversible: given any bucket i and fingerprint f, you can always compute the alternate bucket i ⊕ hash(f).

Cuckoo vs Bloom Comparison

FeatureBloom FilterCuckoo Filter
DeletionNot supportedSupported
Space (FPR < 3%)HigherLower
Space (FPR > 3%)LowerHigher
Lookup speedk memory accesses2 memory accesses (constant)
Insert speedk memory accessesAmortized O(1), worst-case may require relocations
Implementation complexitySimpleMore complex
When to choose Cuckoo over Bloom: If you need deletion support, target a false positive rate below 3%, or need constant-time lookups (exactly 2 memory accesses), use a Cuckoo filter. Otherwise, Bloom filters are simpler and well-understood.

HyperLogLog

While Bloom filters answer "is x in the set?", HyperLogLog (HLL) answers a different question: "How many distinct elements have I seen?" — the cardinality estimation problem.

The Problem

Counting exact distinct elements requires storing every element you've seen (a hash set). For a stream of billions of events, that's impractical. HyperLogLog estimates the count using only ~12 KB of memory, with a standard error of about 0.81%.

Core Intuition

Consider hashing each element to a uniformly random binary string. The maximum number of leading zeros across all hashed values gives you information about the cardinality:

This is because the probability of seeing k leading zeros in a random hash is 1/2k+1, so you need roughly 2k+1 distinct elements before you're likely to see one.

The HyperLogLog Algorithm

  1. Partition into buckets: Use the first b bits of the hash to select one of m = 2b buckets (registers)
  2. Count leading zeros: In the remaining hash bits, count the number of leading zeros + 1. Store the maximum value seen for each bucket
  3. Estimate: The cardinality estimate is:
    E = α_m × m² × (Σ 2^(-M[j]))⁻¹
    where M[j] is the max leading-zero count in register j, and α_m is a bias correction constant (~0.7213 for large m)

Sizing

Standard HLL uses m = 2^14 = 16,384 registers
Each register stores a value 0-63 → 6 bits each
Total memory: 16,384 × 6 bits = 12 KB

Standard error: 1.04 / √m = 1.04 / √16,384 = 0.0081 = 0.81%

This 12 KB structure can estimate cardinalities
up to billions with ~0.81% accuracy!

Redis HyperLogLog

Redis provides native HyperLogLog support with three simple commands:

# Add elements
PFADD visitors "user123" "user456" "user789"

# Get estimated cardinality
PFCOUNT visitors
→ (integer) 3

# Merge multiple HLLs (e.g., combine daily counts)
PFMERGE weekly_visitors monday_visitors tuesday_visitors ...

# Each HLL key uses at most 12 KB of memory
# Can count 2^64 distinct elements
Use case: Counting unique visitors to a website. Instead of storing every visitor ID (which could be millions), a single HyperLogLog counter in Redis uses 12 KB and gives you a count accurate to within 0.81%. For daily unique visitors, weekly merges, etc., this is perfect.

▶ HyperLogLog: Cardinality Estimation

Watch how HyperLogLog uses leading zeros in hash values to estimate cardinality (4 buckets for simplicity).

Count-Min Sketch

The Count-Min Sketch (Cormode & Muthukrishnan, 2005) estimates the frequency of elements in a data stream: "How many times has element x appeared?"

Structure

A 2D array of counters with d rows and w columns, plus d independent hash functions (one per row):

         col 0   col 1   col 2   col 3  ...  col w-1
row 0  [  0   |  0   |  0   |  0   | ... |   0   ]   ← h₁
row 1  [  0   |  0   |  0   |  0   | ... |   0   ]   ← h₂
row 2  [  0   |  0   |  0   |  0   | ... |   0   ]   ← h₃
 ...
row d-1[  0   |  0   |  0   |  0   | ... |   0   ]   ← h_d

Update (Increment)

To record an occurrence of element x:

for each row i from 0 to d-1:
    col = h_i(x) % w
    counters[i][col] += 1

Point Query (Estimate Frequency)

To estimate the count of element x:

estimate = min(counters[i][h_i(x) % w] for i in 0..d-1)

Take the minimum across all d rows. This is because collisions only cause overestimates (other elements hash to the same column, inflating the count). Taking the minimum gives the tightest estimate.

Error Bounds

For a sketch with w columns and d rows processing a stream of N total events:

The estimated count f̂(x) satisfies:
  f(x) ≤ f̂(x) ≤ f(x) + ε × N

with probability at least 1 - δ, where:
  w = ⌈e / ε⌉       (e = 2.718...)
  d = ⌈ln(1 / δ)⌉

Example: ε = 0.001, δ = 0.01
  w = ⌈2.718 / 0.001⌉ = 2719 columns
  d = ⌈ln(100)⌉ = 5 rows
  Total counters: 2719 × 5 = 13,595
  Memory: ~54 KB (4-byte counters)

Use Cases

class CountMinSketch:
    def __init__(self, w, d):
        self.w = w
        self.d = d
        self.table = [[0] * w for _ in range(d)]

    def update(self, x, count=1):
        for i in range(self.d):
            col = hash_i(x) % self.w
            self.table[i][col] += count

    def estimate(self, x):
        return min(
            self.table[i][hash_i(x) % self.w]
            for i in range(self.d)
        )

Real-World Usage

Google Bigtable — SSTable Bloom Filters

Bigtable stores data in SSTables (Sorted String Tables) on GFS. When a read request arrives, the system needs to determine which SSTables might contain the requested row key. Without Bloom filters, this requires reading each SSTable's index from disk.

Bigtable attaches a Bloom filter to each SSTable. Before performing a disk seek, it checks the Bloom filter in memory:

This dramatically reduces the number of disk reads, especially for rows that don't exist.

Apache Cassandra — Per-SSTable Bloom Filters

Cassandra uses a similar approach. Each SSTable has an associated Bloom filter that is loaded into memory. Configuration options include:

# cassandra.yaml
bloom_filter_fp_chance: 0.01    # 1% FPR (default)
# Lower FPR → more memory, fewer disk reads
# Higher FPR → less memory, more disk reads

# Per-table override:
CREATE TABLE users (
    id UUID PRIMARY KEY,
    name text
) WITH bloom_filter_fp_chance = 0.001;  -- 0.1% FPR for hot table

CDN Cache Membership

Content Delivery Networks use Bloom filters to quickly determine if a piece of content is cached at a particular edge node. Instead of querying the cache backend (which might be slow), a Bloom filter check in memory can immediately tell you "definitely not cached here — route to origin." This avoids unnecessary cache lookups and reduces latency.

Chrome Safe Browsing

Google Chrome uses a local Bloom filter (now upgraded to more advanced structures) to check if a URL might be in the list of known malicious websites. If the local filter says "maybe malicious," Chrome contacts Google's servers for a definitive check. This avoids sending every URL you visit to Google while still protecting against known threats.

Spam Detection

Email systems use Bloom filters to quickly check if a message hash matches known spam signatures. The Bloom filter sits in memory for O(1) lookups, and only potential matches (including false positives) trigger a more expensive database lookup.

URL Shortener — Collision Avoidance

When generating short codes, you need to check if a code already exists. A Bloom filter provides a fast first check:

# Generate a short code for a new URL
def generate_short_code(url):
    while True:
        code = random_alphanumeric(7)

        # Fast check: is this code already used?
        if not bloom_filter.might_contain(code):
            # Definitely not used — safe to proceed
            bloom_filter.add(code)
            db.insert(code, url)
            return code

        # Bloom says "maybe used" — check database to be sure
        if not db.exists(code):
            # False positive — code is actually free
            bloom_filter.add(code)
            db.insert(code, url)
            return code

        # Collision — try again
        continue

Medium — Recommendation Dedup

Medium uses Bloom filters to avoid recommending articles a user has already seen. Each user has a Bloom filter tracking read articles. Before showing a recommendation, check the filter — a false positive just means skipping one extra article, which is harmless.

Implementation

Here is a complete, production-quality Bloom filter implementation in Python:

import math
import mmh3  # MurmurHash3 — fast, high-quality non-cryptographic hash


class BloomFilter:
    """
    Space-efficient probabilistic set membership test.
    Supports insert and lookup (no deletion).
    """

    def __init__(self, expected_items: int, fp_rate: float = 0.01):
        """
        Args:
            expected_items: Expected number of elements (n)
            fp_rate: Desired false positive rate (p), default 1%
        """
        self.n = expected_items
        self.p = fp_rate

        # Calculate optimal size: m = -(n × ln(p)) / (ln(2))²
        self.m = self._optimal_m(expected_items, fp_rate)

        # Calculate optimal hash count: k = (m / n) × ln(2)
        self.k = self._optimal_k(self.m, expected_items)

        # Bit array (using bytearray for efficiency)
        self.bit_array = bytearray(math.ceil(self.m / 8))
        self.count = 0  # number of elements inserted

    @staticmethod
    def _optimal_m(n: int, p: float) -> int:
        """Compute optimal bit array size."""
        m = -(n * math.log(p)) / (math.log(2) ** 2)
        return int(math.ceil(m))

    @staticmethod
    def _optimal_k(m: int, n: int) -> int:
        """Compute optimal number of hash functions."""
        k = (m / n) * math.log(2)
        return max(1, int(round(k)))

    def _get_hash_values(self, item: str) -> list:
        """
        Generate k hash values using double hashing:
          h_i(x) = h1(x) + i × h2(x)
        This is proven equivalent to k independent hashes.
        """
        h1 = mmh3.hash(item, seed=0) % self.m
        h2 = mmh3.hash(item, seed=42) % self.m
        return [(h1 + i * h2) % self.m for i in range(self.k)]

    def _set_bit(self, pos: int):
        byte_idx = pos // 8
        bit_idx = pos % 8
        self.bit_array[byte_idx] |= (1 << bit_idx)

    def _get_bit(self, pos: int) -> bool:
        byte_idx = pos // 8
        bit_idx = pos % 8
        return bool(self.bit_array[byte_idx] & (1 << bit_idx))

    def add(self, item: str):
        """Insert an item into the Bloom filter."""
        for pos in self._get_hash_values(item):
            self._set_bit(pos)
        self.count += 1

    def might_contain(self, item: str) -> bool:
        """
        Check if item might be in the set.
        Returns:
          False → item is DEFINITELY NOT in the set
          True  → item is PROBABLY in the set (FPR ≈ self.p)
        """
        return all(self._get_bit(pos)
                    for pos in self._get_hash_values(item))

    def estimated_fpr(self) -> float:
        """Estimate current false positive rate."""
        exponent = -self.k * self.count / self.m
        return (1 - math.exp(exponent)) ** self.k

    def fill_ratio(self) -> float:
        """Fraction of bits that are set to 1."""
        set_bits = sum(bin(byte).count('1')
                       for byte in self.bit_array)
        return set_bits / self.m

    def __len__(self):
        return self.count

    def __repr__(self):
        return (f"BloomFilter(n={self.n}, m={self.m}, k={self.k}, "
                f"items={self.count}, est_fpr={self.estimated_fpr():.4%})")


# === Usage ===
bf = BloomFilter(expected_items=1_000_000, fp_rate=0.01)
print(bf)
# BloomFilter(n=1000000, m=9585059, k=7, items=0, est_fpr=0.0000%)

# Insert elements
for i in range(1_000_000):
    bf.add(f"user_{i}")

print(bf)
# BloomFilter(n=1000000, m=9585059, k=7, items=1000000, est_fpr=0.8189%)

# Lookup — known members
print(bf.might_contain("user_42"))       # True (correct)
print(bf.might_contain("user_999999"))   # True (correct)

# Lookup — non-members
# Some will be False (correct), some True (false positive)
false_positives = sum(
    bf.might_contain(f"stranger_{i}")
    for i in range(100_000)
)
print(f"False positives: {false_positives} / 100,000 "
      f"= {false_positives/1000:.2f}%")
# Expected: ~1% ≈ 1000 false positives

print(f"Fill ratio: {bf.fill_ratio():.2%}")
# Expected: ~50% (optimal fill at optimal k)
Double hashing trick: Instead of using k truly independent hash functions (hard to find), use just two hashes h₁ and h₂ and derive the rest as h_i(x) = h₁(x) + i × h₂(x). Kirsch & Mitzenmacher (2006) proved this gives the same false positive guarantees as k fully independent hashes.

Summary

StructureQuestion AnsweredSpaceError Type
Bloom FilterIs x in S?~10 bits/elementFalse positives only
Cuckoo FilterIs x in S? (+ delete)~12 bits/elementFalse positives only
HyperLogLogHow many distinct x?~12 KB±0.81% error
Count-Min SketchHow often has x appeared?ConfigurableOverestimates only