High Level Design Series · Real-World Designs· Part 50 of 70

Design: Search Autocomplete

April 2026 · 26 min read

Problem Statement

When a user starts typing a query into a search box, the system should return the top 5 most relevant suggestions almost instantaneously. This feature — search autocomplete (also called typeahead) — is one of the most visible and latency-sensitive components in any search engine.

Think about typing "how to" in Google. Before you finish typing the third character, five completions appear. Those suggestions are drawn from billions of historical queries, ranked by frequency (and personalization), and delivered in under 100 milliseconds.

The challenge: how do we build a system that indexes billions of queries and returns prefix-matched results in sub-100ms — at the scale of 10 billion queries per day?

Requirements

Functional Requirements

Given a prefix string typed by the user, return the top 5 most frequently searched completions
Suggestions must update with each keystroke (character-by-character)
Support English language initially (multi-language as extension)
Only suggest queries that have appeared with sufficient frequency (no one-off typos)

Non-Functional Requirements

Latency: < 100ms end-to-end for suggestion retrieval
Scale: Handle ~10 billion search queries per day (≈ 115,000 QPS average, ~400K QPS peak)
Availability: 99.99% uptime — degraded autocomplete is better than no autocomplete
Freshness: Incorporate trending queries within hours, not days

Back-of-the-Envelope Estimation

Metric	Estimate
Daily queries	10 billion
Unique queries	~1 billion (high duplication)
Average QPS	~115K
Peak QPS	~400K (3.5× average)
Autocomplete requests per query	~4 (avg 4 chars typed before selection)
Autocomplete QPS	~460K avg, ~1.6M peak
Avg query length	~20 characters (4-5 words)
Storage for unique queries	~20 GB (1B × 20 bytes avg)

Key insight: The autocomplete request volume is a multiple of the search volume because each search generates multiple prefix requests. But the good news is that prefix queries are extremely cacheable — millions of users type the same prefixes.

High-Level Architecture

The system naturally splits into two major subsystems:

Data Collection & Aggregation Service — collects raw search queries, aggregates frequencies, and builds the suggestion index offline
Serving Service — receives a prefix from the user and returns top-5 suggestions in real time using a precomputed trie

┌──────────────┐     ┌───────────────────┐     ┌──────────────────┐
│  User types  │────▶│  API Gateway /    │────▶│  Autocomplete    │
│  "tre..."    │     │  Load Balancer    │     │  Service (Trie)  │
└──────────────┘     └───────────────────┘     └──────────────────┘
                                                        │
                                                        ▼
                                               ┌──────────────────┐
                                               │  Trie Cluster    │
                                               │  (in-memory)     │
                                               └──────────────────┘

┌──────────────┐     ┌───────────────────┐     ┌──────────────────┐
│  Search Logs │────▶│  Aggregation      │────▶│  Trie Builder    │
│  (raw)       │     │  (MapReduce/Spark)│     │  (offline)       │
└──────────────┘     └───────────────────┘     └──────────────────┘

The key design principle: separate the read path from the write path. The serving layer is optimized for ultra-fast reads. The data pipeline updates the trie periodically (not in real-time) to avoid the complexity and latency of online updates.

The Trie Data Structure

A trie (prefix tree) is the foundational data structure for autocomplete. Each node represents a character, and paths from root to leaf (or marked internal node) represent complete strings. Crucially, all strings sharing a common prefix share the same path in the trie.

Basic Trie Structure

class TrieNode:
    def __init__(self):
        self.children = {}          # char → TrieNode
        self.is_end_of_word = False
        self.frequency = 0          # search frequency (only at word-end nodes)

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word: str, freq: int):
        node = self.root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_end_of_word = True
        node.frequency = freq

    def search_prefix(self, prefix: str) -> TrieNode:
        """Navigate to the node representing the prefix."""
        node = self.root
        for char in prefix:
            if char not in node.children:
                return None
            node = node.children[char]
        return node

Problem with Naive Trie Search

To find the top-5 suggestions for prefix "tre", the naive approach traverses to node t → r → e, then performs a DFS of the entire subtree to collect all complete words and sort by frequency. This is O(n) where n is the number of strings with that prefix — potentially millions for popular prefixes.

def naive_top_k(self, prefix: str, k: int = 5):
    """O(n) approach — TOO SLOW for production."""
    node = self.search_prefix(prefix)
    if not node:
        return []
    results = []
    self._dfs_collect(node, prefix, results)
    results.sort(key=lambda x: -x[1])  # Sort by frequency desc
    return results[:k]

def _dfs_collect(self, node, current, results):
    if node.is_end_of_word:
        results.append((current, node.frequency))
    for char, child in node.children.items():
        self._dfs_collect(child, current + char, results)

Why naive search fails: The prefix "th" could match millions of queries (the, that, this, then, there, ...). We cannot afford to traverse millions of nodes on every keystroke at <100ms latency.

Top-K Optimization

The critical optimization: precompute and cache the top-K results at every trie node. Instead of traversing the subtree at query time, each node stores a sorted list of the K most frequent completions reachable from that node.

Enhanced Trie Node

class OptimizedTrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False
        self.frequency = 0
        self.top_k = []  # List of (query_string, frequency) — precomputed!

class OptimizedTrie:
    K = 5  # top-K suggestions

    def build_top_k(self, node, current_prefix=""):
        """
        Bottom-up: collect top-K from all children,
        merge with self (if end-of-word), keep top K.
        """
        candidates = []

        # If this node marks a complete query, include it
        if node.is_end_of_word:
            candidates.append((current_prefix, node.frequency))

        # Collect top-K from each child subtree
        for char, child in node.children.items():
            self.build_top_k(child, current_prefix + char)
            candidates.extend(child.top_k)

        # Sort by frequency (descending) and keep top K
        candidates.sort(key=lambda x: -x[1])
        node.top_k = candidates[:self.K]

    def autocomplete(self, prefix: str):
        """O(L) lookup where L = prefix length. Constant time after navigation!"""
        node = self.root
        for char in prefix:
            if char not in node.children:
                return []
            node = node.children[char]
        return node.top_k  # Already precomputed!

Time Complexity Comparison

Approach	Query Time	Space
Naive DFS	O(L + n log n) where n = subtree size	O(total chars)
Precomputed Top-K	O(L) where L = prefix length	O(total chars × K)
Precomputed + Cache	O(1) — cache hit	O(total chars × K + cache)

The space increase is modest. Each node stores K pointers (or short strings). For K=5, that's 5 extra references per node. The query speedup — from potentially millions of nodes traversed to just following L character edges — is the difference between 500ms and 0.5ms.

▶ Trie Traversal: Prefix Search with Top-K

Watch the trie traversal as the user types "tre" — each keystroke navigates one level deeper, and the top-5 suggestions are shown from the precomputed list at each node.

Trie Storage & Serialization

For production, we need to persist the trie so it can be loaded into memory on server startup, replicated across nodes, and backed up. There are several strategies:

Option 1: Serialize to Binary File

Flatten the trie into a compact binary format using level-order (BFS) or pre-order (DFS) traversal. Each node is stored as:

struct SerializedNode {
    char character;        // 1 byte
    uint8_t num_children;  // 1 byte
    bool is_end;           // 1 bit (packed)
    uint32_t frequency;    // 4 bytes (if is_end)
    TopK top_k[5];         // 5 × (string_offset + frequency) = 5 × 8 = 40 bytes
}
// Total per node: ~46 bytes + variable-length top-K strings

For 1 billion unique queries with average length 20, the trie has roughly 5–10 billion nodes (due to shared prefixes). At ~50 bytes/node, that's 250–500 GB — too large for a single machine's memory.

Option 2: Key-Value Store (Distributed)

Each trie node is serialized and stored in a distributed key-value store (e.g., Redis, DynamoDB). The key is the prefix string; the value is the top-K list for that prefix:

# Key-Value Store Schema
Key: prefix string (e.g., "tre")
Value: JSON array of top-K suggestions

Example entries:
"t"   → [("the", 980M), ("to", 720M), ("this", 650M), ("that", 600M), ("time", 450M)]
"tr"  → [("trump", 85M), ("translate", 70M), ("tree", 60M), ("trend", 55M), ("travel", 50M)]
"tre" → [("tree", 60M), ("trend", 55M), ("trek", 18M), ("treasure", 12M), ("tremendous", 8M)]

Practical insight: We don't actually store the full trie in a KV store — we precompute the result for every possible prefix (up to some max length) and store it as a flat lookup table. This is essentially "unrolling" the trie into a hash map. For short prefixes (1–3 chars), this is a small number of keys. For longer prefixes, most are pruned because they have too few queries.

Option 3: Compact Trie Representations

Several compressed trie variants reduce memory dramatically:

Patricia Trie (Radix Tree): Merge chains of single-child nodes into single edges. "t-r-e-e" becomes one edge "tree". Reduces node count by 50–80%.
Double-Array Trie: Two parallel arrays (base[] and check[]) encode the trie compactly. Used in many production systems (e.g., libdatrie).
Succinct Trie (LOUDS): Encode the trie structure in ~2 bits per node using the Level-Order Unary Degree Sequence. Extremely space-efficient.

# Patricia Trie: Merging single-child chains
# Before (standard trie):
#   root → t → r → e → e (4 nodes for "tree")
#                    → n → d (5 nodes for "trend")
#
# After (Patricia trie):
#   root → "tr" → "ee" (leaf: "tree")
#                → "end" (leaf: "trend")
#
# Node count: 4 vs 7 — nearly 50% reduction

class PatriciaNode:
    def __init__(self):
        self.edge_label = ""        # compressed edge (multiple chars)
        self.children = {}          # first char of edge → PatriciaNode
        self.is_end = False
        self.frequency = 0
        self.top_k = []

Data Collection Service

The data collection service is the offline pipeline that transforms raw search logs into the trie structure served in production.

Step 1: Log Collection

Every search query is logged with metadata:

{
  "query": "tree traversal algorithm",
  "timestamp": "2026-04-15T14:32:01Z",
  "user_id": "u_abc123",        // for personalization (optional)
  "locale": "en-US",
  "device": "mobile",
  "result_clicked": true         // did user click a result?
}

Logs are written to a distributed log store (Kafka, Kinesis) in real-time. Daily volume: ~10 billion events × ~200 bytes = ~2 TB/day of raw logs.

Step 2: Cleaning & Normalization

Lowercase: "Tree" → "tree"
Strip special characters: Remove non-alphanumeric except spaces
Deduplicate: Same user searching "tree" 50 times in a session counts as 1
Filter: Remove queries with < N occurrences (noise/typos)
Filter offensive content: Blocklist + ML classifier

Step 3: Frequency Aggregation (MapReduce)

The core aggregation job counts query frequencies over a time window. Here's the MapReduce design:

# === MapReduce Job: Query Frequency Aggregation ===

# MAPPER: Emit (query, 1) for each valid search log
def mapper(log_line):
    record = parse_json(log_line)
    query = normalize(record["query"])

    # Filter: skip if query is too short, offensive, or a bot
    if len(query) < 2 or is_offensive(query) or is_bot(record["user_id"]):
        return

    # Deduplicate per (user, query, hour) — same user same query counts once per hour
    dedup_key = f"{record['user_id']}:{query}:{record['timestamp'][:13]}"
    emit(query, dedup_key)  # emit query as key

# COMBINER (local aggregation on mapper node):
def combiner(query, dedup_keys):
    unique_keys = set(dedup_keys)
    emit(query, len(unique_keys))

# REDUCER: Sum frequencies across all mappers
def reducer(query, counts):
    total = sum(counts)
    if total >= MIN_FREQUENCY_THRESHOLD:  # e.g., 100
        emit(query, total)

Time-Weighted Aggregation

Queries from last week matter more than queries from last year. We use exponential decay weighting:

# Time-weighted frequency with exponential decay
#
# weight(age_in_weeks) = decay_factor ^ age_in_weeks
#
# Example with decay_factor = 0.85:
#   This week:  weight = 0.85^0 = 1.00
#   Last week:  weight = 0.85^1 = 0.85
#   2 weeks ago: weight = 0.85^2 = 0.72
#   4 weeks ago: weight = 0.85^4 = 0.52
#   8 weeks ago: weight = 0.85^8 = 0.27
#
# weighted_freq(query) = Σ count_per_week × weight(age)

def compute_weighted_frequency(query, weekly_counts):
    """weekly_counts: [(week_offset, count), ...]"""
    DECAY = 0.85
    total = 0.0
    for week_offset, count in weekly_counts:
        weight = DECAY ** week_offset
        total += count * weight
    return total

Trie Build Pipeline

The complete offline pipeline transforms raw search logs into a production-ready trie. This pipeline runs on a schedule — typically weekly for a full rebuild, with daily incremental updates for trending queries.

Pipeline Stages

Stage 1: Raw Logs (Kafka/S3)
    ↓
Stage 2: Clean & Normalize (Spark job)
    ↓
Stage 3: Aggregate Frequencies (MapReduce/Spark)
    ↓
Stage 4: Merge with Historical Data (weighted)
    ↓
Stage 5: Build Trie + Precompute Top-K
    ↓
Stage 6: Serialize & Upload to Blob Store (S3)
    ↓
Stage 7: Deploy to Serving Cluster (rolling update)

Stage 5: Trie Construction in Detail

# Build the optimized trie from frequency data
def build_trie(query_frequencies: Dict[str, float]) -> OptimizedTrie:
    """
    Input: {"tree": 60_000_000, "trend": 55_000_000, ...}
    Output: Trie with precomputed top-K at every node
    """
    trie = OptimizedTrie()

    # Step 1: Insert all queries
    for query, freq in query_frequencies.items():
        node = trie.root
        for char in query:
            if char not in node.children:
                node.children[char] = OptimizedTrieNode()
            node = node.children[char]
        node.is_end_of_word = True
        node.frequency = freq

    # Step 2: Bottom-up pass to compute top-K at every node
    trie.build_top_k(trie.root, "")

    return trie

# Build top-K bottom-up (post-order DFS)
def build_top_k(node, prefix, K=5):
    candidates = []

    if node.is_end_of_word:
        candidates.append((prefix, node.frequency))

    for char, child in sorted(node.children.items()):
        build_top_k(child, prefix + char, K)
        candidates.extend(child.top_k)

    # Keep only top K by frequency
    candidates.sort(key=lambda x: -x[1])
    node.top_k = candidates[:K]

Stage 6–7: Deployment Strategy

The new trie is serialized and uploaded to a blob store (S3). Serving nodes perform a rolling update:

A new trie version is uploaded to S3 with a version identifier
A control plane signals serving nodes to download the new version
Each serving node downloads the trie in the background into a shadow buffer
Once loaded, the node atomically swaps the trie pointer (old → new)
No downtime: the old trie serves traffic until the swap completes

# Atomic trie swap on serving node
class AutocompleteServer:
    def __init__(self):
        self.current_trie = None       # Active trie
        self.loading_trie = None       # Being loaded in background

    def load_new_trie(self, s3_path):
        """Background thread loads new trie."""
        self.loading_trie = deserialize_trie(download(s3_path))

        # Atomic swap
        old = self.current_trie
        self.current_trie = self.loading_trie
        self.loading_trie = None
        del old  # Free memory

    def query(self, prefix):
        """Reads from current_trie — never blocked by loading."""
        return self.current_trie.autocomplete(prefix)

▶ Data Pipeline: From Search Logs to Serving

Step through the offline data pipeline that transforms raw search logs into a production trie serving autocomplete suggestions.

Caching Strategy

Caching is the single most important optimization in the autocomplete system. Prefix queries are extraordinarily cacheable: millions of users type the same prefixes (especially short ones like "wh", "ho", "the").

Layer 1: Browser Cache (Most Important)

Use aggressive HTTP caching headers to avoid even hitting the network for repeated prefixes:

# HTTP Response Headers for autocomplete API
Cache-Control: public, max-age=3600        # Cache for 1 hour
ETag: "trie-v2026041500"                   # Version-based ETag
Vary: Accept-Encoding

# Example API request and response:
# GET /api/autocomplete?q=tre
# Response:
# {
#   "suggestions": [
#     {"text": "tree", "score": 60000000},
#     {"text": "trend", "score": 55000000},
#     {"text": "trek", "score": 18000000},
#     {"text": "treasure", "score": 12000000},
#     {"text": "tremendous", "score": 8000000}
#   ]
# }

With 1-hour TTL, a user who types "tre" and then re-searches something starting with "tre" later will get an instant cache hit. Even across page refreshes, the browser cache serves results in 0ms.

Layer 2: CDN Cache

Place a CDN (CloudFront, Fastly) in front of the autocomplete API. Short prefixes (1–3 characters) have extremely high cache hit rates because they're shared by virtually all users:

Prefix Length	Unique Prefixes	CDN Hit Rate
1 character	26	~99.9%
2 characters	~676	~99%
3 characters	~17,576	~95%
4 characters	~100K active	~80%
5+ characters	~1M+ active	~50-70%

Cache math: With an average 85% CDN hit rate, the 1.6M peak QPS to the autocomplete API becomes only ~240K QPS reaching the actual serving cluster. The CDN absorbs the vast majority of traffic.

Layer 3: Application-Level Cache

On the serving nodes themselves, an in-memory LRU cache stores recently-queried prefixes to avoid repeated trie traversal:

class CachedAutocompleteServer:
    def __init__(self, trie):
        self.trie = trie
        self.cache = LRUCache(capacity=1_000_000)  # ~1M entries

    def query(self, prefix):
        # L1: Check application cache
        cached = self.cache.get(prefix)
        if cached is not None:
            return cached

        # L2: Traverse trie
        result = self.trie.autocomplete(prefix)
        self.cache.put(prefix, result)
        return result

Client-Side Optimization: Debouncing & Prefetching

The client (browser/mobile) further reduces requests:

Debounce: Wait 50–100ms after the last keystroke before sending the request. Fast typists generate fewer requests.
Local prefix filtering: If the user types "tr" and receives ["tree", "trend", "trek", "treasure", "tremendous"], the client can filter these locally when the user types "tre" — no network request needed.
Prefetch: When results for "tr" arrive, speculatively prefetch "tra", "tre", "tri" etc. in the background.

// Client-side: local filtering from cached results
class AutocompleteClient {
    constructor() {
        this.cache = new Map(); // prefix → suggestions
    }

    async getSuggestions(prefix) {
        // Try to filter from a shorter cached prefix
        for (let i = prefix.length - 1; i >= 1; i--) {
            const shorter = prefix.substring(0, i);
            const cached = this.cache.get(shorter);
            if (cached) {
                const filtered = cached.filter(s =>
                    s.text.startsWith(prefix)
                );
                if (filtered.length >= 5) {
                    return filtered.slice(0, 5); // No network call!
                }
            }
        }

        // Cache miss — fetch from server
        const response = await fetch(`/api/autocomplete?q=${prefix}`);
        const data = await response.json();
        this.cache.set(prefix, data.suggestions);
        return data.suggestions.slice(0, 5);
    }
}

Scaling the Serving Layer

At 10 billion queries/day, the serving infrastructure must be horizontally scalable. The primary strategy is trie sharding combined with read replication.

Sharding Strategy: By First Character(s)

Partition the trie by the first one or two characters of the prefix:

# Shard by first character (26 shards for a-z)
Shard 0 (a): handles prefixes "a", "ab", "abc", ...
Shard 1 (b): handles prefixes "b", "ba", "bar", ...
...
Shard 25 (z): handles prefixes "z", "za", "zoo", ...

# Problem: Shards are very unbalanced!
# "t" shard handles: the, this, that, to, time, they, than, ...
# "x" shard handles: xbox, xray, xml, ... (tiny)

# Better: Shard by first TWO characters (676 shards)
Shard "th": handles "th", "the", "this", "that", "then", "there", ...
Shard "tr": handles "tr", "tree", "trend", "travel", ...

# Even better: Consistent hashing on 2-char prefixes → N shards
# Automatically balances hot and cold prefixes across machines

Dealing with Hot Shards

Even with 2-character sharding, some shards are much hotter. The prefix "th" generates vastly more queries than "zx". Solutions:

Unequal shard sizing: Hot prefixes get their own shard; cold prefixes are grouped (e.g., "xa" through "xz" share one shard)
Read replicas: Hot shards get more replicas. The "th" shard might have 10 replicas while "zx" has 2
Load balancer routes based on prefix: A routing table maps prefix → shard → replica set

# Shard routing table
{
  "th": {"shard_id": 42, "replicas": ["node-42a", "node-42b", ..., "node-42j"]},
  "tr": {"shard_id": 43, "replicas": ["node-43a", "node-43b", "node-43c"]},
  "zx": {"shard_id": 200, "replicas": ["node-200a", "node-200b"]},
  ...
}

# Each serving node holds ONE shard of the trie in memory
# Load balancer routes requests based on prefix → shard → round-robin replica

Replication for Reads

Since the trie is read-only (it's rebuilt offline and swapped atomically), replication is straightforward:

The trie build pipeline produces one serialized trie per shard
Each shard blob is uploaded to S3
All replicas for that shard download and load the new blob
No write conflicts, no consensus protocols, no leader election

Architecture beauty: Because the trie is immutable (rebuilt, not mutated), we get the simplicity of a read-only replicated cache with the consistency of a batch pipeline. There's no need for distributed locks, version vectors, or conflict resolution.

Filtering Offensive Suggestions

An autocomplete system that suggests offensive, dangerous, or inappropriate content is a reputational catastrophe. Filtering must be multi-layered:

Layer 1: Blocklist (Offline)

Maintain a curated blocklist of terms that must never appear in suggestions. This list is applied during the trie build pipeline:

BLOCKLIST = load_blocklist("s3://safety/blocklist.txt")  # ~100K terms

def is_safe(query):
    """Check query against blocklist and pattern rules."""
    normalized = query.lower().strip()

    # Exact match
    if normalized in BLOCKLIST:
        return False

    # Substring match (catch variations)
    for blocked_term in BLOCKLIST_SUBSTRINGS:
        if blocked_term in normalized:
            return False

    return True

# Applied during trie construction:
for query, freq in query_frequencies.items():
    if is_safe(query):
        trie.insert(query, freq)

Layer 2: ML Classifier (Offline)

A trained classifier scores queries for toxicity, NSFW content, and violence. Queries scoring above a threshold are excluded:

# During the aggregation pipeline
toxicity_model = load_model("toxicity-classifier-v3")

for query, freq in query_frequencies.items():
    score = toxicity_model.predict(query)
    if score > TOXICITY_THRESHOLD:  # e.g., 0.7
        mark_for_review(query)
    elif score > HARD_BLOCK_THRESHOLD:  # e.g., 0.95
        block(query)

Layer 3: Real-Time Override (Online)

For emergencies (e.g., a breaking news event makes a previously-safe query offensive), a real-time override service can instantly suppress specific suggestions without waiting for a trie rebuild:

class AutocompleteWithOverrides:
    def query(self, prefix):
        suggestions = self.trie.autocomplete(prefix)

        # Apply real-time overrides (loaded from a fast config service)
        blocked = self.override_service.get_blocked_queries()
        filtered = [s for s in suggestions if s.text not in blocked]

        return filtered[:5]

A purely frequency-based system would always show the same historical top queries. But when a major event happens (e.g., a new product launch, breaking news), users want to see those emerging queries immediately.

Detecting Trends

A query is "trending" if its recent frequency significantly exceeds its historical baseline:

# Trend detection formula
trending_score(query) = recent_freq / (historical_avg_freq + epsilon)

# Example:
# "eclipse 2026": historical avg = 500/day, today = 50,000/day
# trending_score = 50000 / (500 + 1) = 99.8 (very trending!)
#
# "weather today": historical avg = 200,000/day, today = 220,000/day
# trending_score = 220000 / (200000 + 1) = 1.1 (not trending)

Architecture for Trending

┌─────────────────┐     ┌──────────────────────┐
│  Real-time log  │────▶│  Stream processor     │
│  (Kafka)        │     │  (Flink / Spark       │
└─────────────────┘     │   Streaming)          │
                        └──────────┬───────────┘
                                   │ trending queries
                                   ▼
                        ┌──────────────────────┐
                        │  Trending cache       │
                        │  (Redis / Memcached)  │
                        └──────────┬───────────┘
                                   │
                    ┌──────────────┴──────────────┐
                    │  Autocomplete serving node   │
                    │  trie.query() + trending_mix │
                    └─────────────────────────────┘

The serving node merges results from the precomputed trie and the trending cache:

def query_with_trending(self, prefix, k=5):
    # Get standard top-K from trie
    trie_results = self.trie.autocomplete(prefix)

    # Get trending queries matching this prefix
    trending = self.trending_cache.get_matching(prefix)

    # Merge: trending queries get a boost
    TRENDING_BOOST = 2.0
    merged = []
    for q, score in trie_results:
        merged.append((q, score))
    for q, trend_score in trending:
        if q.startswith(prefix):
            merged.append((q, trend_score * TRENDING_BOOST))

    # Deduplicate and sort
    seen = set()
    unique = []
    for q, score in sorted(merged, key=lambda x: -x[1]):
        if q not in seen:
            seen.add(q)
            unique.append((q, score))
    return unique[:k]

Personalization

Different users searching the same prefix should see different suggestions based on their search history. A developer typing "py" probably wants "python" while a finance analyst wants "pypl stock".

User Profile Model

# Per-user search history (stored in user profile service)
{
  "user_id": "u_abc123",
  "recent_queries": [
    {"query": "python async await", "timestamp": "2026-04-15T14:00:00Z"},
    {"query": "python fastapi tutorial", "timestamp": "2026-04-15T13:30:00Z"},
    {"query": "python type hints", "timestamp": "2026-04-14T10:00:00Z"}
  ],
  "topic_affinities": {
    "programming": 0.85,
    "python": 0.72,
    "finance": 0.05,
    "sports": 0.15
  }
}

Personalized Ranking

def personalized_query(self, prefix, user_id, k=5):
    # Get global top-K
    global_results = self.trie.autocomplete(prefix)

    # Get user's recent queries matching this prefix
    user_history = self.user_service.get_recent_queries(user_id)
    personal = [(q, freq) for q, freq in user_history
                if q.startswith(prefix)]

    # Combine: user history gets a significant boost
    PERSONAL_BOOST = 5.0
    combined = {}
    for q, score in global_results:
        combined[q] = score
    for q, freq in personal:
        combined[q] = combined.get(q, 0) + freq * PERSONAL_BOOST

    sorted_results = sorted(combined.items(), key=lambda x: -x[1])
    return sorted_results[:k]

Privacy consideration: Personalization requires storing user search history. This must be opt-in, with clear data retention policies. Many users prefer to search without their history influencing suggestions. Always provide a way to disable personalization.

Multi-Language Support

Supporting multiple languages introduces challenges beyond just character encoding:

Character Set Differences

English: 26 letters — trie nodes have max 26 children
Chinese: ~50,000 common characters — trie branching factor is enormous
Japanese: Mix of hiragana, katakana, kanji — multiple input methods
Arabic/Hebrew: Right-to-left scripts with complex ligatures

Architecture for Multi-Language

# Separate trie per language/locale
tries = {
    "en": load_trie("s3://tries/en/latest.bin"),
    "es": load_trie("s3://tries/es/latest.bin"),
    "zh": load_trie("s3://tries/zh/latest.bin"),
    "ja": load_trie("s3://tries/ja/latest.bin"),
    # ...
}

def query(prefix, locale="en"):
    trie = tries.get(locale, tries["en"])
    return trie.autocomplete(prefix)

For CJK languages (Chinese, Japanese, Korean), the trie approach is less effective because words aren't space-delimited. Alternative approaches include:

n-gram indexing: Index all substrings of length n
Pinyin-based trie: For Chinese, build a trie on pinyin (romanized) input and map to Chinese characters
Inverted index with prefix matching: More flexible for ideographic scripts

API Design

# REST API
GET /v1/autocomplete?q={prefix}&locale={locale}&limit={k}

# Request:
GET /v1/autocomplete?q=tre&locale=en&limit=5

# Response:
{
  "prefix": "tre",
  "suggestions": [
    {"text": "tree", "score": 60000000},
    {"text": "trend", "score": 55000000},
    {"text": "trek", "score": 18000000},
    {"text": "treasure", "score": 12000000},
    {"text": "tremendous", "score": 8000000}
  ],
  "cached": true,
  "version": "2026041500"
}

# Headers:
# Cache-Control: public, max-age=3600
# X-Trie-Version: 2026041500

Error Handling & Graceful Degradation

If the trie service is down, return an empty suggestions list (never block the search bar)
If latency exceeds 100ms, return whatever results are available (even partial)
Rate limit per IP/user to prevent abuse (but generous — autocomplete is expected to be fast & frequent)

Complete System Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                          CLIENT (Browser/App)                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────────────┐             │
│  │ Debounce │→│ Local     │→│ HTTP Request              │             │
│  │ (50ms)   │  │ Filter   │  │ GET /autocomplete?q=tre  │             │
│  └──────────┘  └──────────┘  └────────────┬─────────────┘             │
└───────────────────────────────────────────┼───────────────────────────┘
                                            │
                            ┌───────────────▼───────────────┐
                            │           CDN                  │
                            │  (CloudFront / Fastly)         │
                            │  Cache hit rate: ~85%          │
                            └───────────────┬───────────────┘
                                            │ cache miss
                            ┌───────────────▼───────────────┐
                            │       Load Balancer            │
                            │  (prefix → shard routing)      │
                            └───────────────┬───────────────┘
                                            │
                    ┌───────────────┬───────┴───────┬───────────────┐
                    ▼               ▼               ▼               ▼
            ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
            │ Shard:   │    │ Shard:   │    │ Shard:   │    │ Shard:   │
            │ "a"-"f"  │    │ "g"-"m"  │    │ "n"-"s"  │    │ "t"-"z"  │
            │ (×3 repl)│    │ (×3 repl)│    │ (×3 repl)│    │ (×5 repl)│
            └──────────┘    └──────────┘    └──────────┘    └──────────┘
                    │               │               │               │
                    └───────────────┴───────┬───────┴───────────────┘
                                            │
                            ┌───────────────▼───────────────┐
                            │       Trending Cache           │
                            │       (Redis Cluster)          │
                            └───────────────────────────────┘

═══════════════════════ OFFLINE PIPELINE ═══════════════════════════════

┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│ Search   │────▶│ Spark    │────▶│ Trie     │────▶│ S3       │
│ Logs     │     │ Aggregate│     │ Builder  │     │ (deploy) │
│ (Kafka)  │     │ Job      │     │          │     │          │
└──────────┘     └──────────┘     └──────────┘     └──────────┘

Interview Tips

Key Points to Emphasize

Trie + precomputed top-K is the core insight — O(L) query time vs O(n) for naive
Read/write separation: The serving layer is read-only; updates happen offline via a batch pipeline
Multi-layer caching (browser → CDN → application → trie) reduces effective QPS by 85–95%
Sharding by prefix allows horizontal scaling; hot prefixes get more replicas
Safety filtering is non-negotiable — always mention it proactively
Trending injection adds freshness without compromising the batch pipeline's simplicity

Common Follow-Up Questions

Question	Answer Direction
How do you handle typos?	Spell correction layer before trie lookup; or store common misspellings as aliases
What if the trie doesn't fit in memory?	Shard the trie across machines; or use a Patricia trie / succinct representation
How fresh are suggestions?	Full rebuild weekly; trending layer adds hourly freshness; real-time overrides for emergencies
What about mobile?	Same API; client-side debouncing is more aggressive on mobile (200ms) to save battery/bandwidth
How do you test this system?	A/B test suggestion quality via click-through rate on suggestions; shadow traffic for new trie versions

Bonus: Space Estimation for Production Trie

Let's estimate the memory footprint of a production trie:

# Assumptions:
# - 1 billion unique queries
# - Average query length: 20 characters
# - Patricia trie compression: ~60% node reduction
# - Each node: 50 bytes (char + pointers + top-5 array)

Total characters:   1B × 20 = 20B characters
Trie nodes (naive): ~5B (shared prefixes reduce total)
Patricia nodes:     ~2B (after compression)

Memory per node:    ~50 bytes
  - edge_label:     8 bytes (pointer to string)
  - children map:   16 bytes (small map avg 2 entries)
  - is_end + freq:  5 bytes
  - top_k array:    5 × 4 = 20 bytes (5 pointers)
  - overhead:       ~1 byte

Total memory: 2B × 50 bytes = ~100 GB
Per shard (26 shards): ~4 GB — fits in a single machine!
Per shard (676 shards): ~150 MB — very comfortable

With 676 two-character shards, each shard holds ~150 MB. A machine with 64 GB RAM can easily hold multiple shards plus the application cache.

Design: Search Autocomplete

Problem Statement

Requirements

Functional Requirements

Non-Functional Requirements

Back-of-the-Envelope Estimation

High-Level Architecture

The Trie Data Structure

Basic Trie Structure

Problem with Naive Trie Search

Top-K Optimization

Enhanced Trie Node

Time Complexity Comparison

▶ Trie Traversal: Prefix Search with Top-K

Trie Storage & Serialization

Option 1: Serialize to Binary File

Option 2: Key-Value Store (Distributed)

Option 3: Compact Trie Representations

Data Collection Service

Step 1: Log Collection

Step 2: Cleaning & Normalization

Step 3: Frequency Aggregation (MapReduce)

Time-Weighted Aggregation

Trie Build Pipeline

Pipeline Stages

Stage 5: Trie Construction in Detail

Stage 6–7: Deployment Strategy

▶ Data Pipeline: From Search Logs to Serving

Caching Strategy

Layer 1: Browser Cache (Most Important)

Layer 2: CDN Cache

Layer 3: Application-Level Cache

Client-Side Optimization: Debouncing & Prefetching

Scaling the Serving Layer

Sharding Strategy: By First Character(s)

Dealing with Hot Shards

Replication for Reads

Filtering Offensive Suggestions

Layer 1: Blocklist (Offline)

Layer 2: ML Classifier (Offline)

Layer 3: Real-Time Override (Online)

Trending Queries

Detecting Trends

Architecture for Trending

Personalization

User Profile Model

Personalized Ranking

Multi-Language Support

Character Set Differences

Architecture for Multi-Language

API Design

Error Handling & Graceful Degradation

Complete System Diagram

Interview Tips

Key Points to Emphasize

Common Follow-Up Questions