Design: Search Autocomplete
Problem Statement
When a user starts typing a query into a search box, the system should return the top 5 most relevant suggestions almost instantaneously. This feature — search autocomplete (also called typeahead) — is one of the most visible and latency-sensitive components in any search engine.
Think about typing "how to" in Google. Before you finish typing the third character, five completions appear. Those suggestions are drawn from billions of historical queries, ranked by frequency (and personalization), and delivered in under 100 milliseconds.
The challenge: how do we build a system that indexes billions of queries and returns prefix-matched results in sub-100ms — at the scale of 10 billion queries per day?
Requirements
Functional Requirements
- Given a prefix string typed by the user, return the top 5 most frequently searched completions
- Suggestions must update with each keystroke (character-by-character)
- Support English language initially (multi-language as extension)
- Only suggest queries that have appeared with sufficient frequency (no one-off typos)
Non-Functional Requirements
- Latency: < 100ms end-to-end for suggestion retrieval
- Scale: Handle ~10 billion search queries per day (≈ 115,000 QPS average, ~400K QPS peak)
- Availability: 99.99% uptime — degraded autocomplete is better than no autocomplete
- Freshness: Incorporate trending queries within hours, not days
Back-of-the-Envelope Estimation
| Metric | Estimate |
|---|---|
| Daily queries | 10 billion |
| Unique queries | ~1 billion (high duplication) |
| Average QPS | ~115K |
| Peak QPS | ~400K (3.5× average) |
| Autocomplete requests per query | ~4 (avg 4 chars typed before selection) |
| Autocomplete QPS | ~460K avg, ~1.6M peak |
| Avg query length | ~20 characters (4-5 words) |
| Storage for unique queries | ~20 GB (1B × 20 bytes avg) |
High-Level Architecture
The system naturally splits into two major subsystems:
- Data Collection & Aggregation Service — collects raw search queries, aggregates frequencies, and builds the suggestion index offline
- Serving Service — receives a prefix from the user and returns top-5 suggestions in real time using a precomputed trie
┌──────────────┐ ┌───────────────────┐ ┌──────────────────┐
│ User types │────▶│ API Gateway / │────▶│ Autocomplete │
│ "tre..." │ │ Load Balancer │ │ Service (Trie) │
└──────────────┘ └───────────────────┘ └──────────────────┘
│
▼
┌──────────────────┐
│ Trie Cluster │
│ (in-memory) │
└──────────────────┘
┌──────────────┐ ┌───────────────────┐ ┌──────────────────┐
│ Search Logs │────▶│ Aggregation │────▶│ Trie Builder │
│ (raw) │ │ (MapReduce/Spark)│ │ (offline) │
└──────────────┘ └───────────────────┘ └──────────────────┘
The key design principle: separate the read path from the write path. The serving layer is optimized for ultra-fast reads. The data pipeline updates the trie periodically (not in real-time) to avoid the complexity and latency of online updates.
The Trie Data Structure
A trie (prefix tree) is the foundational data structure for autocomplete. Each node represents a character, and paths from root to leaf (or marked internal node) represent complete strings. Crucially, all strings sharing a common prefix share the same path in the trie.
Basic Trie Structure
class TrieNode:
def __init__(self):
self.children = {} # char → TrieNode
self.is_end_of_word = False
self.frequency = 0 # search frequency (only at word-end nodes)
class Trie:
def __init__(self):
self.root = TrieNode()
def insert(self, word: str, freq: int):
node = self.root
for char in word:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.is_end_of_word = True
node.frequency = freq
def search_prefix(self, prefix: str) -> TrieNode:
"""Navigate to the node representing the prefix."""
node = self.root
for char in prefix:
if char not in node.children:
return None
node = node.children[char]
return node
Problem with Naive Trie Search
To find the top-5 suggestions for prefix "tre", the naive approach traverses to node t → r → e, then performs a DFS of the entire subtree to collect all complete words and sort by frequency. This is O(n) where n is the number of strings with that prefix — potentially millions for popular prefixes.
def naive_top_k(self, prefix: str, k: int = 5):
"""O(n) approach — TOO SLOW for production."""
node = self.search_prefix(prefix)
if not node:
return []
results = []
self._dfs_collect(node, prefix, results)
results.sort(key=lambda x: -x[1]) # Sort by frequency desc
return results[:k]
def _dfs_collect(self, node, current, results):
if node.is_end_of_word:
results.append((current, node.frequency))
for char, child in node.children.items():
self._dfs_collect(child, current + char, results)
Top-K Optimization
The critical optimization: precompute and cache the top-K results at every trie node. Instead of traversing the subtree at query time, each node stores a sorted list of the K most frequent completions reachable from that node.
Enhanced Trie Node
class OptimizedTrieNode:
def __init__(self):
self.children = {}
self.is_end_of_word = False
self.frequency = 0
self.top_k = [] # List of (query_string, frequency) — precomputed!
class OptimizedTrie:
K = 5 # top-K suggestions
def build_top_k(self, node, current_prefix=""):
"""
Bottom-up: collect top-K from all children,
merge with self (if end-of-word), keep top K.
"""
candidates = []
# If this node marks a complete query, include it
if node.is_end_of_word:
candidates.append((current_prefix, node.frequency))
# Collect top-K from each child subtree
for char, child in node.children.items():
self.build_top_k(child, current_prefix + char)
candidates.extend(child.top_k)
# Sort by frequency (descending) and keep top K
candidates.sort(key=lambda x: -x[1])
node.top_k = candidates[:self.K]
def autocomplete(self, prefix: str):
"""O(L) lookup where L = prefix length. Constant time after navigation!"""
node = self.root
for char in prefix:
if char not in node.children:
return []
node = node.children[char]
return node.top_k # Already precomputed!
Time Complexity Comparison
| Approach | Query Time | Space |
|---|---|---|
| Naive DFS | O(L + n log n) where n = subtree size | O(total chars) |
| Precomputed Top-K | O(L) where L = prefix length | O(total chars × K) |
| Precomputed + Cache | O(1) — cache hit | O(total chars × K + cache) |
The space increase is modest. Each node stores K pointers (or short strings). For K=5, that's 5 extra references per node. The query speedup — from potentially millions of nodes traversed to just following L character edges — is the difference between 500ms and 0.5ms.
▶ Trie Traversal: Prefix Search with Top-K
Watch the trie traversal as the user types "tre" — each keystroke navigates one level deeper, and the top-5 suggestions are shown from the precomputed list at each node.
Trie Storage & Serialization
For production, we need to persist the trie so it can be loaded into memory on server startup, replicated across nodes, and backed up. There are several strategies:
Option 1: Serialize to Binary File
Flatten the trie into a compact binary format using level-order (BFS) or pre-order (DFS) traversal. Each node is stored as:
struct SerializedNode {
char character; // 1 byte
uint8_t num_children; // 1 byte
bool is_end; // 1 bit (packed)
uint32_t frequency; // 4 bytes (if is_end)
TopK top_k[5]; // 5 × (string_offset + frequency) = 5 × 8 = 40 bytes
}
// Total per node: ~46 bytes + variable-length top-K strings
For 1 billion unique queries with average length 20, the trie has roughly 5–10 billion nodes (due to shared prefixes). At ~50 bytes/node, that's 250–500 GB — too large for a single machine's memory.
Option 2: Key-Value Store (Distributed)
Each trie node is serialized and stored in a distributed key-value store (e.g., Redis, DynamoDB). The key is the prefix string; the value is the top-K list for that prefix:
# Key-Value Store Schema
Key: prefix string (e.g., "tre")
Value: JSON array of top-K suggestions
Example entries:
"t" → [("the", 980M), ("to", 720M), ("this", 650M), ("that", 600M), ("time", 450M)]
"tr" → [("trump", 85M), ("translate", 70M), ("tree", 60M), ("trend", 55M), ("travel", 50M)]
"tre" → [("tree", 60M), ("trend", 55M), ("trek", 18M), ("treasure", 12M), ("tremendous", 8M)]
Option 3: Compact Trie Representations
Several compressed trie variants reduce memory dramatically:
- Patricia Trie (Radix Tree): Merge chains of single-child nodes into single edges. "t-r-e-e" becomes one edge "tree". Reduces node count by 50–80%.
- Double-Array Trie: Two parallel arrays (base[] and check[]) encode the trie compactly. Used in many production systems (e.g.,
libdatrie). - Succinct Trie (LOUDS): Encode the trie structure in ~2 bits per node using the Level-Order Unary Degree Sequence. Extremely space-efficient.
# Patricia Trie: Merging single-child chains
# Before (standard trie):
# root → t → r → e → e (4 nodes for "tree")
# → n → d (5 nodes for "trend")
#
# After (Patricia trie):
# root → "tr" → "ee" (leaf: "tree")
# → "end" (leaf: "trend")
#
# Node count: 4 vs 7 — nearly 50% reduction
class PatriciaNode:
def __init__(self):
self.edge_label = "" # compressed edge (multiple chars)
self.children = {} # first char of edge → PatriciaNode
self.is_end = False
self.frequency = 0
self.top_k = []
Data Collection Service
The data collection service is the offline pipeline that transforms raw search logs into the trie structure served in production.
Step 1: Log Collection
Every search query is logged with metadata:
{
"query": "tree traversal algorithm",
"timestamp": "2026-04-15T14:32:01Z",
"user_id": "u_abc123", // for personalization (optional)
"locale": "en-US",
"device": "mobile",
"result_clicked": true // did user click a result?
}
Logs are written to a distributed log store (Kafka, Kinesis) in real-time. Daily volume: ~10 billion events × ~200 bytes = ~2 TB/day of raw logs.
Step 2: Cleaning & Normalization
- Lowercase: "Tree" → "tree"
- Strip special characters: Remove non-alphanumeric except spaces
- Deduplicate: Same user searching "tree" 50 times in a session counts as 1
- Filter: Remove queries with < N occurrences (noise/typos)
- Filter offensive content: Blocklist + ML classifier
Step 3: Frequency Aggregation (MapReduce)
The core aggregation job counts query frequencies over a time window. Here's the MapReduce design:
# === MapReduce Job: Query Frequency Aggregation ===
# MAPPER: Emit (query, 1) for each valid search log
def mapper(log_line):
record = parse_json(log_line)
query = normalize(record["query"])
# Filter: skip if query is too short, offensive, or a bot
if len(query) < 2 or is_offensive(query) or is_bot(record["user_id"]):
return
# Deduplicate per (user, query, hour) — same user same query counts once per hour
dedup_key = f"{record['user_id']}:{query}:{record['timestamp'][:13]}"
emit(query, dedup_key) # emit query as key
# COMBINER (local aggregation on mapper node):
def combiner(query, dedup_keys):
unique_keys = set(dedup_keys)
emit(query, len(unique_keys))
# REDUCER: Sum frequencies across all mappers
def reducer(query, counts):
total = sum(counts)
if total >= MIN_FREQUENCY_THRESHOLD: # e.g., 100
emit(query, total)
Time-Weighted Aggregation
Queries from last week matter more than queries from last year. We use exponential decay weighting:
# Time-weighted frequency with exponential decay
#
# weight(age_in_weeks) = decay_factor ^ age_in_weeks
#
# Example with decay_factor = 0.85:
# This week: weight = 0.85^0 = 1.00
# Last week: weight = 0.85^1 = 0.85
# 2 weeks ago: weight = 0.85^2 = 0.72
# 4 weeks ago: weight = 0.85^4 = 0.52
# 8 weeks ago: weight = 0.85^8 = 0.27
#
# weighted_freq(query) = Σ count_per_week × weight(age)
def compute_weighted_frequency(query, weekly_counts):
"""weekly_counts: [(week_offset, count), ...]"""
DECAY = 0.85
total = 0.0
for week_offset, count in weekly_counts:
weight = DECAY ** week_offset
total += count * weight
return total
Trie Build Pipeline
The complete offline pipeline transforms raw search logs into a production-ready trie. This pipeline runs on a schedule — typically weekly for a full rebuild, with daily incremental updates for trending queries.
Pipeline Stages
Stage 1: Raw Logs (Kafka/S3)
↓
Stage 2: Clean & Normalize (Spark job)
↓
Stage 3: Aggregate Frequencies (MapReduce/Spark)
↓
Stage 4: Merge with Historical Data (weighted)
↓
Stage 5: Build Trie + Precompute Top-K
↓
Stage 6: Serialize & Upload to Blob Store (S3)
↓
Stage 7: Deploy to Serving Cluster (rolling update)
Stage 5: Trie Construction in Detail
# Build the optimized trie from frequency data
def build_trie(query_frequencies: Dict[str, float]) -> OptimizedTrie:
"""
Input: {"tree": 60_000_000, "trend": 55_000_000, ...}
Output: Trie with precomputed top-K at every node
"""
trie = OptimizedTrie()
# Step 1: Insert all queries
for query, freq in query_frequencies.items():
node = trie.root
for char in query:
if char not in node.children:
node.children[char] = OptimizedTrieNode()
node = node.children[char]
node.is_end_of_word = True
node.frequency = freq
# Step 2: Bottom-up pass to compute top-K at every node
trie.build_top_k(trie.root, "")
return trie
# Build top-K bottom-up (post-order DFS)
def build_top_k(node, prefix, K=5):
candidates = []
if node.is_end_of_word:
candidates.append((prefix, node.frequency))
for char, child in sorted(node.children.items()):
build_top_k(child, prefix + char, K)
candidates.extend(child.top_k)
# Keep only top K by frequency
candidates.sort(key=lambda x: -x[1])
node.top_k = candidates[:K]
Stage 6–7: Deployment Strategy
The new trie is serialized and uploaded to a blob store (S3). Serving nodes perform a rolling update:
- A new trie version is uploaded to S3 with a version identifier
- A control plane signals serving nodes to download the new version
- Each serving node downloads the trie in the background into a shadow buffer
- Once loaded, the node atomically swaps the trie pointer (old → new)
- No downtime: the old trie serves traffic until the swap completes
# Atomic trie swap on serving node
class AutocompleteServer:
def __init__(self):
self.current_trie = None # Active trie
self.loading_trie = None # Being loaded in background
def load_new_trie(self, s3_path):
"""Background thread loads new trie."""
self.loading_trie = deserialize_trie(download(s3_path))
# Atomic swap
old = self.current_trie
self.current_trie = self.loading_trie
self.loading_trie = None
del old # Free memory
def query(self, prefix):
"""Reads from current_trie — never blocked by loading."""
return self.current_trie.autocomplete(prefix)
▶ Data Pipeline: From Search Logs to Serving
Step through the offline data pipeline that transforms raw search logs into a production trie serving autocomplete suggestions.
Caching Strategy
Caching is the single most important optimization in the autocomplete system. Prefix queries are extraordinarily cacheable: millions of users type the same prefixes (especially short ones like "wh", "ho", "the").
Layer 1: Browser Cache (Most Important)
Use aggressive HTTP caching headers to avoid even hitting the network for repeated prefixes:
# HTTP Response Headers for autocomplete API
Cache-Control: public, max-age=3600 # Cache for 1 hour
ETag: "trie-v2026041500" # Version-based ETag
Vary: Accept-Encoding
# Example API request and response:
# GET /api/autocomplete?q=tre
# Response:
# {
# "suggestions": [
# {"text": "tree", "score": 60000000},
# {"text": "trend", "score": 55000000},
# {"text": "trek", "score": 18000000},
# {"text": "treasure", "score": 12000000},
# {"text": "tremendous", "score": 8000000}
# ]
# }
With 1-hour TTL, a user who types "tre" and then re-searches something starting with "tre" later will get an instant cache hit. Even across page refreshes, the browser cache serves results in 0ms.
Layer 2: CDN Cache
Place a CDN (CloudFront, Fastly) in front of the autocomplete API. Short prefixes (1–3 characters) have extremely high cache hit rates because they're shared by virtually all users:
| Prefix Length | Unique Prefixes | CDN Hit Rate |
|---|---|---|
| 1 character | 26 | ~99.9% |
| 2 characters | ~676 | ~99% |
| 3 characters | ~17,576 | ~95% |
| 4 characters | ~100K active | ~80% |
| 5+ characters | ~1M+ active | ~50-70% |
Layer 3: Application-Level Cache
On the serving nodes themselves, an in-memory LRU cache stores recently-queried prefixes to avoid repeated trie traversal:
class CachedAutocompleteServer:
def __init__(self, trie):
self.trie = trie
self.cache = LRUCache(capacity=1_000_000) # ~1M entries
def query(self, prefix):
# L1: Check application cache
cached = self.cache.get(prefix)
if cached is not None:
return cached
# L2: Traverse trie
result = self.trie.autocomplete(prefix)
self.cache.put(prefix, result)
return result
Client-Side Optimization: Debouncing & Prefetching
The client (browser/mobile) further reduces requests:
- Debounce: Wait 50–100ms after the last keystroke before sending the request. Fast typists generate fewer requests.
- Local prefix filtering: If the user types "tr" and receives ["tree", "trend", "trek", "treasure", "tremendous"], the client can filter these locally when the user types "tre" — no network request needed.
- Prefetch: When results for "tr" arrive, speculatively prefetch "tra", "tre", "tri" etc. in the background.
// Client-side: local filtering from cached results
class AutocompleteClient {
constructor() {
this.cache = new Map(); // prefix → suggestions
}
async getSuggestions(prefix) {
// Try to filter from a shorter cached prefix
for (let i = prefix.length - 1; i >= 1; i--) {
const shorter = prefix.substring(0, i);
const cached = this.cache.get(shorter);
if (cached) {
const filtered = cached.filter(s =>
s.text.startsWith(prefix)
);
if (filtered.length >= 5) {
return filtered.slice(0, 5); // No network call!
}
}
}
// Cache miss — fetch from server
const response = await fetch(`/api/autocomplete?q=${prefix}`);
const data = await response.json();
this.cache.set(prefix, data.suggestions);
return data.suggestions.slice(0, 5);
}
}
Scaling the Serving Layer
At 10 billion queries/day, the serving infrastructure must be horizontally scalable. The primary strategy is trie sharding combined with read replication.
Sharding Strategy: By First Character(s)
Partition the trie by the first one or two characters of the prefix:
# Shard by first character (26 shards for a-z)
Shard 0 (a): handles prefixes "a", "ab", "abc", ...
Shard 1 (b): handles prefixes "b", "ba", "bar", ...
...
Shard 25 (z): handles prefixes "z", "za", "zoo", ...
# Problem: Shards are very unbalanced!
# "t" shard handles: the, this, that, to, time, they, than, ...
# "x" shard handles: xbox, xray, xml, ... (tiny)
# Better: Shard by first TWO characters (676 shards)
Shard "th": handles "th", "the", "this", "that", "then", "there", ...
Shard "tr": handles "tr", "tree", "trend", "travel", ...
# Even better: Consistent hashing on 2-char prefixes → N shards
# Automatically balances hot and cold prefixes across machines
Dealing with Hot Shards
Even with 2-character sharding, some shards are much hotter. The prefix "th" generates vastly more queries than "zx". Solutions:
- Unequal shard sizing: Hot prefixes get their own shard; cold prefixes are grouped (e.g., "xa" through "xz" share one shard)
- Read replicas: Hot shards get more replicas. The "th" shard might have 10 replicas while "zx" has 2
- Load balancer routes based on prefix: A routing table maps prefix → shard → replica set
# Shard routing table
{
"th": {"shard_id": 42, "replicas": ["node-42a", "node-42b", ..., "node-42j"]},
"tr": {"shard_id": 43, "replicas": ["node-43a", "node-43b", "node-43c"]},
"zx": {"shard_id": 200, "replicas": ["node-200a", "node-200b"]},
...
}
# Each serving node holds ONE shard of the trie in memory
# Load balancer routes requests based on prefix → shard → round-robin replica
Replication for Reads
Since the trie is read-only (it's rebuilt offline and swapped atomically), replication is straightforward:
- The trie build pipeline produces one serialized trie per shard
- Each shard blob is uploaded to S3
- All replicas for that shard download and load the new blob
- No write conflicts, no consensus protocols, no leader election
Filtering Offensive Suggestions
An autocomplete system that suggests offensive, dangerous, or inappropriate content is a reputational catastrophe. Filtering must be multi-layered:
Layer 1: Blocklist (Offline)
Maintain a curated blocklist of terms that must never appear in suggestions. This list is applied during the trie build pipeline:
BLOCKLIST = load_blocklist("s3://safety/blocklist.txt") # ~100K terms
def is_safe(query):
"""Check query against blocklist and pattern rules."""
normalized = query.lower().strip()
# Exact match
if normalized in BLOCKLIST:
return False
# Substring match (catch variations)
for blocked_term in BLOCKLIST_SUBSTRINGS:
if blocked_term in normalized:
return False
return True
# Applied during trie construction:
for query, freq in query_frequencies.items():
if is_safe(query):
trie.insert(query, freq)
Layer 2: ML Classifier (Offline)
A trained classifier scores queries for toxicity, NSFW content, and violence. Queries scoring above a threshold are excluded:
# During the aggregation pipeline
toxicity_model = load_model("toxicity-classifier-v3")
for query, freq in query_frequencies.items():
score = toxicity_model.predict(query)
if score > TOXICITY_THRESHOLD: # e.g., 0.7
mark_for_review(query)
elif score > HARD_BLOCK_THRESHOLD: # e.g., 0.95
block(query)
Layer 3: Real-Time Override (Online)
For emergencies (e.g., a breaking news event makes a previously-safe query offensive), a real-time override service can instantly suppress specific suggestions without waiting for a trie rebuild:
class AutocompleteWithOverrides:
def query(self, prefix):
suggestions = self.trie.autocomplete(prefix)
# Apply real-time overrides (loaded from a fast config service)
blocked = self.override_service.get_blocked_queries()
filtered = [s for s in suggestions if s.text not in blocked]
return filtered[:5]
Trending Queries
A purely frequency-based system would always show the same historical top queries. But when a major event happens (e.g., a new product launch, breaking news), users want to see those emerging queries immediately.
Detecting Trends
A query is "trending" if its recent frequency significantly exceeds its historical baseline:
# Trend detection formula
trending_score(query) = recent_freq / (historical_avg_freq + epsilon)
# Example:
# "eclipse 2026": historical avg = 500/day, today = 50,000/day
# trending_score = 50000 / (500 + 1) = 99.8 (very trending!)
#
# "weather today": historical avg = 200,000/day, today = 220,000/day
# trending_score = 220000 / (200000 + 1) = 1.1 (not trending)
Architecture for Trending
┌─────────────────┐ ┌──────────────────────┐
│ Real-time log │────▶│ Stream processor │
│ (Kafka) │ │ (Flink / Spark │
└─────────────────┘ │ Streaming) │
└──────────┬───────────┘
│ trending queries
▼
┌──────────────────────┐
│ Trending cache │
│ (Redis / Memcached) │
└──────────┬───────────┘
│
┌──────────────┴──────────────┐
│ Autocomplete serving node │
│ trie.query() + trending_mix │
└─────────────────────────────┘
The serving node merges results from the precomputed trie and the trending cache:
def query_with_trending(self, prefix, k=5):
# Get standard top-K from trie
trie_results = self.trie.autocomplete(prefix)
# Get trending queries matching this prefix
trending = self.trending_cache.get_matching(prefix)
# Merge: trending queries get a boost
TRENDING_BOOST = 2.0
merged = []
for q, score in trie_results:
merged.append((q, score))
for q, trend_score in trending:
if q.startswith(prefix):
merged.append((q, trend_score * TRENDING_BOOST))
# Deduplicate and sort
seen = set()
unique = []
for q, score in sorted(merged, key=lambda x: -x[1]):
if q not in seen:
seen.add(q)
unique.append((q, score))
return unique[:k]
Personalization
Different users searching the same prefix should see different suggestions based on their search history. A developer typing "py" probably wants "python" while a finance analyst wants "pypl stock".
User Profile Model
# Per-user search history (stored in user profile service)
{
"user_id": "u_abc123",
"recent_queries": [
{"query": "python async await", "timestamp": "2026-04-15T14:00:00Z"},
{"query": "python fastapi tutorial", "timestamp": "2026-04-15T13:30:00Z"},
{"query": "python type hints", "timestamp": "2026-04-14T10:00:00Z"}
],
"topic_affinities": {
"programming": 0.85,
"python": 0.72,
"finance": 0.05,
"sports": 0.15
}
}
Personalized Ranking
def personalized_query(self, prefix, user_id, k=5):
# Get global top-K
global_results = self.trie.autocomplete(prefix)
# Get user's recent queries matching this prefix
user_history = self.user_service.get_recent_queries(user_id)
personal = [(q, freq) for q, freq in user_history
if q.startswith(prefix)]
# Combine: user history gets a significant boost
PERSONAL_BOOST = 5.0
combined = {}
for q, score in global_results:
combined[q] = score
for q, freq in personal:
combined[q] = combined.get(q, 0) + freq * PERSONAL_BOOST
sorted_results = sorted(combined.items(), key=lambda x: -x[1])
return sorted_results[:k]
Multi-Language Support
Supporting multiple languages introduces challenges beyond just character encoding:
Character Set Differences
- English: 26 letters — trie nodes have max 26 children
- Chinese: ~50,000 common characters — trie branching factor is enormous
- Japanese: Mix of hiragana, katakana, kanji — multiple input methods
- Arabic/Hebrew: Right-to-left scripts with complex ligatures
Architecture for Multi-Language
# Separate trie per language/locale
tries = {
"en": load_trie("s3://tries/en/latest.bin"),
"es": load_trie("s3://tries/es/latest.bin"),
"zh": load_trie("s3://tries/zh/latest.bin"),
"ja": load_trie("s3://tries/ja/latest.bin"),
# ...
}
def query(prefix, locale="en"):
trie = tries.get(locale, tries["en"])
return trie.autocomplete(prefix)
For CJK languages (Chinese, Japanese, Korean), the trie approach is less effective because words aren't space-delimited. Alternative approaches include:
- n-gram indexing: Index all substrings of length n
- Pinyin-based trie: For Chinese, build a trie on pinyin (romanized) input and map to Chinese characters
- Inverted index with prefix matching: More flexible for ideographic scripts
API Design
# REST API
GET /v1/autocomplete?q={prefix}&locale={locale}&limit={k}
# Request:
GET /v1/autocomplete?q=tre&locale=en&limit=5
# Response:
{
"prefix": "tre",
"suggestions": [
{"text": "tree", "score": 60000000},
{"text": "trend", "score": 55000000},
{"text": "trek", "score": 18000000},
{"text": "treasure", "score": 12000000},
{"text": "tremendous", "score": 8000000}
],
"cached": true,
"version": "2026041500"
}
# Headers:
# Cache-Control: public, max-age=3600
# X-Trie-Version: 2026041500
Error Handling & Graceful Degradation
- If the trie service is down, return an empty suggestions list (never block the search bar)
- If latency exceeds 100ms, return whatever results are available (even partial)
- Rate limit per IP/user to prevent abuse (but generous — autocomplete is expected to be fast & frequent)
Complete System Diagram
┌─────────────────────────────────────────────────────────────────────────┐
│ CLIENT (Browser/App) │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────────────┐ │
│ │ Debounce │→│ Local │→│ HTTP Request │ │
│ │ (50ms) │ │ Filter │ │ GET /autocomplete?q=tre │ │
│ └──────────┘ └──────────┘ └────────────┬─────────────┘ │
└───────────────────────────────────────────┼───────────────────────────┘
│
┌───────────────▼───────────────┐
│ CDN │
│ (CloudFront / Fastly) │
│ Cache hit rate: ~85% │
└───────────────┬───────────────┘
│ cache miss
┌───────────────▼───────────────┐
│ Load Balancer │
│ (prefix → shard routing) │
└───────────────┬───────────────┘
│
┌───────────────┬───────┴───────┬───────────────┐
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Shard: │ │ Shard: │ │ Shard: │ │ Shard: │
│ "a"-"f" │ │ "g"-"m" │ │ "n"-"s" │ │ "t"-"z" │
│ (×3 repl)│ │ (×3 repl)│ │ (×3 repl)│ │ (×5 repl)│
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │ │
└───────────────┴───────┬───────┴───────────────┘
│
┌───────────────▼───────────────┐
│ Trending Cache │
│ (Redis Cluster) │
└───────────────────────────────┘
═══════════════════════ OFFLINE PIPELINE ═══════════════════════════════
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Search │────▶│ Spark │────▶│ Trie │────▶│ S3 │
│ Logs │ │ Aggregate│ │ Builder │ │ (deploy) │
│ (Kafka) │ │ Job │ │ │ │ │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
Interview Tips
Key Points to Emphasize
- Trie + precomputed top-K is the core insight — O(L) query time vs O(n) for naive
- Read/write separation: The serving layer is read-only; updates happen offline via a batch pipeline
- Multi-layer caching (browser → CDN → application → trie) reduces effective QPS by 85–95%
- Sharding by prefix allows horizontal scaling; hot prefixes get more replicas
- Safety filtering is non-negotiable — always mention it proactively
- Trending injection adds freshness without compromising the batch pipeline's simplicity
Common Follow-Up Questions
| Question | Answer Direction |
|---|---|
| How do you handle typos? | Spell correction layer before trie lookup; or store common misspellings as aliases |
| What if the trie doesn't fit in memory? | Shard the trie across machines; or use a Patricia trie / succinct representation |
| How fresh are suggestions? | Full rebuild weekly; trending layer adds hourly freshness; real-time overrides for emergencies |
| What about mobile? | Same API; client-side debouncing is more aggressive on mobile (200ms) to save battery/bandwidth |
| How do you test this system? | A/B test suggestion quality via click-through rate on suggestions; shadow traffic for new trie versions |
Bonus: Space Estimation for Production Trie
Let's estimate the memory footprint of a production trie:
# Assumptions:
# - 1 billion unique queries
# - Average query length: 20 characters
# - Patricia trie compression: ~60% node reduction
# - Each node: 50 bytes (char + pointers + top-5 array)
Total characters: 1B × 20 = 20B characters
Trie nodes (naive): ~5B (shared prefixes reduce total)
Patricia nodes: ~2B (after compression)
Memory per node: ~50 bytes
- edge_label: 8 bytes (pointer to string)
- children map: 16 bytes (small map avg 2 entries)
- is_end + freq: 5 bytes
- top_k array: 5 × 4 = 20 bytes (5 pointers)
- overhead: ~1 byte
Total memory: 2B × 50 bytes = ~100 GB
Per shard (26 shards): ~4 GB — fits in a single machine!
Per shard (676 shards): ~150 MB — very comfortable
With 676 two-character shards, each shard holds ~150 MB. A machine with 64 GB RAM can easily hold multiple shards plus the application cache.