Search Engines (Elasticsearch)
The Full-Text Search Problem
Relational databases are built for structured data — rows and columns, exact matches, range filters. But the moment you need to answer questions like "find all products matching wireless noise-cancelling headphones," traditional WHERE name LIKE '%wireless%' queries collapse under their own weight.
Why SQL LIKE Fails at Scale
Consider a product catalog with 50 million documents. A LIKE '%wireless%' query forces a full table scan — every single row is inspected character by character. The problems compound rapidly:
- Performance: A full table scan on 50M rows with average 2KB documents means reading ~100 GB of data. Even with NVMe SSDs at 3 GB/s sequential reads, that's 30+ seconds per query — unacceptable for user-facing search.
- No relevance ranking:
LIKEreturns Boolean results (match or no match). It can't tell you that "Sony WH-1000XM5 Wireless Noise-Cancelling" is a better match than "Wireless Mouse Pad" for the query "wireless noise cancelling." - No linguistic awareness: Searching for "running" won't find documents containing "ran" or "runner." Searching for "café" won't match "cafe." Users expect intelligence.
- No typo tolerance: Users misspell. "elaticsearch" should still find "Elasticsearch." SQL offers no fuzzy matching out of the box.
- No tokenization:
LIKE '%noise cancelling%'requires exact substring match. "cancelling noise" or "noise-canceling" (American spelling) won't match.
What Modern Users Expect from Search
Every user has been trained by Google. They expect:
- Sub-100ms response times — even across billions of documents
- Relevance ranking — the best result first, not just any result
- Typo tolerance — "elsticsearch" → "Elasticsearch"
- Synonym awareness — "laptop" matches "notebook computer"
- Autocomplete/suggestions — results appear as you type
- Faceted filtering — narrow by category, price range, brand
- Highlighting — matched terms are visually emphasized in results
Meeting these expectations at scale is the domain of dedicated search engines — and Elasticsearch is the most widely deployed.
The Inverted Index
The inverted index is the foundational data structure behind every search engine. It's conceptually simple but extraordinarily powerful — a mapping from every unique term in a corpus to the list of documents that contain that term.
Forward Index vs Inverted Index
A forward index maps documents to terms (what a database row looks like). An inverted index flips the relationship — it maps terms to documents:
Forward Index (what a database stores):
Doc 1 → ["elasticsearch", "is", "a", "search", "engine"]
Doc 2 → ["search", "engines", "use", "inverted", "indexes"]
Doc 3 → ["elasticsearch", "supports", "full", "text", "search"]
Inverted Index (what a search engine builds):
"elasticsearch" → [Doc 1, Doc 3]
"search" → [Doc 1, Doc 2, Doc 3]
"engine" → [Doc 1]
"engines" → [Doc 2]
"inverted" → [Doc 2]
"indexes" → [Doc 2]
"supports" → [Doc 3]
"full" → [Doc 3]
"text" → [Doc 3]
...
Now, to find all documents containing "search," the engine performs a single dictionary lookup — O(1) instead of scanning every document. For a query like "elasticsearch search," it computes the intersection of [Doc 1, Doc 3] and [Doc 1, Doc 2, Doc 3] = [Doc 1, Doc 3].
The Tokenization Pipeline
Before a document enters the inverted index, it passes through a multi-stage analysis pipeline:
Raw text: "Elasticsearch is a Powerful Search-Engine!"
│
┌─────────▼──────────┐
│ Character Filters │ Strip HTML, normalize Unicode
└─────────┬──────────┘
│
"elasticsearch is a powerful search-engine!"
│
┌─────────▼──────────┐
│ Tokenizer │ Split into tokens
└─────────┬──────────┘
│
["elasticsearch", "is", "a", "powerful", "search", "engine"]
│
┌─────────▼──────────┐
│ Token Filters │ Lowercase, stemming, stop words
└─────────┬──────────┘
│
["elasticsearch", "powerful", "search", "engine"]
│
┌─────────▼──────────┐
│ Inverted Index │ Store term → doc_id mapping
└────────────────────┘
Each stage serves a specific purpose:
- Character filters — Transform raw characters before tokenization. HTML stripping (
<b>bold</b>→bold), pattern replacement, Unicode normalization. - Tokenizer — Splits text into individual tokens. The standard tokenizer splits on whitespace and punctuation. "search-engine" becomes
["search", "engine"]. - Token filters — Transform individual tokens. Lowercasing, stemming (
"running" → "run"), stop word removal ("is", "a", "the"), synonym expansion.
Posting Lists and Term Frequency
Each entry in the inverted index doesn't just store a document ID — it stores a posting list with rich metadata:
Term: "search"
Posting List:
┌──────────────────────────────────────────────────────────┐
│ Doc ID │ Term Freq │ Positions │ Offsets (start:end) │
├────────┼───────────┼────────────┼────────────────────────┤
│ Doc 1 │ 1 │ [3] │ [(15:21)] │
│ Doc 2 │ 1 │ [0] │ [(0:6)] │
│ Doc 3 │ 2 │ [4, 7] │ [(25:31), (42:48)] │
└──────────────────────────────────────────────────────────┘
- Term Frequency (TF): How many times the term appears in this document. More occurrences suggest higher relevance.
- Positions: The word position within the document — essential for phrase queries like
"full text search"which requires these three terms to appear consecutively. - Offsets: Character positions in the original text — used for highlighting matched terms in search results.
▶ Inverted Index Build
Step through indexing 3 documents, building the inverted index, then querying it.
Lucene Segments
Elasticsearch is built on top of Apache Lucene, the battle-tested search library that also powers Solr. Understanding Lucene's segment architecture is key to understanding Elasticsearch's performance characteristics.
Immutable Segment Architecture
When documents are indexed, Lucene doesn't modify existing data structures. Instead, it writes new immutable segments:
Lucene Index Structure:
┌─────────────────────────────────────────────────────┐
│ Lucene Index │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ Segment 0 │ │ Segment 1 │ │ Segment 2 │ │
│ │ (50K docs) │ │ (30K docs) │ │ (8K docs) │ │
│ │ committed │ │ committed │ │ committed │ │
│ └──────────────┘ └──────────────┘ └───────────┘ │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ In-Memory Buffer │ │
│ │ (new docs not yet committed) │ │
│ └──────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Transaction Log (Translog) │ │
│ │ (durability for in-memory buffer) │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Segment Lifecycle
- In-memory buffer: New documents are first written to an in-memory buffer and the transaction log (translog) for durability.
- Refresh (every 1 second by default): The in-memory buffer is written to a new Lucene segment in the filesystem cache. The segment is now searchable but not yet fsynced to disk. This is why Elasticsearch is called "near real-time" — there's a ~1 second delay between indexing and searchability.
- Flush (every 30 minutes or when translog is full): Segments in the filesystem cache are fsynced to disk, and the translog is cleared. Data is now durable.
- Merge: Background process that combines smaller segments into larger ones, reclaiming space from deleted documents.
Why Immutability?
| Benefit | Explanation |
|---|---|
| No locking | Immutable segments can be read by multiple threads concurrently without locks — critical for search throughput. |
| Caching | The OS filesystem cache can aggressively cache segment files since they never change. |
| Compression | Known-at-write-time data can be optimally compressed (delta encoding for doc IDs, variable-byte encoding). |
| Predictable I/O | Sequential writes to new segments rather than random updates to existing files. |
Segment Merging
Over time, many small segments accumulate. Each search must check every segment, so too many segments hurt performance. Lucene runs a merge policy in the background:
Before merge: [Seg0: 100K] [Seg1: 50K] [Seg2: 50K] [Seg3: 20K] [Seg4: 10K] [Seg5: 5K]
After merge: [Seg0: 100K] [Seg1_2: 100K] [Seg3_4_5: 35K]
Merge process:
1. Select candidate segments (tiered merge policy picks similar-sized segments)
2. Read all documents from selected segments
3. Skip documents marked as deleted
4. Write a new, larger segment
5. Atomically swap old segments for the new one
6. Delete old segment files
POST /my-index/_forcemerge?max_num_segments=1 merges everything into one segment. This is extremely expensive on large indices and blocks further merging. Only use on read-only indices (like time-based log indices that have been rolled over).
Elasticsearch Architecture
Elasticsearch wraps Lucene with a distributed layer that handles clustering, replication, routing, and coordination. Understanding this architecture is essential for capacity planning and troubleshooting.
Node Types
| Node Type | Role | Typical Config |
|---|---|---|
| Master-eligible | Manages cluster state (index creation, shard allocation, node tracking). The elected master performs lightweight coordination — no data queries. | 3 dedicated nodes (for quorum). Low CPU/RAM, high availability. |
| Data | Stores shards and executes search/index operations. The workhorses of the cluster. | Scale based on data volume. High RAM (≥64 GB), fast SSDs, moderate CPU. |
| Coordinating | Routes requests, scatters queries to data nodes, gathers and merges results. Acts as a smart load balancer. | Moderate RAM for sorting/aggregation. CPU for merge operations. |
| Ingest | Pre-processes documents before indexing (pipelines: grok parsing, GeoIP enrichment, date parsing). | CPU-heavy for transformation. Can colocate with coordinating nodes in smaller clusters. |
| Machine Learning | Runs anomaly detection and inference jobs (Elastic ML). Isolated to prevent ML workloads from impacting search. | High CPU/RAM, optional GPU. |
Shards and Replicas
An Elasticsearch index is divided into shards — each shard is a complete Lucene index. Shards provide two critical capabilities:
- Primary shards — The authoritative copy. Writes always go to the primary shard first. The number of primary shards is set at index creation and cannot be changed (without reindexing).
- Replica shards — Exact copies of primaries, distributed across different nodes. Serve read requests (search) for throughput. Provide fault tolerance if a node fails.
Index: "products" (3 primaries, 1 replica)
Node 1 Node 2 Node 3
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ P0 │ │ P1 │ │ P2 │
│ R2 │ │ R0 │ │ R1 │
└──────────────┘ └──────────────┘ └──────────────┘
P0, P1, P2 = Primary shards (writes go here first)
R0, R1, R2 = Replica shards (copies, serve reads)
If Node 2 fails:
- P1 is lost, but R1 on Node 3 is promoted to primary
- R0 on Node 2 is lost, ES allocates a new R0 on Node 1 or 3
- Zero data loss, zero downtime
Shard Sizing Guidelines
Getting shard count right is one of the most impactful performance decisions:
| Guideline | Recommendation |
|---|---|
| Shard size | Target 10–50 GB per shard. Under 10 GB wastes overhead; over 50 GB makes recovery slow. |
| Shards per node | Keep below 20 shards per GB of heap. A 30 GB heap node should have <600 shards total. |
| Shard count formula | num_primary_shards = ceil(expected_data_GB / 30) — a reasonable starting point. |
| Over-sharding | Each shard has fixed overhead (~10 MB heap). 1,000 tiny shards waste ~10 GB of heap for cluster state alone. |
// Example: E-commerce product catalog
// 200 GB of product data, moderate search load
PUT /products
{
"settings": {
"number_of_shards": 7, // ~28 GB per shard
"number_of_replicas": 1, // 1 replica = 14 total shards
"refresh_interval": "5s", // slightly relaxed for write perf
"codec": "best_compression" // zstd compression for cold data
}
}
Write Path: How Documents Get Indexed
The journey of a document from client to searchable state:
- Client sends index request to any node (the coordinating node for this request).
- Routing: The coordinating node computes
shard = hash(_routing) % number_of_primary_shards. Default_routingis the document ID. - Forward to primary: The request is forwarded to the node holding the target primary shard.
- Primary indexes: The document is written to the in-memory buffer and the translog.
- Replicate: The primary forwards the operation to all replica shards in parallel.
- Acknowledge: Once the primary and all in-sync replicas confirm, the coordinating node returns success to the client.
Search Path: Scatter-Gather
Search uses a two-phase scatter-gather pattern:
Phase 1: QUERY (Scatter)
Client → Coordinating Node
│
├─→ Shard 0 (or its replica) → returns top N doc IDs + scores
├─→ Shard 1 (or its replica) → returns top N doc IDs + scores
└─→ Shard 2 (or its replica) → returns top N doc IDs + scores
Phase 2: FETCH (Gather)
Coordinating Node merges results, picks global top N
│
├─→ Shard X → return full doc for doc_id 42
├─→ Shard Y → return full doc for doc_id 17
└─→ Shard Z → return full doc for doc_id 91
│
└─→ Client receives final results
This two-phase approach avoids transferring full documents during the query phase — only lightweight IDs and scores cross the network until the final set is determined.
▶ Elasticsearch Cluster
See write routing and search scatter-gather across 3 nodes with 3 primary shards + 1 replica each.
Near Real-Time Search
Elasticsearch is often described as "near real-time" (NRT). The gap between indexing a document and being able to search it is controlled by the refresh interval — 1 second by default.
The Refresh Interval
Timeline of a document's journey to searchability:
t=0.000s Document indexed (in-memory buffer + translog)
t=0.000s Document is NOT searchable
...
t=1.000s Refresh fires: buffer → new Lucene segment (filesystem cache)
t=1.001s Document is NOW searchable
...
t=1800s Flush: segment fsynced to disk, translog cleared
t=1800s Document is NOW durable on disk
Tuning Refresh for Different Workloads
| Scenario | refresh_interval | Rationale |
|---|---|---|
| User-facing product search | 1s (default) | New products should appear quickly. 1s latency is imperceptible. |
| Log ingestion (high throughput) | 30s | Logs don't need instant searchability. Reducing refresh from 1s to 30s can increase indexing throughput 30–50%. |
| Bulk reindexing | -1 (disabled) | Disable refresh entirely during bulk loads. Re-enable after completion. Avoids creating thousands of tiny segments. |
| Real-time alerting | 1s or 500ms | Security alerting needs minimal delay. Can push down to 500ms at the cost of more segments and merging. |
// Disable refresh during bulk indexing
PUT /logs-2026.04/_settings
{ "index.refresh_interval": "-1" }
// Bulk index millions of documents...
POST /_bulk
{ "index": { "_index": "logs-2026.04" } }
{ "timestamp": "2026-04-15T10:30:00Z", "message": "..." }
...
// Re-enable refresh and force one
PUT /logs-2026.04/_settings
{ "index.refresh_interval": "30s" }
POST /logs-2026.04/_refresh
Relevance Scoring: TF-IDF and BM25
When multiple documents match a query, the search engine must rank them by relevance. This ranking is computed using mathematical scoring models.
TF-IDF: The Classic Model
TF-IDF (Term Frequency–Inverse Document Frequency) was the standard scoring model before Elasticsearch 5.0. It combines two intuitions:
- TF (Term Frequency): A term that appears more often in a document is more relevant to that document. If "elasticsearch" appears 5 times in Doc A and once in Doc B, Doc A is probably more about Elasticsearch.
- IDF (Inverse Document Frequency): A term that appears in fewer documents is more discriminating. The word "the" appears in every document (low IDF, not useful for ranking). The word "elasticsearch" appears in only 0.1% of documents (high IDF, very useful).
TF-IDF Formula:
score(t, d) = TF(t, d) × IDF(t)
TF(t, d) = √(frequency of term t in document d)
IDF(t) = log(total_docs / docs_containing_t) + 1
Example:
Corpus: 1,000,000 documents
Query: "elasticsearch"
Doc A: contains "elasticsearch" 5 times
Doc B: contains "elasticsearch" 1 time
"elasticsearch" appears in 1,000 documents
IDF("elasticsearch") = log(1,000,000 / 1,000) + 1 = log(1000) + 1 = 4.0
score(Doc A) = √5 × 4.0 = 2.236 × 4.0 = 8.94
score(Doc B) = √1 × 4.0 = 1.0 × 4.0 = 4.0
Doc A ranks higher ✓
BM25: The Modern Standard
Since Elasticsearch 5.0, the default scoring algorithm is BM25 (Best Matching 25). It addresses two key weaknesses of TF-IDF:
- Term frequency saturation: In TF-IDF, doubling the term frequency always increases the score. BM25 applies a saturation curve — after a certain point, additional occurrences contribute diminishing returns. A document with "search" 50 times isn't 50× more relevant than one with "search" once.
- Document length normalization: A 10,000-word document naturally contains more term occurrences than a 100-word document. BM25 normalizes by document length, so long documents aren't unfairly favored.
BM25 Formula:
score(t, d) = IDF(t) × (TF(t,d) × (k1 + 1)) / (TF(t,d) + k1 × (1 - b + b × |d|/avgdl))
Parameters:
k1 = 1.2 (term frequency saturation — higher = slower saturation)
b = 0.75 (document length normalization — 0 = none, 1 = full)
|d| = length of document d (in terms)
avgdl = average document length across the corpus
Key behavior:
- When TF is low: score grows almost linearly with TF
- When TF is high: score plateaus (saturation)
- When b = 0: no length normalization (all docs treated equally)
- When b = 1: full length normalization (long docs penalized)
Practical effect:
k1=1.2, b=0.75 work well for ~90% of use cases.
For short documents (titles, tags): lower b (0.3–0.5)
For long documents (articles, books): higher b (0.75–1.0)
// Customize BM25 parameters per field
PUT /products
{
"settings": {
"similarity": {
"custom_bm25": {
"type": "BM25",
"k1": 1.5,
"b": 0.3
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"similarity": "custom_bm25"
}
}
}
}
Analyzers: Text Processing Pipeline
Analyzers define how text is processed before entering the inverted index (at index time) and how query text is processed (at search time). Getting analyzers right is often the difference between a good and great search experience.
Anatomy of an Analyzer
Analyzer = Character Filters → Tokenizer → Token Filters
Built-in Analyzers:
┌──────────────┬──────────────────┬──────────────────────────────────┐
│ Analyzer │ Tokenizer │ Token Filters │
├──────────────┼──────────────────┼──────────────────────────────────┤
│ standard │ standard │ lowercase │
│ simple │ letter │ lowercase │
│ whitespace │ whitespace │ (none) │
│ keyword │ keyword (no-op) │ (none) — entire input = 1 token │
│ english │ standard │ english_possessive_stemmer, │
│ │ │ lowercase, english_stop, │
│ │ │ english_stemmer │
└──────────────┴──────────────────┴──────────────────────────────────┘
Building a Custom Analyzer
Real-world search usually requires a custom analyzer. Here's a production-grade example for an e-commerce product search:
PUT /products
{
"settings": {
"analysis": {
"char_filter": {
"strip_html": {
"type": "html_strip"
},
"normalize_hyphens": {
"type": "pattern_replace",
"pattern": "-",
"replacement": " "
}
},
"tokenizer": {
"product_tokenizer": {
"type": "standard",
"max_token_length": 255
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"product_synonyms": {
"type": "synonym_graph",
"synonyms": [
"laptop, notebook, portable computer",
"phone, mobile, smartphone, cell phone",
"tv, television, telly",
"headphones, earphones, earbuds"
]
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 15
}
},
"analyzer": {
"product_analyzer": {
"type": "custom",
"char_filter": ["strip_html", "normalize_hyphens"],
"tokenizer": "product_tokenizer",
"filter": [
"lowercase",
"product_synonyms",
"english_stop",
"english_stemmer"
]
},
"autocomplete_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "product_analyzer",
"search_analyzer": "standard",
"fields": {
"autocomplete": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "standard"
},
"keyword": {
"type": "keyword"
}
}
},
"description": {
"type": "text",
"analyzer": "product_analyzer"
},
"price": { "type": "float" },
"category": { "type": "keyword" },
"brand": { "type": "keyword" },
"created_at": { "type": "date" },
"in_stock": { "type": "boolean" }
}
}
}
Testing Analyzers with the _analyze API
// Test how text is tokenized
POST /products/_analyze
{
"analyzer": "product_analyzer",
"text": "Sony WH-1000XM5 Wireless Noise-Cancelling Headphones"
}
// Response:
{
"tokens": [
{ "token": "sony", "position": 0 },
{ "token": "wh", "position": 1 },
{ "token": "1000xm5", "position": 2 },
{ "token": "wireless", "position": 3 },
{ "token": "nois", "position": 4 }, // stemmed
{ "token": "cancel", "position": 5 }, // stemmed
{ "token": "headphon", "position": 6 }, // stemmed
{ "token": "earphon", "position": 6 }, // synonym
{ "token": "earbud", "position": 6 } // synonym
]
}
Query DSL
Elasticsearch's Query DSL (Domain Specific Language) is a powerful JSON-based query language that supports everything from simple term lookups to complex multi-clause boolean queries with boosting and scoring functions.
Core Query Types
Match Query — Full-Text Search
// The workhorse of full-text search
// Analyzes the query text, then finds matching documents
GET /products/_search
{
"query": {
"match": {
"name": {
"query": "wireless headphones",
"operator": "and", // both terms must match (default: "or")
"fuzziness": "AUTO", // typo tolerance: 0 edits for 1-2 chars,
// 1 edit for 3-5 chars, 2 edits for 6+ chars
"minimum_should_match": "75%"
}
}
}
}
Term Query — Exact Match (No Analysis)
// For keyword fields — no analysis applied to query text
// NEVER use term query on "text" fields (case mismatch issues)
GET /products/_search
{
"query": {
"term": {
"category": {
"value": "electronics"
}
}
}
}
Bool Query — Combine Multiple Conditions
// The most powerful query type — combine must, should, must_not, filter
GET /products/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "wireless headphones"
}
}
],
"filter": [
{ "term": { "brand": "sony" } },
{ "range": { "price": { "gte": 50, "lte": 300 } } },
{ "term": { "in_stock": true } }
],
"should": [
{
"match": {
"description": {
"query": "noise cancelling",
"boost": 1.5
}
}
}
],
"must_not": [
{ "term": { "category": "refurbished" } }
],
"minimum_should_match": 1
}
}
}
// must: Required, contributes to score
// filter: Required, does NOT contribute to score (cacheable!)
// should: Optional, boosts score if matched
// must_not: Excludes documents, does NOT contribute to score
Range Query — Numeric and Date Ranges
GET /logs/_search
{
"query": {
"bool": {
"filter": [
{
"range": {
"timestamp": {
"gte": "2026-04-01T00:00:00Z",
"lt": "2026-04-15T00:00:00Z",
"format": "strict_date_optional_time"
}
}
},
{
"range": {
"response_time_ms": {
"gte": 500
}
}
}
]
}
}
}
Multi-Match Query — Search Across Multiple Fields
// Search across name, description, and brand with different weights
GET /products/_search
{
"query": {
"multi_match": {
"query": "sony noise cancelling",
"fields": ["name^3", "brand^2", "description"],
"type": "best_fields", // use the best matching field's score
"tie_breaker": 0.3, // add 30% of other fields' scores
"fuzziness": "AUTO"
}
}
}
// type options:
// best_fields — score from best matching field (default)
// most_fields — sum of scores from all matching fields
// cross_fields — treat all fields as one big field (for name = first + last)
// phrase — run a match_phrase on each field
Phrase and Proximity Queries
// Exact phrase match — terms must appear in exact order
GET /products/_search
{
"query": {
"match_phrase": {
"description": {
"query": "noise cancelling technology",
"slop": 2 // allow up to 2 words between terms
}
}
}
}
// "advanced noise cancelling technology" → matches (slop 1)
// "noise reduction and cancelling technology" → matches (slop 2)
// "technology for cancelling noise" → does NOT match (wrong order + slop)
Aggregations
Aggregations are Elasticsearch's analytics engine — they let you compute metrics, build histograms, group data, and create complex analytics on top of your search results. Think of them as SQL's GROUP BY on steroids.
Metric Aggregations
// Compute statistics over numeric fields
GET /products/_search
{
"size": 0, // we only want aggregation results, not hits
"aggs": {
"avg_price": { "avg": { "field": "price" } },
"max_price": { "max": { "field": "price" } },
"min_price": { "min": { "field": "price" } },
"price_stats": {
"extended_stats": { "field": "price" }
// Returns: count, min, max, avg, sum, variance, std_deviation
},
"price_percentiles": {
"percentiles": {
"field": "price",
"percents": [50, 75, 90, 95, 99]
}
},
"unique_brands": {
"cardinality": {
"field": "brand",
"precision_threshold": 1000 // HyperLogLog precision
}
}
}
}
Bucket Aggregations
// Group documents into buckets (like GROUP BY)
GET /products/_search
{
"size": 0,
"query": {
"match": { "name": "headphones" }
},
"aggs": {
"by_brand": {
"terms": {
"field": "brand",
"size": 20, // top 20 brands
"order": { "_count": "desc" }
},
"aggs": {
"avg_price": { "avg": { "field": "price" } },
"price_ranges": {
"range": {
"field": "price",
"ranges": [
{ "key": "budget", "to": 50 },
{ "key": "mid", "from": 50, "to": 150 },
{ "key": "premium", "from": 150, "to": 300 },
{ "key": "luxury", "from": 300 }
]
}
}
}
},
"price_histogram": {
"histogram": {
"field": "price",
"interval": 25,
"min_doc_count": 1
}
},
"by_date": {
"date_histogram": {
"field": "created_at",
"calendar_interval": "month",
"format": "yyyy-MM"
}
}
}
}
// Response structure:
// "by_brand": {
// "buckets": [
// { "key": "sony", "doc_count": 42,
// "avg_price": { "value": 189.50 },
// "price_ranges": { "buckets": [...] }
// },
// { "key": "bose", "doc_count": 38,
// "avg_price": { "value": 215.00 }, ...
// }
// ]
// }
Pipeline Aggregations
// Compute metrics on the results of other aggregations
GET /orders/_search
{
"size": 0,
"aggs": {
"monthly_sales": {
"date_histogram": {
"field": "order_date",
"calendar_interval": "month"
},
"aggs": {
"revenue": { "sum": { "field": "total_amount" } }
}
},
"max_monthly_revenue": {
"max_bucket": {
"buckets_path": "monthly_sales>revenue"
}
},
"avg_monthly_revenue": {
"avg_bucket": {
"buckets_path": "monthly_sales>revenue"
}
},
"revenue_moving_avg": {
"moving_avg": {
"buckets_path": "monthly_sales>revenue",
"window": 3
}
}
}
}
Index Lifecycle Management (ILM)
Time-series data (logs, metrics, events) grows continuously and has different access patterns over time. Recent data is queried frequently and needs fast performance. Older data is rarely accessed but must be retained. ILM automates the transition between these phases.
ILM Phases
Hot → Warm → Cold → Frozen → Delete
┌──────────┬─────────────┬──────────────┬────────────┬──────────┐
│ Phase │ Duration │ Hardware │ Replicas │ Purpose │
├──────────┼─────────────┼──────────────┼────────────┼──────────┤
│ Hot │ 0–7 days │ Fast SSDs │ 1–2 │ Active │
│ │ │ High RAM │ │ writes │
│ │ │ │ │ & reads │
├──────────┼─────────────┼──────────────┼────────────┼──────────┤
│ Warm │ 7–30 days │ Standard SSD │ 1 │ Read- │
│ │ │ Moderate RAM │ │ only │
├──────────┼─────────────┼──────────────┼────────────┼──────────┤
│ Cold │ 30–90 days │ HDD │ 0 │ Rare │
│ │ │ Low RAM │ │ access │
├──────────┼─────────────┼──────────────┼────────────┼──────────┤
│ Frozen │ 90–365 days │ Object store │ 0 │ Archive │
│ │ │ (S3/GCS) │ │ │
├──────────┼─────────────┼──────────────┼────────────┼──────────┤
│ Delete │ After 365d │ — │ — │ Removed │
└──────────┴─────────────┴──────────────┴────────────┴──────────┘
Defining an ILM Policy
PUT _ilm/policy/logs-lifecycle
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_primary_shard_size": "50gb",
"max_age": "7d",
"max_docs": 100000000
},
"set_priority": { "priority": 100 }
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 },
"allocate": {
"require": { "data": "warm" }
},
"set_priority": { "priority": 50 }
}
},
"cold": {
"min_age": "30d",
"actions": {
"allocate": {
"number_of_replicas": 0,
"require": { "data": "cold" }
},
"set_priority": { "priority": 0 }
}
},
"frozen": {
"min_age": "90d",
"actions": {
"searchable_snapshot": {
"snapshot_repository": "my-s3-repo"
}
}
},
"delete": {
"min_age": "365d",
"actions": {
"delete": {}
}
}
}
}
}
// Apply the policy to an index template
PUT _index_template/logs-template
{
"index_patterns": ["logs-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.lifecycle.name": "logs-lifecycle",
"index.lifecycle.rollover_alias": "logs"
}
}
}
Relevance Tuning
Out-of-the-box BM25 scoring is a solid baseline, but production search always requires tuning. Users have implicit expectations about result ordering that pure text matching can't satisfy — a product that's popular, highly rated, or recently added should rank higher than an obscure match.
Field Boosting
// Matches in the title are 3× more important than in description
GET /products/_search
{
"query": {
"multi_match": {
"query": "noise cancelling",
"fields": [
"name^3", // 3× boost
"brand^2", // 2× boost
"description", // 1× (default)
"tags^1.5" // 1.5× boost
]
}
}
}
Function Score Query
The function_score query lets you modify scores using custom functions — essential for incorporating business logic into relevance.
GET /products/_search
{
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "wireless headphones",
"fields": ["name^3", "description"]
}
},
"functions": [
{
"filter": { "term": { "featured": true } },
"weight": 2.0
},
{
"field_value_factor": {
"field": "rating",
"factor": 1.2,
"modifier": "log1p", // score *= log(1 + 1.2 * rating)
"missing": 3.0 // default if field missing
}
},
{
"gauss": {
"created_at": {
"origin": "now",
"scale": "30d", // half-life: 30 days
"decay": 0.5 // at 30 days, score multiplied by 0.5
}
}
},
{
"script_score": {
"script": {
"source": "Math.log(2 + doc['sales_count'].value)"
}
}
}
],
"score_mode": "multiply", // multiply all function scores together
"boost_mode": "multiply" // multiply function result with query score
}
}
}
// score_mode options: multiply, sum, avg, first, max, min
// boost_mode options: multiply, replace, sum, avg, max, min
// The final score combines:
// 1. Text relevance (BM25 from multi_match)
// 2. Featured product boost (2×)
// 3. Rating factor (log-scaled)
// 4. Recency decay (Gaussian)
// 5. Popularity signal (sales count)
Debugging Relevance with _explain
// Understand why a specific document got its score
GET /products/_explain/product_42
{
"query": {
"multi_match": {
"query": "wireless headphones",
"fields": ["name^3", "description"]
}
}
}
// Returns a detailed breakdown:
// {
// "_explanation": {
// "value": 12.45,
// "description": "sum of:",
// "details": [
// {
// "value": 9.23,
// "description": "weight(name:wireless in 42) [BM25]",
// "details": [
// { "value": 3.14, "description": "idf, computed as ..." },
// { "value": 2.94, "description": "tf, computed as ..." }
// ]
// },
// ...
// ]
// }
// }
Capacity Planning
Sizing an Elasticsearch cluster correctly is both art and science. Under-provisioned clusters suffer from slow queries and rejected indexing requests. Over-provisioned clusters waste money. Here's a systematic approach.
Key Factors
| Factor | Metric | Impact |
|---|---|---|
| Data volume | GB on disk after indexing | Determines total disk and number of data nodes |
| Indexing rate | Documents/sec or MB/sec | Determines CPU and I/O on primary shards |
| Query rate | Queries/sec (QPS) | Determines CPU and the need for replicas |
| Query complexity | Simple match vs heavy aggs | Aggregations need RAM; complex queries need CPU |
| Retention period | Days/months of data kept | Total data = daily_volume × retention_days |
| Latency SLA | p50, p95, p99 targets | Stricter SLAs need more replicas and faster hardware |
Sizing Walkthrough
Example: E-commerce search platform
- 50M products, ~2 KB each raw JSON = 100 GB raw
- Elasticsearch indexing overhead ≈ 10–15% → ~115 GB stored
- 1 replica → 230 GB total disk
- Target shard size: 30 GB → 4 primary shards
- 4 primary + 4 replica = 8 total shards
Search load: 500 QPS, p99 < 200ms
Indexing: ~1000 docs/sec (product updates)
Hardware per data node:
- RAM: 32 GB (give 16 GB to ES heap, 16 GB for filesystem cache)
- Disk: 500 GB NVMe SSD (115 GB data × 2 for headroom + merging)
- CPU: 8 cores (search is CPU-intensive for scoring + aggs)
Cluster layout:
┌────────────────────────────────────────────────────┐
│ 3× Master nodes: 4 CPU, 8 GB RAM (lightweight) │
│ 3× Data nodes: 8 CPU, 32 GB RAM, 500 GB SSD │
│ 2× Coord nodes: 4 CPU, 16 GB RAM (merge/sort) │
│ │
│ Total: 8 nodes │
│ Monthly cost (cloud): ~$2,500–4,000 │
└────────────────────────────────────────────────────┘
Rule of thumb for heap:
- Never exceed 50% of physical RAM for ES heap
- Never exceed 32 GB heap (beyond 32 GB, JVM loses compressed oops)
- Leave remaining RAM for filesystem cache (Lucene segments love it)
Key Metrics to Monitor
- Cluster health: Green (all shards allocated), Yellow (replicas missing), Red (primary shards missing — data loss risk)
- JVM heap usage: Should stay below 75%. GC pauses above 1s indicate heap pressure.
- Search latency:
_nodes/statsreports query and fetch phase timings. Alert on p95 > SLA. - Indexing rate and rejections: Bulk thread pool rejections mean the cluster can't keep up with writes.
- Segment count: High segment counts (>50 per shard) increase search latency and heap usage.
- Disk watermarks: ES stops allocating shards at 85% disk (low watermark) and relocates at 90% (high watermark).
// Essential monitoring queries
GET _cluster/health?pretty
GET _cat/nodes?v&h=name,heap.percent,ram.percent,cpu,load_1m,disk.used_percent
GET _cat/indices?v&h=index,health,docs.count,store.size,pri.store.size&s=store.size:desc
GET _cat/thread_pool?v&h=name,active,rejected,completed&s=rejected:desc
GET _nodes/stats/jvm?filter_path=nodes.*.jvm.mem
Use Cases
Product Search
The canonical Elasticsearch use case. E-commerce platforms like eBay, Etsy, and Shopify use Elasticsearch to power product search with faceted navigation, autocomplete, and personalized ranking.
// Complete product search with facets and highlighting
GET /products/_search
{
"query": {
"function_score": {
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "wireless headphones",
"fields": ["name^3", "brand^2", "description"],
"fuzziness": "AUTO"
}
}
],
"filter": [
{ "term": { "in_stock": true } },
{ "range": { "price": { "gte": 20, "lte": 300 } } }
]
}
},
"functions": [
{ "field_value_factor": { "field": "rating", "modifier": "log1p" } }
]
}
},
"highlight": {
"fields": {
"name": { "pre_tags": ["<mark>"], "post_tags": ["</mark>"] },
"description": { "fragment_size": 150, "number_of_fragments": 2 }
}
},
"aggs": {
"brands": { "terms": { "field": "brand", "size": 20 } },
"categories": { "terms": { "field": "category", "size": 15 } },
"price_ranges": {
"range": {
"field": "price",
"ranges": [
{ "key": "Under $50", "to": 50 },
{ "key": "$50–$100", "from": 50, "to": 100 },
{ "key": "$100–$200", "from": 100, "to": 200 },
{ "key": "$200+", "from": 200 }
]
}
}
},
"from": 0,
"size": 20
}
Log Analysis (ELK Stack)
The ELK Stack (Elasticsearch, Logstash, Kibana) — or the modern Elastic Stack — is the most popular open-source log analytics platform. Organizations ingest billions of log events per day for operational monitoring, troubleshooting, and security analytics.
Architecture:
Applications → Filebeat (lightweight shipper)
→ Logstash (parse, transform, enrich)
→ Elasticsearch (store + index)
→ Kibana (visualize + dashboard)
// Logstash pipeline for nginx access logs
input {
beats { port => 5044 }
}
filter {
grok {
match => {
"message" => '%{IPORHOST:client_ip} - %{DATA:user} \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:bytes}'
}
}
date {
match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
target => "@timestamp"
}
geoip { source => "client_ip" }
}
output {
elasticsearch {
hosts => ["https://es-cluster:9200"]
index => "nginx-logs-%{+yyyy.MM.dd}"
}
}
// Query: Find 500 errors in the last hour
GET /nginx-logs-*/_search
{
"query": {
"bool": {
"filter": [
{ "range": { "@timestamp": { "gte": "now-1h" } } },
{ "range": { "status": { "gte": 500, "lt": 600 } } }
]
}
},
"aggs": {
"errors_over_time": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "5m"
}
},
"top_error_paths": {
"terms": { "field": "request.keyword", "size": 10 }
}
},
"sort": [{ "@timestamp": "desc" }],
"size": 50
}
Autocomplete / Search-as-You-Type
Autocomplete requires special indexing strategies because the user is typing partial words. Two main approaches:
// Approach 1: Edge N-grams (index-time prefix generation)
// "headphones" → ["he", "hea", "head", "headp", "headph", ...]
PUT /autocomplete
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 15
}
},
"analyzer": {
"autocomplete_index": {
"tokenizer": "standard",
"filter": ["lowercase", "autocomplete_filter"]
},
"autocomplete_search": {
"tokenizer": "standard",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"properties": {
"suggest": {
"type": "text",
"analyzer": "autocomplete_index",
"search_analyzer": "autocomplete_search"
}
}
}
}
// Approach 2: Completion Suggester (optimized in-memory FST)
PUT /products
{
"mappings": {
"properties": {
"name_suggest": {
"type": "completion",
"contexts": [
{ "name": "category", "type": "category" }
]
}
}
}
}
// Index with suggestion
POST /products/_doc
{
"name": "Sony WH-1000XM5",
"name_suggest": {
"input": ["Sony WH-1000XM5", "WH-1000XM5", "Sony headphones"],
"weight": 42,
"contexts": { "category": "headphones" }
}
}
// Query suggestions (< 5ms response time)
POST /products/_search
{
"suggest": {
"product_suggest": {
"prefix": "son",
"completion": {
"field": "name_suggest",
"size": 5,
"fuzzy": { "fuzziness": 1 },
"contexts": { "category": "headphones" }
}
}
}
}
Summary and Best Practices
Do's and Don'ts
| ✅ Do | ❌ Don't |
|---|---|
Use keyword type for exact-match fields (status, category) | Use term query on text fields (case mismatch) |
Put filters in filter context (cacheable, no scoring) | Put non-scoring conditions in must (wastes CPU on scoring) |
| Use bulk API for indexing (1,000–5,000 docs per batch) | Index documents one at a time (massive overhead per request) |
| Set explicit mappings before indexing | Rely on dynamic mapping in production (type mismatches) |
| Use aliases for zero-downtime reindexing | Point applications directly at index names |
| Monitor JVM heap, keep <75% | Set heap >32 GB (loses compressed oops) |
| Use ILM for time-series data | Keep all logs in hot storage forever |
Test analyzers with _analyze API before production | Deploy analyzer changes without verification |
When to Use Elasticsearch (and When Not To)
| Use Elasticsearch For | Don't Use Elasticsearch For |
|---|---|
| Full-text search with relevance ranking | Primary data store (no ACID transactions) |
| Log and event analytics | Strict relational joins across entities |
| Autocomplete and search suggestions | Write-heavy OLTP workloads |
| Geospatial search | Financial transactions requiring strong consistency |
| Real-time dashboards and metrics | Blob/binary storage |
| Security analytics (SIEM) | Graph traversal (use Neo4j) |
Key Numbers to Remember
- Shard size: 10–50 GB per shard
- Heap: ≤50% of RAM, ≤32 GB
- Refresh interval: 1s default (NRT delay)
- Bulk size: 5–15 MB per request optimal
- Shards per node: <20 per GB heap
- Disk watermark: 85% low, 90% high, 95% flood
- BM25 defaults: k1=1.2, b=0.75
- Cluster minimum for HA: 3 master-eligible nodes
In the next post, we'll explore Object Storage — how systems like Amazon S3 store and serve unstructured data at planetary scale, and how they complement search engines for media-heavy applications.