High Level Design Series · Data Storage · Part 7

Search Engines (Elasticsearch)

April 2026 · 22 min read

The Full-Text Search Problem

Relational databases are built for structured data — rows and columns, exact matches, range filters. But the moment you need to answer questions like "find all products matching wireless noise-cancelling headphones," traditional WHERE name LIKE '%wireless%' queries collapse under their own weight.

Why SQL LIKE Fails at Scale

Consider a product catalog with 50 million documents. A LIKE '%wireless%' query forces a full table scan — every single row is inspected character by character. The problems compound rapidly:

Performance: A full table scan on 50M rows with average 2KB documents means reading ~100 GB of data. Even with NVMe SSDs at 3 GB/s sequential reads, that's 30+ seconds per query — unacceptable for user-facing search.
No relevance ranking: LIKE returns Boolean results (match or no match). It can't tell you that "Sony WH-1000XM5 Wireless Noise-Cancelling" is a better match than "Wireless Mouse Pad" for the query "wireless noise cancelling."
No linguistic awareness: Searching for "running" won't find documents containing "ran" or "runner." Searching for "café" won't match "cafe." Users expect intelligence.
No typo tolerance: Users misspell. "elaticsearch" should still find "Elasticsearch." SQL offers no fuzzy matching out of the box.
No tokenization: LIKE '%noise cancelling%' requires exact substring match. "cancelling noise" or "noise-canceling" (American spelling) won't match.

The core insight: Full-text search is fundamentally a different problem from relational queries. It requires a different data structure (inverted index), a different query model (relevance scoring), and a different storage engine optimized for text analysis.

What Modern Users Expect from Search

Every user has been trained by Google. They expect:

Sub-100ms response times — even across billions of documents
Relevance ranking — the best result first, not just any result
Typo tolerance — "elsticsearch" → "Elasticsearch"
Synonym awareness — "laptop" matches "notebook computer"
Autocomplete/suggestions — results appear as you type
Faceted filtering — narrow by category, price range, brand
Highlighting — matched terms are visually emphasized in results

Meeting these expectations at scale is the domain of dedicated search engines — and Elasticsearch is the most widely deployed.

The Inverted Index

The inverted index is the foundational data structure behind every search engine. It's conceptually simple but extraordinarily powerful — a mapping from every unique term in a corpus to the list of documents that contain that term.

Forward Index vs Inverted Index

A forward index maps documents to terms (what a database row looks like). An inverted index flips the relationship — it maps terms to documents:

Forward Index (what a database stores):
  Doc 1 → ["elasticsearch", "is", "a", "search", "engine"]
  Doc 2 → ["search", "engines", "use", "inverted", "indexes"]
  Doc 3 → ["elasticsearch", "supports", "full", "text", "search"]

Inverted Index (what a search engine builds):
  "elasticsearch" → [Doc 1, Doc 3]
  "search"        → [Doc 1, Doc 2, Doc 3]
  "engine"        → [Doc 1]
  "engines"       → [Doc 2]
  "inverted"      → [Doc 2]
  "indexes"       → [Doc 2]
  "supports"      → [Doc 3]
  "full"          → [Doc 3]
  "text"          → [Doc 3]
  ...

Now, to find all documents containing "search," the engine performs a single dictionary lookup — O(1) instead of scanning every document. For a query like "elasticsearch search," it computes the intersection of [Doc 1, Doc 3] and [Doc 1, Doc 2, Doc 3] = [Doc 1, Doc 3].

The Tokenization Pipeline

Before a document enters the inverted index, it passes through a multi-stage analysis pipeline:

Raw text:  "Elasticsearch is a Powerful Search-Engine!"
                    │
          ┌─────────▼──────────┐
          │  Character Filters  │  Strip HTML, normalize Unicode
          └─────────┬──────────┘
                    │
          "elasticsearch is a powerful search-engine!"
                    │
          ┌─────────▼──────────┐
          │     Tokenizer       │  Split into tokens
          └─────────┬──────────┘
                    │
          ["elasticsearch", "is", "a", "powerful", "search", "engine"]
                    │
          ┌─────────▼──────────┐
          │   Token Filters     │  Lowercase, stemming, stop words
          └─────────┬──────────┘
                    │
          ["elasticsearch", "powerful", "search", "engine"]
                    │
          ┌─────────▼──────────┐
          │   Inverted Index    │  Store term → doc_id mapping
          └────────────────────┘

Each stage serves a specific purpose:

Character filters — Transform raw characters before tokenization. HTML stripping (<b>bold</b> → bold), pattern replacement, Unicode normalization.
Tokenizer — Splits text into individual tokens. The standard tokenizer splits on whitespace and punctuation. "search-engine" becomes ["search", "engine"].
Token filters — Transform individual tokens. Lowercasing, stemming ("running" → "run"), stop word removal ("is", "a", "the"), synonym expansion.

Posting Lists and Term Frequency

Each entry in the inverted index doesn't just store a document ID — it stores a posting list with rich metadata:

Term: "search"
Posting List:
  ┌──────────────────────────────────────────────────────────┐
  │ Doc ID │ Term Freq │ Positions  │ Offsets (start:end)    │
  ├────────┼───────────┼────────────┼────────────────────────┤
  │ Doc 1  │     1     │ [3]        │ [(15:21)]              │
  │ Doc 2  │     1     │ [0]        │ [(0:6)]                │
  │ Doc 3  │     2     │ [4, 7]     │ [(25:31), (42:48)]     │
  └──────────────────────────────────────────────────────────┘

Term Frequency (TF): How many times the term appears in this document. More occurrences suggest higher relevance.
Positions: The word position within the document — essential for phrase queries like "full text search" which requires these three terms to appear consecutively.
Offsets: Character positions in the original text — used for highlighting matched terms in search results.

▶ Inverted Index Build

Step through indexing 3 documents, building the inverted index, then querying it.

Lucene Segments

Elasticsearch is built on top of Apache Lucene, the battle-tested search library that also powers Solr. Understanding Lucene's segment architecture is key to understanding Elasticsearch's performance characteristics.

Immutable Segment Architecture

When documents are indexed, Lucene doesn't modify existing data structures. Instead, it writes new immutable segments:

Lucene Index Structure:
  ┌─────────────────────────────────────────────────────┐
  │                   Lucene Index                       │
  │                                                      │
  │  ┌──────────────┐  ┌──────────────┐  ┌───────────┐ │
  │  │  Segment 0   │  │  Segment 1   │  │ Segment 2 │ │
  │  │  (50K docs)  │  │  (30K docs)  │  │ (8K docs) │ │
  │  │  committed   │  │  committed   │  │ committed │ │
  │  └──────────────┘  └──────────────┘  └───────────┘ │
  │                                                      │
  │  ┌──────────────────────────────────────┐           │
  │  │         In-Memory Buffer             │           │
  │  │  (new docs not yet committed)        │           │
  │  └──────────────────────────────────────┘           │
  │                                                      │
  │  ┌──────────────────────────────────────┐           │
  │  │         Transaction Log (Translog)   │           │
  │  │  (durability for in-memory buffer)   │           │
  │  └──────────────────────────────────────┘           │
  └─────────────────────────────────────────────────────┘

Segment Lifecycle

In-memory buffer: New documents are first written to an in-memory buffer and the transaction log (translog) for durability.
Refresh (every 1 second by default): The in-memory buffer is written to a new Lucene segment in the filesystem cache. The segment is now searchable but not yet fsynced to disk. This is why Elasticsearch is called "near real-time" — there's a ~1 second delay between indexing and searchability.
Flush (every 30 minutes or when translog is full): Segments in the filesystem cache are fsynced to disk, and the translog is cleared. Data is now durable.
Merge: Background process that combines smaller segments into larger ones, reclaiming space from deleted documents.

Why Immutability?

Benefit	Explanation
No locking	Immutable segments can be read by multiple threads concurrently without locks — critical for search throughput.
Caching	The OS filesystem cache can aggressively cache segment files since they never change.
Compression	Known-at-write-time data can be optimally compressed (delta encoding for doc IDs, variable-byte encoding).
Predictable I/O	Sequential writes to new segments rather than random updates to existing files.

Segment Merging

Over time, many small segments accumulate. Each search must check every segment, so too many segments hurt performance. Lucene runs a merge policy in the background:

Before merge:  [Seg0: 100K] [Seg1: 50K] [Seg2: 50K] [Seg3: 20K] [Seg4: 10K] [Seg5: 5K]
After merge:   [Seg0: 100K] [Seg1_2: 100K] [Seg3_4_5: 35K]

Merge process:
  1. Select candidate segments (tiered merge policy picks similar-sized segments)
  2. Read all documents from selected segments
  3. Skip documents marked as deleted
  4. Write a new, larger segment
  5. Atomically swap old segments for the new one
  6. Delete old segment files

Force merge warning: POST /my-index/_forcemerge?max_num_segments=1 merges everything into one segment. This is extremely expensive on large indices and blocks further merging. Only use on read-only indices (like time-based log indices that have been rolled over).

Elasticsearch Architecture

Elasticsearch wraps Lucene with a distributed layer that handles clustering, replication, routing, and coordination. Understanding this architecture is essential for capacity planning and troubleshooting.

Node Types

Node Type	Role	Typical Config
Master-eligible	Manages cluster state (index creation, shard allocation, node tracking). The elected master performs lightweight coordination — no data queries.	3 dedicated nodes (for quorum). Low CPU/RAM, high availability.
Data	Stores shards and executes search/index operations. The workhorses of the cluster.	Scale based on data volume. High RAM (≥64 GB), fast SSDs, moderate CPU.
Coordinating	Routes requests, scatters queries to data nodes, gathers and merges results. Acts as a smart load balancer.	Moderate RAM for sorting/aggregation. CPU for merge operations.
Ingest	Pre-processes documents before indexing (pipelines: grok parsing, GeoIP enrichment, date parsing).	CPU-heavy for transformation. Can colocate with coordinating nodes in smaller clusters.
Machine Learning	Runs anomaly detection and inference jobs (Elastic ML). Isolated to prevent ML workloads from impacting search.	High CPU/RAM, optional GPU.

Shards and Replicas

An Elasticsearch index is divided into shards — each shard is a complete Lucene index. Shards provide two critical capabilities:

Primary shards — The authoritative copy. Writes always go to the primary shard first. The number of primary shards is set at index creation and cannot be changed (without reindexing).
Replica shards — Exact copies of primaries, distributed across different nodes. Serve read requests (search) for throughput. Provide fault tolerance if a node fails.

Index: "products" (3 primaries, 1 replica)

  Node 1                 Node 2                 Node 3
  ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
  │  P0           │      │  P1           │      │  P2           │
  │  R2           │      │  R0           │      │  R1           │
  └──────────────┘      └──────────────┘      └──────────────┘

  P0, P1, P2 = Primary shards (writes go here first)
  R0, R1, R2 = Replica shards (copies, serve reads)

  If Node 2 fails:
    - P1 is lost, but R1 on Node 3 is promoted to primary
    - R0 on Node 2 is lost, ES allocates a new R0 on Node 1 or 3
    - Zero data loss, zero downtime

Shard Sizing Guidelines

Getting shard count right is one of the most impactful performance decisions:

Guideline	Recommendation
Shard size	Target 10–50 GB per shard. Under 10 GB wastes overhead; over 50 GB makes recovery slow.
Shards per node	Keep below 20 shards per GB of heap. A 30 GB heap node should have <600 shards total.
Shard count formula	`num_primary_shards = ceil(expected_data_GB / 30)` — a reasonable starting point.
Over-sharding	Each shard has fixed overhead (~10 MB heap). 1,000 tiny shards waste ~10 GB of heap for cluster state alone.

// Example: E-commerce product catalog
// 200 GB of product data, moderate search load

PUT /products
{
  "settings": {
    "number_of_shards": 7,        // ~28 GB per shard
    "number_of_replicas": 1,       // 1 replica = 14 total shards
    "refresh_interval": "5s",      // slightly relaxed for write perf
    "codec": "best_compression"    // zstd compression for cold data
  }
}

Write Path: How Documents Get Indexed

The journey of a document from client to searchable state:

Client sends index request to any node (the coordinating node for this request).
Routing: The coordinating node computes shard = hash(_routing) % number_of_primary_shards. Default _routing is the document ID.
Forward to primary: The request is forwarded to the node holding the target primary shard.
Primary indexes: The document is written to the in-memory buffer and the translog.
Replicate: The primary forwards the operation to all replica shards in parallel.
Acknowledge: Once the primary and all in-sync replicas confirm, the coordinating node returns success to the client.

Search Path: Scatter-Gather

Search uses a two-phase scatter-gather pattern:

Phase 1: QUERY (Scatter)
  Client → Coordinating Node
           │
           ├─→ Shard 0 (or its replica) → returns top N doc IDs + scores
           ├─→ Shard 1 (or its replica) → returns top N doc IDs + scores
           └─→ Shard 2 (or its replica) → returns top N doc IDs + scores

Phase 2: FETCH (Gather)
  Coordinating Node merges results, picks global top N
           │
           ├─→ Shard X → return full doc for doc_id 42
           ├─→ Shard Y → return full doc for doc_id 17
           └─→ Shard Z → return full doc for doc_id 91
           │
           └─→ Client receives final results

This two-phase approach avoids transferring full documents during the query phase — only lightweight IDs and scores cross the network until the final set is determined.

▶ Elasticsearch Cluster

See write routing and search scatter-gather across 3 nodes with 3 primary shards + 1 replica each.

Near Real-Time Search

Elasticsearch is often described as "near real-time" (NRT). The gap between indexing a document and being able to search it is controlled by the refresh interval — 1 second by default.

The Refresh Interval

Timeline of a document's journey to searchability:

  t=0.000s  Document indexed (in-memory buffer + translog)
  t=0.000s  Document is NOT searchable
  ...
  t=1.000s  Refresh fires: buffer → new Lucene segment (filesystem cache)
  t=1.001s  Document is NOW searchable
  ...
  t=1800s   Flush: segment fsynced to disk, translog cleared
  t=1800s   Document is NOW durable on disk

Tuning Refresh for Different Workloads

Scenario	refresh_interval	Rationale
User-facing product search	`1s` (default)	New products should appear quickly. 1s latency is imperceptible.
Log ingestion (high throughput)	`30s`	Logs don't need instant searchability. Reducing refresh from 1s to 30s can increase indexing throughput 30–50%.
Bulk reindexing	`-1` (disabled)	Disable refresh entirely during bulk loads. Re-enable after completion. Avoids creating thousands of tiny segments.
Real-time alerting	`1s` or `500ms`	Security alerting needs minimal delay. Can push down to 500ms at the cost of more segments and merging.

// Disable refresh during bulk indexing
PUT /logs-2026.04/_settings
{ "index.refresh_interval": "-1" }

// Bulk index millions of documents...
POST /_bulk
{ "index": { "_index": "logs-2026.04" } }
{ "timestamp": "2026-04-15T10:30:00Z", "message": "..." }
...

// Re-enable refresh and force one
PUT /logs-2026.04/_settings
{ "index.refresh_interval": "30s" }

POST /logs-2026.04/_refresh

Relevance Scoring: TF-IDF and BM25

When multiple documents match a query, the search engine must rank them by relevance. This ranking is computed using mathematical scoring models.

TF-IDF: The Classic Model

TF-IDF (Term Frequency–Inverse Document Frequency) was the standard scoring model before Elasticsearch 5.0. It combines two intuitions:

TF (Term Frequency): A term that appears more often in a document is more relevant to that document. If "elasticsearch" appears 5 times in Doc A and once in Doc B, Doc A is probably more about Elasticsearch.
IDF (Inverse Document Frequency): A term that appears in fewer documents is more discriminating. The word "the" appears in every document (low IDF, not useful for ranking). The word "elasticsearch" appears in only 0.1% of documents (high IDF, very useful).

TF-IDF Formula:
  score(t, d) = TF(t, d) × IDF(t)

  TF(t, d)  = √(frequency of term t in document d)
  IDF(t)    = log(total_docs / docs_containing_t) + 1

Example:
  Corpus: 1,000,000 documents
  Query: "elasticsearch"
  Doc A: contains "elasticsearch" 5 times
  Doc B: contains "elasticsearch" 1 time
  "elasticsearch" appears in 1,000 documents

  IDF("elasticsearch") = log(1,000,000 / 1,000) + 1 = log(1000) + 1 = 4.0

  score(Doc A) = √5 × 4.0 = 2.236 × 4.0 = 8.94
  score(Doc B) = √1 × 4.0 = 1.0   × 4.0 = 4.0

  Doc A ranks higher ✓

BM25: The Modern Standard

Since Elasticsearch 5.0, the default scoring algorithm is BM25 (Best Matching 25). It addresses two key weaknesses of TF-IDF:

Term frequency saturation: In TF-IDF, doubling the term frequency always increases the score. BM25 applies a saturation curve — after a certain point, additional occurrences contribute diminishing returns. A document with "search" 50 times isn't 50× more relevant than one with "search" once.
Document length normalization: A 10,000-word document naturally contains more term occurrences than a 100-word document. BM25 normalizes by document length, so long documents aren't unfairly favored.

BM25 Formula:
  score(t, d) = IDF(t) × (TF(t,d) × (k1 + 1)) / (TF(t,d) + k1 × (1 - b + b × |d|/avgdl))

  Parameters:
    k1 = 1.2  (term frequency saturation — higher = slower saturation)
    b  = 0.75 (document length normalization — 0 = none, 1 = full)
    |d|    = length of document d (in terms)
    avgdl  = average document length across the corpus

  Key behavior:
    - When TF is low:  score grows almost linearly with TF
    - When TF is high: score plateaus (saturation)
    - When b = 0:      no length normalization (all docs treated equally)
    - When b = 1:      full length normalization (long docs penalized)

  Practical effect:
    k1=1.2, b=0.75 work well for ~90% of use cases.
    For short documents (titles, tags): lower b (0.3–0.5)
    For long documents (articles, books): higher b (0.75–1.0)

// Customize BM25 parameters per field
PUT /products
{
  "settings": {
    "similarity": {
      "custom_bm25": {
        "type": "BM25",
        "k1": 1.5,
        "b": 0.3
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "similarity": "custom_bm25"
      }
    }
  }
}

Analyzers: Text Processing Pipeline

Analyzers define how text is processed before entering the inverted index (at index time) and how query text is processed (at search time). Getting analyzers right is often the difference between a good and great search experience.

Anatomy of an Analyzer

Analyzer = Character Filters → Tokenizer → Token Filters

Built-in Analyzers:
  ┌──────────────┬──────────────────┬──────────────────────────────────┐
  │ Analyzer     │ Tokenizer        │ Token Filters                    │
  ├──────────────┼──────────────────┼──────────────────────────────────┤
  │ standard     │ standard         │ lowercase                        │
  │ simple       │ letter           │ lowercase                        │
  │ whitespace   │ whitespace       │ (none)                           │
  │ keyword      │ keyword (no-op)  │ (none) — entire input = 1 token │
  │ english      │ standard         │ english_possessive_stemmer,      │
  │              │                  │ lowercase, english_stop,         │
  │              │                  │ english_stemmer                  │
  └──────────────┴──────────────────┴──────────────────────────────────┘

Building a Custom Analyzer

Real-world search usually requires a custom analyzer. Here's a production-grade example for an e-commerce product search:

PUT /products
{
  "settings": {
    "analysis": {
      "char_filter": {
        "strip_html": {
          "type": "html_strip"
        },
        "normalize_hyphens": {
          "type": "pattern_replace",
          "pattern": "-",
          "replacement": " "
        }
      },
      "tokenizer": {
        "product_tokenizer": {
          "type": "standard",
          "max_token_length": 255
        }
      },
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "english_stemmer": {
          "type": "stemmer",
          "language": "english"
        },
        "product_synonyms": {
          "type": "synonym_graph",
          "synonyms": [
            "laptop, notebook, portable computer",
            "phone, mobile, smartphone, cell phone",
            "tv, television, telly",
            "headphones, earphones, earbuds"
          ]
        },
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 15
        }
      },
      "analyzer": {
        "product_analyzer": {
          "type": "custom",
          "char_filter": ["strip_html", "normalize_hyphens"],
          "tokenizer": "product_tokenizer",
          "filter": [
            "lowercase",
            "product_synonyms",
            "english_stop",
            "english_stemmer"
          ]
        },
        "autocomplete_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "product_analyzer",
        "search_analyzer": "standard",
        "fields": {
          "autocomplete": {
            "type": "text",
            "analyzer": "autocomplete_analyzer",
            "search_analyzer": "standard"
          },
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "description": {
        "type": "text",
        "analyzer": "product_analyzer"
      },
      "price": { "type": "float" },
      "category": { "type": "keyword" },
      "brand": { "type": "keyword" },
      "created_at": { "type": "date" },
      "in_stock": { "type": "boolean" }
    }
  }
}

Testing Analyzers with the _analyze API

// Test how text is tokenized
POST /products/_analyze
{
  "analyzer": "product_analyzer",
  "text": "Sony WH-1000XM5 Wireless Noise-Cancelling Headphones"
}

// Response:
{
  "tokens": [
    { "token": "sony",        "position": 0 },
    { "token": "wh",          "position": 1 },
    { "token": "1000xm5",     "position": 2 },
    { "token": "wireless",    "position": 3 },
    { "token": "nois",         "position": 4 },  // stemmed
    { "token": "cancel",       "position": 5 },  // stemmed
    { "token": "headphon",    "position": 6 },  // stemmed
    { "token": "earphon",     "position": 6 },  // synonym
    { "token": "earbud",      "position": 6 }   // synonym
  ]
}

Index vs search analyzer: Use a more aggressive analyzer at index time (synonyms, edge ngrams) and a simpler one at search time. This prevents query expansion issues — if both sides expand synonyms, "headphones" would match "earbuds" which expanded to "headphones" which expanded to "earbuds," creating recall noise.

Query DSL

Elasticsearch's Query DSL (Domain Specific Language) is a powerful JSON-based query language that supports everything from simple term lookups to complex multi-clause boolean queries with boosting and scoring functions.

Core Query Types

Match Query — Full-Text Search

// The workhorse of full-text search
// Analyzes the query text, then finds matching documents
GET /products/_search
{
  "query": {
    "match": {
      "name": {
        "query": "wireless headphones",
        "operator": "and",       // both terms must match (default: "or")
        "fuzziness": "AUTO",     // typo tolerance: 0 edits for 1-2 chars,
                                 // 1 edit for 3-5 chars, 2 edits for 6+ chars
        "minimum_should_match": "75%"
      }
    }
  }
}

Term Query — Exact Match (No Analysis)

// For keyword fields — no analysis applied to query text
// NEVER use term query on "text" fields (case mismatch issues)
GET /products/_search
{
  "query": {
    "term": {
      "category": {
        "value": "electronics"
      }
    }
  }
}

Bool Query — Combine Multiple Conditions

// The most powerful query type — combine must, should, must_not, filter
GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "wireless headphones"
          }
        }
      ],
      "filter": [
        { "term":  { "brand": "sony" } },
        { "range": { "price": { "gte": 50, "lte": 300 } } },
        { "term":  { "in_stock": true } }
      ],
      "should": [
        {
          "match": {
            "description": {
              "query": "noise cancelling",
              "boost": 1.5
            }
          }
        }
      ],
      "must_not": [
        { "term": { "category": "refurbished" } }
      ],
      "minimum_should_match": 1
    }
  }
}

// must:     Required, contributes to score
// filter:   Required, does NOT contribute to score (cacheable!)
// should:   Optional, boosts score if matched
// must_not: Excludes documents, does NOT contribute to score

Range Query — Numeric and Date Ranges

GET /logs/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "timestamp": {
              "gte": "2026-04-01T00:00:00Z",
              "lt": "2026-04-15T00:00:00Z",
              "format": "strict_date_optional_time"
            }
          }
        },
        {
          "range": {
            "response_time_ms": {
              "gte": 500
            }
          }
        }
      ]
    }
  }
}

Multi-Match Query — Search Across Multiple Fields

// Search across name, description, and brand with different weights
GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "sony noise cancelling",
      "fields": ["name^3", "brand^2", "description"],
      "type": "best_fields",     // use the best matching field's score
      "tie_breaker": 0.3,        // add 30% of other fields' scores
      "fuzziness": "AUTO"
    }
  }
}

// type options:
//   best_fields  — score from best matching field (default)
//   most_fields  — sum of scores from all matching fields
//   cross_fields — treat all fields as one big field (for name = first + last)
//   phrase       — run a match_phrase on each field

Phrase and Proximity Queries

// Exact phrase match — terms must appear in exact order
GET /products/_search
{
  "query": {
    "match_phrase": {
      "description": {
        "query": "noise cancelling technology",
        "slop": 2    // allow up to 2 words between terms
      }
    }
  }
}

// "advanced noise cancelling technology" → matches (slop 1)
// "noise reduction and cancelling technology" → matches (slop 2)
// "technology for cancelling noise" → does NOT match (wrong order + slop)

Aggregations

Aggregations are Elasticsearch's analytics engine — they let you compute metrics, build histograms, group data, and create complex analytics on top of your search results. Think of them as SQL's GROUP BY on steroids.

Metric Aggregations

// Compute statistics over numeric fields
GET /products/_search
{
  "size": 0,    // we only want aggregation results, not hits
  "aggs": {
    "avg_price": { "avg": { "field": "price" } },
    "max_price": { "max": { "field": "price" } },
    "min_price": { "min": { "field": "price" } },
    "price_stats": {
      "extended_stats": { "field": "price" }
      // Returns: count, min, max, avg, sum, variance, std_deviation
    },
    "price_percentiles": {
      "percentiles": {
        "field": "price",
        "percents": [50, 75, 90, 95, 99]
      }
    },
    "unique_brands": {
      "cardinality": {
        "field": "brand",
        "precision_threshold": 1000    // HyperLogLog precision
      }
    }
  }
}

Bucket Aggregations

// Group documents into buckets (like GROUP BY)
GET /products/_search
{
  "size": 0,
  "query": {
    "match": { "name": "headphones" }
  },
  "aggs": {
    "by_brand": {
      "terms": {
        "field": "brand",
        "size": 20,              // top 20 brands
        "order": { "_count": "desc" }
      },
      "aggs": {
        "avg_price": { "avg": { "field": "price" } },
        "price_ranges": {
          "range": {
            "field": "price",
            "ranges": [
              { "key": "budget",   "to": 50 },
              { "key": "mid",      "from": 50, "to": 150 },
              { "key": "premium",  "from": 150, "to": 300 },
              { "key": "luxury",   "from": 300 }
            ]
          }
        }
      }
    },
    "price_histogram": {
      "histogram": {
        "field": "price",
        "interval": 25,
        "min_doc_count": 1
      }
    },
    "by_date": {
      "date_histogram": {
        "field": "created_at",
        "calendar_interval": "month",
        "format": "yyyy-MM"
      }
    }
  }
}

// Response structure:
// "by_brand": {
//   "buckets": [
//     { "key": "sony", "doc_count": 42,
//       "avg_price": { "value": 189.50 },
//       "price_ranges": { "buckets": [...] }
//     },
//     { "key": "bose", "doc_count": 38,
//       "avg_price": { "value": 215.00 }, ...
//     }
//   ]
// }

Pipeline Aggregations

// Compute metrics on the results of other aggregations
GET /orders/_search
{
  "size": 0,
  "aggs": {
    "monthly_sales": {
      "date_histogram": {
        "field": "order_date",
        "calendar_interval": "month"
      },
      "aggs": {
        "revenue": { "sum": { "field": "total_amount" } }
      }
    },
    "max_monthly_revenue": {
      "max_bucket": {
        "buckets_path": "monthly_sales>revenue"
      }
    },
    "avg_monthly_revenue": {
      "avg_bucket": {
        "buckets_path": "monthly_sales>revenue"
      }
    },
    "revenue_moving_avg": {
      "moving_avg": {
        "buckets_path": "monthly_sales>revenue",
        "window": 3
      }
    }
  }
}

Index Lifecycle Management (ILM)

Time-series data (logs, metrics, events) grows continuously and has different access patterns over time. Recent data is queried frequently and needs fast performance. Older data is rarely accessed but must be retained. ILM automates the transition between these phases.

ILM Phases

Hot  →  Warm  →  Cold  →  Frozen  →  Delete

┌──────────┬─────────────┬──────────────┬────────────┬──────────┐
│ Phase    │ Duration    │ Hardware     │ Replicas   │ Purpose  │
├──────────┼─────────────┼──────────────┼────────────┼──────────┤
│ Hot      │ 0–7 days    │ Fast SSDs    │ 1–2        │ Active   │
│          │             │ High RAM     │            │ writes   │
│          │             │              │            │ & reads  │
├──────────┼─────────────┼──────────────┼────────────┼──────────┤
│ Warm     │ 7–30 days   │ Standard SSD │ 1          │ Read-    │
│          │             │ Moderate RAM │            │ only     │
├──────────┼─────────────┼──────────────┼────────────┼──────────┤
│ Cold     │ 30–90 days  │ HDD          │ 0          │ Rare     │
│          │             │ Low RAM      │            │ access   │
├──────────┼─────────────┼──────────────┼────────────┼──────────┤
│ Frozen   │ 90–365 days │ Object store │ 0          │ Archive  │
│          │             │ (S3/GCS)     │            │          │
├──────────┼─────────────┼──────────────┼────────────┼──────────┤
│ Delete   │ After 365d  │ —            │ —          │ Removed  │
└──────────┴─────────────┴──────────────┴────────────┴──────────┘

Defining an ILM Policy

PUT _ilm/policy/logs-lifecycle
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "7d",
            "max_docs": 100000000
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "allocate": {
            "require": { "data": "warm" }
          },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "number_of_replicas": 0,
            "require": { "data": "cold" }
          },
          "set_priority": { "priority": 0 }
        }
      },
      "frozen": {
        "min_age": "90d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "my-s3-repo"
          }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

// Apply the policy to an index template
PUT _index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-lifecycle",
      "index.lifecycle.rollover_alias": "logs"
    }
  }
}

Relevance Tuning

Out-of-the-box BM25 scoring is a solid baseline, but production search always requires tuning. Users have implicit expectations about result ordering that pure text matching can't satisfy — a product that's popular, highly rated, or recently added should rank higher than an obscure match.

Field Boosting

// Matches in the title are 3× more important than in description
GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "noise cancelling",
      "fields": [
        "name^3",           // 3× boost
        "brand^2",          // 2× boost
        "description",      // 1× (default)
        "tags^1.5"          // 1.5× boost
      ]
    }
  }
}

Function Score Query

The function_score query lets you modify scores using custom functions — essential for incorporating business logic into relevance.

GET /products/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "wireless headphones",
          "fields": ["name^3", "description"]
        }
      },
      "functions": [
        {
          "filter": { "term": { "featured": true } },
          "weight": 2.0
        },
        {
          "field_value_factor": {
            "field": "rating",
            "factor": 1.2,
            "modifier": "log1p",    // score *= log(1 + 1.2 * rating)
            "missing": 3.0          // default if field missing
          }
        },
        {
          "gauss": {
            "created_at": {
              "origin": "now",
              "scale": "30d",       // half-life: 30 days
              "decay": 0.5          // at 30 days, score multiplied by 0.5
            }
          }
        },
        {
          "script_score": {
            "script": {
              "source": "Math.log(2 + doc['sales_count'].value)"
            }
          }
        }
      ],
      "score_mode": "multiply",  // multiply all function scores together
      "boost_mode": "multiply"   // multiply function result with query score
    }
  }
}

// score_mode options: multiply, sum, avg, first, max, min
// boost_mode options: multiply, replace, sum, avg, max, min

// The final score combines:
//   1. Text relevance (BM25 from multi_match)
//   2. Featured product boost (2×)
//   3. Rating factor (log-scaled)
//   4. Recency decay (Gaussian)
//   5. Popularity signal (sales count)

Debugging Relevance with _explain

// Understand why a specific document got its score
GET /products/_explain/product_42
{
  "query": {
    "multi_match": {
      "query": "wireless headphones",
      "fields": ["name^3", "description"]
    }
  }
}

// Returns a detailed breakdown:
// {
//   "_explanation": {
//     "value": 12.45,
//     "description": "sum of:",
//     "details": [
//       {
//         "value": 9.23,
//         "description": "weight(name:wireless in 42) [BM25]",
//         "details": [
//           { "value": 3.14, "description": "idf, computed as ..." },
//           { "value": 2.94, "description": "tf, computed as ..." }
//         ]
//       },
//       ...
//     ]
//   }
// }

Capacity Planning

Sizing an Elasticsearch cluster correctly is both art and science. Under-provisioned clusters suffer from slow queries and rejected indexing requests. Over-provisioned clusters waste money. Here's a systematic approach.

Key Factors

Factor	Metric	Impact
Data volume	GB on disk after indexing	Determines total disk and number of data nodes
Indexing rate	Documents/sec or MB/sec	Determines CPU and I/O on primary shards
Query rate	Queries/sec (QPS)	Determines CPU and the need for replicas
Query complexity	Simple match vs heavy aggs	Aggregations need RAM; complex queries need CPU
Retention period	Days/months of data kept	Total data = daily_volume × retention_days
Latency SLA	p50, p95, p99 targets	Stricter SLAs need more replicas and faster hardware

Sizing Walkthrough

Example: E-commerce search platform
  - 50M products, ~2 KB each raw JSON = 100 GB raw
  - Elasticsearch indexing overhead ≈ 10–15% → ~115 GB stored
  - 1 replica → 230 GB total disk
  - Target shard size: 30 GB → 4 primary shards
  - 4 primary + 4 replica = 8 total shards

  Search load: 500 QPS, p99 < 200ms
  Indexing: ~1000 docs/sec (product updates)

Hardware per data node:
  - RAM:  32 GB (give 16 GB to ES heap, 16 GB for filesystem cache)
  - Disk: 500 GB NVMe SSD (115 GB data × 2 for headroom + merging)
  - CPU:  8 cores (search is CPU-intensive for scoring + aggs)

Cluster layout:
  ┌────────────────────────────────────────────────────┐
  │ 3× Master nodes:  4 CPU, 8 GB RAM (lightweight)   │
  │ 3× Data nodes:    8 CPU, 32 GB RAM, 500 GB SSD    │
  │ 2× Coord nodes:   4 CPU, 16 GB RAM (merge/sort)   │
  │                                                     │
  │ Total: 8 nodes                                      │
  │ Monthly cost (cloud): ~$2,500–4,000                 │
  └────────────────────────────────────────────────────┘

Rule of thumb for heap:
  - Never exceed 50% of physical RAM for ES heap
  - Never exceed 32 GB heap (beyond 32 GB, JVM loses compressed oops)
  - Leave remaining RAM for filesystem cache (Lucene segments love it)

Key Metrics to Monitor

Cluster health: Green (all shards allocated), Yellow (replicas missing), Red (primary shards missing — data loss risk)
JVM heap usage: Should stay below 75%. GC pauses above 1s indicate heap pressure.
Search latency: _nodes/stats reports query and fetch phase timings. Alert on p95 > SLA.
Indexing rate and rejections: Bulk thread pool rejections mean the cluster can't keep up with writes.
Segment count: High segment counts (>50 per shard) increase search latency and heap usage.
Disk watermarks: ES stops allocating shards at 85% disk (low watermark) and relocates at 90% (high watermark).

// Essential monitoring queries
GET _cluster/health?pretty
GET _cat/nodes?v&h=name,heap.percent,ram.percent,cpu,load_1m,disk.used_percent
GET _cat/indices?v&h=index,health,docs.count,store.size,pri.store.size&s=store.size:desc
GET _cat/thread_pool?v&h=name,active,rejected,completed&s=rejected:desc
GET _nodes/stats/jvm?filter_path=nodes.*.jvm.mem

Use Cases

Product Search

The canonical Elasticsearch use case. E-commerce platforms like eBay, Etsy, and Shopify use Elasticsearch to power product search with faceted navigation, autocomplete, and personalized ranking.

// Complete product search with facets and highlighting
GET /products/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "multi_match": {
                "query": "wireless headphones",
                "fields": ["name^3", "brand^2", "description"],
                "fuzziness": "AUTO"
              }
            }
          ],
          "filter": [
            { "term": { "in_stock": true } },
            { "range": { "price": { "gte": 20, "lte": 300 } } }
          ]
        }
      },
      "functions": [
        { "field_value_factor": { "field": "rating", "modifier": "log1p" } }
      ]
    }
  },
  "highlight": {
    "fields": {
      "name": { "pre_tags": ["<mark>"], "post_tags": ["</mark>"] },
      "description": { "fragment_size": 150, "number_of_fragments": 2 }
    }
  },
  "aggs": {
    "brands":     { "terms": { "field": "brand", "size": 20 } },
    "categories": { "terms": { "field": "category", "size": 15 } },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "key": "Under $50",  "to": 50 },
          { "key": "$50–$100",   "from": 50, "to": 100 },
          { "key": "$100–$200",  "from": 100, "to": 200 },
          { "key": "$200+",      "from": 200 }
        ]
      }
    }
  },
  "from": 0,
  "size": 20
}

Log Analysis (ELK Stack)

The ELK Stack (Elasticsearch, Logstash, Kibana) — or the modern Elastic Stack — is the most popular open-source log analytics platform. Organizations ingest billions of log events per day for operational monitoring, troubleshooting, and security analytics.

Architecture:
  Applications → Filebeat (lightweight shipper)
                    → Logstash (parse, transform, enrich)
                       → Elasticsearch (store + index)
                          → Kibana (visualize + dashboard)

// Logstash pipeline for nginx access logs
input {
  beats { port => 5044 }
}
filter {
  grok {
    match => {
      "message" => '%{IPORHOST:client_ip} - %{DATA:user} \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:bytes}'
    }
  }
  date {
    match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
    target => "@timestamp"
  }
  geoip { source => "client_ip" }
}
output {
  elasticsearch {
    hosts => ["https://es-cluster:9200"]
    index => "nginx-logs-%{+yyyy.MM.dd}"
  }
}

// Query: Find 500 errors in the last hour
GET /nginx-logs-*/_search
{
  "query": {
    "bool": {
      "filter": [
        { "range": { "@timestamp": { "gte": "now-1h" } } },
        { "range": { "status": { "gte": 500, "lt": 600 } } }
      ]
    }
  },
  "aggs": {
    "errors_over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "5m"
      }
    },
    "top_error_paths": {
      "terms": { "field": "request.keyword", "size": 10 }
    }
  },
  "sort": [{ "@timestamp": "desc" }],
  "size": 50
}

Autocomplete / Search-as-You-Type

Autocomplete requires special indexing strategies because the user is typing partial words. Two main approaches:

// Approach 1: Edge N-grams (index-time prefix generation)
// "headphones" → ["he", "hea", "head", "headp", "headph", ...]
PUT /autocomplete
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 15
        }
      },
      "analyzer": {
        "autocomplete_index": {
          "tokenizer": "standard",
          "filter": ["lowercase", "autocomplete_filter"]
        },
        "autocomplete_search": {
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "suggest": {
        "type": "text",
        "analyzer": "autocomplete_index",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}

// Approach 2: Completion Suggester (optimized in-memory FST)
PUT /products
{
  "mappings": {
    "properties": {
      "name_suggest": {
        "type": "completion",
        "contexts": [
          { "name": "category", "type": "category" }
        ]
      }
    }
  }
}

// Index with suggestion
POST /products/_doc
{
  "name": "Sony WH-1000XM5",
  "name_suggest": {
    "input": ["Sony WH-1000XM5", "WH-1000XM5", "Sony headphones"],
    "weight": 42,
    "contexts": { "category": "headphones" }
  }
}

// Query suggestions (< 5ms response time)
POST /products/_search
{
  "suggest": {
    "product_suggest": {
      "prefix": "son",
      "completion": {
        "field": "name_suggest",
        "size": 5,
        "fuzzy": { "fuzziness": 1 },
        "contexts": { "category": "headphones" }
      }
    }
  }
}

Completion Suggester vs Edge N-grams: The completion suggester uses an in-memory FST (Finite State Transducer) data structure that provides sub-millisecond lookups but only supports prefix matching and can't do full-text relevance. Edge n-grams are more flexible (support mid-word matching, relevance scoring) but slightly slower. For typeahead dropdowns: use completion suggester. For search-as-you-type with full results: use edge n-grams.

Summary and Best Practices

Do's and Don'ts

✅ Do	❌ Don't
Use `keyword` type for exact-match fields (status, category)	Use `term` query on `text` fields (case mismatch)
Put filters in `filter` context (cacheable, no scoring)	Put non-scoring conditions in `must` (wastes CPU on scoring)
Use bulk API for indexing (1,000–5,000 docs per batch)	Index documents one at a time (massive overhead per request)
Set explicit mappings before indexing	Rely on dynamic mapping in production (type mismatches)
Use aliases for zero-downtime reindexing	Point applications directly at index names
Monitor JVM heap, keep <75%	Set heap >32 GB (loses compressed oops)
Use ILM for time-series data	Keep all logs in hot storage forever
Test analyzers with `_analyze` API before production	Deploy analyzer changes without verification

When to Use Elasticsearch (and When Not To)

Use Elasticsearch For	Don't Use Elasticsearch For
Full-text search with relevance ranking	Primary data store (no ACID transactions)
Log and event analytics	Strict relational joins across entities
Autocomplete and search suggestions	Write-heavy OLTP workloads
Geospatial search	Financial transactions requiring strong consistency
Real-time dashboards and metrics	Blob/binary storage
Security analytics (SIEM)	Graph traversal (use Neo4j)

Key Numbers to Remember

Shard size: 10–50 GB per shard
Heap: ≤50% of RAM, ≤32 GB
Refresh interval: 1s default (NRT delay)
Bulk size: 5–15 MB per request optimal
Shards per node: <20 per GB heap
Disk watermark: 85% low, 90% high, 95% flood
BM25 defaults: k1=1.2, b=0.75
Cluster minimum for HA: 3 master-eligible nodes

In the next post, we'll explore Object Storage — how systems like Amazon S3 store and serve unstructured data at planetary scale, and how they complement search engines for media-heavy applications.