search engine system design

- Bottom line: A search engine is a pipeline of four stages — crawl/ingest, index (inverted index + optional vector index), query processing (parse, expand, route), and ranking (BM25/learned) — each independently scalable via sharding and replication.

full-text search architecture

- Bottom line: A search engine is a pipeline of four stages — crawl/ingest, index (inverted index + optional vector index), query processing (parse, expand, route), and ranking (BM25/learned) — each independently scalable via sharding and replication.

inverted index design

- Bottom line: A search engine is a pipeline of four stages — crawl/ingest, index (inverted index + optional vector index), query processing (parse, expand, route), and ranking (BM25/learned) — each independently scalable via sharding and replication.

Elasticsearch architecture

- Bottom line: A search engine is a pipeline of four stages — crawl/ingest, index (inverted index + optional vector index), query processing (parse, expand, route), and ranking (BM25/learned) — each independently scalable via sharding and replication.

search infrastructure design

- Bottom line: A search engine is a pipeline of four stages — crawl/ingest, index (inverted index + optional vector index), query processing (parse, expand, route), and ranking (BM25/learned) — each independently scalable via sharding and replication.

Search Engine Architecture: System Design Guide

How do I design a search engine architecture?

TL;DR

Bottom line: A search engine is a pipeline of four stages -- crawl/ingest, index (inverted index + optional vector index), query processing (parse, expand, route), and ranking (BM25/learned) -- each independently scalable via sharding and replication.
Key tool/command: PUT /my-index { "mappings": { "properties": { "content": { "type": "text", "analyzer": "standard" } } } }
Watch out for: Choosing shard count at index creation without projecting data growth -- resharding requires full reindex.
Works with: Elasticsearch 7.x-8.x, OpenSearch 2.x, Apache Lucene 9.x, Solr 9.x, Meilisearch 1.x, Typesense 0.25+.

Constraints

Inverted indexes are write-heavy on ingestion -- never colocate indexing and query workloads on the same nodes in production [src7]
Shard count is fixed at index creation in Elasticsearch -- choose based on projected data volume (target 10-50GB per shard) [src2]
BM25 assumes term independence -- correlated multi-word terms need phrase queries or shingle token filters [src5]
Vector search (kNN) memory: each float32 vector of dimension d costs 4*d bytes/doc -- budget RAM before enabling [src6]
Segment merges are I/O-intensive -- schedule force-merges only during off-peak hours [src4]
Never tune relevance without instrumented metrics (CTR, nDCG) -- subjective "looks better" leads to silent regression [src5]

Quick Reference

Component	Role	Technology Options	Scaling Strategy
Crawler / Ingestor	Fetches documents from sources (web, DB, file system)	Scrapy, Apache Nutch, custom HTTP workers, Kafka consumers	Horizontal -- add workers; rate-limit per domain; priority queues
Document Processor	Normalizes raw documents: HTML stripping, language detection, deduplication	Apache Tika, custom ETL, LangChain document loaders	Horizontal -- stateless workers behind a message queue
Tokenizer / Analyzer	Splits text into tokens, stemming, stop-word removal, synonyms	Lucene analyzers, ICU tokenizer, custom language-specific	Co-located with indexer; language-specific chains
Inverted Index	Maps terms to document IDs + positions for fast full-text lookup	Elasticsearch, OpenSearch, Solr, Lucene, Tantivy (Rust)	Shard by document ID (hash-based); replicate for read throughput
Vector Index	Stores dense embeddings for semantic/kNN search	FAISS, HNSW (Lucene 9+), Milvus, Pinecone, Weaviate, Qdrant	Partition by vector space; requires GPU or high-RAM nodes
Query Parser	Interprets user query: tokenization, spell correction, expansion	Elasticsearch Query DSL, custom NLP pipeline, LLM-based	Stateless -- horizontal scale behind load balancer
Query Router	Distributes query to shards and merges partial results (scatter-gather)	ES coordinating node, custom gRPC fan-out	Dedicated coordinating nodes; increase for fan-out width
Ranking Engine	Scores and orders results by relevance (BM25, TF-IDF, learned)	BM25 (default), LTR plugins, RankNet, LambdaMART	CPU-bound -- scale vertically or offload to ranking service
Result Cache	Caches frequent query results to reduce compute	Redis, Memcached, Elasticsearch request cache	TTL-based invalidation; shard-level caching
Autocomplete / Suggest	Provides typeahead suggestions as user types	ES completion suggester, Trie-based, prefix index	Separate lightweight index; <50ms p99 target
Relevance Feedback Loop	Collects click signals, A/B test results to tune ranking	Kafka + ClickHouse, ES LTR, custom analytics	Event streaming -- scale consumers independently
Index Manager	Handles index lifecycle: creation, aliasing, reindexing, retention	Elasticsearch ILM, custom cron, Curator	Automate with ILM policies; alias-based zero-downtime reindex
Monitoring	Tracks query latency, indexing throughput, shard health	Prometheus + Grafana, Elastic APM, Datadog	Alert on p99 latency, indexing lag, shard imbalance

Decision Tree

START
├── Expected corpus size?
│   ├── <1M documents, <100 QPS
│   │   ├── Full-text only → Single-node Elasticsearch or Meilisearch/Typesense
│   │   └── Semantic search → Single-node with HNSW (Elasticsearch 8.x kNN)
│   ├── 1M-100M documents, 100-10K QPS
│   │   ├── Keyword dominant → Multi-node Elasticsearch (3-10 nodes)
│   │   ├── Hybrid needed → ES 8.x with kNN + BM25 fusion
│   │   └── Real-time indexing → Dedicated ingest nodes + hot-warm architecture
│   └── >100M documents, >10K QPS
│       ├── Web-scale → Custom: Kafka → Spark → Lucene shards (Google-style)
│       ├── E-commerce → ES + Learning to Rank + dedicated vector index
│       └── Log/event → OpenSearch with hot-warm-cold tiers + rollover
├── Latency requirement?
│   ├── <50ms (autocomplete) → Dedicated completion index, prefix queries
│   ├── <200ms (standard) → Standard ES with result caching
│   └── <1s (analytics) → Aggregation-heavy, consider materialized views
└── DEFAULT → Managed Elasticsearch (AWS OpenSearch / Elastic Cloud)

Step-by-Step Guide

1. Define the document schema and mapping

Design your index mapping with explicit field types. Letting Elasticsearch auto-detect types leads to suboptimal mappings. [src2]

PUT /products
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "content_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "snowball"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title":       { "type": "text", "analyzer": "content_analyzer", "boost": 2.0 },
      "description": { "type": "text", "analyzer": "content_analyzer" },
      "category":    { "type": "keyword" },
      "price":       { "type": "float" },
      "created_at":  { "type": "date" },
      "embedding":   { "type": "dense_vector", "dims": 768, "index": true, "similarity": "cosine" }
    }
  }
}

Verify: GET /products/_mapping -- expected: mapping with all fields as defined.

2. Build the ingestion pipeline

Create a document processing pipeline that normalizes, deduplicates, and enriches documents before indexing. [src3]

from elasticsearch import Elasticsearch, helpers
import hashlib

es = Elasticsearch("http://localhost:9200")

def process_document(raw_doc):
    content = raw_doc["content"].strip()
    return {
        "_index": "products",
        "_id": hashlib.sha256(raw_doc["url"].encode()).hexdigest()[:16],
        "title": raw_doc["title"],
        "description": content,
        "category": raw_doc.get("category", "uncategorized"),
        "created_at": raw_doc["timestamp"],
    }

def bulk_index(documents, chunk_size=500):
    actions = [process_document(doc) for doc in documents]
    success, errors = helpers.bulk(es, actions, chunk_size=chunk_size, raise_on_error=False)
    return errors

Verify: GET /products/_count -- expected: count matching ingested documents.

3. Implement query parsing and expansion

Transform raw user queries into structured search requests with spell correction and synonym expansion. [src5]

def build_search_query(user_query, filters=None, page=0, size=10):
    must_clauses = [{
        "multi_match": {
            "query": user_query,
            "fields": ["title^3", "description"],
            "type": "best_fields",
            "fuzziness": "AUTO",
            "prefix_length": 2
        }
    }]
    filter_clauses = []
    if filters:
        if "category" in filters:
            filter_clauses.append({"term": {"category": filters["category"]}})
        if "price_max" in filters:
            filter_clauses.append({"range": {"price": {"lte": filters["price_max"]}}})
    return {
        "query": { "bool": { "must": must_clauses, "filter": filter_clauses } },
        "from": page * size, "size": size,
        "highlight": { "fields": {"title": {}, "description": {"fragment_size": 150}} }
    }

Verify: POST /products/_search with query body -- expected: results with _score and highlight.

4. Add vector search for semantic retrieval

Combine keyword search (BM25) with vector search (kNN) for hybrid relevance. [src6]

def hybrid_search(user_query, query_vector, keyword_weight=0.7, vector_weight=0.3, size=10):
    return {
        "query": { "bool": { "should": [{
            "multi_match": { "query": user_query, "fields": ["title^3", "description"], "boost": keyword_weight }
        }]}},
        "knn": {
            "field": "embedding", "query_vector": query_vector,
            "k": size, "num_candidates": size * 10, "boost": vector_weight
        },
        "size": size
    }

Verify: Results include semantically similar documents, not just exact keyword matches.

5. Configure sharding and replication

Set up cluster topology based on expected load. [src2] [src7]

PUT _ilm/policy/search-lifecycle
{
  "policy": {
    "phases": {
      "hot":  { "actions": { "rollover": { "max_size": "50gb", "max_age": "7d" } } },
      "warm": { "min_age": "30d", "actions": { "shrink": { "number_of_shards": 1 }, "forcemerge": { "max_num_segments": 1 } } },
      "delete": { "min_age": "90d", "actions": { "delete": {} } }
    }
  }
}

Verify: GET _cat/shards/products?v -- expected: shards distributed with status STARTED.

6. Implement result caching and monitoring

Add caching for popular queries and monitoring for operational visibility. [src3]

import redis, json, hashlib, time

cache = redis.Redis(host="localhost", port=6379, db=0)
CACHE_TTL = 300  # 5 minutes

def cached_search(es, query_body, index="products"):
    cache_key = f"search:{hashlib.md5(json.dumps(query_body, sort_keys=True).encode()).hexdigest()}"
    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)
    start = time.monotonic()
    results = es.search(index=index, body=query_body)
    latency_ms = (time.monotonic() - start) * 1000
    if results["hits"]["total"]["value"] > 0:
        cache.setex(cache_key, CACHE_TTL, json.dumps(results))
    return results

Verify: Second identical query returns in <1ms (cache hit).

Code Examples

Python: Inverted Index from Scratch

# Input:  List of (doc_id, text) tuples
# Output: Inverted index mapping terms to doc_ids with positions

import re
from collections import defaultdict

def build_inverted_index(documents):
    """Build a simple inverted index with term positions."""
    index = defaultdict(list)
    for doc_id, text in documents:
        tokens = re.findall(r'\w+', text.lower())
        term_positions = defaultdict(list)
        for pos, token in enumerate(tokens):
            term_positions[token].append(pos)
        for term, positions in term_positions.items():
            index[term].append((doc_id, positions))
    return dict(index)

def search(index, query):
    """Search the inverted index, return doc_ids ranked by term frequency."""
    tokens = re.findall(r'\w+', query.lower())
    doc_scores = defaultdict(int)
    for token in tokens:
        for doc_id, positions in index.get(token, []):
            doc_scores[doc_id] += len(positions)
    return sorted(doc_scores.items(), key=lambda x: -x[1])

# Usage
docs = [(1, "search engine design"), (2, "search architecture patterns"), (3, "database engine internals")]
idx = build_inverted_index(docs)
results = search(idx, "search engine")
# Returns: [(1, 2), (2, 1), (3, 1)]

Elasticsearch: Query DSL Multi-Match with Highlighting

// Input:  User search query string
// Output: Ranked results with highlighted snippets

POST /products/_search
{
  "query": {
    "bool": {
      "must": [{
        "multi_match": {
          "query": "wireless noise cancelling",
          "fields": ["title^3", "description^1", "category^0.5"],
          "type": "best_fields",
          "fuzziness": "AUTO"
        }
      }],
      "filter": [
        { "range": { "price": { "lte": 300 } } },
        { "term": { "in_stock": true } }
      ]
    }
  },
  "highlight": {
    "fields": {
      "title": { "number_of_fragments": 1 },
      "description": { "fragment_size": 120, "number_of_fragments": 3 }
    }
  },
  "size": 10, "from": 0
}

Anti-Patterns

Wrong: Single giant shard for all documents

// BAD -- one shard with 500GB of data
PUT /my-index { "settings": { "number_of_shards": 1 } }
// Query latency degrades to seconds as shard grows beyond 50GB

Correct: Right-size shards with growth projection

// GOOD -- target 10-50GB per shard [src2]
PUT /my-index {
  "settings": { "number_of_shards": 5, "number_of_replicas": 1 }
}
// 250GB total / 50GB target = 5 shards; use rollover for time-series

Wrong: Using match_all with client-side filtering

# BAD -- fetches ALL documents then filters in application code
results = es.search(index="products", body={"query": {"match_all": {}}, "size": 10000})
filtered = [r for r in results["hits"]["hits"] if "headphones" in r["_source"]["title"]]

Correct: Push filtering to the query engine

# GOOD -- Elasticsearch uses inverted index for efficient filtering [src6]
results = es.search(index="products", body={
    "query": {"bool": {
        "must": [{"match": {"title": "headphones"}}],
        "filter": [{"range": {"price": {"lte": 200}}}]
    }}, "size": 10
})

Wrong: Colocating indexing and search on same nodes

# BAD -- heavy bulk indexing causes GC pauses that spike query latency
# Both indexing and search compete for same heap, I/O, and CPU
node.roles: [master, data, ingest]  # single node does everything

Correct: Separate ingest and query node roles

# GOOD -- isolate workloads [src7]
# Ingest node:
node.roles: [ingest, data_hot]
# Coordinating node (query routing only):
node.roles: [coordinating_only]
# Data nodes (serve queries from local shards):
node.roles: [data_hot]

Wrong: Disabling replicas to save resources

// BAD -- no fault tolerance AND lower query throughput
PUT /my-index/_settings { "index": { "number_of_replicas": 0 } }
// One node failure = data loss

Correct: Use replicas for HA and read throughput

// GOOD -- replicas serve reads and provide fault tolerance [src2]
PUT /my-index/_settings { "index": { "number_of_replicas": 1 } }
// ES load-balances reads across primary + replica; increase for read-heavy

Common Pitfalls

Over-sharding small indexes: Creating 20 shards for 100K documents wastes cluster resources -- each shard has overhead. Fix: target 10-50GB per shard; small indexes need 1 shard. [src2]
Analyzer chain mismatches: Indexing with standard analyzer but querying with keyword yields zero results. Fix: verify with GET /_analyze that both index and query time produce expected tokens. [src4]
Deep pagination with from+size: Requesting page 1000 forces each shard to score 10010 documents. Fix: use search_after with a sort tiebreaker for deep pagination. [src6]
Default refresh_interval during bulk indexing: 1s refresh creates a new Lucene segment every second, causing excessive I/O. Fix: set refresh_interval: 30s or -1 during bulk, then restore. [src4]
Mapping explosion from dynamic fields: Arbitrary JSON keys bloat mapping (10K+ fields = slow cluster state). Fix: set "dynamic": "strict" and explicitly map fields. [src2]
Single-field ranking bias: Boosting only title causes short-title docs to dominate. Fix: use multi_match with type: "cross_fields" to balance scoring. [src5]
Cold caches after deployment: Causes latency spikes for first N queries after restart. Fix: send warm-up query set before routing traffic. [src3]
Unmonitored shard imbalance: Creates hot nodes that bottleneck the cluster. Fix: use _cat/allocation and zone-based shard allocation awareness. [src2]

Diagnostic Commands

# Check cluster health and shard allocation
curl -s localhost:9200/_cluster/health?pretty
curl -s localhost:9200/_cat/shards?v&s=store:desc

# Check index mapping and settings
curl -s localhost:9200/my-index/_mapping?pretty
curl -s localhost:9200/my-index/_settings?pretty

# Analyze how text is tokenized
curl -s -XPOST localhost:9200/my-index/_analyze -H 'Content-Type: application/json' \
  -d '{"analyzer":"standard","text":"search engine architecture"}'

# Monitor indexing throughput and latency
curl -s localhost:9200/_nodes/stats/indices?pretty | grep -E "indexing|search"

# Check segment count (too many = needs merge)
curl -s localhost:9200/_cat/segments/my-index?v&s=size:desc

Version History & Compatibility

Version	Status	Breaking Changes	Migration Notes
Elasticsearch 8.x	Current (2022-present)	Security on by default; Lucene 9.x; native kNN	Enable security from day 1; vector search via dense_vector
Elasticsearch 7.x	Maintenance (2019-2024)	Type removal; single type per index	Remove type from API calls; upgrade for vector search
OpenSearch 2.x	Current (AWS fork)	Forked from ES 7.10; neural search plugin	Use OpenSearch-specific APIs; not fully ES 8.x compatible
Apache Solr 9.x	Current (2022-present)	Module system; requires Java 11+	Migrate plugins to module architecture
Lucene 9.x	Current (2021-present)	New HNSW vector search; removed deprecated APIs	Most users access through ES/Solr, not directly

When to Use / When Not to Use

Use When	Don't Use When	Use Instead
Full-text search across large document corpus (>10K docs)	Exact key-value lookups on structured data	PostgreSQL with B-tree indexes or Redis
Relevance-ranked results needed (not just filtering)	Simple LIKE/regex search on <10K rows	SQL LIKE or ILIKE queries
Sub-200ms search latency at scale	Write-heavy, read-light workload	OLTP database (PostgreSQL, MySQL)
Faceted search with aggregations (e-commerce)	Single-user, single-machine application	SQLite FTS5 or in-process Lucene
Autocomplete / typeahead functionality	Only need vector similarity search	Dedicated vector DB (Pinecone, Weaviate)
Log search and analytics (ELK stack)	ACID transactional consistency required	PostgreSQL -- search engines are eventually consistent

Important Caveats

Elasticsearch and OpenSearch are eventually consistent -- documents are not searchable until the next refresh interval (default 1s). Do not use as a primary data store for transactional workloads. [src2]
Vector search quality depends on embedding model choice -- HNSW is approximate and may miss relevant results. Benchmark recall vs. latency for your specific model and corpus. [src6]
Lucene segment merges can cause I/O storms -- large force-merges temporarily degrade query latency. Schedule during maintenance windows. [src4]
Cost scales with data retention -- search engines store data in denormalized, index-optimized formats consuming 1.5-3x raw data size. Budget storage and implement ILM. [src7]
Hybrid keyword+vector search is not a silver bullet -- the optimal blend ratio varies by query type and domain. Invest in A/B testing infrastructure early. [src5]