Search Engine Architecture: System Design Guide

Type: Software Reference Confidence: 0.91 Sources: 7 Verified: 2026-02-23 Freshness: 2026-02-23

TL;DR

Constraints

Quick Reference

ComponentRoleTechnology OptionsScaling Strategy
Crawler / IngestorFetches documents from sources (web, DB, file system)Scrapy, Apache Nutch, custom HTTP workers, Kafka consumersHorizontal -- add workers; rate-limit per domain; priority queues
Document ProcessorNormalizes raw documents: HTML stripping, language detection, deduplicationApache Tika, custom ETL, LangChain document loadersHorizontal -- stateless workers behind a message queue
Tokenizer / AnalyzerSplits text into tokens, stemming, stop-word removal, synonymsLucene analyzers, ICU tokenizer, custom language-specificCo-located with indexer; language-specific chains
Inverted IndexMaps terms to document IDs + positions for fast full-text lookupElasticsearch, OpenSearch, Solr, Lucene, Tantivy (Rust)Shard by document ID (hash-based); replicate for read throughput
Vector IndexStores dense embeddings for semantic/kNN searchFAISS, HNSW (Lucene 9+), Milvus, Pinecone, Weaviate, QdrantPartition by vector space; requires GPU or high-RAM nodes
Query ParserInterprets user query: tokenization, spell correction, expansionElasticsearch Query DSL, custom NLP pipeline, LLM-basedStateless -- horizontal scale behind load balancer
Query RouterDistributes query to shards and merges partial results (scatter-gather)ES coordinating node, custom gRPC fan-outDedicated coordinating nodes; increase for fan-out width
Ranking EngineScores and orders results by relevance (BM25, TF-IDF, learned)BM25 (default), LTR plugins, RankNet, LambdaMARTCPU-bound -- scale vertically or offload to ranking service
Result CacheCaches frequent query results to reduce computeRedis, Memcached, Elasticsearch request cacheTTL-based invalidation; shard-level caching
Autocomplete / SuggestProvides typeahead suggestions as user typesES completion suggester, Trie-based, prefix indexSeparate lightweight index; <50ms p99 target
Relevance Feedback LoopCollects click signals, A/B test results to tune rankingKafka + ClickHouse, ES LTR, custom analyticsEvent streaming -- scale consumers independently
Index ManagerHandles index lifecycle: creation, aliasing, reindexing, retentionElasticsearch ILM, custom cron, CuratorAutomate with ILM policies; alias-based zero-downtime reindex
MonitoringTracks query latency, indexing throughput, shard healthPrometheus + Grafana, Elastic APM, DatadogAlert on p99 latency, indexing lag, shard imbalance

Decision Tree

START
├── Expected corpus size?
│   ├── <1M documents, <100 QPS
│   │   ├── Full-text only → Single-node Elasticsearch or Meilisearch/Typesense
│   │   └── Semantic search → Single-node with HNSW (Elasticsearch 8.x kNN)
│   ├── 1M-100M documents, 100-10K QPS
│   │   ├── Keyword dominant → Multi-node Elasticsearch (3-10 nodes)
│   │   ├── Hybrid needed → ES 8.x with kNN + BM25 fusion
│   │   └── Real-time indexing → Dedicated ingest nodes + hot-warm architecture
│   └── >100M documents, >10K QPS
│       ├── Web-scale → Custom: Kafka → Spark → Lucene shards (Google-style)
│       ├── E-commerce → ES + Learning to Rank + dedicated vector index
│       └── Log/event → OpenSearch with hot-warm-cold tiers + rollover
├── Latency requirement?
│   ├── <50ms (autocomplete) → Dedicated completion index, prefix queries
│   ├── <200ms (standard) → Standard ES with result caching
│   └── <1s (analytics) → Aggregation-heavy, consider materialized views
└── DEFAULT → Managed Elasticsearch (AWS OpenSearch / Elastic Cloud)

Step-by-Step Guide

1. Define the document schema and mapping

Design your index mapping with explicit field types. Letting Elasticsearch auto-detect types leads to suboptimal mappings. [src2]

PUT /products
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "content_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "snowball"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title":       { "type": "text", "analyzer": "content_analyzer", "boost": 2.0 },
      "description": { "type": "text", "analyzer": "content_analyzer" },
      "category":    { "type": "keyword" },
      "price":       { "type": "float" },
      "created_at":  { "type": "date" },
      "embedding":   { "type": "dense_vector", "dims": 768, "index": true, "similarity": "cosine" }
    }
  }
}

Verify: GET /products/_mapping -- expected: mapping with all fields as defined.

2. Build the ingestion pipeline

Create a document processing pipeline that normalizes, deduplicates, and enriches documents before indexing. [src3]

from elasticsearch import Elasticsearch, helpers
import hashlib

es = Elasticsearch("http://localhost:9200")

def process_document(raw_doc):
    content = raw_doc["content"].strip()
    return {
        "_index": "products",
        "_id": hashlib.sha256(raw_doc["url"].encode()).hexdigest()[:16],
        "title": raw_doc["title"],
        "description": content,
        "category": raw_doc.get("category", "uncategorized"),
        "created_at": raw_doc["timestamp"],
    }

def bulk_index(documents, chunk_size=500):
    actions = [process_document(doc) for doc in documents]
    success, errors = helpers.bulk(es, actions, chunk_size=chunk_size, raise_on_error=False)
    return errors

Verify: GET /products/_count -- expected: count matching ingested documents.

3. Implement query parsing and expansion

Transform raw user queries into structured search requests with spell correction and synonym expansion. [src5]

def build_search_query(user_query, filters=None, page=0, size=10):
    must_clauses = [{
        "multi_match": {
            "query": user_query,
            "fields": ["title^3", "description"],
            "type": "best_fields",
            "fuzziness": "AUTO",
            "prefix_length": 2
        }
    }]
    filter_clauses = []
    if filters:
        if "category" in filters:
            filter_clauses.append({"term": {"category": filters["category"]}})
        if "price_max" in filters:
            filter_clauses.append({"range": {"price": {"lte": filters["price_max"]}}})
    return {
        "query": { "bool": { "must": must_clauses, "filter": filter_clauses } },
        "from": page * size, "size": size,
        "highlight": { "fields": {"title": {}, "description": {"fragment_size": 150}} }
    }

Verify: POST /products/_search with query body -- expected: results with _score and highlight.

4. Add vector search for semantic retrieval

Combine keyword search (BM25) with vector search (kNN) for hybrid relevance. [src6]

def hybrid_search(user_query, query_vector, keyword_weight=0.7, vector_weight=0.3, size=10):
    return {
        "query": { "bool": { "should": [{
            "multi_match": { "query": user_query, "fields": ["title^3", "description"], "boost": keyword_weight }
        }]}},
        "knn": {
            "field": "embedding", "query_vector": query_vector,
            "k": size, "num_candidates": size * 10, "boost": vector_weight
        },
        "size": size
    }

Verify: Results include semantically similar documents, not just exact keyword matches.

5. Configure sharding and replication

Set up cluster topology based on expected load. [src2] [src7]

PUT _ilm/policy/search-lifecycle
{
  "policy": {
    "phases": {
      "hot":  { "actions": { "rollover": { "max_size": "50gb", "max_age": "7d" } } },
      "warm": { "min_age": "30d", "actions": { "shrink": { "number_of_shards": 1 }, "forcemerge": { "max_num_segments": 1 } } },
      "delete": { "min_age": "90d", "actions": { "delete": {} } }
    }
  }
}

Verify: GET _cat/shards/products?v -- expected: shards distributed with status STARTED.

6. Implement result caching and monitoring

Add caching for popular queries and monitoring for operational visibility. [src3]

import redis, json, hashlib, time

cache = redis.Redis(host="localhost", port=6379, db=0)
CACHE_TTL = 300  # 5 minutes

def cached_search(es, query_body, index="products"):
    cache_key = f"search:{hashlib.md5(json.dumps(query_body, sort_keys=True).encode()).hexdigest()}"
    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)
    start = time.monotonic()
    results = es.search(index=index, body=query_body)
    latency_ms = (time.monotonic() - start) * 1000
    if results["hits"]["total"]["value"] > 0:
        cache.setex(cache_key, CACHE_TTL, json.dumps(results))
    return results

Verify: Second identical query returns in <1ms (cache hit).

Code Examples

Python: Inverted Index from Scratch

# Input:  List of (doc_id, text) tuples
# Output: Inverted index mapping terms to doc_ids with positions

import re
from collections import defaultdict

def build_inverted_index(documents):
    """Build a simple inverted index with term positions."""
    index = defaultdict(list)
    for doc_id, text in documents:
        tokens = re.findall(r'\w+', text.lower())
        term_positions = defaultdict(list)
        for pos, token in enumerate(tokens):
            term_positions[token].append(pos)
        for term, positions in term_positions.items():
            index[term].append((doc_id, positions))
    return dict(index)

def search(index, query):
    """Search the inverted index, return doc_ids ranked by term frequency."""
    tokens = re.findall(r'\w+', query.lower())
    doc_scores = defaultdict(int)
    for token in tokens:
        for doc_id, positions in index.get(token, []):
            doc_scores[doc_id] += len(positions)
    return sorted(doc_scores.items(), key=lambda x: -x[1])

# Usage
docs = [(1, "search engine design"), (2, "search architecture patterns"), (3, "database engine internals")]
idx = build_inverted_index(docs)
results = search(idx, "search engine")
# Returns: [(1, 2), (2, 1), (3, 1)]

Elasticsearch: Query DSL Multi-Match with Highlighting

// Input:  User search query string
// Output: Ranked results with highlighted snippets

POST /products/_search
{
  "query": {
    "bool": {
      "must": [{
        "multi_match": {
          "query": "wireless noise cancelling",
          "fields": ["title^3", "description^1", "category^0.5"],
          "type": "best_fields",
          "fuzziness": "AUTO"
        }
      }],
      "filter": [
        { "range": { "price": { "lte": 300 } } },
        { "term": { "in_stock": true } }
      ]
    }
  },
  "highlight": {
    "fields": {
      "title": { "number_of_fragments": 1 },
      "description": { "fragment_size": 120, "number_of_fragments": 3 }
    }
  },
  "size": 10, "from": 0
}

Anti-Patterns

Wrong: Single giant shard for all documents

// BAD -- one shard with 500GB of data
PUT /my-index { "settings": { "number_of_shards": 1 } }
// Query latency degrades to seconds as shard grows beyond 50GB

Correct: Right-size shards with growth projection

// GOOD -- target 10-50GB per shard [src2]
PUT /my-index {
  "settings": { "number_of_shards": 5, "number_of_replicas": 1 }
}
// 250GB total / 50GB target = 5 shards; use rollover for time-series

Wrong: Using match_all with client-side filtering

# BAD -- fetches ALL documents then filters in application code
results = es.search(index="products", body={"query": {"match_all": {}}, "size": 10000})
filtered = [r for r in results["hits"]["hits"] if "headphones" in r["_source"]["title"]]

Correct: Push filtering to the query engine

# GOOD -- Elasticsearch uses inverted index for efficient filtering [src6]
results = es.search(index="products", body={
    "query": {"bool": {
        "must": [{"match": {"title": "headphones"}}],
        "filter": [{"range": {"price": {"lte": 200}}}]
    }}, "size": 10
})

Wrong: Colocating indexing and search on same nodes

# BAD -- heavy bulk indexing causes GC pauses that spike query latency
# Both indexing and search compete for same heap, I/O, and CPU
node.roles: [master, data, ingest]  # single node does everything

Correct: Separate ingest and query node roles

# GOOD -- isolate workloads [src7]
# Ingest node:
node.roles: [ingest, data_hot]
# Coordinating node (query routing only):
node.roles: [coordinating_only]
# Data nodes (serve queries from local shards):
node.roles: [data_hot]

Wrong: Disabling replicas to save resources

// BAD -- no fault tolerance AND lower query throughput
PUT /my-index/_settings { "index": { "number_of_replicas": 0 } }
// One node failure = data loss

Correct: Use replicas for HA and read throughput

// GOOD -- replicas serve reads and provide fault tolerance [src2]
PUT /my-index/_settings { "index": { "number_of_replicas": 1 } }
// ES load-balances reads across primary + replica; increase for read-heavy

Common Pitfalls

Diagnostic Commands

# Check cluster health and shard allocation
curl -s localhost:9200/_cluster/health?pretty
curl -s localhost:9200/_cat/shards?v&s=store:desc

# Check index mapping and settings
curl -s localhost:9200/my-index/_mapping?pretty
curl -s localhost:9200/my-index/_settings?pretty

# Analyze how text is tokenized
curl -s -XPOST localhost:9200/my-index/_analyze -H 'Content-Type: application/json' \
  -d '{"analyzer":"standard","text":"search engine architecture"}'

# Monitor indexing throughput and latency
curl -s localhost:9200/_nodes/stats/indices?pretty | grep -E "indexing|search"

# Check segment count (too many = needs merge)
curl -s localhost:9200/_cat/segments/my-index?v&s=size:desc

Version History & Compatibility

VersionStatusBreaking ChangesMigration Notes
Elasticsearch 8.xCurrent (2022-present)Security on by default; Lucene 9.x; native kNNEnable security from day 1; vector search via dense_vector
Elasticsearch 7.xMaintenance (2019-2024)Type removal; single type per indexRemove type from API calls; upgrade for vector search
OpenSearch 2.xCurrent (AWS fork)Forked from ES 7.10; neural search pluginUse OpenSearch-specific APIs; not fully ES 8.x compatible
Apache Solr 9.xCurrent (2022-present)Module system; requires Java 11+Migrate plugins to module architecture
Lucene 9.xCurrent (2021-present)New HNSW vector search; removed deprecated APIsMost users access through ES/Solr, not directly

When to Use / When Not to Use

Use WhenDon't Use WhenUse Instead
Full-text search across large document corpus (>10K docs)Exact key-value lookups on structured dataPostgreSQL with B-tree indexes or Redis
Relevance-ranked results needed (not just filtering)Simple LIKE/regex search on <10K rowsSQL LIKE or ILIKE queries
Sub-200ms search latency at scaleWrite-heavy, read-light workloadOLTP database (PostgreSQL, MySQL)
Faceted search with aggregations (e-commerce)Single-user, single-machine applicationSQLite FTS5 or in-process Lucene
Autocomplete / typeahead functionalityOnly need vector similarity searchDedicated vector DB (Pinecone, Weaviate)
Log search and analytics (ELK stack)ACID transactional consistency requiredPostgreSQL -- search engines are eventually consistent

Important Caveats

Related Units