PUT /my-index { "mappings": { "properties": { "content": { "type": "text", "analyzer": "standard" } } } }| Component | Role | Technology Options | Scaling Strategy |
|---|---|---|---|
| Crawler / Ingestor | Fetches documents from sources (web, DB, file system) | Scrapy, Apache Nutch, custom HTTP workers, Kafka consumers | Horizontal -- add workers; rate-limit per domain; priority queues |
| Document Processor | Normalizes raw documents: HTML stripping, language detection, deduplication | Apache Tika, custom ETL, LangChain document loaders | Horizontal -- stateless workers behind a message queue |
| Tokenizer / Analyzer | Splits text into tokens, stemming, stop-word removal, synonyms | Lucene analyzers, ICU tokenizer, custom language-specific | Co-located with indexer; language-specific chains |
| Inverted Index | Maps terms to document IDs + positions for fast full-text lookup | Elasticsearch, OpenSearch, Solr, Lucene, Tantivy (Rust) | Shard by document ID (hash-based); replicate for read throughput |
| Vector Index | Stores dense embeddings for semantic/kNN search | FAISS, HNSW (Lucene 9+), Milvus, Pinecone, Weaviate, Qdrant | Partition by vector space; requires GPU or high-RAM nodes |
| Query Parser | Interprets user query: tokenization, spell correction, expansion | Elasticsearch Query DSL, custom NLP pipeline, LLM-based | Stateless -- horizontal scale behind load balancer |
| Query Router | Distributes query to shards and merges partial results (scatter-gather) | ES coordinating node, custom gRPC fan-out | Dedicated coordinating nodes; increase for fan-out width |
| Ranking Engine | Scores and orders results by relevance (BM25, TF-IDF, learned) | BM25 (default), LTR plugins, RankNet, LambdaMART | CPU-bound -- scale vertically or offload to ranking service |
| Result Cache | Caches frequent query results to reduce compute | Redis, Memcached, Elasticsearch request cache | TTL-based invalidation; shard-level caching |
| Autocomplete / Suggest | Provides typeahead suggestions as user types | ES completion suggester, Trie-based, prefix index | Separate lightweight index; <50ms p99 target |
| Relevance Feedback Loop | Collects click signals, A/B test results to tune ranking | Kafka + ClickHouse, ES LTR, custom analytics | Event streaming -- scale consumers independently |
| Index Manager | Handles index lifecycle: creation, aliasing, reindexing, retention | Elasticsearch ILM, custom cron, Curator | Automate with ILM policies; alias-based zero-downtime reindex |
| Monitoring | Tracks query latency, indexing throughput, shard health | Prometheus + Grafana, Elastic APM, Datadog | Alert on p99 latency, indexing lag, shard imbalance |
START
├── Expected corpus size?
│ ├── <1M documents, <100 QPS
│ │ ├── Full-text only → Single-node Elasticsearch or Meilisearch/Typesense
│ │ └── Semantic search → Single-node with HNSW (Elasticsearch 8.x kNN)
│ ├── 1M-100M documents, 100-10K QPS
│ │ ├── Keyword dominant → Multi-node Elasticsearch (3-10 nodes)
│ │ ├── Hybrid needed → ES 8.x with kNN + BM25 fusion
│ │ └── Real-time indexing → Dedicated ingest nodes + hot-warm architecture
│ └── >100M documents, >10K QPS
│ ├── Web-scale → Custom: Kafka → Spark → Lucene shards (Google-style)
│ ├── E-commerce → ES + Learning to Rank + dedicated vector index
│ └── Log/event → OpenSearch with hot-warm-cold tiers + rollover
├── Latency requirement?
│ ├── <50ms (autocomplete) → Dedicated completion index, prefix queries
│ ├── <200ms (standard) → Standard ES with result caching
│ └── <1s (analytics) → Aggregation-heavy, consider materialized views
└── DEFAULT → Managed Elasticsearch (AWS OpenSearch / Elastic Cloud)
Design your index mapping with explicit field types. Letting Elasticsearch auto-detect types leads to suboptimal mappings. [src2]
PUT /products
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"analysis": {
"analyzer": {
"content_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "stop", "snowball"]
}
}
}
},
"mappings": {
"properties": {
"title": { "type": "text", "analyzer": "content_analyzer", "boost": 2.0 },
"description": { "type": "text", "analyzer": "content_analyzer" },
"category": { "type": "keyword" },
"price": { "type": "float" },
"created_at": { "type": "date" },
"embedding": { "type": "dense_vector", "dims": 768, "index": true, "similarity": "cosine" }
}
}
}
Verify: GET /products/_mapping -- expected: mapping with all fields as defined.
Create a document processing pipeline that normalizes, deduplicates, and enriches documents before indexing. [src3]
from elasticsearch import Elasticsearch, helpers
import hashlib
es = Elasticsearch("http://localhost:9200")
def process_document(raw_doc):
content = raw_doc["content"].strip()
return {
"_index": "products",
"_id": hashlib.sha256(raw_doc["url"].encode()).hexdigest()[:16],
"title": raw_doc["title"],
"description": content,
"category": raw_doc.get("category", "uncategorized"),
"created_at": raw_doc["timestamp"],
}
def bulk_index(documents, chunk_size=500):
actions = [process_document(doc) for doc in documents]
success, errors = helpers.bulk(es, actions, chunk_size=chunk_size, raise_on_error=False)
return errors
Verify: GET /products/_count -- expected: count matching ingested documents.
Transform raw user queries into structured search requests with spell correction and synonym expansion. [src5]
def build_search_query(user_query, filters=None, page=0, size=10):
must_clauses = [{
"multi_match": {
"query": user_query,
"fields": ["title^3", "description"],
"type": "best_fields",
"fuzziness": "AUTO",
"prefix_length": 2
}
}]
filter_clauses = []
if filters:
if "category" in filters:
filter_clauses.append({"term": {"category": filters["category"]}})
if "price_max" in filters:
filter_clauses.append({"range": {"price": {"lte": filters["price_max"]}}})
return {
"query": { "bool": { "must": must_clauses, "filter": filter_clauses } },
"from": page * size, "size": size,
"highlight": { "fields": {"title": {}, "description": {"fragment_size": 150}} }
}
Verify: POST /products/_search with query body -- expected: results with _score and highlight.
Combine keyword search (BM25) with vector search (kNN) for hybrid relevance. [src6]
def hybrid_search(user_query, query_vector, keyword_weight=0.7, vector_weight=0.3, size=10):
return {
"query": { "bool": { "should": [{
"multi_match": { "query": user_query, "fields": ["title^3", "description"], "boost": keyword_weight }
}]}},
"knn": {
"field": "embedding", "query_vector": query_vector,
"k": size, "num_candidates": size * 10, "boost": vector_weight
},
"size": size
}
Verify: Results include semantically similar documents, not just exact keyword matches.
Set up cluster topology based on expected load. [src2] [src7]
PUT _ilm/policy/search-lifecycle
{
"policy": {
"phases": {
"hot": { "actions": { "rollover": { "max_size": "50gb", "max_age": "7d" } } },
"warm": { "min_age": "30d", "actions": { "shrink": { "number_of_shards": 1 }, "forcemerge": { "max_num_segments": 1 } } },
"delete": { "min_age": "90d", "actions": { "delete": {} } }
}
}
}
Verify: GET _cat/shards/products?v -- expected: shards distributed with status STARTED.
Add caching for popular queries and monitoring for operational visibility. [src3]
import redis, json, hashlib, time
cache = redis.Redis(host="localhost", port=6379, db=0)
CACHE_TTL = 300 # 5 minutes
def cached_search(es, query_body, index="products"):
cache_key = f"search:{hashlib.md5(json.dumps(query_body, sort_keys=True).encode()).hexdigest()}"
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
start = time.monotonic()
results = es.search(index=index, body=query_body)
latency_ms = (time.monotonic() - start) * 1000
if results["hits"]["total"]["value"] > 0:
cache.setex(cache_key, CACHE_TTL, json.dumps(results))
return results
Verify: Second identical query returns in <1ms (cache hit).
# Input: List of (doc_id, text) tuples
# Output: Inverted index mapping terms to doc_ids with positions
import re
from collections import defaultdict
def build_inverted_index(documents):
"""Build a simple inverted index with term positions."""
index = defaultdict(list)
for doc_id, text in documents:
tokens = re.findall(r'\w+', text.lower())
term_positions = defaultdict(list)
for pos, token in enumerate(tokens):
term_positions[token].append(pos)
for term, positions in term_positions.items():
index[term].append((doc_id, positions))
return dict(index)
def search(index, query):
"""Search the inverted index, return doc_ids ranked by term frequency."""
tokens = re.findall(r'\w+', query.lower())
doc_scores = defaultdict(int)
for token in tokens:
for doc_id, positions in index.get(token, []):
doc_scores[doc_id] += len(positions)
return sorted(doc_scores.items(), key=lambda x: -x[1])
# Usage
docs = [(1, "search engine design"), (2, "search architecture patterns"), (3, "database engine internals")]
idx = build_inverted_index(docs)
results = search(idx, "search engine")
# Returns: [(1, 2), (2, 1), (3, 1)]
// Input: User search query string
// Output: Ranked results with highlighted snippets
POST /products/_search
{
"query": {
"bool": {
"must": [{
"multi_match": {
"query": "wireless noise cancelling",
"fields": ["title^3", "description^1", "category^0.5"],
"type": "best_fields",
"fuzziness": "AUTO"
}
}],
"filter": [
{ "range": { "price": { "lte": 300 } } },
{ "term": { "in_stock": true } }
]
}
},
"highlight": {
"fields": {
"title": { "number_of_fragments": 1 },
"description": { "fragment_size": 120, "number_of_fragments": 3 }
}
},
"size": 10, "from": 0
}
// BAD -- one shard with 500GB of data
PUT /my-index { "settings": { "number_of_shards": 1 } }
// Query latency degrades to seconds as shard grows beyond 50GB
// GOOD -- target 10-50GB per shard [src2]
PUT /my-index {
"settings": { "number_of_shards": 5, "number_of_replicas": 1 }
}
// 250GB total / 50GB target = 5 shards; use rollover for time-series
# BAD -- fetches ALL documents then filters in application code
results = es.search(index="products", body={"query": {"match_all": {}}, "size": 10000})
filtered = [r for r in results["hits"]["hits"] if "headphones" in r["_source"]["title"]]
# GOOD -- Elasticsearch uses inverted index for efficient filtering [src6]
results = es.search(index="products", body={
"query": {"bool": {
"must": [{"match": {"title": "headphones"}}],
"filter": [{"range": {"price": {"lte": 200}}}]
}}, "size": 10
})
# BAD -- heavy bulk indexing causes GC pauses that spike query latency
# Both indexing and search compete for same heap, I/O, and CPU
node.roles: [master, data, ingest] # single node does everything
# GOOD -- isolate workloads [src7]
# Ingest node:
node.roles: [ingest, data_hot]
# Coordinating node (query routing only):
node.roles: [coordinating_only]
# Data nodes (serve queries from local shards):
node.roles: [data_hot]
// BAD -- no fault tolerance AND lower query throughput
PUT /my-index/_settings { "index": { "number_of_replicas": 0 } }
// One node failure = data loss
// GOOD -- replicas serve reads and provide fault tolerance [src2]
PUT /my-index/_settings { "index": { "number_of_replicas": 1 } }
// ES load-balances reads across primary + replica; increase for read-heavy
standard analyzer but querying with keyword yields zero results. Fix: verify with GET /_analyze that both index and query time produce expected tokens. [src4]search_after with a sort tiebreaker for deep pagination. [src6]refresh_interval: 30s or -1 during bulk, then restore. [src4]"dynamic": "strict" and explicitly map fields. [src2]title causes short-title docs to dominate. Fix: use multi_match with type: "cross_fields" to balance scoring. [src5]_cat/allocation and zone-based shard allocation awareness. [src2]# Check cluster health and shard allocation
curl -s localhost:9200/_cluster/health?pretty
curl -s localhost:9200/_cat/shards?v&s=store:desc
# Check index mapping and settings
curl -s localhost:9200/my-index/_mapping?pretty
curl -s localhost:9200/my-index/_settings?pretty
# Analyze how text is tokenized
curl -s -XPOST localhost:9200/my-index/_analyze -H 'Content-Type: application/json' \
-d '{"analyzer":"standard","text":"search engine architecture"}'
# Monitor indexing throughput and latency
curl -s localhost:9200/_nodes/stats/indices?pretty | grep -E "indexing|search"
# Check segment count (too many = needs merge)
curl -s localhost:9200/_cat/segments/my-index?v&s=size:desc
| Version | Status | Breaking Changes | Migration Notes |
|---|---|---|---|
| Elasticsearch 8.x | Current (2022-present) | Security on by default; Lucene 9.x; native kNN | Enable security from day 1; vector search via dense_vector |
| Elasticsearch 7.x | Maintenance (2019-2024) | Type removal; single type per index | Remove type from API calls; upgrade for vector search |
| OpenSearch 2.x | Current (AWS fork) | Forked from ES 7.10; neural search plugin | Use OpenSearch-specific APIs; not fully ES 8.x compatible |
| Apache Solr 9.x | Current (2022-present) | Module system; requires Java 11+ | Migrate plugins to module architecture |
| Lucene 9.x | Current (2021-present) | New HNSW vector search; removed deprecated APIs | Most users access through ES/Solr, not directly |
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Full-text search across large document corpus (>10K docs) | Exact key-value lookups on structured data | PostgreSQL with B-tree indexes or Redis |
| Relevance-ranked results needed (not just filtering) | Simple LIKE/regex search on <10K rows | SQL LIKE or ILIKE queries |
| Sub-200ms search latency at scale | Write-heavy, read-light workload | OLTP database (PostgreSQL, MySQL) |
| Faceted search with aggregations (e-commerce) | Single-user, single-machine application | SQLite FTS5 or in-process Lucene |
| Autocomplete / typeahead functionality | Only need vector similarity search | Dedicated vector DB (Pinecone, Weaviate) |
| Log search and analytics (ELK stack) | ACID transactional consistency required | PostgreSQL -- search engines are eventually consistent |