RAG System Architecture: System Design Guide
How do I design a Retrieval-Augmented Generation (RAG) system?
TL;DR
- Bottom line: A RAG system is a three-stage pipeline — ingestion (chunk documents, generate embeddings, store in vector DB), retrieval (embed query, similarity search, optional reranking), and generation (inject retrieved context into LLM prompt) — with each stage independently optimizable and scalable.
- Key tool/command:
vectorstore.similarity_search(query, k=5)followed byllm.invoke(prompt_with_context)(LangChain pattern) - Watch out for: Naive fixed-size chunking destroys context boundaries — 80% of RAG failures trace back to chunking, not retrieval or generation. [src4]
- Works with: LangChain 0.3+, LlamaIndex 0.10+, any vector DB (Pinecone, Weaviate, Qdrant, Chroma, pgvector), any LLM (OpenAI, Anthropic, open-source).
Constraints
- Embedding model dimensions are fixed at index time — changing models requires full re-indexing of all documents [src2]
- Chunk size directly bounds retrieval quality — chunks too small lose context (faithfulness 0.47-0.51), semantic chunking reaches 0.79-0.82 [src4]
- Context window limits are hard — total retrieved chunks + system prompt + user query + generation budget must fit within the LLM’s context window [src1]
- Embedding models have maximum input token limits (512-8192 tokens) — text exceeding the limit is silently truncated, corrupting vector representations [src2]
- Hybrid search (vector + BM25) requires two index structures, roughly doubling storage and indexing cost [src1] [src6]
- Never evaluate RAG quality subjectively — use automated metrics (faithfulness, answer relevancy, context precision) or regressions will go undetected [src7]
Quick Reference
| Component | Role | Technology Options | Scaling Strategy |
|---|---|---|---|
| Document Loader | Ingests raw documents (PDF, HTML, Markdown, DB rows) into the pipeline | LangChain document loaders, LlamaIndex readers, Unstructured.io, Apache Tika | Horizontal — stateless workers behind a queue |
| Text Splitter / Chunker | Splits documents into semantically coherent chunks with metadata | RecursiveCharacterTextSplitter (400-512 tokens), semantic chunking, sentence-window | CPU-bound — parallelize across documents |
| Embedding Model | Converts text chunks and queries into dense vector representations | OpenAI text-embedding-3-small/large, Voyage AI voyage-3, Cohere embed-v3, BGE-large, E5 | Batch API calls; GPU for self-hosted models |
| Vector Store | Stores and indexes embeddings for fast approximate nearest neighbor (ANN) search | Pinecone, Weaviate, Qdrant, Chroma, pgvector, Milvus, FAISS | Shard by namespace/collection; replicate for reads |
| Sparse Index (BM25) | Keyword-based retrieval for hybrid search — catches exact terms that vectors miss | Elasticsearch, OpenSearch, Pinecone sparse vectors, Weaviate BM25 | Shard by document ID; standard search scaling |
| Query Transformer | Rewrites, expands, or decomposes user queries before retrieval | HyDE (hypothetical document), multi-query, sub-question decomposition | Stateless — LLM call per query |
| Retriever | Executes similarity search against vector store, returns top-k relevant chunks | Vector search (cosine/dot-product), hybrid (vector + BM25), metadata filters | Tune k, use metadata filters to reduce search space |
| Reranker | Re-scores retrieved chunks by relevance using a cross-encoder model | Cohere Rerank, Jina Reranker, bge-reranker, FlashRank | GPU for cross-encoder; retrieve 3-5x final k, rerank to k |
| Context Assembler | Formats retrieved chunks into a prompt template with citations and instructions | LangChain prompt templates, LlamaIndex response synthesizer | Stateless — string formatting |
| LLM Generator | Produces the final answer grounded in retrieved context | GPT-4o, Claude 3.5/Opus, Llama 3, Mistral, Gemini | Scale via API rate limits or self-hosted replicas |
| Evaluation Framework | Measures retrieval and generation quality with automated metrics | RAGAS (faithfulness, relevancy, precision), DeepEval, TruLens | Run offline on test sets; CI integration |
| Observability | Traces retrieval, reranking, and generation steps for debugging and cost tracking | LangSmith, Arize Phoenix, Weights & Biases, OpenTelemetry | Event streaming — scale consumers independently |
Decision Tree
START
├── Corpus size and complexity?
│ ├── <10K documents, single domain
│ │ ├── Simple Q&A → Naive RAG: chunk + embed + retrieve + generate
│ │ └── Need citations → Naive RAG + source tracking in metadata
│ ├── 10K-1M documents, multiple domains
│ │ ├── Keyword matches important → Hybrid search (vector + BM25) [src1]
│ │ ├── Mixed doc types (PDF, code, tables) → Semantic chunking + metadata filters
│ │ └── Need high precision → Add reranker (retrieve 20, rerank to 5)
│ └── >1M documents, enterprise-scale
│ ├── Multi-tenant → Namespace/collection per tenant in vector DB
│ ├── Latency-sensitive → Cache frequent queries + pre-compute popular embeddings
│ └── Complex multi-hop questions → Agentic RAG with LangGraph [src5]
├── Query complexity?
│ ├── Single-hop factual → Standard retrieval with k=3-5
│ ├── Multi-hop reasoning → Sub-question decomposition or iterative retrieval
│ └── Conversational (follow-ups) → Conversation-aware retrieval with history condensation
├── Vector DB choice?
│ ├── Managed, zero-ops → Pinecone (serverless)
│ ├── Open-source, self-hosted → Qdrant, Weaviate, or Milvus
│ ├── Already using PostgreSQL → pgvector extension
│ └── Prototyping/local dev → Chroma (in-memory) or FAISS
└── DEFAULT → Start with Naive RAG (LangChain + Chroma), add reranking when precision matters
Step-by-Step Guide
1. Prepare and chunk your documents
Split documents into semantically coherent chunks. Use recursive character splitting as a baseline (400-512 tokens, 10-20% overlap), then upgrade to semantic chunking if quality metrics demand it. [src4]
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
# Load documents from a directory
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
raw_docs = loader.load()
# Chunk with overlap to preserve context at boundaries
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # ~400-512 tokens — sweet spot for most use cases
chunk_overlap=64, # ~12% overlap preserves cross-boundary context
separators=["\n\n", "\n", ". ", " ", ""], # Respect natural boundaries
length_function=len,
)
chunks = splitter.split_documents(raw_docs)
print(f"Split {len(raw_docs)} documents into {len(chunks)} chunks")
Verify: len(chunks) is 5-20x the number of source documents. Spot-check 10 random chunks for coherence.
2. Generate embeddings and store in a vector database
Embed all chunks and upsert into your vector store. Pin the embedding model — changing it later requires full re-indexing. [src2]
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
import os
# Pin embedding model version — changing requires full re-index
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small", # 1536 dims, $0.02/1M tokens
openai_api_key=os.environ["OPENAI_API_KEY"],
)
# Create vector store and upsert chunks
vectorstore = PineconeVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
index_name="rag-index",
namespace="production",
)
print(f"Indexed {len(chunks)} chunks into Pinecone")
Verify: vectorstore.similarity_search("test query", k=3) returns relevant chunks, not random results.
3. Implement retrieval with hybrid search
Combine dense vector search with sparse keyword search (BM25) for higher recall. Anthropic’s research shows hybrid retrieval with contextual embeddings reduces failures by 49%. [src1] [src6]
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# Dense retriever — semantic similarity
dense_retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 10}, # Over-retrieve for reranking
)
# Sparse retriever — keyword matching (catches exact terms vectors miss)
bm25_retriever = BM25Retriever.from_documents(chunks, k=10)
# Hybrid: combine dense + sparse with weighted fusion
hybrid_retriever = EnsembleRetriever(
retrievers=[dense_retriever, bm25_retriever],
weights=[0.6, 0.4], # Tune based on your query types
)
results = hybrid_retriever.invoke("How does contextual retrieval work?")
Verify: Compare results from dense-only vs. hybrid on 20 test queries — hybrid should improve recall on keyword-heavy queries.
4. Add reranking for precision
Over-retrieve (3-5x final k), then rerank with a cross-encoder to maximize precision. This reduces noise in the LLM context. [src1] [src3]
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
# Reranker: retrieves 20 chunks, reranks to top 5
reranker = CohereRerank(
model="rerank-english-v3.0",
top_n=5, # Final number of chunks passed to LLM
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=hybrid_retriever, # From step 3
)
reranked_results = compression_retriever.invoke("What embedding model should I use?")
# Now only the 5 most relevant chunks proceed to generation
Verify: Spot-check reranked results — the top chunk should directly answer the query, not just be tangentially related.
5. Build the generation chain with citations
Assemble retrieved context into a prompt template and pass to the LLM. Include source metadata for citation traceability. [src5]
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_template("""
Answer the question based ONLY on the following context. If the context
does not contain enough information, say "I don't have enough information."
Cite sources using [Source: filename] format.
Context:
{context}
Question: {question}
Answer:""")
def format_docs(docs):
return "\n\n".join(
f"[Source: {d.metadata.get('source', 'unknown')}]\n{d.page_content}"
for d in docs
)
# LCEL chain: retrieve → format → prompt → generate
rag_chain = (
{"context": compression_retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
answer = rag_chain.invoke("How do I design a RAG system?")
print(answer)
Verify: Answer references specific source documents. Asking about topics not in the corpus returns “I don’t have enough information.”
6. Evaluate with RAGAS metrics
Measure retrieval and generation quality using automated evaluation. Run on a test set of 50-100 question-answer pairs. [src7]
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
# Prepare evaluation dataset
eval_data = {
"question": ["What is contextual retrieval?", "How does hybrid search work?"],
"answer": [answer_1, answer_2], # RAG-generated answers
"contexts": [retrieved_contexts_1, retrieved_contexts_2],
"ground_truth": ["Contextual retrieval prepends...", "Hybrid search combines..."],
}
result = evaluate(
Dataset.from_dict(eval_data),
metrics=[faithfulness, answer_relevancy, context_precision],
)
print(result)
# Target: faithfulness > 0.85, answer_relevancy > 0.80, context_precision > 0.75
Verify: All metrics above threshold. If faithfulness < 0.85, improve chunking or retrieval. If relevancy < 0.80, improve prompt template.
Code Examples
Python/LangChain: Complete Naive RAG Pipeline
# Input: Directory of documents + user query
# Output: LLM answer grounded in retrieved context with sources
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# 1. Load and chunk
docs = DirectoryLoader("./docs", glob="**/*.md").load()
chunks = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64).split_documents(docs)
# 2. Embed and store
db = Chroma.from_documents(chunks, OpenAIEmbeddings(model="text-embedding-3-small"))
retriever = db.as_retriever(search_kwargs={"k": 5})
# 3. Generate
prompt = ChatPromptTemplate.from_template(
"Answer based on context:\n{context}\n\nQuestion: {question}"
)
chain = (
{"context": retriever | (lambda docs: "\n".join(d.page_content for d in docs)),
"question": RunnablePassthrough()}
| prompt | ChatOpenAI(model="gpt-4o", temperature=0) | StrOutputParser()
)
print(chain.invoke("How does RAG work?"))
Python/LlamaIndex: RAG with Sentence Window Retrieval
# Input: PDF documents + user query
# Output: Answer using sentence-window retrieval for fine-grained context
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
# Parse with sentence windows — embeds single sentences, retrieves surrounding window
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3, # 3 sentences before + after the matched sentence
)
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents, node_parser=node_parser)
# At query time, replace sentence with full window for generation
query_engine = index.as_query_engine(
similarity_top_k=5,
node_postprocessors=[MetadataReplacementPostProcessor(target_metadata_key="window")],
)
response = query_engine.query("What are RAG best practices?")
print(response)
TypeScript/LangChain: RAG with Pinecone
// Input: Array of text documents + user query string
// Output: LLM-generated answer grounded in retrieved context
import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { PineconeStore } from "@langchain/pinecone";
import { Pinecone } from "@pinecone-database/pinecone";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence, RunnablePassthrough } from "@langchain/core/runnables";
const pinecone = new Pinecone();
const index = pinecone.Index("rag-index");
const vectorStore = await PineconeStore.fromExistingIndex(
new OpenAIEmbeddings({ modelName: "text-embedding-3-small" }),
{ pineconeIndex: index }
);
const retriever = vectorStore.asRetriever({ k: 5 });
const llm = new ChatOpenAI({ modelName: "gpt-4o", temperature: 0 });
const prompt = ChatPromptTemplate.fromTemplate(
"Answer based on context:\n{context}\n\nQuestion: {question}"
);
const chain = RunnableSequence.from([
{ context: retriever.pipe((docs) => docs.map((d) => d.pageContent).join("\n")),
question: new RunnablePassthrough() },
prompt, llm, new StringOutputParser(),
]);
const answer = await chain.invoke("How does RAG work?");
Anti-Patterns
Wrong: Stuffing entire documents into the context window
# BAD — sending full documents instead of relevant chunks
# Wastes tokens, dilutes relevance, hits context window limits
def naive_answer(question, documents):
full_text = "\n".join(doc.page_content for doc in documents) # Could be 500K+ tokens
return llm.invoke(f"Context: {full_text}\n\nQuestion: {question}")
# Result: exceeds context window, LLM ignores middle content ("lost in the middle" problem)
Correct: Retrieve only relevant chunks with bounded k
# GOOD — retrieve top-k relevant chunks, respecting token budget [src2]
def rag_answer(question, retriever, llm, max_context_tokens=3000):
chunks = retriever.invoke(question) # Returns k most relevant chunks
context = "\n\n".join(c.page_content for c in chunks)
if num_tokens(context) > max_context_tokens:
chunks = chunks[:len(chunks) // 2] # Trim to fit
context = "\n\n".join(c.page_content for c in chunks)
return llm.invoke(f"Context: {context}\n\nQuestion: {question}")
Wrong: Using fixed-size chunking without overlap
# BAD — hard splits at character boundaries, no overlap
chunks = [text[i:i+1000] for i in range(0, len(text), 1000)]
# "The Eiffel Tower was built in" | "1889 by Gustave Eiffel"
# Neither chunk contains the complete fact — retrieval fails for both
Correct: Recursive chunking with overlap and boundary awareness
# GOOD — respects natural boundaries, preserves cross-boundary context [src4]
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64, # 12% overlap captures cross-boundary content
separators=["\n\n", "\n", ". ", " ", ""], # Split at natural boundaries first
)
chunks = splitter.split_documents(documents)
Wrong: Embedding queries and documents with different models
# BAD — asymmetric embedding models produce incomparable vector spaces
doc_embeddings = model_a.encode(documents) # Model A for docs
query_embedding = model_b.encode(user_query) # Model B for queries
# Cosine similarity between different vector spaces is meaningless
Correct: Use the same embedding model for both documents and queries
# GOOD — same model ensures vectors live in the same space [src2]
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
# Used at index time
vectorstore = Chroma.from_documents(chunks, embedding_model)
# Used at query time (automatically by the retriever)
retriever = vectorstore.as_retriever()
Wrong: No evaluation — “it seems to work”
# BAD — no metrics, no test set, deploying based on vibes
chain = build_rag_chain()
answer = chain.invoke("test question")
print(answer) # "Looks good to me!" — ships to production
# Silent regressions when you change chunking, models, or prompts
Correct: Automated evaluation with RAGAS metrics
# GOOD — measurable quality gates before deployment [src7]
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
result = evaluate(test_dataset, metrics=[faithfulness, answer_relevancy, context_precision])
assert result["faithfulness"] > 0.85, f"Faithfulness too low: {result['faithfulness']}"
assert result["answer_relevancy"] > 0.80, f"Relevancy too low: {result['answer_relevancy']}"
# Gate deployments on metric thresholds in CI/CD
Wrong: Retrieving without reranking for precision-critical applications
# BAD — top-5 from vector similarity often includes tangentially related noise
results = vectorstore.similarity_search(query, k=5)
# Result 1: Highly relevant
# Result 2: Same topic, wrong subtopic
# Result 3: Tangentially related
# Results 4-5: Noise that confuses the LLM
Correct: Over-retrieve and rerank
# GOOD — retrieve 20, rerank to 5 for high precision [src1] [src3]
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)
retriever = ContextualCompressionRetriever(
base_compressor=reranker, base_retriever=base_retriever
)
# Cross-encoder reranking is 5-10x more accurate than vector similarity alone
Common Pitfalls
- Choosing chunk size without testing: Optimal chunk size depends on document type and query patterns. Factoid queries work best with 256-512 tokens; analytical queries need 1024+. Fix: benchmark 3-4 chunk sizes on your test set using RAGAS context_precision. [src4]
- Ignoring metadata filters: Retrieving across all documents when the query clearly targets a specific domain/date/category wastes context tokens on irrelevant chunks. Fix: attach metadata (source, date, category) at indexing time and filter at retrieval time. [src6]
- Not handling the “no relevant results” case: When the vector store returns low-similarity results, the LLM hallucinates rather than admitting ignorance. Fix: set a similarity threshold (e.g., cosine > 0.7) and return “I don’t have information on this” when no chunk passes. [src7]
- Updating documents without re-indexing: Editing source documents without re-embedding and upserting leads to stale retrieval results. Fix: build an incremental indexing pipeline that detects changed files and re-indexes only those. [src2]
- “Lost in the middle” effect: LLMs attend more to the beginning and end of long contexts, underweighting middle chunks. Fix: limit retrieved chunks to 3-5, or place the most relevant chunk first and last. [src3]
- Missing contextual information in chunks: A chunk saying “the company reported $4.2B revenue” without identifying which company is useless. Fix: use Anthropic’s contextual retrieval — prepend chunk-specific context. [src1]
- Skipping hybrid search for technical content: Pure vector search misses exact API names, error codes, and version numbers that BM25 catches trivially. Fix: always use hybrid search (vector + BM25) for technical documentation. [src1] [src6]
- Treating all queries the same: Simple factual queries and complex multi-step questions need different retrieval strategies. Fix: implement query routing — classify queries and route to appropriate retriever (simple vs. multi-hop vs. agentic). [src5]
Diagnostic Commands
# Check embedding dimensions match between index and query
python -c "from langchain_openai import OpenAIEmbeddings; e=OpenAIEmbeddings(model='text-embedding-3-small'); print(len(e.embed_query('test')))"
# Count chunks in vector store (Pinecone)
python -c "from pinecone import Pinecone; pc=Pinecone(); print(pc.Index('rag-index').describe_index_stats())"
# Test retrieval quality — does top result answer the query?
python -c "results = retriever.invoke('your test query'); print(results[0].page_content[:200])"
# Measure chunk size distribution
python -c "import statistics; sizes=[len(c.page_content) for c in chunks]; print(f'mean={statistics.mean(sizes):.0f}, median={statistics.median(sizes):.0f}, std={statistics.stdev(sizes):.0f}')"
# Check for duplicate chunks (common after re-indexing)
python -c "contents=[c.page_content for c in chunks]; dupes=len(contents)-len(set(contents)); print(f'{dupes} duplicate chunks')"
# Run RAGAS evaluation on test set
python -c "from ragas import evaluate; from ragas.metrics import faithfulness; print(evaluate(dataset, metrics=[faithfulness]))"
Version History & Compatibility
| Version | Status | Breaking Changes | Migration Notes |
|---|---|---|---|
| LangChain 0.3 (2024-09) | Current | LCEL-first API; deprecated legacy chains (LLMChain, RetrievalQA) | Replace RetrievalQA.from_chain_type() with LCEL chain composition |
| LangChain 0.2 (2024-05) | Maintenance | Community package split; langchain-community separate | Move imports from langchain to langchain-community or partner packages |
| LlamaIndex 0.10+ (2024-03) | Current | New module structure; llama-index-core + integration packages | Update imports; install integration packages separately |
| OpenAI Embeddings v3 (2024-01) | Current | New models: text-embedding-3-small (1536d), text-embedding-3-large (3072d) | Re-index with new model for better quality; supports Matryoshka dimensionality reduction |
| Pinecone Serverless (2024-01) | Current | New serverless architecture; pod-based deprecated for new indexes | Migrate pod indexes to serverless; use namespaces for multi-tenancy |
| Weaviate 1.25+ (2024-06) | Current | Named vectors; multi-modal modules | Use named vectors for hybrid (dense + sparse) in single collection |
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Knowledge base changes frequently (weekly/monthly) | Data is static and small enough to fit in prompt | Direct prompt injection (system message with all context) |
| Need grounded, cited answers from specific documents | Questions require real-time web data | Web search + LLM (Perplexity-style) |
| Corpus exceeds LLM context window (>100K tokens) | Corpus fits in a single context window (<50K tokens) | Long-context LLM without retrieval |
| Need to reduce hallucinations on domain-specific topics | Creative writing or open-ended generation | Standard LLM prompting |
| Multi-tenant — different users access different document sets | All users access the same small knowledge base | Fine-tuned LLM or prompt engineering |
| Must attribute answers to specific source documents | Attribution not required | Fine-tuning bakes knowledge into weights |
| Budget-conscious — retrieve only relevant context per query | Unlimited token budget and low latency tolerance | Send entire knowledge base in prompt |
Important Caveats
- Embedding model choice locks your index — switching from
text-embedding-3-small(1536 dims) totext-embedding-3-large(3072 dims) requires full re-indexing. Choose embedding model before building the index. [src2] - RAG does not eliminate hallucinations — it reduces them. The LLM can still generate content not present in retrieved chunks, especially when instructions say “answer the question” without “based ONLY on the context.” Always include strict grounding instructions in the prompt. [src7]
- Vector similarity != relevance — cosine similarity of 0.85 does not mean 85% relevant. Thresholds vary by embedding model and domain. Calibrate with labeled data. [src2]
- Latency adds up — embedding (50-200ms) + vector search (20-100ms) + reranking (100-500ms) + LLM generation (500-3000ms) = 700-3800ms total. Budget each component and parallelize where possible. [src7]
- Cost scales with queries, not corpus size — embedding the corpus is a one-time cost; the recurring cost is per-query embedding + LLM generation. At 10K queries/day with GPT-4o, expect $50-200/day in LLM costs alone. [src2]
- The “evolving” temporal status means frameworks change rapidly — LangChain, LlamaIndex, and vector DB APIs ship breaking changes quarterly. Pin dependency versions and test upgrades in staging. [src5]
- Hybrid search is not always better — for purely semantic queries (e.g., “explain how transformers work”), BM25 adds no value and may introduce noise. Profile your query distribution before adding hybrid search complexity. [src1] [src6]