retrieval augmented generation design

- Bottom line: A RAG system is a three-stage pipeline — ingestion (chunk documents, generate embeddings, store in vector DB), retrieval (embed query, similarity search, optional reranking), and generation (inject retrieved context into LLM prompt) — with each stage independently optimizable and scalable.

how to build a RAG system

- Bottom line: A RAG system is a three-stage pipeline — ingestion (chunk documents, generate embeddings, store in vector DB), retrieval (embed query, similarity search, optional reranking), and generation (inject retrieved context into LLM prompt) — with each stage independently optimizable and scalable.

RAG architecture patterns

- Bottom line: A RAG system is a three-stage pipeline — ingestion (chunk documents, generate embeddings, store in vector DB), retrieval (embed query, similarity search, optional reranking), and generation (inject retrieved context into LLM prompt) — with each stage independently optimizable and scalable.

RAG System Architecture: System Design Guide

How do I design a Retrieval-Augmented Generation (RAG) system?

TL;DR

Bottom line: A RAG system is a three-stage pipeline — ingestion (chunk documents, generate embeddings, store in vector DB), retrieval (embed query, similarity search, optional reranking), and generation (inject retrieved context into LLM prompt) — with each stage independently optimizable and scalable.
Key tool/command: vectorstore.similarity_search(query, k=5) followed by llm.invoke(prompt_with_context) (LangChain pattern)
Watch out for: Naive fixed-size chunking destroys context boundaries — 80% of RAG failures trace back to chunking, not retrieval or generation. [src4]
Works with: LangChain 0.3+, LlamaIndex 0.10+, any vector DB (Pinecone, Weaviate, Qdrant, Chroma, pgvector), any LLM (OpenAI, Anthropic, open-source).

Constraints

Embedding model dimensions are fixed at index time — changing models requires full re-indexing of all documents [src2]
Chunk size directly bounds retrieval quality — chunks too small lose context (faithfulness 0.47-0.51), semantic chunking reaches 0.79-0.82 [src4]
Context window limits are hard — total retrieved chunks + system prompt + user query + generation budget must fit within the LLM’s context window [src1]
Embedding models have maximum input token limits (512-8192 tokens) — text exceeding the limit is silently truncated, corrupting vector representations [src2]
Hybrid search (vector + BM25) requires two index structures, roughly doubling storage and indexing cost [src1] [src6]
Never evaluate RAG quality subjectively — use automated metrics (faithfulness, answer relevancy, context precision) or regressions will go undetected [src7]

Quick Reference

Component	Role	Technology Options	Scaling Strategy
Document Loader	Ingests raw documents (PDF, HTML, Markdown, DB rows) into the pipeline	LangChain document loaders, LlamaIndex readers, Unstructured.io, Apache Tika	Horizontal — stateless workers behind a queue
Text Splitter / Chunker	Splits documents into semantically coherent chunks with metadata	RecursiveCharacterTextSplitter (400-512 tokens), semantic chunking, sentence-window	CPU-bound — parallelize across documents
Embedding Model	Converts text chunks and queries into dense vector representations	OpenAI text-embedding-3-small/large, Voyage AI voyage-3, Cohere embed-v3, BGE-large, E5	Batch API calls; GPU for self-hosted models
Vector Store	Stores and indexes embeddings for fast approximate nearest neighbor (ANN) search	Pinecone, Weaviate, Qdrant, Chroma, pgvector, Milvus, FAISS	Shard by namespace/collection; replicate for reads
Sparse Index (BM25)	Keyword-based retrieval for hybrid search — catches exact terms that vectors miss	Elasticsearch, OpenSearch, Pinecone sparse vectors, Weaviate BM25	Shard by document ID; standard search scaling
Query Transformer	Rewrites, expands, or decomposes user queries before retrieval	HyDE (hypothetical document), multi-query, sub-question decomposition	Stateless — LLM call per query
Retriever	Executes similarity search against vector store, returns top-k relevant chunks	Vector search (cosine/dot-product), hybrid (vector + BM25), metadata filters	Tune k, use metadata filters to reduce search space
Reranker	Re-scores retrieved chunks by relevance using a cross-encoder model	Cohere Rerank, Jina Reranker, bge-reranker, FlashRank	GPU for cross-encoder; retrieve 3-5x final k, rerank to k
Context Assembler	Formats retrieved chunks into a prompt template with citations and instructions	LangChain prompt templates, LlamaIndex response synthesizer	Stateless — string formatting
LLM Generator	Produces the final answer grounded in retrieved context	GPT-4o, Claude 3.5/Opus, Llama 3, Mistral, Gemini	Scale via API rate limits or self-hosted replicas
Evaluation Framework	Measures retrieval and generation quality with automated metrics	RAGAS (faithfulness, relevancy, precision), DeepEval, TruLens	Run offline on test sets; CI integration
Observability	Traces retrieval, reranking, and generation steps for debugging and cost tracking	LangSmith, Arize Phoenix, Weights & Biases, OpenTelemetry	Event streaming — scale consumers independently

Decision Tree

START
├── Corpus size and complexity?
│   ├── <10K documents, single domain
│   │   ├── Simple Q&A → Naive RAG: chunk + embed + retrieve + generate
│   │   └── Need citations → Naive RAG + source tracking in metadata
│   ├── 10K-1M documents, multiple domains
│   │   ├── Keyword matches important → Hybrid search (vector + BM25) [src1]
│   │   ├── Mixed doc types (PDF, code, tables) → Semantic chunking + metadata filters
│   │   └── Need high precision → Add reranker (retrieve 20, rerank to 5)
│   └── >1M documents, enterprise-scale
│       ├── Multi-tenant → Namespace/collection per tenant in vector DB
│       ├── Latency-sensitive → Cache frequent queries + pre-compute popular embeddings
│       └── Complex multi-hop questions → Agentic RAG with LangGraph [src5]
├── Query complexity?
│   ├── Single-hop factual → Standard retrieval with k=3-5
│   ├── Multi-hop reasoning → Sub-question decomposition or iterative retrieval
│   └── Conversational (follow-ups) → Conversation-aware retrieval with history condensation
├── Vector DB choice?
│   ├── Managed, zero-ops → Pinecone (serverless)
│   ├── Open-source, self-hosted → Qdrant, Weaviate, or Milvus
│   ├── Already using PostgreSQL → pgvector extension
│   └── Prototyping/local dev → Chroma (in-memory) or FAISS
└── DEFAULT → Start with Naive RAG (LangChain + Chroma), add reranking when precision matters

Step-by-Step Guide

1. Prepare and chunk your documents

Split documents into semantically coherent chunks. Use recursive character splitting as a baseline (400-512 tokens, 10-20% overlap), then upgrade to semantic chunking if quality metrics demand it. [src4]

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

# Load documents from a directory
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
raw_docs = loader.load()

# Chunk with overlap to preserve context at boundaries
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,       # ~400-512 tokens — sweet spot for most use cases
    chunk_overlap=64,     # ~12% overlap preserves cross-boundary context
    separators=["\n\n", "\n", ". ", " ", ""],  # Respect natural boundaries
    length_function=len,
)
chunks = splitter.split_documents(raw_docs)
print(f"Split {len(raw_docs)} documents into {len(chunks)} chunks")

Verify: len(chunks) is 5-20x the number of source documents. Spot-check 10 random chunks for coherence.

2. Generate embeddings and store in a vector database

Embed all chunks and upsert into your vector store. Pin the embedding model — changing it later requires full re-indexing. [src2]

from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
import os

# Pin embedding model version — changing requires full re-index
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",  # 1536 dims, $0.02/1M tokens
    openai_api_key=os.environ["OPENAI_API_KEY"],
)

# Create vector store and upsert chunks
vectorstore = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name="rag-index",
    namespace="production",
)
print(f"Indexed {len(chunks)} chunks into Pinecone")

Verify: vectorstore.similarity_search("test query", k=3) returns relevant chunks, not random results.

3. Implement retrieval with hybrid search

Combine dense vector search with sparse keyword search (BM25) for higher recall. Anthropic’s research shows hybrid retrieval with contextual embeddings reduces failures by 49%. [src1] [src6]

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Dense retriever — semantic similarity
dense_retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 10},  # Over-retrieve for reranking
)

# Sparse retriever — keyword matching (catches exact terms vectors miss)
bm25_retriever = BM25Retriever.from_documents(chunks, k=10)

# Hybrid: combine dense + sparse with weighted fusion
hybrid_retriever = EnsembleRetriever(
    retrievers=[dense_retriever, bm25_retriever],
    weights=[0.6, 0.4],  # Tune based on your query types
)
results = hybrid_retriever.invoke("How does contextual retrieval work?")

Verify: Compare results from dense-only vs. hybrid on 20 test queries — hybrid should improve recall on keyword-heavy queries.

4. Add reranking for precision

Over-retrieve (3-5x final k), then rerank with a cross-encoder to maximize precision. This reduces noise in the LLM context. [src1] [src3]

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

# Reranker: retrieves 20 chunks, reranks to top 5
reranker = CohereRerank(
    model="rerank-english-v3.0",
    top_n=5,  # Final number of chunks passed to LLM
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid_retriever,  # From step 3
)

reranked_results = compression_retriever.invoke("What embedding model should I use?")
# Now only the 5 most relevant chunks proceed to generation

Verify: Spot-check reranked results — the top chunk should directly answer the query, not just be tangentially related.

5. Build the generation chain with citations

Assemble retrieved context into a prompt template and pass to the LLM. Include source metadata for citation traceability. [src5]

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatOpenAI(model="gpt-4o", temperature=0)

prompt = ChatPromptTemplate.from_template("""
Answer the question based ONLY on the following context. If the context
does not contain enough information, say "I don't have enough information."
Cite sources using [Source: filename] format.

Context:
{context}

Question: {question}

Answer:""")

def format_docs(docs):
    return "\n\n".join(
        f"[Source: {d.metadata.get('source', 'unknown')}]\n{d.page_content}"
        for d in docs
    )

# LCEL chain: retrieve → format → prompt → generate
rag_chain = (
    {"context": compression_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

answer = rag_chain.invoke("How do I design a RAG system?")
print(answer)

Verify: Answer references specific source documents. Asking about topics not in the corpus returns “I don’t have enough information.”

6. Evaluate with RAGAS metrics

Measure retrieval and generation quality using automated evaluation. Run on a test set of 50-100 question-answer pairs. [src7]

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["What is contextual retrieval?", "How does hybrid search work?"],
    "answer": [answer_1, answer_2],  # RAG-generated answers
    "contexts": [retrieved_contexts_1, retrieved_contexts_2],
    "ground_truth": ["Contextual retrieval prepends...", "Hybrid search combines..."],
}

result = evaluate(
    Dataset.from_dict(eval_data),
    metrics=[faithfulness, answer_relevancy, context_precision],
)
print(result)
# Target: faithfulness > 0.85, answer_relevancy > 0.80, context_precision > 0.75

Verify: All metrics above threshold. If faithfulness < 0.85, improve chunking or retrieval. If relevancy < 0.80, improve prompt template.

Code Examples

Python/LangChain: Complete Naive RAG Pipeline

# Input:  Directory of documents + user query
# Output: LLM answer grounded in retrieved context with sources

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# 1. Load and chunk
docs = DirectoryLoader("./docs", glob="**/*.md").load()
chunks = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64).split_documents(docs)

# 2. Embed and store
db = Chroma.from_documents(chunks, OpenAIEmbeddings(model="text-embedding-3-small"))
retriever = db.as_retriever(search_kwargs={"k": 5})

# 3. Generate
prompt = ChatPromptTemplate.from_template(
    "Answer based on context:\n{context}\n\nQuestion: {question}"
)
chain = (
    {"context": retriever | (lambda docs: "\n".join(d.page_content for d in docs)),
     "question": RunnablePassthrough()}
    | prompt | ChatOpenAI(model="gpt-4o", temperature=0) | StrOutputParser()
)
print(chain.invoke("How does RAG work?"))

Python/LlamaIndex: RAG with Sentence Window Retrieval

# Input:  PDF documents + user query
# Output: Answer using sentence-window retrieval for fine-grained context

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

# Parse with sentence windows — embeds single sentences, retrieves surrounding window
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,  # 3 sentences before + after the matched sentence
)

documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents, node_parser=node_parser)

# At query time, replace sentence with full window for generation
query_engine = index.as_query_engine(
    similarity_top_k=5,
    node_postprocessors=[MetadataReplacementPostProcessor(target_metadata_key="window")],
)
response = query_engine.query("What are RAG best practices?")
print(response)

TypeScript/LangChain: RAG with Pinecone

// Input:  Array of text documents + user query string
// Output: LLM-generated answer grounded in retrieved context

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { PineconeStore } from "@langchain/pinecone";
import { Pinecone } from "@pinecone-database/pinecone";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence, RunnablePassthrough } from "@langchain/core/runnables";

const pinecone = new Pinecone();
const index = pinecone.Index("rag-index");

const vectorStore = await PineconeStore.fromExistingIndex(
  new OpenAIEmbeddings({ modelName: "text-embedding-3-small" }),
  { pineconeIndex: index }
);

const retriever = vectorStore.asRetriever({ k: 5 });
const llm = new ChatOpenAI({ modelName: "gpt-4o", temperature: 0 });
const prompt = ChatPromptTemplate.fromTemplate(
  "Answer based on context:\n{context}\n\nQuestion: {question}"
);

const chain = RunnableSequence.from([
  { context: retriever.pipe((docs) => docs.map((d) => d.pageContent).join("\n")),
    question: new RunnablePassthrough() },
  prompt, llm, new StringOutputParser(),
]);
const answer = await chain.invoke("How does RAG work?");

Anti-Patterns

Wrong: Stuffing entire documents into the context window

# BAD — sending full documents instead of relevant chunks
# Wastes tokens, dilutes relevance, hits context window limits
def naive_answer(question, documents):
    full_text = "\n".join(doc.page_content for doc in documents)  # Could be 500K+ tokens
    return llm.invoke(f"Context: {full_text}\n\nQuestion: {question}")
# Result: exceeds context window, LLM ignores middle content ("lost in the middle" problem)

Correct: Retrieve only relevant chunks with bounded k

# GOOD — retrieve top-k relevant chunks, respecting token budget [src2]
def rag_answer(question, retriever, llm, max_context_tokens=3000):
    chunks = retriever.invoke(question)  # Returns k most relevant chunks
    context = "\n\n".join(c.page_content for c in chunks)
    if num_tokens(context) > max_context_tokens:
        chunks = chunks[:len(chunks) // 2]  # Trim to fit
        context = "\n\n".join(c.page_content for c in chunks)
    return llm.invoke(f"Context: {context}\n\nQuestion: {question}")

Wrong: Using fixed-size chunking without overlap

# BAD — hard splits at character boundaries, no overlap
chunks = [text[i:i+1000] for i in range(0, len(text), 1000)]
# "The Eiffel Tower was built in" | "1889 by Gustave Eiffel"
# Neither chunk contains the complete fact — retrieval fails for both

Correct: Recursive chunking with overlap and boundary awareness

# GOOD — respects natural boundaries, preserves cross-boundary context [src4]
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,  # 12% overlap captures cross-boundary content
    separators=["\n\n", "\n", ". ", " ", ""],  # Split at natural boundaries first
)
chunks = splitter.split_documents(documents)

Wrong: Embedding queries and documents with different models

# BAD — asymmetric embedding models produce incomparable vector spaces
doc_embeddings = model_a.encode(documents)     # Model A for docs
query_embedding = model_b.encode(user_query)   # Model B for queries
# Cosine similarity between different vector spaces is meaningless

Correct: Use the same embedding model for both documents and queries

# GOOD — same model ensures vectors live in the same space [src2]
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
# Used at index time
vectorstore = Chroma.from_documents(chunks, embedding_model)
# Used at query time (automatically by the retriever)
retriever = vectorstore.as_retriever()

Wrong: No evaluation — “it seems to work”

# BAD — no metrics, no test set, deploying based on vibes
chain = build_rag_chain()
answer = chain.invoke("test question")
print(answer)  # "Looks good to me!" — ships to production
# Silent regressions when you change chunking, models, or prompts

Correct: Automated evaluation with RAGAS metrics

# GOOD — measurable quality gates before deployment [src7]
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
result = evaluate(test_dataset, metrics=[faithfulness, answer_relevancy, context_precision])
assert result["faithfulness"] > 0.85, f"Faithfulness too low: {result['faithfulness']}"
assert result["answer_relevancy"] > 0.80, f"Relevancy too low: {result['answer_relevancy']}"
# Gate deployments on metric thresholds in CI/CD

Wrong: Retrieving without reranking for precision-critical applications

# BAD — top-5 from vector similarity often includes tangentially related noise
results = vectorstore.similarity_search(query, k=5)
# Result 1: Highly relevant
# Result 2: Same topic, wrong subtopic
# Result 3: Tangentially related
# Results 4-5: Noise that confuses the LLM

Correct: Over-retrieve and rerank

# GOOD — retrieve 20, rerank to 5 for high precision [src1] [src3]
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)
retriever = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=base_retriever
)
# Cross-encoder reranking is 5-10x more accurate than vector similarity alone

Common Pitfalls

Choosing chunk size without testing: Optimal chunk size depends on document type and query patterns. Factoid queries work best with 256-512 tokens; analytical queries need 1024+. Fix: benchmark 3-4 chunk sizes on your test set using RAGAS context_precision. [src4]
Ignoring metadata filters: Retrieving across all documents when the query clearly targets a specific domain/date/category wastes context tokens on irrelevant chunks. Fix: attach metadata (source, date, category) at indexing time and filter at retrieval time. [src6]
Not handling the “no relevant results” case: When the vector store returns low-similarity results, the LLM hallucinates rather than admitting ignorance. Fix: set a similarity threshold (e.g., cosine > 0.7) and return “I don’t have information on this” when no chunk passes. [src7]
Updating documents without re-indexing: Editing source documents without re-embedding and upserting leads to stale retrieval results. Fix: build an incremental indexing pipeline that detects changed files and re-indexes only those. [src2]
“Lost in the middle” effect: LLMs attend more to the beginning and end of long contexts, underweighting middle chunks. Fix: limit retrieved chunks to 3-5, or place the most relevant chunk first and last. [src3]
Missing contextual information in chunks: A chunk saying “the company reported $4.2B revenue” without identifying which company is useless. Fix: use Anthropic’s contextual retrieval — prepend chunk-specific context. [src1]
Skipping hybrid search for technical content: Pure vector search misses exact API names, error codes, and version numbers that BM25 catches trivially. Fix: always use hybrid search (vector + BM25) for technical documentation. [src1] [src6]
Treating all queries the same: Simple factual queries and complex multi-step questions need different retrieval strategies. Fix: implement query routing — classify queries and route to appropriate retriever (simple vs. multi-hop vs. agentic). [src5]

Diagnostic Commands

# Check embedding dimensions match between index and query
python -c "from langchain_openai import OpenAIEmbeddings; e=OpenAIEmbeddings(model='text-embedding-3-small'); print(len(e.embed_query('test')))"

# Count chunks in vector store (Pinecone)
python -c "from pinecone import Pinecone; pc=Pinecone(); print(pc.Index('rag-index').describe_index_stats())"

# Test retrieval quality — does top result answer the query?
python -c "results = retriever.invoke('your test query'); print(results[0].page_content[:200])"

# Measure chunk size distribution
python -c "import statistics; sizes=[len(c.page_content) for c in chunks]; print(f'mean={statistics.mean(sizes):.0f}, median={statistics.median(sizes):.0f}, std={statistics.stdev(sizes):.0f}')"

# Check for duplicate chunks (common after re-indexing)
python -c "contents=[c.page_content for c in chunks]; dupes=len(contents)-len(set(contents)); print(f'{dupes} duplicate chunks')"

# Run RAGAS evaluation on test set
python -c "from ragas import evaluate; from ragas.metrics import faithfulness; print(evaluate(dataset, metrics=[faithfulness]))"

Version History & Compatibility

Version	Status	Breaking Changes	Migration Notes
LangChain 0.3 (2024-09)	Current	LCEL-first API; deprecated legacy chains (LLMChain, RetrievalQA)	Replace `RetrievalQA.from_chain_type()` with LCEL chain composition
LangChain 0.2 (2024-05)	Maintenance	Community package split; `langchain-community` separate	Move imports from `langchain` to `langchain-community` or partner packages
LlamaIndex 0.10+ (2024-03)	Current	New module structure; `llama-index-core` + integration packages	Update imports; install integration packages separately
OpenAI Embeddings v3 (2024-01)	Current	New models: text-embedding-3-small (1536d), text-embedding-3-large (3072d)	Re-index with new model for better quality; supports Matryoshka dimensionality reduction
Pinecone Serverless (2024-01)	Current	New serverless architecture; pod-based deprecated for new indexes	Migrate pod indexes to serverless; use namespaces for multi-tenancy
Weaviate 1.25+ (2024-06)	Current	Named vectors; multi-modal modules	Use named vectors for hybrid (dense + sparse) in single collection

When to Use / When Not to Use

Use When	Don't Use When	Use Instead
Knowledge base changes frequently (weekly/monthly)	Data is static and small enough to fit in prompt	Direct prompt injection (system message with all context)
Need grounded, cited answers from specific documents	Questions require real-time web data	Web search + LLM (Perplexity-style)
Corpus exceeds LLM context window (>100K tokens)	Corpus fits in a single context window (<50K tokens)	Long-context LLM without retrieval
Need to reduce hallucinations on domain-specific topics	Creative writing or open-ended generation	Standard LLM prompting
Multi-tenant — different users access different document sets	All users access the same small knowledge base	Fine-tuned LLM or prompt engineering
Must attribute answers to specific source documents	Attribution not required	Fine-tuning bakes knowledge into weights
Budget-conscious — retrieve only relevant context per query	Unlimited token budget and low latency tolerance	Send entire knowledge base in prompt

Important Caveats

Embedding model choice locks your index — switching from text-embedding-3-small (1536 dims) to text-embedding-3-large (3072 dims) requires full re-indexing. Choose embedding model before building the index. [src2]
RAG does not eliminate hallucinations — it reduces them. The LLM can still generate content not present in retrieved chunks, especially when instructions say “answer the question” without “based ONLY on the context.” Always include strict grounding instructions in the prompt. [src7]
Vector similarity != relevance — cosine similarity of 0.85 does not mean 85% relevant. Thresholds vary by embedding model and domain. Calibrate with labeled data. [src2]
Latency adds up — embedding (50-200ms) + vector search (20-100ms) + reranking (100-500ms) + LLM generation (500-3000ms) = 700-3800ms total. Budget each component and parallelize where possible. [src7]
Cost scales with queries, not corpus size — embedding the corpus is a one-time cost; the recurring cost is per-query embedding + LLM generation. At 10K queries/day with GPT-4o, expect $50-200/day in LLM costs alone. [src2]
The “evolving” temporal status means frameworks change rapidly — LangChain, LlamaIndex, and vector DB APIs ship breaking changes quarterly. Pin dependency versions and test upgrades in staging. [src5]
Hybrid search is not always better — for purely semantic queries (e.g., “explain how transformers work”), BM25 adds no value and may introduce noise. Profile your query distribution before adding hybrid search complexity. [src1] [src6]