vectorstore.similarity_search(query, k=5) followed by llm.invoke(prompt_with_context) (LangChain pattern)| Component | Role | Technology Options | Scaling Strategy |
|---|---|---|---|
| Document Loader | Ingests raw documents (PDF, HTML, Markdown, DB rows) into the pipeline | LangChain document loaders, LlamaIndex readers, Unstructured.io, Apache Tika | Horizontal — stateless workers behind a queue |
| Text Splitter / Chunker | Splits documents into semantically coherent chunks with metadata | RecursiveCharacterTextSplitter (400-512 tokens), semantic chunking, sentence-window | CPU-bound — parallelize across documents |
| Embedding Model | Converts text chunks and queries into dense vector representations | OpenAI text-embedding-3-small/large, Voyage AI voyage-3, Cohere embed-v3, BGE-large, E5 | Batch API calls; GPU for self-hosted models |
| Vector Store | Stores and indexes embeddings for fast approximate nearest neighbor (ANN) search | Pinecone, Weaviate, Qdrant, Chroma, pgvector, Milvus, FAISS | Shard by namespace/collection; replicate for reads |
| Sparse Index (BM25) | Keyword-based retrieval for hybrid search — catches exact terms that vectors miss | Elasticsearch, OpenSearch, Pinecone sparse vectors, Weaviate BM25 | Shard by document ID; standard search scaling |
| Query Transformer | Rewrites, expands, or decomposes user queries before retrieval | HyDE (hypothetical document), multi-query, sub-question decomposition | Stateless — LLM call per query |
| Retriever | Executes similarity search against vector store, returns top-k relevant chunks | Vector search (cosine/dot-product), hybrid (vector + BM25), metadata filters | Tune k, use metadata filters to reduce search space |
| Reranker | Re-scores retrieved chunks by relevance using a cross-encoder model | Cohere Rerank, Jina Reranker, bge-reranker, FlashRank | GPU for cross-encoder; retrieve 3-5x final k, rerank to k |
| Context Assembler | Formats retrieved chunks into a prompt template with citations and instructions | LangChain prompt templates, LlamaIndex response synthesizer | Stateless — string formatting |
| LLM Generator | Produces the final answer grounded in retrieved context | GPT-4o, Claude 3.5/Opus, Llama 3, Mistral, Gemini | Scale via API rate limits or self-hosted replicas |
| Evaluation Framework | Measures retrieval and generation quality with automated metrics | RAGAS (faithfulness, relevancy, precision), DeepEval, TruLens | Run offline on test sets; CI integration |
| Observability | Traces retrieval, reranking, and generation steps for debugging and cost tracking | LangSmith, Arize Phoenix, Weights & Biases, OpenTelemetry | Event streaming — scale consumers independently |
START
├── Corpus size and complexity?
│ ├── <10K documents, single domain
│ │ ├── Simple Q&A → Naive RAG: chunk + embed + retrieve + generate
│ │ └── Need citations → Naive RAG + source tracking in metadata
│ ├── 10K-1M documents, multiple domains
│ │ ├── Keyword matches important → Hybrid search (vector + BM25) [src1]
│ │ ├── Mixed doc types (PDF, code, tables) → Semantic chunking + metadata filters
│ │ └── Need high precision → Add reranker (retrieve 20, rerank to 5)
│ └── >1M documents, enterprise-scale
│ ├── Multi-tenant → Namespace/collection per tenant in vector DB
│ ├── Latency-sensitive → Cache frequent queries + pre-compute popular embeddings
│ └── Complex multi-hop questions → Agentic RAG with LangGraph [src5]
├── Query complexity?
│ ├── Single-hop factual → Standard retrieval with k=3-5
│ ├── Multi-hop reasoning → Sub-question decomposition or iterative retrieval
│ └── Conversational (follow-ups) → Conversation-aware retrieval with history condensation
├── Vector DB choice?
│ ├── Managed, zero-ops → Pinecone (serverless)
│ ├── Open-source, self-hosted → Qdrant, Weaviate, or Milvus
│ ├── Already using PostgreSQL → pgvector extension
│ └── Prototyping/local dev → Chroma (in-memory) or FAISS
└── DEFAULT → Start with Naive RAG (LangChain + Chroma), add reranking when precision matters
Split documents into semantically coherent chunks. Use recursive character splitting as a baseline (400-512 tokens, 10-20% overlap), then upgrade to semantic chunking if quality metrics demand it. [src4]
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
# Load documents from a directory
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
raw_docs = loader.load()
# Chunk with overlap to preserve context at boundaries
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # ~400-512 tokens — sweet spot for most use cases
chunk_overlap=64, # ~12% overlap preserves cross-boundary context
separators=["\n\n", "\n", ". ", " ", ""], # Respect natural boundaries
length_function=len,
)
chunks = splitter.split_documents(raw_docs)
print(f"Split {len(raw_docs)} documents into {len(chunks)} chunks")
Verify: len(chunks) is 5-20x the number of source documents. Spot-check 10 random chunks for coherence.
Embed all chunks and upsert into your vector store. Pin the embedding model — changing it later requires full re-indexing. [src2]
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
import os
# Pin embedding model version — changing requires full re-index
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small", # 1536 dims, $0.02/1M tokens
openai_api_key=os.environ["OPENAI_API_KEY"],
)
# Create vector store and upsert chunks
vectorstore = PineconeVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
index_name="rag-index",
namespace="production",
)
print(f"Indexed {len(chunks)} chunks into Pinecone")
Verify: vectorstore.similarity_search("test query", k=3) returns relevant chunks, not random results.
Combine dense vector search with sparse keyword search (BM25) for higher recall. Anthropic’s research shows hybrid retrieval with contextual embeddings reduces failures by 49%. [src1] [src6]
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# Dense retriever — semantic similarity
dense_retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 10}, # Over-retrieve for reranking
)
# Sparse retriever — keyword matching (catches exact terms vectors miss)
bm25_retriever = BM25Retriever.from_documents(chunks, k=10)
# Hybrid: combine dense + sparse with weighted fusion
hybrid_retriever = EnsembleRetriever(
retrievers=[dense_retriever, bm25_retriever],
weights=[0.6, 0.4], # Tune based on your query types
)
results = hybrid_retriever.invoke("How does contextual retrieval work?")
Verify: Compare results from dense-only vs. hybrid on 20 test queries — hybrid should improve recall on keyword-heavy queries.
Over-retrieve (3-5x final k), then rerank with a cross-encoder to maximize precision. This reduces noise in the LLM context. [src1] [src3]
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
# Reranker: retrieves 20 chunks, reranks to top 5
reranker = CohereRerank(
model="rerank-english-v3.0",
top_n=5, # Final number of chunks passed to LLM
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=hybrid_retriever, # From step 3
)
reranked_results = compression_retriever.invoke("What embedding model should I use?")
# Now only the 5 most relevant chunks proceed to generation
Verify: Spot-check reranked results — the top chunk should directly answer the query, not just be tangentially related.
Assemble retrieved context into a prompt template and pass to the LLM. Include source metadata for citation traceability. [src5]
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_template("""
Answer the question based ONLY on the following context. If the context
does not contain enough information, say "I don't have enough information."
Cite sources using [Source: filename] format.
Context:
{context}
Question: {question}
Answer:""")
def format_docs(docs):
return "\n\n".join(
f"[Source: {d.metadata.get('source', 'unknown')}]\n{d.page_content}"
for d in docs
)
# LCEL chain: retrieve → format → prompt → generate
rag_chain = (
{"context": compression_retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
answer = rag_chain.invoke("How do I design a RAG system?")
print(answer)
Verify: Answer references specific source documents. Asking about topics not in the corpus returns “I don’t have enough information.”
Measure retrieval and generation quality using automated evaluation. Run on a test set of 50-100 question-answer pairs. [src7]
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
# Prepare evaluation dataset
eval_data = {
"question": ["What is contextual retrieval?", "How does hybrid search work?"],
"answer": [answer_1, answer_2], # RAG-generated answers
"contexts": [retrieved_contexts_1, retrieved_contexts_2],
"ground_truth": ["Contextual retrieval prepends...", "Hybrid search combines..."],
}
result = evaluate(
Dataset.from_dict(eval_data),
metrics=[faithfulness, answer_relevancy, context_precision],
)
print(result)
# Target: faithfulness > 0.85, answer_relevancy > 0.80, context_precision > 0.75
Verify: All metrics above threshold. If faithfulness < 0.85, improve chunking or retrieval. If relevancy < 0.80, improve prompt template.
# Input: Directory of documents + user query
# Output: LLM answer grounded in retrieved context with sources
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# 1. Load and chunk
docs = DirectoryLoader("./docs", glob="**/*.md").load()
chunks = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64).split_documents(docs)
# 2. Embed and store
db = Chroma.from_documents(chunks, OpenAIEmbeddings(model="text-embedding-3-small"))
retriever = db.as_retriever(search_kwargs={"k": 5})
# 3. Generate
prompt = ChatPromptTemplate.from_template(
"Answer based on context:\n{context}\n\nQuestion: {question}"
)
chain = (
{"context": retriever | (lambda docs: "\n".join(d.page_content for d in docs)),
"question": RunnablePassthrough()}
| prompt | ChatOpenAI(model="gpt-4o", temperature=0) | StrOutputParser()
)
print(chain.invoke("How does RAG work?"))
# Input: PDF documents + user query
# Output: Answer using sentence-window retrieval for fine-grained context
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
# Parse with sentence windows — embeds single sentences, retrieves surrounding window
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3, # 3 sentences before + after the matched sentence
)
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents, node_parser=node_parser)
# At query time, replace sentence with full window for generation
query_engine = index.as_query_engine(
similarity_top_k=5,
node_postprocessors=[MetadataReplacementPostProcessor(target_metadata_key="window")],
)
response = query_engine.query("What are RAG best practices?")
print(response)
// Input: Array of text documents + user query string
// Output: LLM-generated answer grounded in retrieved context
import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { PineconeStore } from "@langchain/pinecone";
import { Pinecone } from "@pinecone-database/pinecone";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence, RunnablePassthrough } from "@langchain/core/runnables";
const pinecone = new Pinecone();
const index = pinecone.Index("rag-index");
const vectorStore = await PineconeStore.fromExistingIndex(
new OpenAIEmbeddings({ modelName: "text-embedding-3-small" }),
{ pineconeIndex: index }
);
const retriever = vectorStore.asRetriever({ k: 5 });
const llm = new ChatOpenAI({ modelName: "gpt-4o", temperature: 0 });
const prompt = ChatPromptTemplate.fromTemplate(
"Answer based on context:\n{context}\n\nQuestion: {question}"
);
const chain = RunnableSequence.from([
{ context: retriever.pipe((docs) => docs.map((d) => d.pageContent).join("\n")),
question: new RunnablePassthrough() },
prompt, llm, new StringOutputParser(),
]);
const answer = await chain.invoke("How does RAG work?");
# BAD — sending full documents instead of relevant chunks
# Wastes tokens, dilutes relevance, hits context window limits
def naive_answer(question, documents):
full_text = "\n".join(doc.page_content for doc in documents) # Could be 500K+ tokens
return llm.invoke(f"Context: {full_text}\n\nQuestion: {question}")
# Result: exceeds context window, LLM ignores middle content ("lost in the middle" problem)
# GOOD — retrieve top-k relevant chunks, respecting token budget [src2]
def rag_answer(question, retriever, llm, max_context_tokens=3000):
chunks = retriever.invoke(question) # Returns k most relevant chunks
context = "\n\n".join(c.page_content for c in chunks)
if num_tokens(context) > max_context_tokens:
chunks = chunks[:len(chunks) // 2] # Trim to fit
context = "\n\n".join(c.page_content for c in chunks)
return llm.invoke(f"Context: {context}\n\nQuestion: {question}")
# BAD — hard splits at character boundaries, no overlap
chunks = [text[i:i+1000] for i in range(0, len(text), 1000)]
# "The Eiffel Tower was built in" | "1889 by Gustave Eiffel"
# Neither chunk contains the complete fact — retrieval fails for both
# GOOD — respects natural boundaries, preserves cross-boundary context [src4]
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64, # 12% overlap captures cross-boundary content
separators=["\n\n", "\n", ". ", " ", ""], # Split at natural boundaries first
)
chunks = splitter.split_documents(documents)
# BAD — asymmetric embedding models produce incomparable vector spaces
doc_embeddings = model_a.encode(documents) # Model A for docs
query_embedding = model_b.encode(user_query) # Model B for queries
# Cosine similarity between different vector spaces is meaningless
# GOOD — same model ensures vectors live in the same space [src2]
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
# Used at index time
vectorstore = Chroma.from_documents(chunks, embedding_model)
# Used at query time (automatically by the retriever)
retriever = vectorstore.as_retriever()
# BAD — no metrics, no test set, deploying based on vibes
chain = build_rag_chain()
answer = chain.invoke("test question")
print(answer) # "Looks good to me!" — ships to production
# Silent regressions when you change chunking, models, or prompts
# GOOD — measurable quality gates before deployment [src7]
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
result = evaluate(test_dataset, metrics=[faithfulness, answer_relevancy, context_precision])
assert result["faithfulness"] > 0.85, f"Faithfulness too low: {result['faithfulness']}"
assert result["answer_relevancy"] > 0.80, f"Relevancy too low: {result['answer_relevancy']}"
# Gate deployments on metric thresholds in CI/CD
# BAD — top-5 from vector similarity often includes tangentially related noise
results = vectorstore.similarity_search(query, k=5)
# Result 1: Highly relevant
# Result 2: Same topic, wrong subtopic
# Result 3: Tangentially related
# Results 4-5: Noise that confuses the LLM
# GOOD — retrieve 20, rerank to 5 for high precision [src1] [src3]
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)
retriever = ContextualCompressionRetriever(
base_compressor=reranker, base_retriever=base_retriever
)
# Cross-encoder reranking is 5-10x more accurate than vector similarity alone
# Check embedding dimensions match between index and query
python -c "from langchain_openai import OpenAIEmbeddings; e=OpenAIEmbeddings(model='text-embedding-3-small'); print(len(e.embed_query('test')))"
# Count chunks in vector store (Pinecone)
python -c "from pinecone import Pinecone; pc=Pinecone(); print(pc.Index('rag-index').describe_index_stats())"
# Test retrieval quality — does top result answer the query?
python -c "results = retriever.invoke('your test query'); print(results[0].page_content[:200])"
# Measure chunk size distribution
python -c "import statistics; sizes=[len(c.page_content) for c in chunks]; print(f'mean={statistics.mean(sizes):.0f}, median={statistics.median(sizes):.0f}, std={statistics.stdev(sizes):.0f}')"
# Check for duplicate chunks (common after re-indexing)
python -c "contents=[c.page_content for c in chunks]; dupes=len(contents)-len(set(contents)); print(f'{dupes} duplicate chunks')"
# Run RAGAS evaluation on test set
python -c "from ragas import evaluate; from ragas.metrics import faithfulness; print(evaluate(dataset, metrics=[faithfulness]))"
| Version | Status | Breaking Changes | Migration Notes |
|---|---|---|---|
| LangChain 0.3 (2024-09) | Current | LCEL-first API; deprecated legacy chains (LLMChain, RetrievalQA) | Replace RetrievalQA.from_chain_type() with LCEL chain composition |
| LangChain 0.2 (2024-05) | Maintenance | Community package split; langchain-community separate | Move imports from langchain to langchain-community or partner packages |
| LlamaIndex 0.10+ (2024-03) | Current | New module structure; llama-index-core + integration packages | Update imports; install integration packages separately |
| OpenAI Embeddings v3 (2024-01) | Current | New models: text-embedding-3-small (1536d), text-embedding-3-large (3072d) | Re-index with new model for better quality; supports Matryoshka dimensionality reduction |
| Pinecone Serverless (2024-01) | Current | New serverless architecture; pod-based deprecated for new indexes | Migrate pod indexes to serverless; use namespaces for multi-tenancy |
| Weaviate 1.25+ (2024-06) | Current | Named vectors; multi-modal modules | Use named vectors for hybrid (dense + sparse) in single collection |
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Knowledge base changes frequently (weekly/monthly) | Data is static and small enough to fit in prompt | Direct prompt injection (system message with all context) |
| Need grounded, cited answers from specific documents | Questions require real-time web data | Web search + LLM (Perplexity-style) |
| Corpus exceeds LLM context window (>100K tokens) | Corpus fits in a single context window (<50K tokens) | Long-context LLM without retrieval |
| Need to reduce hallucinations on domain-specific topics | Creative writing or open-ended generation | Standard LLM prompting |
| Multi-tenant — different users access different document sets | All users access the same small knowledge base | Fine-tuned LLM or prompt engineering |
| Must attribute answers to specific source documents | Attribution not required | Fine-tuning bakes knowledge into weights |
| Budget-conscious — retrieve only relevant context per query | Unlimited token budget and low latency tolerance | Send entire knowledge base in prompt |
text-embedding-3-small (1536 dims) to text-embedding-3-large (3072 dims) requires full re-indexing. Choose embedding model before building the index. [src2]