RAG System Architecture: System Design Guide

Type: Software Reference Confidence: 0.90 Sources: 7 Verified: 2026-02-23 Freshness: monthly

TL;DR

Constraints

Quick Reference

ComponentRoleTechnology OptionsScaling Strategy
Document LoaderIngests raw documents (PDF, HTML, Markdown, DB rows) into the pipelineLangChain document loaders, LlamaIndex readers, Unstructured.io, Apache TikaHorizontal — stateless workers behind a queue
Text Splitter / ChunkerSplits documents into semantically coherent chunks with metadataRecursiveCharacterTextSplitter (400-512 tokens), semantic chunking, sentence-windowCPU-bound — parallelize across documents
Embedding ModelConverts text chunks and queries into dense vector representationsOpenAI text-embedding-3-small/large, Voyage AI voyage-3, Cohere embed-v3, BGE-large, E5Batch API calls; GPU for self-hosted models
Vector StoreStores and indexes embeddings for fast approximate nearest neighbor (ANN) searchPinecone, Weaviate, Qdrant, Chroma, pgvector, Milvus, FAISSShard by namespace/collection; replicate for reads
Sparse Index (BM25)Keyword-based retrieval for hybrid search — catches exact terms that vectors missElasticsearch, OpenSearch, Pinecone sparse vectors, Weaviate BM25Shard by document ID; standard search scaling
Query TransformerRewrites, expands, or decomposes user queries before retrievalHyDE (hypothetical document), multi-query, sub-question decompositionStateless — LLM call per query
RetrieverExecutes similarity search against vector store, returns top-k relevant chunksVector search (cosine/dot-product), hybrid (vector + BM25), metadata filtersTune k, use metadata filters to reduce search space
RerankerRe-scores retrieved chunks by relevance using a cross-encoder modelCohere Rerank, Jina Reranker, bge-reranker, FlashRankGPU for cross-encoder; retrieve 3-5x final k, rerank to k
Context AssemblerFormats retrieved chunks into a prompt template with citations and instructionsLangChain prompt templates, LlamaIndex response synthesizerStateless — string formatting
LLM GeneratorProduces the final answer grounded in retrieved contextGPT-4o, Claude 3.5/Opus, Llama 3, Mistral, GeminiScale via API rate limits or self-hosted replicas
Evaluation FrameworkMeasures retrieval and generation quality with automated metricsRAGAS (faithfulness, relevancy, precision), DeepEval, TruLensRun offline on test sets; CI integration
ObservabilityTraces retrieval, reranking, and generation steps for debugging and cost trackingLangSmith, Arize Phoenix, Weights & Biases, OpenTelemetryEvent streaming — scale consumers independently

Decision Tree

START
├── Corpus size and complexity?
│   ├── <10K documents, single domain
│   │   ├── Simple Q&A → Naive RAG: chunk + embed + retrieve + generate
│   │   └── Need citations → Naive RAG + source tracking in metadata
│   ├── 10K-1M documents, multiple domains
│   │   ├── Keyword matches important → Hybrid search (vector + BM25) [src1]
│   │   ├── Mixed doc types (PDF, code, tables) → Semantic chunking + metadata filters
│   │   └── Need high precision → Add reranker (retrieve 20, rerank to 5)
│   └── >1M documents, enterprise-scale
│       ├── Multi-tenant → Namespace/collection per tenant in vector DB
│       ├── Latency-sensitive → Cache frequent queries + pre-compute popular embeddings
│       └── Complex multi-hop questions → Agentic RAG with LangGraph [src5]
├── Query complexity?
│   ├── Single-hop factual → Standard retrieval with k=3-5
│   ├── Multi-hop reasoning → Sub-question decomposition or iterative retrieval
│   └── Conversational (follow-ups) → Conversation-aware retrieval with history condensation
├── Vector DB choice?
│   ├── Managed, zero-ops → Pinecone (serverless)
│   ├── Open-source, self-hosted → Qdrant, Weaviate, or Milvus
│   ├── Already using PostgreSQL → pgvector extension
│   └── Prototyping/local dev → Chroma (in-memory) or FAISS
└── DEFAULT → Start with Naive RAG (LangChain + Chroma), add reranking when precision matters

Step-by-Step Guide

1. Prepare and chunk your documents

Split documents into semantically coherent chunks. Use recursive character splitting as a baseline (400-512 tokens, 10-20% overlap), then upgrade to semantic chunking if quality metrics demand it. [src4]

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

# Load documents from a directory
loader = DirectoryLoader("./docs", glob="**/*.pdf", loader_cls=PyPDFLoader)
raw_docs = loader.load()

# Chunk with overlap to preserve context at boundaries
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,       # ~400-512 tokens — sweet spot for most use cases
    chunk_overlap=64,     # ~12% overlap preserves cross-boundary context
    separators=["\n\n", "\n", ". ", " ", ""],  # Respect natural boundaries
    length_function=len,
)
chunks = splitter.split_documents(raw_docs)
print(f"Split {len(raw_docs)} documents into {len(chunks)} chunks")

Verify: len(chunks) is 5-20x the number of source documents. Spot-check 10 random chunks for coherence.

2. Generate embeddings and store in a vector database

Embed all chunks and upsert into your vector store. Pin the embedding model — changing it later requires full re-indexing. [src2]

from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
import os

# Pin embedding model version — changing requires full re-index
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",  # 1536 dims, $0.02/1M tokens
    openai_api_key=os.environ["OPENAI_API_KEY"],
)

# Create vector store and upsert chunks
vectorstore = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name="rag-index",
    namespace="production",
)
print(f"Indexed {len(chunks)} chunks into Pinecone")

Verify: vectorstore.similarity_search("test query", k=3) returns relevant chunks, not random results.

3. Implement retrieval with hybrid search

Combine dense vector search with sparse keyword search (BM25) for higher recall. Anthropic’s research shows hybrid retrieval with contextual embeddings reduces failures by 49%. [src1] [src6]

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Dense retriever — semantic similarity
dense_retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 10},  # Over-retrieve for reranking
)

# Sparse retriever — keyword matching (catches exact terms vectors miss)
bm25_retriever = BM25Retriever.from_documents(chunks, k=10)

# Hybrid: combine dense + sparse with weighted fusion
hybrid_retriever = EnsembleRetriever(
    retrievers=[dense_retriever, bm25_retriever],
    weights=[0.6, 0.4],  # Tune based on your query types
)
results = hybrid_retriever.invoke("How does contextual retrieval work?")

Verify: Compare results from dense-only vs. hybrid on 20 test queries — hybrid should improve recall on keyword-heavy queries.

4. Add reranking for precision

Over-retrieve (3-5x final k), then rerank with a cross-encoder to maximize precision. This reduces noise in the LLM context. [src1] [src3]

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

# Reranker: retrieves 20 chunks, reranks to top 5
reranker = CohereRerank(
    model="rerank-english-v3.0",
    top_n=5,  # Final number of chunks passed to LLM
)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid_retriever,  # From step 3
)

reranked_results = compression_retriever.invoke("What embedding model should I use?")
# Now only the 5 most relevant chunks proceed to generation

Verify: Spot-check reranked results — the top chunk should directly answer the query, not just be tangentially related.

5. Build the generation chain with citations

Assemble retrieved context into a prompt template and pass to the LLM. Include source metadata for citation traceability. [src5]

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatOpenAI(model="gpt-4o", temperature=0)

prompt = ChatPromptTemplate.from_template("""
Answer the question based ONLY on the following context. If the context
does not contain enough information, say "I don't have enough information."
Cite sources using [Source: filename] format.

Context:
{context}

Question: {question}

Answer:""")

def format_docs(docs):
    return "\n\n".join(
        f"[Source: {d.metadata.get('source', 'unknown')}]\n{d.page_content}"
        for d in docs
    )

# LCEL chain: retrieve → format → prompt → generate
rag_chain = (
    {"context": compression_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

answer = rag_chain.invoke("How do I design a RAG system?")
print(answer)

Verify: Answer references specific source documents. Asking about topics not in the corpus returns “I don’t have enough information.”

6. Evaluate with RAGAS metrics

Measure retrieval and generation quality using automated evaluation. Run on a test set of 50-100 question-answer pairs. [src7]

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": ["What is contextual retrieval?", "How does hybrid search work?"],
    "answer": [answer_1, answer_2],  # RAG-generated answers
    "contexts": [retrieved_contexts_1, retrieved_contexts_2],
    "ground_truth": ["Contextual retrieval prepends...", "Hybrid search combines..."],
}

result = evaluate(
    Dataset.from_dict(eval_data),
    metrics=[faithfulness, answer_relevancy, context_precision],
)
print(result)
# Target: faithfulness > 0.85, answer_relevancy > 0.80, context_precision > 0.75

Verify: All metrics above threshold. If faithfulness < 0.85, improve chunking or retrieval. If relevancy < 0.80, improve prompt template.

Code Examples

Python/LangChain: Complete Naive RAG Pipeline

# Input:  Directory of documents + user query
# Output: LLM answer grounded in retrieved context with sources

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# 1. Load and chunk
docs = DirectoryLoader("./docs", glob="**/*.md").load()
chunks = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64).split_documents(docs)

# 2. Embed and store
db = Chroma.from_documents(chunks, OpenAIEmbeddings(model="text-embedding-3-small"))
retriever = db.as_retriever(search_kwargs={"k": 5})

# 3. Generate
prompt = ChatPromptTemplate.from_template(
    "Answer based on context:\n{context}\n\nQuestion: {question}"
)
chain = (
    {"context": retriever | (lambda docs: "\n".join(d.page_content for d in docs)),
     "question": RunnablePassthrough()}
    | prompt | ChatOpenAI(model="gpt-4o", temperature=0) | StrOutputParser()
)
print(chain.invoke("How does RAG work?"))

Python/LlamaIndex: RAG with Sentence Window Retrieval

# Input:  PDF documents + user query
# Output: Answer using sentence-window retrieval for fine-grained context

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

# Parse with sentence windows — embeds single sentences, retrieves surrounding window
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,  # 3 sentences before + after the matched sentence
)

documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents, node_parser=node_parser)

# At query time, replace sentence with full window for generation
query_engine = index.as_query_engine(
    similarity_top_k=5,
    node_postprocessors=[MetadataReplacementPostProcessor(target_metadata_key="window")],
)
response = query_engine.query("What are RAG best practices?")
print(response)

TypeScript/LangChain: RAG with Pinecone

// Input:  Array of text documents + user query string
// Output: LLM-generated answer grounded in retrieved context

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { PineconeStore } from "@langchain/pinecone";
import { Pinecone } from "@pinecone-database/pinecone";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence, RunnablePassthrough } from "@langchain/core/runnables";

const pinecone = new Pinecone();
const index = pinecone.Index("rag-index");

const vectorStore = await PineconeStore.fromExistingIndex(
  new OpenAIEmbeddings({ modelName: "text-embedding-3-small" }),
  { pineconeIndex: index }
);

const retriever = vectorStore.asRetriever({ k: 5 });
const llm = new ChatOpenAI({ modelName: "gpt-4o", temperature: 0 });
const prompt = ChatPromptTemplate.fromTemplate(
  "Answer based on context:\n{context}\n\nQuestion: {question}"
);

const chain = RunnableSequence.from([
  { context: retriever.pipe((docs) => docs.map((d) => d.pageContent).join("\n")),
    question: new RunnablePassthrough() },
  prompt, llm, new StringOutputParser(),
]);
const answer = await chain.invoke("How does RAG work?");

Anti-Patterns

Wrong: Stuffing entire documents into the context window

# BAD — sending full documents instead of relevant chunks
# Wastes tokens, dilutes relevance, hits context window limits
def naive_answer(question, documents):
    full_text = "\n".join(doc.page_content for doc in documents)  # Could be 500K+ tokens
    return llm.invoke(f"Context: {full_text}\n\nQuestion: {question}")
# Result: exceeds context window, LLM ignores middle content ("lost in the middle" problem)

Correct: Retrieve only relevant chunks with bounded k

# GOOD — retrieve top-k relevant chunks, respecting token budget [src2]
def rag_answer(question, retriever, llm, max_context_tokens=3000):
    chunks = retriever.invoke(question)  # Returns k most relevant chunks
    context = "\n\n".join(c.page_content for c in chunks)
    if num_tokens(context) > max_context_tokens:
        chunks = chunks[:len(chunks) // 2]  # Trim to fit
        context = "\n\n".join(c.page_content for c in chunks)
    return llm.invoke(f"Context: {context}\n\nQuestion: {question}")

Wrong: Using fixed-size chunking without overlap

# BAD — hard splits at character boundaries, no overlap
chunks = [text[i:i+1000] for i in range(0, len(text), 1000)]
# "The Eiffel Tower was built in" | "1889 by Gustave Eiffel"
# Neither chunk contains the complete fact — retrieval fails for both

Correct: Recursive chunking with overlap and boundary awareness

# GOOD — respects natural boundaries, preserves cross-boundary context [src4]
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,  # 12% overlap captures cross-boundary content
    separators=["\n\n", "\n", ". ", " ", ""],  # Split at natural boundaries first
)
chunks = splitter.split_documents(documents)

Wrong: Embedding queries and documents with different models

# BAD — asymmetric embedding models produce incomparable vector spaces
doc_embeddings = model_a.encode(documents)     # Model A for docs
query_embedding = model_b.encode(user_query)   # Model B for queries
# Cosine similarity between different vector spaces is meaningless

Correct: Use the same embedding model for both documents and queries

# GOOD — same model ensures vectors live in the same space [src2]
embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")
# Used at index time
vectorstore = Chroma.from_documents(chunks, embedding_model)
# Used at query time (automatically by the retriever)
retriever = vectorstore.as_retriever()

Wrong: No evaluation — “it seems to work”

# BAD — no metrics, no test set, deploying based on vibes
chain = build_rag_chain()
answer = chain.invoke("test question")
print(answer)  # "Looks good to me!" — ships to production
# Silent regressions when you change chunking, models, or prompts

Correct: Automated evaluation with RAGAS metrics

# GOOD — measurable quality gates before deployment [src7]
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
result = evaluate(test_dataset, metrics=[faithfulness, answer_relevancy, context_precision])
assert result["faithfulness"] > 0.85, f"Faithfulness too low: {result['faithfulness']}"
assert result["answer_relevancy"] > 0.80, f"Relevancy too low: {result['answer_relevancy']}"
# Gate deployments on metric thresholds in CI/CD

Wrong: Retrieving without reranking for precision-critical applications

# BAD — top-5 from vector similarity often includes tangentially related noise
results = vectorstore.similarity_search(query, k=5)
# Result 1: Highly relevant
# Result 2: Same topic, wrong subtopic
# Result 3: Tangentially related
# Results 4-5: Noise that confuses the LLM

Correct: Over-retrieve and rerank

# GOOD — retrieve 20, rerank to 5 for high precision [src1] [src3]
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)
retriever = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=base_retriever
)
# Cross-encoder reranking is 5-10x more accurate than vector similarity alone

Common Pitfalls

Diagnostic Commands

# Check embedding dimensions match between index and query
python -c "from langchain_openai import OpenAIEmbeddings; e=OpenAIEmbeddings(model='text-embedding-3-small'); print(len(e.embed_query('test')))"

# Count chunks in vector store (Pinecone)
python -c "from pinecone import Pinecone; pc=Pinecone(); print(pc.Index('rag-index').describe_index_stats())"

# Test retrieval quality — does top result answer the query?
python -c "results = retriever.invoke('your test query'); print(results[0].page_content[:200])"

# Measure chunk size distribution
python -c "import statistics; sizes=[len(c.page_content) for c in chunks]; print(f'mean={statistics.mean(sizes):.0f}, median={statistics.median(sizes):.0f}, std={statistics.stdev(sizes):.0f}')"

# Check for duplicate chunks (common after re-indexing)
python -c "contents=[c.page_content for c in chunks]; dupes=len(contents)-len(set(contents)); print(f'{dupes} duplicate chunks')"

# Run RAGAS evaluation on test set
python -c "from ragas import evaluate; from ragas.metrics import faithfulness; print(evaluate(dataset, metrics=[faithfulness]))"

Version History & Compatibility

VersionStatusBreaking ChangesMigration Notes
LangChain 0.3 (2024-09)CurrentLCEL-first API; deprecated legacy chains (LLMChain, RetrievalQA)Replace RetrievalQA.from_chain_type() with LCEL chain composition
LangChain 0.2 (2024-05)MaintenanceCommunity package split; langchain-community separateMove imports from langchain to langchain-community or partner packages
LlamaIndex 0.10+ (2024-03)CurrentNew module structure; llama-index-core + integration packagesUpdate imports; install integration packages separately
OpenAI Embeddings v3 (2024-01)CurrentNew models: text-embedding-3-small (1536d), text-embedding-3-large (3072d)Re-index with new model for better quality; supports Matryoshka dimensionality reduction
Pinecone Serverless (2024-01)CurrentNew serverless architecture; pod-based deprecated for new indexesMigrate pod indexes to serverless; use namespaces for multi-tenancy
Weaviate 1.25+ (2024-06)CurrentNamed vectors; multi-modal modulesUse named vectors for hybrid (dense + sparse) in single collection

When to Use / When Not to Use

Use WhenDon't Use WhenUse Instead
Knowledge base changes frequently (weekly/monthly)Data is static and small enough to fit in promptDirect prompt injection (system message with all context)
Need grounded, cited answers from specific documentsQuestions require real-time web dataWeb search + LLM (Perplexity-style)
Corpus exceeds LLM context window (>100K tokens)Corpus fits in a single context window (<50K tokens)Long-context LLM without retrieval
Need to reduce hallucinations on domain-specific topicsCreative writing or open-ended generationStandard LLM prompting
Multi-tenant — different users access different document setsAll users access the same small knowledge baseFine-tuned LLM or prompt engineering
Must attribute answers to specific source documentsAttribution not requiredFine-tuning bakes knowledge into weights
Budget-conscious — retrieve only relevant context per queryUnlimited token budget and low latency toleranceSend entire knowledge base in prompt

Important Caveats

Related Units