recommendation system architecture

- Bottom line: A production recommendation engine is a multi-stage pipeline — candidate retrieval (fast, coarse) followed by ranking (slow, precise) — backed by a feature store, an event stream, and an A/B testing framework.

build a recommender system from scratch

- Bottom line: A production recommendation engine is a multi-stage pipeline — candidate retrieval (fast, coarse) followed by ranking (slow, precise) — backed by a feature store, an event stream, and an A/B testing framework.

collaborative filtering system design

- Bottom line: A production recommendation engine is a multi-stage pipeline — candidate retrieval (fast, coarse) followed by ranking (slow, precise) — backed by a feature store, an event stream, and an A/B testing framework.

content-based recommendation engine

- Bottom line: A production recommendation engine is a multi-stage pipeline — candidate retrieval (fast, coarse) followed by ranking (slow, precise) — backed by a feature store, an event stream, and an A/B testing framework.

hybrid recommendation system architecture

- Bottom line: A production recommendation engine is a multi-stage pipeline — candidate retrieval (fast, coarse) followed by ranking (slow, precise) — backed by a feature store, an event stream, and an A/B testing framework.

How to Design a Recommendation Engine

How do I design a recommendation engine?

TL;DR

Bottom line: A production recommendation engine is a multi-stage pipeline -- candidate retrieval (fast, coarse) followed by ranking (slow, precise) -- backed by a feature store, an event stream, and an A/B testing framework.
Key tool/command: Two-tower embedding model for retrieval + gradient-boosted or deep ranking model + FAISS/ScaNN ANN index
Watch out for: Training-serving skew -- features computed differently in training vs. serving silently degrade recommendation quality with no error signal.
Works with: Python (TensorFlow Recommenders, PyTorch, LightFM, Surprise), Spark, Kafka, Redis, FAISS, Feast, any cloud ML platform.

Constraints

Never train on raw PII without anonymization -- GDPR Article 22 and CCPA require consent for automated profiling and right-to-explanation
Always separate candidate retrieval from ranking -- single-stage models cannot meet latency SLAs over catalogs >100K items
Feature store must guarantee training-serving consistency -- use the same feature computation code for both paths
Never deploy a model without an A/B test against the production baseline -- offline metric improvements (NDCG, recall@K) do not reliably predict online metric lifts (CTR, revenue)
Always implement a fallback for cold-start users/items -- popularity-based or content-based defaults, never empty results

Quick Reference

Component	Role	Technology Options	Scaling Strategy
Event Ingestion	Capture user interactions (clicks, views, purchases) in real-time	Kafka, AWS Kinesis, Google Pub/Sub	Partition by user_id; horizontal scaling
Event Store	Durable log of all interaction events for replay and retraining	Kafka (retention), S3/GCS Parquet, Delta Lake	Tiered storage (hot/warm/cold)
Feature Store	Serve consistent features for training and inference	Feast, Tecton, Vertex AI Feature Store	Online store (Redis/DynamoDB) + offline store (BigQuery/S3)
Embedding Service	Generate user and item embeddings via two-tower model	TensorFlow Recommenders, PyTorch, custom	GPU cluster for training; CPU/GPU for inference
ANN Index	Fast approximate nearest-neighbor retrieval over item embeddings	FAISS, ScaNN, Milvus, Pinecone, Weaviate	Shard by embedding space; replicate for read throughput
Candidate Retrieval	Narrow millions of items to hundreds of candidates in <10ms	Two-tower model + ANN, co-occurrence, popularity	Multiple retrieval sources merged; each independently scalable
Ranking Model	Score and re-rank candidates using rich cross-features	XGBoost, LambdaMART, deep ranking network (DCN, DLRM)	Model server (TF Serving, Triton); batch + online
Business Rules Engine	Apply hard filters (age-gating, geo-restrictions, already-seen)	Custom service, Drools, OPA	Stateless; horizontal scaling
Re-ranking / Diversity	Ensure result diversity, freshness, and business objectives	MMR, DPP, slot-based allocation	Lightweight post-processing; CPU-only
A/B Testing Framework	Measure online impact of model changes on business KPIs	Optimizely, Statsig, custom (hashing-based)	Consistent hashing for user bucketing
Model Training Pipeline	Retrain models on fresh data (daily or continuous)	Airflow/Kubeflow + Spark + GPU training	Scheduled DAGs; spot/preemptible instances
Model Registry	Version, track, and deploy trained models	MLflow, Vertex AI Model Registry, SageMaker	Centralized; blue-green deployment
Monitoring & Observability	Track model performance, data drift, and system health	Prometheus, Grafana, custom dashboards, Evidently AI	Alert on metric decay, feature drift, latency spikes

Decision Tree

START
+-- Catalog size >100K items?
|   +-- YES -> Multi-stage pipeline required (retrieval + ranking)
|   |   +-- Have rich item metadata (text, images, categories)?
|   |   |   +-- YES -> Hybrid: content-based retrieval + collaborative ranking
|   |   |   +-- NO -> Pure collaborative filtering (two-tower + interaction data)
|   |   +-- Cold-start items common (>20% of catalog)?
|   |       +-- YES -> Content-based tower mandatory for item embedding
|   |       +-- NO -> ID-based embeddings sufficient for item tower
|   +-- NO -> Single-stage model feasible
|       +-- Have explicit ratings (1-5 stars)?
|       |   +-- YES -> Matrix factorization (ALS/SVD) or LightFM
|       |   +-- NO -> Implicit feedback model (BPR, WARP loss)
|       +-- Need real-time updates?
|           +-- YES -> Online learning or session-based model (GRU4Rec)
|           +-- NO -> Batch retrained daily
+-- Latency budget <50ms?
|   +-- YES -> Pre-compute recommendations; serve from cache/KV store
|   +-- NO -> Online inference acceptable with feature store lookup
+-- DEFAULT -> Start with popularity + simple collaborative filtering, iterate toward hybrid

Step-by-Step Guide

1. Define interaction schema and ingest events

Design your event schema to capture every user-item interaction with context. This is the foundation -- bad event data means bad recommendations regardless of model sophistication. [src1]

# Event schema (Avro/Protobuf recommended for production)
interaction_event = {
    "user_id": "u_abc123",          # anonymized user identifier
    "item_id": "item_789",          # catalog item identifier
    "event_type": "click",          # click | view | purchase | add_to_cart | skip
    "timestamp": "2026-02-23T10:30:00Z",
    "context": {
        "device": "mobile",
        "session_id": "sess_456",
        "page": "home_feed",
        "position": 3               # position in the list (for position bias correction)
    }
}

Verify: Check Kafka consumer lag stays <1s and event count matches expected traffic within 5%.

2. Build the feature store

Materialize user features (historical interaction aggregates, demographics) and item features (metadata, popularity scores, freshness) into both offline and online stores. This eliminates training-serving skew. [src2]

# Feast feature definition (feast_repo/features.py)
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64, String
from datetime import timedelta

user = Entity(name="user_id", join_keys=["user_id"])
user_features = FeatureView(
    name="user_features",
    entities=[user],
    ttl=timedelta(hours=24),
    schema=[
        Field(name="total_clicks_7d", dtype=Int64),
        Field(name="avg_session_duration", dtype=Float32),
        Field(name="top_category", dtype=String),
        Field(name="interaction_count_30d", dtype=Int64),
    ],
    source=FileSource(path="s3://features/user_features.parquet"),
)

Verify: feast feature-server serve responds in <5ms for user feature lookups; validate feature values match a manual SQL query on 10 random users.

3. Train the candidate retrieval model (two-tower)

Train a two-tower model where the user tower and item tower produce embeddings in the same vector space. Similarity between embeddings determines relevance. [src4] [src5]

# Two-tower model with TensorFlow Recommenders
import tensorflow as tf
import tensorflow_recommenders as tfrs

class TwoTowerModel(tfrs.Model):
    def __init__(self, user_model, item_model, task):
        super().__init__()
        self.user_model = user_model
        self.item_model = item_model
        self.task = task

    def compute_loss(self, features, training=False):
        user_embeddings = self.user_model(features["user_id"])
        item_embeddings = self.item_model(features["item_id"])
        return self.task(user_embeddings, item_embeddings)

Verify: Evaluate recall@100 on held-out test set; target >0.25 for initial model.

4. Build the ANN index for fast retrieval

Export item embeddings and build an approximate nearest-neighbor index for sub-millisecond retrieval of top-K candidates from the full catalog. [src7]

import faiss
import numpy as np

embedding_dim = 32
nlist = 256  # number of clusters
quantizer = faiss.IndexFlatIP(embedding_dim)
index = faiss.IndexIVFFlat(quantizer, embedding_dim, nlist, faiss.METRIC_INNER_PRODUCT)
index.train(item_embeddings_np)
index.add(item_embeddings_np)
index.nprobe = 16  # search 16 clusters (speed vs recall trade-off)

# Query: get top 200 candidates for a user
scores, indices = index.search(user_embedding, k=200)

Verify: Measure recall@200 vs. brute-force search; target >0.95. Query latency target: <5ms for 1M items.

5. Build the ranking model

Train a ranking model that takes the retrieved candidates and produces a fine-grained relevance score using rich cross-features between user and item. [src2] [src3]

import xgboost as xgb

dtrain = xgb.DMatrix(train_features, label=train_labels)
params = {
    "objective": "rank:pairwise",
    "eval_metric": "ndcg@10",
    "max_depth": 6,
    "eta": 0.1,
}
ranker = xgb.train(params, dtrain, num_boost_round=200)

Verify: Evaluate NDCG@10 on held-out set; target >0.35. Compare against popularity-only baseline.

6. Apply business rules and diversity re-ranking

After scoring, apply hard business rules (geo-restrictions, already-consumed filtering) and diversity logic to avoid filter bubbles. [src6]

# Maximal Marginal Relevance (MMR) for diversity
selected = []
lambda_param = 0.7  # 0=max diversity, 1=max relevance
for _ in range(num_results):
    best = max(remaining, key=lambda c:
        lambda_param * c["score"] -
        (1 - lambda_param) * max_similarity(c, selected))
    selected.append(best)

Verify: Check no duplicate categories in top-3 and no already-consumed items appear.

7. Deploy with A/B testing and monitoring

Serve the pipeline behind an A/B testing framework. Monitor model metrics (NDCG, coverage, diversity), system metrics (latency, throughput), and business metrics (CTR, conversion, revenue). [src1]

Verify: Confirm consistent user bucketing (same user always sees same variant). Check metric tracking fires correctly.

Code Examples

Python: Collaborative Filtering with Implicit Feedback

# Input:  CSV of (user_id, item_id, rating) tuples
# Output: Top-N recommendations for a given user

from surprise import SVD, Dataset, Reader
import pandas as pd

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(
    pd.read_csv("ratings.csv")[["user_id", "item_id", "rating"]], reader
)
algo = SVD(n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02)
trainset = data.build_full_trainset()
algo.fit(trainset)

anti_testset = trainset.build_anti_testset()
predictions = algo.test([x for x in anti_testset if x[0] == "u_123"])
top_10 = sorted(predictions, key=lambda x: x.est, reverse=True)[:10]

Python: Feature Store Pattern with Feast

# Input:  user_id at serving time
# Output: consistent feature vector for model inference

from feast import FeatureStore

store = FeatureStore(repo_path="feast_repo/")
features = store.get_online_features(
    entity_rows=[{"user_id": "u_abc123"}],
    features=[
        "user_features:total_clicks_7d",
        "user_features:avg_session_duration",
        "user_features:top_category",
    ],
).to_dict()
# Pass features to ranking model -- guarantees no training-serving skew

Anti-Patterns

Wrong: Training on all interactions equally

# BAD -- treating all events as equal positive signals
train_data = df[["user_id", "item_id"]]  # clicks, views, purchases all weight=1
model.fit(train_data)

Correct: Weight interactions by signal strength

# GOOD -- stronger signals get higher weight
weights = {"purchase": 5.0, "add_to_cart": 3.0, "click": 1.0, "view": 0.3}
df["weight"] = df["event_type"].map(weights)
model.fit(df[["user_id", "item_id"]], sample_weight=df["weight"])

Wrong: Using cross-features for retrieval

# BAD -- retrieval model uses cross-features (cannot scale to full catalog)
retrieval_features = ["user_age", "item_category", "user_x_item_category_affinity"]
# Requires computing features for ALL items per request -- O(N) is too slow

Correct: Independent towers for retrieval, cross-features for ranking

# GOOD -- retrieval uses independent towers (pre-computed, ANN lookup)
user_embedding = user_tower(user_features)           # O(1)
candidates = ann_index.search(user_embedding, k=200) # O(log N)
ranked = ranking_model.predict(cross_features(user, candidates))  # O(200)

Wrong: Deploying based solely on offline metrics

# BAD -- no A/B test, no business metric validation
if new_model_ndcg > old_model_ndcg:
    deploy(new_model)

Correct: A/B test with business metrics as primary KPI

# GOOD -- offline metrics gate launch; online metrics gate full rollout
if new_model_ndcg > old_model_ndcg * 1.02:  # >2% offline lift
    launch_ab_test(new_model, traffic=10%)
    # Only promote to 100% if revenue_per_user improves with p<0.05

Wrong: No position bias correction

# BAD -- position 1 gets 10x CTR regardless of relevance
# Training on this data reinforces existing ranking (feedback loop)
train_data = raw_click_log

Correct: Apply inverse propensity weighting

# GOOD -- correct for position bias in training data
position_bias = {1: 1.0, 2: 0.7, 3: 0.5, 4: 0.35, 5: 0.25}
df["corrected_weight"] = df["clicked"] / df["position"].map(position_bias)

Common Pitfalls

Training-serving skew: Features computed differently in batch training vs. online serving cause silent quality degradation. Fix: Use a feature store (Feast/Tecton) that serves the exact same feature computation for both paths. [src2]
Popularity bias amplification: Models trained on click data over-recommend popular items, creating a filter bubble. Fix: Add exploration (epsilon-greedy or Thompson sampling) and diversity re-ranking (MMR with lambda=0.7). [src6]
Cold-start item neglect: New items receive zero impressions because they have no interaction data. Fix: Use content-based features in the item tower so new items get meaningful embeddings from day one. [src4]
Ignoring position bias: Items at top positions get disproportionate clicks, creating a feedback loop. Fix: Apply inverse propensity scoring or include position as a training feature (set to zero at inference). [src2]
Stale embeddings: Item embeddings computed once and never refreshed ignore new interactions and catalog changes. Fix: Rebuild ANN index at least daily; retrain embeddings weekly or continuously. [src7]
Over-engineering for small catalogs: Building a multi-stage pipeline for <10K items wastes engineering effort. Fix: Use matrix factorization (SVD/ALS) or LightFM with brute-force scoring. [src3]
Neglecting implicit negative signals: Treating only clicks as positives ignores skips and short view durations. Fix: Use weighted implicit feedback (view <5s = negative, view >30s = weak positive, click = positive, purchase = strong positive). [src1]
No fallback for empty results: When retrieval returns zero candidates, the system returns an empty page. Fix: Always merge retrieval sources (personalized + popularity + trending). [src6]

When to Use / When Not to Use

Use When	Don't Use When	Use Instead
Catalog >1K items and user interaction data available	Catalog <100 items	Manual curation or simple rule-based sorting
Personalization drives business KPI (engagement, revenue)	All users should see the same content (editorial, news homepage)	Content management system with editorial ranking
Sufficient interaction volume (>10K interactions/day)	Very sparse data (<1K total interactions)	Content-based filtering only, or popularity-based
Need to surface long-tail items users would not find via search	Users know exactly what they want (transactional search)	Search engine with relevance ranking
Real-time personalization matters (feeds, homepages)	Batch recommendations suffice (weekly email digest)	Simpler batch job with matrix factorization

Important Caveats

Offline metric improvements (NDCG, recall@K) frequently do not translate to online business metric gains -- always A/B test before full rollout
GDPR Article 22 grants users the right to explanation for automated decisions including recommendations -- design for explainability from day one
Recommendation quality degrades rapidly when interaction data is stale -- budget for daily retraining pipelines and real-time feature updates
Two-tower models trade off cross-feature expressiveness for retrieval speed -- the ranking stage must compensate with richer feature interactions
Position bias in training data creates a rich-get-richer feedback loop -- invest in randomized exploration early to collect unbiased signal