Scalable E-Commerce Platform Architecture
How do I design a scalable e-commerce platform architecture?
TL;DR
- Bottom line: A scalable e-commerce platform decomposes into 8-12 bounded-context services (catalog, cart, order, payment, inventory, user, search, notification) communicating via async events, each owning its database, behind an API gateway with CDN caching.
- Key tool/command:
docker-compose upwith separate containers per service, or Kubernetes with Helm charts for production-grade orchestration. - Watch out for: Distributed transactions across services (especially inventory + payment) — use the Saga pattern with compensating transactions, never two-phase commit.
- Works with: Any cloud provider (AWS, GCP, Azure); language-agnostic services; PostgreSQL, MongoDB, Redis, Elasticsearch, Kafka/RabbitMQ.
Constraints
- Never process or store raw credit card numbers in your own system — always use PCI DSS-compliant payment gateways (Stripe, Adyen, Braintree) with tokenized payment methods
- Inventory must be reserved (soft lock) before payment processing to prevent overselling — use optimistic locking or distributed locks, never rely on application-level checks alone
- Each bounded context (catalog, cart, order, payment, inventory) must own its own database — shared databases create distributed monoliths that are harder to scale than the original monolith
- Shopping cart must survive server restarts and session expiry — persist carts server-side (Redis or database), never rely solely on client-side storage for cart state
- All inter-service communication for order processing must be idempotent — network failures will cause retries, and duplicate order creation or double-charging is unacceptable
Quick Reference
| Component | Role | Technology Options | Scaling Strategy |
|---|---|---|---|
| API Gateway | Route requests, rate limit, auth, SSL termination | Kong, AWS API Gateway, NGINX, Envoy | Horizontal — stateless, add instances behind LB |
| Product Catalog Service | CRUD products, categories, attributes, pricing | Node.js/Python + PostgreSQL or MongoDB | Read replicas + CDN cache; write sharding by category |
| Search Service | Full-text search, faceted filtering, autocomplete | Elasticsearch, OpenSearch, Typesense | Horizontal sharding by index; read replicas |
| User/Auth Service | Registration, login, JWT/OAuth, profiles | Node.js/Go + PostgreSQL + Redis (sessions) | Horizontal — stateless with token-based auth |
| Cart Service | Add/remove items, persist cart state, price calc | Node.js/Python + Redis (primary) + PostgreSQL (backup) | Horizontal — partition by user ID in Redis Cluster |
| Order Service | Order creation, lifecycle management, history | Python/Java + PostgreSQL (ACID) | Shard by order ID; archive old orders to cold storage |
| Payment Service | Gateway integration, tokenization, refunds | Node.js/Go + PostgreSQL + external gateway (Stripe) | Horizontal — idempotency keys prevent duplicates |
| Inventory Service | Stock levels, reservations, warehouse sync | Go/Java + PostgreSQL + Redis (hot counts) | Optimistic locking; shard by SKU range |
| Notification Service | Email, SMS, push notifications, webhooks | Node.js/Python + queue consumer (SQS/RabbitMQ) | Scale consumers independently based on queue depth |
| Recommendation Service | Personalized suggestions, "also bought" | Python (ML) + Redis (feature store) + Spark | Precompute offline; serve from cache; scale reads |
| CDN / Edge | Static assets, image delivery, edge caching | CloudFront, Cloudflare, Fastly | Automatic — scales with traffic globally |
| Message Broker | Async inter-service events, order saga coordination | Kafka, RabbitMQ, AWS SQS/SNS | Kafka: add partitions; RabbitMQ: add consumers |
| Monitoring & Observability | Distributed tracing, metrics, alerting, logging | Datadog, Grafana+Prometheus, Jaeger, ELK Stack | Scale collectors; sample traces at high volume |
Decision Tree
START
├── Expected daily orders < 100 and products < 1K?
│ ├── YES → Use managed platform (Shopify/WooCommerce)
│ └── NO ↓
├── Team size < 5 backend engineers?
│ ├── YES → Modular monolith (single deploy, domain modules, shared DB with schema separation)
│ └── NO ↓
├── < 1K concurrent users?
│ ├── YES → Modular monolith with clear domain boundaries, prepare for future extraction
│ └── NO ↓
├── 1K–50K concurrent users?
│ ├── YES → Extract high-load services first (search, catalog, cart) as microservices
│ └── NO ↓
├── 50K–500K concurrent users?
│ ├── YES → Full microservices with Kafka event bus, database-per-service, Kubernetes
│ └── NO ↓
├── > 500K concurrent users?
│ ├── YES → Microservices + CQRS/Event Sourcing, multi-region, database sharding
│ └── NO ↓
└── DEFAULT → Start with modular monolith, extract services as bottlenecks emerge
Step-by-Step Guide
1. Define bounded contexts and data ownership
Map your e-commerce domain into distinct bounded contexts using Domain-Driven Design (DDD). Each context becomes a service boundary with its own database. The critical contexts are: Product Catalog, Shopping Cart, Order Management, Payment, Inventory, User/Auth, Search, and Notifications. [src3]
Bounded Contexts:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Catalog │ │ Cart │ │ Order │
│ (MongoDB) │ │ (Redis) │ │ (PostgreSQL) │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└────────── API Gateway ────────────┘
│
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Payment │ │ Inventory │ │ Search │
│ (PostgreSQL) │ │ (PostgreSQL) │ │(Elasticsearch│
└──────────────┘ └──────────────┘ └──────────────┘
Verify: Each service can be deployed and tested independently — no compile-time dependencies between services.
2. Design the API gateway and routing layer
Place an API gateway in front of all services to handle authentication, rate limiting, request routing, and SSL termination. Use path-based routing. [src1]
# Kong or AWS API Gateway route config (conceptual)
routes:
- path: /api/v1/products
service: catalog-service
methods: [GET]
plugins: [rate-limit, jwt-auth, response-cache]
- path: /api/v1/cart
service: cart-service
methods: [GET, POST, PUT, DELETE]
plugins: [rate-limit, jwt-auth]
- path: /api/v1/orders
service: order-service
methods: [GET, POST]
plugins: [rate-limit, jwt-auth]
- path: /api/v1/checkout
service: payment-service
methods: [POST]
plugins: [rate-limit, jwt-auth, idempotency]
Verify: curl -H "Authorization: Bearer <token>" https://api.example.com/api/v1/products → returns product list with 200 OK.
3. Implement the product catalog with search indexing
The catalog service stores products in a primary database and syncs changes to Elasticsearch for full-text search. Use Change Data Capture (CDC) or event publishing to keep the search index in sync. [src6]
# catalog_service/events.py — Publish product changes to message broker
import json
from kafka import KafkaProducer
producer = KafkaProducer(
bootstrap_servers=["kafka:9092"],
value_serializer=lambda v: json.dumps(v).encode("utf-8"),
)
def publish_product_event(event_type: str, product: dict):
producer.send("product-events", value={"event": event_type, "product": product})
producer.flush()
Verify: Create a product via API → curl localhost:9200/products/_search?q=<name> → product appears within 2 seconds.
4. Build the cart service with Redis persistence
Use Redis as the primary store for shopping carts with sub-millisecond reads and built-in TTL for cart expiry. Back up cart data to PostgreSQL for carts older than 30 minutes. [src2]
# cart_service/cart.py — Redis-backed cart
import redis
r = redis.Redis(host="redis", port=6379, db=0, decode_responses=True)
CART_TTL = 86400 * 7 # 7 days
def add_to_cart(user_id: str, product_id: str, quantity: int):
cart_key = f"cart:{user_id}"
r.hset(cart_key, product_id, quantity)
r.expire(cart_key, CART_TTL)
def get_cart(user_id: str) -> dict:
return {pid: int(qty) for pid, qty in r.hgetall(f"cart:{user_id}").items()}
Verify: add_to_cart("user123", "SKU-001", 2) then get_cart("user123") → {"SKU-001": 2}.
5. Implement checkout with the Saga pattern
The checkout flow spans multiple services and cannot use a single database transaction. Use the Saga pattern: reserve inventory → process payment → create order. If payment fails, release the inventory reservation. [src3]
Checkout Saga Flow:
1. Cart Service → Validate cart items and prices
2. Inventory Svc → Reserve stock (soft lock with TTL)
3. Payment Service → Charge customer via gateway
├── SUCCESS → 4. Order Service → Create order record
│ 5. Inventory Svc → Confirm reservation
│ 6. Cart Service → Clear cart
│ 7. Notification → Send confirmation email
└── FAILURE → Compensate:
- Inventory Svc → Release reservation
- Notification → Send failure notice
Verify: Place test order → inventory decremented, payment captured, order record exists, cart cleared. Simulate payment failure → inventory restored.
6. Set up event-driven communication
Use Apache Kafka as the central event bus. Services publish domain events (OrderCreated, PaymentProcessed, InventoryReserved) and other services subscribe to react asynchronously. [src2]
# order_service/events.py — Consume payment events
from kafka import KafkaConsumer
import json
consumer = KafkaConsumer(
"payment-events",
bootstrap_servers=["kafka:9092"],
group_id="order-service",
value_deserializer=lambda m: json.loads(m.decode("utf-8")),
)
for message in consumer:
event = message.value
if event["type"] == "PaymentSucceeded":
create_order(event["order_id"], event["items"], event["total"])
consumer.commit()
elif event["type"] == "PaymentFailed":
release_inventory(event["order_id"], event["items"])
consumer.commit()
Verify: Publish a PaymentSucceeded event → order record appears in database within 5 seconds.
7. Deploy with container orchestration
Package each service as a Docker container and orchestrate with Kubernetes. Use Horizontal Pod Autoscalers (HPA) to scale based on CPU/memory or custom metrics. [src4]
# k8s/catalog-service.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: catalog-service
spec:
replicas: 3
selector:
matchLabels:
app: catalog-service
template:
spec:
containers:
- name: catalog
image: ecommerce/catalog-service:1.0.0
ports:
- containerPort: 8080
resources:
requests: { memory: "256Mi", cpu: "250m" }
limits: { memory: "512Mi", cpu: "500m" }
Verify: kubectl get hpa → shows catalog-hpa. Under load: kubectl get pods -l app=catalog-service → pod count increases.
Code Examples
Python: Order Service with Saga Orchestrator
# order_service/saga.py — Checkout saga orchestrator
# Input: Cart contents (user_id, items), payment method token
# Output: Order confirmation or compensated failure
import httpx
import uuid
INVENTORY_URL = "http://inventory-service:8080"
PAYMENT_URL = "http://payment-service:8080"
async def checkout_saga(user_id: str, items: list, payment_token: str):
saga_id = str(uuid.uuid4())
# Step 1: Reserve inventory
res = await httpx.AsyncClient().post(
f"{INVENTORY_URL}/reserve", json={"saga_id": saga_id, "items": items})
if res.status_code != 200:
return {"status": "failed", "reason": "inventory_unavailable"}
# Step 2: Process payment
res = await httpx.AsyncClient().post(f"{PAYMENT_URL}/charge", json={
"saga_id": saga_id, "token": payment_token,
"amount": sum(i["price"] * i["qty"] for i in items)})
if res.status_code != 200:
await httpx.AsyncClient().post(
f"{INVENTORY_URL}/release", json={"saga_id": saga_id})
return {"status": "failed", "reason": "payment_declined"}
return {"status": "confirmed", "order_id": res.json()["order_id"]}
Node.js: Inventory Service with Optimistic Locking
// inventory_service/reserve.js — Atomic inventory reservation
// Input: saga_id, items [{sku, qty}]
// Output: reservation confirmation or rejection
const { Pool } = require("pg"); // [email protected]
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
async function reserveInventory(sagaId, items) {
const client = await pool.connect();
try {
await client.query("BEGIN");
for (const item of items) {
const result = await client.query(
`UPDATE inventory SET reserved = reserved + $1, updated_at = NOW()
WHERE sku = $2 AND (stock - reserved) >= $1 RETURNING sku`,
[item.qty, item.sku]);
if (result.rowCount === 0) {
await client.query("ROLLBACK");
return { success: false, reason: `Insufficient stock: ${item.sku}` };
}
}
await client.query(
`INSERT INTO reservations (saga_id, items, status) VALUES ($1, $2, 'reserved')`,
[sagaId, JSON.stringify(items)]);
await client.query("COMMIT");
return { success: true, saga_id: sagaId };
} catch (err) { await client.query("ROLLBACK"); throw err; }
finally { client.release(); }
}
Anti-Patterns
Wrong: Shared database across services
// BAD — All services read/write the same database
// Creates coupling: schema changes break all services,
// impossible to scale services independently,
// single point of failure
Correct: Database-per-service with event sync
// GOOD — Each service owns its data, syncs via events
// Catalog (MongoDB), Order (PostgreSQL), Inventory (PostgreSQL)
// Connected via Kafka Event Bus
// Independent scaling, deployment, and schema evolution
Wrong: Synchronous checkout chain
# BAD — Blocking HTTP calls chain during checkout
def checkout(cart):
inventory = requests.post("/inventory/reserve", json=cart) # blocks
payment = requests.post("/payment/charge", json=cart) # blocks
order = requests.post("/orders/create", json=cart) # blocks
# If payment service is slow, entire checkout hangs.
# If order service fails after payment, no compensation.
return order
Correct: Saga pattern with async compensation
# GOOD — Saga orchestrator with compensation on failure
async def checkout_saga(cart, payment_token):
saga_id = uuid.uuid4()
try:
await reserve_inventory(saga_id, cart.items)
payment = await process_payment(saga_id, payment_token, cart.total)
order = await create_order(saga_id, cart, payment.id)
return {"status": "confirmed", "order_id": order.id}
except PaymentFailedError:
await release_inventory(saga_id) # compensating transaction
return {"status": "failed", "reason": "payment_declined"}
Wrong: Client-side only cart storage
// BAD — Cart only in browser localStorage
localStorage.setItem("cart", JSON.stringify(cartItems));
// Lost on device switch, browser clear, or incognito.
// No server validation of prices. No abandoned cart analytics.
Correct: Server-side cart with client cache
// GOOD — Server-side cart (Redis) with client-side sync
async function addToCart(productId, qty) {
const res = await fetch("/api/cart", {
method: "POST",
body: JSON.stringify({ product_id: productId, qty }),
headers: { "Authorization": `Bearer ${token}` },
});
const cart = await res.json();
sessionStorage.setItem("cart_cache", JSON.stringify(cart));
return cart;
}
Common Pitfalls
- Overselling during flash sales: Inventory checks pass at application level but concurrent requests create a race condition. Fix:
UPDATE ... WHERE stock >= qtywith row-level locking in PostgreSQL. [src2] - Cart price drift: Product prices change between add-to-cart and checkout. Fix: Re-validate all prices at checkout time; show price change warnings. [src6]
- Distributed transaction failures: Using 2PC across microservices causes tight coupling. Fix: Replace with Saga pattern using compensating transactions and idempotency keys. [src3]
- Search index lag: Products updated in catalog don't appear in search for minutes. Fix: Use CDC with Debezium or event publishing with <2s latency. [src1]
- Session stickiness dependency: Relying on sticky sessions for cart state means losing carts on server failure. Fix: Store cart in Redis Cluster keyed by user ID. [src2]
- Payment webhook idempotency: Payment gateway sends duplicate webhooks. Fix: Store processed webhook ID in database; use unique constraints on payment_intent_id. [src7]
- N+1 queries on product listing: Loading product list that fetches category, images, reviews per product. Fix: Use batch loading (DataLoader pattern) or materialized views. [src6]
- Missing circuit breakers: One slow downstream service cascades failures. Fix: Implement circuit breakers with fallback responses and retry budgets. [src4]
Diagnostic Commands
# Check service health across all microservices
for svc in catalog cart order payment inventory search; do
curl -s "http://${svc}-service:8080/health" | jq '.status'
done
# Monitor Kafka consumer lag (detect processing bottlenecks)
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
--describe --group order-service
# Check PostgreSQL active connections and locks
psql -c "SELECT pid, state, query FROM pg_stat_activity WHERE state != 'idle';"
# Redis memory and cart key count
redis-cli INFO memory | grep used_memory_human
redis-cli DBSIZE
# Elasticsearch cluster health
curl -s localhost:9200/_cluster/health | jq '.status,.active_shards'
# Kubernetes pod status
kubectl get pods -n ecommerce -o wide
kubectl get events -n ecommerce --sort-by='.lastTimestamp' | tail -20
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Building a custom e-commerce platform with >1K daily orders | Selling <100 products with simple needs | Shopify, WooCommerce, or BigCommerce |
| Team has 5+ backend engineers and DevOps capability | Solo developer or small team without Kubernetes experience | Modular monolith or managed platform |
| Need independent scaling of catalog, search, and checkout | All components have similar load patterns | Modular monolith with domain modules |
| Regulatory requirements demand service isolation (PCI scope reduction) | No compliance requirements and simple payment flow | Monolith with Stripe Checkout |
| Multi-region deployment required for <100ms latency globally | Single-region audience with acceptable latency | Single-region deployment with CDN |
| Flash sales or highly variable traffic patterns | Steady, predictable traffic with no spikes | Fixed-size deployment with load balancer |
Important Caveats
- Microservices architecture adds significant operational complexity — distributed tracing, service mesh, and container orchestration are prerequisites, not nice-to-haves. Teams without this expertise should start with a modular monolith.
- Eventual consistency between services means users may see stale data briefly. Design UIs to communicate this with spinners and optimistic updates with server reconciliation.
- Shopify processes 20+ TB/minute on a modular monolith (Ruby on Rails with Packwerk). Do not default to microservices — many successful e-commerce platforms at significant scale use modular monoliths.
- Database-per-service means no JOINs across services. Cross-service queries require API composition at the gateway level or materialized read models (CQRS). Budget extra time for reporting and analytics.
- Payment service architecture should change rarely due to PCI DSS compliance burden. Isolate it behind a stable API contract and use feature flags for payment method additions.