Multi-Tenant SaaS Architecture Design Guide
How do I design a multi-tenant SaaS architecture?
TL;DR
- Bottom line: Choose between shared-database (pool), schema-per-tenant, or database-per-tenant isolation based on your tenant count, compliance needs, and cost tolerance -- most SaaS products start with shared-database + RLS and graduate enterprise tenants to dedicated databases.
- Key tool/command:
ALTER TABLE tenants ENABLE ROW LEVEL SECURITY; CREATE POLICY tenant_isolation ... - Watch out for: Forgetting to set tenant context on every request -- a single missed middleware path leaks cross-tenant data.
- Works with: PostgreSQL 12+, MySQL 8.0+, any cloud (AWS/Azure/GCP), any language with middleware support.
Constraints
- Tenant isolation is a security requirement, not optional -- a single leaked tenant_id in a query can expose all tenant data
- Row-Level Security (RLS) policies must be enabled AND enforced -- enabling RLS without creating policies silently blocks all rows
- Schema-per-tenant hits PostgreSQL catalog limits at ~5,000-10,000 schemas -- plan migration path before that threshold
- JWT tenant claims must be validated server-side -- never trust client-supplied tenant_id from headers or query params alone
- Database connection pooling (PgBouncer) in transaction mode breaks
SET app.current_tenant-- use session mode or pass tenant context per query
Quick Reference
| Component | Role | Technology Options | Scaling Strategy |
|---|---|---|---|
| API Gateway | Route requests, inject tenant context | Kong, AWS API Gateway, Envoy, Nginx | Horizontal (stateless) |
| Auth / Identity | Authenticate users, issue tenant-scoped JWTs | Auth0, AWS Cognito, Keycloak, Azure AD B2C | Managed service / federation |
| Tenant Resolver | Extract tenant_id from JWT/subdomain/header | Custom middleware (Express, FastAPI, Spring) | Stateless, per-instance |
| Application Layer | Business logic with tenant-scoped data access | Node.js, Python, Go, Java microservices | Horizontal auto-scaling |
| Shared Database (Pool) | Single DB, tenant_id column + RLS | PostgreSQL + RLS, Azure SQL elastic pools | Vertical + read replicas |
| Schema-per-Tenant | Separate schema per tenant, shared DB server | PostgreSQL schemas, MySQL databases | Vertical + schema sharding |
| Database-per-Tenant (Silo) | Dedicated DB instance per tenant | RDS per tenant, Aurora, Azure SQL per DB | Independent scaling per tenant |
| Caching Layer | Tenant-prefixed keys, prevent cross-tenant leaks | Redis, Memcached, ElastiCache | Cluster mode with key namespacing |
| Background Jobs | Tenant-aware queue processing, fair scheduling | SQS + tenant tags, Celery, BullMQ | Per-tenant rate limiting |
| Billing / Metering | Track per-tenant usage for tiered pricing | Stripe, AWS Marketplace, custom metering | Event-driven aggregation |
| Tenant Onboarding | Provision resources, seed data, configure DNS | Custom pipeline (Terraform, Pulumi, scripts) | Automated, idempotent |
| Monitoring | Per-tenant metrics, noisy neighbor detection | Datadog, Prometheus + tenant labels, CloudWatch | Tag-based tenant dashboards |
| Config / Feature Flags | Per-tenant feature toggles, tier-based limits | LaunchDarkly, Unleash, custom tenant_config table | Cached, per-request lookup |
Decision Tree
START: How many tenants and what isolation level?
|
+-- Expected tenants > 10,000?
| +-- YES --> Shared database + RLS (pool model)
| | Cost-effective, simplest ops, use tenant_id column everywhere
| +-- NO |
| v
+-- Compliance requires dedicated data stores (HIPAA, FedRAMP, data residency)?
| +-- YES --> Database-per-tenant (silo model)
| | Maximum isolation, per-tenant backup/restore, highest cost
| +-- NO |
| v
+-- Tenants < 5,000 and want schema-level isolation without separate DBs?
| +-- YES --> Schema-per-tenant (bridge model)
| | Good isolation, shared infra, watch for catalog bloat
| +-- NO |
| v
+-- Mixed tenant sizes (SMB + enterprise)?
| +-- YES --> Hybrid: pool for SMB tenants, silo for enterprise tenants
| | Route based on tenant tier; automate promotion workflow
| +-- NO |
| v
+-- DEFAULT --> Start with shared database + RLS
Easiest to build, migrate to hybrid when needed
Step-by-Step Guide
1. Define your tenant isolation model
Evaluate your requirements across four dimensions: security/compliance, tenant count, operational complexity, and cost. The three fundamental models are pool (shared everything), bridge (schema-per-tenant), and silo (database-per-tenant). Most teams should start with pool + RLS and create a promotion path to silo for enterprise customers. [src1]
Verify: Document your choice in an Architecture Decision Record (ADR) with trade-offs listed.
2. Implement tenant-aware authentication
Every authenticated request must carry a trusted tenant_id. Embed it as a custom claim in JWTs during login. Never accept tenant_id from request headers or URL parameters without validation against the token. [src7]
# Python / FastAPI -- extract tenant from JWT
from fastapi import Request, HTTPException
from jose import jwt, JWTError
async def get_tenant_id(request: Request) -> str:
token = request.headers.get("Authorization", "").replace("Bearer ", "")
try:
payload = jwt.decode(token, PUBLIC_KEY, algorithms=["RS256"])
tenant_id = payload.get("tenant_id")
if not tenant_id:
raise HTTPException(status_code=403, detail="Missing tenant claim")
return tenant_id
except JWTError:
raise HTTPException(status_code=401, detail="Invalid token")
Verify: Decode a sample JWT and confirm tenant_id claim is present.
3. Build tenant context middleware
Create middleware that runs on every request, extracts tenant_id, and makes it available to all downstream code. This is the single most critical piece -- if any route bypasses this middleware, you have a data leak. [src1] [src7]
// Node.js / Express -- tenant context middleware
const { AsyncLocalStorage } = require('node:async_hooks');
const tenantStore = new AsyncLocalStorage();
function tenantMiddleware(req, res, next) {
const tenantId = req.auth?.tenant_id;
if (!tenantId) return res.status(403).json({ error: 'No tenant context' });
tenantStore.run({ tenantId }, () => next());
}
function getCurrentTenant() {
return tenantStore.getStore()?.tenantId;
}
module.exports = { tenantMiddleware, getCurrentTenant, tenantStore };
Verify: Add a test route that returns getCurrentTenant() -- confirm it matches the JWT's tenant_id.
4. Configure database-level isolation (RLS)
For the shared-database model, enable PostgreSQL Row-Level Security so the database itself enforces tenant boundaries. This is your safety net -- even if application code has a bug, the database will not return another tenant's data. [src3]
-- PostgreSQL RLS setup for multi-tenant table
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
ALTER TABLE orders FORCE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON orders
USING (tenant_id = current_setting('app.current_tenant')::uuid);
-- Set tenant context per request (call from app middleware)
SET LOCAL app.current_tenant = 'tenant-uuid-here';
Verify: SELECT * FROM orders; as a role with RLS active but no app.current_tenant set -- should return 0 rows.
5. Wire tenant context to database queries
Every database connection must set the tenant context before executing queries. With connection pooling, set it at the start of each transaction, not per-connection. [src3]
# Python / SQLAlchemy -- set tenant context per request
from sqlalchemy import event, text
from contextvars import ContextVar
current_tenant: ContextVar[str] = ContextVar('current_tenant')
@event.listens_for(engine, "after_begin")
def set_tenant_context(session, transaction, connection):
tenant_id = current_tenant.get(None)
if tenant_id:
connection.execute(text(
"SET LOCAL app.current_tenant = :tid"
), {"tid": tenant_id})
Verify: Run SHOW app.current_tenant; inside a test transaction -- should show the expected tenant UUID.
6. Implement noisy neighbor protection
Add per-tenant rate limiting, resource quotas, and monitoring. Without these, a single tenant's traffic spike degrades all tenants. [src4] [src5]
# Per-tenant rate limiting with Redis
import redis, time
r = redis.Redis()
def check_rate_limit(tenant_id: str, max_rpm: int = 1000) -> bool:
key = f"ratelimit:{tenant_id}:{int(time.time()) // 60}"
count = r.incr(key)
if count == 1:
r.expire(key, 120)
return count <= max_rpm
Verify: Fire max_rpm + 1 requests for one tenant -- confirm the last request is rejected while another tenant's requests succeed.
7. Set up tenant-aware monitoring
Tag all metrics and logs with tenant_id. Create dashboards that show per-tenant resource consumption, error rates, and latency percentiles. Alert on tenants exceeding 2x the median resource usage. [src4]
Verify: Query your monitoring system for a specific tenant_id -- confirm metrics are segmented correctly.
Code Examples
Python / FastAPI: Complete Tenant Middleware
# Input: HTTP request with Bearer JWT containing tenant_id claim
# Output: tenant_id available via ContextVar for all downstream code
from contextvars import ContextVar
from fastapi import FastAPI, Request, HTTPException
from starlette.middleware.base import BaseHTTPMiddleware
tenant_ctx: ContextVar[str] = ContextVar("tenant_ctx", default="")
class TenantMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
tid = getattr(request.state, "tenant_id", None)
if not tid:
raise HTTPException(403, "Missing tenant context")
token = tenant_ctx.set(tid)
response = await call_next(request)
tenant_ctx.reset(token)
return response
Node.js / Express: Tenant-Scoped Database Query
// Input: Express request with req.auth.tenant_id set by JWT middleware
// Output: query results scoped to the authenticated tenant only
const { Pool } = require('pg');
const pool = new Pool();
async function tenantQuery(tenantId, sql, params = []) {
const client = await pool.connect();
try {
await client.query('BEGIN');
await client.query("SET LOCAL app.current_tenant = $1", [tenantId]);
const result = await client.query(sql, params);
await client.query('COMMIT');
return result.rows;
} catch (err) {
await client.query('ROLLBACK');
throw err;
} finally {
client.release();
}
}
PostgreSQL: Full RLS Setup with Indexes
-- Input: Multi-tenant orders table
-- Output: Row-level isolation enforced by the database engine
-- 1. Add tenant column and index
ALTER TABLE orders ADD COLUMN tenant_id uuid NOT NULL;
CREATE INDEX idx_orders_tenant ON orders (tenant_id);
-- 2. Enable and force RLS
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
ALTER TABLE orders FORCE ROW LEVEL SECURITY;
-- 3. Create policies for each operation
CREATE POLICY tenant_select ON orders FOR SELECT
USING (tenant_id = current_setting('app.current_tenant')::uuid);
CREATE POLICY tenant_insert ON orders FOR INSERT
WITH CHECK (tenant_id = current_setting('app.current_tenant')::uuid);
CREATE POLICY tenant_update ON orders FOR UPDATE
USING (tenant_id = current_setting('app.current_tenant')::uuid)
WITH CHECK (tenant_id = current_setting('app.current_tenant')::uuid);
CREATE POLICY tenant_delete ON orders FOR DELETE
USING (tenant_id = current_setting('app.current_tenant')::uuid);
Anti-Patterns
Wrong: Trusting client-supplied tenant_id
# BAD -- tenant_id from request body, not from verified JWT
@app.post("/orders")
async def create_order(request: Request):
body = await request.json()
tenant_id = body["tenant_id"] # attacker can set any tenant_id
db.execute("INSERT INTO orders (tenant_id, ...) VALUES (%s, ...)", tenant_id)
Correct: Extract tenant_id from verified JWT only
# GOOD -- tenant_id from server-verified JWT claim
@app.post("/orders")
async def create_order(request: Request):
tenant_id = request.state.tenant_id # set by auth middleware from JWT
db.execute("INSERT INTO orders (tenant_id, ...) VALUES (%s, ...)", tenant_id)
Wrong: No RLS -- relying only on application WHERE clauses
# BAD -- single missed WHERE clause = full data leak
def get_orders(tenant_id):
return db.execute("SELECT * FROM orders WHERE tenant_id = %s", tenant_id)
# What if another query forgets the WHERE clause?
Correct: Database-enforced RLS as safety net
# GOOD -- RLS enforces isolation even if app code is buggy
def get_orders():
# tenant_id filter applied automatically by PostgreSQL RLS policy
# SET LOCAL app.current_tenant was called by middleware
return db.execute("SELECT * FROM orders")
Wrong: Global cache keys without tenant prefix
# BAD -- cache key collision between tenants
cache.set("user:123:profile", user_data) # which tenant's user 123?
Correct: Tenant-namespaced cache keys
# GOOD -- cache key includes tenant_id
cache.set(f"tenant:{tenant_id}:user:123:profile", user_data)
Wrong: Single connection pool shared without tenant context reset
# BAD -- tenant context leaks between requests in a connection pool
conn = pool.get_connection()
conn.execute("SET app.current_tenant = %s", [tenant_a])
conn.execute("SELECT * FROM orders") # returns tenant_a's data
pool.return_connection(conn) # context still set to tenant_a!
Correct: SET LOCAL within transaction boundaries
# GOOD -- SET LOCAL scoped to transaction, auto-resets on COMMIT/ROLLBACK
conn = pool.get_connection()
conn.execute("BEGIN")
conn.execute("SET LOCAL app.current_tenant = %s", [tenant_b])
conn.execute("SELECT * FROM orders") # scoped to tenant_b
conn.execute("COMMIT") # context automatically cleared
pool.return_connection(conn)
Common Pitfalls
- Noisy neighbor degradation: One tenant's traffic spike consumes shared database connections, CPU, or memory, degrading all tenants. Fix: implement per-tenant rate limiting, connection quotas, and auto-scaling triggers based on per-tenant metrics. [src4]
- Connection pool exhaustion with schema-per-tenant: Each schema requires its own search_path, breaking connection pooling. Fix: use
SET LOCAL search_pathwithin transactions, or use a pooler that supports schema switching. [src6] - Migration drift across tenant schemas: Schema-per-tenant makes DDL migrations N times harder -- one failure leaves schemas inconsistent. Fix: use idempotent migrations (
IF NOT EXISTS), run in parallel with rollback capability. [src6] - Tenant context not set on background jobs: Async workers (Celery, BullMQ, SQS) often miss the middleware that sets tenant context, causing queries to fail or return empty results. Fix: serialize tenant_id into every job payload and set context at worker start. [src7]
- Backup/restore restores wrong tenant's data: In shared-database models, pg_restore restores all tenants. Fix: use logical backups with tenant_id filters for single-tenant recovery. [src5]
- PgBouncer transaction pooling breaks SET commands:
SET app.current_tenantpersists across transactions in transaction mode. Fix: useSET LOCAL(transaction-scoped) or switch to session pooling. [src3] - Tenant data leaks in error messages: Stack traces or error responses may include data from the wrong tenant. Fix: sanitize all error responses and never include raw database errors in API responses. [src2]
- Overcounting resource usage in shared metrics: Without tenant-tagged metrics, you cannot identify which tenant causes spikes. Fix: tag every metric, log entry, and trace span with tenant_id from the start. [src4]
Diagnostic Commands
-- Check RLS policies on a table
SELECT schemaname, tablename, policyname, permissive, roles, cmd, qual
FROM pg_policies WHERE tablename = 'orders';
-- Verify RLS is enabled and forced
SELECT relname, relrowsecurity, relforcerowsecurity
FROM pg_class WHERE relname = 'orders';
-- Check current tenant context in session
SHOW app.current_tenant;
-- Count rows per tenant (detect data skew / noisy tenants)
SELECT tenant_id, COUNT(*) as row_count
FROM orders GROUP BY tenant_id ORDER BY row_count DESC LIMIT 20;
-- Monitor active connections per tenant
SELECT usename, client_addr, COUNT(*) as conn_count
FROM pg_stat_activity WHERE state = 'active'
GROUP BY usename, client_addr ORDER BY conn_count DESC;
-- Check for missing tenant_id indexes
SELECT t.tablename, i.indexname
FROM pg_tables t
LEFT JOIN pg_indexes i ON t.tablename = i.tablename AND i.indexdef LIKE '%tenant_id%'
WHERE t.schemaname = 'public' AND i.indexname IS NULL
AND EXISTS (
SELECT 1 FROM information_schema.columns c
WHERE c.table_name = t.tablename AND c.column_name = 'tenant_id'
);
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Building a B2B SaaS product serving multiple organizations | Building a consumer app where each user is independent | Standard user-based auth with row-level permissions |
| Need to amortize infrastructure costs across many customers | Tenants require physically separate infrastructure (FedRAMP High) | Account-per-tenant with AWS Organizations / Azure Management Groups |
| Tenants share the same application version and feature set | Each customer needs heavy customization of business logic | Plugin architecture or customer-deployed instances |
| 100+ tenants with similar data volumes and usage patterns | Fewer than 5 large customers with vastly different SLAs | Dedicated deployment per customer |
| Offering tiered plans (free, pro, enterprise) with different limits | Real-time systems where microsecond latency matters | Direct table access with compile-time tenant verification |
Important Caveats
- RLS adds 1-5% query overhead on PostgreSQL for simple queries. For complex queries with many JOINs, benchmark before assuming negligible impact.
- The hybrid model (pool + silo) doubles your operational complexity -- you need separate deployment, monitoring, and migration pipelines for each model. Only adopt when customer contracts require it.
- Schema-per-tenant in PostgreSQL degrades pg_catalog performance beyond ~5,000 schemas. This is a known limitation that requires migrating to database-per-tenant or consolidating to RLS.
- Multi-region deployments multiply the isolation challenge -- a tenant's data may need to stay in a specific region (GDPR), requiring region-aware routing at the API gateway level.
- Most ORMs (Django, Rails, Hibernate) do not natively support
SET LOCALfor tenant context -- you must add hooks/events to inject it, and these hooks must be tested rigorously for every query path.