Multi-Tenant SaaS Architecture Design Guide

Type: Software Reference Confidence: 0.92 Sources: 7 Verified: 2026-02-23 Freshness: 2026-02-23

TL;DR

Constraints

Quick Reference

ComponentRoleTechnology OptionsScaling Strategy
API GatewayRoute requests, inject tenant contextKong, AWS API Gateway, Envoy, NginxHorizontal (stateless)
Auth / IdentityAuthenticate users, issue tenant-scoped JWTsAuth0, AWS Cognito, Keycloak, Azure AD B2CManaged service / federation
Tenant ResolverExtract tenant_id from JWT/subdomain/headerCustom middleware (Express, FastAPI, Spring)Stateless, per-instance
Application LayerBusiness logic with tenant-scoped data accessNode.js, Python, Go, Java microservicesHorizontal auto-scaling
Shared Database (Pool)Single DB, tenant_id column + RLSPostgreSQL + RLS, Azure SQL elastic poolsVertical + read replicas
Schema-per-TenantSeparate schema per tenant, shared DB serverPostgreSQL schemas, MySQL databasesVertical + schema sharding
Database-per-Tenant (Silo)Dedicated DB instance per tenantRDS per tenant, Aurora, Azure SQL per DBIndependent scaling per tenant
Caching LayerTenant-prefixed keys, prevent cross-tenant leaksRedis, Memcached, ElastiCacheCluster mode with key namespacing
Background JobsTenant-aware queue processing, fair schedulingSQS + tenant tags, Celery, BullMQPer-tenant rate limiting
Billing / MeteringTrack per-tenant usage for tiered pricingStripe, AWS Marketplace, custom meteringEvent-driven aggregation
Tenant OnboardingProvision resources, seed data, configure DNSCustom pipeline (Terraform, Pulumi, scripts)Automated, idempotent
MonitoringPer-tenant metrics, noisy neighbor detectionDatadog, Prometheus + tenant labels, CloudWatchTag-based tenant dashboards
Config / Feature FlagsPer-tenant feature toggles, tier-based limitsLaunchDarkly, Unleash, custom tenant_config tableCached, per-request lookup

Decision Tree

START: How many tenants and what isolation level?
|
+-- Expected tenants > 10,000?
|   +-- YES --> Shared database + RLS (pool model)
|   |           Cost-effective, simplest ops, use tenant_id column everywhere
|   +-- NO  |
|           v
+-- Compliance requires dedicated data stores (HIPAA, FedRAMP, data residency)?
|   +-- YES --> Database-per-tenant (silo model)
|   |           Maximum isolation, per-tenant backup/restore, highest cost
|   +-- NO  |
|           v
+-- Tenants < 5,000 and want schema-level isolation without separate DBs?
|   +-- YES --> Schema-per-tenant (bridge model)
|   |           Good isolation, shared infra, watch for catalog bloat
|   +-- NO  |
|           v
+-- Mixed tenant sizes (SMB + enterprise)?
|   +-- YES --> Hybrid: pool for SMB tenants, silo for enterprise tenants
|   |           Route based on tenant tier; automate promotion workflow
|   +-- NO  |
|           v
+-- DEFAULT --> Start with shared database + RLS
                Easiest to build, migrate to hybrid when needed

Step-by-Step Guide

1. Define your tenant isolation model

Evaluate your requirements across four dimensions: security/compliance, tenant count, operational complexity, and cost. The three fundamental models are pool (shared everything), bridge (schema-per-tenant), and silo (database-per-tenant). Most teams should start with pool + RLS and create a promotion path to silo for enterprise customers. [src1]

Verify: Document your choice in an Architecture Decision Record (ADR) with trade-offs listed.

2. Implement tenant-aware authentication

Every authenticated request must carry a trusted tenant_id. Embed it as a custom claim in JWTs during login. Never accept tenant_id from request headers or URL parameters without validation against the token. [src7]

# Python / FastAPI -- extract tenant from JWT
from fastapi import Request, HTTPException
from jose import jwt, JWTError

async def get_tenant_id(request: Request) -> str:
    token = request.headers.get("Authorization", "").replace("Bearer ", "")
    try:
        payload = jwt.decode(token, PUBLIC_KEY, algorithms=["RS256"])
        tenant_id = payload.get("tenant_id")
        if not tenant_id:
            raise HTTPException(status_code=403, detail="Missing tenant claim")
        return tenant_id
    except JWTError:
        raise HTTPException(status_code=401, detail="Invalid token")

Verify: Decode a sample JWT and confirm tenant_id claim is present.

3. Build tenant context middleware

Create middleware that runs on every request, extracts tenant_id, and makes it available to all downstream code. This is the single most critical piece -- if any route bypasses this middleware, you have a data leak. [src1] [src7]

// Node.js / Express -- tenant context middleware
const { AsyncLocalStorage } = require('node:async_hooks');
const tenantStore = new AsyncLocalStorage();

function tenantMiddleware(req, res, next) {
  const tenantId = req.auth?.tenant_id;
  if (!tenantId) return res.status(403).json({ error: 'No tenant context' });
  tenantStore.run({ tenantId }, () => next());
}

function getCurrentTenant() {
  return tenantStore.getStore()?.tenantId;
}

module.exports = { tenantMiddleware, getCurrentTenant, tenantStore };

Verify: Add a test route that returns getCurrentTenant() -- confirm it matches the JWT's tenant_id.

4. Configure database-level isolation (RLS)

For the shared-database model, enable PostgreSQL Row-Level Security so the database itself enforces tenant boundaries. This is your safety net -- even if application code has a bug, the database will not return another tenant's data. [src3]

-- PostgreSQL RLS setup for multi-tenant table
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
ALTER TABLE orders FORCE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON orders
  USING (tenant_id = current_setting('app.current_tenant')::uuid);

-- Set tenant context per request (call from app middleware)
SET LOCAL app.current_tenant = 'tenant-uuid-here';

Verify: SELECT * FROM orders; as a role with RLS active but no app.current_tenant set -- should return 0 rows.

5. Wire tenant context to database queries

Every database connection must set the tenant context before executing queries. With connection pooling, set it at the start of each transaction, not per-connection. [src3]

# Python / SQLAlchemy -- set tenant context per request
from sqlalchemy import event, text
from contextvars import ContextVar

current_tenant: ContextVar[str] = ContextVar('current_tenant')

@event.listens_for(engine, "after_begin")
def set_tenant_context(session, transaction, connection):
    tenant_id = current_tenant.get(None)
    if tenant_id:
        connection.execute(text(
            "SET LOCAL app.current_tenant = :tid"
        ), {"tid": tenant_id})

Verify: Run SHOW app.current_tenant; inside a test transaction -- should show the expected tenant UUID.

6. Implement noisy neighbor protection

Add per-tenant rate limiting, resource quotas, and monitoring. Without these, a single tenant's traffic spike degrades all tenants. [src4] [src5]

# Per-tenant rate limiting with Redis
import redis, time

r = redis.Redis()

def check_rate_limit(tenant_id: str, max_rpm: int = 1000) -> bool:
    key = f"ratelimit:{tenant_id}:{int(time.time()) // 60}"
    count = r.incr(key)
    if count == 1:
        r.expire(key, 120)
    return count <= max_rpm

Verify: Fire max_rpm + 1 requests for one tenant -- confirm the last request is rejected while another tenant's requests succeed.

7. Set up tenant-aware monitoring

Tag all metrics and logs with tenant_id. Create dashboards that show per-tenant resource consumption, error rates, and latency percentiles. Alert on tenants exceeding 2x the median resource usage. [src4]

Verify: Query your monitoring system for a specific tenant_id -- confirm metrics are segmented correctly.

Code Examples

Python / FastAPI: Complete Tenant Middleware

# Input:  HTTP request with Bearer JWT containing tenant_id claim
# Output: tenant_id available via ContextVar for all downstream code

from contextvars import ContextVar
from fastapi import FastAPI, Request, HTTPException
from starlette.middleware.base import BaseHTTPMiddleware

tenant_ctx: ContextVar[str] = ContextVar("tenant_ctx", default="")

class TenantMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        tid = getattr(request.state, "tenant_id", None)
        if not tid:
            raise HTTPException(403, "Missing tenant context")
        token = tenant_ctx.set(tid)
        response = await call_next(request)
        tenant_ctx.reset(token)
        return response

Node.js / Express: Tenant-Scoped Database Query

// Input:  Express request with req.auth.tenant_id set by JWT middleware
// Output: query results scoped to the authenticated tenant only

const { Pool } = require('pg');
const pool = new Pool();

async function tenantQuery(tenantId, sql, params = []) {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');
    await client.query("SET LOCAL app.current_tenant = $1", [tenantId]);
    const result = await client.query(sql, params);
    await client.query('COMMIT');
    return result.rows;
  } catch (err) {
    await client.query('ROLLBACK');
    throw err;
  } finally {
    client.release();
  }
}

PostgreSQL: Full RLS Setup with Indexes

-- Input:  Multi-tenant orders table
-- Output: Row-level isolation enforced by the database engine

-- 1. Add tenant column and index
ALTER TABLE orders ADD COLUMN tenant_id uuid NOT NULL;
CREATE INDEX idx_orders_tenant ON orders (tenant_id);

-- 2. Enable and force RLS
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
ALTER TABLE orders FORCE ROW LEVEL SECURITY;

-- 3. Create policies for each operation
CREATE POLICY tenant_select ON orders FOR SELECT
  USING (tenant_id = current_setting('app.current_tenant')::uuid);

CREATE POLICY tenant_insert ON orders FOR INSERT
  WITH CHECK (tenant_id = current_setting('app.current_tenant')::uuid);

CREATE POLICY tenant_update ON orders FOR UPDATE
  USING (tenant_id = current_setting('app.current_tenant')::uuid)
  WITH CHECK (tenant_id = current_setting('app.current_tenant')::uuid);

CREATE POLICY tenant_delete ON orders FOR DELETE
  USING (tenant_id = current_setting('app.current_tenant')::uuid);

Anti-Patterns

Wrong: Trusting client-supplied tenant_id

# BAD -- tenant_id from request body, not from verified JWT
@app.post("/orders")
async def create_order(request: Request):
    body = await request.json()
    tenant_id = body["tenant_id"]  # attacker can set any tenant_id
    db.execute("INSERT INTO orders (tenant_id, ...) VALUES (%s, ...)", tenant_id)

Correct: Extract tenant_id from verified JWT only

# GOOD -- tenant_id from server-verified JWT claim
@app.post("/orders")
async def create_order(request: Request):
    tenant_id = request.state.tenant_id  # set by auth middleware from JWT
    db.execute("INSERT INTO orders (tenant_id, ...) VALUES (%s, ...)", tenant_id)

Wrong: No RLS -- relying only on application WHERE clauses

# BAD -- single missed WHERE clause = full data leak
def get_orders(tenant_id):
    return db.execute("SELECT * FROM orders WHERE tenant_id = %s", tenant_id)
    # What if another query forgets the WHERE clause?

Correct: Database-enforced RLS as safety net

# GOOD -- RLS enforces isolation even if app code is buggy
def get_orders():
    # tenant_id filter applied automatically by PostgreSQL RLS policy
    # SET LOCAL app.current_tenant was called by middleware
    return db.execute("SELECT * FROM orders")

Wrong: Global cache keys without tenant prefix

# BAD -- cache key collision between tenants
cache.set("user:123:profile", user_data)  # which tenant's user 123?

Correct: Tenant-namespaced cache keys

# GOOD -- cache key includes tenant_id
cache.set(f"tenant:{tenant_id}:user:123:profile", user_data)

Wrong: Single connection pool shared without tenant context reset

# BAD -- tenant context leaks between requests in a connection pool
conn = pool.get_connection()
conn.execute("SET app.current_tenant = %s", [tenant_a])
conn.execute("SELECT * FROM orders")  # returns tenant_a's data
pool.return_connection(conn)  # context still set to tenant_a!

Correct: SET LOCAL within transaction boundaries

# GOOD -- SET LOCAL scoped to transaction, auto-resets on COMMIT/ROLLBACK
conn = pool.get_connection()
conn.execute("BEGIN")
conn.execute("SET LOCAL app.current_tenant = %s", [tenant_b])
conn.execute("SELECT * FROM orders")  # scoped to tenant_b
conn.execute("COMMIT")  # context automatically cleared
pool.return_connection(conn)

Common Pitfalls

Diagnostic Commands

-- Check RLS policies on a table
SELECT schemaname, tablename, policyname, permissive, roles, cmd, qual
FROM pg_policies WHERE tablename = 'orders';

-- Verify RLS is enabled and forced
SELECT relname, relrowsecurity, relforcerowsecurity
FROM pg_class WHERE relname = 'orders';

-- Check current tenant context in session
SHOW app.current_tenant;

-- Count rows per tenant (detect data skew / noisy tenants)
SELECT tenant_id, COUNT(*) as row_count
FROM orders GROUP BY tenant_id ORDER BY row_count DESC LIMIT 20;

-- Monitor active connections per tenant
SELECT usename, client_addr, COUNT(*) as conn_count
FROM pg_stat_activity WHERE state = 'active'
GROUP BY usename, client_addr ORDER BY conn_count DESC;

-- Check for missing tenant_id indexes
SELECT t.tablename, i.indexname
FROM pg_tables t
LEFT JOIN pg_indexes i ON t.tablename = i.tablename AND i.indexdef LIKE '%tenant_id%'
WHERE t.schemaname = 'public' AND i.indexname IS NULL
  AND EXISTS (
    SELECT 1 FROM information_schema.columns c
    WHERE c.table_name = t.tablename AND c.column_name = 'tenant_id'
  );

When to Use / When Not to Use

Use WhenDon't Use WhenUse Instead
Building a B2B SaaS product serving multiple organizationsBuilding a consumer app where each user is independentStandard user-based auth with row-level permissions
Need to amortize infrastructure costs across many customersTenants require physically separate infrastructure (FedRAMP High)Account-per-tenant with AWS Organizations / Azure Management Groups
Tenants share the same application version and feature setEach customer needs heavy customization of business logicPlugin architecture or customer-deployed instances
100+ tenants with similar data volumes and usage patternsFewer than 5 large customers with vastly different SLAsDedicated deployment per customer
Offering tiered plans (free, pro, enterprise) with different limitsReal-time systems where microsecond latency mattersDirect table access with compile-time tenant verification

Important Caveats

Related Units