Database Sharding Strategies: A Complete System Design Guide
What are the best database sharding strategies?
TL;DR
- Bottom line: Choose hash-based sharding for uniform distribution, range-based for time-series/ordered queries, geo-based for multi-region latency, or directory-based for maximum flexibility -- the shard key determines everything.
- Key tool/command:
SELECT create_distributed_table('orders', 'tenant_id');(Citus) orPARTITION BY HASH(Vitess vschema) - Watch out for: Choosing a low-cardinality shard key (e.g., country code or status enum) -- creates hot shards and uneven distribution that cannot be fixed without full migration.
- Works with: PostgreSQL + Citus, MySQL + Vitess/PlanetScale, MongoDB native sharding, CockroachDB, Cassandra/ScyllaDB, YugabyteDB, TiDB.
Constraints
- Never shard before exhausting vertical scaling, read replicas, caching, and query optimization -- sharding adds permanent operational complexity
- Shard key is immutable after data is distributed -- changing it requires full data migration across all shards
- Cross-shard JOINs and transactions are either unsupported or orders of magnitude slower -- design schema to keep related data on the same shard
- Auto-increment primary keys are not globally unique across shards -- use UUIDs, ULIDs, or Snowflake IDs
- Resharding (adding/removing shards) requires a migration plan -- consistent hashing minimizes data movement but does not eliminate it
- Unique constraints cannot span shards -- globally unique columns must include the shard key or use an external uniqueness service
Quick Reference
| Strategy | How It Works | Distribution | Range Queries | Resharding Cost | Best For | Worst For |
|---|---|---|---|---|---|---|
| Hash-based | shard = hash(key) % N |
Uniform (even) | Poor (scatter-gather) | High (rehash all data) | High-write workloads, random access by ID | Time-series, range scans |
| Consistent hashing | Hash ring with virtual nodes | Uniform | Poor | Low (only K/N keys move) | Elastic scaling, frequent shard add/remove | Range queries, ordered access |
| Range-based | shard = key_range(value) |
Depends on data | Excellent (single shard) | Medium (split/merge ranges) | Time-series, ordered data, date partitions | Monotonic keys (hot last shard) |
| Geo-based | shard = region(location) |
Uneven (population) | Good within region | Low (region reassignment) | Multi-region apps, data residency compliance | Global queries, small datasets |
| Directory-based | Lookup table maps key to shard | Configurable | Depends on mapping | Low (update directory) | Complex routing, tenant isolation | High throughput (directory = bottleneck) |
| Hash + range (compound) | Hash on tenant, range on time | Balanced | Good within tenant | Medium | Multi-tenant SaaS with time-series | Simple applications |
| Schema-based | Each tenant gets own schema | Isolated | Full SQL within tenant | Low (move schema) | Strong tenant isolation, compliance | Large tenant counts (>10K schemas) |
| Algorithmic (jump hash) | jump_consistent_hash(key, N) |
Near-perfect | Poor | Very low (minimal movement) | Caching layers, stateless routing | Dynamic shard topology |
Decision Tree
START -- What is your primary access pattern?
|
+-- Single-entity lookup (user_id, order_id, tenant_id)?
| +-- Need elastic scaling (frequently add/remove shards)?
| | +-- YES --> Consistent hashing (virtual nodes)
| | +-- NO --> Hash-based sharding (simple modulo or jump hash)
| |
+-- Range scans (time-series, date ranges, sequential IDs)?
| +-- Data is append-only (logs, events, IoT)?
| | +-- YES --> Range-based with time buckets (e.g., monthly shards)
| | +-- NO --> Range-based with dynamic range splitting
| |
+-- Geographic queries (user location, data residency)?
| +-- YES --> Geo-based sharding (shard per region/country)
| |
+-- Multi-tenant SaaS with mixed access patterns?
| +-- <1K tenants with strong isolation needs?
| | +-- YES --> Schema-based sharding (Citus schema-based)
| +-- >1K tenants?
| +-- YES --> Hash on tenant_id (co-locate all tenant data)
|
+-- Unknown or complex routing needs?
+-- YES --> Directory-based (lookup table with flexibility)
+-- DEFAULT --> Start with hash-based on most-queried entity ID
Step-by-Step Guide
1. Identify your shard key candidates
Analyze your query patterns to find the column that appears in the WHERE clause of 80%+ of your queries. The shard key must have high cardinality (thousands+ distinct values) and appear in most queries to avoid scatter-gather operations. [src7]
-- PostgreSQL: Find most common WHERE clause columns
SELECT query, calls, mean_exec_time
FROM pg_stat_statements
ORDER BY calls DESC LIMIT 20;
Verify: The top 5 queries all include your candidate shard key in their WHERE clause.
2. Evaluate shard key quality
A good shard key has high cardinality, high frequency in queries, and low monotonicity. Test distribution before committing. [src3]
-- Check cardinality: should be >10x your planned shard count
SELECT COUNT(DISTINCT tenant_id) AS cardinality FROM orders;
-- Check distribution: no single value should exceed 5% of total
SELECT tenant_id, COUNT(*) AS row_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pct
FROM orders GROUP BY tenant_id ORDER BY row_count DESC LIMIT 20;
Verify: cardinality > 10 * planned_shards and max(pct) < 5%.
3. Choose your sharding strategy
Based on the decision tree, select hash, range, geo, or directory-based sharding. For most OLTP applications, hash-based on tenant_id or user_id is the safest default. [src6]
Verify: Your chosen strategy handles your top 5 queries without cross-shard operations.
4. Implement sharding with your chosen technology
Deploy using a proven sharding middleware or native distributed database rather than building custom sharding logic. [src1]
-- PostgreSQL + Citus (hash distribution)
SELECT create_distributed_table('orders', 'tenant_id');
SELECT create_distributed_table('order_items', 'tenant_id',
colocate_with => 'orders');
Verify: SELECT COUNT(*) FROM citus_shards; shows data distributed across all nodes.
5. Set up monitoring and rebalancing
Monitor shard distribution, query routing, and hotspots. Set up alerts for shard imbalance exceeding 20%. [src5]
-- Citus: Check shard sizes and distribution
SELECT nodename, COUNT(*) AS shard_count,
pg_size_pretty(SUM(shard_size)) AS total_size
FROM citus_shards GROUP BY nodename;
Verify: Shard sizes are within 20% of each other. No single shard handles >30% of total queries.
Code Examples
Python: Consistent Hashing Ring Implementation
# Input: List of shard nodes + data keys to route
# Output: Shard assignment for each key with minimal disruption on node changes
import hashlib
from bisect import bisect_right
class ConsistentHashRing:
def __init__(self, nodes=None, virtual_nodes=150):
self.ring, self.sorted_keys = {}, []
self.virtual_nodes = virtual_nodes
for node in (nodes or []): self.add_node(node)
def _hash(self, key): return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_node(self, node):
for i in range(self.virtual_nodes):
h = self._hash(f"{node}:vn{i}")
self.ring[h] = node; self.sorted_keys.append(h)
self.sorted_keys.sort()
Full script: python-consistent-hashing-ring-implementation.py (42 lines)
SQL/Vitess: VSchema Configuration for Hash-Based Sharding
// Vitess vschema -- defines sharding strategy for MySQL
{
"sharded": true,
"vindexes": { "hash_tenant": { "type": "hash" } },
"tables": {
"orders": {
"column_vindexes": [{ "column": "tenant_id", "name": "hash_tenant" }]
}
}
}
Full config: vschema-configuration.json (31 lines)
Go: Application-Level Shard Router
// Input: Shard key (tenant_id) + database connections
// Output: Routes queries to correct shard based on hash
func (r *ShardRouter) GetShard(tenantID string) *sql.DB {
h := fnv.New32a()
h.Write([]byte(tenantID))
idx := h.Sum32() % r.count
return r.shards[idx]
}
Full script: go-application-level-shard-router.go (34 lines)
Anti-Patterns
Wrong: Sharding by auto-increment ID with modulo
-- BAD -- Modulo on auto-increment creates hot shard for recent data.
-- All new inserts go to shard = (max_id + 1) % N, hammering one shard.
INSERT INTO orders_shard_%s VALUES (...); -- shard = order_id % 4
Correct: Hash the shard key with consistent hashing
-- GOOD -- Hash distributes evenly regardless of key ordering.
-- Adding a 5th shard moves only ~20% of data (1/N).
SELECT create_distributed_table('orders', 'tenant_id'); -- Citus handles hashing
Wrong: Using a low-cardinality column as shard key
# BAD -- country_code has ~200 values. With 16 shards, some shards
# get USA (40% of data) while others get Liechtenstein (0.001%).
shard = hash(row["country_code"]) % 16 # massive skew
Correct: Use a high-cardinality key, add geo-routing at application level
# GOOD -- Shard by tenant_id (millions of values = even distribution).
# Handle geo-routing in the application layer, not the shard key.
shard = consistent_hash(row["tenant_id"])
cluster = geo_directory[row["region"]]
Wrong: Cross-shard JOINs in hot-path queries
-- BAD -- This JOIN hits every shard, aggregates at coordinator.
-- 16 shards = 16 network round trips + merge. Latency: 50-500ms.
SELECT o.*, u.name FROM orders o
JOIN users u ON o.user_id = u.user_id
WHERE o.created_at > NOW() - INTERVAL '1 hour';
Correct: Co-locate related tables on the same shard key
-- GOOD -- Both tables sharded by tenant_id = JOIN stays on one shard.
SELECT create_distributed_table('orders', 'tenant_id');
SELECT create_distributed_table('users', 'tenant_id',
colocate_with => 'orders');
-- Query stays on one shard:
SELECT o.*, u.name FROM orders o
JOIN users u ON o.tenant_id = u.tenant_id AND o.user_id = u.user_id
WHERE o.tenant_id = 'abc123';
Wrong: Building custom sharding logic in application code
# BAD -- Homegrown shard routing = bugs, no rebalancing, no failover.
def get_connection(user_id):
shard_id = hash(user_id) % len(shards)
return psycopg2.connect(shards[shard_id])
# What about transactions? Failover? Schema migrations?
Correct: Use a proven sharding middleware
# GOOD -- Let Citus, Vitess, or ProxySQL handle routing, rebalancing,
# failover, and schema migrations. You write normal SQL.
conn = psycopg2.connect("host=citus-coordinator dbname=myapp")
cur = conn.cursor()
cur.execute("SELECT * FROM orders WHERE tenant_id = %s", ("abc123",))
Common Pitfalls
- Hot shards from monotonic keys: Sharding by auto-increment ID, timestamp, or ObjectId concentrates all recent writes on one shard. Fix: use hash-based sharding or add a random prefix. [src2]
- Cross-shard query explosion: Queries without the shard key in the WHERE clause hit all shards (scatter-gather). Fix: always include the shard key in queries. Denormalize data to avoid cross-shard JOINs. [src1]
- Unbalanced shard sizes over time: Data growth is rarely uniform -- some tenants grow 100x faster. Fix: implement shard splitting or tenant migration. Citus
rebalance_table_shards()handles this automatically. [src3] - Schema migrations across shards: DDL changes must be applied to every shard atomically. Fix: use middleware's built-in migration tool (Vitess
ApplySchema, Citus DDL propagation). [src1] - Connection pool exhaustion: N shards x M app servers = N*M connections. With 16 shards and 50 app servers, that is 800 connections per shard. Fix: use PgBouncer per shard or a connection-aware proxy. [src3]
- Distributed transaction overhead: Two-phase commit (2PC) across shards adds 2-10x latency. Fix: design for eventual consistency where possible. Use saga pattern for multi-shard workflows. [src4]
- Backup and restore complexity: Each shard must be backed up independently but restored to a consistent point-in-time. Fix: use coordinated snapshots or logical replication with consistent cutover. [src5]
- Testing with fewer shards than production: Bugs that only manifest with >2 shards go undetected. Fix: run CI tests with production shard count or at minimum 4 shards. [src6]
Diagnostic Commands
# PostgreSQL + Citus: Check shard distribution
psql -c "SELECT nodename, COUNT(*) AS shards, pg_size_pretty(SUM(shard_size))
FROM citus_shards GROUP BY nodename;"
# Citus: Find scatter-gather queries (hitting multiple shards)
psql -c "SELECT query, calls, mean_exec_time
FROM citus_stat_statements
WHERE partition_count > 1 ORDER BY calls DESC LIMIT 10;"
# MongoDB: Check balancer status and chunk distribution
mongosh --eval 'sh.status()'
mongosh --eval 'db.adminCommand({ balancerStatus: 1 })'
# Vitess: List tablets and their health
vtctlclient ListAllTablets | grep -v healthy
# General: Check shard key cardinality
psql -c "SELECT COUNT(DISTINCT tenant_id) AS cardinality,
COUNT(*) AS total_rows FROM orders;"
Version History & Compatibility
| Technology | Version | Sharding Features | Notes |
|---|---|---|---|
| PostgreSQL + Citus | 13.0 (2025) | Schema-based sharding, shard rebalancer, CDC | Citus 12 introduced schema-based sharding (2023) |
| Vitess | 21.x (2025) | VReplication, online resharding, VDiff | Used by Slack, GitHub, PlanetScale. CNCF graduated |
| MongoDB | 8.0 (2024) | Hashed + ranged sharding, zone sharding | Online resharding added in 6.0 (2022) |
| CockroachDB | 24.x (2025) | Automatic range splitting, hash-sharded indexes | Transparent sharding -- no manual shard management |
| TiDB | 8.x (2025) | Auto-sharding, TiFlash columnar, placement rules | MySQL-compatible, auto-splits at 96MB regions |
| YugabyteDB | 2024.x | Hash + range sharding, tablespace geo-partitioning | PostgreSQL-compatible, automatic tablet splitting |
| Cassandra | 5.0 (2024) | Consistent hashing (Murmur3), virtual nodes | Always sharded -- no single-node mode |
| ScyllaDB | 6.x (2025) | Tablets (replacing vnodes), Raft-based topology | Cassandra-compatible, 10x lower latency |
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Single database exceeds 1-5TB and vertical scaling is maxed | Database is <500GB with room to scale vertically | Add more RAM/CPU, optimize queries, add read replicas |
| Write throughput exceeds single-node capacity (>50K writes/sec) | Read-heavy workload with low write volume | Read replicas with connection routing |
| Data residency requirements mandate geographic isolation | All users are in one region | Single-region deployment with edge caching |
| Multi-tenant SaaS needs tenant-level isolation and scaling | <100 tenants with similar data volumes | Schema-per-tenant or row-level security on single DB |
| Need to scale horizontally beyond single-machine limits | Application can tolerate latency from unoptimized indexes | Index optimization and query tuning first |
| Compliance requires data in specific jurisdictions (GDPR, CCPA) | No regulatory data residency requirements | Single deployment with encryption at rest |
Important Caveats
- Sharding is a one-way door operationally -- once data is distributed, unsharding (merging back) requires a full migration project that can take weeks to months depending on data volume
- Choosing the wrong shard key is the most expensive mistake: it requires migrating all data to a new distribution, which means double storage and extended downtime or complex dual-write periods
- Distributed databases (CockroachDB, TiDB, YugabyteDB) handle sharding transparently but add latency for cross-range transactions -- benchmark your specific workload before assuming they eliminate sharding complexity
- Schema-based sharding (Citus 12+) is simpler for SaaS but breaks down beyond ~10K tenants due to PostgreSQL catalog overhead per schema
- Consistent hashing solves resharding data movement but does not solve cross-shard queries, distributed transactions, or global secondary indexes
- Cloud-managed sharding services (PlanetScale, Amazon Aurora, Azure Cosmos DB) reduce operational burden but introduce vendor lock-in and higher per-query costs at scale