Database Indexing Strategies
What are the best database indexing strategies?
TL;DR
- Bottom line: Choose index types based on your query patterns — B-tree for range/sort queries, hash for equality-only, GIN for JSONB/arrays, BRIN for time-series, and partial indexes to reduce storage on filtered workloads.
- Key tool/command:
CREATE INDEX idx_name ON table (col1, col2) WHERE condition; - Watch out for: Over-indexing — every unused index wastes storage and slows writes with zero read benefit.
- Works with: PostgreSQL 13+, MySQL 8.0+, SQL Server 2019+, SQLite 3.35+ (syntax varies per engine).
Constraints
- Every index adds write overhead — INSERT/UPDATE/DELETE operations slow down per additional index. Benchmark write-heavy workloads before adding indexes.
- Composite index column order must follow the leftmost prefix rule: a
(a, b, c)index serves queries on(a),(a, b), and(a, b, c)but NOT(b, c)alone. - Hash indexes do not support range queries (
>,<,BETWEEN,ORDER BY) — only exact equality (=). Use B-tree for anything else. - Never apply functions to indexed columns in WHERE clauses (
WHERE UPPER(name) = 'FOO') unless you create an expression/functional index. - BRIN indexes require physically sorted (correlated) data to be effective. Random inserts destroy correlation and make BRIN useless.
- Partial indexes must have a WHERE clause that matches your query's filter exactly — the planner won't use a partial index if the query's condition doesn't match.
Quick Reference
| Index Type | Best For | Lookup Speed | Write Overhead | Storage | Engine Support |
|---|---|---|---|---|---|
| B-tree | Range queries, sorting, equality | O(log n) | Moderate | Moderate | All engines |
| Hash | Equality-only lookups | O(1) amortized | Low | Low | PG 10+, MySQL (Memory only) |
| GIN | JSONB, arrays, full-text, trgm | O(1) per element | High (batch-friendly) | High | PostgreSQL only |
| GiST | Geospatial, range types, nearest-neighbor | O(log n) | Moderate | Moderate | PostgreSQL only |
| SP-GiST | Non-balanced structures (quad-trees, k-d trees) | O(log n) | Moderate | Moderate | PostgreSQL only |
| BRIN | Time-series, append-only sorted data | O(n) worst case | Very low | Very low | PostgreSQL 9.5+ |
| Partial Index | Filtered subsets (active rows, non-null) | Same as base type | Lower (fewer rows) | Lower | PG, SQL Server (filtered), SQLite |
| Covering (INCLUDE) | Index-only scans avoiding heap lookups | O(log n) | Moderate-high | Higher | PG 11+, SQL Server, MySQL 8.0+ |
| Composite Index | Multi-column filters and sorts | O(log n) | Moderate | Moderate | All engines |
| Columnstore | Analytics, aggregation on large tables | Scan-optimized | High (batch insert) | Very high (compressed) | SQL Server 2012+ |
| Expression/Functional | Queries on computed values (LOWER(col)) | O(log n) | Moderate | Moderate | PG, MySQL 8.0+ |
Decision Tree
START
├── Query uses only equality (=)?
│ ├── YES → Single column?
│ │ ├── YES → Hash index (PG) or B-tree (MySQL/SQL Server)
│ │ └── NO → Composite B-tree index (columns in selectivity order)
│ └── NO ↓
├── Query uses range (>, <, BETWEEN) or ORDER BY?
│ ├── YES → B-tree index (range column last in composite)
│ └── NO ↓
├── Column is JSONB, array, or tsvector?
│ ├── YES → GIN index (PostgreSQL)
│ └── NO ↓
├── Column is geometry/geography or range type?
│ ├── YES → GiST index (PostgreSQL) or spatial index (MySQL/SQL Server)
│ └── NO ↓
├── Data is append-only / time-series with physical sort order?
│ ├── YES → BRIN index (PostgreSQL) — 1000x smaller than B-tree
│ └── NO ↓
├── Only a small fraction of rows match your filter?
│ ├── YES → Partial index (WHERE active = true)
│ └── NO ↓
├── Query reads many columns but only filters on few?
│ ├── YES → Covering index (INCLUDE extra columns)
│ └── NO ↓
├── Workload is analytical (aggregations on millions of rows)?
│ ├── YES → Columnstore index (SQL Server)
│ └── NO ↓
└── DEFAULT → B-tree index on the filtered/sorted column(s)
Step-by-Step Guide
1. Identify slow queries with EXPLAIN
Run EXPLAIN (or EXPLAIN ANALYZE in PostgreSQL) to see if the query uses a sequential scan instead of an index scan. [src1]
-- PostgreSQL
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 42;
-- MySQL
EXPLAIN SELECT * FROM orders WHERE customer_id = 42;
-- SQL Server
SET STATISTICS IO ON;
SELECT * FROM orders WHERE customer_id = 42;
Verify: Look for Seq Scan (PG), ALL in type column (MySQL), or Table Scan (SQL Server) — these indicate missing indexes.
2. Create a B-tree index on filtered columns
Start with a standard B-tree index on the column(s) appearing in WHERE clauses. [src2]
-- PostgreSQL / MySQL
CREATE INDEX idx_orders_customer ON orders (customer_id);
-- SQL Server
CREATE NONCLUSTERED INDEX idx_orders_customer ON orders (customer_id);
Verify: Re-run EXPLAIN — should now show Index Scan (PG), ref (MySQL), or Index Seek (SQL Server).
3. Add composite indexes for multi-column queries
For queries filtering on multiple columns, create a composite index with equality columns first, range columns last. [src3]
-- Query: WHERE status = 'active' AND created_at > '2026-01-01'
-- Equality columns first, range columns last
CREATE INDEX idx_orders_status_created ON orders (status, created_at);
Verify: EXPLAIN should show the composite index being used with both conditions.
4. Use partial indexes to reduce index size
If queries consistently filter on a condition, create a partial index. [src2]
-- Only 5% of orders are 'pending' — index just those
CREATE INDEX idx_orders_pending ON orders (created_at)
WHERE status = 'pending';
Verify: SELECT pg_size_pretty(pg_relation_size('idx_orders_pending')); — should be much smaller than a full index.
5. Create covering indexes to avoid table lookups
Include extra columns so the query can be answered entirely from the index. [src4]
-- PostgreSQL 11+
CREATE INDEX idx_orders_cover ON orders (customer_id)
INCLUDE (total, created_at);
-- SQL Server
CREATE NONCLUSTERED INDEX idx_orders_cover ON orders (customer_id)
INCLUDE (total, created_at);
-- MySQL (covering via composite — no INCLUDE keyword)
CREATE INDEX idx_orders_cover ON orders (customer_id, total, created_at);
Verify: EXPLAIN shows Index Only Scan (PG) or Covering Index Scan (SQL Server).
6. Monitor and remove unused indexes
Indexes that are never used waste storage and slow writes. Query system catalogs regularly. [src1]
-- PostgreSQL: find unused indexes
SELECT indexrelname, idx_scan, pg_size_pretty(pg_relation_size(indexrelid))
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
-- Drop unused indexes
DROP INDEX idx_unused_index;
Verify: After dropping, monitor write performance — INSERT throughput should improve.
Code Examples
PostgreSQL: GIN Index for JSONB Queries
-- Input: Table with JSONB column storing product attributes
-- Output: Fast lookups on any key inside the JSONB document
-- Create a GIN index for containment queries (@>)
CREATE INDEX idx_products_attrs ON products USING gin (attributes);
-- Query: find products where attributes contain {"color": "red"}
SELECT id, name FROM products
WHERE attributes @> '{"color": "red"}'::jsonb;
-- For specific key lookups, use jsonb_path_ops (smaller, faster)
CREATE INDEX idx_products_attrs_path ON products
USING gin (attributes jsonb_path_ops);
PostgreSQL: BRIN Index for Time-Series Data
-- Input: Append-only events table with timestamp column
-- Output: Extremely small index for range queries on time
-- BRIN stores min/max per block range (128 pages default)
CREATE INDEX idx_events_ts ON events USING brin (created_at);
-- Query: filter by time range — BRIN eliminates block ranges
SELECT * FROM events
WHERE created_at BETWEEN '2026-01-01' AND '2026-01-31';
-- Check physical correlation (must be > 0.9 for BRIN to help)
SELECT correlation FROM pg_stats
WHERE tablename = 'events' AND attname = 'created_at';
MySQL: Composite Index with Sort Optimization
-- Input: Orders table queried by status + date range + sort
-- Output: Single index serving filter, range, and ORDER BY
-- Equality columns first, then range, then sort direction
CREATE INDEX idx_orders_status_date ON orders (status, created_at DESC);
-- This query uses the index for all three operations:
SELECT id, total, created_at FROM orders
WHERE status = 'shipped'
AND created_at > '2026-01-01'
ORDER BY created_at DESC
LIMIT 20;
SQL Server: Filtered Index + Columnstore for Mixed Workloads
-- Input: Large orders table with OLTP + reporting queries
-- Output: Filtered rowstore index for OLTP, columnstore for analytics
-- Filtered index: only active orders (OLTP fast path)
CREATE NONCLUSTERED INDEX idx_orders_active
ON orders (customer_id, created_at)
WHERE status = 'active';
-- Nonclustered columnstore: analytics on historical data
CREATE NONCLUSTERED COLUMNSTORE INDEX idx_orders_analytics
ON orders (customer_id, total, product_id, created_at)
WHERE status != 'active';
Anti-Patterns
Wrong: Indexing every column individually
-- BAD — creates N separate indexes that mostly go unused
-- The optimizer typically picks only ONE index per table scan
CREATE INDEX idx_col1 ON orders (customer_id);
CREATE INDEX idx_col2 ON orders (status);
CREATE INDEX idx_col3 ON orders (created_at);
CREATE INDEX idx_col4 ON orders (total);
CREATE INDEX idx_col5 ON orders (product_id);
Correct: Create targeted composite indexes matching query patterns
-- GOOD — one composite index serves the most common query
CREATE INDEX idx_orders_main ON orders (customer_id, status, created_at);
-- Serves: WHERE customer_id = ?, WHERE customer_id = ? AND status = ?,
-- and WHERE customer_id = ? AND status = ? AND created_at > ?
Wrong: Range column before equality column in composite index
-- BAD — range column first breaks index efficiency for equality filter
CREATE INDEX idx_orders_bad ON orders (created_at, status);
-- This query cannot efficiently use the index for status filtering:
SELECT * FROM orders WHERE status = 'active' AND created_at > '2026-01-01';
Correct: Equality columns first, range columns last
-- GOOD — equality columns narrow the search before range scan
CREATE INDEX idx_orders_good ON orders (status, created_at);
-- Now the index seeks to status='active' then scans the date range
SELECT * FROM orders WHERE status = 'active' AND created_at > '2026-01-01';
Wrong: Using functions on indexed columns
-- BAD — wrapping the column in a function prevents index usage
SELECT * FROM users WHERE LOWER(email) = '[email protected]';
SELECT * FROM orders WHERE YEAR(created_at) = 2026;
Correct: Use expression indexes or rewrite the query
-- GOOD (PostgreSQL) — expression index on the function result
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
SELECT * FROM users WHERE LOWER(email) = '[email protected]';
-- GOOD (MySQL) — rewrite to avoid function on column
SELECT * FROM orders
WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01';
Wrong: Indexing low-cardinality boolean columns alone
-- BAD — a boolean column has only 2 values, full index has no selectivity
CREATE INDEX idx_users_active ON users (is_active);
-- The optimizer may ignore this index entirely (table scan is cheaper)
Correct: Use a partial index on the rare value
-- GOOD — partial index on the minority value is tiny and targeted
CREATE INDEX idx_users_active ON users (id) WHERE is_active = true;
-- Only indexes the ~5% of active users, not the entire table
Wrong: Never checking if indexes are actually used
-- BAD — creating indexes without monitoring leads to index bloat
-- Common in ORMs that auto-generate indexes for every foreign key
Correct: Regularly audit index usage and drop dead indexes
-- GOOD — query pg_stat_user_indexes to find unused indexes
SELECT indexrelname AS index_name,
idx_scan AS times_used,
pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;
Common Pitfalls
- Index bloat in PostgreSQL: Dead tuples from UPDATE/DELETE inflate index size. Fix:
REINDEX CONCURRENTLY idx_name;or tuneautovacuum_vacuum_scale_factor. [src2] - Implicit type casting defeats indexes:
WHERE int_column = '42'forces a cast, bypassing the index. Fix: ensure parameter types match column types. [src1] - MySQL FORCE INDEX causes plan regression: Optimizer hints become stale as data changes. Fix: remove hints, use
ANALYZE TABLEto refresh statistics. [src3] - Partial index not used due to mismatched WHERE clause: The planner only uses a partial index when your query's condition logically implies the index's predicate. Fix: make conditions match. [src2]
- BRIN on randomly inserted data is useless: BRIN relies on physical row order correlation. Fix: check
pg_stats.correlation— value must be close to 1.0 or -1.0. [src6] - Covering index INCLUDE columns are not searchable: Columns in INCLUDE are stored in leaf pages only. Fix: put searchable columns in the key, non-searchable in INCLUDE. [src4]
- Too many indexes on write-heavy tables: Each INSERT updates every index. Fix: keep indexes under 5-7 per table on write-heavy workloads. [src1]
Diagnostic Commands
-- PostgreSQL: check index usage statistics
SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
-- PostgreSQL: check table index sizes
SELECT indexname, pg_size_pretty(pg_relation_size(indexname::regclass)) AS size
FROM pg_indexes WHERE tablename = 'your_table';
-- PostgreSQL: check physical correlation for BRIN suitability
SELECT attname, correlation
FROM pg_stats WHERE tablename = 'your_table';
-- PostgreSQL: find duplicate indexes
SELECT indrelid::regclass AS table_name,
array_agg(indexrelid::regclass) AS duplicate_indexes
FROM pg_index
GROUP BY indrelid, indkey
HAVING COUNT(*) > 1;
-- MySQL: check index cardinality and usage
SHOW INDEX FROM your_table;
SELECT * FROM sys.schema_index_statistics WHERE table_name = 'your_table';
-- MySQL: find redundant indexes
SELECT * FROM sys.schema_redundant_indexes;
-- SQL Server: find missing indexes suggested by the optimizer
SELECT mid.statement, mid.equality_columns, mid.inequality_columns,
mid.included_columns, migs.avg_user_impact
FROM sys.dm_db_missing_index_details mid
JOIN sys.dm_db_missing_index_groups mig ON mid.index_handle = mig.index_handle
JOIN sys.dm_db_missing_index_group_stats migs ON mig.index_group_handle = migs.group_handle
ORDER BY migs.avg_user_impact DESC;
-- SQL Server: find unused indexes
SELECT OBJECT_NAME(i.object_id) AS table_name, i.name AS index_name,
ius.user_seeks, ius.user_scans, ius.user_updates
FROM sys.indexes i
LEFT JOIN sys.dm_db_index_usage_stats ius
ON i.object_id = ius.object_id AND i.index_id = ius.index_id
WHERE OBJECTPROPERTY(i.object_id, 'IsUserTable') = 1
AND (ius.user_seeks + ius.user_scans) = 0
ORDER BY ius.user_updates DESC;
Version History & Compatibility
| Feature | PostgreSQL | MySQL | SQL Server | Notes |
|---|---|---|---|---|
| B-tree | All versions | All versions | All versions | Universal default |
| Hash index | 10+ (WAL-logged) | Memory engine only | N/A | PG hash crash-safe since v10 |
| GIN | 8.2+ | N/A | N/A | PostgreSQL-specific |
| GiST | 7.0+ | N/A | N/A | PostgreSQL-specific |
| BRIN | 9.5+ | N/A | N/A | PostgreSQL-specific |
| Partial/Filtered | All versions | N/A | 2008+ | MySQL lacks partial indexes |
| Covering (INCLUDE) | 11+ | 8.0+ (implicit) | 2005+ | MySQL uses composite for covering |
| Expression/Functional | All versions | 8.0+ | Via computed columns | MySQL added in 8.0 |
| Columnstore | N/A | N/A | 2012+ | SQL Server-specific |
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Queries filter on specific columns repeatedly | Table has < 1000 rows | Full table scan (faster for tiny tables) |
| Read/query workload dominates | Write throughput is the bottleneck | Batch inserts, then add indexes after bulk load |
| Query needs sorted output (ORDER BY) | Data is write-once, read-never (audit logs) | Append-only table, no indexes |
| Covering index eliminates heap lookups | You're indexing every column "just in case" | Targeted indexes based on actual query patterns |
| JSONB containment queries are slow | JSON values are small and rarely queried | No index; sequential scan on small data |
| Time-series data with range filters | Data is inserted in random order | B-tree (BRIN requires physical correlation) |
Important Caveats
- PostgreSQL hash indexes were not WAL-logged before version 10 — they were unsafe for production. Always use PG 10+ for hash indexes.
- MySQL InnoDB stores the primary key at every secondary index leaf node. Very wide primary keys (e.g., UUID) inflate all secondary indexes. Prefer integer auto-increment PKs on InnoDB.
- SQL Server clustered index physically reorders table data — you get exactly one per table. Choose it carefully (usually the primary key or most common range query column).
- Index maintenance (REINDEX, OPTIMIZE TABLE, ALTER INDEX REBUILD) can lock the table. Use CONCURRENTLY (PG) or online rebuild (SQL Server Enterprise) in production.
- Statistics staleness causes bad plans. PostgreSQL auto-analyzes, but after bulk loads run
ANALYZE table_name;manually. MySQL:ANALYZE TABLE. SQL Server:UPDATE STATISTICS. - Multicolumn GIN indexes in PostgreSQL do not support the
<@(contained-by) operator — only@>,?,?|,?&. Check operator class support before creating.