SQL Query Optimization: Complete Reference
What are the best SQL query optimization techniques?
TL;DR
- Bottom line: Most slow SQL queries are fixed by proper indexing, avoiding full table scans, and rewriting queries to let the optimizer choose efficient plans -- use EXPLAIN ANALYZE to diagnose before guessing.
- Key tool/command:
EXPLAIN ANALYZE(PostgreSQL/MySQL) orSET STATISTICS IO ON+ actual execution plan (SQL Server) - Watch out for: Adding indexes without checking if the optimizer actually uses them -- unused indexes waste write performance and disk space.
- Works with: PostgreSQL 13+, MySQL 8.0+, SQL Server 2019+, Oracle 19c+, SQLite 3.35+ (core principles are engine-agnostic).
Constraints
- EXPLAIN ANALYZE executes the query -- never run it on UPDATE/DELETE/INSERT without wrapping in BEGIN/ROLLBACK
- Index creation on production tables can lock writes -- use CREATE INDEX CONCURRENTLY (PostgreSQL) or ALGORITHM=INPLACE (MySQL) for zero-downtime deployments
- work_mem (PostgreSQL) and sort_buffer_size (MySQL) are per-operation, not per-session -- setting too high causes OOM under concurrent load
- Query hints and optimizer directives are engine-specific -- PostgreSQL has no native hint syntax; MySQL uses optimizer_switch; SQL Server uses query hints
- Statistics staleness causes bad plans -- ensure ANALYZE / ANALYZE TABLE / UPDATE STATISTICS runs regularly or via autovacuum/auto-stats
Quick Reference
| Technique | When to Use | Impact | Complexity | Trade-off |
|---|---|---|---|---|
| Add covering index | Queries doing index lookups + heap fetches | Very high | Low | Extra write overhead + disk for each INSERT/UPDATE |
| Rewrite SELECT * to specific columns | Any query returning unnecessary data | High | Trivial | None -- strictly better |
| Add WHERE clause indexes | Sequential scans on filtered columns | Very high | Low | Index maintenance cost on writes |
| Replace correlated subquery with JOIN | Subquery executes once per outer row | Very high | Medium | JOIN may use more memory for hash/merge |
| Use EXISTS instead of IN for subqueries | Large subquery result sets | High | Low | Semantics differ with NULLs -- test carefully |
| Partition large tables | Tables > 100M rows with date/range filters | High | High | Query routing complexity, cross-partition JOINs slower |
| Materialize expensive CTEs / views | Repeated expensive aggregations | Very high | Medium | Stale data if not refreshed; storage cost |
| Use LIMIT with ORDER BY + index | Pagination queries | High | Low | Requires index matching ORDER BY columns |
| Batch INSERT/UPDATE operations | Bulk writes causing lock contention | High | Medium | Application-level batching logic needed |
| Add composite indexes | Multi-column WHERE/ORDER BY | Very high | Medium | Column order matters -- most selective first |
| Avoid implicit type conversions | WHERE varchar_col = 12345 (int) | High | Trivial | Requires schema awareness |
| Use query plan caching | Repeated parameterized queries | Medium | Low | Prepared statements may choose suboptimal generic plan |
| Update statistics | Bad plans after large data changes | High | Low | Brief CPU spike during ANALYZE |
| Denormalize hot paths | Read-heavy queries joining 5+ tables | Very high | High | Data duplication, consistency burden |
Decision Tree
START: Query is slow
├── Have you run EXPLAIN ANALYZE?
│ ├── NO → Run EXPLAIN ANALYZE first (see Step 1 below)
│ └── YES ↓
├── Is there a Sequential Scan / Full Table Scan on a large table?
│ ├── YES → Is there a WHERE clause filtering rows?
│ │ ├── YES → Add index on WHERE columns (see Step 2)
│ │ └── NO → Query legitimately needs all rows -- optimize with LIMIT, pagination, or materialized view
│ └── NO ↓
├── Is the plan doing Index Scan but still slow?
│ ├── YES → Check if it's fetching too many columns (heap fetches) → Create covering index
│ └── NO ↓
├── Is there a Sort or Hash operation consuming >50% cost?
│ ├── YES → Add index matching ORDER BY / GROUP BY columns → Or increase work_mem
│ └── NO ↓
├── Is there a Nested Loop with high row estimates?
│ ├── YES → Replace with Hash Join (rewrite query) or add join column indexes
│ └── NO ↓
├── Are estimated rows very different from actual rows?
│ ├── YES → Run ANALYZE to update statistics → Check for data skew
│ └── NO ↓
├── Is the query using correlated subqueries?
│ ├── YES → Rewrite as JOIN or lateral join (see Anti-Patterns)
│ └── NO ↓
└── DEFAULT → Profile at application level -- problem may be N+1 queries, connection overhead, or network latency
Step-by-Step Guide
1. Profile the slow query with EXPLAIN ANALYZE
Before changing anything, get the actual execution plan. The plan shows exactly where time is spent. [src1]
-- PostgreSQL: EXPLAIN ANALYZE executes the query -- wrap DML in a transaction
BEGIN;
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.created_at > '2025-01-01'
ORDER BY o.total DESC
LIMIT 100;
ROLLBACK;
-- MySQL:
EXPLAIN ANALYZE
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.created_at > '2025-01-01'
ORDER BY o.total DESC
LIMIT 100;
Verify: Look for Seq Scan (PostgreSQL), type: ALL (MySQL), or Table Scan (SQL Server) on large tables -- these are the primary targets.
2. Identify and add missing indexes
The execution plan reveals which table accesses lack indexes. Add targeted indexes for WHERE, JOIN, and ORDER BY columns. [src3]
-- PostgreSQL: Create index without locking writes
CREATE INDEX CONCURRENTLY idx_orders_created_at
ON orders (created_at DESC);
-- MySQL: Online DDL (InnoDB)
ALTER TABLE orders
ADD INDEX idx_orders_created_at (created_at DESC),
ALGORITHM=INPLACE, LOCK=NONE;
-- SQL Server: Online index creation
CREATE NONCLUSTERED INDEX idx_orders_created_at
ON orders (created_at DESC)
WITH (ONLINE = ON);
Verify: Re-run EXPLAIN ANALYZE -- you should see Index Scan or Index Only Scan replacing Seq Scan.
3. Eliminate unnecessary columns and rows
Fetch only what you need. SELECT * forces the engine to read every column from heap pages even when an index covers the query. [src3]
-- Before: fetches all columns, prevents index-only scan
SELECT * FROM orders WHERE customer_id = 42;
-- After: fetches only needed columns, enables covering index
SELECT id, total, status FROM orders WHERE customer_id = 42;
Verify: EXPLAIN output should show Index Only Scan (PostgreSQL) or Using index (MySQL) when all selected columns are in the index.
4. Rewrite subqueries as JOINs
Correlated subqueries execute once per outer row. JOINs allow the optimizer to choose hash or merge strategies. [src5]
-- Before: correlated subquery (O(n*m) execution)
SELECT c.name, (
SELECT SUM(o.total) FROM orders o WHERE o.customer_id = c.id
) AS total_spent
FROM customers c;
-- After: JOIN with aggregation (single pass)
SELECT c.name, COALESCE(SUM(o.total), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
GROUP BY c.id, c.name;
Verify: EXPLAIN should show Hash Aggregate or GroupAggregate instead of SubPlan nodes.
5. Create composite and covering indexes
Multi-column indexes must follow the leftmost-prefix rule. A covering index includes all columns the query needs, eliminating heap fetches entirely. [src3]
-- Composite index: most selective column first
CREATE INDEX idx_orders_cust_date
ON orders (customer_id, created_at DESC);
-- Covering index (PostgreSQL INCLUDE syntax)
CREATE INDEX idx_orders_covering
ON orders (customer_id, created_at DESC)
INCLUDE (total, status);
-- MySQL covering index (all columns in the index)
CREATE INDEX idx_orders_covering
ON orders (customer_id, created_at, total, status);
Verify: EXPLAIN shows Index Only Scan (PostgreSQL) or Using index in Extra column (MySQL).
6. Optimize aggregation queries
GROUP BY performance depends on whether the optimizer can use an index for grouping or must sort/hash. [src1]
-- Create index matching GROUP BY + aggregate
CREATE INDEX idx_orders_status_total
ON orders (status, total);
-- Query that benefits from the index
SELECT status, SUM(total), COUNT(*)
FROM orders
GROUP BY status;
Verify: EXPLAIN should show GroupAggregate using an index, not HashAggregate with a Sort node.
7. Update statistics after large data changes
The query planner relies on table statistics to estimate row counts. After bulk loads or deletes, statistics become stale and cause bad plans. [src1]
-- PostgreSQL
ANALYZE orders;
-- MySQL
ANALYZE TABLE orders;
-- SQL Server
UPDATE STATISTICS orders;
-- Or with full scan for accuracy
UPDATE STATISTICS orders WITH FULLSCAN;
Verify: Re-run EXPLAIN ANALYZE -- estimated rows should be close to actual rows (within 10x).
Code Examples
PostgreSQL: Find missing indexes with pg_stat_user_tables
-- Input: connected to PostgreSQL database
-- Output: tables with high sequential scan ratios (index candidates)
SELECT
schemaname,
relname AS table_name,
seq_scan,
idx_scan,
CASE WHEN seq_scan + idx_scan > 0
THEN ROUND(100.0 * seq_scan / (seq_scan + idx_scan), 1)
ELSE 0
END AS seq_scan_pct,
n_live_tup AS row_count
FROM pg_stat_user_tables
WHERE n_live_tup > 10000
AND seq_scan > idx_scan
ORDER BY seq_scan_pct DESC, n_live_tup DESC
LIMIT 20;
MySQL: Enable and query the slow query log
-- Input: MySQL 8.0+ with admin privileges
-- Output: slow queries captured to file for analysis
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;
SET GLOBAL log_queries_not_using_indexes = 'ON';
-- Check current setting
SHOW VARIABLES LIKE 'slow_query%';
-- After collecting data, analyze with pt-query-digest:
-- pt-query-digest /var/lib/mysql/hostname-slow.log
SQL Server: Find top resource-consuming queries
-- Input: SQL Server 2019+ with VIEW SERVER STATE permission
-- Output: top 20 queries by total elapsed time
SELECT TOP 20
qs.total_elapsed_time / qs.execution_count AS avg_elapsed_ms,
qs.execution_count,
qs.total_logical_reads / qs.execution_count AS avg_logical_reads,
SUBSTRING(qt.text, (qs.statement_start_offset / 2) + 1,
((CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(qt.text)
ELSE qs.statement_end_offset
END - qs.statement_start_offset) / 2) + 1) AS query_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
ORDER BY qs.total_elapsed_time / qs.execution_count DESC;
Anti-Patterns
Wrong: Using SELECT * in production queries
-- BAD -- fetches all columns, prevents index-only scans,
-- breaks when schema changes, transfers unnecessary data
SELECT * FROM orders WHERE customer_id = 42;
Correct: Select only needed columns
-- GOOD -- enables covering indexes, reduces I/O and network transfer
SELECT id, total, status, created_at
FROM orders
WHERE customer_id = 42;
Wrong: N+1 query pattern in application code
-- BAD -- executes 1 query + N queries (one per customer)
-- Application loop:
-- customers = SELECT * FROM customers;
-- for each customer:
-- orders = SELECT * FROM orders WHERE customer_id = ?;
Correct: Single JOIN query
-- GOOD -- single query, database optimizes the join
SELECT c.id, c.name, o.id AS order_id, o.total
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
WHERE c.active = true;
Wrong: Using functions on indexed columns in WHERE
-- BAD -- wrapping indexed column in function prevents index usage
SELECT * FROM orders
WHERE YEAR(created_at) = 2025;
-- BAD -- implicit type conversion disables index
SELECT * FROM users
WHERE phone = 5551234567; -- phone is VARCHAR, literal is INT
Correct: Rewrite to keep indexed column bare
-- GOOD -- sargable: index on created_at is used
SELECT * FROM orders
WHERE created_at >= '2025-01-01'
AND created_at < '2026-01-01';
-- GOOD -- matching types
SELECT * FROM users
WHERE phone = '5551234567';
Wrong: Using OR conditions that prevent index usage
-- BAD -- OR on different columns forces full table scan
SELECT * FROM products
WHERE category_id = 5 OR supplier_id = 12;
Correct: Use UNION ALL for OR on different indexed columns
-- GOOD -- each branch can use its own index
SELECT * FROM products WHERE category_id = 5
UNION ALL
SELECT * FROM products WHERE supplier_id = 12
AND category_id <> 5;
Wrong: Unbounded queries without LIMIT
-- BAD -- returns potentially millions of rows
SELECT id, name, email FROM users
WHERE created_at > '2020-01-01'
ORDER BY created_at;
Correct: Always paginate large result sets
-- GOOD -- keyset pagination (fastest for large offsets)
SELECT id, name, email FROM users
WHERE created_at > '2020-01-01'
AND id > :last_seen_id
ORDER BY id
LIMIT 50;
Common Pitfalls
- Stale statistics after bulk operations: EXPLAIN shows good plan but actual performance is terrible because row estimates are 100x off. Fix: Run
ANALYZE(PostgreSQL),ANALYZE TABLE(MySQL), orUPDATE STATISTICS(SQL Server) after large INSERT/DELETE/UPDATE batches. [src1] - Over-indexing write-heavy tables: Every index slows down INSERT/UPDATE/DELETE. Tables with 20+ indexes often have slower write performance than query gains. Fix: Use
pg_stat_user_indexes(PostgreSQL) orsys.dm_db_index_usage_stats(SQL Server) to find and drop unused indexes. [src3] - Ignoring the buffer cache / I/O distinction: A query may be fast when data is cached but slow on cold start. Fix: Use
EXPLAIN (ANALYZE, BUFFERS)in PostgreSQL to see shared hit vs read ratios. [src5] - OFFSET pagination on large tables:
OFFSET 100000 LIMIT 50still scans and discards 100,000 rows. Fix: Use keyset/cursor pagination:WHERE id > :last_id ORDER BY id LIMIT 50. [src3] - Premature denormalization: Duplicating data across tables before profiling actual bottlenecks creates consistency bugs. Fix: Profile first with EXPLAIN ANALYZE, add indexes, then denormalize only proven hot paths. [src7]
- Relying on query cache (MySQL < 8.0): MySQL removed the query cache in 8.0 because it caused contention. Fix: Use application-level caching (Redis, Memcached) instead. [src2]
- Not testing with production-like data volume: A query that is fast on 1,000 rows may be catastrophically slow on 10 million. Fix: Test optimization changes against a staging environment with production data volume and distribution. [src6]
- Missing VACUUM on PostgreSQL: Dead tuples cause table bloat and slow sequential scans. Fix: Ensure autovacuum is running and tuned (
autovacuum_vacuum_scale_factor,autovacuum_analyze_scale_factor). [src1]
Diagnostic Commands
-- PostgreSQL: Show execution plan with timing and buffer info
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ... ;
-- PostgreSQL: Find queries consuming the most time (requires pg_stat_statements)
SELECT query, calls, total_exec_time, mean_exec_time, rows
FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;
-- PostgreSQL: Check for unused indexes
SELECT indexrelname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
-- PostgreSQL: Check table bloat and dead tuples
SELECT relname, n_dead_tup, n_live_tup, last_vacuum, last_autovacuum
FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;
-- MySQL: Show execution plan
EXPLAIN ANALYZE SELECT ... ;
-- MySQL: Check slow query log status
SHOW VARIABLES LIKE 'slow_query%';
SHOW VARIABLES LIKE 'long_query_time';
-- MySQL: Show table index usage
SHOW INDEX FROM orders;
-- SQL Server: Show I/O and time statistics
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT ... ;
-- SQL Server: Show estimated execution plan
SET SHOWPLAN_XML ON;
GO
SELECT ... ;
GO
SET SHOWPLAN_XML OFF;
Version History & Compatibility
| Engine | Version | Key Optimization Features | Notes |
|---|---|---|---|
| PostgreSQL 18 | Current (2025) | EXPLAIN BUFFERS by default, incremental sort improvements | Parallel query improvements |
| PostgreSQL 16 | Stable | Parallel FULL OUTER JOIN, improved DISTINCT | Recommended minimum for modern features |
| PostgreSQL 14 | LTS-like | Multirange types, query pipeline mode | Extended statistics improvements |
| MySQL 9.x | Current (2025) | Improved hash joins, parallel query | Innovation release track |
| MySQL 8.4 | LTS (2024) | Window functions, CTEs, EXPLAIN ANALYZE | First LTS release |
| MySQL 8.0 | EOL (2026-04) | Query cache removed, hash joins added | Upgrade to 8.4+ recommended |
| SQL Server 2022 | Current | Intelligent Query Processing, Parameter Sensitivity Plan | Adaptive joins, batch mode on rowstore |
| SQL Server 2019 | Mainstream support | Adaptive memory grants, batch mode on rowstore | Table variable deferred compilation |
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Individual queries are slow (> 100ms for OLTP) | All queries degraded equally | Check server resources: CPU, memory, I/O, connections |
| EXPLAIN shows sequential scans on large tables | Table is small (< 10,000 rows) | Sequential scan is often faster than index for small tables |
| Application has identified slow endpoints via APM | Problem is at the ORM/application layer (N+1) | Fix N+1 at application level first, not database level |
| Read-heavy workload with predictable query patterns | Write-heavy workload with many indexes already | Reduce index count, use batch writes, consider async processing |
| Query patterns are stable and well-understood | Ad-hoc analytical queries on large datasets | Use OLAP-optimized engines (ClickHouse, DuckDB, BigQuery) |
Important Caveats
- Optimization techniques are engine-specific -- an approach that works on PostgreSQL (e.g., partial indexes, INCLUDE columns) may not exist or behave differently on MySQL or SQL Server
- Index recommendations from automated tools (pg_stat_statements, Query Store, MySQL Performance Schema) should be validated with EXPLAIN before deploying -- they may suggest redundant or conflicting indexes
- Cloud-managed databases (AWS RDS, Azure SQL, Cloud SQL) have restricted access to some diagnostic tools and configuration parameters -- check your provider's documentation for available optimization levers
- Premature optimization without profiling wastes engineering time -- always measure with EXPLAIN ANALYZE first, then target the specific bottleneck