Database Migration Strategies
What are the best database migration strategies?
TL;DR
- Bottom line: Use the expand-contract pattern for zero-downtime schema migrations; pair with versioned migration files and CI validation for safety and reproducibility.
- Key tool/command:
flyway migrate/alembic upgrade head/knex migrate:latest - Watch out for: Long-running ALTER TABLE locks on large tables -- use online schema change tools (gh-ost, pt-online-schema-change) for tables over 10M rows.
- Works with: PostgreSQL, MySQL, SQL Server, SQLite, and all major ORMs; patterns are engine-agnostic.
Constraints
- Never run DDL migrations without a tested rollback script or compensating migration
- Always test migrations against production-sized datasets before deploying -- a migration that takes 2 seconds on dev can lock a table for 30 minutes in production
- Never mix schema changes (DDL) and data changes (DML) in a single migration file -- they have different rollback semantics
- Never drop columns or tables that are still referenced by running application code -- deploy application changes first
- For MySQL tables >10M rows, always use gh-ost or pt-online-schema-change instead of direct ALTER TABLE
- In PostgreSQL, always use
CREATE INDEX CONCURRENTLYinstead ofCREATE INDEXto avoid blocking writes
Quick Reference
| Strategy | Downtime | Reversibility | Complexity | Best For |
|---|---|---|---|---|
| Versioned migrations (Flyway/Alembic) | Depends on DDL | Forward-only (compensating migration) | Low | Most projects; sequential, auditable changes |
| State-based (Redgate, SchemaSync) | Depends on diff | Snapshot rollback | Medium | Teams preferring declarative schema definitions |
| Expand-contract | Zero | Full (old schema remains until contract) | Medium-High | Renaming columns, changing types, splitting tables |
| Dual-write | Zero | Switch reads back to old table | High | Cross-database migrations, service extractions |
| Blue-green database | Near-zero | Switch back to blue environment | High | Major version upgrades, engine migrations |
| Shadow tables | Zero | Drop shadow; keep original | High | Validating new schema under real traffic before cutover |
| Online schema change (gh-ost) | Near-zero (brief metadata lock) | Drop ghost table | Medium | Large MySQL tables (>10M rows) |
| pg_repack / pgroll | Zero | Revert migration | Medium | Large PostgreSQL tables needing VACUUM FULL or schema changes |
Decision Tree
START
|-- Can you tolerate any downtime?
| |-- YES (maintenance window available)
| | |-- Simple column add/drop?
| | | |-- YES -> Direct ALTER TABLE in a versioned migration
| | | +-- NO -> State-based diff tool or multi-step versioned migrations
| +-- NO (zero-downtime required) v
|-- Is the table large (>10M rows)?
| |-- YES
| | |-- MySQL?
| | | |-- YES -> gh-ost or pt-online-schema-change
| | | +-- NO (PostgreSQL) -> pg_repack, pgroll, or CREATE INDEX CONCURRENTLY
| +-- NO (<10M rows) v
|-- What type of schema change?
| |-- Adding a nullable column -> Direct ALTER TABLE (fast, no lock in PostgreSQL)
| |-- Renaming a column -> Expand-contract pattern (3-phase)
| |-- Changing column type -> Expand-contract pattern
| |-- Dropping a column -> Deploy app changes first, then drop in separate migration
| |-- Splitting/merging tables -> Dual-write pattern
| +-- Full engine migration -> Blue-green database deployment
+-- DEFAULT -> Versioned migration file with CI validation
Step-by-Step Guide
1. Set up versioned migration infrastructure
Choose a migration tool that matches your stack and initialize it. Every migration gets a monotonically increasing version number and lives in version control alongside application code. [src5]
# Flyway (Java/.NET/any)
flyway init
# Alembic (Python/SQLAlchemy)
alembic init migrations
alembic revision --autogenerate -m "baseline"
# Knex (Node.js)
npx knex init
npx knex migrate:make baseline
Verify: flyway info / alembic history / npx knex migrate:status -- shows the baseline migration as pending.
2. Write forward-only migration files
Each migration file contains the SQL or code to move the schema forward. Never edit a migration that has already been applied to any shared environment. [src7]
-- V2__add_email_verified_column.sql (Flyway naming convention)
ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT FALSE;
CREATE INDEX CONCURRENTLY idx_users_email_verified ON users (email_verified);
Verify: Apply to a staging environment with production-sized data and measure execution time.
3. Implement expand-contract for breaking changes
When renaming a column or changing a type, use three separate deployments to avoid downtime. [src1]
-- Migration 1: EXPAND -- add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- Migration 2: MIGRATE -- backfill (run in batches for large tables)
UPDATE users SET full_name = first_name || ' ' || last_name
WHERE full_name IS NULL
LIMIT 10000; -- repeat in batches
-- Migration 3: CONTRACT -- drop old columns (only after all app code updated)
ALTER TABLE users DROP COLUMN first_name;
ALTER TABLE users DROP COLUMN last_name;
Verify: Between each phase, confirm SELECT COUNT(*) FROM users WHERE full_name IS NULL returns 0 before proceeding to CONTRACT.
4. Use online schema change for large tables
For MySQL tables with millions of rows, direct ALTER TABLE acquires locks that block reads and writes. Use gh-ost or pt-online-schema-change instead. [src3] [src4]
# gh-ost (GitHub's online schema change -- triggerless, uses binlog)
gh-ost \
--host=db-primary.example.com \
--database=myapp \
--table=orders \
--alter="ADD COLUMN shipping_status VARCHAR(50) DEFAULT 'pending'" \
--execute
# pt-online-schema-change (Percona -- trigger-based)
pt-online-schema-change \
--alter="ADD COLUMN shipping_status VARCHAR(50) DEFAULT 'pending'" \
D=myapp,t=orders \
--execute
Verify: SHOW CREATE TABLE orders; -- confirm the new column exists and no locks are held.
5. Validate migrations in CI before production
Add migration validation to your CI pipeline to catch errors before they reach production. [src7]
# In CI pipeline: spin up a fresh DB, apply all migrations, verify
docker run -d --name test-db -e POSTGRES_PASSWORD=test -p 5432:5432 postgres:16
sleep 5
flyway -url=jdbc:postgresql://localhost:5432/postgres -user=postgres -password=test migrate
flyway -url=jdbc:postgresql://localhost:5432/postgres -user=postgres -password=test validate
Verify: CI exits with code 0 and flyway validate reports no pending or failed migrations.
Code Examples
SQL: Zero-downtime column rename (expand-contract)
-- Input: Table 'products' with column 'price' that needs renaming to 'unit_price'
-- Output: Column renamed with zero application downtime
-- Step 1: EXPAND (Migration V10)
ALTER TABLE products ADD COLUMN unit_price NUMERIC(10,2);
-- Step 2: Dual-write trigger (temporary, remove at CONTRACT)
CREATE OR REPLACE FUNCTION sync_price_columns()
RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
NEW.unit_price := COALESCE(NEW.unit_price, NEW.price);
NEW.price := COALESCE(NEW.price, NEW.unit_price);
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_sync_price
BEFORE INSERT OR UPDATE ON products
FOR EACH ROW EXECUTE FUNCTION sync_price_columns();
-- Step 3: MIGRATE -- backfill existing rows
UPDATE products SET unit_price = price WHERE unit_price IS NULL;
-- Step 4: CONTRACT (Migration V12, after app code is updated)
DROP TRIGGER trg_sync_price ON products;
DROP FUNCTION sync_price_columns();
ALTER TABLE products DROP COLUMN price;
Node.js (Knex): Migration file with rollback
// Input: Knex migration system configured
// Output: Creates 'audit_logs' table with indexes
// migrations/20260224_create_audit_logs.js
exports.up = function(knex) {
return knex.schema.createTable('audit_logs', (table) => {
table.bigIncrements('id').primary();
table.uuid('user_id').notNullable().references('id').inTable('users');
table.string('action', 100).notNullable();
table.jsonb('payload').defaultTo('{}');
table.timestamp('created_at').defaultTo(knex.fn.now());
table.index(['user_id', 'created_at'], 'idx_audit_user_time');
table.index(['action'], 'idx_audit_action');
});
};
exports.down = function(knex) {
return knex.schema.dropTableIfExists('audit_logs');
};
Python (Alembic): Batched data migration
# Input: Alembic environment configured with SQLAlchemy
# Output: Safely backfills 'full_name' from 'first_name' + 'last_name'
"""backfill full_name column"""
revision = 'a1b2c3d4e5f6'
down_revision = '9z8y7x6w5v4u'
from alembic import op
import sqlalchemy as sa
BATCH_SIZE = 5000
def upgrade():
conn = op.get_bind()
while True:
result = conn.execute(sa.text(
"UPDATE users SET full_name = first_name || ' ' || last_name "
"WHERE full_name IS NULL LIMIT :batch"
), {"batch": BATCH_SIZE})
if result.rowcount == 0:
break
def downgrade():
op.execute("UPDATE users SET full_name = NULL")
Go (goose): Migration with transaction control
// Input: goose migration framework configured
// Output: Adds 'status' enum column to orders table
// migrations/20260224_add_order_status.go
package migrations
import (
"database/sql"
"github.com/pressly/goose/v3"
)
func init() {
goose.AddMigration(upAddOrderStatus, downAddOrderStatus)
}
func upAddOrderStatus(tx *sql.Tx) error {
_, err := tx.Exec(`
CREATE TYPE order_status AS ENUM ('pending','processing','shipped','delivered');
ALTER TABLE orders ADD COLUMN status order_status DEFAULT 'pending';
`)
return err
}
func downAddOrderStatus(tx *sql.Tx) error {
_, err := tx.Exec(`
ALTER TABLE orders DROP COLUMN status;
DROP TYPE order_status;
`)
return err
}
Anti-Patterns
Wrong: Running manual DDL directly in production
-- BAD -- ad-hoc DDL with no version tracking, no rollback, no audit trail
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
-- "I'll just run this real quick in the production console..."
Correct: Use versioned migration files
-- GOOD -- tracked in VCS, reviewed in PR, tested in CI, applied by migration tool
-- V14__add_phone_to_users.sql
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
-- Applied via: flyway migrate (with CI validation gate)
Wrong: Big-bang migration (everything in one file)
-- BAD -- 500-line migration that renames, restructures, backfills, and drops
-- If it fails halfway, you're in an inconsistent state
ALTER TABLE orders RENAME COLUMN price TO unit_price;
ALTER TABLE orders ADD COLUMN currency VARCHAR(3);
UPDATE orders SET currency = 'USD'; -- locks table for 45 minutes on 50M rows
ALTER TABLE orders ALTER COLUMN currency SET NOT NULL;
DROP TABLE legacy_orders;
Correct: Break into small, independent migrations
-- GOOD -- each step is a separate, independently reversible migration
-- V20__add_currency_column.sql
ALTER TABLE orders ADD COLUMN currency VARCHAR(3) DEFAULT 'USD';
-- V21__backfill_currency.sql (run in batches)
-- UPDATE orders SET currency = 'USD' WHERE currency IS NULL LIMIT 10000;
-- V22__set_currency_not_null.sql (only after backfill complete)
ALTER TABLE orders ALTER COLUMN currency SET NOT NULL;
-- V23__drop_legacy_orders.sql (separate deployment)
DROP TABLE IF EXISTS legacy_orders;
Wrong: No rollback plan
# BAD -- forward-only with no escape hatch
def upgrade():
op.drop_column('users', 'legacy_role') # data is gone forever
Correct: Always preserve data until verified
# GOOD -- rename instead of drop; keep data recoverable
def upgrade():
op.alter_column('users', 'legacy_role',
new_column_name='_legacy_role_deprecated')
# Drop in a future migration after 2 release cycles
def downgrade():
op.alter_column('users', '_legacy_role_deprecated',
new_column_name='legacy_role')
Common Pitfalls
- Table locking on large tables:
ALTER TABLE ... ADD COLUMN ... DEFAULT valueacquires an ACCESS EXCLUSIVE lock in PostgreSQL <11 and MySQL. Fix: In PostgreSQL 11+,ADD COLUMN ... DEFAULTis instant. In MySQL, usegh-ostorpt-online-schema-change. [src3] - Data backfill ordering with foreign keys: Backfilling a column that references another table can fail if the referenced rows don't exist yet. Fix: Disable foreign key checks during backfill or backfill parent tables first. [src2]
- Foreign key constraints blocking online schema changes: Both gh-ost and pt-osc have limited support for tables with foreign keys. Fix: Drop FK constraints before running online schema change, then re-add after. [src4]
- Migration order conflicts in team environments: Two developers create migrations with the same version number. Fix: Use timestamp-based naming (
20260224143000_add_column.sql) instead of sequential integers. [src5] - Running migrations during peak traffic: Even "fast" DDL can cause metadata lock waits if long-running queries are active. Fix: Schedule migrations during low-traffic windows; kill long-running queries before applying DDL. [src7]
- Not testing on production-sized data: A migration that runs in 50ms on a dev database with 100 rows can take 40 minutes on a production table with 50M rows. Fix: Maintain a staging environment with production-scale data. [src7]
- Mixing DDL and DML in transactions: In MySQL, DDL statements cause an implicit COMMIT of any open transaction, breaking atomicity. Fix: Separate schema changes and data changes into distinct migration files. [src6]
Diagnostic Commands
# Check Flyway migration status
flyway info
# Check Alembic migration history
alembic history --verbose
alembic current
# Check Knex migration status
npx knex migrate:status
# Check for active locks (PostgreSQL)
SELECT pid, relation::regclass, mode, granted
FROM pg_locks WHERE NOT granted;
# Check for long-running queries that may block DDL (PostgreSQL)
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '30 seconds';
# Check table size before deciding migration strategy (PostgreSQL)
SELECT pg_size_pretty(pg_total_relation_size('orders'));
# Verify MySQL table structure after migration
SHOW CREATE TABLE orders;
# Check for pending migrations in Flyway schema history
SELECT * FROM flyway_schema_history WHERE success = false;
Version History & Compatibility
| Tool | Current Version | Key Feature | Migration Notes |
|---|---|---|---|
| Flyway | 10.x | Baseline on migrate, undo (Teams) | Free tier: no undo; use compensating migrations |
| Alembic | 1.13.x | Autogenerate, branching | Pair with SQLAlchemy 2.0+ for async support |
| Knex | 3.x | Promise-based, multi-dialect | Knex 3.x dropped callback API; use async/await |
| goose | 3.x | Go + SQL migrations, no-transaction mode | v3 dropped GOOSE_DRIVER env; use --driver flag |
| gh-ost | 1.1.x | Triggerless, throttle-aware | Requires binlog access (ROW format) |
| pt-osc | 3.6.x | Trigger-based, mature | Conflicts with tables that already have triggers |
| pgroll | 0.x | PostgreSQL-native, versioned rollback | Still pre-1.0; API may change |
| Liquibase | 4.x | Multi-format (XML/YAML/SQL), rollback | Free tier: limited rollback; Pro for full undo |
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Evolving schema in a running application | One-time data import/export | ETL tools (dbt, Airbyte) |
| Team collaboration requiring auditable changes | Prototyping with throwaway databases | Direct DDL in dev environment |
| Zero-downtime deployments with expand-contract | Migrating between different database engines | AWS DMS, pgloader |
| Large table alterations (>10M rows) | Small tables in low-traffic applications | Direct ALTER TABLE |
| CI/CD pipeline integration | Schema managed by ORM auto-sync in dev | ORM auto-migrate (dev only) |
Important Caveats
- PostgreSQL 11+ handles
ADD COLUMN ... DEFAULTwithout a full table rewrite, but older versions lock the entire table -- always check your PostgreSQL version before assuming instant DDL - MySQL's InnoDB online DDL supports some ALTER TABLE operations without locking, but not all --
CHANGE COLUMNtype changes still require a table copy - In MySQL, DDL statements cause an implicit COMMIT of any open transaction -- never place DDL inside an explicit transaction expecting rollback
- gh-ost requires binlog in ROW format and
SUPERorREPLICATION SLAVEprivileges -- verify prerequisites before planning a migration window CREATE INDEX CONCURRENTLYin PostgreSQL cannot run inside a transaction block -- migration tools must be configured to run this statement outside of a transaction