SQL Temporal Queries: Gaps and Islands Patterns
How do I write SQL temporal queries for gaps and islands problems?
TL;DR
- Bottom line: Gaps and islands problems are solved by computing a grouping key from the difference between a data value and a window function rank, then aggregating per group.
- Key tool/command:
ROW_NUMBER() OVER (ORDER BY col) - colfor integer islands;LAG()+ runningSUM()for date/timestamp islands. - Watch out for: Forgetting to PARTITION BY your grouping column; this merges islands across unrelated groups.
- Works with: PostgreSQL 9.4+, MySQL 8.0+, SQL Server 2012+, Oracle 12c+, SQLite 3.25+ (any database with window function support).
Constraints
- Window functions require SQL:2003+ compliant databases; MySQL < 8.0 and SQLite < 3.25 do not support them
- ORDER BY in window functions must be deterministic; add a tiebreaker column (e.g., primary key) if the sequence column has duplicates
- NULL values in LAG/LEAD comparisons evaluate to NULL, not FALSE; use
COALESCE()orIS DISTINCT FROMto handle NULLs - PARTITION BY must match your logical entity grouping (e.g., per-user, per-device); omitting it merges all rows into one partition
- For overlapping date ranges, standard gap detection fails; use the running-maximum technique instead
Quick Reference
| Scenario | Pattern | Window Functions | Time | Space | Trade-off |
|---|---|---|---|---|---|
| Integer gaps (missing IDs) | LEAD difference | LEAD() | O(n) | O(1) | Simplest; integers only |
| Integer islands (consecutive IDs) | ROW_NUMBER difference | ROW_NUMBER() | O(n) | O(1) | Fast; fails with duplicates |
| Integer islands with duplicates | DENSE_RANK difference | DENSE_RANK() | O(n) | O(1) | Handles duplicates correctly |
| Date gaps (missing days) | LAG + date arithmetic | LAG() | O(n) | O(1) | Requires known interval |
| Date islands (consecutive days) | LAG + running SUM | LAG(), SUM() | O(n) | O(n) | Most flexible for dates |
| Timestamp islands (activity sessions) | LAG + threshold + SUM | LAG(), SUM() | O(n) | O(n) | Configurable gap threshold |
| Overlapping date range merging | Running MAX end date | MAX(), LAG() | O(n) | O(n) | Handles overlaps correctly |
| Partitioned islands (per-entity) | Any pattern + PARTITION BY | varies | O(n) | O(n) | Add PARTITION BY to all window fns |
| Islands with row numbering | No-sort optimized | ROW_NUMBER(), LAG(), MAX() | O(n) | O(n) | Best serial performance |
| Gaps in date series | Generate series + LEFT JOIN | generate_series() | O(n) | O(n) | PostgreSQL-specific; clearest |
| Multi-column islands | Composite LAG comparison | LAG() per column | O(n) | O(n) | Compare all grouping columns |
Decision Tree
START
|-- Is the sequence numeric integers (IDs)?
| |-- YES --> Are there duplicate values?
| | |-- YES --> Use DENSE_RANK difference technique
| | +-- NO --> Use ROW_NUMBER difference technique
| +-- NO (dates/timestamps) |
|-- Are date ranges overlapping?
| |-- YES --> Use running MAX end-date technique (overlapping merge)
| +-- NO |
|-- Do you need gaps (missing) or islands (present)?
| |-- GAPS --> Is the database PostgreSQL?
| | |-- YES --> Use generate_series + LEFT JOIN
| | +-- NO --> Use LEAD/LAG gap detection
| +-- ISLANDS |
|-- Is the interval fixed (1 day, 1 hour)?
| |-- YES --> Use LAG + running SUM with fixed threshold
| +-- NO --> Use LAG + running SUM with variable threshold
+-- DEFAULT --> LAG + running SUM (most general technique)
Step-by-Step Guide
1. Identify your sequence type and grouping
Determine whether your data is integer-based (IDs, sequence numbers) or temporal (dates, timestamps), and identify the PARTITION BY columns. [src1]
-- Inspect your data shape
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'your_table'
AND column_name IN ('id', 'event_date', 'user_id', 'status');
Verify: Confirm the sequence column type is integer, date, or timestamp.
2. Add the grouping key using window functions
For integer sequences, subtract ROW_NUMBER from the value. For date sequences, compare with LAG and accumulate island IDs with running SUM. [src2]
-- Integer islands: the ROW_NUMBER difference technique
SELECT
val,
val - ROW_NUMBER() OVER (ORDER BY val) AS grp
FROM your_table;
-- Rows in the same island will have the same grp value
Verify: SELECT DISTINCT grp FROM cte should return one value per island.
3. Aggregate to get island boundaries
Group by the computed key to get the start and end of each island. [src1]
WITH islands AS (
SELECT
val,
val - ROW_NUMBER() OVER (ORDER BY val) AS grp
FROM your_table
)
SELECT
MIN(val) AS island_start,
MAX(val) AS island_end,
COUNT(*) AS island_length
FROM islands
GROUP BY grp
ORDER BY island_start;
Verify: SUM(island_length) should equal COUNT(*) from the source table.
4. Validate edge cases
Check for NULLs, duplicates, and partition boundaries. [src5]
-- Check for NULLs in sequence column
SELECT COUNT(*) AS null_count
FROM your_table
WHERE val IS NULL;
-- Check for duplicates
SELECT val, COUNT(*) AS cnt
FROM your_table
GROUP BY val
HAVING COUNT(*) > 1;
Verify: If null_count > 0, add WHERE val IS NOT NULL or switch to DENSE_RANK. If duplicates exist, use DENSE_RANK instead of ROW_NUMBER.
Code Examples
PostgreSQL: Detect gaps in a date sequence
-- Input: Table 'user_logins' with columns (user_id, login_date)
-- Output: Missing date ranges per user
WITH date_range AS (
SELECT user_id,
login_date,
LEAD(login_date) OVER (
PARTITION BY user_id ORDER BY login_date
) AS next_login
FROM user_logins
)
SELECT user_id,
login_date + INTERVAL '1 day' AS gap_start,
next_login - INTERVAL '1 day' AS gap_end,
(next_login - login_date - 1) AS days_missing
FROM date_range
WHERE next_login - login_date > 1
ORDER BY user_id, gap_start;
PostgreSQL: Detect islands of consecutive dates
Full script: postgresql-detect-islands-of-consecutive-dates.sql
-- Input: Table 'daily_activity' with columns (user_id, activity_date)
-- Output: Consecutive date ranges (islands) per user
WITH flagged AS (
SELECT user_id, activity_date,
CASE WHEN activity_date - LAG(activity_date) OVER (
PARTITION BY user_id ORDER BY activity_date
) = 1 THEN 0 ELSE 1 END AS new_island
FROM daily_activity
),
grouped AS (
SELECT user_id, activity_date,
SUM(new_island) OVER (
PARTITION BY user_id ORDER BY activity_date
ROWS UNBOUNDED PRECEDING
) AS island_id
FROM flagged
)
SELECT user_id,
MIN(activity_date) AS island_start,
MAX(activity_date) AS island_end,
COUNT(*) AS consecutive_days
FROM grouped
GROUP BY user_id, island_id
ORDER BY user_id, island_start;
PostgreSQL: Merge overlapping date ranges
Full script: postgresql-merge-overlapping-date-ranges.sql
-- Input: Table 'subscriptions' with columns (user_id, start_date, end_date)
-- Output: Merged non-overlapping date ranges per user
WITH ordered AS (
SELECT user_id, start_date, end_date,
MAX(end_date) OVER (
PARTITION BY user_id ORDER BY start_date
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
) AS max_prev_end
FROM subscriptions
),
flagged AS (
SELECT *,
CASE WHEN start_date <= max_prev_end THEN 0 ELSE 1 END AS new_group
FROM ordered
),
grouped AS (
SELECT *,
SUM(new_group) OVER (
PARTITION BY user_id ORDER BY start_date
ROWS UNBOUNDED PRECEDING
) AS grp
FROM flagged
)
SELECT user_id,
MIN(start_date) AS merged_start,
MAX(end_date) AS merged_end
FROM grouped
GROUP BY user_id, grp
ORDER BY user_id, merged_start;
SQL Server: Integer gaps with LEAD
-- Input: Table 'invoice_numbers' with column (invoice_no INT)
-- Output: Missing invoice number ranges
WITH gaps AS (
SELECT invoice_no,
LEAD(invoice_no) OVER (ORDER BY invoice_no) AS next_no
FROM invoice_numbers
)
SELECT invoice_no + 1 AS gap_start,
next_no - 1 AS gap_end,
next_no - invoice_no - 1 AS missing_count
FROM gaps
WHERE next_no - invoice_no > 1
ORDER BY gap_start;
Anti-Patterns
Wrong: Using ROW_NUMBER with duplicate values
-- BAD -- ROW_NUMBER assigns unique numbers, splitting identical values into separate islands
WITH islands AS (
SELECT status, event_date,
event_date - INTERVAL '1 day' * ROW_NUMBER() OVER (
PARTITION BY status ORDER BY event_date
) AS grp
FROM events
)
SELECT status, MIN(event_date), MAX(event_date)
FROM islands GROUP BY status, grp;
Correct: Use DENSE_RANK for data with duplicates
-- GOOD -- DENSE_RANK handles duplicates by assigning the same rank to ties
WITH islands AS (
SELECT status, event_date,
event_date - INTERVAL '1 day' * DENSE_RANK() OVER (
PARTITION BY status ORDER BY event_date
) AS grp
FROM events
)
SELECT status, MIN(event_date), MAX(event_date)
FROM islands GROUP BY status, grp;
Wrong: Omitting PARTITION BY when data has multiple entities
-- BAD -- Without PARTITION BY, islands from different users get merged
WITH flagged AS (
SELECT user_id, login_date,
CASE WHEN login_date - LAG(login_date) OVER (
ORDER BY login_date -- missing PARTITION BY user_id!
) = 1 THEN 0 ELSE 1 END AS new_island
FROM user_logins
)
Correct: Always PARTITION BY the entity column
-- GOOD -- Each user's islands are computed independently
WITH flagged AS (
SELECT user_id, login_date,
CASE WHEN login_date - LAG(login_date) OVER (
PARTITION BY user_id ORDER BY login_date
) = 1 THEN 0 ELSE 1 END AS new_island
FROM user_logins
)
Wrong: Using self-join for gap detection
-- BAD -- O(n^2) self-join; catastrophic on large tables
SELECT a.val + 1 AS gap_start,
MIN(b.val) - 1 AS gap_end
FROM sequence_table a
INNER JOIN sequence_table b ON b.val > a.val
WHERE NOT EXISTS (
SELECT 1 FROM sequence_table c
WHERE c.val = a.val + 1
)
GROUP BY a.val;
Correct: Use LEAD window function for O(n) gap detection
-- GOOD -- Single pass with LEAD, O(n) with proper index
WITH gaps AS (
SELECT val,
LEAD(val) OVER (ORDER BY val) AS next_val
FROM sequence_table
)
SELECT val + 1 AS gap_start,
next_val - 1 AS gap_end
FROM gaps
WHERE next_val - val > 1;
Common Pitfalls
- Non-deterministic ORDER BY: If your ORDER BY column has ties, ROW_NUMBER assigns arbitrary numbers, producing inconsistent island boundaries across runs. Fix: add a tiebreaker column, e.g.,
ORDER BY event_date, id. [src1] - NULL propagation in LAG/LEAD:
LAG(val)returns NULL for the first row, andNULL - dateis NULL, not a flag. Fix:COALESCE(LAG(val) OVER (...), val)or handle the first-row case explicitly. [src4] - Off-by-one in date arithmetic:
date2 - date1 = 1means consecutive days, buttimestamp2 - timestamp1 = INTERVAL '1 day'may miss times. Fix: cast toDATEor useDATE_TRUNC('day', ts)before comparison. [src5] - Forgetting ROWS UNBOUNDED PRECEDING: The default window frame for
SUM() OVER (ORDER BY ...)isRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which groups equal values. Fix: always specifyROWS UNBOUNDED PRECEDINGfor running sums. [src2] - Overlapping ranges treated as gaps: Standard gap detection assumes non-overlapping intervals. If ranges overlap, use the running-MAX technique to merge before detecting. Fix: pre-merge overlapping ranges first. [src7]
- Performance on unsorted data: Window functions require sorted input. Without an index on the ORDER BY column, the planner adds an expensive Sort operator. Fix: create an index on
(partition_col, order_col). [src3]
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Finding missing values in a known sequence (invoice numbers, dates) | Bucketing data into fixed time intervals (hourly, daily) | DATE_TRUNC / time_bucket aggregation |
| Identifying consecutive activity streaks (login streaks, uptime) | Detecting simple duplicates in a column | GROUP BY + HAVING COUNT > 1 |
| Merging overlapping date ranges (subscriptions, schedules) | Joining two tables on date ranges (interval overlap) | Range join with daterange / OVERLAPS operator |
| Computing session duration from event timestamps | Computing running totals or moving averages | Standard window function aggregates |
| Building slowly changing dimensions from event logs | Data has no natural ordering or sequence | Graph traversal / recursive CTE |
Important Caveats
- The ROW_NUMBER difference technique only works for sequences with a known, constant step size (e.g., +1 for integers, +1 day for dates); variable-step sequences require the LAG/SUM approach
- MySQL 8.0 supports window functions but lacks
generate_series(); use a recursive CTE or a numbers table for gap detection with date series - Performance varies significantly by engine: SQL Server's batch mode processing can achieve 5x speedup on islands queries with serial execution plans when window functions share the same PARTITION BY and ORDER BY
- SQLite window function support (3.25+) is functional but lacks some advanced framing options; test complex ROWS/RANGE frames carefully
- For very large tables (>10M rows), ensure a covering index exists on
(partition_col, order_col, value_col)to avoid table scans