String Matching Algorithms: KMP, Rabin-Karp & Beyond
How do string matching algorithms work (KMP, Rabin-Karp)?
TL;DR
- Bottom line: Use KMP for guaranteed O(n+m) single-pattern matching; use Rabin-Karp for multi-pattern search or when hashing is natural; use your language's built-in
find()/indexOf()for production single-pattern search. - Key tool/command: Failure function (KMP) / rolling hash (Rabin-Karp) — the preprocessing step that makes each algorithm efficient.
- Watch out for: Hash collisions in Rabin-Karp degrade worst case to O(nm); always verify hash matches with full character comparison.
- Works with: Any language; algorithms are language-agnostic. Standard in competitive programming, text editors, bioinformatics, and plagiarism detection.
Constraints
- KMP failure function array must be 0-indexed with
lps[0] = 0; 1-indexed implementations cause off-by-one errors - Rabin-Karp hash modulus must be a large prime (e.g., 109+7); small moduli cause excessive spurious hits
- Rolling hash arithmetic must use modular arithmetic consistently to avoid integer overflow in Java, C++, and JavaScript
- Rabin-Karp worst case is O(nm) when all hash values collide; always verify hash matches with full string comparison
- Multi-pattern search requires Aho-Corasick or Rabin-Karp with multiple hashes; KMP handles only single patterns
Quick Reference
| Algorithm | Preprocessing | Search Time | Space | Multi-pattern | Best For |
|---|---|---|---|---|---|
| Naive (brute force) | None | O(nm) | O(1) | No | Short patterns, small texts |
| KMP | O(m) | O(n) | O(m) | No | Single pattern, guaranteed linear |
| Rabin-Karp | O(m) | O(n) avg, O(nm) worst | O(1) | Yes | Multi-pattern, plagiarism detection |
| Boyer-Moore | O(m + σ) | O(n/m) best, O(nm) worst | O(m + σ) | No | Long patterns, large alphabets |
| Aho-Corasick | O(Σmi) | O(n + matches) | O(Σmi · σ) | Yes | Dictionary matching, many patterns |
| Z-Algorithm | O(n+m) | O(n+m) | O(n+m) | No | Simpler alternative to KMP |
Where: n = text length, m = pattern length, σ = alphabet size. [src1] [src6]
Decision Tree
START
├── Single pattern or multiple patterns?
│ ├── MULTIPLE PATTERNS ↓
│ │ ├── Fixed dictionary of patterns?
│ │ │ ├── YES → Aho-Corasick (O(n + total_matches))
│ │ │ └── NO (patterns change) → Rabin-Karp with multiple hashes
│ │ └── END
│ └── SINGLE PATTERN ↓
├── Need guaranteed worst-case O(n+m)?
│ ├── YES → KMP or Z-Algorithm
│ └── NO (average case OK) ↓
├── Pattern length > 10 and large alphabet (ASCII/Unicode)?
│ ├── YES → Boyer-Moore (sublinear average case)
│ └── NO ↓
├── Production code (not competitive programming)?
│ ├── YES → Use built-in: str.find() / indexOf() / strings.Contains()
│ └── NO ↓
└── DEFAULT → KMP (most versatile, easiest to implement correctly)
Step-by-Step Guide
1. Build the KMP failure function (LPS array)
The Longest Proper Prefix which is also Suffix (LPS) array is the core of KMP. For each position i in the pattern, lps[i] stores the length of the longest proper prefix of pattern[0..i] that is also a suffix. This tells KMP how far to shift after a mismatch. [src1]
def build_lps(pattern):
m = len(pattern)
lps = [0] * m
length = 0 # length of previous longest prefix suffix
i = 1
while i < m:
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
length = lps[length - 1] # don't increment i
else:
lps[i] = 0
i += 1
return lps
Verify: build_lps("ABABAC") → expected: [0, 0, 1, 2, 3, 0]
2. Run KMP search using the LPS array
With the LPS array built, scan the text. On a mismatch at position j in the pattern, jump to lps[j-1] instead of restarting from 0. Each text character is examined at most twice, guaranteeing O(n+m) total. [src1] [src4]
def kmp_search(text, pattern):
n, m = len(text), len(pattern)
lps = build_lps(pattern)
matches = []
i = j = 0
while i < n:
if text[i] == pattern[j]:
i += 1
j += 1
if j == m:
matches.append(i - j)
j = lps[j - 1]
elif i < n and text[i] != pattern[j]:
if j != 0:
j = lps[j - 1]
else:
i += 1
return matches
Verify: kmp_search("ABABDABABCABABABABCABAB", "ABABC") → expected: [5]
3. Implement Rabin-Karp rolling hash
Choose a base (e.g., 256 for ASCII) and a large prime modulus. Compute the hash of the pattern and the first window. For each subsequent position, update the hash in O(1) by removing the leading character and adding the trailing character. [src2] [src3]
def rabin_karp(text, pattern, base=256, mod=10**9 + 7):
n, m = len(text), len(pattern)
if m > n:
return []
h = pow(base, m - 1, mod)
p_hash = t_hash = 0
for i in range(m):
p_hash = (p_hash * base + ord(pattern[i])) % mod
t_hash = (t_hash * base + ord(text[i])) % mod
matches = []
for i in range(n - m + 1):
if p_hash == t_hash:
if text[i:i + m] == pattern:
matches.append(i)
if i < n - m:
t_hash = (t_hash - ord(text[i]) * h) % mod
t_hash = (t_hash * base + ord(text[i + m])) % mod
return matches
Verify: rabin_karp("ABABDABABCABABABABCABAB", "ABABC") → expected: [5]
4. Implement Z-Algorithm as a simpler KMP alternative
The Z-array stores at each position the length of the longest substring starting there that matches a prefix of the string. Concatenate pattern + "$" + text and find positions where z[i] == m. [src6]
def z_function(s):
n = len(s)
z = [0] * n
z[0] = n
l = r = 0
for i in range(1, n):
if i < r:
z[i] = min(r - i, z[i - l])
while i + z[i] < n and s[z[i]] == s[i + z[i]]:
z[i] += 1
if i + z[i] > r:
l, r = i, i + z[i]
return z
def z_search(text, pattern):
concat = pattern + "$" + text
z = z_function(concat)
m = len(pattern)
return [i - m - 1 for i in range(m + 1, len(concat)) if z[i] == m]
Verify: z_search("ABABDABABCABABABABCABAB", "ABABC") → expected: [5]
Code Examples
Python: KMP with all match positions
# Input: text string and pattern string
# Output: list of all starting indices where pattern occurs
def kmp_find_all(text, pattern):
"""KMP string matching - O(n+m) guaranteed."""
n, m = len(text), len(pattern)
if m == 0:
return []
lps = [0] * m
k = 0
for i in range(1, m):
while k > 0 and pattern[k] != pattern[i]:
k = lps[k - 1]
if pattern[k] == pattern[i]:
k += 1
lps[i] = k
matches, j = [], 0
for i in range(n):
while j > 0 and pattern[j] != text[i]:
j = lps[j - 1]
if pattern[j] == text[i]:
j += 1
if j == m:
matches.append(i - m + 1)
j = lps[j - 1]
return matches
JavaScript: Rabin-Karp with BigInt for safe modular arithmetic
// Input: text (string), pattern (string)
// Output: array of starting indices of all matches
function rabinKarp(text, pattern) {
const n = text.length, m = pattern.length;
if (m > n) return [];
const BASE = 256n, MOD = 1000000007n;
let h = 1n;
for (let i = 0; i < m - 1; i++) h = (h * BASE) % MOD;
let pHash = 0n, tHash = 0n;
for (let i = 0; i < m; i++) {
pHash = (pHash * BASE + BigInt(pattern.charCodeAt(i))) % MOD;
tHash = (tHash * BASE + BigInt(text.charCodeAt(i))) % MOD;
}
const matches = [];
for (let i = 0; i <= n - m; i++) {
if (pHash === tHash && text.slice(i, i + m) === pattern) {
matches.push(i);
}
if (i < n - m) {
tHash = ((tHash - BigInt(text.charCodeAt(i)) * h) * BASE
+ BigInt(text.charCodeAt(i + m))) % MOD;
if (tHash < 0n) tHash += MOD;
}
}
return matches;
}
Java: KMP implementation
// Input: text (String), pattern (String)
// Output: List<Integer> of all starting indices
import java.util.ArrayList;
import java.util.List;
public static List<Integer> kmpSearch(String text, String pattern) {
int n = text.length(), m = pattern.length();
int[] lps = new int[m];
for (int i = 1, len = 0; i < m; ) {
if (pattern.charAt(i) == pattern.charAt(len)) {
lps[i++] = ++len;
} else if (len > 0) {
len = lps[len - 1];
} else {
lps[i++] = 0;
}
}
List<Integer> matches = new ArrayList<>();
for (int i = 0, j = 0; i < n; ) {
if (text.charAt(i) == pattern.charAt(j)) { i++; j++; }
if (j == m) { matches.add(i - j); j = lps[j - 1]; }
else if (i < n && text.charAt(i) != pattern.charAt(j)) {
if (j > 0) j = lps[j - 1];
else i++;
}
}
return matches;
}
Anti-Patterns
Wrong: Restarting pattern comparison from scratch on mismatch
# BAD -- O(nm) naive approach re-examines characters unnecessarily
def naive_search(text, pattern):
matches = []
for i in range(len(text) - len(pattern) + 1):
if text[i:i+len(pattern)] == pattern:
matches.append(i)
return matches
Correct: Use KMP to avoid redundant comparisons
# GOOD -- O(n+m) with failure function, no backtracking in text
def kmp_search(text, pattern):
lps = build_lps(pattern)
i = j = 0
matches = []
while i < len(text):
if text[i] == pattern[j]:
i += 1; j += 1
if j == len(pattern):
matches.append(i - j)
j = lps[j - 1]
elif i < len(text) and text[i] != pattern[j]:
j = lps[j - 1] if j else 0
if j == 0 and text[i] != pattern[0]:
i += 1
return matches
Wrong: Rabin-Karp without verifying hash matches
# BAD -- trusts hash equality without character comparison
if p_hash == t_hash:
matches.append(i) # WRONG: hash collision = wrong match
Correct: Always verify hash matches with full comparison
# GOOD -- hash is a filter, not proof; always confirm
if p_hash == t_hash:
if text[i:i+m] == pattern: # O(m) verification on hash match
matches.append(i)
Wrong: Using small modulus in Rabin-Karp
# BAD -- mod=101 causes frequent collisions on real text
mod = 101
t_hash = (t_hash * base + ord(text[i])) % mod
Correct: Use a large prime modulus
# GOOD -- 10^9+7 is standard; fits in 64-bit integers
mod = 10**9 + 7
t_hash = (t_hash * base + ord(text[i])) % mod
Common Pitfalls
- Hash collisions degrade Rabin-Karp to O(nm): When many substrings produce the same hash, every position triggers a full comparison. Fix: use double hashing (two independent hash functions with different primes). [src2]
- 1-indexed vs 0-indexed LPS array: Many textbook descriptions use 1-indexed arrays; translating to 0-indexed code causes off-by-one bugs. Fix: always initialize
lps[0] = 0and start build loop ati = 1. [src1] - Integer overflow in rolling hash: In Java/C++/JavaScript,
base^(m-1)overflows 64-bit integers for long patterns. Fix: apply% modat every multiplication step, or use BigInt. [src3] - Negative modular arithmetic: In languages where
%can return negative values, the rolling hash subtraction produces negative hashes. Fix:t_hash = ((t_hash - c * h) % mod + mod) % mod. [src3] - Forgetting to handle empty pattern: Both algorithms can infinite-loop on empty input. Fix: check
if len(pattern) == 0: return []at the start. [src4] - Using KMP for multi-pattern search: Running KMP k times for k patterns gives O(k*(n+m)). Fix: use Aho-Corasick for O(n + total_pattern_length + matches). [src5]
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Need guaranteed O(n+m) worst case for single pattern | Production code where stdlib is available | str.find(), indexOf(), strings.Contains() |
| Searching multiple patterns simultaneously | Pattern includes wildcards or regex features | Regex engine or NFA/DFA-based matcher |
| Streaming text (can't backtrack) | Text fits in memory and pattern is short (<5 chars) | Naive search or stdlib |
| Competitive programming / interviews | Need approximate/fuzzy matching | Edit distance, BK-trees, fuzzy matching |
| Text editor find/replace on long documents | Searching in sorted data | Binary search |
| Plagiarism detection across many documents | Need only first occurrence in small text | str.find() / indexOf() |
Important Caveats
- Standard library string search functions (
str.find(),String.indexOf(),strings.Contains()) typically use optimized Boyer-Moore or two-way algorithm variants and are faster than hand-rolled KMP for most practical inputs. - Rabin-Karp's practical speed depends heavily on the hash function quality. For adversarial inputs (competitive programming with hacking), use double hashing or randomize the base.
- Boyer-Moore often outperforms KMP in practice for large alphabets because it achieves sublinear O(n/m) average case by skipping characters. However, its worst case is O(nm) without the Galil rule.
- The Z-algorithm produces identical results to KMP with a simpler implementation; prefer it when code clarity matters more than constant-factor performance.