KMP algorithm explained

- Bottom line: Use KMP for guaranteed O(n+m) single-pattern matching; use Rabin-Karp for multi-pattern search or when hashing is natural; use your language's built-in find()/indexOf() for production single-pattern search.

Rabin-Karp string search

- Bottom line: Use KMP for guaranteed O(n+m) single-pattern matching; use Rabin-Karp for multi-pattern search or when hashing is natural; use your language's built-in find()/indexOf() for production single-pattern search.

string pattern matching algorithms comparison

- Bottom line: Use KMP for guaranteed O(n+m) single-pattern matching; use Rabin-Karp for multi-pattern search or when hashing is natural; use your language's built-in find()/indexOf() for production single-pattern search.

failure function prefix table KMP

- Bottom line: Use KMP for guaranteed O(n+m) single-pattern matching; use Rabin-Karp for multi-pattern search or when hashing is natural; use your language's built-in find()/indexOf() for production single-pattern search.

rolling hash string matching

- Bottom line: Use KMP for guaranteed O(n+m) single-pattern matching; use Rabin-Karp for multi-pattern search or when hashing is natural; use your language's built-in find()/indexOf() for production single-pattern search.

String Matching Algorithms: KMP, Rabin-Karp & Beyond

How do string matching algorithms work (KMP, Rabin-Karp)?

TL;DR

Bottom line: Use KMP for guaranteed O(n+m) single-pattern matching; use Rabin-Karp for multi-pattern search or when hashing is natural; use your language's built-in find()/indexOf() for production single-pattern search.
Key tool/command: Failure function (KMP) / rolling hash (Rabin-Karp) — the preprocessing step that makes each algorithm efficient.
Watch out for: Hash collisions in Rabin-Karp degrade worst case to O(nm); always verify hash matches with full character comparison.
Works with: Any language; algorithms are language-agnostic. Standard in competitive programming, text editors, bioinformatics, and plagiarism detection.

Constraints

KMP failure function array must be 0-indexed with lps[0] = 0; 1-indexed implementations cause off-by-one errors
Rabin-Karp hash modulus must be a large prime (e.g., 10⁹+7); small moduli cause excessive spurious hits
Rolling hash arithmetic must use modular arithmetic consistently to avoid integer overflow in Java, C++, and JavaScript
Rabin-Karp worst case is O(nm) when all hash values collide; always verify hash matches with full string comparison
Multi-pattern search requires Aho-Corasick or Rabin-Karp with multiple hashes; KMP handles only single patterns

Quick Reference

Algorithm	Preprocessing	Search Time	Space	Multi-pattern	Best For
Naive (brute force)	None	O(nm)	O(1)	No	Short patterns, small texts
KMP	O(m)	O(n)	O(m)	No	Single pattern, guaranteed linear
Rabin-Karp	O(m)	O(n) avg, O(nm) worst	O(1)	Yes	Multi-pattern, plagiarism detection
Boyer-Moore	O(m + σ)	O(n/m) best, O(nm) worst	O(m + σ)	No	Long patterns, large alphabets
Aho-Corasick	O(Σm_i)	O(n + matches)	O(Σm_i · σ)	Yes	Dictionary matching, many patterns
Z-Algorithm	O(n+m)	O(n+m)	O(n+m)	No	Simpler alternative to KMP

Where: n = text length, m = pattern length, σ = alphabet size. [src1] [src6]

Decision Tree

START
├── Single pattern or multiple patterns?
│   ├── MULTIPLE PATTERNS ↓
│   │   ├── Fixed dictionary of patterns?
│   │   │   ├── YES → Aho-Corasick (O(n + total_matches))
│   │   │   └── NO (patterns change) → Rabin-Karp with multiple hashes
│   │   └── END
│   └── SINGLE PATTERN ↓
├── Need guaranteed worst-case O(n+m)?
│   ├── YES → KMP or Z-Algorithm
│   └── NO (average case OK) ↓
├── Pattern length > 10 and large alphabet (ASCII/Unicode)?
│   ├── YES → Boyer-Moore (sublinear average case)
│   └── NO ↓
├── Production code (not competitive programming)?
│   ├── YES → Use built-in: str.find() / indexOf() / strings.Contains()
│   └── NO ↓
└── DEFAULT → KMP (most versatile, easiest to implement correctly)

Step-by-Step Guide

1. Build the KMP failure function (LPS array)

The Longest Proper Prefix which is also Suffix (LPS) array is the core of KMP. For each position i in the pattern, lps[i] stores the length of the longest proper prefix of pattern[0..i] that is also a suffix. This tells KMP how far to shift after a mismatch. [src1]

def build_lps(pattern):
    m = len(pattern)
    lps = [0] * m
    length = 0  # length of previous longest prefix suffix
    i = 1
    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]  # don't increment i
            else:
                lps[i] = 0
                i += 1
    return lps

Verify: build_lps("ABABAC") → expected: [0, 0, 1, 2, 3, 0]

2. Run KMP search using the LPS array

With the LPS array built, scan the text. On a mismatch at position j in the pattern, jump to lps[j-1] instead of restarting from 0. Each text character is examined at most twice, guaranteeing O(n+m) total. [src1] [src4]

def kmp_search(text, pattern):
    n, m = len(text), len(pattern)
    lps = build_lps(pattern)
    matches = []
    i = j = 0
    while i < n:
        if text[i] == pattern[j]:
            i += 1
            j += 1
        if j == m:
            matches.append(i - j)
            j = lps[j - 1]
        elif i < n and text[i] != pattern[j]:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1
    return matches

Verify: kmp_search("ABABDABABCABABABABCABAB", "ABABC") → expected: [5]

3. Implement Rabin-Karp rolling hash

Choose a base (e.g., 256 for ASCII) and a large prime modulus. Compute the hash of the pattern and the first window. For each subsequent position, update the hash in O(1) by removing the leading character and adding the trailing character. [src2] [src3]

def rabin_karp(text, pattern, base=256, mod=10**9 + 7):
    n, m = len(text), len(pattern)
    if m > n:
        return []
    h = pow(base, m - 1, mod)
    p_hash = t_hash = 0
    for i in range(m):
        p_hash = (p_hash * base + ord(pattern[i])) % mod
        t_hash = (t_hash * base + ord(text[i])) % mod
    matches = []
    for i in range(n - m + 1):
        if p_hash == t_hash:
            if text[i:i + m] == pattern:
                matches.append(i)
        if i < n - m:
            t_hash = (t_hash - ord(text[i]) * h) % mod
            t_hash = (t_hash * base + ord(text[i + m])) % mod
    return matches

Verify: rabin_karp("ABABDABABCABABABABCABAB", "ABABC") → expected: [5]

4. Implement Z-Algorithm as a simpler KMP alternative

The Z-array stores at each position the length of the longest substring starting there that matches a prefix of the string. Concatenate pattern + "$" + text and find positions where z[i] == m. [src6]

def z_function(s):
    n = len(s)
    z = [0] * n
    z[0] = n
    l = r = 0
    for i in range(1, n):
        if i < r:
            z[i] = min(r - i, z[i - l])
        while i + z[i] < n and s[z[i]] == s[i + z[i]]:
            z[i] += 1
        if i + z[i] > r:
            l, r = i, i + z[i]
    return z

def z_search(text, pattern):
    concat = pattern + "$" + text
    z = z_function(concat)
    m = len(pattern)
    return [i - m - 1 for i in range(m + 1, len(concat)) if z[i] == m]

Verify: z_search("ABABDABABCABABABABCABAB", "ABABC") → expected: [5]

Code Examples

Python: KMP with all match positions

# Input:  text string and pattern string
# Output: list of all starting indices where pattern occurs

def kmp_find_all(text, pattern):
    """KMP string matching - O(n+m) guaranteed."""
    n, m = len(text), len(pattern)
    if m == 0:
        return []
    lps = [0] * m
    k = 0
    for i in range(1, m):
        while k > 0 and pattern[k] != pattern[i]:
            k = lps[k - 1]
        if pattern[k] == pattern[i]:
            k += 1
        lps[i] = k
    matches, j = [], 0
    for i in range(n):
        while j > 0 and pattern[j] != text[i]:
            j = lps[j - 1]
        if pattern[j] == text[i]:
            j += 1
        if j == m:
            matches.append(i - m + 1)
            j = lps[j - 1]
    return matches

JavaScript: Rabin-Karp with BigInt for safe modular arithmetic

// Input:  text (string), pattern (string)
// Output: array of starting indices of all matches

function rabinKarp(text, pattern) {
  const n = text.length, m = pattern.length;
  if (m > n) return [];
  const BASE = 256n, MOD = 1000000007n;
  let h = 1n;
  for (let i = 0; i < m - 1; i++) h = (h * BASE) % MOD;

  let pHash = 0n, tHash = 0n;
  for (let i = 0; i < m; i++) {
    pHash = (pHash * BASE + BigInt(pattern.charCodeAt(i))) % MOD;
    tHash = (tHash * BASE + BigInt(text.charCodeAt(i))) % MOD;
  }
  const matches = [];
  for (let i = 0; i <= n - m; i++) {
    if (pHash === tHash && text.slice(i, i + m) === pattern) {
      matches.push(i);
    }
    if (i < n - m) {
      tHash = ((tHash - BigInt(text.charCodeAt(i)) * h) * BASE
        + BigInt(text.charCodeAt(i + m))) % MOD;
      if (tHash < 0n) tHash += MOD;
    }
  }
  return matches;
}

Java: KMP implementation

// Input:  text (String), pattern (String)
// Output: List<Integer> of all starting indices

import java.util.ArrayList;
import java.util.List;

public static List<Integer> kmpSearch(String text, String pattern) {
    int n = text.length(), m = pattern.length();
    int[] lps = new int[m];
    for (int i = 1, len = 0; i < m; ) {
        if (pattern.charAt(i) == pattern.charAt(len)) {
            lps[i++] = ++len;
        } else if (len > 0) {
            len = lps[len - 1];
        } else {
            lps[i++] = 0;
        }
    }
    List<Integer> matches = new ArrayList<>();
    for (int i = 0, j = 0; i < n; ) {
        if (text.charAt(i) == pattern.charAt(j)) { i++; j++; }
        if (j == m) { matches.add(i - j); j = lps[j - 1]; }
        else if (i < n && text.charAt(i) != pattern.charAt(j)) {
            if (j > 0) j = lps[j - 1];
            else i++;
        }
    }
    return matches;
}

Anti-Patterns

Wrong: Restarting pattern comparison from scratch on mismatch

# BAD -- O(nm) naive approach re-examines characters unnecessarily
def naive_search(text, pattern):
    matches = []
    for i in range(len(text) - len(pattern) + 1):
        if text[i:i+len(pattern)] == pattern:
            matches.append(i)
    return matches

Correct: Use KMP to avoid redundant comparisons

# GOOD -- O(n+m) with failure function, no backtracking in text
def kmp_search(text, pattern):
    lps = build_lps(pattern)
    i = j = 0
    matches = []
    while i < len(text):
        if text[i] == pattern[j]:
            i += 1; j += 1
        if j == len(pattern):
            matches.append(i - j)
            j = lps[j - 1]
        elif i < len(text) and text[i] != pattern[j]:
            j = lps[j - 1] if j else 0
            if j == 0 and text[i] != pattern[0]:
                i += 1
    return matches

Wrong: Rabin-Karp without verifying hash matches

# BAD -- trusts hash equality without character comparison
if p_hash == t_hash:
    matches.append(i)  # WRONG: hash collision = wrong match

Correct: Always verify hash matches with full comparison

# GOOD -- hash is a filter, not proof; always confirm
if p_hash == t_hash:
    if text[i:i+m] == pattern:  # O(m) verification on hash match
        matches.append(i)

Wrong: Using small modulus in Rabin-Karp

# BAD -- mod=101 causes frequent collisions on real text
mod = 101
t_hash = (t_hash * base + ord(text[i])) % mod

Correct: Use a large prime modulus

# GOOD -- 10^9+7 is standard; fits in 64-bit integers
mod = 10**9 + 7
t_hash = (t_hash * base + ord(text[i])) % mod

Common Pitfalls

Hash collisions degrade Rabin-Karp to O(nm): When many substrings produce the same hash, every position triggers a full comparison. Fix: use double hashing (two independent hash functions with different primes). [src2]
1-indexed vs 0-indexed LPS array: Many textbook descriptions use 1-indexed arrays; translating to 0-indexed code causes off-by-one bugs. Fix: always initialize lps[0] = 0 and start build loop at i = 1. [src1]
Integer overflow in rolling hash: In Java/C++/JavaScript, base^(m-1) overflows 64-bit integers for long patterns. Fix: apply % mod at every multiplication step, or use BigInt. [src3]
Negative modular arithmetic: In languages where % can return negative values, the rolling hash subtraction produces negative hashes. Fix: t_hash = ((t_hash - c * h) % mod + mod) % mod. [src3]
Forgetting to handle empty pattern: Both algorithms can infinite-loop on empty input. Fix: check if len(pattern) == 0: return [] at the start. [src4]
Using KMP for multi-pattern search: Running KMP k times for k patterns gives O(k*(n+m)). Fix: use Aho-Corasick for O(n + total_pattern_length + matches). [src5]

When to Use / When Not to Use

Use When	Don't Use When	Use Instead
Need guaranteed O(n+m) worst case for single pattern	Production code where stdlib is available	`str.find()`, `indexOf()`, `strings.Contains()`
Searching multiple patterns simultaneously	Pattern includes wildcards or regex features	Regex engine or NFA/DFA-based matcher
Streaming text (can't backtrack)	Text fits in memory and pattern is short (<5 chars)	Naive search or stdlib
Competitive programming / interviews	Need approximate/fuzzy matching	Edit distance, BK-trees, fuzzy matching
Text editor find/replace on long documents	Searching in sorted data	Binary search
Plagiarism detection across many documents	Need only first occurrence in small text	`str.find()` / `indexOf()`

Important Caveats

Standard library string search functions (str.find(), String.indexOf(), strings.Contains()) typically use optimized Boyer-Moore or two-way algorithm variants and are faster than hand-rolled KMP for most practical inputs.
Rabin-Karp's practical speed depends heavily on the hash function quality. For adversarial inputs (competitive programming with hacking), use double hashing or randomize the base.
Boyer-Moore often outperforms KMP in practice for large alphabets because it achieves sublinear O(n/m) average case by skipping characters. However, its worst case is O(nm) without the Galil rule.
The Z-algorithm produces identical results to KMP with a simpler implementation; prefer it when code clarity matters more than constant-factor performance.