Bloom filter tutorial

- Bottom line: Bloom filters are space-efficient probabilistic data structures that answer "is X in the set?" with configurable false positive rates and zero false negatives, using a bit array and k independent hash functions.

probabilistic set membership

- Bottom line: Bloom filters are space-efficient probabilistic data structures that answer "is X in the set?" with configurable false positive rates and zero false negatives, using a bit array and k independent hash functions.

space-efficient set lookup

- Bottom line: Bloom filters are space-efficient probabilistic data structures that answer "is X in the set?" with configurable false positive rates and zero false negatives, using a bit array and k independent hash functions.

Bloom filter false positive rate

- Bottom line: Bloom filters are space-efficient probabilistic data structures that answer "is X in the set?" with configurable false positive rates and zero false negatives, using a bit array and k independent hash functions.

Counting Bloom filter vs Cuckoo filter

- Bottom line: Bloom filters are space-efficient probabilistic data structures that answer "is X in the set?" with configurable false positive rates and zero false negatives, using a bit array and k independent hash functions.

Bloom Filters & Probabilistic Data Structures: Complete Reference

How do Bloom filters and probabilistic data structures work?

TL;DR

Bottom line: Bloom filters are space-efficient probabilistic data structures that answer "is X in the set?" with configurable false positive rates and zero false negatives, using a bit array and k independent hash functions.
Key tool/command: m = -n * ln(p) / (ln(2))^2 (optimal filter size: n = elements, p = target FP rate)
Watch out for: Standard Bloom filters do not support deletion — removing a bit corrupts other entries. Use Counting Bloom or Cuckoo filters if deletion is needed.
Works with: Any language/platform. Common libraries: pybloom_live (Python), bloom-filters (npm), bits-and-blooms/bloom (Go).

Constraints

Standard Bloom filters do NOT support element deletion — removing bits may corrupt other entries
False positive rate increases as filter fills — never exceed designed capacity (n)
Hash functions must be independent and uniformly distributed — cryptographic hashes (SHA-256) are too slow; use murmur3, xxHash, or FNV
A Bloom filter can never return false negatives — "not in set" is always correct
Filter size (m) and hash count (k) are fixed at creation and cannot be changed after insertions begin
Thread safety is NOT built into most implementations — add synchronization for concurrent access

Quick Reference

Probabilistic Data Structure Comparison

Structure	Space (bits/element)	Insert	Query	Delete	FP Rate	Best For
Standard Bloom Filter	9.6 (at 1% FP)	O(k)	O(k)	No	Configurable	Set membership, dedup
Counting Bloom Filter	3-4x standard	O(k)	O(k)	Yes	Same as standard	Membership with deletion
Cuckoo Filter	~12 bits (at 3% FP)	O(1) amortized	O(1)	Yes	Lower at <3% FP	Membership + deletion
HyperLogLog	~1.5 KB fixed	O(1)	O(1)	No	~2% std error	Cardinality estimation
Count-Min Sketch	O(1/e * ln(1/d))	O(d)	O(d)	No	Over-counts only	Frequency estimation
Quotient Filter	10-25% more than Bloom	O(1) amortized	O(1) amortized	Yes	Configurable	Cache-friendly membership

Key Formulas (Standard Bloom Filter)

Parameter	Formula	Description
Optimal bit array size (m)	`m = -n * ln(p) / (ln(2))^2`	n = expected elements, p = target FP rate
Optimal hash count (k)	`k = (m / n) * ln(2)`	Minimizes FP for given m and n
False positive probability	`p = (1 - e^(-kn/m))^k`	Actual FP rate after n insertions
Bits per element (1% FP)	~9.6 bits	Independent of element size
Bits per element (0.1% FP)	~14.4 bits	Each 10x reduction costs ~4.8 bits

Decision Tree

START: What problem are you solving?
|
+-- Set membership ("is X in the set?")?
|   |
|   +-- Need deletion?
|   |   +-- YES --> Counting Bloom Filter (if space OK) or Cuckoo Filter (if space-constrained)
|   |   +-- NO --> Standard Bloom Filter (most space-efficient)
|   |
|   +-- Need exact membership (zero false positives)?
|       +-- YES --> Use a hash set instead (not a probabilistic structure)
|       +-- NO --> Bloom Filter family (see above)
|
+-- Cardinality estimation ("how many distinct elements?")?
|   +-- YES --> HyperLogLog (~1.5 KB, ~2% error)
|
+-- Frequency estimation ("how often does X appear?")?
|   +-- YES --> Count-Min Sketch (over-counts, never under-counts)
|
+-- Need cache-friendly disk operations?
|   +-- YES --> Quotient Filter (better locality than Bloom)
|
+-- DEFAULT --> Standard Bloom Filter (simplest, most battle-tested)

Step-by-Step Guide

1. Calculate optimal filter parameters

Determine the bit array size (m) and number of hash functions (k) based on your expected element count (n) and desired false positive rate (p). [src7]

import math

def bloom_filter_params(n, p):
    """Calculate optimal Bloom filter parameters.
    Args:
        n: Expected number of elements
        p: Desired false positive rate (e.g., 0.01 for 1%)
    Returns:
        (m, k): bit array size, number of hash functions
    """
    m = -n * math.log(p) / (math.log(2) ** 2)
    k = (m / n) * math.log(2)
    return int(math.ceil(m)), int(math.ceil(k))

# Example: 1 million elements, 1% false positive rate
m, k = bloom_filter_params(1_000_000, 0.01)
print(f"Bit array size: {m:,} bits ({m / 8 / 1024:.1f} KB)")
print(f"Hash functions: {k}")
# Output: Bit array size: 9,585,059 bits (1170.5 KB), Hash functions: 7

Verify: Run with n=1000000, p=0.01 → expect m ~ 9.6M bits, k = 7

2. Implement the Bloom filter

Create a bit array of size m and implement add/query operations using k hash functions. Use double hashing to simulate k independent hash functions efficiently. [src2]

import mmh3  # pip install mmh3==4.1.0
from bitarray import bitarray  # pip install bitarray==2.9.2

class BloomFilter:
    def __init__(self, n, p=0.01):
        self.n = n
        self.p = p
        self.m = int(-n * math.log(p) / (math.log(2) ** 2))
        self.k = int((self.m / n) * math.log(2))
        self.bits = bitarray(self.m)
        self.bits.setall(0)
        self.count = 0

    def _hashes(self, item):
        h1 = mmh3.hash(str(item), 0) % self.m
        h2 = mmh3.hash(str(item), 1) % self.m
        for i in range(self.k):
            yield (h1 + i * h2) % self.m

    def add(self, item):
        for pos in self._hashes(item):
            self.bits[pos] = 1
        self.count += 1

    def __contains__(self, item):
        return all(self.bits[pos] for pos in self._hashes(item))

Verify: Insert 1000 elements, test 10000 non-existent → FP count ~ 1% of 10000

3. Test actual false positive rate

Always validate your filter against the theoretical FP rate after deployment. [src1]

import random, string

def test_fp_rate(n=100_000, p=0.01):
    bf = BloomFilter(n, p)
    inserted = set()
    for _ in range(n):
        item = ''.join(random.choices(string.ascii_letters, k=12))
        bf.add(item)
        inserted.add(item)

    false_positives = 0
    test_count = 100_000
    for _ in range(test_count):
        item = ''.join(random.choices(string.ascii_letters, k=12))
        if item not in inserted and item in bf:
            false_positives += 1

    actual_fp = false_positives / test_count
    print(f"Target FP: {p:.4f}, Actual FP: {actual_fp:.4f}")

test_fp_rate()
# Expected: Target FP: 0.0100, Actual FP: ~0.0095-0.0105

Verify: Actual FP rate should be within 10% of target p

4. Choose the right variant for your use case

If deletion is required, use a Counting Bloom Filter or Cuckoo Filter. [src3]

# Counting Bloom Filter — supports deletion via counters
class CountingBloomFilter:
    def __init__(self, n, p=0.01, counter_bits=4):
        self.m = int(-n * math.log(p) / (math.log(2) ** 2))
        self.k = int((self.m / n) * math.log(2))
        self.max_count = (1 << counter_bits) - 1
        self.counters = [0] * self.m

    def _hashes(self, item):
        h1 = mmh3.hash(str(item), 0) % self.m
        h2 = mmh3.hash(str(item), 1) % self.m
        for i in range(self.k):
            yield (h1 + i * h2) % self.m

    def add(self, item):
        for pos in self._hashes(item):
            if self.counters[pos] < self.max_count:
                self.counters[pos] += 1

    def remove(self, item):
        if item not in self:
            return False
        for pos in self._hashes(item):
            if self.counters[pos] > 0:
                self.counters[pos] -= 1
        return True

    def __contains__(self, item):
        return all(self.counters[pos] > 0 for pos in self._hashes(item))

Verify: Add "hello", check (True), remove "hello", check again (False)

Code Examples

Python: Standard Bloom Filter with mmh3

# Input:  list of items to insert, items to query
# Output: membership test results (True = possibly in set, False = definitely not)

import mmh3  # pip install mmh3==4.1.0
from bitarray import bitarray  # pip install bitarray==2.9.2
import math

class BloomFilter:
    def __init__(self, expected_items, fp_rate=0.01):
        self.m = int(-expected_items * math.log(fp_rate) / (math.log(2) ** 2))
        self.k = max(1, int((self.m / expected_items) * math.log(2)))
        self.bits = bitarray(self.m)
        self.bits.setall(0)

    def add(self, item):
        for seed in range(self.k):
            idx = mmh3.hash(str(item), seed) % self.m
            self.bits[idx] = 1

    def __contains__(self, item):
        return all(
            self.bits[mmh3.hash(str(item), seed) % self.m]
            for seed in range(self.k)
        )

bf = BloomFilter(1_000_000, fp_rate=0.001)
bf.add("user:12345")
print("user:12345" in bf)  # True
print("user:99999" in bf)  # False

JavaScript: Bloom Filter for Browser/Node.js

// Input:  items to add, items to query
// Output: boolean membership test results

class BloomFilter {
  constructor(expectedItems, fpRate = 0.01) {
    this.m = Math.ceil(-expectedItems * Math.log(fpRate) / (Math.log(2) ** 2));
    this.k = Math.max(1, Math.round((this.m / expectedItems) * Math.log(2)));
    this.bits = new Uint8Array(Math.ceil(this.m / 8));
  }

  _hash(str, seed) {
    let h = 0x811c9dc5 ^ seed;
    for (let i = 0; i < str.length; i++) {
      h ^= str.charCodeAt(i);
      h = (h * 0x01000193) >>> 0;
    }
    return h % this.m;
  }

  add(item) {
    const s = String(item);
    for (let i = 0; i < this.k; i++) {
      const idx = this._hash(s, i);
      this.bits[idx >> 3] |= (1 << (idx & 7));
    }
  }

  has(item) {
    const s = String(item);
    for (let i = 0; i < this.k; i++) {
      const idx = this._hash(s, i);
      if (!(this.bits[idx >> 3] & (1 << (idx & 7)))) return false;
    }
    return true;
  }
}

const bf = new BloomFilter(100000, 0.01);
bf.add("session:abc123");
console.log(bf.has("session:abc123")); // true
console.log(bf.has("session:xyz789")); // false

Go: Bloom Filter using bits-and-blooms

// Input:  items to add, items to query
// Output: boolean membership test results

package main

import (
	"fmt"
	"github.com/bits-and-blooms/bloom/v3" // v3.7.0
)

func main() {
	filter := bloom.NewWithEstimates(1_000_000, 0.001)

	filter.AddString("user:12345")
	filter.AddString("user:67890")

	fmt.Println(filter.TestString("user:12345")) // true
	fmt.Println(filter.TestString("user:99999")) // false

	fmt.Printf("Bit array size: %d bits\n", filter.Cap())
	fmt.Printf("Hash functions: %d\n", filter.K())
}

Anti-Patterns

Wrong: Using a single hash function

# BAD -- single hash function causes excessive collisions and high FP rate
import hashlib

class BadBloomFilter:
    def __init__(self, size):
        self.bits = [False] * size

    def add(self, item):
        h = int(hashlib.md5(item.encode()).hexdigest(), 16) % len(self.bits)
        self.bits[h] = True  # Only one bit set per element

Correct: Using k independent hash functions with double hashing

# GOOD -- k hash functions spread bits optimally, matching theoretical FP rate
import mmh3

class GoodBloomFilter:
    def __init__(self, expected, fp_rate=0.01):
        self.m = int(-expected * math.log(fp_rate) / (math.log(2) ** 2))
        self.k = max(1, int((self.m / expected) * math.log(2)))
        self.bits = bitarray(self.m)
        self.bits.setall(0)

    def add(self, item):
        for seed in range(self.k):
            self.bits[mmh3.hash(str(item), seed) % self.m] = 1

Wrong: Not sizing the filter before use

# BAD -- arbitrary size without considering element count or FP rate
bf_bits = [0] * 1000  # Way too small for 100K elements
# After inserting 100K items, nearly all bits are 1 -- FP rate approaches 100%

Correct: Calculate parameters from requirements

# GOOD -- size derived from actual requirements
n = 100_000     # expected elements
p = 0.01        # 1% false positive rate
bf = BloomFilter(n, p)
# m ~ 958,506 bits (117 KB), k = 7 -- FP rate stays at ~1%

Wrong: Trying to delete from a standard Bloom filter

# BAD -- clearing bits may corrupt other entries
def bad_remove(bf, item):
    for seed in range(bf.k):
        idx = mmh3.hash(str(item), seed) % bf.m
        bf.bits[idx] = 0  # This may unset bits shared with other elements!

Correct: Use Counting Bloom Filter for deletion

# GOOD -- counters allow safe deletion
cbf = CountingBloomFilter(100_000, p=0.01, counter_bits=4)
cbf.add("item_a")
cbf.add("item_b")
cbf.remove("item_a")  # Safe -- decrements counters, does not affect item_b
print("item_a" in cbf)  # False
print("item_b" in cbf)  # True

Common Pitfalls

Undersized filter: Creating a filter too small for actual data volume causes FP rate to skyrocket as bits saturate. Fix: Always calculate m from m = -n * ln(p) / (ln(2))^2. [src1]
Wrong number of hash functions: Using too many or too few hash functions relative to m/n ratio wastes space or increases FP. Fix: Use k = (m/n) * ln(2) and round to nearest integer. [src7]
Cryptographic hash functions: Using SHA-256 or MD5 is orders of magnitude slower than necessary. Fix: Use murmur3, xxHash, or FNV-1a. [src2]
No thread safety: Most Bloom filter implementations are not thread-safe. Fix: Wrap add/query in a mutex or use a concurrent-safe implementation. [src6]
Unbounded growth: Continuously inserting beyond capacity n without monitoring FP rate. Fix: Track insertion count and create a new, larger filter when approaching n. [src1]
Confusing false positive with false negative: Bloom filters never produce false negatives. If "not in set" is returned, the element is definitely absent. Fix: Only "possibly in set" needs verification against the source of truth. [src5]

When to Use / When Not to Use

Use When	Don't Use When	Use Instead
Checking if an element was seen before (dedup)	You need exact set membership with zero errors	Hash set / hash table
Pre-filtering expensive disk/network lookups	You need to enumerate or iterate set members	Sorted set or B-tree
Spell-checking dictionaries	You need to delete elements frequently	Cuckoo filter or hash set
Network packet routing (is destination in cache?)	Dataset fits comfortably in memory as a hash set	Hash set (simpler, deterministic)
Distributed systems: reducing cross-node queries	You need element frequency counts	Count-Min Sketch
Database: avoiding unnecessary disk reads (LSM trees)	You need cardinality estimation (count distinct)	HyperLogLog

Important Caveats

The theoretical FP formula p = (1 - e^(-kn/m))^k is an approximation. Actual FP rate is slightly higher due to hash collisions and non-perfect independence. Always test empirically.
Scalable Bloom Filters (SBF) allow growth by adding new filter layers but increase query time linearly with the number of layers. Consider pre-sizing if possible.
Counting Bloom Filters use 3-4x more memory than standard Bloom filters. Cuckoo filters are more space-efficient when deletion is needed and FP rate target is below 3%.
Bloom filters serialized across systems must agree on hash function, seed, bit array encoding, and endianness. Always use the same library version on both ends.
In distributed databases (Cassandra, HBase), Bloom filters are used per-SSTable. Misconfigured bloom_filter_fp_chance causes excessive disk reads.