Bloom Filters & Probabilistic Data Structures: Complete Reference
How do Bloom filters and probabilistic data structures work?
TL;DR
- Bottom line: Bloom filters are space-efficient probabilistic data structures that answer "is X in the set?" with configurable false positive rates and zero false negatives, using a bit array and k independent hash functions.
- Key tool/command:
m = -n * ln(p) / (ln(2))^2(optimal filter size: n = elements, p = target FP rate) - Watch out for: Standard Bloom filters do not support deletion — removing a bit corrupts other entries. Use Counting Bloom or Cuckoo filters if deletion is needed.
- Works with: Any language/platform. Common libraries:
pybloom_live(Python),bloom-filters(npm),bits-and-blooms/bloom(Go).
Constraints
- Standard Bloom filters do NOT support element deletion — removing bits may corrupt other entries
- False positive rate increases as filter fills — never exceed designed capacity (n)
- Hash functions must be independent and uniformly distributed — cryptographic hashes (SHA-256) are too slow; use murmur3, xxHash, or FNV
- A Bloom filter can never return false negatives — "not in set" is always correct
- Filter size (m) and hash count (k) are fixed at creation and cannot be changed after insertions begin
- Thread safety is NOT built into most implementations — add synchronization for concurrent access
Quick Reference
Probabilistic Data Structure Comparison
| Structure | Space (bits/element) | Insert | Query | Delete | FP Rate | Best For |
|---|---|---|---|---|---|---|
| Standard Bloom Filter | 9.6 (at 1% FP) | O(k) | O(k) | No | Configurable | Set membership, dedup |
| Counting Bloom Filter | 3-4x standard | O(k) | O(k) | Yes | Same as standard | Membership with deletion |
| Cuckoo Filter | ~12 bits (at 3% FP) | O(1) amortized | O(1) | Yes | Lower at <3% FP | Membership + deletion |
| HyperLogLog | ~1.5 KB fixed | O(1) | O(1) | No | ~2% std error | Cardinality estimation |
| Count-Min Sketch | O(1/e * ln(1/d)) | O(d) | O(d) | No | Over-counts only | Frequency estimation |
| Quotient Filter | 10-25% more than Bloom | O(1) amortized | O(1) amortized | Yes | Configurable | Cache-friendly membership |
Key Formulas (Standard Bloom Filter)
| Parameter | Formula | Description |
|---|---|---|
| Optimal bit array size (m) | m = -n * ln(p) / (ln(2))^2 | n = expected elements, p = target FP rate |
| Optimal hash count (k) | k = (m / n) * ln(2) | Minimizes FP for given m and n |
| False positive probability | p = (1 - e^(-kn/m))^k | Actual FP rate after n insertions |
| Bits per element (1% FP) | ~9.6 bits | Independent of element size |
| Bits per element (0.1% FP) | ~14.4 bits | Each 10x reduction costs ~4.8 bits |
Decision Tree
START: What problem are you solving?
|
+-- Set membership ("is X in the set?")?
| |
| +-- Need deletion?
| | +-- YES --> Counting Bloom Filter (if space OK) or Cuckoo Filter (if space-constrained)
| | +-- NO --> Standard Bloom Filter (most space-efficient)
| |
| +-- Need exact membership (zero false positives)?
| +-- YES --> Use a hash set instead (not a probabilistic structure)
| +-- NO --> Bloom Filter family (see above)
|
+-- Cardinality estimation ("how many distinct elements?")?
| +-- YES --> HyperLogLog (~1.5 KB, ~2% error)
|
+-- Frequency estimation ("how often does X appear?")?
| +-- YES --> Count-Min Sketch (over-counts, never under-counts)
|
+-- Need cache-friendly disk operations?
| +-- YES --> Quotient Filter (better locality than Bloom)
|
+-- DEFAULT --> Standard Bloom Filter (simplest, most battle-tested)
Step-by-Step Guide
1. Calculate optimal filter parameters
Determine the bit array size (m) and number of hash functions (k) based on your expected element count (n) and desired false positive rate (p). [src7]
import math
def bloom_filter_params(n, p):
"""Calculate optimal Bloom filter parameters.
Args:
n: Expected number of elements
p: Desired false positive rate (e.g., 0.01 for 1%)
Returns:
(m, k): bit array size, number of hash functions
"""
m = -n * math.log(p) / (math.log(2) ** 2)
k = (m / n) * math.log(2)
return int(math.ceil(m)), int(math.ceil(k))
# Example: 1 million elements, 1% false positive rate
m, k = bloom_filter_params(1_000_000, 0.01)
print(f"Bit array size: {m:,} bits ({m / 8 / 1024:.1f} KB)")
print(f"Hash functions: {k}")
# Output: Bit array size: 9,585,059 bits (1170.5 KB), Hash functions: 7
Verify: Run with n=1000000, p=0.01 → expect m ~ 9.6M bits, k = 7
2. Implement the Bloom filter
Create a bit array of size m and implement add/query operations using k hash functions. Use double hashing to simulate k independent hash functions efficiently. [src2]
import mmh3 # pip install mmh3==4.1.0
from bitarray import bitarray # pip install bitarray==2.9.2
class BloomFilter:
def __init__(self, n, p=0.01):
self.n = n
self.p = p
self.m = int(-n * math.log(p) / (math.log(2) ** 2))
self.k = int((self.m / n) * math.log(2))
self.bits = bitarray(self.m)
self.bits.setall(0)
self.count = 0
def _hashes(self, item):
h1 = mmh3.hash(str(item), 0) % self.m
h2 = mmh3.hash(str(item), 1) % self.m
for i in range(self.k):
yield (h1 + i * h2) % self.m
def add(self, item):
for pos in self._hashes(item):
self.bits[pos] = 1
self.count += 1
def __contains__(self, item):
return all(self.bits[pos] for pos in self._hashes(item))
Verify: Insert 1000 elements, test 10000 non-existent → FP count ~ 1% of 10000
3. Test actual false positive rate
Always validate your filter against the theoretical FP rate after deployment. [src1]
import random, string
def test_fp_rate(n=100_000, p=0.01):
bf = BloomFilter(n, p)
inserted = set()
for _ in range(n):
item = ''.join(random.choices(string.ascii_letters, k=12))
bf.add(item)
inserted.add(item)
false_positives = 0
test_count = 100_000
for _ in range(test_count):
item = ''.join(random.choices(string.ascii_letters, k=12))
if item not in inserted and item in bf:
false_positives += 1
actual_fp = false_positives / test_count
print(f"Target FP: {p:.4f}, Actual FP: {actual_fp:.4f}")
test_fp_rate()
# Expected: Target FP: 0.0100, Actual FP: ~0.0095-0.0105
Verify: Actual FP rate should be within 10% of target p
4. Choose the right variant for your use case
If deletion is required, use a Counting Bloom Filter or Cuckoo Filter. [src3]
# Counting Bloom Filter — supports deletion via counters
class CountingBloomFilter:
def __init__(self, n, p=0.01, counter_bits=4):
self.m = int(-n * math.log(p) / (math.log(2) ** 2))
self.k = int((self.m / n) * math.log(2))
self.max_count = (1 << counter_bits) - 1
self.counters = [0] * self.m
def _hashes(self, item):
h1 = mmh3.hash(str(item), 0) % self.m
h2 = mmh3.hash(str(item), 1) % self.m
for i in range(self.k):
yield (h1 + i * h2) % self.m
def add(self, item):
for pos in self._hashes(item):
if self.counters[pos] < self.max_count:
self.counters[pos] += 1
def remove(self, item):
if item not in self:
return False
for pos in self._hashes(item):
if self.counters[pos] > 0:
self.counters[pos] -= 1
return True
def __contains__(self, item):
return all(self.counters[pos] > 0 for pos in self._hashes(item))
Verify: Add "hello", check (True), remove "hello", check again (False)
Code Examples
Python: Standard Bloom Filter with mmh3
# Input: list of items to insert, items to query
# Output: membership test results (True = possibly in set, False = definitely not)
import mmh3 # pip install mmh3==4.1.0
from bitarray import bitarray # pip install bitarray==2.9.2
import math
class BloomFilter:
def __init__(self, expected_items, fp_rate=0.01):
self.m = int(-expected_items * math.log(fp_rate) / (math.log(2) ** 2))
self.k = max(1, int((self.m / expected_items) * math.log(2)))
self.bits = bitarray(self.m)
self.bits.setall(0)
def add(self, item):
for seed in range(self.k):
idx = mmh3.hash(str(item), seed) % self.m
self.bits[idx] = 1
def __contains__(self, item):
return all(
self.bits[mmh3.hash(str(item), seed) % self.m]
for seed in range(self.k)
)
bf = BloomFilter(1_000_000, fp_rate=0.001)
bf.add("user:12345")
print("user:12345" in bf) # True
print("user:99999" in bf) # False
JavaScript: Bloom Filter for Browser/Node.js
// Input: items to add, items to query
// Output: boolean membership test results
class BloomFilter {
constructor(expectedItems, fpRate = 0.01) {
this.m = Math.ceil(-expectedItems * Math.log(fpRate) / (Math.log(2) ** 2));
this.k = Math.max(1, Math.round((this.m / expectedItems) * Math.log(2)));
this.bits = new Uint8Array(Math.ceil(this.m / 8));
}
_hash(str, seed) {
let h = 0x811c9dc5 ^ seed;
for (let i = 0; i < str.length; i++) {
h ^= str.charCodeAt(i);
h = (h * 0x01000193) >>> 0;
}
return h % this.m;
}
add(item) {
const s = String(item);
for (let i = 0; i < this.k; i++) {
const idx = this._hash(s, i);
this.bits[idx >> 3] |= (1 << (idx & 7));
}
}
has(item) {
const s = String(item);
for (let i = 0; i < this.k; i++) {
const idx = this._hash(s, i);
if (!(this.bits[idx >> 3] & (1 << (idx & 7)))) return false;
}
return true;
}
}
const bf = new BloomFilter(100000, 0.01);
bf.add("session:abc123");
console.log(bf.has("session:abc123")); // true
console.log(bf.has("session:xyz789")); // false
Go: Bloom Filter using bits-and-blooms
// Input: items to add, items to query
// Output: boolean membership test results
package main
import (
"fmt"
"github.com/bits-and-blooms/bloom/v3" // v3.7.0
)
func main() {
filter := bloom.NewWithEstimates(1_000_000, 0.001)
filter.AddString("user:12345")
filter.AddString("user:67890")
fmt.Println(filter.TestString("user:12345")) // true
fmt.Println(filter.TestString("user:99999")) // false
fmt.Printf("Bit array size: %d bits\n", filter.Cap())
fmt.Printf("Hash functions: %d\n", filter.K())
}
Anti-Patterns
Wrong: Using a single hash function
# BAD -- single hash function causes excessive collisions and high FP rate
import hashlib
class BadBloomFilter:
def __init__(self, size):
self.bits = [False] * size
def add(self, item):
h = int(hashlib.md5(item.encode()).hexdigest(), 16) % len(self.bits)
self.bits[h] = True # Only one bit set per element
Correct: Using k independent hash functions with double hashing
# GOOD -- k hash functions spread bits optimally, matching theoretical FP rate
import mmh3
class GoodBloomFilter:
def __init__(self, expected, fp_rate=0.01):
self.m = int(-expected * math.log(fp_rate) / (math.log(2) ** 2))
self.k = max(1, int((self.m / expected) * math.log(2)))
self.bits = bitarray(self.m)
self.bits.setall(0)
def add(self, item):
for seed in range(self.k):
self.bits[mmh3.hash(str(item), seed) % self.m] = 1
Wrong: Not sizing the filter before use
# BAD -- arbitrary size without considering element count or FP rate
bf_bits = [0] * 1000 # Way too small for 100K elements
# After inserting 100K items, nearly all bits are 1 -- FP rate approaches 100%
Correct: Calculate parameters from requirements
# GOOD -- size derived from actual requirements
n = 100_000 # expected elements
p = 0.01 # 1% false positive rate
bf = BloomFilter(n, p)
# m ~ 958,506 bits (117 KB), k = 7 -- FP rate stays at ~1%
Wrong: Trying to delete from a standard Bloom filter
# BAD -- clearing bits may corrupt other entries
def bad_remove(bf, item):
for seed in range(bf.k):
idx = mmh3.hash(str(item), seed) % bf.m
bf.bits[idx] = 0 # This may unset bits shared with other elements!
Correct: Use Counting Bloom Filter for deletion
# GOOD -- counters allow safe deletion
cbf = CountingBloomFilter(100_000, p=0.01, counter_bits=4)
cbf.add("item_a")
cbf.add("item_b")
cbf.remove("item_a") # Safe -- decrements counters, does not affect item_b
print("item_a" in cbf) # False
print("item_b" in cbf) # True
Common Pitfalls
- Undersized filter: Creating a filter too small for actual data volume causes FP rate to skyrocket as bits saturate. Fix: Always calculate m from
m = -n * ln(p) / (ln(2))^2. [src1] - Wrong number of hash functions: Using too many or too few hash functions relative to m/n ratio wastes space or increases FP. Fix: Use
k = (m/n) * ln(2)and round to nearest integer. [src7] - Cryptographic hash functions: Using SHA-256 or MD5 is orders of magnitude slower than necessary. Fix: Use murmur3, xxHash, or FNV-1a. [src2]
- No thread safety: Most Bloom filter implementations are not thread-safe. Fix: Wrap add/query in a mutex or use a concurrent-safe implementation. [src6]
- Unbounded growth: Continuously inserting beyond capacity n without monitoring FP rate. Fix: Track insertion count and create a new, larger filter when approaching n. [src1]
- Confusing false positive with false negative: Bloom filters never produce false negatives. If "not in set" is returned, the element is definitely absent. Fix: Only "possibly in set" needs verification against the source of truth. [src5]
When to Use / When Not to Use
| Use When | Don't Use When | Use Instead |
|---|---|---|
| Checking if an element was seen before (dedup) | You need exact set membership with zero errors | Hash set / hash table |
| Pre-filtering expensive disk/network lookups | You need to enumerate or iterate set members | Sorted set or B-tree |
| Spell-checking dictionaries | You need to delete elements frequently | Cuckoo filter or hash set |
| Network packet routing (is destination in cache?) | Dataset fits comfortably in memory as a hash set | Hash set (simpler, deterministic) |
| Distributed systems: reducing cross-node queries | You need element frequency counts | Count-Min Sketch |
| Database: avoiding unnecessary disk reads (LSM trees) | You need cardinality estimation (count distinct) | HyperLogLog |
Important Caveats
- The theoretical FP formula
p = (1 - e^(-kn/m))^kis an approximation. Actual FP rate is slightly higher due to hash collisions and non-perfect independence. Always test empirically. - Scalable Bloom Filters (SBF) allow growth by adding new filter layers but increase query time linearly with the number of layers. Consider pre-sizing if possible.
- Counting Bloom Filters use 3-4x more memory than standard Bloom filters. Cuckoo filters are more space-efficient when deletion is needed and FP rate target is below 3%.
- Bloom filters serialized across systems must agree on hash function, seed, bit array encoding, and endianness. Always use the same library version on both ends.
- In distributed databases (Cassandra, HBase), Bloom filters are used per-SSTable. Misconfigured bloom_filter_fp_chance causes excessive disk reads.