m = -n * ln(p) / (ln(2))^2 (optimal filter size: n = elements, p = target FP rate)pybloom_live (Python), bloom-filters (npm), bits-and-blooms/bloom (Go).| Structure | Space (bits/element) | Insert | Query | Delete | FP Rate | Best For |
|---|---|---|---|---|---|---|
| Standard Bloom Filter | 9.6 (at 1% FP) | O(k) | O(k) | No | Configurable | Set membership, dedup |
| Counting Bloom Filter | 3-4x standard | O(k) | O(k) | Yes | Same as standard | Membership with deletion |
| Cuckoo Filter | ~12 bits (at 3% FP) | O(1) amortized | O(1) | Yes | Lower at <3% FP | Membership + deletion |
| HyperLogLog | ~1.5 KB fixed | O(1) | O(1) | No | ~2% std error | Cardinality estimation |
| Count-Min Sketch | O(1/e * ln(1/d)) | O(d) | O(d) | No | Over-counts only | Frequency estimation |
| Quotient Filter | 10-25% more than Bloom | O(1) amortized | O(1) amortized | Yes | Configurable | Cache-friendly membership |
| Parameter | Formula | Description |
|---|---|---|
| Optimal bit array size (m) | m = -n * ln(p) / (ln(2))^2 | n = expected elements, p = target FP rate |
| Optimal hash count (k) | k = (m / n) * ln(2) | Minimizes FP for given m and n |
| False positive probability | p = (1 - e^(-kn/m))^k | Actual FP rate after n insertions |
| Bits per element (1% FP) | ~9.6 bits | Independent of element size |
| Bits per element (0.1% FP) | ~14.4 bits | Each 10x reduction costs ~4.8 bits |
START: What problem are you solving?
|
+-- Set membership ("is X in the set?")?
| |
| +-- Need deletion?
| | +-- YES --> Counting Bloom Filter (if space OK) or Cuckoo Filter (if space-constrained)
| | +-- NO --> Standard Bloom Filter (most space-efficient)
| |
| +-- Need exact membership (zero false positives)?
| +-- YES --> Use a hash set instead (not a probabilistic structure)
| +-- NO --> Bloom Filter family (see above)
|
+-- Cardinality estimation ("how many distinct elements?")?
| +-- YES --> HyperLogLog (~1.5 KB, ~2% error)
|
+-- Frequency estimation ("how often does X appear?")?
| +-- YES --> Count-Min Sketch (over-counts, never under-counts)
|
+-- Need cache-friendly disk operations?
| +-- YES --> Quotient Filter (better locality than Bloom)
|
+-- DEFAULT --> Standard Bloom Filter (simplest, most battle-tested)
Determine the bit array size (m) and number of hash functions (k) based on your expected element count (n) and desired false positive rate (p). [src7]
import math
def bloom_filter_params(n, p):
"""Calculate optimal Bloom filter parameters.
Args:
n: Expected number of elements
p: Desired false positive rate (e.g., 0.01 for 1%)
Returns:
(m, k): bit array size, number of hash functions
"""
m = -n * math.log(p) / (math.log(2) ** 2)
k = (m / n) * math.log(2)
return int(math.ceil(m)), int(math.ceil(k))
# Example: 1 million elements, 1% false positive rate
m, k = bloom_filter_params(1_000_000, 0.01)
print(f"Bit array size: {m:,} bits ({m / 8 / 1024:.1f} KB)")
print(f"Hash functions: {k}")
# Output: Bit array size: 9,585,059 bits (1170.5 KB), Hash functions: 7
Verify: Run with n=1000000, p=0.01 → expect m ~ 9.6M bits, k = 7
Create a bit array of size m and implement add/query operations using k hash functions. Use double hashing to simulate k independent hash functions efficiently. [src2]
import mmh3 # pip install mmh3==4.1.0
from bitarray import bitarray # pip install bitarray==2.9.2
class BloomFilter:
def __init__(self, n, p=0.01):
self.n = n
self.p = p
self.m = int(-n * math.log(p) / (math.log(2) ** 2))
self.k = int((self.m / n) * math.log(2))
self.bits = bitarray(self.m)
self.bits.setall(0)
self.count = 0
def _hashes(self, item):
h1 = mmh3.hash(str(item), 0) % self.m
h2 = mmh3.hash(str(item), 1) % self.m
for i in range(self.k):
yield (h1 + i * h2) % self.m
def add(self, item):
for pos in self._hashes(item):
self.bits[pos] = 1
self.count += 1
def __contains__(self, item):
return all(self.bits[pos] for pos in self._hashes(item))
Verify: Insert 1000 elements, test 10000 non-existent → FP count ~ 1% of 10000
Always validate your filter against the theoretical FP rate after deployment. [src1]
import random, string
def test_fp_rate(n=100_000, p=0.01):
bf = BloomFilter(n, p)
inserted = set()
for _ in range(n):
item = ''.join(random.choices(string.ascii_letters, k=12))
bf.add(item)
inserted.add(item)
false_positives = 0
test_count = 100_000
for _ in range(test_count):
item = ''.join(random.choices(string.ascii_letters, k=12))
if item not in inserted and item in bf:
false_positives += 1
actual_fp = false_positives / test_count
print(f"Target FP: {p:.4f}, Actual FP: {actual_fp:.4f}")
test_fp_rate()
# Expected: Target FP: 0.0100, Actual FP: ~0.0095-0.0105
Verify: Actual FP rate should be within 10% of target p
If deletion is required, use a Counting Bloom Filter or Cuckoo Filter. [src3]
# Counting Bloom Filter — supports deletion via counters
class CountingBloomFilter:
def __init__(self, n, p=0.01, counter_bits=4):
self.m = int(-n * math.log(p) / (math.log(2) ** 2))
self.k = int((self.m / n) * math.log(2))
self.max_count = (1 << counter_bits) - 1
self.counters = [0] * self.m
def _hashes(self, item):
h1 = mmh3.hash(str(item), 0) % self.m
h2 = mmh3.hash(str(item), 1) % self.m
for i in range(self.k):
yield (h1 + i * h2) % self.m
def add(self, item):
for pos in self._hashes(item):
if self.counters[pos] < self.max_count:
self.counters[pos] += 1
def remove(self, item):
if item not in self:
return False
for pos in self._hashes(item):
if self.counters[pos] > 0:
self.counters[pos] -= 1
return True
def __contains__(self, item):
return all(self.counters[pos] > 0 for pos in self._hashes(item))
Verify: Add "hello", check (True), remove "hello", check again (False)
# Input: list of items to insert, items to query
# Output: membership test results (True = possibly in set, False = definitely not)
import mmh3 # pip install mmh3==4.1.0
from bitarray import bitarray # pip install bitarray==2.9.2
import math
class BloomFilter:
def __init__(self, expected_items, fp_rate=0.01):
self.m = int(-expected_items * math.log(fp_rate) / (math.log(2) ** 2))
self.k = max(1, int((self.m / expected_items) * math.log(2)))
self.bits = bitarray(self.m)
self.bits.setall(0)
def add(self, item):
for seed in range(self.k):
idx = mmh3.hash(str(item), seed) % self.m
self.bits[idx] = 1
def __contains__(self, item):
return all(
self.bits[mmh3.hash(str(item), seed) % self.m]
for seed in range(self.k)
)
bf = BloomFilter(1_000_000, fp_rate=0.001)
bf.add("user:12345")
print("user:12345" in bf) # True
print("user:99999" in bf) # False
// Input: items to add, items to query
// Output: boolean membership test results
class BloomFilter {
constructor(expectedItems, fpRate = 0.01) {
this.m = Math.ceil(-expectedItems * Math.log(fpRate) / (Math.log(2) ** 2));
this.k = Math.max(1, Math.round((this.m / expectedItems) * Math.log(2)));
this.bits = new Uint8Array(Math.ceil(this.m / 8));
}
_hash(str, seed) {
let h = 0x811c9dc5 ^ seed;
for (let i = 0; i < str.length; i++) {
h ^= str.charCodeAt(i);
h = (h * 0x01000193) >>> 0;
}
return h % this.m;
}
add(item) {
const s = String(item);
for (let i = 0; i < this.k; i++) {
const idx = this._hash(s, i);
this.bits[idx >> 3] |= (1 << (idx & 7));
}
}
has(item) {
const s = String(item);
for (let i = 0; i < this.k; i++) {
const idx = this._hash(s, i);
if (!(this.bits[idx >> 3] & (1 << (idx & 7)))) return false;
}
return true;
}
}
const bf = new BloomFilter(100000, 0.01);
bf.add("session:abc123");
console.log(bf.has("session:abc123")); // true
console.log(bf.has("session:xyz789")); // false
// Input: items to add, items to query
// Output: boolean membership test results
package main
import (
"fmt"
"github.com/bits-and-blooms/bloom/v3" // v3.7.0
)
func main() {
filter := bloom.NewWithEstimates(1_000_000, 0.001)
filter.AddString("user:12345")
filter.AddString("user:67890")
fmt.Println(filter.TestString("user:12345")) // true
fmt.Println(filter.TestString("user:99999")) // false
fmt.Printf("Bit array size: %d bits\n", filter.Cap())
fmt.Printf("Hash functions: %d\n", filter.K())
}
# BAD -- single hash function causes excessive collisions and high FP rate
import hashlib
class BadBloomFilter:
def __init__(self, size):
self.bits = [False] * size
def add(self, item):
h = int(hashlib.md5(item.encode()).hexdigest(), 16) % len(self.bits)
self.bits[h] = True # Only one bit set per element
# GOOD -- k hash functions spread bits optimally, matching theoretical FP rate
import mmh3
class GoodBloomFilter:
def __init__(self, expected, fp_rate=0.01):
self.m = int(-expected * math.log(fp_rate) / (math.log(2) ** 2))
self.k = max(1, int((self.m / expected) * math.log(2)))
self.bits = bitarray(self.m)
self.bits.setall(0)
def add(self, item):
for seed in range(self.k):
self.bits[mmh3.hash(str(item), seed) % self.m] = 1
# BAD -- arbitrary size without considering element count or FP rate
bf_bits = [0] * 1000 # Way too small for 100K elements
# After inserting 100K items, nearly all bits are 1 -- FP rate approaches 100%
# GOOD -- size derived from actual requirements
n = 100_000 # expected elements
p = 0.01 # 1% false positive rate
bf = BloomFilter(n, p)
# m ~ 958,506 bits (117 KB), k = 7 -- FP rate stays at ~1%
# BAD -- clearing bits may corrupt other entries
def bad_remove(bf, item):
for seed in range(bf.k):
idx = mmh3.hash(str(item), seed) % bf.m
bf.bits[idx] = 0 # This may unset bits shared with other elements!
# GOOD -- counters allow safe deletion
cbf = CountingBloomFilter(100_000, p=0.01, counter_bits=4)
cbf.add("item_a")
cbf.add("item_b")
cbf.remove("item_a") # Safe -- decrements counters, does not affect item_b
print("item_a" in cbf) # False
print("item_b" in cbf) # True
m = -n * ln(p) / (ln(2))^2. [src1]k = (m/n) * ln(2) and round to nearest integer. [src7]| Use When | Don't Use When | Use Instead |
|---|---|---|
| Checking if an element was seen before (dedup) | You need exact set membership with zero errors | Hash set / hash table |
| Pre-filtering expensive disk/network lookups | You need to enumerate or iterate set members | Sorted set or B-tree |
| Spell-checking dictionaries | You need to delete elements frequently | Cuckoo filter or hash set |
| Network packet routing (is destination in cache?) | Dataset fits comfortably in memory as a hash set | Hash set (simpler, deterministic) |
| Distributed systems: reducing cross-node queries | You need element frequency counts | Count-Min Sketch |
| Database: avoiding unnecessary disk reads (LSM trees) | You need cardinality estimation (count distinct) | HyperLogLog |
p = (1 - e^(-kn/m))^k is an approximation. Actual FP rate is slightly higher due to hash collisions and non-perfect independence. Always test empirically.