Bloom Filters & Probabilistic Data Structures: Complete Reference

Type: Software Reference Confidence: 0.92 Sources: 7 Verified: 2026-02-24 Freshness: 2026-02-24

TL;DR

Constraints

Quick Reference

Probabilistic Data Structure Comparison

StructureSpace (bits/element)InsertQueryDeleteFP RateBest For
Standard Bloom Filter9.6 (at 1% FP)O(k)O(k)NoConfigurableSet membership, dedup
Counting Bloom Filter3-4x standardO(k)O(k)YesSame as standardMembership with deletion
Cuckoo Filter~12 bits (at 3% FP)O(1) amortizedO(1)YesLower at <3% FPMembership + deletion
HyperLogLog~1.5 KB fixedO(1)O(1)No~2% std errorCardinality estimation
Count-Min SketchO(1/e * ln(1/d))O(d)O(d)NoOver-counts onlyFrequency estimation
Quotient Filter10-25% more than BloomO(1) amortizedO(1) amortizedYesConfigurableCache-friendly membership

Key Formulas (Standard Bloom Filter)

ParameterFormulaDescription
Optimal bit array size (m)m = -n * ln(p) / (ln(2))^2n = expected elements, p = target FP rate
Optimal hash count (k)k = (m / n) * ln(2)Minimizes FP for given m and n
False positive probabilityp = (1 - e^(-kn/m))^kActual FP rate after n insertions
Bits per element (1% FP)~9.6 bitsIndependent of element size
Bits per element (0.1% FP)~14.4 bitsEach 10x reduction costs ~4.8 bits

Decision Tree

START: What problem are you solving?
|
+-- Set membership ("is X in the set?")?
|   |
|   +-- Need deletion?
|   |   +-- YES --> Counting Bloom Filter (if space OK) or Cuckoo Filter (if space-constrained)
|   |   +-- NO --> Standard Bloom Filter (most space-efficient)
|   |
|   +-- Need exact membership (zero false positives)?
|       +-- YES --> Use a hash set instead (not a probabilistic structure)
|       +-- NO --> Bloom Filter family (see above)
|
+-- Cardinality estimation ("how many distinct elements?")?
|   +-- YES --> HyperLogLog (~1.5 KB, ~2% error)
|
+-- Frequency estimation ("how often does X appear?")?
|   +-- YES --> Count-Min Sketch (over-counts, never under-counts)
|
+-- Need cache-friendly disk operations?
|   +-- YES --> Quotient Filter (better locality than Bloom)
|
+-- DEFAULT --> Standard Bloom Filter (simplest, most battle-tested)

Step-by-Step Guide

1. Calculate optimal filter parameters

Determine the bit array size (m) and number of hash functions (k) based on your expected element count (n) and desired false positive rate (p). [src7]

import math

def bloom_filter_params(n, p):
    """Calculate optimal Bloom filter parameters.
    Args:
        n: Expected number of elements
        p: Desired false positive rate (e.g., 0.01 for 1%)
    Returns:
        (m, k): bit array size, number of hash functions
    """
    m = -n * math.log(p) / (math.log(2) ** 2)
    k = (m / n) * math.log(2)
    return int(math.ceil(m)), int(math.ceil(k))

# Example: 1 million elements, 1% false positive rate
m, k = bloom_filter_params(1_000_000, 0.01)
print(f"Bit array size: {m:,} bits ({m / 8 / 1024:.1f} KB)")
print(f"Hash functions: {k}")
# Output: Bit array size: 9,585,059 bits (1170.5 KB), Hash functions: 7

Verify: Run with n=1000000, p=0.01 → expect m ~ 9.6M bits, k = 7

2. Implement the Bloom filter

Create a bit array of size m and implement add/query operations using k hash functions. Use double hashing to simulate k independent hash functions efficiently. [src2]

import mmh3  # pip install mmh3==4.1.0
from bitarray import bitarray  # pip install bitarray==2.9.2

class BloomFilter:
    def __init__(self, n, p=0.01):
        self.n = n
        self.p = p
        self.m = int(-n * math.log(p) / (math.log(2) ** 2))
        self.k = int((self.m / n) * math.log(2))
        self.bits = bitarray(self.m)
        self.bits.setall(0)
        self.count = 0

    def _hashes(self, item):
        h1 = mmh3.hash(str(item), 0) % self.m
        h2 = mmh3.hash(str(item), 1) % self.m
        for i in range(self.k):
            yield (h1 + i * h2) % self.m

    def add(self, item):
        for pos in self._hashes(item):
            self.bits[pos] = 1
        self.count += 1

    def __contains__(self, item):
        return all(self.bits[pos] for pos in self._hashes(item))

Verify: Insert 1000 elements, test 10000 non-existent → FP count ~ 1% of 10000

3. Test actual false positive rate

Always validate your filter against the theoretical FP rate after deployment. [src1]

import random, string

def test_fp_rate(n=100_000, p=0.01):
    bf = BloomFilter(n, p)
    inserted = set()
    for _ in range(n):
        item = ''.join(random.choices(string.ascii_letters, k=12))
        bf.add(item)
        inserted.add(item)

    false_positives = 0
    test_count = 100_000
    for _ in range(test_count):
        item = ''.join(random.choices(string.ascii_letters, k=12))
        if item not in inserted and item in bf:
            false_positives += 1

    actual_fp = false_positives / test_count
    print(f"Target FP: {p:.4f}, Actual FP: {actual_fp:.4f}")

test_fp_rate()
# Expected: Target FP: 0.0100, Actual FP: ~0.0095-0.0105

Verify: Actual FP rate should be within 10% of target p

4. Choose the right variant for your use case

If deletion is required, use a Counting Bloom Filter or Cuckoo Filter. [src3]

# Counting Bloom Filter — supports deletion via counters
class CountingBloomFilter:
    def __init__(self, n, p=0.01, counter_bits=4):
        self.m = int(-n * math.log(p) / (math.log(2) ** 2))
        self.k = int((self.m / n) * math.log(2))
        self.max_count = (1 << counter_bits) - 1
        self.counters = [0] * self.m

    def _hashes(self, item):
        h1 = mmh3.hash(str(item), 0) % self.m
        h2 = mmh3.hash(str(item), 1) % self.m
        for i in range(self.k):
            yield (h1 + i * h2) % self.m

    def add(self, item):
        for pos in self._hashes(item):
            if self.counters[pos] < self.max_count:
                self.counters[pos] += 1

    def remove(self, item):
        if item not in self:
            return False
        for pos in self._hashes(item):
            if self.counters[pos] > 0:
                self.counters[pos] -= 1
        return True

    def __contains__(self, item):
        return all(self.counters[pos] > 0 for pos in self._hashes(item))

Verify: Add "hello", check (True), remove "hello", check again (False)

Code Examples

Python: Standard Bloom Filter with mmh3

# Input:  list of items to insert, items to query
# Output: membership test results (True = possibly in set, False = definitely not)

import mmh3  # pip install mmh3==4.1.0
from bitarray import bitarray  # pip install bitarray==2.9.2
import math

class BloomFilter:
    def __init__(self, expected_items, fp_rate=0.01):
        self.m = int(-expected_items * math.log(fp_rate) / (math.log(2) ** 2))
        self.k = max(1, int((self.m / expected_items) * math.log(2)))
        self.bits = bitarray(self.m)
        self.bits.setall(0)

    def add(self, item):
        for seed in range(self.k):
            idx = mmh3.hash(str(item), seed) % self.m
            self.bits[idx] = 1

    def __contains__(self, item):
        return all(
            self.bits[mmh3.hash(str(item), seed) % self.m]
            for seed in range(self.k)
        )

bf = BloomFilter(1_000_000, fp_rate=0.001)
bf.add("user:12345")
print("user:12345" in bf)  # True
print("user:99999" in bf)  # False

JavaScript: Bloom Filter for Browser/Node.js

// Input:  items to add, items to query
// Output: boolean membership test results

class BloomFilter {
  constructor(expectedItems, fpRate = 0.01) {
    this.m = Math.ceil(-expectedItems * Math.log(fpRate) / (Math.log(2) ** 2));
    this.k = Math.max(1, Math.round((this.m / expectedItems) * Math.log(2)));
    this.bits = new Uint8Array(Math.ceil(this.m / 8));
  }

  _hash(str, seed) {
    let h = 0x811c9dc5 ^ seed;
    for (let i = 0; i < str.length; i++) {
      h ^= str.charCodeAt(i);
      h = (h * 0x01000193) >>> 0;
    }
    return h % this.m;
  }

  add(item) {
    const s = String(item);
    for (let i = 0; i < this.k; i++) {
      const idx = this._hash(s, i);
      this.bits[idx >> 3] |= (1 << (idx & 7));
    }
  }

  has(item) {
    const s = String(item);
    for (let i = 0; i < this.k; i++) {
      const idx = this._hash(s, i);
      if (!(this.bits[idx >> 3] & (1 << (idx & 7)))) return false;
    }
    return true;
  }
}

const bf = new BloomFilter(100000, 0.01);
bf.add("session:abc123");
console.log(bf.has("session:abc123")); // true
console.log(bf.has("session:xyz789")); // false

Go: Bloom Filter using bits-and-blooms

// Input:  items to add, items to query
// Output: boolean membership test results

package main

import (
	"fmt"
	"github.com/bits-and-blooms/bloom/v3" // v3.7.0
)

func main() {
	filter := bloom.NewWithEstimates(1_000_000, 0.001)

	filter.AddString("user:12345")
	filter.AddString("user:67890")

	fmt.Println(filter.TestString("user:12345")) // true
	fmt.Println(filter.TestString("user:99999")) // false

	fmt.Printf("Bit array size: %d bits\n", filter.Cap())
	fmt.Printf("Hash functions: %d\n", filter.K())
}

Anti-Patterns

Wrong: Using a single hash function

# BAD -- single hash function causes excessive collisions and high FP rate
import hashlib

class BadBloomFilter:
    def __init__(self, size):
        self.bits = [False] * size

    def add(self, item):
        h = int(hashlib.md5(item.encode()).hexdigest(), 16) % len(self.bits)
        self.bits[h] = True  # Only one bit set per element

Correct: Using k independent hash functions with double hashing

# GOOD -- k hash functions spread bits optimally, matching theoretical FP rate
import mmh3

class GoodBloomFilter:
    def __init__(self, expected, fp_rate=0.01):
        self.m = int(-expected * math.log(fp_rate) / (math.log(2) ** 2))
        self.k = max(1, int((self.m / expected) * math.log(2)))
        self.bits = bitarray(self.m)
        self.bits.setall(0)

    def add(self, item):
        for seed in range(self.k):
            self.bits[mmh3.hash(str(item), seed) % self.m] = 1

Wrong: Not sizing the filter before use

# BAD -- arbitrary size without considering element count or FP rate
bf_bits = [0] * 1000  # Way too small for 100K elements
# After inserting 100K items, nearly all bits are 1 -- FP rate approaches 100%

Correct: Calculate parameters from requirements

# GOOD -- size derived from actual requirements
n = 100_000     # expected elements
p = 0.01        # 1% false positive rate
bf = BloomFilter(n, p)
# m ~ 958,506 bits (117 KB), k = 7 -- FP rate stays at ~1%

Wrong: Trying to delete from a standard Bloom filter

# BAD -- clearing bits may corrupt other entries
def bad_remove(bf, item):
    for seed in range(bf.k):
        idx = mmh3.hash(str(item), seed) % bf.m
        bf.bits[idx] = 0  # This may unset bits shared with other elements!

Correct: Use Counting Bloom Filter for deletion

# GOOD -- counters allow safe deletion
cbf = CountingBloomFilter(100_000, p=0.01, counter_bits=4)
cbf.add("item_a")
cbf.add("item_b")
cbf.remove("item_a")  # Safe -- decrements counters, does not affect item_b
print("item_a" in cbf)  # False
print("item_b" in cbf)  # True

Common Pitfalls

When to Use / When Not to Use

Use WhenDon't Use WhenUse Instead
Checking if an element was seen before (dedup)You need exact set membership with zero errorsHash set / hash table
Pre-filtering expensive disk/network lookupsYou need to enumerate or iterate set membersSorted set or B-tree
Spell-checking dictionariesYou need to delete elements frequentlyCuckoo filter or hash set
Network packet routing (is destination in cache?)Dataset fits comfortably in memory as a hash setHash set (simpler, deterministic)
Distributed systems: reducing cross-node queriesYou need element frequency countsCount-Min Sketch
Database: avoiding unnecessary disk reads (LSM trees)You need cardinality estimation (count distinct)HyperLogLog

Important Caveats

Related Units