What a Bloom Filter Is, and Where You've Already Used One
A Bloom filter answers set-membership with 'probably yes' or 'definitely no' in a fraction of the space a hash set needs. Here's how it works and where it's quietly running in tools you already use.
A Bloom filter is a small data structure that answers one question — is this item in the set? — and gives one of two replies: “definitely not” or “probably yes.” It never says “definitely yes,” and it never wrongly says “definitely not.” That lopsidedness is the entire reason it exists, and it’s why the structure has been quietly sitting inside databases, browsers, and CDNs you use every day since Burton Howard Bloom described it in 1970.
How it actually works
Start with a bit array of m bits, all set to zero, and pick k independent hash functions. Each hash function maps any input to one position in the array.
To add an element, run it through all k hash functions, take the k positions you get back, and set those bits to 1. That’s the whole insertion.
To query whether an element is present, hash it the same way and look at the same k positions. If any of those bits is 0, the element was definitely never added — because adding it would have forced all k bits to 1. If all k bits are 1, the element is probably present. “Probably,” because those bits could have been set to 1 by other elements that happened to hash to the same positions. That collision is a false positive.
The asymmetry falls straight out of the mechanism:
- A false negative is impossible. If an item was added, its bits are 1, full stop.
- A false positive is possible. Enough unrelated insertions can light up all
kof an absent item’s bits by coincidence.
So a Bloom filter is a fast, cheap pre-check. “Definitely not” is a final answer you can trust. “Probably yes” means go do the expensive real lookup to be sure.
The math is friendlier than it looks
The false-positive rate depends on three things: the number of bits m, the number of inserted elements n, and the number of hash functions k. For a chosen target rate, there’s an optimal number of hash functions, roughly k = (m/n) × ln 2. Use too few and individual collisions matter too much; use too many and you saturate the array with 1s.
The space numbers are the part worth memorizing, because they’re what make the structure worth reaching for:
- About 9.6 bits per element gets you a 1% false-positive rate.
- About 14.4 bits per element gets you 0.1%.
- Every additional 10× reduction in the false-positive rate costs roughly 4.8 more bits per element.
Notice what’s missing from those figures: the size of the elements. Whether you’re storing 12-character usernames or 200-byte URLs, the cost is the same handful of bits each, because only the hashes are stored. A hash set holding a million 64-byte keys needs tens of megabytes; a Bloom filter answering the same membership question at 1% error needs a little over a megabyte. That’s the pitch.
Where you’ve already used one
You’ve almost certainly hit a Bloom filter today without knowing it.
Storage engines. LSM-tree databases — Cassandra, HBase, Google’s Bigtable, and the RocksDB/LevelDB engines underneath countless other systems — keep data in sorted files on disk called SSTables. Looking up a key that doesn’t exist would mean touching several files, each a potential disk read. So each SSTable carries a Bloom filter of its keys. Before reading the file, the engine asks the filter: could this key be here? A “definitely not” skips the disk entirely. Reads for nonexistent keys are common (cache misses, first-time writes checking for duplicates), so this saves an enormous amount of I/O.
CDNs and caches. A large fraction of objects a CDN sees are “one-hit wonders” — requested exactly once, never again. Caching them wastes space. A common pattern is to admit an object to the cache only on its second sighting, and a Bloom filter is the cheap way to remember “have I seen this URL before?” across an enormous request stream.
Browsers. Earlier versions of Chrome’s Safe Browsing used a Bloom filter of malicious URL patterns as a local first pass. “Definitely not on the list” let the page load with zero network round-trips; a “probably” triggered a confirming lookup against Google’s servers. Most sites are safe, so most checks ended at the local filter.
Apps and protocols. Medium has described using Bloom filters to avoid recommending articles you’ve already read. Bitcoin’s BIP-37 lightweight clients used them so a node could request only transactions relevant to its wallet without revealing exactly which addresses it owned. Spell checkers used them for decades to test “is this a real word?” in a tiny memory footprint.
The common thread: a slow or expensive definitive check (disk, network, full database) sitting behind a fast “is it even worth checking?” gate, where the answer is usually no.
FAQ
How is a Bloom filter different from a hash set?
Can I delete items from a Bloom filter?
What false-positive rate should I pick?
Related reading
2026-06-22
TCP vs UDP, Explained Through What Breaks When You Pick Wrong
TCP and UDP aren't interchangeable. We walk through the exact failure modes — head-of-line blocking, silent packet loss, Nagle delays — that show up when you pick the wrong transport.
2026-06-22
Write-Ahead Logging: How Databases Survive a Power Cut
How write-ahead logging keeps your data intact when the machine dies mid-write — the log-first rule, fsync, checkpoints, and why PostgreSQL and SQLite both rely on it.
2026-06-22
Backpressure, Explained Through a Queue That Won't Fall Over
What backpressure actually is, why an unbounded queue is a memory leak in disguise, and the four strategies a producer can take when a consumer falls behind.
2026-06-22
What a Bloom Filter Actually Saves You (and When It Lies)
A bloom filter trades a small false-positive rate for big memory savings. Here is the math behind the trade, where it pays off, and the failure mode that bites people.
2026-06-22
Idempotency, Explained Through the Retry That Doesn't Double-Charge
A practical look at idempotency keys: why a retried payment request shouldn't charge a card twice, how the pattern works, and where it quietly breaks in production.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.