pickuma.
Dev Knowledge

What a Bloom Filter Is, and Where You've Already Used One

A Bloom filter answers set-membership with 'probably yes' or 'definitely no' in a fraction of the space a hash set needs. Here's how it works and where it's quietly running in tools you already use.

6 min read

A Bloom filter is a small data structure that answers one question — is this item in the set? — and gives one of two replies: “definitely not” or “probably yes.” It never says “definitely yes,” and it never wrongly says “definitely not.” That lopsidedness is the entire reason it exists, and it’s why the structure has been quietly sitting inside databases, browsers, and CDNs you use every day since Burton Howard Bloom described it in 1970.

How it actually works

Start with a bit array of m bits, all set to zero, and pick k independent hash functions. Each hash function maps any input to one position in the array.

To add an element, run it through all k hash functions, take the k positions you get back, and set those bits to 1. That’s the whole insertion.

To query whether an element is present, hash it the same way and look at the same k positions. If any of those bits is 0, the element was definitely never added — because adding it would have forced all k bits to 1. If all k bits are 1, the element is probably present. “Probably,” because those bits could have been set to 1 by other elements that happened to hash to the same positions. That collision is a false positive.

The asymmetry falls straight out of the mechanism:

  • A false negative is impossible. If an item was added, its bits are 1, full stop.
  • A false positive is possible. Enough unrelated insertions can light up all k of an absent item’s bits by coincidence.

So a Bloom filter is a fast, cheap pre-check. “Definitely not” is a final answer you can trust. “Probably yes” means go do the expensive real lookup to be sure.

The math is friendlier than it looks

The false-positive rate depends on three things: the number of bits m, the number of inserted elements n, and the number of hash functions k. For a chosen target rate, there’s an optimal number of hash functions, roughly k = (m/n) × ln 2. Use too few and individual collisions matter too much; use too many and you saturate the array with 1s.

The space numbers are the part worth memorizing, because they’re what make the structure worth reaching for:

  • About 9.6 bits per element gets you a 1% false-positive rate.
  • About 14.4 bits per element gets you 0.1%.
  • Every additional 10× reduction in the false-positive rate costs roughly 4.8 more bits per element.

Notice what’s missing from those figures: the size of the elements. Whether you’re storing 12-character usernames or 200-byte URLs, the cost is the same handful of bits each, because only the hashes are stored. A hash set holding a million 64-byte keys needs tens of megabytes; a Bloom filter answering the same membership question at 1% error needs a little over a megabyte. That’s the pitch.

Where you’ve already used one

You’ve almost certainly hit a Bloom filter today without knowing it.

Storage engines. LSM-tree databases — Cassandra, HBase, Google’s Bigtable, and the RocksDB/LevelDB engines underneath countless other systems — keep data in sorted files on disk called SSTables. Looking up a key that doesn’t exist would mean touching several files, each a potential disk read. So each SSTable carries a Bloom filter of its keys. Before reading the file, the engine asks the filter: could this key be here? A “definitely not” skips the disk entirely. Reads for nonexistent keys are common (cache misses, first-time writes checking for duplicates), so this saves an enormous amount of I/O.

CDNs and caches. A large fraction of objects a CDN sees are “one-hit wonders” — requested exactly once, never again. Caching them wastes space. A common pattern is to admit an object to the cache only on its second sighting, and a Bloom filter is the cheap way to remember “have I seen this URL before?” across an enormous request stream.

Browsers. Earlier versions of Chrome’s Safe Browsing used a Bloom filter of malicious URL patterns as a local first pass. “Definitely not on the list” let the page load with zero network round-trips; a “probably” triggered a confirming lookup against Google’s servers. Most sites are safe, so most checks ended at the local filter.

Apps and protocols. Medium has described using Bloom filters to avoid recommending articles you’ve already read. Bitcoin’s BIP-37 lightweight clients used them so a node could request only transactions relevant to its wallet without revealing exactly which addresses it owned. Spell checkers used them for decades to test “is this a real word?” in a tiny memory footprint.

The common thread: a slow or expensive definitive check (disk, network, full database) sitting behind a fast “is it even worth checking?” gate, where the answer is usually no.

FAQ

How is a Bloom filter different from a hash set?
A hash set stores the actual elements and gives exact answers, but its memory grows with both the count and the size of the elements. A Bloom filter stores only bits set by hashing, uses a small constant number of bits per element regardless of element size, and accepts a tunable rate of false positives in exchange. Use a hash set when you need exact membership or need to retrieve elements; use a Bloom filter as a space-cheap pre-filter in front of an expensive lookup.
Can I delete items from a Bloom filter?
Not from a standard one — bits are shared between elements, so clearing an element's bits could introduce false negatives. A counting Bloom filter supports deletion by storing a small counter per slot instead of a single bit, at higher memory cost. There are also newer variants like cuckoo filters that support deletion more compactly.
What false-positive rate should I pick?
Work backward from the cost of a false positive. If a false positive just means doing the real lookup you'd have done anyway, 1% (about 9.6 bits per element) is a common, cheap choice. If a false positive triggers something expensive or user-visible, drop to 0.1% (about 14.4 bits per element) — each 10x reduction costs roughly 4.8 more bits per element.

Related reading

See all Dev Knowledge articles →

Get the best tools, weekly

One email every Friday. No spam, unsubscribe anytime.