What a Bloom Filter Actually Saves You (and When It Lies)

You reach for a HashSet when you need to ask “have I seen this before?” It works, it is exact, and it is fine until the set gets large enough that holding every element in memory becomes the bottleneck. A bloom filter is the answer to a narrower question: “is it safe to skip the expensive lookup?” It answers that in a fraction of the space, and the price is that it occasionally answers wrong in exactly one direction.

That one-directional error is the whole point, and it is also where people get burned. So let’s be precise about what you are buying.

The trade, in actual numbers

A bloom filter is a bit array plus a handful of hash functions. To add an element, you hash it k times, map each hash to a position in an m-bit array, and set those bits to 1. To test membership, you hash the same way and check whether all k bits are already set. If any bit is 0, the element is definitely not in the set. If all bits are 1, the element is probably in the set.

That asymmetry is the contract:

No false negatives. If the filter says “not present,” it is telling the truth. Always.
False positives are possible. If the filter says “present,” it might be lying, because some other elements happened to set the same bits.

The false-positive rate is tunable, and the math is friendlier than you’d guess. For a target rate p, you need roughly -ln(p) / (ln 2)^2 bits per element. Run the numbers:

A 1% false-positive rate costs about 9.6 bits per element — call it 1.2 bytes.
A 0.1% rate costs about 14.4 bits per element — under 1.8 bytes.

Compare that to storing the elements themselves. Even if you only kept a 64-bit hash of each item in a HashSet, that’s 8 bytes per element before container overhead, and real hash tables carry pointers, load-factor slack, and bucket metadata that push the true cost well past that. A bloom filter at 1% gives you membership testing for roughly one-seventh the bytes of bare hashes, and it does not grow with the size of the elements — a 2 KB URL and a 4-byte integer both cost the same ~9.6 bits.

Where it actually pays off

A bloom filter earns its place when a “no” lets you skip something genuinely expensive — a disk seek, a network round-trip, a cold cache read. The filter sits in front of the slow path and absorbs the majority of negative lookups in memory.

The canonical example lives inside the databases you already use. LSM-tree storage engines — RocksDB, Cassandra, LevelDB — store data across many on-disk sorted files (SSTables). Without help, a lookup for a missing key would have to check several files on disk. Each SSTable carries a bloom filter, so the engine asks the filter first: “could this key be in this file?” A “no” skips the disk read entirely. With a 1% false-positive rate, you avoid roughly 99% of the pointless disk seeks for keys that aren’t there, at a memory cost small enough to keep the filters resident.

The pattern generalizes anywhere the negative case dominates and is cheap to short-circuit:

Caching layers checking “have we definitely never cached this?” before hitting the origin.
Crawlers and queues deduping URLs or job IDs without keeping the full visited set in RAM.
Write paths that want to skip a uniqueness check against a remote store when the key is obviously new.

The common thread: the filter only has to be right about “no.” A false positive just means you fall through to the exact check you would have done anyway. You pay a little wasted work, not a wrong answer.

The lies, and the ones people forget

The false positive is the famous failure mode, but two others bite harder because they’re quieter.

You can’t delete from a standard bloom filter. Clearing the bits for one element would also clear bits shared with others, reintroducing false negatives and breaking the one guarantee you cared about. If your set shrinks over time, a plain bloom filter is the wrong tool. A counting bloom filter (each slot is a small counter instead of a single bit) supports deletion, at several times the memory. If churn is heavy, you often just rebuild the filter periodically instead.

The false-positive rate is a promise about a specific size. You size the filter for n elements. Push past n and the bit array saturates — more bits flip to 1, and the actual false-positive rate climbs well above your target. A filter built for a million entries and fed ten million isn’t “a bit worse,” it’s approaching uselessly noisy. If your data volume is unknown or unbounded, look at a scalable bloom filter that chains progressively larger filters as it fills.

None of this requires deep math to use in practice — the libraries handle the bit twiddling. What it requires is matching the structure to a workload where “definitely no” is the valuable answer and a rare false “yes” is harmless.

Cursor

When you're wiring a bloom filter into a hot path, an AI editor that reads your whole repo helps you spot whether the false-positive fall-through actually hits an exact check — or silently trusts the filter.

Free tier; Pro from $20/mo

Try Cursor

Affiliate link · We earn a commission at no cost to you.

The mental model to keep: a bloom filter doesn’t store your data, it stores a compressed, lossy hint about your data. The compression is the gift. The lossiness is the bill. As long as your code reads the hint as “safe to skip” rather than “known fact,” the bill stays small.

FAQ

Why can't a bloom filter have false negatives?

Adding an element only ever sets bits to 1, never back to 0. So if all of an element's bits were set when it was added, they are still set at query time. A 'not present' answer means at least one required bit is 0, which can only happen if the element was never added.

How is a bloom filter different from a hash table?

A hash table stores the actual keys (or pointers to them) and gives exact answers, including handling deletions. A bloom filter stores only a bit pattern, uses far less memory, and trades exactness for a tunable false-positive rate. Use a hash table when you need certainty or the keys themselves; use a bloom filter when you only need a cheap, memory-light 'definitely not present' signal in front of a slower exact check.

What false-positive rate should I pick?

It depends on the cost of a false positive in your system. If a miss just triggers a disk read or a remote lookup you'd otherwise do anyway, 1% (about 9.6 bits per element) is a common, cheap default. If each false positive is more costly, drop to 0.1% for roughly 14.4 bits per element. Going much lower hits diminishing returns fast — every additional order of magnitude costs roughly another 4.8 bits per element.

What a Bloom Filter Actually Saves You (and When It Lies)

The trade, in actual numbers

Where it actually pays off

The lies, and the ones people forget

Cursor

FAQ

TCP vs UDP, Explained Through What Breaks When You Pick Wrong

Write-Ahead Logging: How Databases Survive a Power Cut

Backpressure, Explained Through a Queue That Won't Fall Over

Idempotency, Explained Through the Retry That Doesn't Double-Charge

Git Plumbing in Practice: How CI, Review Tools, and AI Agents Build on Git's Primitives

Get the best tools, weekly