pickuma.
Dev Knowledge

LSM-Trees vs B-Trees: The Write-Optimized Database Tradeoff

Why some databases append writes and reconcile later while others edit in place — and how that one choice shapes write throughput, read latency, and disk usage.

7 min read

Every database answers one question before it stores a single byte: when you write a row, do you change the data where it already lives, or do you append the change somewhere new and reconcile later? B-trees pick the first answer. LSM-trees pick the second. That one decision ripples outward into write throughput, read latency, disk usage, and the shape of the workloads each engine handles well. If you have ever wondered why Postgres and Cassandra feel so different under load, this is the root of it.

How each structure actually writes

A B-tree — really a B+tree in almost every production database — keeps keys sorted inside fixed-size pages: 8KB in PostgreSQL, 16KB in MySQL’s InnoDB. To update a row, the engine walks from the root to the leaf page holding that key, then modifies that page in place. The walk is shallow (a tree holding millions of rows is usually three or four levels deep), but the final write lands wherever that page already sits on disk. That means random I/O, and it means write amplification: change 50 bytes and the engine still rewrites the whole 8KB page, plus the write-ahead log entry that protects it.

An LSM-tree (log-structured merge tree) refuses to touch existing files. A write first appends to a write-ahead log, then lands in an in-memory sorted structure called the memtable. When the memtable fills, it is flushed to disk as an immutable, sorted file called an SSTable. Because SSTables are never modified, every flush is a large sequential write — the kind of I/O that both spinning disks and SSDs handle far faster than scattered random writes. Updates and deletes do not overwrite anything; a delete writes a small marker called a tombstone, and the newest value simply shadows older ones.

The catch is that those immutable files pile up. A background process called compaction continuously merges SSTables, drops shadowed values, and reclaims space. Compaction is where the LSM-tree pays its bill: the same logical row may be rewritten several times over its life as it moves through compaction levels.

Where the tradeoffs bite

Reads. A B-tree read is predictable: walk root to leaf, read one page, done. An LSM read may have to check the memtable and then several SSTables across levels, because the key could live in any of them. To avoid touching files that cannot contain the key, each SSTable carries a Bloom filter — a probabilistic index that answers “definitely not here” cheaply, with a tunable false-positive rate around 1%. Bloom filters make point lookups close to a single disk read in practice, but range scans still have to merge results from multiple files, so they remain the B-tree’s stronger suit.

Space. B-trees fragment. Pages split when they fill and rarely sit perfectly packed, so a B-tree often carries meaningful slack on disk. LSM-trees hold obsolete versions and tombstones until compaction clears them, which means transient bloat that depends on how far behind compaction is running. Leveled compaction keeps steady-state overhead low — often around 10% extra — at the cost of more rewriting; size-tiered compaction rewrites less but can temporarily hold multiple full copies during a merge.

Concurrency. Appending to a log and a memtable serializes cleanly, so LSM-trees absorb bursts of concurrent writes gracefully. B-trees must latch the pages they modify, and hot pages can become contention points under heavy concurrent writes.

A useful frame here is the RUM conjecture: across Read overhead, Update overhead, and Memory (space) overhead, you can optimize for two at the expense of the third. B-trees favor reads and space and pay on random updates. LSM-trees favor updates and space efficiency under compaction and pay on read and rewrite amplification. There is no structure that wins all three — only structures that pick a side.

DimensionB-treeLSM-tree
Write patternIn-place, random I/OAppend-only, sequential I/O
Point readOne root-to-leaf walkMemtable + SSTables, cut by Bloom filters
Range scanStrong, keys laid out in orderWeaker, merges across files
Main costPage write amplificationCompaction rewrite amplification
Best fitRead-heavy, mixed OLTPWrite-heavy, high ingest

Which one is in the database you already use

You are almost certainly running both, depending on the project. PostgreSQL, MySQL’s InnoDB, and SQLite are B-tree engines — which is why they shine on transactional, read-heavy work with predictable latency. Cassandra, ScyllaDB, HBase, and the embedded engines LevelDB and RocksDB are LSM-trees, built for heavy ingest and write throughput. The line blurs further inside modern systems: CockroachDB and TiKV store their data in RocksDB-derived LSM engines (Pebble and RocksDB respectively), and MyRocks puts an LSM storage engine under a MySQL front end.

The fastest way to internalize this is to read the source. Open the SSTable flush path in RocksDB, or the page-split logic in Postgres’s nbtree, and the abstract tradeoff turns concrete.

Cursor

An AI code editor with whole-repo context, handy when you are spelunking through a storage engine like RocksDB or Postgres to see how compaction, page splits, and Bloom filters work in real source rather than in diagrams.

Free tier; Pro plan $20/mo

Try Cursor

Affiliate link · We earn a commission at no cost to you.

FAQ

Are LSM-trees always faster for writes?+
Not in every case. They convert random writes into sequential writes, which is a large win on write-heavy ingest, but compaction adds background rewrite work that competes with foreground traffic. On a read-heavy workload a B-tree often delivers lower and steadier overall latency.
Why do LSM databases use Bloom filters?+
Because a key can live in any of several immutable SSTables, a naive point read would have to check all of them. A Bloom filter per SSTable answers definitely-not-present cheaply, so the engine skips files that cannot contain the key and usually touches just one.
Is a B+tree different from a B-tree?+
Yes. A B+tree stores all values in the leaf nodes and links those leaves together, which makes range scans efficient. Nearly every database that says B-tree is actually using a B+tree for exactly that reason.

Related reading

See all Dev Knowledge articles →

Get the best tools, weekly

One email every Friday. No spam, unsubscribe anytime.