LSM-Trees vs B-Trees: The Write-Optimized Database Tradeoff
Why some databases append writes and reconcile later while others edit in place — and how that one choice shapes write throughput, read latency, and disk usage.
Every database answers one question before it stores a single byte: when you write a row, do you change the data where it already lives, or do you append the change somewhere new and reconcile later? B-trees pick the first answer. LSM-trees pick the second. That one decision ripples outward into write throughput, read latency, disk usage, and the shape of the workloads each engine handles well. If you have ever wondered why Postgres and Cassandra feel so different under load, this is the root of it.
How each structure actually writes
A B-tree — really a B+tree in almost every production database — keeps keys sorted inside fixed-size pages: 8KB in PostgreSQL, 16KB in MySQL’s InnoDB. To update a row, the engine walks from the root to the leaf page holding that key, then modifies that page in place. The walk is shallow (a tree holding millions of rows is usually three or four levels deep), but the final write lands wherever that page already sits on disk. That means random I/O, and it means write amplification: change 50 bytes and the engine still rewrites the whole 8KB page, plus the write-ahead log entry that protects it.
An LSM-tree (log-structured merge tree) refuses to touch existing files. A write first appends to a write-ahead log, then lands in an in-memory sorted structure called the memtable. When the memtable fills, it is flushed to disk as an immutable, sorted file called an SSTable. Because SSTables are never modified, every flush is a large sequential write — the kind of I/O that both spinning disks and SSDs handle far faster than scattered random writes. Updates and deletes do not overwrite anything; a delete writes a small marker called a tombstone, and the newest value simply shadows older ones.
The catch is that those immutable files pile up. A background process called compaction continuously merges SSTables, drops shadowed values, and reclaims space. Compaction is where the LSM-tree pays its bill: the same logical row may be rewritten several times over its life as it moves through compaction levels.
Where the tradeoffs bite
Reads. A B-tree read is predictable: walk root to leaf, read one page, done. An LSM read may have to check the memtable and then several SSTables across levels, because the key could live in any of them. To avoid touching files that cannot contain the key, each SSTable carries a Bloom filter — a probabilistic index that answers “definitely not here” cheaply, with a tunable false-positive rate around 1%. Bloom filters make point lookups close to a single disk read in practice, but range scans still have to merge results from multiple files, so they remain the B-tree’s stronger suit.
Space. B-trees fragment. Pages split when they fill and rarely sit perfectly packed, so a B-tree often carries meaningful slack on disk. LSM-trees hold obsolete versions and tombstones until compaction clears them, which means transient bloat that depends on how far behind compaction is running. Leveled compaction keeps steady-state overhead low — often around 10% extra — at the cost of more rewriting; size-tiered compaction rewrites less but can temporarily hold multiple full copies during a merge.
Concurrency. Appending to a log and a memtable serializes cleanly, so LSM-trees absorb bursts of concurrent writes gracefully. B-trees must latch the pages they modify, and hot pages can become contention points under heavy concurrent writes.
A useful frame here is the RUM conjecture: across Read overhead, Update overhead, and Memory (space) overhead, you can optimize for two at the expense of the third. B-trees favor reads and space and pay on random updates. LSM-trees favor updates and space efficiency under compaction and pay on read and rewrite amplification. There is no structure that wins all three — only structures that pick a side.
| Dimension | B-tree | LSM-tree |
|---|---|---|
| Write pattern | In-place, random I/O | Append-only, sequential I/O |
| Point read | One root-to-leaf walk | Memtable + SSTables, cut by Bloom filters |
| Range scan | Strong, keys laid out in order | Weaker, merges across files |
| Main cost | Page write amplification | Compaction rewrite amplification |
| Best fit | Read-heavy, mixed OLTP | Write-heavy, high ingest |
Which one is in the database you already use
You are almost certainly running both, depending on the project. PostgreSQL, MySQL’s InnoDB, and SQLite are B-tree engines — which is why they shine on transactional, read-heavy work with predictable latency. Cassandra, ScyllaDB, HBase, and the embedded engines LevelDB and RocksDB are LSM-trees, built for heavy ingest and write throughput. The line blurs further inside modern systems: CockroachDB and TiKV store their data in RocksDB-derived LSM engines (Pebble and RocksDB respectively), and MyRocks puts an LSM storage engine under a MySQL front end.
The fastest way to internalize this is to read the source. Open the SSTable flush path in RocksDB, or the page-split logic in Postgres’s nbtree, and the abstract tradeoff turns concrete.
Cursor
An AI code editor with whole-repo context, handy when you are spelunking through a storage engine like RocksDB or Postgres to see how compaction, page splits, and Bloom filters work in real source rather than in diagrams.
Free tier; Pro plan $20/mo
Affiliate link · We earn a commission at no cost to you.
FAQ
Are LSM-trees always faster for writes?+
Why do LSM databases use Bloom filters?+
Is a B+tree different from a B-tree?+
Related reading
2026-06-10
Copy-on-Write, Explained Through fork() and Snapshots
How copy-on-write defers copying until a write actually happens — the mechanism behind fast fork(), filesystem snapshots, and database MVCC, explained with page tables and page faults.
2026-06-10
A Coroutine Is Not a Thread: What Suspends, What Gets Scheduled, and Why It Matters
A coroutine suspends and resumes cooperatively; a thread is preempted by the OS. Here is the real difference in scheduling, memory, and parallelism — and when each one wins.
2026-06-10
Two's Complement: How Computers Represent Negative Numbers
How two's complement encodes negative integers, why CPUs run signed and unsigned math on one adder, and the edge cases — INT_MIN, overflow, sign extension — that cause real bugs.
2026-06-10
What MVCC Is, and How Databases Let Readers and Writers Coexist
MVCC keeps multiple versions of every row so reads never block writes. Here's how Postgres implements it with xmin/xmax, why your tables bloat, and where snapshot isolation bites.
2026-06-09
What a Merkle Tree Is, and Where You've Already Seen One
A Merkle tree hashes data into a single fingerprint so you can verify any piece without downloading the whole set. Here's how it works and where it already runs in your stack.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.