What a Write-Ahead Log Is, and Why Databases Trust It
A practical look at the write-ahead log: the durability trick behind Postgres, SQLite, and most databases, and what it means when a server loses power mid-write.
A database has one promise that everything else depends on: when it tells you a write committed, that write survives a crash. Lose power one millisecond after the COMMIT returns, and the row has to still be there when the machine boots back up. The mechanism that makes this promise keepable is older and simpler than most people expect. It is the write-ahead log, and once you see how it works, a lot of database behavior stops looking like magic.
The problem a write-ahead log solves
Imagine a database without a log. You run UPDATE accounts SET balance = balance - 100 WHERE id = 7, and the engine walks to the right page on disk — say an 8KB block — modifies the bytes, and writes the block back. Simple. Now imagine the power dies halfway through that 8KB write. Disks do not write 8KB atomically. You can end up with the first 4KB updated and the last 4KB still holding the old data. That page is now torn: not the old value, not the new value, just corruption. There is no way to recover, because nothing recorded what the page was supposed to become.
The naive fix is to write everything twice in some clever order, but that runs into the same atomicity problem at every layer. The actual fix flips the order of operations. Before touching the real data pages, the database first appends a small record to a separate file describing the change it is about to make. Only after that intent record is safely on disk does it modify the actual table. This is the write-ahead rule, and the name is literal: the log is written ahead of the data.
The key insight is that the log is append-only and sequential. You are never overwriting existing log bytes in place — you are adding to the end. A sequential append is far easier to make durable than a scattered set of random page updates, and on spinning disks it was also dramatically faster because the head never has to seek. The database turns a hard problem (atomically updating data scattered across a file) into an easier one (atomically appending a record to the end of a log).
How the log earns the database’s trust
The protocol has three moving parts: append, flush, and checkpoint. Understanding how they interact is what makes the rest intuitive.
When a transaction commits, the database writes its log records to the WAL and then issues an fsync (or the platform equivalent) to force the operating system to push those bytes past its own write cache and onto durable storage. The COMMIT does not return to your application until that flush completes. This is the moment durability is established — not when the table page is updated, which may not happen for seconds. That single fsync is the hinge the entire durability guarantee swings on, which is also why it is the most common place people accidentally trade away safety for speed.
The actual data pages get updated lazily, in memory, and written back to disk later in batches. If the server crashes before those dirty pages reach disk, no data is lost, because the log already has every committed change. On restart, the database performs recovery: it reads the WAL forward from the last known-good point and replays — “redoes” — every committed change against the data files, and discards any half-finished transaction that never reached a commit record. After replay, the data files match exactly what committed before the crash.
That replay would take longer and longer as the log grew, so databases periodically run a checkpoint: they flush all currently-dirty pages to the data files and record a marker saying “everything before this point is now safely in the main files.” Log written before a checkpoint can be recycled or archived, because recovery never needs to start earlier than the last checkpoint. The checkpoint is the garbage-collection valve that keeps the log bounded and keeps recovery time predictable.
There is one more subtlety worth knowing. Because a single page write isn’t atomic, PostgreSQL by default writes a full copy of each page to the WAL the first time it is modified after a checkpoint — the full_page_writes setting. That copy lets recovery reconstruct a torn page completely rather than trying to patch a half-written one. It costs log volume but closes the torn-page hole entirely. SQLite’s WAL mode solves the same class of problem differently, by keeping new versions of pages in the WAL file and only folding them back into the main database during a checkpoint.
What this means for you when things go wrong
The WAL is not just an internal detail — it surfaces in knobs you will eventually touch. In PostgreSQL, synchronous_commit = off tells the server to return from COMMIT without waiting for the WAL flush. Writes get noticeably faster, and you accept a small window — typically a fraction of a second — where a crash can lose the most recent committed transactions. The data files never corrupt; you just lose the tail. For analytics or recoverable event ingestion that trade-off can be correct. For a payments ledger it is not. The WAL is also the foundation of replication: streaming the log to a replica is how Postgres keeps a standby in sync, and shipping archived WAL segments is how point-in-time recovery rewinds a database to any second in its history.
When you understand the log, a lot of operational behavior becomes legible. Disk filling up with pg_wal segments usually means archiving or a replication slot has stalled and the log can’t be recycled. A long crash-recovery time after an unclean shutdown means a lot of WAL accumulated since the last checkpoint. A replica falling behind is the log not being applied fast enough. These are not separate mysteries; they are all the same log, seen from different angles.
Cursor
Reading a real database engine's recovery code is the fastest way to make the WAL concrete. Cursor's codebase chat lets you open the PostgreSQL or SQLite source, ask how xlog replay works, and trace the commit path without grepping blind.
Free tier; Pro at $20/mo
Affiliate link · We earn a commission at no cost to you.
The write-ahead log is a small idea with an outsized payoff: record your intent durably before you act on it, and you can always recover to a consistent state by replaying intent. That is why nearly every serious database, and a growing number of distributed systems, puts a log at its center.
FAQ
Is the WAL the same thing as a binlog or replication log?+
Does a write-ahead log make my database slower?+
What actually happens to an uncommitted transaction after a crash?+
Related reading
2026-06-09
What a Merkle Tree Is, and Where You've Already Seen One
A Merkle tree hashes data into a single fingerprint so you can verify any piece without downloading the whole set. Here's how it works and where it already runs in your stack.
2026-06-09
Consistent Hashing, Explained Through the Problem It Actually Solves
Why hash(key) % N falls apart when you add a server, how the hash ring fixes it, and what virtual nodes do — a practical walkthrough for developers.
2026-06-08
Database Isolation Levels Explained: The Anomalies Each One Prevents
A practical guide to the four SQL isolation levels, the concurrency anomalies they forbid, and how PostgreSQL and MySQL actually behave at the same level name.
2026-06-08
What the CAP Theorem Actually Means for Your Application
The CAP theorem isn't 'pick two of three.' It's a rule about what happens during a network partition — and most of the time, no partition is happening at all.
2026-06-09
The Circuit Breaker Pattern, Explained for Resilient Systems
How the circuit breaker pattern stops one slow dependency from taking down your whole service — states, thresholds, and the defaults real libraries ship with.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.