pickuma.
Dev Knowledge

What a Write-Ahead Log Is, and Why Databases Trust It

A practical look at the write-ahead log: the durability trick behind Postgres, SQLite, and most databases, and what it means when a server loses power mid-write.

7 min read

A database has one promise that everything else depends on: when it tells you a write committed, that write survives a crash. Lose power one millisecond after the COMMIT returns, and the row has to still be there when the machine boots back up. The mechanism that makes this promise keepable is older and simpler than most people expect. It is the write-ahead log, and once you see how it works, a lot of database behavior stops looking like magic.

The problem a write-ahead log solves

Imagine a database without a log. You run UPDATE accounts SET balance = balance - 100 WHERE id = 7, and the engine walks to the right page on disk — say an 8KB block — modifies the bytes, and writes the block back. Simple. Now imagine the power dies halfway through that 8KB write. Disks do not write 8KB atomically. You can end up with the first 4KB updated and the last 4KB still holding the old data. That page is now torn: not the old value, not the new value, just corruption. There is no way to recover, because nothing recorded what the page was supposed to become.

The naive fix is to write everything twice in some clever order, but that runs into the same atomicity problem at every layer. The actual fix flips the order of operations. Before touching the real data pages, the database first appends a small record to a separate file describing the change it is about to make. Only after that intent record is safely on disk does it modify the actual table. This is the write-ahead rule, and the name is literal: the log is written ahead of the data.

The key insight is that the log is append-only and sequential. You are never overwriting existing log bytes in place — you are adding to the end. A sequential append is far easier to make durable than a scattered set of random page updates, and on spinning disks it was also dramatically faster because the head never has to seek. The database turns a hard problem (atomically updating data scattered across a file) into an easier one (atomically appending a record to the end of a log).

How the log earns the database’s trust

The protocol has three moving parts: append, flush, and checkpoint. Understanding how they interact is what makes the rest intuitive.

When a transaction commits, the database writes its log records to the WAL and then issues an fsync (or the platform equivalent) to force the operating system to push those bytes past its own write cache and onto durable storage. The COMMIT does not return to your application until that flush completes. This is the moment durability is established — not when the table page is updated, which may not happen for seconds. That single fsync is the hinge the entire durability guarantee swings on, which is also why it is the most common place people accidentally trade away safety for speed.

The actual data pages get updated lazily, in memory, and written back to disk later in batches. If the server crashes before those dirty pages reach disk, no data is lost, because the log already has every committed change. On restart, the database performs recovery: it reads the WAL forward from the last known-good point and replays — “redoes” — every committed change against the data files, and discards any half-finished transaction that never reached a commit record. After replay, the data files match exactly what committed before the crash.

That replay would take longer and longer as the log grew, so databases periodically run a checkpoint: they flush all currently-dirty pages to the data files and record a marker saying “everything before this point is now safely in the main files.” Log written before a checkpoint can be recycled or archived, because recovery never needs to start earlier than the last checkpoint. The checkpoint is the garbage-collection valve that keeps the log bounded and keeps recovery time predictable.

There is one more subtlety worth knowing. Because a single page write isn’t atomic, PostgreSQL by default writes a full copy of each page to the WAL the first time it is modified after a checkpoint — the full_page_writes setting. That copy lets recovery reconstruct a torn page completely rather than trying to patch a half-written one. It costs log volume but closes the torn-page hole entirely. SQLite’s WAL mode solves the same class of problem differently, by keeping new versions of pages in the WAL file and only folding them back into the main database during a checkpoint.

What this means for you when things go wrong

The WAL is not just an internal detail — it surfaces in knobs you will eventually touch. In PostgreSQL, synchronous_commit = off tells the server to return from COMMIT without waiting for the WAL flush. Writes get noticeably faster, and you accept a small window — typically a fraction of a second — where a crash can lose the most recent committed transactions. The data files never corrupt; you just lose the tail. For analytics or recoverable event ingestion that trade-off can be correct. For a payments ledger it is not. The WAL is also the foundation of replication: streaming the log to a replica is how Postgres keeps a standby in sync, and shipping archived WAL segments is how point-in-time recovery rewinds a database to any second in its history.

When you understand the log, a lot of operational behavior becomes legible. Disk filling up with pg_wal segments usually means archiving or a replication slot has stalled and the log can’t be recycled. A long crash-recovery time after an unclean shutdown means a lot of WAL accumulated since the last checkpoint. A replica falling behind is the log not being applied fast enough. These are not separate mysteries; they are all the same log, seen from different angles.

Cursor

Reading a real database engine's recovery code is the fastest way to make the WAL concrete. Cursor's codebase chat lets you open the PostgreSQL or SQLite source, ask how xlog replay works, and trace the commit path without grepping blind.

Free tier; Pro at $20/mo

Try Cursor

Affiliate link · We earn a commission at no cost to you.

The write-ahead log is a small idea with an outsized payoff: record your intent durably before you act on it, and you can always recover to a consistent state by replaying intent. That is why nearly every serious database, and a growing number of distributed systems, puts a log at its center.

FAQ

Is the WAL the same thing as a binlog or replication log?+
They overlap but aren't identical. PostgreSQL's WAL is both the durability log and the replication stream. MySQL separates them: the InnoDB redo log handles crash recovery, while the binary log (binlog) handles replication and point-in-time recovery. Same underlying principle, different division of labor.
Does a write-ahead log make my database slower?+
It adds one sequential append and a flush per commit, but it usually makes the system faster overall by turning random data-page writes into deferred batches and a single sequential log write. The cost lands on commit latency, which is why batching transactions and tuning the flush behavior matter more than the log's raw existence.
What actually happens to an uncommitted transaction after a crash?+
During recovery the database replays committed changes and ignores any transaction whose commit record never made it to the log. Its partial changes either were never applied to the data files or get rolled back, so the database comes up as if that transaction never ran.

Related reading

See all Dev Knowledge articles →

Get the best tools, weekly

One email every Friday. No spam, unsubscribe anytime.