Write-Ahead Logging: How Databases Survive a Power Cut

A database commits a transaction, returns OK, and a half-second later someone trips over the power cord. The machine is dead. When it boots back up, the row you just inserted is still there. That is not luck, and it is not magic. It is write-ahead logging doing the one job it exists to do: making a promise survive a crash.

The naive way to store data is to write it straight into the data file at the right offset. The problem is that a single logical change often touches several disk pages — an index entry here, a row there, a free-space map update somewhere else. If the power dies after page one and before page three, you are left with a data file that is internally inconsistent: an index that points at a row that was never written. There is no way to tell, on reboot, whether that file is whole or torn. You have lost the ability to trust your own storage.

The log-first rule

Write-ahead logging fixes this by inverting the order of operations. Before any change is applied to the actual data pages, the database first writes a description of that change to a separate, append-only file: the log. Only after that log record is safely on disk does the database touch the real data — and crucially, it can defer touching the real data for a long time.

The rule is in the name. The log is written ahead of the data. A transaction is considered durable the moment its commit record reaches stable storage in the log, not when the data pages are updated. This is the D in ACID — durability — and the log is where it lives.

The payoff shows up at recovery time. After a crash, the database reads the log from the last known-good checkpoint forward. For every committed transaction whose changes might not have made it into the data files, it replays the log record and reapplies the change. This is the redo pass. For any transaction that was still in flight when the lights went out — a log record with no matching commit — it rolls the change back. This is the undo pass. The canonical formulation of this redo/undo dance is the ARIES algorithm, and most production databases are a variation on its themes.

Why is replaying the log safe when writing the data directly was not? Because the log is append-only and each record is self-contained. You are never half-updating a structure; you are reading a sequence of “this happened, then this happened” entries and applying them in order. Append-only writes are about the only thing storage hardware is genuinely good at keeping consistent.

What this looks like in PostgreSQL and SQLite

The concept is universal, but the two databases most developers actually touch implement it in instructively different ways.

PostgreSQL keeps its WAL as a stream of 16 MB segment files under pg_wal/. Every change generates a WAL record stamped with a Log Sequence Number (LSN), a monotonically increasing position in the log. Periodically the database runs a checkpoint: it flushes all the dirty data pages that the log has been describing out to the main data files, then records that the log up to a certain LSN is now fully reflected on disk. Everything before that point can be recycled. The synchronous_commit setting controls how aggressively commits wait for the WAL flush — turn it off and you trade a window of durability for throughput, which is a legitimate choice for data you can afford to lose.

SQLite ships with WAL mode as an opt-in, switched on with PRAGMA journal_mode=WAL;. By default SQLite uses a rollback journal instead, which works the other way around — it copies the original pages out before overwriting them, so it can put them back on a crash. WAL mode flips this: new changes go to a -wal sidecar file and the main database stays untouched until a checkpoint folds them in. The practical reason to switch is concurrency. In WAL mode, readers do not block the writer and the writer does not block readers, because readers see a consistent snapshot of the main file while new writes pile up in the log. SQLite checkpoints automatically once the WAL file grows past roughly 1000 pages, though you can trigger it yourself.

The shared idea across both: writes are cheap and sequential because they go to the log; the expensive, random-access work of updating the real data structures is batched up and done later, in bulk, when it is convenient.

Cursor

An AI-native code editor that's genuinely useful when you're reading unfamiliar systems code — like a database's WAL implementation — and want to ask 'what does this function do' without leaving the file.

Free tier; Pro from $20/mo

Try Cursor

Affiliate link · We earn a commission at no cost to you.

There is a cost to all this, and it is worth naming. Every committed change is written at least twice: once to the log, once to the data file at checkpoint time. This is write amplification, and it is the price of durability. Databases claw some of it back with group commit, batching the fsyncs of several concurrent transactions into a single disk flush, so ten commits arriving at once might cost one physical sync rather than ten. The log is sequential and the batching is generous, which is why the overhead is usually a rounding error against the safety it buys.

The mental model to keep: the log is the source of truth about what happened, and the data files are a cache of where things currently stand that can always be rebuilt by replaying the log. Get that backwards and crash recovery stops making sense. Get it right and the power cord becomes a non-event.

FAQ

Is write-ahead logging the same as a transaction log or a redo log?

They're the same family. 'Transaction log' is the generic term, 'redo log' is Oracle's name for it, and 'WAL' is the term PostgreSQL and SQLite use. The mechanism — write the change description before changing the data, replay on recovery — is identical.

Does WAL slow down my writes?

It adds write amplification because each change is written to the log and later to the data file. But the log write is sequential and fast, and group commit batches the expensive fsyncs together. For most workloads the durability is worth far more than the modest overhead, and disabling the safety only makes sense for data you can afford to lose.

Why does my database still lose data on a power cut sometimes?

Almost always because a disk lied about flushing. WAL guarantees durability only if fsync actually forces data onto stable storage. Drives with volatile write caches that acknowledge flushes early break that guarantee. Verify your hardware honors flushes before blaming the database.

Write-Ahead Logging: How Databases Survive a Power Cut

The log-first rule

What this looks like in PostgreSQL and SQLite

Cursor

FAQ

TCP vs UDP, Explained Through What Breaks When You Pick Wrong

Backpressure, Explained Through a Queue That Won't Fall Over

What a Bloom Filter Actually Saves You (and When It Lies)

Idempotency, Explained Through the Retry That Doesn't Double-Charge

Git Plumbing in Practice: How CI, Review Tools, and AI Agents Build on Git's Primitives

Get the best tools, weekly