Copy-on-Write, Explained Through fork() and Snapshots
How copy-on-write defers copying until a write actually happens — the mechanism behind fast fork(), filesystem snapshots, and database MVCC, explained with page tables and page faults.
Copying data is expensive. Copying data you never modify is wasted work. Copy-on-write (CoW) is the trick that resolves that tension: you hand out what looks like an independent copy, but no bytes move until someone actually writes. Until that first write, every “copy” is the same physical data, shared and marked read-only. The copy happens lazily, per unit, only when a writer forces it.
That single idea shows up in three places most developers touch every week: the fork() system call, filesystem snapshots on ZFS and Btrfs, and the multi-version concurrency control inside Postgres. They look unrelated until you see they’re the same mechanism applied at different granularities.
The mechanism: share read-only, copy on the fault
The unit of sharing on a modern CPU is the page — 4 KiB on x86-64 by default. Your process doesn’t address physical memory directly; it addresses virtual pages, and the page table maps each virtual page to a physical frame, plus permission bits like read, write, and execute.
Copy-on-write works by lying about those permission bits. When you want two logical copies of a region, you don’t duplicate the underlying frames. You point both page tables at the same physical frames and clear the write bit on both. Reads go straight through and cost nothing extra. The moment either side issues a write, the CPU’s memory management unit sees the cleared write bit and raises a page fault.
The kernel’s fault handler is where the actual copy happens. It allocates a fresh frame, copies the 4 KiB of contents, repoints the faulting process’s page table entry at the new frame, restores the write bit, and resumes the instruction. The writer never knows it faulted. The other side still references the original, untouched. A reference count on each shared frame tells the kernel whether a copy is even necessary — if the count is already 1, there’s no one to protect, so it just flips the write bit back on instead of copying.
fork() is the textbook case
When a Unix process calls fork(), the kernel needs to produce a child with an identical address space. Eagerly duplicating every page would make fork() scale with the parent’s memory footprint, which is brutal for a large process — and pointless, because the overwhelmingly common next move is execve(), which throws the whole address space away and loads a new program.
So fork() copies the page tables, not the page contents, and marks every writable page read-only in both parent and child. Both processes share physical memory. Execution continues until one of them writes to a shared page; that page, and only that page, gets duplicated by the fault handler. If the child immediately calls execve(), almost nothing was ever copied.
This is also why fork()-based snapshotting works. Redis takes a point-in-time RDB snapshot by calling fork() and letting the child serialize memory to disk while the parent keeps serving traffic. The child sees a frozen view: any key the parent mutates after the fork triggers a CoW page copy, so the child keeps reading the pre-fork bytes. The catch is memory pressure — if the parent writes heavily during the save, copied pages accumulate, and a write-heavy Redis can transiently approach double its resident size during a background save.
Snapshots: the same idea, larger units
Filesystem and database snapshots apply copy-on-write above the page level, to disk blocks and row versions.
A CoW filesystem like ZFS or Btrfs never overwrites a live block in place. When you modify a file, it writes the new data to a free block and updates the metadata to point there, leaving the old block intact. A snapshot is then almost free: you record the current root of the tree and stop reclaiming the blocks it references. The live filesystem keeps moving forward onto new blocks; the snapshot keeps pointing at the old ones. Blocks are shared between the live view and the snapshot until a write diverges them — exactly the page-fault dance, just with the storage allocator playing the role of the fault handler. A snapshot’s size on disk is only the blocks that changed since it was taken.
Database MVCC is the row-level version. Instead of locking a row so readers and writers take turns, Postgres writes a new version of the row on update and leaves the old version in place. A transaction reads whichever version was visible when it started, so a long-running read never blocks a concurrent write and vice versa. Old versions are shared by every transaction old enough to see them, and only get cleaned up — by vacuum — once no transaction can reference them anymore. The reference-count idea returns as visibility bookkeeping.
| Layer | Unit shared | What triggers the copy | Who reclaims |
|---|---|---|---|
| fork() | 4 KiB page | Write page fault | Process exit |
| ZFS / Btrfs snapshot | Disk block | Block overwrite | Snapshot deletion |
| Postgres MVCC | Row version (tuple) | UPDATE / DELETE | VACUUM |
Reading kernel and database source is the fastest way to make this concrete — the do_wp_page fault handler in the Linux mm code, or the tuple visibility checks in Postgres, are short and surprisingly readable once you know what you’re looking for. A capable editor that can jump across a large C codebase and answer “who clears this write bit” without you grepping by hand earns its keep here.
Cursor
An AI-native code editor that's well suited to spelunking large systems codebases like the Linux kernel or Postgres — jump-to-definition across millions of lines and ask in-context questions about how a subsystem works.
Free tier available; Pro is $20/mo
Affiliate link · We earn a commission at no cost to you.
The payoff of seeing these three as one mechanism is practical. When a forked worker pool balloons in memory, you know it’s CoW pages diverging under write pressure, not a leak. When a Postgres table bloats, you know dead row versions are accumulating faster than vacuum reclaims them. When a snapshot you forgot about quietly consumes a disk, you know it’s pinning blocks the live filesystem has long since moved past. Same lazy copy, same reference counting, same failure mode: copies you stopped tracking.
FAQ
Is copy-on-write always faster than just copying?+
Why does my forked process show high memory usage when it shares most pages?+
How is MVCC related to copy-on-write?+
Related reading
2026-06-10
LSM-Trees vs B-Trees: The Write-Optimized Database Tradeoff
Why some databases append writes and reconcile later while others edit in place — and how that one choice shapes write throughput, read latency, and disk usage.
2026-06-10
A Coroutine Is Not a Thread: What Suspends, What Gets Scheduled, and Why It Matters
A coroutine suspends and resumes cooperatively; a thread is preempted by the OS. Here is the real difference in scheduling, memory, and parallelism — and when each one wins.
2026-06-10
Two's Complement: How Computers Represent Negative Numbers
How two's complement encodes negative integers, why CPUs run signed and unsigned math on one adder, and the edge cases — INT_MIN, overflow, sign extension — that cause real bugs.
2026-06-10
What MVCC Is, and How Databases Let Readers and Writers Coexist
MVCC keeps multiple versions of every row so reads never block writes. Here's how Postgres implements it with xmin/xmax, why your tables bloat, and where snapshot isolation bites.
2026-06-09
What a Merkle Tree Is, and Where You've Already Seen One
A Merkle tree hashes data into a single fingerprint so you can verify any piece without downloading the whole set. Here's how it works and where it already runs in your stack.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.