pickuma.
Dev Knowledge

Copy-on-Write, Explained Through fork() and Snapshots

How copy-on-write defers copying until a write actually happens — the mechanism behind fast fork(), filesystem snapshots, and database MVCC, explained with page tables and page faults.

8 min read

Copying data is expensive. Copying data you never modify is wasted work. Copy-on-write (CoW) is the trick that resolves that tension: you hand out what looks like an independent copy, but no bytes move until someone actually writes. Until that first write, every “copy” is the same physical data, shared and marked read-only. The copy happens lazily, per unit, only when a writer forces it.

That single idea shows up in three places most developers touch every week: the fork() system call, filesystem snapshots on ZFS and Btrfs, and the multi-version concurrency control inside Postgres. They look unrelated until you see they’re the same mechanism applied at different granularities.

The mechanism: share read-only, copy on the fault

The unit of sharing on a modern CPU is the page — 4 KiB on x86-64 by default. Your process doesn’t address physical memory directly; it addresses virtual pages, and the page table maps each virtual page to a physical frame, plus permission bits like read, write, and execute.

Copy-on-write works by lying about those permission bits. When you want two logical copies of a region, you don’t duplicate the underlying frames. You point both page tables at the same physical frames and clear the write bit on both. Reads go straight through and cost nothing extra. The moment either side issues a write, the CPU’s memory management unit sees the cleared write bit and raises a page fault.

The kernel’s fault handler is where the actual copy happens. It allocates a fresh frame, copies the 4 KiB of contents, repoints the faulting process’s page table entry at the new frame, restores the write bit, and resumes the instruction. The writer never knows it faulted. The other side still references the original, untouched. A reference count on each shared frame tells the kernel whether a copy is even necessary — if the count is already 1, there’s no one to protect, so it just flips the write bit back on instead of copying.

fork() is the textbook case

When a Unix process calls fork(), the kernel needs to produce a child with an identical address space. Eagerly duplicating every page would make fork() scale with the parent’s memory footprint, which is brutal for a large process — and pointless, because the overwhelmingly common next move is execve(), which throws the whole address space away and loads a new program.

So fork() copies the page tables, not the page contents, and marks every writable page read-only in both parent and child. Both processes share physical memory. Execution continues until one of them writes to a shared page; that page, and only that page, gets duplicated by the fault handler. If the child immediately calls execve(), almost nothing was ever copied.

This is also why fork()-based snapshotting works. Redis takes a point-in-time RDB snapshot by calling fork() and letting the child serialize memory to disk while the parent keeps serving traffic. The child sees a frozen view: any key the parent mutates after the fork triggers a CoW page copy, so the child keeps reading the pre-fork bytes. The catch is memory pressure — if the parent writes heavily during the save, copied pages accumulate, and a write-heavy Redis can transiently approach double its resident size during a background save.

Snapshots: the same idea, larger units

Filesystem and database snapshots apply copy-on-write above the page level, to disk blocks and row versions.

A CoW filesystem like ZFS or Btrfs never overwrites a live block in place. When you modify a file, it writes the new data to a free block and updates the metadata to point there, leaving the old block intact. A snapshot is then almost free: you record the current root of the tree and stop reclaiming the blocks it references. The live filesystem keeps moving forward onto new blocks; the snapshot keeps pointing at the old ones. Blocks are shared between the live view and the snapshot until a write diverges them — exactly the page-fault dance, just with the storage allocator playing the role of the fault handler. A snapshot’s size on disk is only the blocks that changed since it was taken.

Database MVCC is the row-level version. Instead of locking a row so readers and writers take turns, Postgres writes a new version of the row on update and leaves the old version in place. A transaction reads whichever version was visible when it started, so a long-running read never blocks a concurrent write and vice versa. Old versions are shared by every transaction old enough to see them, and only get cleaned up — by vacuum — once no transaction can reference them anymore. The reference-count idea returns as visibility bookkeeping.

LayerUnit sharedWhat triggers the copyWho reclaims
fork()4 KiB pageWrite page faultProcess exit
ZFS / Btrfs snapshotDisk blockBlock overwriteSnapshot deletion
Postgres MVCCRow version (tuple)UPDATE / DELETEVACUUM

Reading kernel and database source is the fastest way to make this concrete — the do_wp_page fault handler in the Linux mm code, or the tuple visibility checks in Postgres, are short and surprisingly readable once you know what you’re looking for. A capable editor that can jump across a large C codebase and answer “who clears this write bit” without you grepping by hand earns its keep here.

Cursor

An AI-native code editor that's well suited to spelunking large systems codebases like the Linux kernel or Postgres — jump-to-definition across millions of lines and ask in-context questions about how a subsystem works.

Free tier available; Pro is $20/mo

Try Cursor

Affiliate link · We earn a commission at no cost to you.

The payoff of seeing these three as one mechanism is practical. When a forked worker pool balloons in memory, you know it’s CoW pages diverging under write pressure, not a leak. When a Postgres table bloats, you know dead row versions are accumulating faster than vacuum reclaims them. When a snapshot you forgot about quietly consumes a disk, you know it’s pinning blocks the live filesystem has long since moved past. Same lazy copy, same reference counting, same failure mode: copies you stopped tracking.

FAQ

Is copy-on-write always faster than just copying?+
No. It's faster when the copy is mostly read afterward, which is the common case. If the writer modifies most of the data, you pay the per-unit copy cost plus the overhead of page faults and bookkeeping — slightly more total work than an eager copy, just deferred. CoW optimizes for the read-heavy case at a small penalty in the write-heavy one.
Why does my forked process show high memory usage when it shares most pages?+
Standard resident-set-size accounting counts each shared copy-on-write page against every process that maps it, so the same physical page is counted multiple times. Use proportional set size (PSS) from /proc/[pid]/smaps on Linux, which divides shared pages among their sharers, to see actual physical consumption.
How is MVCC related to copy-on-write?+
MVCC is copy-on-write at the row level. Instead of overwriting a row in place, the database writes a new version and keeps the old one so in-flight readers still see a consistent snapshot. Old versions are shared until no transaction can see them, then reclaimed — the same share-until-write, reference-count-then-collect pattern as fork() and filesystem snapshots.

Related reading

See all Dev Knowledge articles →

Get the best tools, weekly

One email every Friday. No spam, unsubscribe anytime.