What MVCC Is, and How Databases Let Readers and Writers Coexist
MVCC keeps multiple versions of every row so reads never block writes. Here's how Postgres implements it with xmin/xmax, why your tables bloat, and where snapshot isolation bites.
You run a SELECT that takes 400ms to scan a table. Halfway through, another connection runs an UPDATE on a row you haven’t reached yet. What should your query see — the old value or the new one? And should your read have blocked that write, or the write blocked your read?
Multi-Version Concurrency Control (MVCC) is the answer most production databases settled on: neither one waits. Postgres, MySQL’s InnoDB engine, Oracle, and SQL Server’s snapshot mode all keep more than one copy of a row at a time. A reader gets a consistent point-in-time view; a writer creates a new version alongside the old one. The two coexist instead of fighting over a lock.
The problem MVCC is solving
Without versioning, a database has two blunt options. It can take a read lock so writers wait until every reader finishes — correct, but readers and writers now serialize, and a long analytical query can stall writes for its entire duration. Or it can let reads see in-progress writes, which gives you dirty reads: you return a value that the writing transaction later rolls back, so you reported data that never officially existed.
MVCC sidesteps both. The core rule is the one you’ll see repeated everywhere: readers don’t block writers, and writers don’t block readers. When a transaction modifies a row, the database doesn’t overwrite the bytes in place. It writes a new version and leaves the old one in place until no running transaction can still need it. Every transaction reads against a snapshot — a definition of which versions are visible to it — so your 400ms scan sees one coherent state of the table even while other connections are busy changing it.
Writers still block other writers on the same row. Two transactions can’t both update row 17 concurrently and pretend they didn’t see each other; one waits. But that’s the only contention MVCC doesn’t remove, and it’s the one you usually can’t avoid anyway.
How Postgres actually stores the versions
Postgres is the cleanest system to reason about because the versioning lives directly in the table heap. Every row version (a tuple) carries two hidden system columns: xmin, the ID of the transaction that created it, and xmax, the ID of the transaction that deleted or superseded it. You can see them:
SELECT xmin, xmax, * FROM accounts WHERE id = 17;
Visibility is a comparison. A tuple is visible to your transaction if its xmin committed before your snapshot was taken and its xmax is either empty or belongs to a transaction your snapshot doesn’t consider committed. An UPDATE is mechanically an insert plus a stamp: Postgres sets the old tuple’s xmax to the updating transaction’s ID and inserts a brand-new tuple with a fresh xmin. A DELETE just stamps xmax and inserts nothing.
The consequence is the thing that surprises people: deleted and updated rows don’t free space immediately. The old versions — dead tuples — sit in the table until cleanup runs. That cleanup is VACUUM, which reclaims space from tuples no remaining transaction can see, and autovacuum runs it for you in the background. This is why a table you only ever DELETE from can keep growing on disk, and why a heavily-updated table develops bloat that slows sequential scans.
InnoDB and Oracle reach the same outcome by a different route. Instead of keeping old versions in the main table, they keep the current row in place and store enough undo information in a separate area (the undo log in InnoDB, rollback/undo segments in Oracle) to reconstruct older versions on demand. The trade-off flips: the main table stays compact, but a long-running read has to walk undo records to rebuild the snapshot it needs, and an undo log that fills up because of an old open transaction is InnoDB’s version of the same operational headache.
Snapshot isolation and the edge it hides
MVCC gives you snapshots cheaply, but when the snapshot is taken determines what anomalies you can still hit. That’s the isolation level.
In Postgres, the default is Read Committed: every individual statement gets a fresh snapshot. So two SELECTs in the same transaction can return different data if another transaction committed in between. Repeatable Read takes one snapshot at the transaction’s first statement and holds it — Postgres implements this as true snapshot isolation, so the whole transaction sees a frozen view.
Snapshot isolation feels like it should be airtight, and it nearly is, which is what makes its failure mode sneaky: write skew. Two transactions each read an overlapping set of rows, each checks a condition that’s still true in its own snapshot, and each writes a different row. Neither sees the other’s write because both snapshots predate it, so a constraint that depends on the combination — “at least one doctor must stay on call” — gets violated even though each transaction looked correct in isolation.
The practical takeaway is to know your database’s default. Read Committed is fine for most CRUD work. The moment you’re reading a value, making a decision, and writing based on it — balance checks, inventory decrements, allocation logic — you need to think about whether a concurrent transaction could invalidate that decision, and reach for a higher isolation level or an explicit SELECT ... FOR UPDATE lock.
Cursor
When you're tracing a visibility bug across schema, query, and transaction-wrapper code, an editor that can read the whole repo at once is the difference between an afternoon and ten minutes. Cursor's codebase-aware chat is genuinely useful for 'where does this transaction's isolation level get set?' questions.
Free tier available; Pro is $20/month
Affiliate link · We earn a commission at no cost to you.
MVCC isn’t free — bloat, vacuum tuning, and undo-log pressure are the rent you pay for it. But the alternative, a database where every analytics query freezes your writes, is worse for almost every workload. Understanding the version mechanics underneath turns “why is this table 40GB when it has 2 million rows” from a mystery into a vacuum-tuning task.
FAQ
Does MVCC mean I never need locks?+
Why does my Postgres table grow on disk even after I delete rows?+
Is MVCC the same thing as snapshot isolation?+
Related reading
2026-06-10
LSM-Trees vs B-Trees: The Write-Optimized Database Tradeoff
Why some databases append writes and reconcile later while others edit in place — and how that one choice shapes write throughput, read latency, and disk usage.
2026-06-10
Copy-on-Write, Explained Through fork() and Snapshots
How copy-on-write defers copying until a write actually happens — the mechanism behind fast fork(), filesystem snapshots, and database MVCC, explained with page tables and page faults.
2026-06-10
A Coroutine Is Not a Thread: What Suspends, What Gets Scheduled, and Why It Matters
A coroutine suspends and resumes cooperatively; a thread is preempted by the OS. Here is the real difference in scheduling, memory, and parallelism — and when each one wins.
2026-06-10
Two's Complement: How Computers Represent Negative Numbers
How two's complement encodes negative integers, why CPUs run signed and unsigned math on one adder, and the edge cases — INT_MIN, overflow, sign extension — that cause real bugs.
2026-06-09
What a Merkle Tree Is, and Where You've Already Seen One
A Merkle tree hashes data into a single fingerprint so you can verify any piece without downloading the whole set. Here's how it works and where it already runs in your stack.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.