pickuma.
Dev Knowledge

What MVCC Is, and How Databases Let Readers and Writers Coexist

MVCC keeps multiple versions of every row so reads never block writes. Here's how Postgres implements it with xmin/xmax, why your tables bloat, and where snapshot isolation bites.

8 min read

You run a SELECT that takes 400ms to scan a table. Halfway through, another connection runs an UPDATE on a row you haven’t reached yet. What should your query see — the old value or the new one? And should your read have blocked that write, or the write blocked your read?

Multi-Version Concurrency Control (MVCC) is the answer most production databases settled on: neither one waits. Postgres, MySQL’s InnoDB engine, Oracle, and SQL Server’s snapshot mode all keep more than one copy of a row at a time. A reader gets a consistent point-in-time view; a writer creates a new version alongside the old one. The two coexist instead of fighting over a lock.

The problem MVCC is solving

Without versioning, a database has two blunt options. It can take a read lock so writers wait until every reader finishes — correct, but readers and writers now serialize, and a long analytical query can stall writes for its entire duration. Or it can let reads see in-progress writes, which gives you dirty reads: you return a value that the writing transaction later rolls back, so you reported data that never officially existed.

MVCC sidesteps both. The core rule is the one you’ll see repeated everywhere: readers don’t block writers, and writers don’t block readers. When a transaction modifies a row, the database doesn’t overwrite the bytes in place. It writes a new version and leaves the old one in place until no running transaction can still need it. Every transaction reads against a snapshot — a definition of which versions are visible to it — so your 400ms scan sees one coherent state of the table even while other connections are busy changing it.

Writers still block other writers on the same row. Two transactions can’t both update row 17 concurrently and pretend they didn’t see each other; one waits. But that’s the only contention MVCC doesn’t remove, and it’s the one you usually can’t avoid anyway.

How Postgres actually stores the versions

Postgres is the cleanest system to reason about because the versioning lives directly in the table heap. Every row version (a tuple) carries two hidden system columns: xmin, the ID of the transaction that created it, and xmax, the ID of the transaction that deleted or superseded it. You can see them:

SELECT xmin, xmax, * FROM accounts WHERE id = 17;

Visibility is a comparison. A tuple is visible to your transaction if its xmin committed before your snapshot was taken and its xmax is either empty or belongs to a transaction your snapshot doesn’t consider committed. An UPDATE is mechanically an insert plus a stamp: Postgres sets the old tuple’s xmax to the updating transaction’s ID and inserts a brand-new tuple with a fresh xmin. A DELETE just stamps xmax and inserts nothing.

The consequence is the thing that surprises people: deleted and updated rows don’t free space immediately. The old versions — dead tuples — sit in the table until cleanup runs. That cleanup is VACUUM, which reclaims space from tuples no remaining transaction can see, and autovacuum runs it for you in the background. This is why a table you only ever DELETE from can keep growing on disk, and why a heavily-updated table develops bloat that slows sequential scans.

InnoDB and Oracle reach the same outcome by a different route. Instead of keeping old versions in the main table, they keep the current row in place and store enough undo information in a separate area (the undo log in InnoDB, rollback/undo segments in Oracle) to reconstruct older versions on demand. The trade-off flips: the main table stays compact, but a long-running read has to walk undo records to rebuild the snapshot it needs, and an undo log that fills up because of an old open transaction is InnoDB’s version of the same operational headache.

Snapshot isolation and the edge it hides

MVCC gives you snapshots cheaply, but when the snapshot is taken determines what anomalies you can still hit. That’s the isolation level.

In Postgres, the default is Read Committed: every individual statement gets a fresh snapshot. So two SELECTs in the same transaction can return different data if another transaction committed in between. Repeatable Read takes one snapshot at the transaction’s first statement and holds it — Postgres implements this as true snapshot isolation, so the whole transaction sees a frozen view.

Snapshot isolation feels like it should be airtight, and it nearly is, which is what makes its failure mode sneaky: write skew. Two transactions each read an overlapping set of rows, each checks a condition that’s still true in its own snapshot, and each writes a different row. Neither sees the other’s write because both snapshots predate it, so a constraint that depends on the combination — “at least one doctor must stay on call” — gets violated even though each transaction looked correct in isolation.

The practical takeaway is to know your database’s default. Read Committed is fine for most CRUD work. The moment you’re reading a value, making a decision, and writing based on it — balance checks, inventory decrements, allocation logic — you need to think about whether a concurrent transaction could invalidate that decision, and reach for a higher isolation level or an explicit SELECT ... FOR UPDATE lock.

Cursor

When you're tracing a visibility bug across schema, query, and transaction-wrapper code, an editor that can read the whole repo at once is the difference between an afternoon and ten minutes. Cursor's codebase-aware chat is genuinely useful for 'where does this transaction's isolation level get set?' questions.

Free tier available; Pro is $20/month

Try Cursor

Affiliate link · We earn a commission at no cost to you.

MVCC isn’t free — bloat, vacuum tuning, and undo-log pressure are the rent you pay for it. But the alternative, a database where every analytics query freezes your writes, is worse for almost every workload. Understanding the version mechanics underneath turns “why is this table 40GB when it has 2 million rows” from a mystery into a vacuum-tuning task.

FAQ

Does MVCC mean I never need locks?+
No. MVCC removes read/write contention, but two transactions writing the same row still serialize, and any read-then-write decision (check a balance, then debit it) can still be wrong under concurrency. Use explicit locks like SELECT ... FOR UPDATE or a higher isolation level when correctness depends on the combination of what you read and what you write.
Why does my Postgres table grow on disk even after I delete rows?+
A DELETE only stamps the row version as dead by setting its xmax; it doesn't reclaim the space. VACUUM (run automatically by autovacuum) reclaims dead tuples so the space can be reused. If autovacuum is falling behind — often because a long-running or idle-in-transaction connection is holding back the cleanup horizon — dead tuples accumulate as bloat.
Is MVCC the same thing as snapshot isolation?+
They're related but not identical. MVCC is the storage mechanism — keeping multiple row versions. Snapshot isolation is an isolation level you can build on top of it, where a transaction reads from a single consistent snapshot. In Postgres, Repeatable Read is implemented as snapshot isolation; Read Committed also uses MVCC but takes a fresh snapshot per statement.

Related reading

See all Dev Knowledge articles →

Get the best tools, weekly

One email every Friday. No spam, unsubscribe anytime.