Understanding MVCC

Why Handle Concurrency

Concurrency helps deal with race conditions. Types of race conditions in databases are -

  1. Dirty Read :: One transaction reads uncommitted work from another transaction
  2. Non-Repeatable Read :: One transaction gets two different results for the same read, because data changed in between.
  3. Phantom Read :: A transaction sees a row in a second read that wasn’t reported in the first read because the data was deleted in another transaction.
  4. Seralization Anomaly :: Concurrent execution of transactions are nondeterministic or arbitrary.

Concurrency Control options in Databases

Two major lines of thought.


This is where we want to lock everything. Conflicts are expected. This is the normal case.

Two phase Locking

This is the original concurrency control for databases. Readers acquire a shared lock. Writers acquire an exclusive lock. Locks build up during the “growing” phase and then are released during the “releasing” phase in reverse order. This approach isn’t really optimized for high throughput. Writers block readers, and readers block writers.


Here we don’t expect conflicts. We accept everything, but fail to commit if there’s a conflict.

Timestamp Ordering

This is a “lock-free” concurrency control. The requirement is that the “timestamp” must be monotonically increasing. Timestamps are used to evaluate or detect conflicts.

The Postgres Approach to MVCC


Postgres makes multiple versions or “snapshots” of the data for each write transaction. This is copy-on-write. Reading never blocks writing. Writing never blocks reading.

A Walkthrough

  • At t=0 (epoch)

    • We insert a record.
    • MVCC stores the row with some additional context per version. This context is a tuple - the begin timestamp and end timestamp.
    • Initially, the end timestamp will represent infinity, because its the most current version. Begin timestamp will be 0, since we’re at ecoch.
  • At t=2

    • We insert a new record.
    • MVCC adds another version row into the internal representation with additional contecxt.
    • The end timestamp of the first row is atomically changed from infinity to 2 (since that’s when the next version was created.)
    • The end timestamp of the second row is changed to infinity.
  • At t=3

    • We try to read the row.
    • At t=3, we pick the valid row, which will be reading the second version.
  • At t=4

    • We try to delete the row.
    • The end timestamp of the second version row is changed from infinity to 4.
  • At t=5

    • We try to read the row again.
    • But no valid versions exist, so no results returned.

Now what happens when an old transaction opened at t=1 tries to read again. This reads the version of the row where transaction start time (t=1) is between version start time and version end time. So we’ll get a “snapshot” of the row as of transaction open time.

How does postgres model the Version Chain

  1. Every transaction has an xid which is transaction Id, an xmin which is begin “timestamp” and an xmax which is end “timestamp”.
  2. However instead of an actual timestamp, x is just a monotonically increasing integer (int32).

In postgres, you can actually select xmin and xmax for every row from the table. Infinity is represented by the value of 0 for xmax.

SELECT xmin, xmax, * FROM t

What are Postgres Vacuums?

You need VACUUMs because of MVCC. VACUUMs do the following -

  1. Find the lowest open transaction id (xid)
  2. Delete all versions where this tx.xmax < this xid

If you don’t vacuum -

  1. You could run out of diskspace.
  2. You could run into wierd stuff many queries later.

The transaction ID will wrap around when it reaches the max value. Plenty of outage reports online where companies had trouble due to xid wraparound. Postgres will try to do a VACUUM. For this to succeed, you have two options -

  1. Stop the world, terminate all connections, and complete the vacuum before wrapping around.
  2. Do not stop the world, accept inconsistency/data loss due to the fact that we lost the monotonic property of xid.

If you try and open up psql when you have very high xid and try to run queries, you will receive warnings that wraparound is imminent.

Postgres Transaction Guarantees

MVCC by enough is not enough for guarantees. It also depends on the transactions isolation level.

ANSI SQL has 4 levels of SQL Transaction isolation levels -

Read Uncommitted

  • This really means no isolation at all.
  • Postgres does not allow for this isolation level.

Read Committed

  • This is the default and lowest isolation in Postgres.
  • Each statement in the transaction (instead of the transaction itself) gets a certain snapshot of the database, based on the startime of those statements. So every statement gets to seee the latest committed snapshot.
  • This allows non-repeatable reads, phantom reads, and other classes of serialization problems.

Repeatable Read

  • The snapshot for all statements in a transaction is determined before the transaction begins. This is how most people think about transactions.
  • The minimum transaction id is determined by when the transaction started, and not when each statement begins.
  • This still allows for serialization anomalies.

An example of the serialization anomaly is as follows -

  • Imagine that a bank needs to verify that the sum of all your accounts' balances is over 0.
  • tx1 reads both accounts and updates account 1 to make it zero.
  • tx2 reads both accounts and updates account 2 to make it zero..
  • Both update different rows so Postgres doesn’t see a conflict. But the application requirement that sum of both balances is zero is no longer valid.

This can be solved by a few different ways -

  • You can lock the row. But this defeats concurrency.
  • You can handle this in application logic.
  • Or enable Serializable


  • Strictest isolation level. It is entirely deterministic.
  • Guesses write dependencies based on orders of reads and writes in a transaction.
  • Result serially and concurrently should be identical
  • Never sacrifices isolation for anomalies.
  • More frequently aborts transactions due to “serialization failures”. You should never consider data good until committed, because your transaction could frequently aborted.

If txn1 reads two rows and writes 1 row, and txn2 reads 2 rows and writes to a different row. Postgres needs to understand what is being read/common across transactions, and guess at write dependencies. This is why transactions get aborted.

The reason this is not default in Postgres is because of frequent serialization failures when a dependency is violated, and possibly performance due to all this dependency tracking.

Locks in Postgres?

While MVCC is lock-free concurrency, many operations still require light locks under the hood. Look at pg_stat_activity and pg_locks to diagnose any issues. Postgre does periodic checks (with configurable timestamps) that builds a dependency graph between locks and checks for deadlocks.

How does Oracle handle MVCC?

Oracle avoids copying rows, wherever possible. When a row is updated, the current version of the row is updated in-place and the UNDO log will store (in a different place) the change vectors that can be used to rebuild a previous version of the block. This means indexes do not need to be updated. In contrast, with Postgres, because the whole row is copied to the new version (which could be in a new page), all indexes must also be updated to point to the new location. This also generates a high volume of WAL because many blocks are touched when a tuple is moved to another place.

With Oracle, even if the row size increases, and the row no longer fits in the block, Oracle will keep a pointer to the new row (chained rows) so that index entries continue to be valid.

The advantage with this approach is that there is no additional work needed to keep predictable performance on queries. The table blocks are clean and the undo blocks will just be reused later. Index blocks are also versioned in the same way, which means that a query can still do a true Index Only scan, even when there are concurrent changes. Oracle is versioning the whole blocks, all datafile blocks and a query just builds the consistent version of the blocks when reading them from the buffer cache.


  • Tejas Manohar’s excellent PostgresOpen 2017 talk. I cannot recommend this highly enough.
  • Comparison of Oracle’s and Postgre’s approach to MVCC.
comments powered by Disqus