Seriously friends he has gotten so large. He knows his name and comes when called, like, 80% of the time. He turned four months old today, just in time to celebrate National Puppy Day!
This week is a bit different (insofar as we have a “normal” over here). I spent a week or so digging into a PostgreSQL performance issue last week, and I’ve decided it was worth writing up the things I learned while digging into the cause, in case they are ever helpful for anyone in the future.
I find these kinds of writeups interesting because I love learning how software systems of all sorts work, and comparing and contrasting them with other implementations and exploring the implications and interactions between various different design choices.
However, if you don’t care much about Postgres implementation details, you can jump ahead to the “Takeaways” section, where I have some musings about my broader lessons and conclusions from this incident.
Under load, we saw our query latency to our Postgres database spiking precipitously, and our throughput plummeting virtually to 0; the database would almost completely lock up and make virtually no progress at all for a while. During these lockups, we would dig into pg_stat_activity and other tools, and see a large number of queries blocked on a number of apparently-related
Some googling quickly revealed that
LWLocks are internal mutexes used to protect shared data structures inside the Postgres implementation; but we had much less luck understanding what those specific locks meant. The documentation contained only the most spare descriptions of these locks, which did not really help us understand why we were so contended on them.
We eventually came to understand what we were seeing, by way of digging into a lot of Postgres implementation details, which I will now review for your benefit.
Postgres’ internals documentation in the source code are actually generally quite good. Much of the following is adapted from the documentation on tuple-locking; head there if you want to read along.
Postgres’ MVCC concurrency model allows for high concurrency for many operations without the need for locking records. However, Postgres also implements row-level locking, which is used to support explicit row-level locking via
SELECT FOR UPDATE, and also internally to serialize write operations.
As a design goal, Postgres needs to support locking arbitrarily-many rows at a time. This implies potentially locking more rows than fit into memory (e.g.
SELECT * from [some_massive_table] WHERE [some condition matching half the records] should succeed, and should also not degrade to a full-table lock). As a result, row locks cannot be stored exclusively in memory, but must be (potentially) persisted to disk. Thus, Postgres actually stores row-level locks in the header of the on-disk tuple that stores a table row. The “xmax” field is repurposed to store the transaction ID of the locking transaction, and a bit in the “infomask” flag word denotes that this transaction ID is a lock-holder.
One consequence already of this design is that
SELECT FOR UPDATE actually dirties on-disk data; for this reason, the performance of
SELECT FOR UPDATE — even uncontended, and even without later performing an
UPDATE — is much closer to that of an
UPDATE on the locked row than it is to a bare
SELECT! If you’re locking a row because you intend to update it anyways this typically doesn’t matter all that much, but it’s good to know if you were considering using
SELECT FOR UPDATE for some other purpose.
In addition to
SELECT FOR UPDATE, Postgres supports
SELECT FOR SHARE, which implements reader/writer locks: Any number of transactions can
SELECT FOR SHARE the same row concurrently, but no row can be selected
FOR SHARE and
FOR UPDATE concurrently.
This poses a challenge for the lock implementation: With
SELECT FOR SHARE, an arbitrary number of transactions may have the same tuple locked, simultaneously. How do we represent this? We’ve already seen that row locks are stored in the tuple header, but the tuple header is a fixed size, and is by design quite compact.
Postgres solves this by introduction a concept called
MultiXact IDs. If multiple transactions lock the same row, the transaction ID in the “xmax” field is replaced with a separate type of ID, known as a
MultiXact ID. A
MultiXact ID represents a set of locking transaction IDs (and some flags for each); the locking engine maintains, in a separate store, a lookup table mapping each
MultiXactID to the corresponding set of transaction IDs.
The MultiXact mapping is stored in a dedicated directory (
pg_multixact) inside Postgres’ data directory, using a pair of “SLRU” (“Simple LRU”) stores. One store (the “members” store) is a tightly-packed flat array of transaction IDs and flag bytes, and the other store (the “offset” store) maps MultiXact IDs to ranges of transaction IDs in the members store.
One fun fact about the MultiXact store is that entries are immutable; if a new transaction locks a row that is already locked, a new MultiXact ID must be created, containing all the old transaction IDs as well as the new ID. This means that having N processes locking a row concurrently requires potentially quadratic work, since the list of transaction IDs must be copied in full, each time! This behavior is usually fine, since it’s rare to have many transactions locking the same row at once, but already suggests that
SELECT FOR SHARE has some potentially-disastrous performance cliffs.
Looking at our blocking data from earlier, we were blocked on the following four LWLocks:
We can now explain what each of those is; each SLRU region protects metadata with a “Control” lock, and protects the actual page data using a set of LWLocks per page, which are the lower-cased locks we see.
ControlLocks, in particular, must be taken for any access to the MultiXact store! So, under heavy access to the MultiXact store — which, for whatever reason, our application code was triggering — all queries essentially end up contending on a single global mutex!
Another phenomenon we noticed in our application was that, even when the application was not under heavy load, performance would steadily degrade over time, and we would see more and more of the above locks showing up in wait profiles.
It turns out that, when a row has been locked by a MultiXact ID, that ID is not synchronously removed; even if all locking transactions have completed, the ID stays around, and future transactions must consult the MultiXact store in order to determine that the lock has expired. Thus, as our application accrued these MultiXact locks on more and more rows, over time more and more transactions would have to refer to the MultiXact store, and would become contended on the above locks.
These IDs are periodically cleaned up by Postgres’ VACUUM process; the Postgres VACUUM documentation explains its role in cleaning up MultiXact IDs pretty well. However, the first rule of PostgreSQL tuning is that you aren’t VACUUMing enough, and indeed our database’s autovacuum process was not configured aggressively enough to fully keep up with all the MultiXact traffic that our application was apparently generating. So that explained one mystery of our performance troubles.
At this point in our investigation, though, we weren’t clear why we were generating MultiXact IDs, at all. Our application did make use of explicit row-level locks using
SELECT FOR UPDATE, but we were definitely not using
SELECT FOR SHARE. MultiXact IDs only come into play if multiple transactions actually hold a lock concurrently, but
SELECT FOR UPDATE is an exclusive lock, meaning only one transaction can hold the lock at a time!
The tuple locking documentation includes this tantalizing hint, though:
In earlier PostgreSQL releases, a MultiXact always meant that the tuple was locked in shared mode by multiple transactions. This is no longer the case; a MultiXact may contain an update or delete Xid.
But it did not elaborate on precisely when that could happen.
After a bunch of experimentation, though, we eventually found the culprit. Our application ended up inadvertently containing recursive calls to Django’s
transaction.atomic, which Django implements by emitting the SQL
SAVEPOINT statement, implemented in Postgres using subtransactions. After a bunch of experimentation, we discovered that the following sequence suffices to make Postgres generate a MultiXact lock;
SELECT [some row] FOR UPDATE; SAVEPOINT save; UPDATE [the same row];
I am not completely clear on the details, but locking a row explicitly and updating a row both take out locks, which must be accounted for slightly differently. If both locks are held by the same transaction this can be managed by just updating the tuple flags, but when the update is performed in a subtransaction, Postgres degrades to storing the two locks separately using a MultiXact ID.
This was, in fact, the sequence our application was using. We changed the inner
transaction.atomic to pass
savepoint=False, which turns it into a no-op if already inside an open transaction, and our problems immediately disappeared.
How did we determine which sequences do or don’t produce MultiXact IDs? It turns out Postgres comes with the very useful
pageinspect extension, which allows easy inspection of the actual on-disk format of a table. We can observe the above sequence creating a
MultiXact ID in a test database like so:
postgres=# create extension pageinspect; CREATE EXTENSION postgres=# create table test(i int); CREATE TABLE postgres=# insert into test (i) values (1); INSERT 0 1 postgres=# select lp, t_xmin, t_xmax, t_ctid, t_infomask, (t_infomask&4096)!=0 as is_multixact from heap_page_items(get_raw_page('test', 0)); lp | t_xmin | t_xmax | t_ctid | t_infomask | is_multixact ----+--------+--------+--------+------------+-------------- 1 | 488 | 0 | (0,1) | 2048 | f (1 row) postgres=# begin; BEGIN postgres=*# select * from test where i = 1 for update; i --- 1 (1 row) postgres=*# savepoint a; SAVEPOINT postgres=*# update test set i = i+1 where i = 1; UPDATE 1 postgres=*# commit; COMMIT postgres=# select lp, t_xmin, t_xmax, t_ctid, t_infomask, (t_infomask&4096)!=0 as is_multixact from heap_page_items(get_raw_page('test', 0)); lp | t_xmin | t_xmax | t_ctid | t_infomask | is_multixact ----+--------+--------+--------+------------+-------------- 1 | 488 | 1 | (0,2) | 4416 | t 2 | 490 | 489 | (0,2) | 8336 | f (2 rows)
heap_page_items(get_raw_page(…)) lets us inspect the actual on-disk tuples in a physical page in a postgres table. I won’t explain all the fields here, but I will note that bit
0x1000 (4096) in
t_infomask records whether a row has a MultiXact ID, which lets us select
(t_infomask&4096)!=0 as is_multixact to determine which rows have been locked using MultiXact IDs. We can also see that
t_xmax on the MultiXact-locked row is much smaller than the other
t_xmax values, because it is coming from the MultiXact namespace, and not the normal XID namespace.
Well, that was a lot of diving deep into PostgreSQL implementation details that I hope most of you will never have to care about. I want to wrap with some high-level conclusions I take away from this deep dive.
This experience really reinforced my pre-existing experience that Postgres is enormously complex and full of poorly-labelled operational landmines. Postgres is incredibly powerful and sophisticated, but it also contains a million and one ways to accidentally tank your performance, lock up your database for extended periods, and even leave your entire database nearly-unrecoverably messed up. And when you do encounter these failure modes, there may or may not be any warnings that you are approaching or have passed the point of disaster, or why.
It is entirely possible to operate Postgres safely and with high performance and throughput at scale … but essentially the only way to do it is to have ready access to deep PostgreSQL experience and expertise on your team; people who Just Know where the landmines are because they’ve seen them before. The only alternative is to repeatedly run into these issues, experience painful outages, and spend time and money recovering and learning the lessons yourself.
If you don’t have an expert DBA on your time, you probably need a contact at and a contract with an expert Postgres consultancy. PGExperts and 2ndQuadrant both come well recommended to me, although I haven’t worked with either one personally.
A “performance cliff” is any time a small change to a workload or application can result in a dramatic performance degradation. They are often introduced as a result of optimization: If you optimize a program for some common case, you may introduce a performance cliff for future programs that stray off of that garden path.
I consistently believe we don’t take performance cliffs seriously enough. In some ways, it’s worse to have a performance cliff — with a fast sweet spot, and slow performance outside of it — than it is to be uniformly slow. If a program is uniformly slow, then its behavior is predictable and no one will depend on it being fast. If there’s a fast sweet spot, users will build architecture relying on that performance, and are now at risk of incidents if they inadvertently stray off of the fast path.
Performance cliffs are probably unavoidable in modern software; our very hardware contains steep performance cliffs like exceeding the size of the L3 cache or the TLB. However, I think we can mitigate them and reduce their risk in a few ways:
SELECT FOR SHAREor subtransactions and have low throughput needs could still explicitly opt-in for the present behavior.
More tactically, if you do run Postgres, here are a few conclusions I take away about designing your application and working with Postgres.
This experience has made me very wary of
SELECT FOR UPDATE, and extremely skittish about
SELECT FOR SHARE. As mentioned earlier, the implementation of both locking mechanisms makes them comparably expensive to an
UPDATE in latency — even when uncontended — and they both enable scary performance cliffs in the wrong circumstances.
I will never write code using
SELECT FOR SHARE, and I will be very cautious about reaching for
SELECT FOR UPDATE in the future, preferring to find some other means of concurrency control.
I told a version of this story to a friend with years more Postgres experience at scale than I will ever have, and he commented:
Subtransactions are basically cursed. Rip em out.
While they are technically supported, they definitely seem to have very scary performance behaviors in some cases. Just don’t. I’ll add them to the list of features I wish I could hard-disable in database configuration.
I found the official Postgres docs almost useless for understanding what behavior we were running into. However, once I started digging into the source to try to understand the implementation, I was quite pleased by the amount of very readable documentation I found. Most directories have
README files containing notes on the implementation, and most files have good overall comment blocks explaining their function and purpose. If you do run Postgres, don’t fear the source! Get familiar with it, and it will serve you well understanding your application’s weird performance issues.