Benchmarking and theories of performance

The Structure of Scientific Revolutions

                        November 19, 2020

                Benchmarking and theories of performance

                        I just this week finished reading Kuhn’s The Structure of Scientific Revolutions. I’d encountered many or most of the ideas in the book by reference, but this was my first time reading through the original work. I’d recommend it — it’s fairly short and pretty readable, and — among other topics — full of fun anecdotes about early scientific development.
In any case, thinking about the ideas in that book definitely contributed to the thoughts in today’s newsletter.
Benchmarking
What is it that we are doing when we benchmark? When we time the execution of some programs and report their results?
I think it’s important to observe that when we report a performance measurement, we rarely care about the numbers we recorded per se. If a given process took 317ms in a benchmark, we rarely care that that particular execution took 317 milliseconds. Instead, we care that it took 317 milliseconds because we believe it implies that similar processes in the future will also take about 317 milliseconds, or because some alternate process took 423 milliseconds, and we want to conclude that the first approach will be consistently faster on future inputs than the second.
When we benchmark, therefore, fundamentally what we are doing is an act of prediction or generalization — we are trying to take some concrete finite set of observations, and use them to make predictions about future situations.
Benchmarking, therefore, requires a theory of performance. We need to have some model that tells us in which ways our tests are representative, and in what sense a future use case needs to be similar to our tests in order for our predictions to be useful.
In simple cases, our theories are relatively straightforward and may even appear trivial; when we apply a small patch and run a program before and after the patch, we are behaving according to a theory that says something like:

Our patch is the only relevant difference between runs; any other effects that might impact performance between our runs are small relative to the change we expect to see from this patch

Even spelling out that theory makes clear several assumptions, and makes clear one important observation about benchmarking changes: we require different tools to observe performance improvements that have a large percentage impact than small ones. This fact is likely obvious to anyone who has done much performance engineering at all, but is also sometimes forgotten. I have submitted pull requests implementing a 1000x speedup on some core piece of code where I received pushback for not implementing more sophisticated statistical benchmarking. I’m not opposed to precision in benchmarking, but it seems to me that if a patch makes something run in one thousandth of the time, the precise details of whether it’s 700x or 1500x faster is rarely the most important question.
The above theory mentions other sources of performance differences between runs. We might try to quantify these, to gain confidence in our results, by running one side of the benchmark repeatedly, to try to observe the noise. However, far from removing an assumption, this transforms an assumption. If we want to observe the performance change from the patch — including the results of the patch when applied to future versions of the software, not just on this precise revision — we need to believe something like:

The “noise” we sample from by running a given binary repeatedly is similar for each version; our patch does not change the noise distribution in unrelated ways to its primary function.

As I discussed last time, this assumption doesn’t always hold. The Stabilizer paper does an excellent job of demonstrating that small differences in object addresses can affect performance — perhaps our change shifts alignment so that two key data structures now occupy the same cache set, where previously they shared key address bits — but we can find similar effects at higher levels. Perhaps our garbage collector runs a collection after every N bytes of allocations, and our patch pushes the test case just over the N-byte threshold. We might measure the time spent garbage collecting in our benchmark and report a significant degradation, when in fact every allocation in the program contributed to this threshold, and we “should” only blame the patch for the new bytes it allocated, not the entire cost of crossing that threshold.
Benchmarking whole systems
That touches on some of the complexities of benchmarking individual changes to a system, which I talked about a lot last week.
The problems get much more complex and harder when we’re talking about evaluating a complex system for some purpose. Suppose we are building a web service and we have some estimate of the load we need to serve, and we want to answer the question: Can MySQL on a single ec2 instance of a certain size support our load, or will we need some other solution?
We might be tempted to reach for a benchmark to answer the question; we can write a load-generator that simulates the load profile we expect, and see how MySQL performs.
On the surface, it seems like such a test ought be authoritative: surely running an actual test with a good approximation of our actual load will give us truer results than, say, a back-of-the-envelope estimate or a comparison with someone else’s benchmarks for a different use case. However, such an experiment once again relies on some unstated premises. We are assuming that our test harness is a “fair” test in some sense: that it is making efficient use of MySQL, and that MySQL has been reasonably well-tuned to the test in question. 
At the most extreme, imagine that we wrote a benchmark that did not create any indexes, and then repeatedly made point queries on some field. Looking only at our performance numbers, we might conclude that MySQL was hopelessly rubbish and we definitely need to look elsewhere. If our goal was to answer “Can MySQL work here?” that conclusion is completely wrong; we’ve instead answered “Will MySQL work here if we use it badly suboptimally?” We’ve given a true answer to that question, perhaps, but that’s hardly what we were hoping to ask.
For the case of SQL indexes, perhaps few experienced engineers would ever make such a mistake. But for more exotic systems¹, or for subtler questions of tuning (Limiting MySQL to a tiny fraction of your RAM as cache is another classic one), it’s all-to-easy to mistake an answer to one question for another. Of course, it’s also possible to take this line of thought too far and always blame our users for “holding it wrong;” the question of “How well does a tool perform, in the ways a typical user will try to use it” is also an important one, and distinct from “How well does the tool perform with a full-time operator who is expert in all of its tuning parameters.”
We run into similar problems when we try to compare, say, two algorithmic approaches to a problem. If we want to compare two algorithms, but we compare a highly-optimized implementation of one with the first-draft version of the other, it’s hard to know if the results come down to a superior algorithm or just to the nitty-gritty of the optimization work. This can pose a real challenge when we’re working on a problem with highly mature existing solutions. If we think we have a clever new approach, simply implementing it and benchmarking may not tell us if we’re on the right path. Our new solution may be slower because it lacks all the optimizations built into the mature system. In order to know if it’s worth pursing, we either need to do all the work to grind out the low-level performance gains, or else have some other metric that tells us that it might be worth pursuing.
Theories of performance
These latter questions get to why I think of benchmarking as being about a “theory of performance.” In my earlier examples, about benchmarking the effect of a single patch, I outlined the implicit beliefs driving the tests we might run, but those read more like simple lists of assumptions than something obviously worth the name “theory.” 
When we talk about benchmarking a database system for a given use case, or comparing algorithms, though, I think we are at our strongest when we have a fairly rich model of the performance of the system. If we have some model of what the slow operations and bottlenecks are, and how we expect them to change with the input, we can use benchmarks to refine and stress-test our model, and to populate unknown parameters; we have a rich structure we can use to cross-check and interpret the results of our experiment, instead of interpreting them in isolation. For instance, if our model predicts performance should be linear in some factor, we can vary that factor and test that correlation. If it holds, we have a more powerful tool for making predictions, and if it fails, we know our understanding was wrong and we need to dig deeper.
Back to our MySQL database: If I had cause to benchmark a SQL database for some task, I would start with a simplified model of its performance based on my understanding of the underlying B-Trees, WAL, I/O system, and hardware. I would compare the results of my benchmarks to what my model suggests they should be, including how they vary with different hardware or input size. If the results didn’t match up with my model, I would investigate, and revise the model or improve the benchmark until they did. That would be no guarantee that I was using MySQL optimally — perhaps there are some optimizations that exist entirely outside of my mental model — but it would give me some confidence that I was in a reasonable ballpark, since I could relate the numbers I were seeing to expectations that were based in hardware capabilities and algorithmic limits.
I think this gets at another reason I like to relate performance to hardware capacity. Doing so helps ground your performance models in the very concrete, and helps prevent benchmarking mistakes where you’re taking some egregious performance hit and just never know it.
That said, we can have theories of performance that don’t go down to the concrete hardware level. The Universal Scalability Law is one such model that attempts to characterize the scalability of just about any system using a small number of parameters. Like any good theory, using it tells you what to look for when benchmarking, and tells you (some of) what to expect; if your experiments match those expectations, then you can — with some confidence — use the law to make predictions of the system in a broader domain.
It’s also important to note that having an incorrect theory of performance can result in extremely erroneous conclusions. One sometime piece of benchmarking advice is to run an experiment multiple times and take the minimum time, on a theory that goes (something like) “measured performance = true performance + noise,” where “true performance” is relatively constant, and “noise” is random and non-negative. In another recent blog post, Laurence Tratt does an excellent job of showing the dangers of trying to apply this model.
Conclusions
With this perspective, I tend to think it’s useful to think of even the simple “benchmark a patch” experiments we opened with as being driven, not just by a list of assumptions, but by a theory of performance that tells us how a system’s performance varies, what experiments we can perform, and what to see.
In short, the connection here to Structure is that I think benchmarking can be thought of as very analogous to Kuhn’s “normal science.” Much as Kuhns tells us that normal science cannot proceed without a paradigm to tell us which questions are worth asking, and what kinds of answers to expect, we cannot productively benchmark without a theory of performance that similarly guides our experiments and interpretation of their results.

This example is inspired in part by an anecdote related to me by Simon Eskildsen of Shopify about testing ElasticSearch. ↩

                        Don't miss what's next. Subscribe to Musing in Computer Systems: