Performance engineering requires stable benchmarks

drastic

                        November 12, 2020

                Performance engineering requires stable benchmarks

                        Reproducible benchmarks are essential to doing performance engineering. In order to know if a change had an impact on performance, you have to be able to measure performance, and that measurement has to be reasonably consistent across time, so that “before” and “after” measurements are comparable.
Unfortunately, modern software systems are incredibly difficult to benchmark, for two broad reasons:

Noise — I’m defining noise as effects external to the system being benchmarked which can influence its performance. This definition includes sources as obvious as other processes running on the same CPU (or different CPUs with shared caches or memory buses), but also more exotic processes like kernel timer interrupts, CPU throttling due to increased ambient temperatures, and explicit sources of randomness like address-space layout randomization.
Performance instability — By instability, I mean performance that has subtle (but technically deterministic) dependencies on small details of the system under test. Probably the most common source of performance instability is CPU caches, whose behavior can change drastically depending on small changes in memory layout or alignment; this classic paper reports consistent ±5% performance changes and sometimes much more drastic results just from changes in the link order of object files in their tests. However, interactions with heuristics or thresholds inside a VM or runtime can also have this effect, as can other interactions.

Performance instability is closely related to the idea of performance cliffs, which I define as cases where small changes in a program result in disproportionately large changes in performance. If a JIT only inlines methods below N bytes, a change which happens to grow some hot method above N bytes may exhibit a drastic performance degradation; I would class this as an instance of both a performance cliff, and of performance instability. However, I also want to include smaller variances in performance in my definition of "instability"; memory alignment rarely results in a drastic performance shift, but can consistently result in a measurable one.
Performance instability is in many ways much more pernicious than true noise. Noise that is truly exogenous to the system under test can be reduced by controlling the benchmark environment — techniques like disabling ASLR, isolating CPU cores, pinning CPU frequency, and so on are fairly well-known — and, to the extent that it is uncorrelated with program behavior, can be addressed using statistical techniques like averaging multiple runs. Performance instability, precisely because it comes from inside the system being tested, is much harder to eliminate or average out. Performance instability can lead to results like a change that is, say, on average, a 5% speedup — if applied to a large set of related programs — but which manifests as a deterministic 3% slowdown in a specific situation, not because of the change itself, but because it happens to shuffle data layout of some unrelated data in a way that happens to decrease CPU cache effectiveness in this specific example. Because this 3% slowdown is deterministic, no number of re-runs will show us the “true” speedup; to do that, we need some way to sample from some set of “related programs” that laid out data a bit differently. The Stabilizer project is one attempt to do precisely such a task, by way of a tool that can independently re-randomize memory layout in a fine-grained way, in order to sample “related” executions and essentially turn instability into mere noise that can be averaged over.
Is this why we’re slow?
I’m increasingly suspicious that the difficulty of benchmarking modern software systems is a major driver behind the observed phenomenon that, even as computer hardware continues to get faster, software gets slower at least as quickly. If consistent benchmarks require heroic acts of binary engineering and statistical analysis to measure performance, or a customized bare-metal setup with a custom kernel and hardware configuration, then most developers simply won’t bother.
One might optimistically hope that noise and performance-cliff-driven-instability only impact our ability to measure changes of the same order of magnitude as that noise, and that we can still make good progress with the “easy” benchmarks, and still do most of the performance work we’d like to. While I think this is sometimes true, I think these challenges have cascading impacts in various ways, that make them even worse than they might appear:

Performance improvements and regressions are cumulative. I love to tell the story of how SQLite achieved a 50% speedup by stacking sub-percentage optimizations, but even in less extreme cases, 1-2% changes over time really add up.
The noise is often shockingly bad. For e.g. the Ruby systems I worked on at Stripe, it was not uncommon to see 5-10% variance when attempting benchmarks. 10% noise is enough to make all but the very clearest-cut performance improvements challenging to demonstrate.
As mentioned above, the nature of unstable performance means that statistical analyses alone can’t help us here. When developers realize their benchmarks are noisy, almost invariably the first thing they try is running them N times and taking some summary statistic; but if the variance is endogenous to the system under test itself, this doesn’t actually help much.
Noisy and unreliable benchmarks provide plausibly deniability to ignore regressions as they happen. Most programmers are inclined — not out of malice but just out of the incentives of their situation — not to want to notice regressions. Noticing a regression creates work and delays a feature from shipping while it is investigated. If your benchmarks are known to be noisy, it’s easy to motivated-reason your way to a belief that your change didn’t really regress performance — you probably just ran on a particularly noisy piece of hardware.

Peak performance vs performance stability
I find it interesting to note that many or most of the sources of unstable performance to come to mind are themselves systems designed to improve performance:

Modern garbage collectors are marvels of sophistication and vastly outperform the GC of old, but at a cost of often-unpredictable behavior and lots of heuristics and tuning thresholds that can make small changes impact performance.
Similarly, modern JITs can achieve impressive speedups, but exhibit incredibly harder-to-understand performance behaviors, even sometimes varying continually over time¹ on the same static workload!
Caches — both CPU caches and OS-level and application-level — are essential to performance of pretty much any system these days, but are huge sources of performance instability depending on exactly how effective the cache ends up being on a given workload.
Modern superscalar speculative processors continue to defy physics and eke out incremental single-thread performance gains, but also increasingly decouple the details of the actual performance of software from the input machine code.
Multi-core and distributed systems can result in drastic increases in throughput or decreases in latency for a given application, but introduce novel sources of noise into performance measurements that are often impossible to eliminate.

This suggests to me that our very efforts to improve performance at the systems level are actually direct contributors to our struggles to maintain high performance at the application level! All of these innovations improve benchmarks and peak performance figures, and simultaneously making it harder and harder for application developers to do performance engineering, and less and less likely that developers actually do. I definitely don’t claim this is at all the full explanation for our slow software, but I think I’ve convinced myself it’s an under-recognized contributor.
What is to be done?
What do we do about this? I wish I had a neat answer.
I certainly don’t want your takeaway from this note to be that I oppose JITs or caches or other instances of sophisticated engineering meant to improve performance. On the contrary, I love those kinds of engineering, and love learning about them and how they work and contributing to them.
On the flip side, though, I do think there are some fundamental tradeoffs between such systems and having systems with easily-understood performance, and I think we should take those tradeoffs more seriously. We shouldn’t throw out the very idea of caching because it makes performance harder to reason about; but perhaps we should try 20% harder to implement a solution that is always fast, first, before we reach to add caches to our system.
Finally, I think we need drastically better tooling, and it needs to be much more broadly accessible.
There does exist some really incredible work to try to combat this problem and find ways to get reliable performance numbers out of noisy and unstable systems. To pick a few I’ve seen recently:

The Rust team has done an amazing writeup on their efforts to get deterministic results out of hardware performance counters in their benchmarking suites.
Stabilizer is a really interesting effort to, effectively, take a program and sample “related” programs in order to turn performance instability into genuinely random noise, which in turn allows you to use statistical methods to summarize it and compare performance numbers.
SQLite does all of their CPU performance work inside cachegrind, simulating an entire memory hierarchy, in order to get consistent performance numbers over time. I find this work particularly interesting because their resulting numbers don’t actually reflect performance on any real piece of hardware; but they do correlate well with real performance, and, because they are so deterministic, allow the authors to implement and test optimizations with very fine precision.
Laurence Tratt’s recent series (part 2) is a really great dive into VM performance that talks about how challenging these problems are, as well as some of the incredible tooling his team built to try to actually measure some of these effects.

Have you ever worked on a project that had really clever solutions to get more reliable benchmarks? Are there approaches or schools of thought I’ve missed? Drop me a note!

I really recommend this paper. It’s an incredible writeup of a truly heroic benchmarking effort to quantify VM performance, and surfaces some remarkably anomalies that do a great job showing just how challenging it is to truly characterize the performance of modern software systems. ↩

                        Don't miss what's next. Subscribe to Musing in Computer Systems: