Reproducible benchmarks are essential to doing performance engineering. In order to know if a change had an impact on performance, you have to be able to measure performance, and that measurement has to be reasonably consistent across time, so that “before” and “after” measurements are comparable.
Unfortunately, modern software systems are incredibly difficult to benchmark, for two broad reasons:
Performance instability is closely related to the idea of performance cliffs, which I define as cases where small changes in a program result in disproportionately large changes in performance. If a JIT only inlines methods below N bytes, a change which happens to grow some hot method above N bytes may exhibit a drastic performance degradation; I would class this as an instance of both a performance cliff, and of performance instability. However, I also want to include smaller variances in performance in my definition of "instability"; memory alignment rarely results in a drastic performance shift, but can consistently result in a measurable one.
Performance instability is in many ways much more pernicious than true noise. Noise that is truly exogenous to the system under test can be reduced by controlling the benchmark environment — techniques like disabling ASLR, isolating CPU cores, pinning CPU frequency, and so on are fairly well-known — and, to the extent that it is uncorrelated with program behavior, can be addressed using statistical techniques like averaging multiple runs. Performance instability, precisely because it comes from inside the system being tested, is much harder to eliminate or average out. Performance instability can lead to results like a change that is, say, on average, a 5% speedup — if applied to a large set of related programs — but which manifests as a deterministic 3% slowdown in a specific situation, not because of the change itself, but because it happens to shuffle data layout of some unrelated data in a way that happens to decrease CPU cache effectiveness in this specific example. Because this 3% slowdown is deterministic, no number of re-runs will show us the “true” speedup; to do that, we need some way to sample from some set of “related programs” that laid out data a bit differently. The Stabilizer project is one attempt to do precisely such a task, by way of a tool that can independently re-randomize memory layout in a fine-grained way, in order to sample “related” executions and essentially turn instability into mere noise that can be averaged over.
I’m increasingly suspicious that the difficulty of benchmarking modern software systems is a major driver behind the observed phenomenon that, even as computer hardware continues to get faster, software gets slower at least as quickly. If consistent benchmarks require heroic acts of binary engineering and statistical analysis to measure performance, or a customized bare-metal setup with a custom kernel and hardware configuration, then most developers simply won’t bother.
One might optimistically hope that noise and performance-cliff-driven-instability only impact our ability to measure changes of the same order of magnitude as that noise, and that we can still make good progress with the “easy” benchmarks, and still do most of the performance work we’d like to. While I think this is sometimes true, I think these challenges have cascading impacts in various ways, that make them even worse than they might appear:
I find it interesting to note that many or most of the sources of unstable performance to come to mind are themselves systems designed to improve performance:
This suggests to me that our very efforts to improve performance at the systems level are actually direct contributors to our struggles to maintain high performance at the application level! All of these innovations improve benchmarks and peak performance figures, and simultaneously making it harder and harder for application developers to do performance engineering, and less and less likely that developers actually do. I definitely don’t claim this is at all the full explanation for our slow software, but I think I’ve convinced myself it’s an under-recognized contributor.
What do we do about this? I wish I had a neat answer.
I certainly don’t want your takeaway from this note to be that I oppose JITs or caches or other instances of sophisticated engineering meant to improve performance. On the contrary, I love those kinds of engineering, and love learning about them and how they work and contributing to them.
On the flip side, though, I do think there are some fundamental tradeoffs between such systems and having systems with easily-understood performance, and I think we should take those tradeoffs more seriously. We shouldn’t throw out the very idea of caching because it makes performance harder to reason about; but perhaps we should try 20% harder to implement a solution that is always fast, first, before we reach to add caches to our system.
Finally, I think we need drastically better tooling, and it needs to be much more broadly accessible.
There does exist some really incredible work to try to combat this problem and find ways to get reliable performance numbers out of noisy and unstable systems. To pick a few I’ve seen recently:
Have you ever worked on a project that had really clever solutions to get more reliable benchmarks? Are there approaches or schools of thought I’ve missed? Drop me a note!
I really recommend this paper. It’s an incredible writeup of a truly heroic benchmarking effort to quantify VM performance, and surfaces some remarkably anomalies that do a great job showing just how challenging it is to truly characterize the performance of modern software systems. ↩