What's worth optimizing?

right

                        October 22, 2020

                What's worth optimizing?

                        Broken links in last week’s email
Many people reported that all the links in last week’s post were dead. Thanks for letting me know, and I’m sorry about that. It turns out buttondown’s link tracking doesn’t play well with custom domains that require HSTS on all subdomains (like my own nelhage.com). I’ve disabled link tracking, since I dislike that feature anyways.
You can find last week’s post, with functioning links, here: https://buttondown.email/nelhage/archive/welcome-to-my-newsletter-and-performance-as/
On to this week’s topic.
What’s worth optimizing?
Everyone loves to talk about the evils of “premature optimization,” but what counts as “premature”? 
More broadly, how do we decide which pieces of code or subsystems are worth optimizing? The answer always depends on context; we need to understand the performance of our system, what we want to use the system for, and the available resources (both hardware resources and engineering time).
Even though this question is complicated and lacks easy answers, I think it’s a useful one to try to form some frameworks for. Without that, it’s too easy to get caught in arguments that just reduce to “That optimization is premature!” vs “No I’m not!”
Let’s look at some frameworks I’ve seen for answering this question, and explore what some other ones might look like.
Optimize the slowest operation category
In this framework, we identify the slowest family of operations in the program, and focus on reducing their cost or frequency. The rest of the system can be mostly ignored.
I’m being a bit general about “type”s of operations here. Most commonly, you’ll see this framework instantiated with the slow operation type being “I/O,” or perhaps specifically “network operations.” For instance, it’s often stated as a truism that web applications are I/O bound, and so (the implication goes) the only optimizations worth considering are those that reduce database calls (e.g. ORM optimizations, caching, indexes, schema changes, etc), and not those that reduce CPU time.
However, I want to encompass other classes of operations, like situations where a program makes calls into some CPU-intensive kernel, such as a SAT solver. In such an instance, this framework would tell us to invest effort in producing fewer or smaller SAT instances, but not to optimize the code actually generating those instances.
The biggest advantage of this framework is that it is both simple and simplifying. It’s simple to understand in a given domain, and it simplifies your world view, giving you permission not to worry about performance most of the time. Depending on your goals — e.g. if you just need to ship a prototype — that can be very valuable.
There are at least two substantial pitfalls with this mode of analysis.
First, you need to be right about what the slowest operation is! I cited the “truism” above that web applications are database-bound. It’s possible this was broadly true in the days before ubiquitous SSDs and when RAM was much more expensive, but in my more-recent experience it’s often extremely incorrect. Especially in dynamic languages like Ruby and Python with relatively complex frameworks powering application code, it’s very common to see 80% or more of endpoint runtime spent in CPU time inside application code. If you’re that badly wrong about what operations are slow, then optimizing the slow operations is not going to do you any good.
A while ago, I fixed a bug in Ruby’s rest-client library that caused 7 milliseconds of wasted CPU time on every HTTP call. I can’t say for sure why this wasn’t found, but I speculate that — insofar as the developers thought about performance — they figured that the whole point of this library was to make network calls, and surely CPU time was “free” in comparison. Over the public internet, that might even be right. But we were trying to use rest-client with internal microservices that could return responses — including network latency — in under millisecond. Performance assumptions that might hold in one context can be utterly obliterated in another.
Second, this framework can lead to a death by a thousand papercuts. Even a piece of code only takes 10% of the time spent in I/O, if you have five or ten such pieces, it adds up. And with the implicit mantra of “code performance doesn’t matter” that this framework often recommends, that happens more often than you might think. I’ve lost track of the number of times where I’ve profiled a tool and found that, even if most of its time was spent “where you’d expect,” there was a cheap fifteen or twenty percent speedup from a simple optimization that no one had ever gone looking for.
Profile and optimize the hot spots
This next framework is the one that I see most commonly presented as the “right” way to do performance engineering. Instead of focusing solely on one class of operation, in this framework we use a profiler, and focus on functions or subsystems that consume a substantial fraction of the total execution time.
This general approach certainly has a lot to recommend it. Because we’re actually measuring, we’re less likely to make mistakes about where to focus. And since we focus on a broader window of slow operations, we’re a bit more resilient to deaths by a thousand papercuts.
This approach struggles to fully solve the fat-tail problem, though. Often, we will end up with a long tail of operations in a profile that don’t take much time individually, but add up to far more time than seems justified. I’ve heard such profiles described as “it seems we’re just running too much code.” What are we to do in this case?
Profilers also present challenges around how we group events in our analysis. Should you look at “exclusive” time or “inclusive” time? If a function is inlined at a lot of call sites, does your profiler just include its time in the caller, or is it able to recover the fact that it is a single function?
In general, I find that profiler-driven optimization is great for picking off low-hanging fruit, but often loses steam if you are out of low-hanging-fruit but still want or need even faster performance.
All that said, an even more fundamental problem when profiling is the question of what workload to use. If we’re an application developer, we can hopefully benchmark on a representative use case or data set (but what if performance is heavily data-set dependent?). But if we’re developing a library — or a network API or an OS kernel —  what application patterns do we use to drive our system? We can make (hopefully, informed) guesses about how our users will use our system, but what about users we don’t know about? Or future users?
To pick a simplistic example: suppose we are writing a library to load some image format. If we profile an application that loads a few images at startup and then renders them repeatedly, we might find our code is a tiny fraction of the total time, and conclude that we don’t need to worry about performance. But if another user attempts to write a batch job that loads thousands of different images at a time in order to produce thumbnails, we might find our runtime is a much larger percentage of the total! Which profile(s) should we use to decide what a “substantial fraction of total time” means?
[Insert better answer here]
At this point, I get a bit fuzzy, and this is why this is a newsletter instead of a blog post.
I mentioned — above and elsewhere —  that “build a slow app, and then profile hotspots” approach is usually insufficient to build truly fast software, but I struggle to clearly articulate what paradigm I find more effective. A lot of my answers seem to boil down “Have a very small team of very experienced people and do everything right,” which, while great if it’s available, isn’t particularly informative.
However, I think there are a few common threads here I can call out:
Care about overall performance, not just hot spots
Profiling for and optimizing hot spots can be a useful tool, but it’s not the end of the performance game. Projects that care heavily about performance track end-to-end performance, and investigate regressions or changes, even apparently-small ones.
Set a priori performance goals
This LWN article from over a decade ago is one of the pieces that have really stuck with me in my career. The article covers the efforts of a team to build a Linux system that could boot in under five seconds. This is the piece that’s stuck with me:

Arjan said it starts with the right attitude. "It's not about booting faster, it's about booting in 5 seconds." Instead of saving a second here and there, set a time budget for the whole system, and make each step of the boot finish in its allotted time.
[…]
The kernel gets one second to start, including all modules. "Early boot" including init scripts and background tasks, gets another second. X gets another second, and the desktop environment gets two.

When I read this article as a much younger engineer, this struck me as nonsense. “Everyone knows,” I thought, that you have to profile before you do performance work, so how can you set time budgets at the start of a performance project, before you even know where the time is going???
With more experience, I appreciate two key facts:

These time budgets weren’t picked from thin air; they were set, no doubt, by experienced engineers with knowledge about the system, including both intuition, some initial profiling, and some theoretical analyses (e.g. an estimate of their working size, divided by the speed of their disk). Being empirical and driven by available data is not at odds with setting budgets; they can work together very effectively.
Setting budgets is a really powerful technique for achieving an absolute performance goal, not just an incremental “faster” goal. It lets you subdivide the work, and it also forces you to think creatively and work hard to figure out where your time is going and potentially investigate drastic changes to hit your budget; if you’re dying a death of a thousand papercuts, maybe you need to throw out the entire stack of paper.

Another variant of a priori performance goals that many projects have is a “no regressions” goal, supported by a strong benchmark suite; if you require that no change ever reduce performance, that provides another strong counter to the problem of continuous slow degradation.
Think about efficiency of resource usage
Touching on the theme of last time, you can analyze components and operations not just relative to total runtime, but relative to how fast they could be, based on usage of the underlying resources — be those the CPU, memory, I/O, or even “calls into an underlying library.” If we can turn two filesystem accesses into one, let’s do it, even if it’s a small fraction of our total time. If we can do something in three memory lookups, instead of four, that’s worth it. If we can shrink our pointer sizes — and thus cache footprint — by half, it’s worth it. On the flip side, if a routine is 20% of our runtime but it’s running a highly-optimized SIMD compute core, maybe we can set it aside and focus on the long tail of the other 80% instead.
In Closing
Do you have or have you seen a really playbook for producing high-performance software? It seems to be something that’s really hard to teach or articulate, especially in our modern world of hyper-complex stacks that build on top of millions of lines of other peoples’ code. At the same time, it’s very clear that we have so much room for improvement over the kinds of software we do seem to manage to create.
It seems likely to me that the first step in such a playbook is “Measure performance from day one, and make it easy to benchmark every change,” but beyond that it runs the risk of turning into an incredible mass of domain-specific expertise and gotchas. Can we do better, in terms of advice we give to new engineers and to teams?

                        Don't miss what's next. Subscribe to Musing in Computer Systems: