22lbs at last weigh-in! His favorite place continues to be “curled up at our feet chewing his favorite bone”:
A few weeks back, I talked about gg, a tool for outsourcing computation to Amazon Lambda, and about llama, my own experiments in replicating some gg
-inspired functionality. This week, while exploring llama’s performance, I found a surprising performance result I wanted to share.
The result is: Using HTTP pipelining is worth a 30-40% performance improvement when downloading large numbers of small files from S3, even from inside Amazon Lambda, which is only about a millisecond away from the S3 API endpoint.
You can find the details of my result in my benchmark repository.
This result is surprising to me for a few reasons:
gg
and HTTP pipeliningWhile doing performance work on llama this week, I discovered that the runtime of my lambda executions was substantially spent fetching dependencies from S3. The average file in the Linux kernel build I was testing depends on over 500 header files, and fetching all 500+ headers took on average over 700ms. By comparison, the average gcc
execution, to actually build the source file, only took 500ms! Llama already fetched dependencies from S3 in parallel across 32 concurrent workers, and scaling up to more concurrency did not seem to help much or at all.
Knowing that gg
also uses an s3 file per dependency, I decided to dig in and see if they were doing anything non-obvious1. I found their download code, and learned two surprising pieces of information:
Somewhat flabbergasted by both facts, I hacked up my own pipelined implementation, which necessitated punching violently through a number of Go’s abstraction layers (as alluded to above, Go net/http
definitely does not support sending pipelined requests). With that available to do a side-by-side test, I was able to demonstrate the 30-40% speedup cited earlier!
In addition to “Huh, a real live use case for HTTP pipelining,” I thought this discovery actually surfaced some interesting lessons about systems and performance engineering, as well as about systems research.
This instance felt like another really good example of performance engineering crossing abstraction layers. From the perspective of most S3 libraries, the fact that the underlying transport is HTTP is largely hidden, and I doubt that any support doing HTTP pipelining natively. This is a 30% performance improvement that is only accessible by understanding several layers of the stack and being comfortable thinking between them.
As a corollary, it drove home to me just how much performance we lose every day to the stack of abstractions we live atop, and missed opportunities for layer-piercing optimizations. I don’t say this to condemn abstractions or the modern stack — the things that are possible and even easy these days, building on top of existing tools, are truly wonderful — but as a moment of reflection on the costs we pay for it, and the opportunity to do much better when we need to.
On that note,
When I first realized that gg
uses its own S3 client and even HTTP client, I will admit my response was somewhere between “horror” and “sneering.” I temperamentally lean towards reimplementing the wheel, but even I wouldn’t reimplement something so foundational and fiddly as an HTTP client.
However, once I realized that their tool outperformed mine precisely because it had its own HTTP stack it shifted my thinking a bit. I think I take away two related lessons:
gg
is to be fast — then there are real advantages to controlling the entire stack. Even if this Stanford group made different tradeoffs than I might, I can acknowledge that there is at least some value to their tradeoffs, and also reflect on the larger phenomenon that controlling your dependencies has real power, especially when working with topics that cross abstraction layers. Furthermore, even aside from performance, there can be real advantages to it being easy to evolve the interface boundary between layers of your stack, or even just to make it easier to debug cross-layer issues.The gg
paper describes the high-level design and architecture of gg
, and some of the tests they performed to evaluate it. However, the concrete behavior of gg
, including the details of its performance — which, again, is practically the entire point — depend on hundreds or thousands of implementation details, tuning parameters, and other choices, including the one outlined in this writeup. Since they authors did release the source code, we are free to examine it, benchmark it independently, and discover all of these details, and figure out which ones are important and why. Without that, reimplementing the paper in a way that achieved the same performance characteristics would be a nightmare. It is be very easy to imagine ending up with a system with the same high-level designs described in the paper, but completely different behavior in practice, because all the details matter so much. Once again, this isn’t a new observation by any means, but this anecdote really drove it home for me.
gg
also has a mode in which workers cache some files locally to reduce the need to download dependencies repeatedly. This optimization — which I have not yet copied in my own system — is, in practice, probably more significant than the one discussed here. But I still figured they had probably done some work to tune their download implementation. ↩