HTTP Pipelining, S3, and gg

even from inside Amazon Lambda

                            February 22, 2021

                HTTP Pipelining, S3, and gg

                        Ranger update
22lbs at last weigh-in! His favorite place continues to be "curled up at our feet chewing his favorite bone":

S3 and pipelining
A few weeks back, I talked about gg, a tool for outsourcing computation to Amazon Lambda, and about llama, my own experiments in replicating some gg-inspired functionality. This week, while exploring llama’s performance, I found a surprising performance result I wanted to share.
The result is: Using HTTP pipelining is worth a 30-40% performance improvement when downloading large numbers of small files from S3, even from inside Amazon Lambda, which is only about a millisecond away from the S3 API endpoint.
You can find the details of my result in my benchmark repository.
This result is surprising to me for a few reasons:

The usually quoted advantage of pipelining is to remove network roundtrips. For an operation where we’re so close to the server, I was surprised to see it matter so much.
HTTP pipelining is considered … cursed, is perhaps the best word … by people who work closely with HTTP. The MDN link I linked above has a good summary of the problems it has; for these reasons and more essentially no one uses HTTP pipelining in the wild. Most modern HTTP client libraries I am aware of don’t even support pipelining.
I had assumed that S3 would support HTTP/2 by now, which gives all the advantages of pipelining, and more, while being substantially less cursed. This turns out not to be true.
Finally, and one of the real reasons I want to talk about this, was the way I discovered this oddity.

gg and HTTP pipelining
While doing performance work on llama this week, I discovered that the runtime of my lambda executions was substantially spent fetching dependencies from S3. The average file in the Linux kernel build I was testing depends on over 500 header files, and fetching all 500+ headers took on average over 700ms. By comparison, the average gcc execution, to actually build the source file, only took 500ms! Llama already fetched dependencies from S3 in parallel across 32 concurrent workers, and scaling up to more concurrency did not seem to help much or at all.
Knowing that gg also uses an s3 file per dependency, I decided to dig in and see if they were doing anything non-obvious¹. I found their download code, and learned two surprising pieces of information:

They have their own S3 client and even their own HTTP client — they use a bare minimum of external libraries, but have rolled their own stack to a very low layer. They do use an external TLS library, at least (OpenSSL).
The pipeline their requests! Coincidentally, they also use 32 concurrent workers by default, but each worker sends up to 32 requests out on the wire all at once, and then reads all 32 responses, in order.

Somewhat flabbergasted by both facts, I hacked up my own pipelined implementation, which necessitated punching violently through a number of Go’s abstraction layers (as alluded to above, Go net/http definitely does not support sending pipelined requests). With that available to do a side-by-side test, I was able to demonstrate the 30-40% speedup cited earlier!
Some more thoughts
In addition to “Huh, a real live use case for HTTP pipelining,” I thought this discovery actually surfaced some interesting lessons about systems and performance engineering, as well as about systems research.
Performance engineering spans abstraction layers
This instance felt like another really good example of performance engineering crossing abstraction layers. From the perspective of most S3 libraries, the fact that the underlying transport is HTTP is largely hidden, and I doubt that any support doing HTTP pipelining natively. This is a 30% performance improvement that is only accessible by understanding several layers of the stack and being comfortable thinking between them.
As a corollary, it drove home to me just how much performance we lose every day to the stack of abstractions we live atop, and missed opportunities for layer-piercing optimizations. I don’t say this to condemn abstractions or the modern stack — the things that are possible and even easy these days, building on top of existing tools, are truly wonderful — but as a moment of reflection on the costs we pay for it, and the opportunity to do much better when we need to.
On that note,
There is value in controlling your entire stack
When I first realized that gg uses its own S3 client and even HTTP client, I will admit my response was somewhere between “horror” and “sneering.” I temperamentally lean towards reimplementing the wheel, but even I wouldn’t reimplement something so foundational and fiddly as an HTTP client.
However, once I realized that their tool outperformed mine precisely because it had its own HTTP stack it shifted my thinking a bit. I think I take away two related lessons:

As mentioned in the previous point, performance spans abstraction layers. If your goal is performance — and the whole point of gg is to be fast — then there are real advantages to controlling the entire stack. Even if this Stanford group made different tradeoffs than I might, I can acknowledge that there is at least some value to their tradeoffs, and also reflect on the larger phenomenon that controlling your dependencies has real power, especially when working with topics that cross abstraction layers. Furthermore, even aside from performance, there can be real advantages to it being easy to evolve the interface boundary between layers of your stack, or even just to make it easier to debug cross-layer issues.
Secondly, and relatedly, this code is a research project, coming out of a research group. Their very job is to stretch the bounds of the possible, and to try things that haven’t been thought of or tried or realized before. By its nature, that job is going to sometimes require pushing existing systems in novel ways; perhaps it’s genuinely useful for engineers operating in that mode to experience the entire stack as malleable in their hands and open to experimentation. Perhaps it is not only advantageous but necessary that they be unconstrained by “how everyone else thinks HTTP works,” and free to experiment with any part of their system just about as easily as any other part. This can, of course, be true at the same time as it would make me hesitant to run their systems as-is in production; there is perhaps some analogy here to the different manufacturing techniques you use for prototyping vs for production at scale.

Publishing source is so important
The gg paper describes the high-level design and architecture of gg, and some of the tests they performed to evaluate it. However, the concrete behavior of gg, including the details of its performance — which, again, is practically the entire point — depend on hundreds or thousands of implementation details, tuning parameters, and other choices, including the one outlined in this writeup. Since they authors did release the source code, we are free to examine it, benchmark it independently, and discover all of these details, and figure out which ones are important and why. Without that, reimplementing the paper in a way that achieved the same performance characteristics would be a nightmare. It is be very easy to imagine ending up with a system with the same high-level designs described in the paper, but completely different behavior in practice, because all the details matter so much. Once again, this isn’t a new observation by any means, but this anecdote really drove it home for me.

gg also has a mode in which workers cache some files locally to reduce the need to download dependencies repeatedly. This optimization — which I have not yet copied in my own system — is, in practice, probably more significant than the one discussed here. But I still figured they had probably done some work to tune their download implementation. ↩

                            Don't miss what's next. Subscribe to Musing in Computer Systems: