Papers I love: gg

outputs

                            December 4, 2020

                Papers I love: gg

                        I’ve been playing with Amazon Lambda the last few weeks for a side project, and during the process I’ve gone from kinda infuriated with and baffled by Lambda to quite a fan. I hope to write more about that journey in a future letter, but I decided that today I first want to share the paper that got me thinking about Lambda in the first place. I really enjoy it, and want to share some of the ideas I love from it.
gg
The paper is “From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers,” by a number of researchers primarily at Stanford (including Keith Winstein, who I worked with at MIT and at Ksplice). They present a tool called gg, which is designed to allow users to use cloud computation — including “function-as-a-service” systems like Amazon Lambda — to outsource everyday computation (such as software builds) from their laptops to the cloud, without requiring users to provision or manage a standing compute cluster.
First off, I love the vision of gg. It promises a vision of the “best of both worlds” for local and cloud computing: You get the low latency and full control of keeping source control and data files locally, and running a local editor, filesystem, terminal, and such, combined with the full compute power of the cloud so that your builds or test runs are lightning fast, and don’t destroy your battery or make your fans sound like a jet aircraft.
At Stripe, when I worked on developer tools, we built a version of this vision for Stripe engineers: git checkouts lived on developer laptops and developers used their local editor, but we build a system that transparently synced their checkout to a dedicated per-use development VM. We also built a UX called the pay command that let developers run commands like pay test path/to/my/file… locally. This would transparently RPC to their dev VM to invoke the code remotely — using the compute resources of their VM, and in an environment that could be centrally managed by the development tools team — and shuttled output back to the laptop. This developer experience was quite successful and popular, and preserved most of the advantages of letting developers use the tools they preferred, while centralizing management of the execution environment, runtime dependencies, and giving developers access to more RAM and CPU than their laptop had.
However, building and maintaining this system required an entire team of engineers — more-or-less in perpetuity — to build and maintain the infrastructure, and also required a warm EC2 VM for each engineer. For a large organization like Stripe, both costs were well worth it (A single VM is cheap compared to an engineer’s salary!), but they aren’t feasible for smaller organizations, open-source projects, or just engineers’ personal projects. If the gg model works, it promises to exploit Amazon Lambda to achieve a similar UX, in a generic project-agnostic way, with no standing infrastructure, and which only charges for precisely the compute used!
In addition to loving the gg vision of the future, I find several of the ideas and tricks in the paper very interesting, and so I would like to take this opportunity to share them with you.
The compute IR
Arguably the core contribution of gg is the gg IR, an abstraction for representing arbitrary computations in a way that can be scheduled onto remote executors. The core concepts of the IR are fairly well-trodden, but gg composes them in interesting ways.
gg models computation as a graph of “thunks,” which represent individual steps of computation. Thunks describe the execution of a specific binary on a set of inputs, producing one or more outputs. Thunks are entirely self-contained; any dependencies of the invoked binary must also be explicitly named as dependencies.
Inputs to a thunk — including the invoked binary itself — are encoded using a content-addressed scheme, where they are named by their sha256 hash, so they can be looked up in a content-addressed store shared between clients and different executors. Importantly, inputs can take two forms:

Either a primitive value, representing a concrete set of bytes stored into the object store, or
As the output of a named thunk (referred to using a hash of the serialized thunk object), which must be first executed to produce its output(s).

This latter ability, to refer to the output of an earlier thunk, is what turns the IR into a computation graph. You can encode, for instance, “Link(Compile(Preprocess(foo.cc)))” as a sequence of three thunks; the Preprocess thunk would refer to a primitive value containing the contents of foo.cc, and Link and Compile would refer to the previous thunk to get their input.
While the details matter, this broad shape of IR, consisting of compute nodes linked together by dataflow edges, will likely be fairly familiar to anyone who’s worked on a build system or data pipeline. In particular, it reminded me a lot of Bazel’s build descriptions, which also make extensive use of content-addressed storage for caching and remote execution. This is a powerful model for decoupling the description of a computation process from the execution or analysis of that process. 
However, gg’s IR adds one additional feature, which is unusual in my experience for such graphs, although I don’t know if it’s unique to gg:
Tail recursion
gg thunks can refer both to other thunks, or to primitive values, as their inputs. Importantly, however, the same is also true of their outputs! A gg thunk can return either one or more primitive values, or it can return a new thunk, with its own set of inputs. The runtime will then execute this new thunk until it finally resolves into a primitive object, in a way the gg authors characterize as akin to tail-recursion.
This capability means that gg compute graphs do not need to be known in their entirety at construction time; instead, they can dynamically “unfurl” themselves, allocating and scheduling new work as they are evaluated. This flexibility is an important feature, allowing for dynamic discovery of work and dependencies. 
As an example, suppose we want to map some computation over records in some large file where we don’t know the record boundaries ahead of time. In gg, we could write a single thunk that takes the file as input and

scans the file to detect record boundaries
divides it into N chunks, each with an appropriate amount of work
emits N new thunks, each one encoding the computation for a single input chunk
returns a new thunk which depends on each of the individual chunk computations, and merges the results back together

In systems without gg’s tail recursion, we often have to have some mechanism to specify the number of map workers a priori, or even to scan the file beforehand so that we can hard-code appropriate segmentation into the graph. gg’s tail recursion lets us write job descriptions that dynamically schedule computation as they discover the work to be done, in a very flexible manner that is completely agnostic to the specific data patterns or languages encoded in the IR. 
Bazel’s compute IR, by contrast lacks this feature or anything comparable This a deliberate decision on their part in order to make bazel builds more analyzable — it’s harder to analyze build if you can’t even know the full compute graph without executing the build — but it’s a lack that is sorely felt, and bazel builds often have to very painfully work around the limitation.
gg Executors
gg uses its IR to abstract between describing and executing computation. This distinction allows gg to implement multiple execution substrates that all operate on the same IR, using a variety of computation substrates and implementations of the content-addressed object store. Their most impressive results come from their Amazon Lambda-based runtime, but they implemented four other compatible engines, to demonstrate the generality:

Local execution, allowing gg to operate with no external resources at a ll
Cluster execution, allowing gg to make use of VMs provisioned out of band
Google Cloud Functions
IBM Cloud Functions

Compiling C and C++
The paper presents several applications of gg, including several different approaches for generating jobs in their IR. While they are all interesting, I found their C/C++ build system the most fascinating, since they came up with some (as far as I know) novel solutions to long-standing problems in analyzing C/C++ builds, which I think may have application outside of gg.
I also like these tricks because they’re in some sense entirely separate from the core contribution of the paper — the gg  IR and executors — but they’re also very clever, and they end up being hugely valuable for actually demonstrating that the elegant IR model is practical for real applications. I appreciate that the authors went out of their way to do this fiddly integration work to actually make their tool workable for existing legacy uses. And, finally, as a long-time C++ developer who is all too familiar with the hell that is C++ build systems, I just find them really clever and fun to read.
Model substitution
Given an existing C/C++ build — using, say, make or cmake — the first question that arises for executing it in gg is how to extract the actual compilation commands so that gg can interpret them. This is a classic problem faced by any system — for instance, IDEs or static analyzers — that wishes to analyze a build. The state-of-the-art in general-purpose solutions the one taken by bear, which actually runs the build, instrumented in some way (using e.g. LD_PRELOAD or ptrace) to record all the commands which are actually executed during the build.
However, if our goal is to distribute the build, this approach is of limited use; If we have to first run the normal local build before we can run the distributed one, we’ve pretty much lost the whole point.
Instead, gg uses a technique they call “model substitution.” gg contains a “stub” implementation of all the common commands in a C/C++ build pipeline — gcc, strip, ld, etc — which understand just enough to parse their command line and, instead of actually compiling/linking/etc, write out a gg thunk describing the work they would have done, referring to their inputs (which, if they were generated by other stubs, will themselves be more thunks!). gg then runs the real build system with a PATH pointing at these stubs, which will run quickly (the stubs are much faster than actually doing the work), in order to produce a full description of the build system.
This approach has some limitations to be sure, including the need to model every program involved in the build, but it seems nonetheless quite flexible and general. I would love to see a version of this strategy adapted into something like bear, in order to produce a compile_commands.json without having to actually run the build. I personally would find this quite useful, as I often want to generate a compile_commands.json for an open source project so that I can use clangd, but would rather not wait on the full build, or have to chase down all of its dependencies!
Handling the preprocessor
A classic challenge in C and C++ build systems is that there is no way, in general, to figure out which header files a given source file requires without actually running the C compiler (technically, just the preprocessor) over them. This makes it challenging to handle incremental rebuilds, and also represents a challenge for distributed builds: You can’t know which dependencies a file has without running the preprocessor, but you also can’t run the preprocessor on a remote node without knowing which dependencies to ship over! There are two straightforward solutions that distributed build systems (like distcc) have used historically, both of which have serious weaknesses:

Originally, distcc would run the C preprocessor (cpp) on the local machine, and then send preprocessed source code (which has no dependencies) to the compilation farm. Because preprocessing is only a fraction of the total compile time, this can achieve substantial speedups, and means that remote nodes can be very simple functions of a single input file to a single output file. However, this approach breaks down at large scale; the gg paper notes that merely running cpp over every source file in Chromium takes nearly 30 minutes on their 4-core laptop!
Alternately, we can make every header file available to all of our build nodes, either using a network filesystem or by packaging them all up in some form. Distributed filesystems are slow and fiddly, though, and passing around all header files can be quite expensive in terms of bandwidth and storage, or even infeasible — lambda jobs have a limit of only 500M of local disk, for instance.

Here, again, gg has a clever solution.
gg‘s adapter for C/C++ builds first prepares (locally) a stripped-down version of every header file — including system headers — which contains only the preprocessor directives (#include, #define, etc) in each file. Since these directives are a small fraction of the text of most headers, it is feasible to package all of these stripped headers together and ship them to workers. Then, for each source file, emit a thunk that use these stripped-down headers to run dependency inference inside of cloud workers. These jobs depend on the full set of stripped-down headers, run the cpp preprocessor on the source file in the context of these stub headers, and record which header files were actually accessed. They then use gg’s tail-recursion feature to output a new thunk, which depends on the full version of only the headers that were actually needed, and which does the real compilation and emits the output file. This use-case is another example of how gg  uses its tail-recursion feature to enable thunks to dynamically discover dependencies and schedule appropriate work.
Again, the contrast to Bazel is stark, for me. Bazel’s C and C++ build rules — at least last I checked — are actually implemented in Java inside the Bazel core (as opposed to in Bazel’s extension language), and violate Bazel’s usual invariants to modify the dependency graph after running the compiler, in order to record the discovered header dependencies, which is something that “normal” build plugins aren’t allowed to do. Again, Bazel has deliberate reasons for making this tradeoff, but I’m still struck but how clever and general-purpose gg’s tail recursion mechanism ends up being.
Conclusion
I will admit that I haven’t actually gone to the work of trying to get gg running so I can play with it. It definitely smells of “researchware” to me — It’s a massive autotools C++ project with relative minimal documentation, and I have some skepticism of how well it would actually work outside of the researchers’ machines. However, this letter has persuaded me I should give it another look — I do think there are applications I might use it for!
Nonetheless, I’m very charmed by it on a number of levels, and I think it should be more widely known. I really love its goals, the elegance of its IR, the power of the tail-recursion technique in compute graphs, and the fiddly domain-specific engineering hacking the authors did to make it work for C++ builds. It’s also an inspiration for my hacking over the last week or so, where I’ve successfully used Lambda to support bursty computation from my laptop for a personal project. Hopefully I’ll write more about that soon!

                            Don't miss what's next. Subscribe to Musing in Computer Systems: