A few weeks back I wrote about nondeterministic performance and the problems it poses for benchmarking. That reminded me that I’ve got a bunch of thoughts about regular old determinism that I’ve been meaning to write up, and so that’s today’s topic.
I describe a program’s behavior as “deterministic” if running it multiple times on the same inputs will reliably produce the same outputs. I expect this to be a non-controversial definition, although there is room for nuance in defining exactly what a program’s “inputs” are or what the “same outputs are” — in some contexts we might allow an output to contain a timestamp and still call that determinism, but in others that might be a problem.
Most of the time, when I talk about determinism, I will use it to refer to bit-identical outputs given bit-identical inputs, but I will exclude filesystem metadata. So a program doesn’t need to call
utime on all its outputs for me to consider it deterministic, but if gzips the output it needs to make sure to fix the gzip
mtime header field.
One consistent theme of my experiences as a software engineer has been that, over time, I appreciate the value of determinism in software more and more. A big part of what I want to convey in this letter is how valuable I find deterministic software, and to exhort you to put more effort into making your software more deterministic by default.
With that preface, let’s jump into some places where determinism comes in handy:
It’s widely appreciated that bugs are easier to find, debug, and fix when they can be reliably reproduced. The more deterministic a piece of software is, the more likely it is that bugs will happen consistently, instead of intermittently. And when we do manage to catch a bug, it is invaluable to be able to re-run our software on the same inputs with additional instrumentation — be that with logging enabled, a debugger attached, or even just additional
printlns — and be confident we will see the bug again. And if we have a minimal pair — an input with the bug, and a very similar input without a bug — deterministic behavior helps us trace through behavior side-by-side and see where they diverge, and have confidence that the divergence is somehow related to the difference in input, and not random environmental factors.
I’ll have an anecdote later on about the challenges of debugging nondeterministic software, and some conclusions I take away.
I’ve written previously about record/replay testing, sometimes also called “golden master” testing, where you save the result of running one version of your program, and then compare future executions against that output to check for unintentional divergences.
As I mentioned in that piece, this technique is much more powerful the more deterministic your software is. If the output varies with fixed inputs, then you will need to figure out how to handle that variation in your test harness, introducing complexity and points of potential error. Even without a full testing harness, it’s valuable to be able to use
diff directly to compare outputs or log traces, where possible.
Coverage-guided fuzzing (as pioneered by AFL) is one of the absolute coolest computing innovations of the last decade. Fuzzers based on this lineage have been responsible for stunning numbers of bugs in C and C++ code, with lesser but still notable success in other languages. It’s also been adopted for property-based testing tools in Python and other languages, with the aim of allowing them to explore their input space much more effectively.
One major assumption, however, of coverage-guided fuzzers is that the target has not only deterministic outputs, but deterministic execution patterns. A target that executes different lines of code or in different orders based on wall-clock time or random factors creates a lot of noise that makes it much harder for these fuzzers to explore their targets. Especially if you’re implementing a system in C or C++, AFL and libfuzzer alone should be motivation enough to make your tool as deterministic as you can, the better to effectively fuzz it.
One trend that I’ve witnessed gaining steam in the last ten or twenty years is that of reproducible builds. You can read more about the effort that URL, but the important observation for this letter is that reproducible builds require deterministic build systems. And as build systems become more complex and feature more and more software that runs inside them, this ends up requiring a lot of software to become deterministic. If reproducible builds are a worthy goal — and I agree they are — then this is a strong motivation for more software to support a fully-deterministic mode, at a minimum as an option.
The Bazel build system encourages deterministic builds, and tries hard to support and encourage third-party build definitions and addons to be deterministic wherever possible. This has many benefits, including reproducible builds as cited previously, but it’s also a key enabler for Bazel’s caching strategy.
Bazel supports caching arbitrary build steps, including in a networked cache interface shared by multiple machines — e.g. shared by multiple developer laptops or even between development laptops and CI. Bazel caches individual steps in the build using a content-addressed store; it essentially just hashes all inputs to a step (e.g. the command-line it would run, the compiler itself, and the contents of all input files), down to a single hash value, and looks up that hash in the cache store.
However, in order for this approach to successfully cache complex builds, it requires that all intermediate steps be fully deterministic! Consider a simple C build process, in which a
.c file is first compiled into a
.o object, and then converted into an executable by linking the
.o file with the necessary libraries.
If two developers have the same
.c source file, a shared cache will always allow one to reuse the other developer’s compilation step. However, suppose that for whatever reason, we do not cache the compile step, so that each developer builds their own
.o file. If the compiler is deterministic, then both developers end up with identical
.o files, and we can continue benefiting from the shared cache — perhaps one can re-use the other’s linker result. If, however, the compiler is subtly non-deterministic — perhaps it embeds a build timestamp in the object file — then, as soon as each machine runs the compiler separately, our builds will diverge forever, and we will gain no additional benefit from the shared cache! Deterministic builds are a key ingredient in making this caching strategy effective. In real build pipelines, furthermore, we have many more than two sequential steps, which exposes both many more possibilities to benefit from caching, but also many more opportunities for the builds to diverge.
We could imagine caching strategies that are more robust to non-determinism. Perhaps instead of acting a step at a time, we could hash the entire build DAG leading up to an object, or use symbolic names for objects — “the .o generated by compiling the .c file with hash <…>” — instead of their content. However, the bazel approach has the benefit of being extremely simple, extremely scalable, and almost trivially correct. Given the other advantages of having deterministic builds, anyways, this seems like a sensible tradeoff to me!
Computers are mostly-deterministic beasts at heart; executing the same sequence of CPU instructions on the same data will yield the same behavior and results every time. However, there are a large number of ways that nondeterminism leaks into our software systems. Here are some of the more commonly encountered ones:
One situation that arises fairly often in software programs is the need to iterate over a data structure that is logically a set — it has well-defined membership, but not a well-defined ordering. Many languages choose to iterate over their
Set type, or over entries in a
Map type, in hash-table implementation order. Because hash values often depend on pointer addresses, or are even explicitly randomized for security purposes, this is a common source of nondeterminism, as mentioned above. I want to say a few more thoughts about this specific situation.
First, an anecdote: I first “got religion” about determinism while working on a compiler in an undergraduate compilers class. We had used java’s HashMap class ubiquitously for symbol tables and other maps, and routinely iterated over those maps in our compiler. This served us well for most of the class, but near the end of the semester, as our compiler got more complex and our bugs more subtle, we started running into every engineer’s nightmare: Unreproducible bugs that would only occur sometimes, even running on an identical test case. It turned out that we had some numbers of bugs that only manifested if we iterated over, say, local variables, in some orders, but not others. Attempts to trace down this bugs were nightmarish, since we couldn’t even compare debug logs from two buggy runs, as they might execute on different “bad” orderings.
As soon as we realized what was happening, we quickly implemented a global switch to LinkedHashMaps, which order elements in insertion order. With the compiler now much more deterministic, our bugs fell with great ease.
The acute reader, at this point, might ask whether the random ordering wasn’t beneficial. Didn’t it represent a (limited, accidental) form of fuzzing, by making sure that we were resilient to arbitrary orders? After all, our algorithms — on paper — should work for any iteration order, and so perhaps our deterministic iteration is just hiding bugs that other source programs might manifest.
For our undergraduate compiler, we didn’t worry too hard about this possibility, and just worked until it passed the required test cases and called it victory.
When working on Sorbet, however, we encountered exactly this problem. Sorbet aims to be agnostic to source file ordering; unlike Ruby itself, it aims to produce the same results no matter what order it considers files. However, in the pursuit of deterministic results during development, we had always been sorting input files in our wrapper script, and it turned out that we had all kinds of ordering bugs when we passed files in different orders.
When we realized this, we started routinely testing with random orderings of input files. Importantly, though, we did this by randomizing the list ahead of time, storing the order, and then calling Sorbet. This meant that we knew the ordering that had triggered any bug, and made it easy to reproduce and investigate.
This points me to a general lesson I’ve taken away from these experiences: When iterating over a set, the most robust design is to:
In Sorbet, we added randomness this by shuffling the input list and saving a copy. A more general strategy might look like a
--shuffle-seed flag, which seeds an deterministic PRNG, which is used to shuffle elements of any set that we iterate over. If this shuffle is a performance problem (due to e.g. constructing temporary lists to shuffle), it could be disabled in optimized builds. CI — or, better, a dedicated fuzz job — can then pass random seed values, which should have the effect of testing random orderings, while also making it easy to reproduce behavior.
I’ve also come to believe, for this reason, that general-purpose hash tables should be insertion-ordered by default. It turns out that with some clever tricks, maintaining insertion order isn’t much more expensive than not doing so, and doing so both addresses a common source of accidental nondeterminism, and can even close some security holes, where hash table order can leak information about memory addresses.
Determinism is one of those topics that seems simple and that we all perhaps think we understand, but where I find there’s a lot of depth and a lot to think about if you dig in. If you have any other thoughts on this topic, or places where deterministic software has been unexpectedly relevant or powerful in a system you work on, I’d love to hear about it.