Determinism in software engineering

metadata

                            November 27, 2020

                Determinism in software engineering

                        A few weeks back I wrote about nondeterministic performance and the problems it poses for benchmarking. That reminded me that I’ve got a bunch of thoughts about regular old determinism that I’ve been meaning to write up, and so that’s today’s topic.
Determinism
I describe a program’s behavior as “deterministic” if running it multiple times on the same inputs will reliably produce the same outputs. I expect this to be a non-controversial definition, although there is room for nuance in defining exactly what a program’s “inputs” are or what the “same outputs are” — in some contexts we might allow an output to contain a timestamp and still call that determinism, but in others that might be a problem.
Most of the time, when I talk about determinism, I will use it to refer to bit-identical outputs given bit-identical inputs, but I will exclude filesystem metadata. So a program doesn’t need to call utime on all its outputs for me to consider it deterministic, but if gzips the output it needs to make sure to fix the gzip mtime header field.
Why do we care about determinism?
One consistent theme of my experiences as a software engineer has been that, over time, I appreciate the value of determinism in software more and more. A big part of what I want to convey in this letter is how valuable I find deterministic software, and to exhort you to put more effort into making your software more deterministic by default.
With that preface, let’s jump into some places where determinism comes in handy:
Debugging and reproducibility
It’s widely appreciated that bugs are easier to find, debug, and fix when they can be reliably reproduced. The more deterministic a piece of software is, the more likely it is that bugs will happen consistently, instead of intermittently. And when we do manage to catch a bug, it is invaluable to be able to re-run our software on the same inputs with additional instrumentation — be that with logging enabled, a debugger attached, or even just additional printlns — and be confident we will see the bug again. And if we have a minimal pair — an input with the bug, and a very similar input without a bug — deterministic behavior helps us trace through behavior side-by-side and see where they diverge, and have confidence that the divergence is somehow related to the difference in input, and not random environmental factors.
I’ll have an anecdote later on about the challenges of debugging nondeterministic software, and some conclusions I take away.
Record/replay testing
I’ve written previously about record/replay testing, sometimes also called “golden master” testing, where you save the result of running one version of your program, and then compare future executions against that output to check for unintentional divergences.
As I mentioned in that piece, this technique is much more powerful the more deterministic your software is. If the output varies with fixed inputs, then you will need to figure out how to handle that variation in your test harness, introducing complexity and points of potential error. Even without a full testing harness, it’s valuable to be able to use diff directly to compare outputs or log traces, where possible.
Coverage-guided fuzzing
Coverage-guided fuzzing (as pioneered by AFL) is one of the absolute coolest computing innovations of the last decade. Fuzzers based on this lineage have been responsible for stunning numbers of bugs in C and C++ code, with lesser but still notable success in other languages. It’s also been adopted for property-based testing tools in Python and other languages, with the aim of allowing them to explore their input space much more effectively.
One major assumption, however, of coverage-guided fuzzers is that the target has not only deterministic outputs, but deterministic execution patterns. A target that executes different lines of code or in different orders based on wall-clock time or random factors creates a lot of noise that makes it much harder for these fuzzers to explore their targets. Especially if you’re implementing a system in C or C++, AFL and libfuzzer alone should be motivation enough to make your tool as deterministic as you can, the better to effectively fuzz it.
Reproducible builds
One trend that I’ve witnessed gaining steam in the last ten or twenty years is that of reproducible builds. You can read more about the effort that URL, but the important observation for this letter is that reproducible builds require deterministic build systems. And as build systems become more complex and feature more and more software that runs inside them, this ends up requiring a lot of software to become deterministic. If reproducible builds are a worthy goal — and I agree they are — then this is a strong motivation for more software to support a fully-deterministic mode, at a minimum as an option.
Content-addressed stores and caching
The Bazel build system encourages deterministic builds, and tries hard to support and encourage third-party build definitions and addons to be deterministic wherever possible. This has many benefits, including reproducible builds as cited previously, but it’s also a key enabler for Bazel’s caching strategy.
Bazel supports caching arbitrary build steps, including in a networked cache interface shared by multiple machines — e.g. shared by multiple developer laptops or even between development laptops and CI. Bazel caches individual steps in the build using a content-addressed store; it essentially just hashes all inputs to a step (e.g. the command-line it would run, the compiler itself, and the contents of all input files), down to a single hash value, and looks up that hash in the cache store.
However, in order for this approach to successfully cache complex builds, it requires that all intermediate steps be fully deterministic! Consider a simple C build process, in which a .c  file is first compiled into a .o object, and then converted into an executable by linking the .o file with the necessary libraries.
If two developers have the same .c source file, a shared cache will always allow one to reuse the other developer’s compilation step. However, suppose that for whatever reason, we do not cache the compile step, so that each developer builds their own .o file. If the compiler is deterministic, then both developers end up with identical .o files, and we can continue benefiting from the shared cache — perhaps one can re-use the other’s linker result. If, however, the compiler is subtly non-deterministic — perhaps it embeds a build timestamp in the object file — then, as soon as each machine runs the compiler separately, our builds will diverge forever, and we will gain no additional benefit from the shared cache! Deterministic builds are a key ingredient in making this caching strategy effective. In real build pipelines, furthermore, we have many more than two sequential steps, which exposes both many more possibilities to benefit from caching, but also many more opportunities for the builds to diverge.
We could imagine caching strategies that are more robust to non-determinism. Perhaps instead of acting a step at a time, we could hash the entire build DAG leading up to an object, or use symbolic names for objects — “the .o generated by compiling the .c file with hash <…>” — instead of their content. However, the bazel approach has the benefit of being extremely simple, extremely scalable, and almost trivially correct. Given the other advantages of having deterministic builds, anyways, this seems like a sensible tradeoff to me!
Where does nondeterminism come from?
Computers are mostly-deterministic beasts at heart; executing the same sequence of CPU instructions on the same data will yield the same behavior and results every time. However, there are a large number of ways that nondeterminism leaks into our software systems. Here are some of the more commonly encountered ones:

Explicit use of randomness. Any code that explicitly uses a random-number generator will have behavior that changes between runs, unless it deliberately uses a deterministic generator with a fixed seed.
Dependency on the current time. This manifests most directly as timestamps included in output files, but can have more indirect effects if we e.g. sort a set of records by timestamp. Keep in mind that, without some care, timestamps can move backwards during a program’s execution — even UTC moves backwards during the rare leap second! — so even the ordering of sequential timestamps isn’t necessarily stable. Pseudorandom number generators seeded off the current timestamp are a case of both this bullet and the previous one. 
Dependency on memory addresses. ASLR and other effects tend to make memory addresses nondeterministic between runs of programs, so this is a common source of entropy.
Iterating over hash tables with nondeterministic order. Hash tables often yield elements in nondeterministic order; this can either happen because they are explicitly randomized or because of reliance on pointer addresses in hash functions.
Concurrency. If a program runs multiple threads or performs multiple I/O or network operations concurrently, its behavior may vary depending on their relative completion or the interactions between the concurrent operations, which can vary depending on all kinds of factors.
Filesystem iteration order. Filesystems do not tend to guarantee the order in which they return results when iterating over a directory. Programs that consume a directory in filesystem order may result in different behaviors even on two directories containing otherwise-identical sets of files, if they are read in different orders. I once spent most of a day tracking down a bug where approximately half of the nodes in our Hadoop cluster were misbehaving, that ended up tracing down to half the nodes having files in one critical directory in a different order, as a result of a quirk in the bootstrap process.

Iteration in arbitrary order
One situation that arises fairly often in software programs is the need to iterate over a data structure that is logically a set — it has well-defined membership, but not a well-defined ordering. Many languages choose to iterate over their Set type, or over entries in a Map type, in hash-table implementation order. Because hash values often depend on pointer addresses, or are even explicitly randomized for security purposes, this is a common source of nondeterminism, as mentioned above. I want to say a few more thoughts about this specific situation.
First, an anecdote: I first “got religion” about determinism while working on a compiler in an undergraduate compilers class. We had used java’s HashMap class ubiquitously for symbol tables and other maps, and routinely iterated over those maps in our compiler. This served us well for most of the class, but near the end of the semester, as our compiler got more complex and our bugs more subtle, we started running into every engineer’s nightmare: Unreproducible bugs that would only occur sometimes, even running on an identical test case. It turned out that we had some numbers of bugs that only manifested if we iterated over, say, local variables, in some orders, but not others. Attempts to trace down this bugs were nightmarish, since we couldn’t even compare debug logs from two buggy runs, as they might execute on different “bad” orderings.
As soon as we realized what was happening, we quickly implemented a global switch to LinkedHashMaps, which order elements in insertion order. With the compiler now much more deterministic, our bugs fell with great ease.
The acute reader, at this point, might ask whether the random ordering wasn’t beneficial. Didn’t it represent a (limited, accidental) form of fuzzing, by making sure that we were resilient to arbitrary orders? After all, our algorithms — on paper — should work for any iteration order, and so perhaps our deterministic iteration is just hiding bugs that other source programs might manifest.
For our undergraduate compiler, we didn’t worry too hard about this possibility, and just worked until it passed the required test cases and called it victory.
When working on Sorbet, however, we encountered exactly this problem. Sorbet aims to be agnostic to source file ordering; unlike Ruby itself, it aims to produce the same results no matter what order it considers files. However, in the pursuit of deterministic results during development, we had always been sorting input files in our wrapper script, and it turned out that we had all kinds of ordering bugs when we passed files in different orders.
When we realized this, we started routinely testing with random orderings of input files. Importantly, though, we did this by randomizing the list ahead of time, storing the order, and then calling Sorbet. This meant that we knew the ordering that had triggered any bug, and made it easy to reproduce and investigate. 
This points me to a general lesson I’ve taken away from these experiences: When iterating over a set, the most robust design is to:

Deliberately remove randomness from the order, by sorting elements in some known way, and then
Deliberately re-add randomness, but in a controlled way, so that bugs remain reproducible

In Sorbet, we added randomness this by shuffling the input list and saving a copy. A more general strategy might look like a --shuffle-seed flag, which seeds an deterministic PRNG, which is used to shuffle elements of any set that we iterate over. If this shuffle is a performance problem (due to e.g. constructing temporary lists to shuffle), it could be disabled in optimized builds. CI — or, better, a dedicated fuzz job — can then pass random seed values, which should have the effect of testing random orderings, while also making it easy to reproduce behavior.
I’ve also come to believe, for this reason, that general-purpose hash tables should be insertion-ordered by default. It turns out that with some clever tricks, maintaining insertion order isn’t much more expensive than not doing so, and doing so both addresses a common source of accidental nondeterminism, and can even close some security holes, where hash table order can leak information about memory addresses.
Conclusions
Determinism is one of those topics that seems simple and that we all perhaps think we understand, but where I find there’s a lot of depth and a lot to think about if you dig in. If you have any other thoughts on this topic, or places where deterministic software has been unexpectedly relevant or powerful in a system you work on, I’d love to hear about it.

                            Don't miss what's next. Subscribe to Musing in Computer Systems: