Test size and scope

Software Engineering at Google

                        November 5, 2020

                Test size and scope

                        It’s been A Week
I have a 60%-written post on software performance I was hoping to send this week, but in retrospect it was pretty foolish to expect I would do any writing at all after about noon on Tuesday. So here’s a short piece from my prepared backlog about testing instead.
Describing different sizes of tests
I’ve been intermittently working through Software Engineering at Google, a book by several experienced Google engineers that attempts to summarize what they’ve learned about developing and maintaining software at scale and over extended periods of time. I find it at intervals both dry and annoyingly condescending in a way I’ve come to associate with some Google engineers, but it’s also full of a lot of good insight and hard-earned lessons. It’s been particularly interesting to compare their advice with my own experiences and lessons learned working on developer tooling and practices at smaller but still sizeable scale.
Today, I want to talk about one concept from Chapter 11 (“Testing Overview”) that I found quite helpful.
Scope vs size
Many teams categorize tests as “unit,” “integration,” “functional,” “system,” or similar, based on some concept of the size of the system being tested. In my experience, nearly every organization that uses these words uses them in different ways, with different connotations. As extreme examples, I have seen “unit test” used to mean both “a test that tests a single source file in a single process with no external dependencies at all” as well as “any test that runs in a fully automated fashion in CI” (as opposed to forms of manual testing). This confusion has lead me to avoid using these terms whenever possible, in favor of looking for more descriptive terms.
Google, to my delight, has not just mostly set these terms aside, but has come up with a better scheme for test classification! Importantly, they classify all tests along two related but distinct axes, not just one:

A test’s size refers to the resources that are used to run a test. By proxy, this also characterizes the types of dependencies a test is allowed to have — for instance, is a test allowed to access an external database process?
A test’s scope refers to how much code is actually being validated by a test — are we trying to test only the contents of a single method or class, or are we deliberately exercising and testing interactions between multiple components or subsystems?

I love this distinction because it captures the difference between the size of “the system under test” — the component I am trying to gain confidence about by writing this test — and the size of the dependencies that I need to (or am choosing to) execute in order to gain that confidence.
A classic example for me is attempting to write tests for a single model class implemented on top of something like the django ORM. On the one hand, such a test feels very “unit test”-y — I’m trying to exercise and make assertions about the behavior in a single class in a single file. On the other hand, in most ORMs I’ve worked with, the easiest way to test such a model involves actually writing data to the database and reading it back. In many environments, “unit” tests aren’t permitted to touch the database. So we do call these tests “unit” tests or not?
Separating our taxonomy into these two axes of organization neatly gives us vocabulary to talk about such tests. A test that exercises logic in a single model, but which talks to a database to do so, is has a “small” scope (it is validating a single module), but (in Google’s terms), a medium size (because it executes over multiple processes — the test process, and the database).
Definitions
Google uses strict definitions for test size, based on the environment where the test executes code:

A small test runs within a single process (depending on the language, a single thread), and can’t access the filesystem, perform I/O, or for the most part, otherwise interact outside of that process.
A medium test runs within a single machine. They can run multiple threads — for instance, run multiple services from a codebase, or run an external database process — and access the filesystem and communicate over localhost.
A large test has almost no restrictions, and can talk over the network, potentially including talking to external services running outside of the test setup. A test that accesses a third party’s test-mode environment, for instance, would have to be large.

These feel like useful least-common-denominator definitions, especially in an environment like Google where their testing infrastructure has to support an incredible range of diversity of languages, types and sizes of systems, and use cases.
However, in many non-Google environments, I think the merit of these axes is in creating clear communication, rather than enforcing specific limits, so I would feel free to adopt our own definitions for these different axes. I would try to ground this in the specific concepts of a system being tested; if I were still working on the systems I worked at within Stripe, I might push to categorize tests along both size and scope using a scale something like:

module — a test validates behavior in a single module (scope), or only runs code in a module without importing any dependencies (size)
service — a test validates the behavior of an entire service (scope), or runs code in a service without making any network calls to other services (size)
database — a test runs in a context where it has access to a database (only makes sense as a size)
system — a test attempts to validate interactions between multiple services (scope), or runs code in a way that makes network calls to external services, be they other Stripe-authored services, or other systems like Redis.

This is a thought experiment so I’m not sure I’d settle on precisely those definitions, but I do feel confident that splitting test classification into two axes and thinking of them in that way would be a huge improvement over what we did use.
Closing notes
I have often thought of test size by categorizing the code executed during a test into two pieces: the system being tested — which is to say, the actual code I want to make assertions about — and its dependencies, which are all the code that I’ve chosen to execute in order to support the test, but which — for the purposes of this test — I’m assuming is behaving correctly. I was very pleased to find that Google has codified a variant of this notion into a somewhat-consistent terminology for talking about types of test, and I’m going to experiment with using this kind of a framework going forward.
Have you found any classification systems the size or scope of a test or of a system under test that you’ve found to be particularly informative or helpful in designing or operating or working within a testing environment? I’d love to heard about it, if you have.

                        Don't miss what's next. Subscribe to Musing in Computer Systems: