I have a 60%-written post on software performance I was hoping to send this week, but in retrospect it was pretty foolish to expect I would do any writing at all after about noon on Tuesday. So here’s a short piece from my prepared backlog about testing instead.
I’ve been intermittently working through Software Engineering at Google, a book by several experienced Google engineers that attempts to summarize what they’ve learned about developing and maintaining software at scale and over extended periods of time. I find it at intervals both dry and annoyingly condescending in a way I’ve come to associate with some Google engineers, but it’s also full of a lot of good insight and hard-earned lessons. It’s been particularly interesting to compare their advice with my own experiences and lessons learned working on developer tooling and practices at smaller but still sizeable scale.
Today, I want to talk about one concept from Chapter 11 (“Testing Overview”) that I found quite helpful.
Many teams categorize tests as “unit,” “integration,” “functional,” “system,” or similar, based on some concept of the size of the system being tested. In my experience, nearly every organization that uses these words uses them in different ways, with different connotations. As extreme examples, I have seen “unit test” used to mean both “a test that tests a single source file in a single process with no external dependencies at all” as well as “any test that runs in a fully automated fashion in CI” (as opposed to forms of manual testing). This confusion has lead me to avoid using these terms whenever possible, in favor of looking for more descriptive terms.
Google, to my delight, has not just mostly set these terms aside, but has come up with a better scheme for test classification! Importantly, they classify all tests along two related but distinct axes, not just one:
I love this distinction because it captures the difference between the size of “the system under test” — the component I am trying to gain confidence about by writing this test — and the size of the dependencies that I need to (or am choosing to) execute in order to gain that confidence.
A classic example for me is attempting to write tests for a single model class implemented on top of something like the django ORM. On the one hand, such a test feels very “unit test”-y — I’m trying to exercise and make assertions about the behavior in a single class in a single file. On the other hand, in most ORMs I’ve worked with, the easiest way to test such a model involves actually writing data to the database and reading it back. In many environments, “unit” tests aren’t permitted to touch the database. So we do call these tests “unit” tests or not?
Separating our taxonomy into these two axes of organization neatly gives us vocabulary to talk about such tests. A test that exercises logic in a single model, but which talks to a database to do so, is has a “small” scope (it is validating a single module), but (in Google’s terms), a medium size (because it executes over multiple processes — the test process, and the database).
Google uses strict definitions for test size, based on the environment where the test executes code:
These feel like useful least-common-denominator definitions, especially in an environment like Google where their testing infrastructure has to support an incredible range of diversity of languages, types and sizes of systems, and use cases.
However, in many non-Google environments, I think the merit of these axes is in creating clear communication, rather than enforcing specific limits, so I would feel free to adopt our own definitions for these different axes. I would try to ground this in the specific concepts of a system being tested; if I were still working on the systems I worked at within Stripe, I might push to categorize tests along both size and scope using a scale something like:
This is a thought experiment so I’m not sure I’d settle on precisely those definitions, but I do feel confident that splitting test classification into two axes and thinking of them in that way would be a huge improvement over what we did use.
I have often thought of test size by categorizing the code executed during a test into two pieces: the system being tested — which is to say, the actual code I want to make assertions about — and its dependencies, which are all the code that I’ve chosen to execute in order to support the test, but which — for the purposes of this test — I’m assuming is behaving correctly. I was very pleased to find that Google has codified a variant of this notion into a somewhat-consistent terminology for talking about types of test, and I’m going to experiment with using this kind of a framework going forward.
Have you found any classification systems the size or scope of a test or of a system under test that you’ve found to be particularly informative or helpful in designing or operating or working within a testing environment? I’d love to heard about it, if you have.