I’m a big advocate of Empirical Software Engineering. I wrote a talk on it. I wrote a 6000-word post covering one controversy. I spend a lot of time reading papers and talking to software researchers. ESE matters a lot to me.
I’m also a big advocate of formal methods (FM). I wrote a book on it, I’m helping run a conference on it, I professionally teach it for a living. There’s almost no empirical evidence that FM helps us deliver software cheaper, because it’s such a niche field and nobody’s really studied it. But we can study a simpler claim: does catching software defects earlier in the project life cycle reduce the cost of fixing bugs? Someone asked me just that.
Which meant I’d have to actually dive into the research.
I’ve been dreading this. As much as I value empirical evidence, software research is also a train wreck where both trains were carrying napalm and tires.
Common Knowledge is Wrong
If you google “cost of a software bug” you will get tons of articles that say “bugs found in requirements are 100x cheaper than bugs found in implementations.” They all use this chart from the “IBM Systems Sciences Institute”:
There’s one tiny problem with the IBM Systems Sciences Institute study: it doesn’t exist. Laurent Bossavit did an exhaustive trawl and found that the ISSI, if it did exist, was an internal training program and not a research institute. As far as anybody knows, that chart is completely made up.
You also find a lot by Barry Boehm and COCOMO, which is based on research from the 1970s, and corruptions of Barry Boehm, which Bossavit tears down in his book on bad research. You also get lots of people who just make up hypothetical numbers, then other people citing those numbers as fact.
It’s a standard problem with secondary sources: most of them aren’t very good. They corrupt the actual primary information to advance their own agenda. If you want to get a lay of what the research actually says, you need to read the primary sources, or the papers that everybody’s butchering.
But first you gotta find the primary sources.
Finding things is pain
The usual problem people raise with research is the cost: if you don’t have an institutional subscription to a journal, reading a single paper can cost 40 bucks. If you’re skimming dozens of papers, you’re suddenly paying in the thousands just to learn “is planning good”. Fortunately you can get around the paywalls with things like sci-hub. Alexandra Elbakyan has done more for society than the entire FSF. YEAH I WENT THERE
The bigger problem is finding the papers to read. General search engines have too much noise, academic search engines are terrible or siloed across a million different journals or both, and you don’t know what to search. Like are you searching bug? Hah, newbie mistake! A good two-thirds of the papers are about defects. What’s the difference between a “bug” and “defect”? Well, one’s a bug and the other’s a defect, duh!
I’m sure this is slightly easier if you’re deeply embedded in academia. As an outsider, it feels like I’m trying to learn the intricacies of Byzantine fault tolerance without having ever touched a computer. Here’s the only technique I’ve found that works, which I call scrobbling even though that means something totally different:
- Search seed terms you know, like “cost of bugs”, in an appropriate journal (here’s one).
- Find papers that look kinda relevant, skim their abstracts and conclusions.
- Make a list of all the papers that either cite or are cited by these papers and repeat.
- Find more useful terms and repeat.
Over time you slowly build out a list of good “node papers” (mostly literature reviews) and useful terms to speed up this process, but it’s always gonna be super time consuming. Eventually you’ll have a big messy mass of papers, most of which are entirely irrelevant, some of which are mostly irrelevant, a previous few are actually topical. Unfortunately, the only way to know which is which is to grind through them.
Most Papers are Useless
A lot will be from before 2000, before we had things like “Agile” and “unit tests” and “widespread version control”, so you can’t extrapolate any of their conclusions to what we’re doing. As a rule of thumb I try to keep to papers after 2010.
Not that more recent papers are necessarily good! I mentioned earlier that most secondary sources are garbage. So are most primary sources. Doing science is hard and we’re not very good at it! There are lots of ways to make a paper useless for our purposes.
- Calculating bugs in changed files, but measuring program metrics across the whole project)
- Base a lot of their calculations off sources that are complete garbage
- Never clearly mention all of their data is exclusively on complex cyber-physical systems that 99.9999% of developers will never see
- Accidentally including beginner test repos in your GitHub mining
- Screw up the highly error prone statistical analysis, and then seventeenuple-count the same codebase for good measure
This doesn’t even begin to cover the number of ways a paper can go wrong. Trust me, there are a lot. And the best part is that most of these errors are very subtle and the only way you can notice them is by carefully reading the paper. In some cases, you need a second team of researchers to find the errors the first time made. That’s what happened in the last reference, where chronicling the whole affair took me several months. Just chronicling it, after the dust had settled. I can’t imagine how much effort went into actually finding the errors.
Oh, and even if nobody can find any errors, the work might not replicate. Odds are you’ll never find out, because the academic-industrial complex is set up to discourage replication studies. Long story short if you want to use a cite as evidence you need to carefully read it to make sure it’s actually something you want.
Good papers are useless too
Well here’s a paper that says inspection finds defects more easily in earlier phases! Except it doesn’t distinguish between defect severity, so we have no idea if finding defects earlier is cheaper. But this paper measures cost-to-fix too, and finds there’s no additional cost to fixing defects later! But all the projects in it used the heavyweight Team Software Process (TSP). And it contradicts this paper, which finds that design-level reviews find many more bugs that code level reviews… in a classroom setting.
Did I mention that all three of those papers use different definitions of “defect”? Could mean “something that causes the program to diverge from the necessary behavior”, could mean a “misspelled word” (Vitharana, 2015). So we have that even good papers are working with small datasets, over narrow scopes, with conflicting results, and they can’t even agree on what the words mean.
I normally call this “nuance”. That is a very optimistic term. The pessimistic term is “a giant incoherent mess of sadness.”
Nobody believes research anyway
The average developer thinks empirical software engineering is a waste of time. How can you possibly study something as complex as software engineering?! You’ve got different languages and projects and teams and experience levels and problem domains and constraints and timelines and everything else. Why should they believe your giant incoherent mess of sadness over their personal experience or their favorite speaker’s logical arguments?
You don’t study the research to convince others. You study the research because you’re rather be technically correct than happy.
Well, first of all, sometimes there is stuff that we can all agree on. Empirical research overwhelmingly shows that code review is a good way to find software bugs and spread software knowledge. It also shows that shorter iteration cycles and feedback loops lead to higher quality software than long lead times. Given how hedged and halting most empirical claims are, when everybody agrees on something, we should pay attention. Code review is good, fast feedback is good, sleep is good.
Second, there’s a difference between ESE as a concept and ESE as practiced. I’m a big proponent in ESE, but I also believe that the academic incentive structures are not aligned a way that would give industry actionable information. There’s much more incentive to create new models and introduce new innovations than do the necessary “gruntwork” that would be most useful: participant observation, manual compilation and classification of bugs, detailed case studies, etc. This is an example of the kinds of research I think is more useful. A team of researchers followed a single software team for three years and sat in on all of their sprint retrospectives. Even if the numbers don’t translate to another organization, the general ideas are worth reading and reflecting on.
(Of course academia doesn’t exist just to serve industry, and having cross-purpose incentives isn’t necessarily a bad thing. But academics should at least make a conscious choice to do work that won’t help the industry, as opposed to thinking their work is critical and wonder why nobody pays attention.)
Finally, even if the research is a giant incoherent mess of sadness, it’s still possible to glean insights, as long as you accept that they’ll be 30% research and 70% opinion. You can make inferences based on lots of small bits of indirect information, none of which is meaningful by itself but paints a picture as a whole.
Are Late-Stage Bugs More Expensive?
Oh yeah, the original question I was trying to answer. Kinda forgot about it. While there’s no smoking gun, I think the body of research so far tentatively points in that direction, depending on how you interpret “late-stage”, “bugs”, and “more expensive”. This is a newsletter, not a research paper, so I’ll keep it all handwavey. Here’s the rough approach I took to reach that conclusion:
Some bugs are more expensive than others. You can sort of imagine it being a Gaussian, or maybe a power law: most bugs are relatively cheap, a few are relatively expensive. We’d mine existing projects to create bug classifications, or we’d interview software developers to learn their experiences. Dewayne Perry did one of these analyses and found the bugs that took longest to fix (6 or more days) were things like feature interaction bugs and unacceptable global performance, in general stuff that’s easier to catch in requirements and software modeling than in implementation.
I’ve checked a few other papers and think I’m tentatively confident in this line of reasoning: certain bugs take more time to fix (and cause more damage) than others, and said bugs tend to be issues in the design. I haven’t vetted these papers to see if they don’t make major mistakes, though. It’s more time-consuming to do that when you’re trying to synthesize an stance based on lots of indirect pieces of evidence. You’re using a lot more papers but doing a lot less with each one, so the cost gets higher.
Anyway, I’m now 2000 words in and need to do other things today. tl;dr science is hard and researching it is hard and I really value empiricism while also recognizing how little quality work we’ve actually done in it.
TLA+ Conf schedule posted
Here! Even if you’re not attending in person, we’re looking to stream it live, so def check it out! I’ll be giving a talk on tips and tricks to make spec writing easier. And there might be just a bit of showboating ;)
Update because this is popular
This was sent as part of an email newsletter; you can subscribe here. Common topics are software history, formal methods, the theory of software engineering, and silly research dives. I also have a website where I put my more polished and heavily-edited writing; newsletter is more for off-the-cuff writing. Updates are at least 1x a week.
I’m seeing a lot of people read this and conclude that “software engineering” is inherently nonsensical and has nothing to do with engineering at all. Earlier this year I finished a large journalism project where I interviewed 17 “crossovers” who worked professionally as both “real” engineers and software developers. You can read the series here!