Science Turf War
I'm speaking at YOW! I'll be presenting Designing Distributed Systems with TLA+ in Sydney, Melbourne, AND Brisbane. If you're coming to one of those conferences, come say hi!¹ Unfortunately, I won't have my usual sack of chocolates on me (international travel), but still. Say hi.
Speaking of which...
No Newsletter Next Two Weeks
Because I'll be at conferences. There might be a bonus or two, but I don't want you expecting the usual once-a-week newsletters if I can't guarantee I'll write them.
Okay, now for the main course.
A Brief History of the Current Empirical Software Engineering (ESE) Science Turf War
About a week ago, someone I knew retweeted this:²
Happy to announce our rebuttal of Berger et al., TOPLAS 2019 paper, available on Medium https://link.medium.com/3dQgOK3QJ1. Full, gory details are on ArXiv https://arxiv.org/abs/1911.07393. Tl;dr: our results hold, they reproduced them, their study has many issues.
And I knew I had a long day ahead of me. I used both these people's original papers, and the TOPLAS replication, as part of my talk What We Know We Don't Know. I used it to show the dangers of data mining and why scientific replication was so, so important. And now they were saying that the replication itself had many issues. If true, I'd need to rethink a lot of my positions about ESE. And rewrite a bunch of the talk. So I needed to carefully read the rebuttal, which meant carefully rereading the replication, which meant carefully rereading the original paper.
And you have to read my summaries. You joined the newsletter. You have nobody to blame but yourself.
Disclosure: I'm firmly on the side of the replicators here, and have been in communication with them about their results.
Premkumar Devanbu, Vladimir Filkov, and others publish Large Scale Study of Programming Languages and Code Quality in Github at the Foundations of Software Engineering conference. Because that's really long and going to cause naming conflicts later, we'll refer to it as the FSE paper.
In FSE, the authors studied 729 GitHub projects, making up over 80 million lines of code, to see if some languages lead to more defects than others. They also dropped a bombshell in the abstract:
Most notably, it does appear that strong typing is modestly better than weak typing, and among functional languages, static typing is also somewhat better than dynamic typing. We also find that functional languages are somewhat better than procedural languages
While they cautioned that this is "overwhelmingly dominated by [...] project size, team size, and commit size", and added caveats that it might be due to other effects, the evidence was clear: typed functional languages produced higher quality code at a statistically significant level. While they also cautioned about the effect size being minimal, their data showed it was actually pretty strong. According to their analysis, if the "average language had four defective commits", then the equivalent codebase in C++ would have 5 and the Haskell codebase would have just over 3. TypeScript, by comparison, would have about 2.6.
Dan Luu, in his phenomenal piece Static v. dynamic languages, looked at various studies done on type systems.³ In that process, he found the FSE paper, and immediately raised some serious issues. First of all, their way of tracking bugs was odd: they were searching commit messages for keywords that corresponded to buggy claims. What if different projects had different commit practices? If they had a different bug reporting culture? But the much bigger issue is this:
As for TypeScript, the three projects they list as example TypeScript projects (bitcoin, litecoin, and qBittorrent) are C++ projects. So the intermediate result appears to not be that TypeScript is reliable, but that projects mis-identified as TypeScript are reliable. Those projects are reliable because Qt translation files are identified as TypeScript and it turns out that, per line of code, giant dumps of config files from another project don't cause a lot of bugs. It's like saying that a project has few bugs per line of code because it has a giant README. This is the most blatant classification error, but it's far from the only one. Since this study uses Github's notoriously inaccurate code classification system to classify repos, it is, at best, a series of correlations with factors that are themselves only loosely correlated with actual language usage.
So the TSE data collection processes were suspect. They misidentified projects, meaning they could not draw accurate conclusions about the relative quality of the projects.
He also found other issues, but that was big enough to throw doubt on the entire project.
Premkumar Devanbu, Vladimir Filkov, and others publish Large Scale Study of Programming Languages and Code Quality in Github in the Communications of the ACM magazine. Because that's really long and has naming conflicts from earlier, we'll refer to it as the CACM paper.
For the CACM paper, they reused the same dataset but redid the analysis, getting different numbers that still showed their main thesis. They also stopped classifying BitCoin as a TypeScript project, and got that it lead to more defective commits, not fewer. Also, the impact of other factors besides language are greatly reduced. Doubling the age of the project used to matter a lot more than the language you chose, now it matters less. They did not address this discrepancy in their paper, and they still claim that language factors are "overwhelmingly dominated" by other factors.
So we have again that typed functional languages are better. According to Google, the preprint and final versions have been cited over 200 times, making it a pretty influential paper on the academic discourse. It also frontpaged Hacker News twice, which we all know is the one true indicator of programmer thought leadership.
Emery Berger, Jan Vitek, and others publish On the Impact of Programming Languages on Code Quality: A Reproduction Study in the Transactions on Programming Languages and Systems journal. Because that's really long, we'll refer to it as the TOPLAS replication.
- 2% of the original data set were duplicate commits for forks, merges, etc. It also missed about 20% of the commits in those projects, and had a number of commits that weren't in the project histories. TOPLAS speculated this was because of
- The discrepancy between FSE and CACM in the values for the impacts of commit size, team size, etc, can be traced back to them mixing
log10in a bunch of calculations.
- They were using a P-value of
0.05, but were testing 17 different languages. It was likely at least one of their results were a false positive, and likely more. If you test multiple hypotheses, you need to correct for that.
- A third of the defect commits... weren't.
The review suggested a false-positive rate of 36%; i.e., 36% of the commits that the original study considered as bug-fixing were in fact not. The false-negative rate was 11%.
FSE was using string searches on the commit messages to find defects. TOPLAS got three developers to manually check a subset of the commits. Many of them had nothing to do with bugs! One, for example, had the message "Add lazyness to infix operators."
Infix has nothing to do with
After correcting for all of these, TOPLAS found that only four of the 17 languages had any statistically significant difference at all, and even those were suspect. The paper was basically unsalvageable.
Premkumar Devanbu, Vladimir Filkov, and others publish Rebuttal to Berger et al., TOPLAS 2019 on the ArXiv. They also publish a compaion article on Medium, called On a Reproduction Study Chock-Full of Problems. Because that's really long and there's two of them, we'll refer to them collectively as the rebuttal.
The rebuttal, needless to say, is pretty unhappy with TOPLAS. They identify several flaws in the replication. In particular:
- TOPLAS was examining data from the 2014 FSE paper, not the 2017 CACM paper. The latter is the definitive version, and has a bunch of cleanup the FSE paper does not have. So they're looking at outdated data and deceptively telling people the modern version is wrong.
- The TOPLAS results agree in degree of impact with the CACM results.
- They corrected for multiple hypotheses with the Bonferroni correction, which is too conservative. The FDR correction is more accurate, but they used Bonferroni in their analysis.
- The manual process for identifying defect commits isn't actually much better than the automated process. When Devanbu and Filkov went over the list of buggy commits, they selected 12 commits at random and found that eleven of them were actually true positives.
So the replication is itself invalid, and the original points stand.
Hillel Wayne publishes "Science Turf War" in the "Computer Things" newsletter. Because that's really long and totally irreverent, we'll refer to it as "my thoughts".
I reviewed every relevant artifact in this story. First of all, I found an error in TOPLAS that nobody talked about yet: they say that FSE's threshold for significance was
p < 0.005, not
p < 0.05. This affects one of their tables. I messaged the authors for their thoughts and they said they'd look into it.
Now a confession: I don't know anything about statistics. It's a really big gap in my knowledge that I really should get around to fixing one of these days. So I can't evaluate any of the quantitative claims, about the relative merits of Bonferroni vs FDR, the importance of bootstrapping, p-values versus prediction intervals, anything interesting like that. But I can look into the rebuttal's qualitative claims. And most of them, in my opinion, don't hold up:
- The replicators couldn't replicate any of the analyses on the CACM data, because they only received the FSE data! This is explicitly stated in both TOPLAS and the rebuttal.
- They agree in degree of impact, but not in p-value. The replicators aren't saying the effect is minimal, they are saying that the effect is statistically insignificant.
- The replication used both Bonferroni and FDR and got almost-identical results (only Ruby differed). The replication also used both in their conclusions. CACM, on the other hand, used no correction at all.
- Okay this one's fun.
In the Rebuttal paper, they list eleven of the true positives that TOPLAS got wrong. I'm assuming they got them from the list of buggy commits that the replicators published on GitHub. I linked it earlier but here it is again. I really recommend checking it out. See anything weird about it?
There's a third column for whether or not that commit is actually a bug. I've tried three different computers and on every one, GitHub's CSV viewport hides that column. Without seeing that, it's reasonable to assume that all of the rows are supposed to be false positives, when it's really just the rows marked with a
I checked all of the listed true positives and, in all but one case, they corresponded to rows marked
1. The rebutters, in trying to prove that the set of false positives were actually true, picked commits where the replicators agreed were true positives.
That's why I'm on the side of the replicators. None of the rebuttal's counterarguments I checked hold up. Not only that, but the biggest claimed problem in the replication, getting all of the false positives wrong, can be factually shown wrong... and potentially traceable to a GitHub UI issue.
The replicators will probably write a rebuttal rebuttal, and then we might get a rebuttal rebuttal rebuttal, maybe a debate or two, and I will keep watching in case I have to fix my talk. And that's my ultimately motivation: blatant, shameless laziness.
why did I spend two hours writing about this I'm so behind on work now aaaaaaaaaaaaaaah
¹ If you aren't coming, you can watch it online here.
³ That piece was hugely influential on me. Forget the actual type system stuff. It showed me how incredibly valuable aggregating and curating information is. Sure, we had lots of information out there about this stuff, but here was a person who actually put it in one place! And evaluated it! This is amazing!