Nextjournal BlogSep 10 2019 UTC

No, your ten rules aren’t simple and they won’t fix Reproducibility

I'm not sure how these papers made it through peer review, but it seems to be a prevalent idea at PLOS that we can fix the problem of computational reproducibility if we only applied Ten Simple Rules. It happened in 2013 with Ten Simple Rules for Reproducible Computational Research and it's been repeated this year with Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks.

Now don't get me started on how somebody in their right mind postulates ten rules and still call them simple. But the deeper truth is that computational reproducibility is an incredibly hard problem to which there is no simple solution.

If you don't believe me, I challenge you to find a paper more than a few years old that comes with a reproducible analysis and try to run it. My favourite example of this is How much of the world is woody?. This is a shining example of reproducible research and they went well beyond what the Ten Simple Rules dictate: they even used a continuous integration service to run the whole analysis from top-to-bottom after each change. But to prove my point they had to make a commit to fix an incompatibility with one of their libraries years after they released their analysis.

Now this is a very simple example; it's just using one language (R) and a few libraries. The complexity of a typical computational biology pipeline is many times that. It often starts with simulations that run on a cluster of specialised hardware following a series of post-processing steps spanning tools written in different languages. For a good primer on the subject I recommend Konrad Hinsen's Dealing With Software Collapse.

This isn't just a problem concerning scientific computing, it affects all software. Consider the left-pad fiasco: here, a single developer removed one of his libraries from the npm package registry, and broke thousands of downstream projects depending on it. The way most of our package registries are currently designed, this won't be the last time this happens.

The good news is that I believe that it is possible to fix computational reproducibility by applying a principle of functional programming at the systems level: immutability (which is what we're doing at Nextjournal). I will show how applying immutability to code, data and the computational environment gives us a much better chance of keeping things runnable in a follow-up post.

Your thoughts? Find me as @mkvlr on twitter to discuss!