Making science reproducible @nextjournal

Papers We Will Love

This are my slides and notes from my talk given at the Papers We Love Berlin meetup on November 23rd 2017.

This is going to be a short talk. I do hope this gives us the time for a discussion afterwards.

I want you to imagine a future where scientific papers are not published as PDFs, but as living documents with executable code. With graphs whose source can be viewed along with their input data. Imagine these artefacts could be embedded around the web, but always traceable back to their source and reusable at a click of a button. And finally, all of this code would continue to run, no matter if it's one year or ten years later.

About ten years ago, I tried really hard to apply ideas from papers to my problem – which was tubes.

I was writing a program that would reverse engineer the parametric surface of a tube based on 3D scan data.

Who has ever forked a repository on GitHub?

It has transformed the way we as developers do our work. This was made possible by two things: first, a strong technical foundation, based on immutability. This means unchanging over time.

But I would argue that the real change that enabled this transformation wasn't a technical one.

It was a change in usability. GitHub put the source code front and center and allowed you to fork a repository with just one click. This meant that it just became so much easier to make a change to a library. A change that you wouldn't even have considered doing otherwise. This is something that isn't nearly talked about enough, especially in papers.

Now this isn't just a matter of convenience alone. We know that people are losing trust in science. I don’t think it’s a stretch to say that the US, the second largest emitter of CO2 after china, backing out if the Paris agreement is related to this growing mistrust.

In his climate change article Bret Victor underlines the importance of technology for tackling climate change. If there’s one thing you take away from this talk today, please read that article. It’s also a great example of how much nicer an interactive web page with links can be for presenting ideas rather than a PDF.

I caught a glimpse of how much messier things can get doing my diploma thesis in Physics. Looking for a way use my software development skills – while somehow covering up the fact that I didn't know a whole lot about Physics, I ended up in a group doing computational molecular Biology.

In this group, a typical publication would depend on simulation and molecular modelling packages, custom Java and Python code, and a handful of unrelated visualisation tools and finally you would put it all together in LaTeX. A simulation would run on multiple nodes on the groups GPU-cluster, producing hundreds of gigabytes of raw data.

Now if you search around for how to approach this problem, here's a thing that comes up: the ten simple rules.

They go like this.

Can you imagine how much additional work this is for a workflow like this? How practical is it that someone else will read through the descriptions and redo the steps? How do you estimate the chances of success?

You see that I'm talking about the work behind the paper. I believe if we truly want to evolve the medium we must use tools that have our end goal in mind: produce an artefact that can be reproduced and reused.

What you see here is an article that contains code that trained a neural net in image recognition.

Let’s look at what’s needed to run code. This is what we we call dependency hell, your code runs on a brittle foundation of language specific packages, a language runtime, system libraries and an operating system.

Who here knows about leftpad? It is a NodeJS package, containing only 11 lines of JavaScript. When it was removed from the node package manager system after its maintainer was faced with a legal threat, it broke thousands of builds that depended on it.

A more radical solution to this problem is storing the complete environment from packages down to the operating system. A popular way of doing this today is with docker.

This is one of the reasons why we treat data separately and don’t just put them in this environment.

Now I’ve talked about a fundamental change in the way scientists works. I believe the missing piece here could be an idea originally conceived by Ted Nelson in the 60s: transclusions. He argues that cut-and-paste the way it works today is not what authors want. Today when we copy something we lose all references to the original. What we actually want is to just reference that and maybe make changes on top of it.

That’s it from my side and I’m happy to take any questions now. Thank you!