Towards Reproducibility: Git

In my previous post, I've hopefully convinced you that computational reproducibly is a hard problem without a simple solution. A principled approach is needed where immutability is applied at the systems level to make computation reproducible.

Functional programming with immutable data shows that this approach works well on a smaller scale. Programmers know that making our functions pure makes them easier to test and reason about. By eliminating global state and creating new values instead of mutating existing ones, we ensure that different function calls don’t interfere with each other and make concurrency easy.

This is also the theoretical foundation for Git. It uses Merkle trees to store our source code. Whenever we commit a change to a repository, Git computes a SHA-1 over the contents of that directory tree, and stores this together with metadata. The metadata includes information such as a pointer to a parent commit as well as author information and a commit message as a commit object.

This model enables many of the properties developers love about Git:

Efficient storage. Keep the entire history spanning thousands of commits locally for millions of lines of code.
Speed. Because everything is stored locally, operations like switching branches or reverting a commit are fast. By reducing the cost of these operations, it changed how programmers behave and made the creation of small branches (and experimentation) much more common.
Trustless data-transfer. Once the hash of a given object (e.g. commit) is known, data can be transferred trustless and the client can validate that it received the correct data.
Collaboration without a central remote. While it's somewhat ironic that GitHub managed to established itself as a central authority around a distributed version control system, Git's robustness allows programmers to collaborate even when GitHub is down and easily backup or migrate their code elsewhere.

But when the goal is years or decades long computational reproducibility, Git is not enough. Here's why:

It's a bad fit for storing data. Git was designed to store source code (in the form of text files) and excels in this area. But when you store larger files in your repository things become painfully slow. Extensions like Git Large File Storage deal with this, but they introduce additional complexity, cost and still come with a per-file limit of 2GB.
It's a bad fit for storing virtual machine images. This problem is similar to storing large files, except the 2GB limitation is even more problematic for virtual machine images.
It allows to rewrite history. Commands like rebase allow you to rewrite history. Read this for a more thorough look at why rewriting history is a bad idea.
It's hard to use. It's easy to forget this once we're past the initial learning curve, but Git is actually incredibly hard to use. There are good ideas to improve this, but I wouldn't hold my breath.
The commit model can be painful. This relates to the previous point of being hard to use, but I think this deserves an explicit mention: having to manually create commits (including messages) is not always easy. I think everybody who has used Git substantially has wished for a way to restore previous states that haven’t been committed yet. Automatic versioning, with the ability to name a certain change, such as the technique used by Google Docs, is much easier to learn.
It's hard to cross repository boundaries. Git works best when all your work happens within the boundaries of a single repository. The problem of making atomic changes across different repositories has lead to the adoption of monorepositories (a single repository holding multiple projects). These often require custom tools that are difficult to operate.

While Git is great for versioning source code, reproducing or re-running a computational pipeline is a bigger, more challenging problem. In my next post, I will look at how we can apply Git’s underlying principles to building a simple immutable data store for reproducibility.