How to Version Control Jupyter Notebooks

1. Introduction

Jupyter notebooks generate files that may contain metadata, source code, formatted text, and rich media. Unfortunately, this makes these files poor candidates for conventional version control solutions, which works best with plain text.

Version control is an important creative tool that engenders experimentation and eases collaboration between peers. It lowers the risks of making a mistake or erasing another person's work because a complete record exists of all changes.

Exploration is a critical part of data analysis. Jupyter's inherent interactivity has made it a popular tool amongst data scientists and researchers. It has taken several years, but version control solutions are beginning to catch up. This article explores a few of the latest and greatest.

1.1. Problems With Jupyter and Version Control

simple-nb.ipynb

Jupyter notebook files are human-readable JSON .ipynb files.

fold -s -w80 
simple-nb.ipynb

The JSON data above renders the following result in Jupyter Notebook:

It is uncommon to edit the JSON source directly because the format is so verbose; it's easy to forget required punctuation, unbalance brackets like {} and [], and corrupt the file. More troublesome, Jupyter source code is often littered cell output stored as binary blobs. The sine wave from looks like this, trimmed for legibility:

   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAEWCAYAAAB1xKBvAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzsvXmcHNd13/s9vc4+2EgABHeQEkVSXGGRFLembFNSPn7Wyy45i5UXh5ZjvcSy4xcr78WK5bwkzvKSeIllOqaVxZKcOJLN+FHc0dxJEVxAAgQBAiCIdbDP0tPT+80fVdXdmOnl1q17ezBm/T6f+QDdXVXnVtU996z3HFFKESNGjBgxYvRDYrkHECNGjBgxVgZigREjRowYMbQQC4wYMWLEiKGFWGDEiBEjRgwtxAIjRowYMWJoIRYYMWLEiBFDC7HAiBEDEJG/JiKPL/c4YsQ4nxELjBgfGojIXSLyoojMiMgZEXlBRH4IQCn1B0qp+x3QfExE/q+2z5tERHX5boNt+jFi2EQsMGJ8KCAiE8CfAr8BrAE2Ab8ClB2Tfha4t+3zPcC7Hb57Tyk15XgsMWJEQiwwYnxY8BEApdS3lVJ1pdSCUupxpdRbACLyRRF5PjjY1/i/JCLvichZEfktEZG23/8PEdnl//aYiFzWhe6zwJ0iEvDa3cC/A7Ys+u5Z/7qrReRPReSkf+0/FZGL/d8+LyLb2i8uIl8RkYf9/2dF5F+LyEEROS4i3xCR4YjPLUaMJmKBEePDgj1AXUT+k4h8VkRWa5zzY8APATcCfwX4NICI/O/APwL+AnAB8Bzw7S7X+AGQ9a8BnjXxBLB30XfP+v9PAL8PXAZcCiwAv+..."
(Custom)

This creates misleading and unwieldy diffs when doing something as simple as rerunning a notebook with different input data. For example, updating the periodicity of the sine waves involves changing a single line from t = np.arange(0.0, 2.0, 0.01) to t = np.arange(0.0, 4.0, 0.01). This produces a minor change in the notebook...

... that looks like a significant change in the git commit log. Scroll through the output and you will immediately see the issue.

git --git-dir=/jupyter-git/.git log -p -1 > /results/log.txt
fold -s -w80 /results/log.txt
log.txt

2. Built-In Solutions

2.1. Clear Output Manually

The simplest solution is to always clear the output before committing. CellAll OutputClearSave. This removes any binary blobs that have been generated by the notebook. There are three main drawbacks:

  • It is a manual process.
  • Collaborators on other machines will need to rerun the notebook to see the output, requiring additional time and setup.
  • Collaborators on other machines may still create noise when new metadata is generated, like this information at the end of simple-nb.ipynb:
 "metadata": {
  "kernelspec": {
   "display_name": "SageMath (stable)",
   "language": "sagemath",
   "name": "sagemath"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  }
 }
(Custom)

2.2. Convert to HTML

As a best practice, many Jupyter users will generate HTML and pure Python versions of their notebook using the built-in nbconvert tool. This ensures the output can easily be displayed by any computer with a web browser.

jupyter nbconvert /jupyter-git/simple-nb.ipynb --output-dir="/results" --output="simple-nb.html"
cat /results/simple-nb.html
simple-nb.html

Opening the above file, simple-nb.html, in a browser window will render the Python code and resulting sine wave just as it would look in a Jupyter notebook.

2.3. Convert to Python

jupyter nbconvert --to="python" creates a succinct, readable record of the notebook's code cells. Peruse the output below and note how much shorter simple-nb-nbconvert.py is than the JSON or HTML versions.

The simple Python document is perfect for version control and makes working in teams much easier. Changes are easily spotted and diffs are more readable.

jupyter nbconvert /jupyter-git/simple-nb.ipynb --to="python" --output-dir="/results" --output="simple-nb-nbconvert"
cat /results/simple-nb-nbconvert.py
simple-nb-nbconvert.py

2.4. Conclusion

These are useful tools, but leave something to be desired when compared to other solutions. Read on to see how version control with Jupyter notebooks can be more useful and tightly integrated.

3. External Tools

3.1. nbdime

nbdime was specifically created to solve problems related to diffing and merging Jupyter notebooks. The tool understands the structure of .ipynb files, so it can make content-aware decisions and offer more informative messaging.

3.1.1. Diffing

In this scenario, new output is created after rerunning the notebook. A traditional git diff is not very helpful.

cd /nbdime-git
git diff > /results/git-diff.txt
git-diff.txt

Scroll through the diff and you'll immediately see the problem, the binary blob makes the output virtually illegible:

fold -s -w80 
git-diff.txt

Running nbdime's nbdiff provides a more useful output by highlighting the change in context. Note that it also trims the binary blob:

cd /nbdime-git
nbdiff

3.1.2. Merging

Merging is more clear as well. In the first example, two users, local and remote, have made edits to the base notebook. When one user merges their local file with another user's updated remote file, there are no conflicts and nbmerge displays an output similar to nbdiff.

simple-nbdime-base.ipynb
simple-nbdime-local.ipynb
simple-nbdime-remote.ipynb
nbmerge 
simple-nbdime-base.ipynb
simple-nbdime-local.ipynb
simple-nbdime-remote.ipynb
--decisions
[W nbmergeapp:64] Decisions: 0 conflicted decisions of 2 total: ==== decision at /cells/0: --- local_diff (selected): ## replaced /cells/0/execution_count: - 11 + 12 ## inserted before /cells/0/outputs/0: + output: + output_type: execute_result + execution_count: 12 + data: + image/png: iVBORw0K...<snip base64, md5=6a9b3279fefe3054...> ## deleted /cells/0/outputs/0: - output: - output_type: execute_result - execution_count: 11 - data: - image/png: iVBORw0K...<snip base64, md5=20bce36ace1d7e31...> ==== decision at /cells/1: --- remote_diff (selected): ## replaced /cells/1/execution_count: - 9 + 10 ## replaced /cells/1/outputs/0/execution_count: - 9 + 10 ## inserted before /cells/1/outputs/1: + output: + output_type: execute_result + execution_count: 10 + data: + image/png: iVBORw0K...<snip base64, md5=5808ce171c4518b6...> ## deleted /cells/1/outputs/1: - output: - output_type: execute_result - execution_count: 9 - data: - image/png: iVBORw0K...<snip base64, md5=fa26bad070e548a3...>

On the other hand, when two users alter the same sections of the base file, nbmerge offers the user a more Jupyter-friendly conflict resolution:

simple-nbdime-11.ipynb
simple-nbdime-12.ipynb
simple-nbdime-13.ipynb
nbmerge 
simple-nbdime-11.ipynb
simple-nbdime-12.ipynb
simple-nbdime-13.ipynb
--decisions

These features are simply not available with the built-in Jupyter solutions. nbdime also features Git and Mercurial integration as well as browser-based visual diffing and merging:

3.2. ReviewNB

ReviewNB is a GitHub app that also offers visual diffing with an interface that looks similar to the traditional Jupyter IDE. Because the outputs are visualized, problems associated with committing binary blobs disappear.

ReviewNB is a simple tool built specifically for GitHub integration. This means the software is less flexible, but also easy to install and use. Perhaps the most attractive feature is the recent addition of cell-level comments and conversation threads around open issues.

3.3. Jupytext

The previous solutions make Jupyter notebooks more friendly to version control, but they have drawbacks. nbconvert processes are manual (but scriptable) and they force the user to rerun the notebook after stripping the output. nbdime offers more complete solutions for diff and merge, but doesn't make it easy to edit plain text outside of the notebook. Jupytext uses YAML metadata to offer the most complete version control solution.

3.3.1. Setup

Jupytext takes some configuration to get started.

pip install jupytext --upgrade

A Jupyter configuration file must be generated/appended to with this code: c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager".

jupyter notebook --generate-config -y
echo 'c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"' >> ~/.jupyter/jupyter_notebook_config.py
cat ~/.jupyter/jupyter_notebook_config.py

3.3.2. Formats

Jupytext can be configured to automatically pair a git-friendly file for input data while preserving the output data in the .ipynb file. The options include:

  • Julia: .jl
  • Python: .py
  • R: .R
  • Markdown: .md
  • RMarkdown: .Rmd
  • and more!
  • Python: .py
  • R: .R
  • Markdown: .md
  • RMarkdown: .Rmd
  • and more!
3.3.2.1. Markdown
jupytext --to markdown --output /results/simple-nb.md /jupyter-git/simple-nb.ipynb
cat /results/simple-nb.md
simple-nb.md
3.3.2.2. Python
jupytext --to python --output /results/simple-nb-jupytext.py /jupyter-git/simple-nb.ipynb
cat /results/simple-nb-jupytext.py
simple-nb-jupytext.py

Compare the Python created by nbconvert, , with jupytext's . Jupytext's light format avoids inserting cell markers; it is paired with a .ipynb file and can accurately reconstruct input cells without them. Futhermore, jupytext inserts this YAML header information as a comment in the Python .py file (note the format_name):

#   jupytext:
#     text_representation:
#       extension: .py
#       format_name: light
#       format_version: '1.3'
#       jupytext_version: 0.8.5
#   kernelspec:
#     display_name: SageMath (stable)
#     language: sagemath
#     name: sagemath
# ---
(Custom)

Note similar YAML header information in . This technique simultaneously relieves two pain points associated with Jupyter notebooks: clean version control and easy collaboration. Notebooks can be configured individually or a global default can be added to the aforementioned jupyter_notebook_config.py file.

3.3.3. Pair an Individual Notebook

To associate with , open the .ipynb file in Jupyter notebook. Select EditEdit Notebook Metadata in Jupyter's menu and add "jupytext": {"formats": "ipynb,py"}, to the JSON:

{
  "jupytext": {"formats": "ipynb,py"},
  "kernelspec": {
    (...)
  },
  "language_info": {
    (...)
  }
}
(Custom)

When the .ipynb is loaded or reloaded in Jupyter, the input cells will now be read from the associated .py file.

3.3.4. Round Trip Test

To ensure the accuracy of building a .ipynb file from .py source, a --test flag will take a notebook from .ipynb.py.ipynb and compare the two .ipynb files.

jupytext --test -x /jupyter-git/simple-nb.ipynb --to python

No issues!

3.3.5. Version Control the Python Script

Add the .py file to version control. Every saved change to a Python cell in this Jupyter notebook will now be reflected in the .py file. Two different people can now work on these .py files simultanously. Pulling, pushing, and merging code will be handled just as they would be for any other Python project. The .ipynb file never needs to be shared, unless someone wants to share the output of their notebook. This addresses any issues regarding committing binary blobs to version control.

4. Nextjournal

Version control will always be a little complicated in Jupyter due to the nature of the notebook file format. If you would like to avoid this entirely, you should try Nextjournal. Nextjournal promises complete reproducibility across your entire project. From computational environments, to code, prose and data - everything is automatically version controlled. No installation or configuration required!

Nextjournal makes it effortless to collaborate using the remix feature and reuse work from other articles via the platform's immutable transclusions. You can even upload your Jupyter notebooks and use Jupyter kernels.