How to Version Control Jupyter Notebooks

The Definitive Guide

Introduction

Jupyter notebooks generate files that may contain metadata, source code, formatted text, and rich media. Unfortunately, this makes these files poor candidates for conventional version control solutions, which works best with plain text.

Version control is an important creative tool that engenders experimentation and eases collaboration between peers. It lowers the risks of making a mistake or erasing another person's work because a complete record exists of all changes.

Exploration is a critical part of data analysis. Jupyter's inherent interactivity has made it a popular tool amongst data scientists and researchers. It has taken several years, but version control solutions are beginning to catch up. This article explores a few of the latest and greatest.

Problems With Jupyter and Version Control

simple-nb.ipynb

Jupyter notebook files are human-readable JSON .ipynb files.

fold -s -w80 NJ__REFec4177e5_f354_4574_af09_cc30fb391f30_simple_nb_ipynb
1.3s

The JSON data above renders the following result in Jupyter Notebook:

It is uncommon to edit the JSON source directly because the format is so verbose; it's easy to forget required punctuation, unbalance brackets like {} and [], and corrupt the file. More troublesome, Jupyter source code is often littered cell output stored as binary blobs. The sine wave from simple-nb.ipynb looks like this, trimmed for legibility:

   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAEWCAYAAAB1xKBvAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzsvXmcHNd13/s9vc4+2EgABHeQEkVSXGGRFLembFNSPn7Wyy45i5UXh5ZjvcSy4xcr78WK5bwkzvKSeIllOqaVxZKcOJLN+FHc0dxJEVxAAgQBAiCIdbDP0tPT+80fVdXdmOnl1q17ezBm/T6f+QDdXVXnVtU996z3HFFKESNGjBgxYvRDYrkHECNGjBgxVgZigREjRowYMbQQC4wYMWLEiKGFWGDEiBEjRgwtxAIjRowYMWJoIRYYMWLEiBFDC7HAiBEDEJG/JiKPL/c4YsQ4nxELjBgfGojIXSLyoojMiMgZEXlBRH4IQCn1B0qp+x3QfExE/q+2z5tERHX5boNt+jFi2EQsMGJ8KCAiE8CfAr8BrAE2Ab8ClB2Tfha4t+3zPcC7Hb57Tyk15XgsMWJEQiwwYnxY8BEApdS3lVJ1pdSCUupxpdRbACLyRRF5PjjY1/i/JCLvichZEfktEZG23/8PEdnl//aYiFzWhe6zwJ0iEvDa3cC/A7Ys+u5Z/7qrReRPReSkf+0/FZGL/d8+LyLb2i8uIl8RkYf9/2dF5F+LyEEROS4i3xCR4YjPLUaMJmKBEePDgj1AXUT+k4h8VkRWa5zzY8APATcCfwX4NICI/O/APwL+AnAB8Bzw7S7X+AGQ9a8BnjXxBLB30XfP+v9PAL8PXAZcCiwAv+..."
(Custom)

This creates misleading and unwieldy diffs when doing something as simple as rerunning a notebook with different input data. For example, updating the periodicity of the sine waves involves changing a single line from t = np.arange(0.0, 2.0, 0.01) to t = np.arange(0.0, 4.0, 0.01). This produces a minor change in the notebook...

... that looks like a significant change in the git commit log. Scroll through the output and you will immediately see the issue.

git --git-dir=/jupyter-git/.git log -p -1 > /results/log.txt
fold -s -w80 /results/log.txt
0.7s
log.txt

Built-In Solutions

Clear Output Manually

The simplest solution is to always clear the output before committing. CellAll OutputClearSave. This removes any binary blobs that have been generated by the notebook. There are three main drawbacks:

  • It is a manual process.

  • Collaborators on other machines will need to rerun the notebook to see the output, requiring additional time and setup.

  • Collaborators on other machines may still create noise when new metadata is generated, like this information at the end of simple-nb.ipynb:

 "metadata": {
  "kernelspec": {
   "display_name": "SageMath (stable)",
   "language": "sagemath",
   "name": "sagemath"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  }
 }
(Custom)

Convert to HTML

As a best practice, many Jupyter users will generate HTML and pure Python versions of their notebook using the built-in nbconvert tool. This ensures the output can easily be displayed by any computer with a web browser.

jupyter nbconvert /jupyter-git/simple-nb.ipynb --output-dir="/results" --output="simple-nb.html"
cat /results/simple-nb.html
1.7s
simple-nb.html

Opening the above file, simple-nb.html, in a browser window will render the Python code and resulting sine wave just as it would look in a Jupyter notebook.

Convert to Python

jupyter nbconvert --to="python" creates a succinct, readable record of the notebook's code cells. Peruse the output below and note how much shorter simple-nb-nbconvert.py is than the JSON or HTML versions.

The simple Python document is perfect for version control and makes working in teams much easier. Changes are easily spotted and diffs are more readable.

jupyter nbconvert /jupyter-git/simple-nb.ipynb --to="python" --output-dir="/results" --output="simple-nb-nbconvert"
cat /results/simple-nb-nbconvert.py
1.5s
simple-nb-nbconvert.py

Conclusion

These are useful tools, but leave something to be desired when compared to other solutions. Read on to see how version control with Jupyter notebooks can be more useful and tightly integrated.

External Tools

nbdime

nbdime was specifically created to solve problems related to diffing and merging Jupyter notebooks. The tool understands the structure of .ipynb files, so it can make content-aware decisions and offer more informative messaging.

Diffing

In this scenario, new output is created after rerunning the notebook. A traditional git diff is not very helpful.

cd /nbdime-git
git diff > /results/git-diff.txt
0.1s
git-diff.txt

Scroll through the diff and you'll immediately see the problem, the binary blob makes the output virtually illegible:

fold -s -w80 NJ__REFc4e7a8ac_2112_4314_859b_5d9eab911e5a_git_diff_txt
0.7s

Running nbdime's nbdiff provides a more useful output by highlighting the change in context. Note that it also trims the binary blob:

cd /nbdime-git
nbdiff
1.1s

Merging

Merging is more clear as well. In the first example, two users, local and remote, have made edits to the base notebook. When one user merges their local file with another user's updated remote file, there are no conflicts and nbmerge displays an output similar to nbdiff.

simple-nbdime-base.ipynb
simple-nbdime-local.ipynb
simple-nbdime-remote.ipynb
nbmerge NJ__REFd4efffad_cac3_4179_bd48_b01651392b38_simple_nbdime_base_ipynb NJ__REFe54f2634_4222_4995_8356_22ee9edda9ee_simple_nbdime_local_ipynb NJ__REF690c1a5c_dcda_4dbe_b909_d16df7a3a6e7_simple_nbdime_remote_ipynb  --decisions
1.9s
[W nbmergeapp:64] Decisions: 0 conflicted decisions of 2 total: ==== decision at /cells/0: --- local_diff (selected): ## replaced /cells/0/execution_count: - 11 + 12 ## inserted before /cells/0/outputs/0: + output: + output_type: execute_result + execution_count: 12 + data: + image/png: iVBORw0K...<snip base64, md5=6a9b3279fefe3054...> ## deleted /cells/0/outputs/0: - output: - output_type: execute_result - execution_count: 11 - data: - image/png: iVBORw0K...<snip base64, md5=20bce36ace1d7e31...> ==== decision at /cells/1: --- remote_diff (selected): ## replaced /cells/1/execution_count: - 9 + 10 ## replaced /cells/1/outputs/0/execution_count: - 9 + 10 ## inserted before /cells/1/outputs/1: + output: + output_type: execute_result + execution_count: 10 + data: + image/png: iVBORw0K...<snip base64, md5=5808ce171c4518b6...> ## deleted /cells/1/outputs/1: - output: - output_type: execute_result - execution_count: 9 - data: - image/png: iVBORw0K...<snip base64, md5=fa26bad070e548a3...>

On the other hand, when two users alter the same sections of the base file, nbmerge offers the user a more Jupyter-friendly conflict resolution:

simple-nbdime-11.ipynb
simple-nbdime-12.ipynb
simple-nbdime-13.ipynb
nbmerge NJ__REF51355d8b_22cd_4933_8a66_804b2f3b60dc_simple_nbdime_11_ipynb NJ__REF79295c1b_ce3b_4637_aa13_285dbe84f771_simple_nbdime_12_ipynb NJ__REF162b38f9_52d7_431d_9a29_a8c4e19ca5dc_simple_nbdime_13_ipynb --decisions
1.4s

These features are simply not available with the built-in Jupyter solutions. nbdime also features Git and Mercurial integration as well as browser-based visual diffing and merging:

ReviewNB

ReviewNB is a GitHub app that also offers visual diffing with an interface that looks similar to the traditional Jupyter IDE. Because the outputs are visualized, problems associated with committing binary blobs disappear.

ReviewNB is a simple tool built specifically for GitHub integration. This means the software is less flexible, but also easy to install and use. Perhaps the most attractive feature is the recent addition of cell-level comments and conversation threads around open issues.

Neptune

Neptune is a collaboration tool that can integrate with Jupyter and JupyterLab as an extension. Version control is just one of Neptune's features. The team, project, and user management features make this more than a version control tool, but the software's lightweight footprint may make it a compelling candidate regardless.

Neptune makes it easy to share notebook diffs at specific checkpoints with hyperlinks. The comparisons include media rich output from cells. The interface also makes it easy to browse different checkpoints or notebook files.

Jupytext

The previous solutions make Jupyter notebooks more friendly to version control, but they have drawbacks. nbconvert processes are manual (but scriptable) and they force the user to rerun the notebook after stripping the output. nbdime offers more complete solutions for diff and merge, but doesn't make it easy to edit plain text outside of the notebook. Jupytext uses YAML metadata to offer the most complete version control solution.

Setup

Jupytext takes some configuration to get started.

pip install jupytext --upgrade
4.9s

A Jupyter configuration file must be generated/appended to with this code: c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager".

jupyter notebook --generate-config -y
echo 'c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"' >> ~/.jupyter/jupyter_notebook_config.py
cat ~/.jupyter/jupyter_notebook_config.py
1.2s

Formats

Jupytext can be configured to automatically pair a git-friendly file for input data while preserving the output data in the .ipynb file. The options include:

  • Julia: .jl

  • Python: .py

  • R: .R

  • Markdown: .md

  • RMarkdown: .Rmd

  • and more!

Markdown
jupytext --to markdown --output /results/simple-nb.md /jupyter-git/simple-nb.ipynb
cat /results/simple-nb.md
1.4s
simple-nb.md
Python
jupytext --to python --output /results/simple-nb-jupytext.py /jupyter-git/simple-nb.ipynb
cat /results/simple-nb-jupytext.py
1.7s
simple-nb-jupytext.py

Compare the Python created by nbconvert, simple-nb-convert.py, with jupytext's simple-nb-jupytext.py. Jupytext's light format avoids inserting cell markers; it is paired with a .ipynb file and can accurately reconstruct input cells without them. Futhermore, jupytext inserts this YAML header information as a comment in the Python .py file (note the format_name):

#   jupytext:
#     text_representation:
#       extension: .py
#       format_name: light
#       format_version: '1.3'
#       jupytext_version: 0.8.5
#   kernelspec:
#     display_name: SageMath (stable)
#     language: sagemath
#     name: sagemath
# ---
(Custom)

Note similar YAML header information in

simple-nb.md
. This technique simultaneously relieves two pain points associated with Jupyter notebooks: clean version control and easy collaboration. Notebooks can be configured individually or a global default can be added to the aforementioned jupyter_notebook_config.py file.

Pair an Individual Notebook

To associate simple-nb-jupytext.py with simple-nb.ipynb, open the .ipynb file in Jupyter notebook. Select EditEdit Notebook Metadata in Jupyter's menu and add "jupytext": {"formats": "ipynb,py"}, to the JSON:

{
  "jupytext": {"formats": "ipynb,py"},
  "kernelspec": {
    (...)
  },
  "language_info": {
    (...)
  }
}
(Custom)

When the .ipynb is loaded or reloaded in Jupyter, the input cells will now be read from the associated .py file.

Round Trip Test

To ensure the accuracy of building a .ipynb file from .py source, a --test flag will take a notebook from .ipynb.py.ipynb and compare the two .ipynb files.

jupytext --test -x /jupyter-git/simple-nb.ipynb --to python
0.9s
❤︎Bash

No issues!

Version Control the Python Script

Add the .py file to version control. Every saved change to a Python cell in this Jupyter notebook will now be reflected in the .py file. Two different people can now work on these .py files simultanously. Pulling, pushing, and merging code will be handled just as they would be for any other Python project. The .ipynb file never needs to be shared, unless someone wants to share the output of their notebook. This addresses any issues regarding committing binary blobs to version control.

Nextjournal

Version control will always be a little complicated in Jupyter due to the nature of the notebook file format. If you would like to avoid this entirely, you should try Nextjournal. Nextjournal promises complete reproducibility across your entire project. From computational environments, to code, prose and data - everything is automatically version controlled. No installation or configuration required!

Nextjournal makes it effortless to collaborate using the remix feature and reuse work from other articles via the platform's immutable transclusions. You can even upload your Jupyter notebooks and use Jupyter kernels.

Runtimes (1)