How to Version Control Jupyter Notebooks
The Definitive Guide
Introduction
Jupyter notebooks generate files that may contain metadata, source code, formatted text, and rich media. Unfortunately, this makes these files poor candidates for conventional version control solutions, which works best with plain text.
Version control is an important creative tool that engenders experimentation and eases collaboration between peers. It lowers the risks of making a mistake or erasing another person's work because a complete record exists of all changes.
Exploration is a critical part of data analysis. Jupyter's inherent interactivity has made it a popular tool amongst data scientists and researchers. It has taken several years, but version control solutions are beginning to catch up. This article explores a few of the latest and greatest.
Problems With Jupyter and Version Control
Jupyter notebook files are human-readable JSON .ipynb
files.
fold -s -w80 NJ__REFec4177e5_f354_4574_af09_cc30fb391f30_simple_nb_ipynb
The JSON data above renders the following result in Jupyter Notebook:
It is uncommon to edit the JSON source directly because the format is so verbose; it's easy to forget required punctuation, unbalance brackets like {}
and []
, and corrupt the file. More troublesome, Jupyter source code is often littered cell output stored as binary blobs. The sine wave from simple-nb.ipynb
looks like this, trimmed for legibility:
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAEWCAYAAAB1xKBvAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzsvXmcHNd13/s9vc4+2EgABHeQEkVSXGGRFLembFNSPn7Wyy45i5UXh5ZjvcSy4xcr78WK5bwkzvKSeIllOqaVxZKcOJLN+FHc0dxJEVxAAgQBAiCIdbDP0tPT+80fVdXdmOnl1q17ezBm/T6f+QDdXVXnVtU996z3HFFKESNGjBgxYvRDYrkHECNGjBgxVgZigREjRowYMbQQC4wYMWLEiKGFWGDEiBEjRgwtxAIjRowYMWJoIRYYMWLEiBFDC7HAiBEDEJG/JiKPL/c4YsQ4nxELjBgfGojIXSLyoojMiMgZEXlBRH4IQCn1B0qp+x3QfExE/q+2z5tERHX5boNt+jFi2EQsMGJ8KCAiE8CfAr8BrAE2Ab8ClB2Tfha4t+3zPcC7Hb57Tyk15XgsMWJEQiwwYnxY8BEApdS3lVJ1pdSCUupxpdRbACLyRRF5PjjY1/i/JCLvichZEfktEZG23/8PEdnl//aYiFzWhe6zwJ0iEvDa3cC/A7Ys+u5Z/7qrReRPReSkf+0/FZGL/d8+LyLb2i8uIl8RkYf9/2dF5F+LyEEROS4i3xCR4YjPLUaMJmKBEePDgj1AXUT+k4h8VkRWa5zzY8APATcCfwX4NICI/O/APwL+AnAB8Bzw7S7X+AGQ9a8BnjXxBLB30XfP+v9PAL8PXAZcCiwAv+..."
This creates misleading and unwieldy diffs when doing something as simple as rerunning a notebook with different input data. For example, updating the periodicity of the sine waves involves changing a single line from t = np.arange(0.0, 2.0, 0.01)
to t = np.arange(0.0, 4.0, 0.01)
. This produces a minor change in the notebook...
... that looks like a significant change in the git commit
log. Scroll through the output and you will immediately see the issue.
git --git-dir=/jupyter-git/.git log -p -1 > /results/log.txt
fold -s -w80 /results/log.txt
Try Nextjournal. The notebook for reproducible research.
- Automatically version-controlled all the time
- Supports Python, R, Julia, Clojure and more
- Invite co-workers, collaborate in real-time
- Import your existing Jupyter notebooks
Built-In Solutions
Clear Output Manually
The simplest solution is to always clear the output before committing. Cell → All Output → Clear → Save. This removes any binary blobs that have been generated by the notebook. There are three main drawbacks:
It is a manual process.
Collaborators on other machines will need to rerun the notebook to see the output, requiring additional time and setup.
Collaborators on other machines may still create noise when new metadata is generated, like this information at the end of
simple-nb.ipynb
:
"metadata": {
"kernelspec": {
"display_name": "SageMath (stable)",
"language": "sagemath",
"name": "sagemath"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.15"
}
}
Convert to HTML
As a best practice, many Jupyter users will generate HTML and pure Python versions of their notebook using the built-in nbconvert
tool. This ensures the output can easily be displayed by any computer with a web browser.
jupyter nbconvert /jupyter-git/simple-nb.ipynb --output-dir="/results" --output="simple-nb.html"
cat /results/simple-nb.html
Opening the above file, simple-nb.html
, in a browser window will render the Python code and resulting sine wave just as it would look in a Jupyter notebook.
Convert to Python
jupyter nbconvert --to="python"
creates a succinct, readable record of the notebook's code cells. Peruse the output below and note how much shorter simple-nb-nbconvert.py
is than the JSON or HTML versions.
The simple Python document is perfect for version control and makes working in teams much easier. Changes are easily spotted and diffs are more readable.
jupyter nbconvert /jupyter-git/simple-nb.ipynb --to="python" --output-dir="/results" --output="simple-nb-nbconvert"
cat /results/simple-nb-nbconvert.py
Conclusion
These are useful tools, but leave something to be desired when compared to other solutions. Read on to see how version control with Jupyter notebooks can be more useful and tightly integrated.
Try Nextjournal. The notebook for reproducible research.
- Automatically version-controlled all the time
- Supports Python, R, Julia, Clojure and more
- Invite co-workers, collaborate in real-time
- Import your existing Jupyter notebooks
External Tools
nbdime
nbdime
was specifically created to solve problems related to diffing and merging Jupyter notebooks. The tool understands the structure of .ipynb
files, so it can make content-aware decisions and offer more informative messaging.
Diffing
In this scenario, new output is created after rerunning the notebook. A traditional git diff
is not very helpful.
cd /nbdime-git
git diff > /results/git-diff.txt
Scroll through the diff and you'll immediately see the problem, the binary blob makes the output virtually illegible:
fold -s -w80 NJ__REFc4e7a8ac_2112_4314_859b_5d9eab911e5a_git_diff_txt
Running nbdime
's nbdiff
provides a more useful output by highlighting the change in context. Note that it also trims the binary blob:
cd /nbdime-git
nbdiff
Merging
Merging is more clear as well. In the first example, two users, local and remote, have made edits to the base notebook. When one user merges their local file with another user's updated remote file, there are no conflicts and nbmerge
displays an output similar to nbdiff
.
nbmerge NJ__REFd4efffad_cac3_4179_bd48_b01651392b38_simple_nbdime_base_ipynb NJ__REFe54f2634_4222_4995_8356_22ee9edda9ee_simple_nbdime_local_ipynb NJ__REF690c1a5c_dcda_4dbe_b909_d16df7a3a6e7_simple_nbdime_remote_ipynb --decisions
On the other hand, when two users alter the same sections of the base file, nbmerge
offers the user a more Jupyter-friendly conflict resolution:
nbmerge NJ__REF51355d8b_22cd_4933_8a66_804b2f3b60dc_simple_nbdime_11_ipynb NJ__REF79295c1b_ce3b_4637_aa13_285dbe84f771_simple_nbdime_12_ipynb NJ__REF162b38f9_52d7_431d_9a29_a8c4e19ca5dc_simple_nbdime_13_ipynb --decisions
These features are simply not available with the built-in Jupyter solutions. nbdime
also features Git and Mercurial integration as well as browser-based visual diffing and merging:
ReviewNB
ReviewNB is a GitHub app that also offers visual diffing with an interface that looks similar to the traditional Jupyter IDE. Because the outputs are visualized, problems associated with committing binary blobs disappear.
ReviewNB is a simple tool built specifically for GitHub integration. This means the software is less flexible, but also easy to install and use. Perhaps the most attractive feature is the recent addition of cell-level comments and conversation threads around open issues.
Neptune
Neptune is a collaboration tool that can integrate with Jupyter and JupyterLab as an extension. Version control is just one of Neptune's features. The team, project, and user management features make this more than a version control tool, but the software's lightweight footprint may make it a compelling candidate regardless.
Neptune makes it easy to share notebook diffs at specific checkpoints with hyperlinks. The comparisons include media rich output from cells. The interface also makes it easy to browse different checkpoints or notebook files.
Jupytext
The previous solutions make Jupyter notebooks more friendly to version control, but they have drawbacks. nbconvert
processes are manual (but scriptable) and they force the user to rerun the notebook after stripping the output. nbdime
offers more complete solutions for diff
and merge
, but doesn't make it easy to edit plain text outside of the notebook. Jupytext uses YAML metadata to offer the most complete version control solution.
Setup
Jupytext takes some configuration to get started.
pip install jupytext --upgrade
A Jupyter configuration file must be generated/appended to with this code: c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"
.
jupyter notebook --generate-config -y
echo 'c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"' >> ~/.jupyter/jupyter_notebook_config.py
cat ~/.jupyter/jupyter_notebook_config.py
Formats
Jupytext can be configured to automatically pair a git-friendly file for input data while preserving the output data in the .ipynb
file. The options include:
Julia:
.jl
Python:
.py
R:
.R
Markdown:
.md
RMarkdown:
.Rmd
and more!
Markdown
jupytext --to markdown --output /results/simple-nb.md /jupyter-git/simple-nb.ipynb
cat /results/simple-nb.md
Python
jupytext --to python --output /results/simple-nb-jupytext.py /jupyter-git/simple-nb.ipynb
cat /results/simple-nb-jupytext.py
Compare the Python created by nbconvert
, simple-nb-convert.py
, with jupytext's
simple-nb-jupytext.py
. Jupytext's light format avoids inserting cell markers; it is paired with a .ipynb
file and can accurately reconstruct input cells without them. Futhermore, jupytext
inserts this YAML header information as a comment in the Python .py
file (note the format_name
):
# jupytext:
# text_representation:
# extension: .py
# format_name: light
# format_version: '1.3'
# jupytext_version: 0.8.5
# kernelspec:
# display_name: SageMath (stable)
# language: sagemath
# name: sagemath
# ---
Note similar YAML header information in
jupyter_notebook_config.py
file.Pair an Individual Notebook
To associate simple-nb-jupytext.py
with simple-nb.ipynb
, open the .ipynb
file in Jupyter notebook. Select Edit → Edit Notebook Metadata in Jupyter's menu and add "jupytext": {"formats": "ipynb,py"},
to the JSON:
{
"jupytext": {"formats": "ipynb,py"},
"kernelspec": {
(...)
},
"language_info": {
(...)
}
}
When the .ipynb
is loaded or reloaded in Jupyter, the input cells will now be read from the associated .py
file.
Round Trip Test
To ensure the accuracy of building a .ipynb
file from .py
source, a --test
flag will take a notebook from .ipynb
→ .py
→ .ipynb
and compare the two .ipynb
files.
jupytext --test -x /jupyter-git/simple-nb.ipynb --to python
No issues!
Version Control the Python Script
Add the .py
file to version control. Every saved change to a Python cell in this Jupyter notebook will now be reflected in the .py
file. Two different people can now work on these .py
files simultanously. Pulling, pushing, and merging code will be handled just as they would be for any other Python project. The .ipynb
file never needs to be shared, unless someone wants to share the output of their notebook. This addresses any issues regarding committing binary blobs to version control.
Nextjournal
Version control will always be a little complicated in Jupyter due to the nature of the notebook file format. If you would like to avoid this entirely, you should try Nextjournal. Nextjournal promises complete reproducibility across your entire project. From computational environments, to code, prose and data - everything is automatically version controlled. No installation or configuration required!
Nextjournal makes it effortless to collaborate using the remix feature and reuse work from other articles via the platform's immutable transclusions. You can even upload your Jupyter notebooks and use Jupyter kernels.
Try Nextjournal. The notebook for reproducible research.
- Automatically version-controlled all the time
- Supports Python, R, Julia, Clojure and more
- Invite co-workers, collaborate in real-time
- Import your existing Jupyter notebooks