David Schmudde / Jun 03 2019

Nextjournal and Jupyter

While Nextjournal can run Jupyter notebooks, it is based on a fundamentally different t technology. The platform was written from the ground up to address some of the shortcomings of traditional Jupyter notebooks.

What follows is an examination of the differences between the two platforms. The Jupyter ecosystem is too large for this brief survey, so the focus is on core tooling and some of the most popular solutions. The goal is to show where Nextjournal reduces friction and saves time.

Uncompromised System Access

  • Nextjournal: standard system commands for reliable environment management, encompassing default environments
  • Jupyter: kernel middleman, magic commands, and installation workarounds

Version Control

  • Nextjournal: automatic, synchronous version control across all code, commentary, and data
  • Jupyter: manual, selective version control

Collaboration

  • Nextjournal: real time teamwork, group management, and key sharing for external databases and repositories
  • Jupyter: most collboration happens outside the notebook on special platforms

Compute

  • Nextjournal: allocate in a single menu, native bash scripting for batch jobs, pipeline results between runtimes
  • Jupyter: complex GPU administration outside of the notebook, easy to forget to shut down costly GPU allocation, difficult to provision data between machines.

Share

  • Nextjournal: technical support built in, working drafts, publish at a memorable URL
  • Jupyter: share static notebooks on GitHub or nbviewer. Share runnable notebooks in Bindr but it is not an editing platform.

Reproduce and Reuse

  • Nextjournal: instant reproducibility with a single click, reusable components between notebooks
  • Jupyter: command line tools for reproducing computational environments aren't guaranteed to work

The Notebook and the Shell

Uncompromised System Access

This notebook begins by using Nextjournal's default Python environment.

import platform; platform.python_version()
'3.6.8'

The environment comes pre-loaded with dozens of useful packages. Use Bash in Nextjournal just as you would on your local machine to see what packages are installed.

conda list; printf "\npip packages:\n"; pip list

Installation using Bash also works as expected.

pip install haishoku

Nextjournal views newly installed or upgraded packages as a new environment. Sharing these changes are trivial in Nextjournal. More on this later or a detailed description is available in Runtimes & Environments.

So far this notebook contains one Nextjournal runtime with both Python and Bash cells. To demonstrate how Jupyter handles some features, it is easy to create a second runtime with a Jupyter Python kernel:

!jupyter --version

Note the indicator, Jupyter, in the lower right of the Python cell. This is in contrast to the Nextjournal cells, which have been assigned the name NJ.

Jupyter

Which Python Do You Mean?

In Jupyter notebooks, the python executable demonstrated above is determined by the kernel itself. On any given computer, the kernel's python executable not always the same as the command line's python executable. Related dependency errors are difficult to trace.

Installation For the Right Python

This is evident when running a package installation from within a Jupyter cell. !conda install --yes numpy does not guarantee that the package is destined for the kernel's runtime. To ensure the correct destination:

import sys
!conda install --yes --prefix {sys.prefix} numpy

It may be confusing, but the above cell is a mix of Python and shell commands. sys is a Python module and import is the Python command that enables its use. The shell executable conda is prefixed by a Jupyter bang (!). The rest are conda commands with the insertion of {sys.prefix} to pull the Python runtime.

Three separate pieces of software - Python, Conda in the shell, and Jupyter - are working together to reliably install Python dependencies. Some Jupyter users forgo the notebook cell and install dependencies directly from the command line.

Installing dependencies from the command line is a fine practice, but it does not resolve the issue of having multiple versions of Python on one system. It also disassociates the notebook's explicit dependencies from the notebook itself.

Nextjournal's approach to building environments solves this issue. However, environments offer many more benefits for reproducible research that will be explored later in this notebook.

Version Control

Automatic, Synchronous Version Control

Nextjournal boasts automatic, synchronous version control across all code, commentary, and data hosted on the platform. All changes are recorded. Go back in time, anytime.

Make changes to unfamiliar code and rerun to see new results. Experiment with alternative ideas knowing that you can always go back. Play!

Synchronous version control impacts every stage of research - from early collaborative efforts to peer reviewing published work. All computational environments are trivially reproducible and all processes are essentially documented.

Jupyter

Once the environment is built, a user can run the .ipynb file assuming they also have access to the data. It is easy to go back to the original notebook file at any time, but difficult to go back to any step in between. Version control of a Jupyter notebook is not trivial, even if you are familiar with a tool like GitHub.

Because of the way that .ipynb files handle binary blobs, they are not ideal candidates for use in version control systems. There are several workarounds, all with various tradeoffs. nbconvert is the most basic and comes with every Jupyter install.

simple-nb.ipynb

jupyter nbconvert --to="python" creates a succinct, readable record of the notebook's code cells. The simple Python document works well with version control - changes are easily spotted and diffs are more readable.

%%bash
jupyter nbconvert 
simple-nb.ipynb
--to="python" --output-dir="/results" --output="simple-nb-nbconvert" cat /results/simple-nb-nbconvert.py
simple-nb-nbconvert.py

More complete solutions require more administrative work - none of which offer automatic, synchronous version control across code, commentary, data. For a more complete survey of what's possible, see How to Version Control Jupyter Notebooks.

Collaborate

Real Time Collaboration and Fully Reproducible Environments

Sharing a notebook means sharing the computational environment as well as the data needed to run. If the data is stored somewhere other than Nextjournal, the platform makes it easy to share keys with other members of the team.

Real Time Teamwork

Nextjournal offers a full set of collaboration tools. Invite people to help edit your work and collaborate in real-time in the notebook.

Computational environments and data travel with the notebook. Public or private data databases and repositories hosted outside of Nextjournal are explicitly mounted within the notebook. For example, a S3 bucket:

nextjournal-s3-demo
Public

And a GitHub repository:

All this can be accomplished on a per-notebook basis or by creating a group. Groups allow the added benefit of reuse, private group databases and repositories, and the ability to publish under a single group profile.

Share Secrets With Colleagues

Nextjournal stores your secrets in a fully encrypted in a vault separate from your notebook. Stored secrets can be referenced in notebooks, in runtime environment variables, and shared between your collaborators. Group management makes it even easier to setup and continually share access with select collaborators.

Jupyter

Collaboration is not a core component of the original Jupyter project. Most collaborative aspects are managed through version control using external tools like GitHub.

JupyterHub is an initiative to host Jupyter notebooks for multiple users on a remote server (or collection of servers). It requires considerable investment to setup, so distributions have been created to ease the process.

Platforms such as CoCalc have been built on top of JupyterHub and offers integrated version control and a chat window, but not Google Docs-style real-time collaboration.

The JupyterHub roadmap suggests that real-time collaboration is being considered but not in the immediate future.

Further friction is created when attempting to share data or recreate computational environments. The latter will be explored later in this notebook.

Compute

Single Click Allocation and Scripted Pipelines

Resourcing

Nextjournal runs on a fully managed cloud computing infrastructure — no setup or maintenance required. Compute resources can be fully customized (machine types, number of instances, etc...) with full GPU support.

Creating a new cell using Nextjournal's standard PyTorch environment is simple:

import platform, torch

print("This environment runs PyTorch version {0} on {1} {2}".format(torch.__version__, torch.cuda.device_count(), torch.cuda.get_device_name(0)))

At this point, it's also worth noting that the PyTorch runtime joins NJ, the Nextjournal Python default runtime, and the Jupyter runtime in this notebook. All were added with just a few clicks.

New PyTorch, Tensorflow, TFLearn, and Keras notebooks can be created using Nextjournal's single click defaults.

Pipelines

Allocating computational resources is an important part in many data-driven pipelines. For example, after all data and environments are in place, a computationally intense Bash script running on a GPU can feed an R cell for plotting:

python -c 'import torch; print(torch.rand(3,3).cuda())' > /results/big-process.txt
empty

Now to pipe to R

import torch;
import matplotlib.pyplot as plt

x = (torch.rand(100,100).cuda())

# torch.rand(100,100).cuda();

def showTensor(aTensor):
    plt.figure()
    plt.imshow(aTensor.cpu().numpy())
    plt.colorbar()
    plt.show()
    plt.savefig("/results/test.png")
    
showTensor(x);

Data flow is simplified by Nextjournal's integration of data sources and computational environments within the notebook. The result is an easier time spawning new runtimes and moving data between them. It is all available from the Nextjournal GUI - no extensions or extra notebook installations required.

Jupyter

Jupyter can be configured to work with many cloud compute providers. Like the other examples on this list, it requires significant configuration to work.

Generally speaking, the steps include choosing and configuring a provider like Amazon or Google, configuring a local machine to work with the provider, installing a remote Jupyter instance and required packages, generating and configuring security certificates, configuring Jupyter to work with the GPU and ensuring proper security restrictions are in place, and finally installing the latest versions of your chosen libraries.

Pipelines starting from Bash can be tricky with the requirement of special Jupyter Magic commands. A provisioned data source that works well on one machine may need to be reconfigured at an indefinite point in the future or when run on another machine. The same is true for the computational environment on which the pipeline depends.

Share

Technical Problems? Ask For Help and Collaborate on Drafts

Create your project in our state-of-the-art editor which includes code auto-completion and language-specific documentation. If you get stuck on an error, use the Ask for help button.

Generate, share (and revoke) secret links to your working drafts for review and collaboration. Once ready, publish versions under your Nextjournal profile on a permanent URL.

Jupyter

Jupyter notebooks are automatically rendered in GitHub repositories and nbviewer remains a popular way to share notebooks online. If the reader wants to interact with these notebooks, they will have to download the .ipynb file and install all dependencies or run it on a cloud service like Binder.

Reproduce & Reuse

  • Common Jupyter solutions create a plain text file that points to the source repositories, which will need to be downloaded on a new computer. Binary storage and version control require additional tools.
  • Common Jupyter cloud solutions do not offer robust version control

Instant Reproducibility and Component Reusability

Click the remix button to create an instant copy of the notebook including all dependencies down to the operating system:

The duplicate can be explored, edited, and run without a single extra step of configuration or installation.

Remix relies on Nextjournal's synchronous version control across all code, commentary, and data. Changes to the remix do not affect the original; changes to the original do not affect the remix. The original author automatically retains all attribution.

Environments can be reused on any computer with a web browser or Docker installed locally.

civisanalytics/civis-jupyter-python3
Download as Docker image from:
Copy
This image was imported from: latest

Environments can be built once and reused indefinitely. In fact, the Nextjournal team depends on the reproducible nature of our notebooks to build our default environments for Python, R, Julia, and Clojure. They are created like any other article with all the same benefits from remixing.

Jupyter

When reproducing results from a Jupyter notebook, the notebook file (.ipynb), runtime environment, and data must all be available. Jupyter cloud solutions can help smooth this process, but there are drawbacks. This notebook will continue to look at the core Jupyter experience, which will provide insight into how the cloud services actually work.

Export

Conda offers a number of command line utilities for managing environments. This is essential for building the runtime environment for the .ipynb file. The simplest, conda env export, will prepare a plain text file that can be used to build the environment on another computer.

!conda env export -n base -q > /results/environment.yml
empty

Import

The following cell takes the Jupyter environment and imports it into the NJ runtime.

cat 
environment.yml
> env.yml # Move the YAML file to another part of the system conda env create -p /opt/conda2 -f env.yml

There are multiple commands and configuration options when working with Jupyter environments that Nextjournal automates. Keep in mind:

  • When running a project on a new machine, conda env create will download all dependencies again. If there is an issue with the repository at some point in the future, the environment will not be easily reproducible.
  • Conda cannot guarantee package parity between operating systems - Linux, macOS, and Windows. This is not an issue with Nextjournal.

conda list --explicit will download an identical set of dependencies on a second computer. Note the output of the process indicates the dependency stack is for linux-64. The software spec-file.txt downloads will not be compatible with colleagues using Windows or macOS.

%%bash
conda list --explicit > spec-file.txt
head spec-file.txt

If a repository moves or disappears before spec-file.txt is referenced, the only solution will be to find the dependency using some other method.

If you use GitHub for version control, it is possible to use yet another piece of software called Binder to turn a Git repo into a Jupyter notebook running in the browser able to reproduce results. However, it does not offer any security features and offers no direct version control integration.

Conclusion

Nextjournal was built for researchers, journalists, and scientists who want to focus on their work. The hours spent configuring a working system are better spent elsewhere. Best of all, if the convenience of Nextjournal is not self-evident after using the platform, there is no lock in. It's easy to import existing Jupyter/iPython, RMarkdown or Markdown notebooks and export any Nextjournal notebook to Markdown.