Getting Started with Python on Nextjournal

This article serves as a simple and concise starting point for Python programming tasks. Go ahead and make it your own - just 'Remix' and code away!

1.
Pre-Installed Packages and Manual Installation

All articles ship with a lot of useful packages already pre-installed, so we can simply use them out of the box.

pip freeze

1.5s

pip freeze

Base Python (Bash in Python)Python Default↩

import numpy as np
import plotly

0.1s

import pandas

Base Python (Python)Python Default↩

But even if we're missing our favorite package, we can make use of bash cells to install packages since each cell runs within a Docker container with its own filesystem.  If we do our installation within a separate Setup runtime, we can then save and export that runtime as a custom environment for use by our Main runtime, and never have to run the installations again.

1.1.
Conda Package Manager

The default Python environment on Nextjournal includes the conda package manager. Let's say we want to create some statistical graphics using seaborn to explore and visualize some given dataset. 

conda install -y seaborn

108.8s

conda install

Custom Python Setup (Bash)Python Default↩

1.2.
The Python Package Index

Not all packages are available in the conda package manager. The Python Package Index (PyPI) spans many more packages. Let's install two powerful machine learning libraries - XGBoost and scikit-learn - using the pip install command.

The XGBoost installation involves a library compilation, so first we'll have to install some build tools.  The base system is Ubuntu, so we can use apt-get to install what we need.

apt-get update
apt-get install -y make g++

14.9s

fragrant-base

Custom Python Setup (Bash)Python Default↩

Now we can use pip.

pip install xgboost scikit-learn

104.9s

pip install

Custom Python Setup (Bash)Python Default↩

1.3.
Installing from Github

Since we have complete control over the filesystem we can also clone and install packages from Github. Let's try to install XGBoost again, this time directly from Github.

apt-get install -y git
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost; make -j4

72.3s

github install

Custom Python Setup (Bash)Python Default↩

Now we Save this runtime's state as a new environment, preserving our installed packages for use by other runtimes, here or in other articles.

2.
Exploring a Dataset

Once we have installed all required packages for our article we can now explore the dataset we wish to analyze. In our case we use the Iris Dataset - a classic in machine learning datasets - where we try to classify different types of iris plants.

2.1.
Load Data

iris_data.csv

We can read our dataset using Pandas' read_csv and store it as a DataFrame. Pandas, which is built on top of NumPy, offers some higher level data manipulation tools.

import numpy as np
import pandas as pd

iris_df = pd.read_csv(iris_data.csv↩)

0.5s

sparkling-limit

Custom Python (Python)Custom Python Setup↩

If needed we can always convert DataFrames to NumPy arrays and vice versa.

iris_array = iris_df.as_matrix()
print(iris_array[:5,:])

0.8s

convert df to ndarray

Custom Python (Python)Custom Python Setup↩

But let's stay with Pandas for now and inspect the shape of our DataFrame. We can reference variables from other cells by typing in the cell name and selecting the variable from the dropdown menu.

n_examples, n_variables = (iris_df.shape)
print(n_examples, n_variables)

0.9s

dataframe shape

Custom Python (Python)Custom Python Setup↩

Our dataset only consists of Missing Reference examples with  Missing Reference variables each (including the target). Let's see what the DataFrame looks like. Plotly's figure factory offers  for that.

# plotly imports
import plotly.plotly as py
import plotly.figure_factory as ff
# plotly.graph_objs contains all the helper classes to make/style plots
from plotly.graph_objs import *

# display the dataframe only looking at the first 3 rows and 5 columns
ff.create_table(iris_df.iloc[:3,:].round(2), 
                index = True)

0.9s

display df head

Custom Python (Python)Custom Python Setup↩

Unfortunately the dataset has no column names set. Let's consult the dataset's website and set the columns.

column_list = ['Sepal Length', 'Sepal Width', 
               'Petal Length', 'Petal Width', 'Iris Type']

iris_df.columns = column_list

0.1s

rapid-recipe

Custom Python (Python)Custom Python Setup↩

Now our DataFrame should look better.

ff.create_table(iris_df.iloc[:3,:].round(2), 
                index = True)

0.6s

display df with columns set

Custom Python (Python)Custom Python Setup↩

2.2.
Visualize Data

Let's also visualize the data using whatever plotting library suits us. Starting with the most commonly used - matplotlib.

import matplotlib.pyplot as plt

fig_scatter, ax = plt.subplots()
for iris_type in iris_df["Iris Type"].unique():
  ax.scatter(x=iris_df["Sepal Length"].loc[
    					iris_df["Iris Type"]==iris_type], 
             y=iris_df["Sepal Width"].loc[
    					iris_df["Iris Type"]==iris_type], label=iris_type)
ax.legend()
ax.grid(True)
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")

fig_scatter

0.7s

scatter plot matplotlib

Custom Python (Python)Custom Python Setup↩

Pretty good! Another way to visualize the data in a nice fashion is seaborn.

import seaborn as sns

sns.jointplot(x="Sepal Length", y="Sepal Width", data=iris_df, size=5)

fig_jointplot = plt.gcf()

fig_jointplot

1.2s

scatter plot seaborn

Custom Python (Python)Custom Python Setup↩

Now let's pick out a single variable and take a more detailed look through a box plot.

fig_boxplot, ax = plt.subplots()

ax = sns.boxplot(x="Iris Type", y="Petal Length", data=iris_df)
ax = sns.stripplot(x="Iris Type", y="Petal Length", data=iris_df, 
                   jitter=True, edgecolor="gray")
plt.xlabel("")

fig_boxplot

0.7s

box plot seaborn

Custom Python (Python)Custom Python Setup↩

But we have four variables to compare, not only the one. Let's visualize them in pair plots.

sns.pairplot(iris_df, hue="Iris Type", size=3)

fig_pairplot = plt.gcf()

fig_pairplot

2.7s

pair plots seaborn

Custom Python (Python)Custom Python Setup↩

From those plots we can definitely see that there are features which separate our data. But we can do better.

3.
Processing the Data

When using machine learning models to describe the data at hand it is often necessary to process the data beforehand to facilitate the learning process for the model. scikit-learn offers not only machine learning APIs, but also a good collection of preprocessing tools. Let's put them to use.

from sklearn import preprocessing

scaled_data = preprocessing.scale(iris_df.drop("Iris Type", axis=1))

scaled_df = pd.DataFrame(scaled_data, columns=column_list[:4])
scaled_df = pd.concat([scaled_df, iris_df["Iris Type"]], axis=1)

ff.create_table(scaled_df.head(), index = True)

0.6s

scale dataframe

Custom Python (Python)Custom Python Setup↩

print (scaled_df.describe() )

0.8s

check scaled dataframe

Custom Python (Python)Custom Python Setup↩

We scaled the data by centering it to the mean of each respective column and component-wise scaling to unit variance. Let's look at the original and scaled distributions of the Sepal Width variable.

fig, axes = plt.subplots()
sns.despine(left=True)

sns.distplot(iris_df["Sepal Width"], color="g", ax=axes, 
             label="original")
sns.distplot(scaled_df["Sepal Width"], color="b", ax=axes, 
             label="scaled")

plt.setp(axes, yticks=[])
plt.tight_layout()
plt.legend()

fig

1.6s

distribution comparison

Custom Python (Python)Custom Python Setup↩

{
  "thing1": axes.lines[0].get_xdata().tolist(),
  "thing2": axes.lines[0].get_xdata().tolist()
}

0.6s

green-glade

Custom Python (Python)Custom Python Setup↩

var data = nilgreen-glade↩

var iris = data["thing1"]
var scaled = data["thing2"]

plottypes = ['lines','markers']

var traces = [
  {
    type: 'scatter', mode: 'lines',
    name: 'blorp',
    x: [...Array(iris.length).keys()],
    y: data
  }
]

var layout = { title: 'Split Cubic Function',
  xaxis: { title: 'Sepal Width' }
}

Nextjournal.plot(traces, layout)

dawn-fire

Javascript↩

Ok this looks like something we can work with. Let's save the dataframe.

4.
Saving Results

With Nextjournal we can also directly save our results. If we want to save the scaled DataFrame from above we simply write to the "/results" folder.

scaled_df.to_csv("/results/scaled_iris_data.csv")

0.1s

save results

Custom Python (Python)Custom Python Setup↩

scaled_iris_data.csv

The results saved that way are automatically displayed and can be downloaded right from the article.

Time to create your own data science workflow!

Getting Started with Python on Nextjournal

1. Pre-Installed Packages and Manual Installation

1.1. Conda Package Manager

1.2. The Python Package Index

1.3. Installing from Github

2. Exploring a Dataset

2.1. Load Data

2.2. Visualize Data

3. Processing the Data

4. Saving Results

1.
Pre-Installed Packages and Manual Installation

1.1.
Conda Package Manager

1.2.
The Python Package Index

1.3.
Installing from Github

2.
Exploring a Dataset

2.1.
Load Data

2.2.
Visualize Data

3.
Processing the Data

4.
Saving Results