Getting Started with Python on Nextjournal

pip freeze

This article serves as a simple and concise starting point for Python programming tasks. Go ahead and make it your own - just 'Remix' and code away!

Pre-Installed Packages and Manual Installation

All articles ship with a lot of useful packages already pre-installed (see list above), so we can simply use them out of the box.

import numpy as np
import pandas as pd

But even if we're missing our favorite package, we can make use of bash cells to install packages since all articles have their own filesystem within a Docker container. If we do our installation within a separate Setup runtime, we can then save and export that runtime as a custom environment for use by our Main runtime, and never have to run the installations again.

Conda Package Manager

Nextjournal pre-installs the conda package manager. Let's say we want to create some statistical graphics using seaborn to explore and visualize some given dataset.

conda install seaborn

The Python Package Index

Not all packages are available in the conda package manager. The Python Package Index (PyPI) spans many more packages. Let's install two powerful machine learning libraries - XGBoost and scikit-learn - using the pip install command.

pip install xgboost scikit-learn

Installing from Github

Since we have complete control over the filesystem we can also clone and install packages from Github. Let's try to install XGBoost again, this time directly from Github.

git clone --recursive
cd xgboost; make -j4

Exploring a Dataset

Once we have installed all required packages for our article we can now explore the dataset we wish to analyze. In our case we use the Iris Dataset - a classic in machine learning datasets - where we try to classify different types of iris plants.

Load Data


We can read our dataset using Pandas' read_csv and store it as a DataFrame. Pandas, which is built on top of NumPy, offers some higher level data manipulation tools.

iris_df = pd.read_csv(iris_data.csv)

If needed we can always convert DataFrames to NumPy arrays and vice versa.

iris_array = iris_df.as_matrix()
print iris_array[:5,:]

But let's stay with Pandas for now and inspect the shape of our DataFrame. We can reference variables from other cells by typing in the cell name and selecting the variable from the dropdown menu.

n_examples, n_variables = (load data.iris_df.shape)
print(n_examples, n_variables)

Our dataset only consists of149 examples with 5 variables each (including the target). Let's see what the DataFrame looks like. Plotly's figure factory offers for that.

# plotly imports
import plotly.plotly as py
import plotly.figure_factory as ff
# plotly.graph_objs contains all the helper classes to make/style plots
from plotly.graph_objs import *

# display the dataframe only looking at the first 3 rows and 5 columns
ff.create_table(load data.iris_df.iloc[:3,:].round(2), 
                index = True)

Unfortunately the dataset has no column names set. Let's consult the dataset's website and set the columns.

column_list = ['Sepal Length', 'Sepal Width', 
               'Petal Length', 'Petal Width', 'Iris Type']

iris_df.columns = column_list

Now our DataFrame should look better.

ff.create_table(set columns.iris_df.iloc[:3,:].round(2), 
                index = True)

Visualize Data

Let's also visualize the data using whatever plotting library suits us. Starting with the most commonly used - matplotlib.

import matplotlib.pyplot as plt

fig_scatter, ax = plt.subplots()
for iris_type in iris_df["Iris Type"].unique():
  ax.scatter(x=iris_df["Sepal Length"].loc[
    					iris_df["Iris Type"]==iris_type], 
             y=iris_df["Sepal Width"].loc[
    					iris_df["Iris Type"]==iris_type], label=iris_type)
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")


Pretty good! Another way to visualize the data in a nice fashion is seaborn.

import seaborn as sns

sns.jointplot(x="Sepal Length", y="Sepal Width", data=iris_df, size=5)

fig_jointplot = plt.gcf()


Now let's pick out a single variable and take a more detailed look through a box plot.

fig_boxplot, ax = plt.subplots()

ax = sns.boxplot(x="Iris Type", y="Petal Length", data=iris_df)
ax = sns.stripplot(x="Iris Type", y="Petal Length", data=iris_df, 
                   jitter=True, edgecolor="gray")


But we have four variables to compare, not only the one. Let's visualize them in pair plots.

sns.pairplot(iris_df, hue="Iris Type", size=3)

fig_pairplot = plt.gcf()


From those plots we can definitely see that there are features which separate our data. But we can do better.

Processing the Data

When using machine learning models to describe the data at hand it is often necessary to process the data beforehand to facilitate the learning process for the model. scikit-learn offers not only machine learning APIs, but also a good collection of preprocessing tools. Let's put them to use.

from sklearn import preprocessing

scaled_data = preprocessing.scale(iris_df.drop("Iris Type", axis=1))

scaled_df = pd.DataFrame(scaled_data, columns=column_list[:4])
scaled_df = pd.concat([scaled_df, iris_df["Iris Type"]], axis=1)

ff.create_table(scaled_df.head(), index = True)
print (scaled_df.describe() )

We scaled the data by centering it to the mean of each respective column and component-wise scaling to unit variance. Let's look at the original and scaled distributions of the Sepal Width variable.

fig, axes = plt.subplots()

sns.distplot(iris_df["Sepal Width"], color="g", ax=axes, 
sns.distplot(scaled_df["Sepal Width"], color="b", ax=axes, 

plt.setp(axes, yticks=[])


Ok this looks like something we can work with. Let's save the dataframe.

Saving Results

With Nextjournal we can also directly save our results. If we want to save the scaled DataFrame from above we simply write to the "/results" folder.


The results saved that way are automatically displayed and can be downloaded right from the article.

Time to create your own data science workflow!