Getting Started with Python on Nextjournal

This article serves as a simple and concise starting point for Python programming tasks. Go ahead and make it your own - just 'Remix' and code away!

1.
Pre-Installed Packages and Manual Installation

All articles ship with a lot of useful packages already pre-installed, so we can simply use them out of the box.

pip freeze






























import numpy as np
import plotly

But even if we're missing our favorite package, we can make use of bash cells to install packages since each cell runs within a Docker container with its own filesystem. If we do our installation within a separate Setup runtime, we can then save and export that runtime as a custom environment for use by our Main runtime, and never have to run the installations again.

1.1.
Conda Package Manager

The default Python environment on Nextjournal includes the conda package manager. Let's say we want to create some statistical graphics using seaborn to explore and visualize some given dataset.

conda install -y seaborn





























































1.2.
The Python Package Index

Not all packages are available in the conda package manager. The Python Package Index (PyPI) spans many more packages. Let's install two powerful machine learning libraries - XGBoost and scikit-learn - using the pip install command.

The XGBoost installation involves a library compilation, so first we'll have to install some build tools. The base system is Ubuntu, so we can use apt-get to install what we need.

apt-get update
apt-get install -y make g++
























































































































































































































Now we can use pip.

pip install xgboost scikit-learn













1.3.
Installing from Github

Since we have complete control over the filesystem we can also clone and install packages from Github. Let's try to install XGBoost again, this time directly from Github.

apt-get install -y git
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost; make -j4



























































































































































































































































































































Now we Save this runtime's state as a new environment, preserving our installed packages for use by other runtimes, here or in other articles.

2.
Exploring a Dataset

Once we have installed all required packages for our article we can now explore the dataset we wish to analyze. In our case we use the Iris Dataset - a classic in machine learning datasets - where we try to classify different types of iris plants.

2.1.
Load Data

iris_data.csv

We can read our dataset using Pandas' read_csv and store it as a DataFrame. Pandas, which is built on top of NumPy, offers some higher level data manipulation tools.

import numpy as np
import pandas as pd

iris_df = pd.read_csv(iris_data.csv)

If needed we can always convert DataFrames to NumPy arrays and vice versa.

iris_array = iris_df.as_matrix()
print(iris_array[:5,:])









But let's stay with Pandas for now and inspect the shape of our DataFrame. We can reference variables from other cells by typing in the cell name and selecting the variable from the dropdown menu.

n_examples, n_variables = (iris_df.shape)
print(n_examples, n_variables)

Our dataset only consists ofMissing Reference examples with Missing Reference variables each (including the target). Let's see what the DataFrame looks like. Plotly's figure factory offers for that.

# plotly imports
import plotly.plotly as py
import plotly.figure_factory as ff
# plotly.graph_objs contains all the helper classes to make/style plots
from plotly.graph_objs import *

# display the dataframe only looking at the first 3 rows and 5 columns
ff.create_table(iris_df.iloc[:3,:].round(2), 
                index = True)

Unfortunately the dataset has no column names set. Let's consult the dataset's website and set the columns.

column_list = ['Sepal Length', 'Sepal Width', 
               'Petal Length', 'Petal Width', 'Iris Type']

iris_df.columns = column_list

Now our DataFrame should look better.

ff.create_table(iris_df.iloc[:3,:].round(2), 
                index = True)

2.2.
Visualize Data

Let's also visualize the data using whatever plotting library suits us. Starting with the most commonly used - matplotlib.

import matplotlib.pyplot as plt

fig_scatter, ax = plt.subplots()
for iris_type in iris_df["Iris Type"].unique():
  ax.scatter(x=iris_df["Sepal Length"].loc[
    					iris_df["Iris Type"]==iris_type], 
             y=iris_df["Sepal Width"].loc[
    					iris_df["Iris Type"]==iris_type], label=iris_type)
ax.legend()
ax.grid(True)
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")

fig_scatter

Pretty good! Another way to visualize the data in a nice fashion is seaborn.

import seaborn as sns

sns.jointplot(x="Sepal Length", y="Sepal Width", data=iris_df, size=5)

fig_jointplot = plt.gcf()

fig_jointplot












Now let's pick out a single variable and take a more detailed look through a box plot.

fig_boxplot, ax = plt.subplots()

ax = sns.boxplot(x="Iris Type", y="Petal Length", data=iris_df)
ax = sns.stripplot(x="Iris Type", y="Petal Length", data=iris_df, 
                   jitter=True, edgecolor="gray")
plt.xlabel("")

fig_boxplot

But we have four variables to compare, not only the one. Let's visualize them in pair plots.

sns.pairplot(iris_df, hue="Iris Type", size=3)

fig_pairplot = plt.gcf()

fig_pairplot




From those plots we can definitely see that there are features which separate our data. But we can do better.

3.
Processing the Data

When using machine learning models to describe the data at hand it is often necessary to process the data beforehand to facilitate the learning process for the model. scikit-learn offers not only machine learning APIs, but also a good collection of preprocessing tools. Let's put them to use.

from sklearn import preprocessing

scaled_data = preprocessing.scale(iris_df.drop("Iris Type", axis=1))

scaled_df = pd.DataFrame(scaled_data, columns=column_list[:4])
scaled_df = pd.concat([scaled_df, iris_df["Iris Type"]], axis=1)

ff.create_table(scaled_df.head(), index = True)
print (scaled_df.describe() )









We scaled the data by centering it to the mean of each respective column and component-wise scaling to unit variance. Let's look at the original and scaled distributions of the Sepal Width variable.

fig, axes = plt.subplots()
sns.despine(left=True)

sns.distplot(iris_df["Sepal Width"], color="g", ax=axes, 
             label="original")
sns.distplot(scaled_df["Sepal Width"], color="b", ax=axes, 
             label="scaled")

plt.setp(axes, yticks=[])
plt.tight_layout()
plt.legend()

fig












{
  "thing1": axes.lines[0].get_xdata().tolist(),
  "thing2": axes.lines[0].get_xdata().tolist()
}
var data = nilgreen-glade

var iris = data["thing1"]
var scaled = data["thing2"]

plottypes = ['lines','markers']

var traces = [
  {
    type: 'scatter', mode: 'lines',
    name: 'blorp',
    x: [...Array(iris.length).keys()],
    y: data
  }
]

var layout = { title: 'Split Cubic Function',
  xaxis: { title: 'Sepal Width' }
}

Nextjournal.plot(traces, layout)

Ok this looks like something we can work with. Let's save the dataframe.

4.
Saving Results

With Nextjournal we can also directly save our results. If we want to save the scaled DataFrame from above we simply write to the "/results" folder.

scaled_df.to_csv("/results/scaled_iris_data.csv")
scaled_iris_data.csv

The results saved that way are automatically displayed and can be downloaded right from the article.

Time to create your own data science workflow!