Getting Started with Python on Nextjournal
This article serves as a simple and concise starting point for Python programming tasks. Go ahead and make it your own - just 'Remix' and code away!
1. Pre-Installed Packages and Manual Installation
All articles ship with a lot of useful packages already pre-installed, so we can simply use them out of the box.
pip freeze
import numpy as np import plotly
But even if we're missing our favorite package, we can make use of bash cells to install packages since each cell runs within a Docker container with its own filesystem. If we do our installation within a separate Setup runtime, we can then save and export that runtime as a custom environment for use by our Main runtime, and never have to run the installations again.
1.1. Conda Package Manager
The default Python environment on Nextjournal includes the conda package manager. Let's say we want to create some statistical graphics using seaborn to explore and visualize some given dataset.
conda install -y seaborn
1.2. The Python Package Index
Not all packages are available in the conda package manager. The Python Package Index (PyPI) spans many more packages. Let's install two powerful machine learning libraries - XGBoost and scikit-learn - using the pip install
command.
The XGBoost installation involves a library compilation, so first we'll have to install some build tools. The base system is Ubuntu, so we can use apt-get
to install what we need.
apt-get update apt-get install -y make g++
Now we can use pip.
pip install xgboost scikit-learn
1.3. Installing from Github
Since we have complete control over the filesystem we can also clone and install packages from Github. Let's try to install XGBoost again, this time directly from Github.
apt-get install -y git git clone --recursive https://github.com/dmlc/xgboost cd xgboost; make -j4
Now we Save this runtime's state as a new environment, preserving our installed packages for use by other runtimes, here or in other articles.
2. Exploring a Dataset
Once we have installed all required packages for our article we can now explore the dataset we wish to analyze. In our case we use the Iris Dataset - a classic in machine learning datasets - where we try to classify different types of iris plants.
2.1. Load Data
import numpy as np import pandas as pd iris_df = pd.read_csv(iris_data.csv↩)
If needed we can always convert DataFrames to NumPy arrays and vice versa.
iris_array = iris_df.as_matrix() print(iris_array[:5,:])
But let's stay with Pandas for now and inspect the shape of our DataFrame. We can reference variables from other cells by typing in the cell name and selecting the variable from the dropdown menu.
n_examples, n_variables = (iris_df.shape) print(n_examples, n_variables)
Our dataset only consists of Missing Reference examples with Missing Reference variables each (including the target). Let's see what the DataFrame looks like. Plotly's figure factory offers for that.
# plotly imports import plotly.plotly as py import plotly.figure_factory as ff # plotly.graph_objs contains all the helper classes to make/style plots from plotly.graph_objs import * # display the dataframe only looking at the first 3 rows and 5 columns ff.create_table(iris_df.iloc[:3,:].round(2), index = True)
Unfortunately the dataset has no column names set. Let's consult the dataset's website and set the columns.
column_list = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Iris Type'] iris_df.columns = column_list
Now our DataFrame should look better.
ff.create_table(iris_df.iloc[:3,:].round(2), index = True)
2.2. Visualize Data
Let's also visualize the data using whatever plotting library suits us. Starting with the most commonly used - matplotlib.
import matplotlib.pyplot as plt fig_scatter, ax = plt.subplots() for iris_type in iris_df["Iris Type"].unique(): ax.scatter(x=iris_df["Sepal Length"].loc[ iris_df["Iris Type"]==iris_type], y=iris_df["Sepal Width"].loc[ iris_df["Iris Type"]==iris_type], label=iris_type) ax.legend() ax.grid(True) plt.xlabel("Sepal Length") plt.ylabel("Sepal Width") fig_scatter
Pretty good! Another way to visualize the data in a nice fashion is seaborn.
import seaborn as sns sns.jointplot(x="Sepal Length", y="Sepal Width", data=iris_df, size=5) fig_jointplot = plt.gcf() fig_jointplot
Now let's pick out a single variable and take a more detailed look through a box plot.
fig_boxplot, ax = plt.subplots() ax = sns.boxplot(x="Iris Type", y="Petal Length", data=iris_df) ax = sns.stripplot(x="Iris Type", y="Petal Length", data=iris_df, jitter=True, edgecolor="gray") plt.xlabel("") fig_boxplot
But we have four variables to compare, not only the one. Let's visualize them in pair plots.
sns.pairplot(iris_df, hue="Iris Type", size=3) fig_pairplot = plt.gcf() fig_pairplot
From those plots we can definitely see that there are features which separate our data. But we can do better.
3. Processing the Data
When using machine learning models to describe the data at hand it is often necessary to process the data beforehand to facilitate the learning process for the model. scikit-learn offers not only machine learning APIs, but also a good collection of preprocessing tools. Let's put them to use.
from sklearn import preprocessing scaled_data = preprocessing.scale(iris_df.drop("Iris Type", axis=1)) scaled_df = pd.DataFrame(scaled_data, columns=column_list[:4]) scaled_df = pd.concat([scaled_df, iris_df["Iris Type"]], axis=1) ff.create_table(scaled_df.head(), index = True)
print (scaled_df.describe() )
We scaled the data by centering it to the mean of each respective column and component-wise scaling to unit variance. Let's look at the original and scaled distributions of the Sepal Width
variable.
fig, axes = plt.subplots() sns.despine(left=True) sns.distplot(iris_df["Sepal Width"], color="g", ax=axes, label="original") sns.distplot(scaled_df["Sepal Width"], color="b", ax=axes, label="scaled") plt.setp(axes, yticks=[]) plt.tight_layout() plt.legend() fig
{ "thing1": axes.lines[0].get_xdata().tolist(), "thing2": axes.lines[0].get_xdata().tolist() }
var data = nilgreen-glade↩ var iris = data["thing1"] var scaled = data["thing2"] plottypes = ['lines','markers'] var traces = [ { type: 'scatter', mode: 'lines', name: 'blorp', x: [Array(iris.length).keys()], y: data } ] var layout = { title: 'Split Cubic Function', xaxis: { title: 'Sepal Width' } } Nextjournal.plot(traces, layout)
Ok this looks like something we can work with. Let's save the dataframe.
4. Saving Results
With Nextjournal we can also directly save our results. If we want to save the scaled DataFrame from above we simply write to the "/results"
folder.
The results saved that way are automatically displayed and can be downloaded right from the article.
Time to create your own data science workflow!