# Getting Started with Python on Nextjournal

Gregor Koehler

This article serves as a simple and concise starting point for Python programming tasks. Go ahead and make it your own - just 'Remix' and code away! # Pre-Installed Packages and Manual Installation All articles ship with a lot of useful packages already pre-installed, so we can simply use them out of the box. ```bash id=065f7282-30ac-4ea4-a5ec-0082f67a7425 pip freeze ``` ```python id=b2bc8a41-70dd-4d7e-8e78-f98f1fd4b0ba import numpy as np import plotly ``` But even if we're missing our favorite package, we can make use of bash cells to install packages since each cell runs within a [Docker](https://www.docker.com/what-docker) container with its own filesystem. If we do our installation within a separate Setup runtime, we can then save and export that runtime as a **custom environment** for use by our Main runtime, and never have to run the installations again. ## Conda Package Manager The default Python environment on Nextjournal includes the [conda package manager](https://conda.io/docs/). Let's say we want to create some statistical graphics using [seaborn](http://seaborn.pydata.org) to explore and visualize some given dataset. ```bash id=769adb00-18e2-4ed7-8af6-06ccf04f7dfa conda install -y seaborn ``` ## The Python Package Index Not all packages are available in the conda package manager. The [Python Package Index (PyPI)](https://pypi.python.org/pypi) spans many more packages. Let's install two powerful machine learning libraries - [XGBoost](http://xgboost.readthedocs.io/en/latest/) and [scikit-learn](http://scikit-learn.org/stable/) - using the `pip install` command. The XGBoost installation involves a library compilation, so first we'll have to install some build tools. The base system is Ubuntu, so we can use `apt-get` to install what we need. ```bash id=8af414d7-635a-4caa-8dea-64806d692d39 apt-get update apt-get install -y make g++ ``` Now we can use pip. ```bash id=1f3c672a-d672-4b07-ad8c-5b69ece66ff2 pip install xgboost scikit-learn ``` ## Installing from Github Since we have complete control over the filesystem we can also clone and install packages from Github. Let's try to install XGBoost again, this time directly from Github. ```bash id=c2d3111e-33d2-44e2-975e-846e7aa0aa9e apt-get install -y git git clone --recursive https://github.com/dmlc/xgboost cd xgboost; make -j4 ``` Now we Save this runtime's state as a **new environment**, preserving our installed packages for use by other runtimes, here or in other articles. # Exploring a Dataset Once we have installed all required packages for our article we can now explore the dataset we wish to analyze. In our case we use the [Iris Dataset](https://archive.ics.uci.edu/ml/datasets/Iris) - a classic in machine learning datasets - where we try to classify different types of iris plants. ## Load Data [iris_data.csv][nextjournal#file#2a96c883-c41d-4a37-b29c-c14ed32d7857] We can read our dataset using Pandas' `read_csv `and store it as a DataFrame. [Pandas](https://pandas.pydata.org), which is built on top of [NumPy](http://www.numpy.org), offers some higher level data manipulation tools. ```python id=8a823a41-93dc-4f7d-a84b-7e5f84c3529f import numpy as np import pandas as pd iris_df = pd.read_csv([reference][nextjournal#reference#9dd83147-e046-47fb-899e-e9b2d7bf66b2]) ``` If needed we can always convert DataFrames to NumPy arrays and vice versa. ```python id=3502e232-627c-48a0-bb17-88da498f1ae9 iris_array = iris_df.as_matrix() print(iris_array[:5,:]) ``` But let's stay with Pandas for now and inspect the shape of our DataFrame. We can reference variables from other cells by typing in the cell name and selecting the variable from the dropdown menu. ```python id=de93fa39-73b7-4ac3-ade3-04cd8ff3ead4 n_examples, n_variables = (iris_df.shape) print(n_examples, n_variables) ``` Our dataset only consists of[reference][nextjournal#reference#d0caa040-11ac-49f4-a6e2-261c43654eb9] examples with [reference][nextjournal#reference#701d574d-104f-4590-96dc-5556272578e5] variables each (including the target). Let's see what the DataFrame looks like. Plotly's figure factory offers for that. ```python id=473c475d-d5c8-4192-924c-5f9f554091e2 # plotly imports import plotly.plotly as py import plotly.figure_factory as ff # plotly.graph_objs contains all the helper classes to make/style plots from plotly.graph_objs import * # display the dataframe only looking at the first 3 rows and 5 columns ff.create_table(iris_df.iloc[:3,:].round(2), index = True) ``` Unfortunately the dataset has no column names set. Let's consult the dataset's website and set the columns. ```python id=c5f2751a-6982-4f3d-84ca-de85b1c8df2c column_list = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Iris Type'] iris_df.columns = column_list ``` Now our DataFrame should look better. ```python id=8ba362ad-0890-48a4-b75b-98bd71ef36b0 ff.create_table(iris_df.iloc[:3,:].round(2), index = True) ``` ## Visualize Data Let's also visualize the data using whatever plotting library suits us. Starting with the most commonly used - [matplotlib](https://matplotlib.org). ```python id=224a0436-1474-4440-9daa-60a654472bb7 import matplotlib.pyplot as plt fig_scatter, ax = plt.subplots() for iris_type in iris_df["Iris Type"].unique(): ax.scatter(x=iris_df["Sepal Length"].loc[ iris_df["Iris Type"]==iris_type], y=iris_df["Sepal Width"].loc[ iris_df["Iris Type"]==iris_type], label=iris_type) ax.legend() ax.grid(True) plt.xlabel("Sepal Length") plt.ylabel("Sepal Width") fig_scatter ``` Pretty good! Another way to visualize the data in a nice fashion is [seaborn](https://seaborn.pydata.org). ```python id=3a8da2e0-b30c-4e0d-89a6-1197e2d48d59 import seaborn as sns sns.jointplot(x="Sepal Length", y="Sepal Width", data=iris_df, size=5) fig_jointplot = plt.gcf() fig_jointplot ``` Now let's pick out a single variable and take a more detailed look through a [box plot](https://en.wikipedia.org/wiki/Box_plot). ```python id=43ca0c2c-2c2a-443c-8e1b-86061eb67cdf fig_boxplot, ax = plt.subplots() ax = sns.boxplot(x="Iris Type", y="Petal Length", data=iris_df) ax = sns.stripplot(x="Iris Type", y="Petal Length", data=iris_df, jitter=True, edgecolor="gray") plt.xlabel("") fig_boxplot ``` But we have four variables to compare, not only the one. Let's visualize them in pair plots. ```python id=5a671a32-fe8a-4a4a-bc69-ad82b8b6eae4 sns.pairplot(iris_df, hue="Iris Type", size=3) fig_pairplot = plt.gcf() fig_pairplot ``` From those plots we can definitely see that there are features which separate our data. But we can do better. # Processing the Data When using machine learning models to describe the data at hand it is often necessary to process the data beforehand to facilitate the learning process for the model. [scikit-learn](http://scikit-learn.org/stable/) offers not only machine learning APIs, but also a good collection of [preprocessing tools](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing). Let's put them to use. ```python id=79da0cc5-a8e6-4ffa-bcf3-7f5b087db6c0 from sklearn import preprocessing scaled_data = preprocessing.scale(iris_df.drop("Iris Type", axis=1)) scaled_df = pd.DataFrame(scaled_data, columns=column_list[:4]) scaled_df = pd.concat([scaled_df, iris_df["Iris Type"]], axis=1) ff.create_table(scaled_df.head(), index = True) ``` ```python id=133ae426-e3c2-48bb-8409-f873dbdbe55e print (scaled_df.describe() ) ``` We scaled the data by centering it to the mean of each respective column and component-wise scaling to unit variance. Let's look at the original and scaled distributions of the `Sepal Width` variable. ```python id=47446733-5939-4f19-8e15-30291f7ac90e fig, axes = plt.subplots() sns.despine(left=True) sns.distplot(iris_df["Sepal Width"], color="g", ax=axes, label="original") sns.distplot(scaled_df["Sepal Width"], color="b", ax=axes, label="scaled") plt.setp(axes, yticks=[]) plt.tight_layout() plt.legend() fig ``` ```python id=9bb961e4-08d2-404c-b5ae-102b4f47dadd { "thing1": axes.lines[0].get_xdata().tolist(), "thing2": axes.lines[0].get_xdata().tolist() } ``` ```javascript id=7dcd3eea-f2f5-448f-be29-15221339d076 var data = [reference][nextjournal#reference#d3ac9bce-5f78-4af9-a0bf-4e467e83dbb7] var iris = data["thing1"] var scaled = data["thing2"] plottypes = ['lines','markers'] var traces = [ { type: 'scatter', mode: 'lines', name: 'blorp', x: [...Array(iris.length).keys()], y: data } ] var layout = { title: 'Split Cubic Function', xaxis: { title: 'Sepal Width' } } Nextjournal.plot(traces, layout) ``` Ok this looks like something we can work with. Let's save the dataframe. # Saving Results With Nextjournal we can also directly save our results. If we want to save the scaled DataFrame from above we simply write to the `"/results"` folder. ```python id=701a52e5-361c-4c3a-8167-fe35970add2c scaled_df.to_csv("/results/scaled_iris_data.csv") ``` [scaled_iris_data.csv][nextjournal#output#701a52e5-361c-4c3a-8167-fe35970add2c#scaled_iris_data.csv] The results saved that way are automatically displayed and can be downloaded right from the article. Time to create your own data science workflow! [nextjournal#file#2a96c883-c41d-4a37-b29c-c14ed32d7857]: [nextjournal#reference#9dd83147-e046-47fb-899e-e9b2d7bf66b2]: <#nextjournal#reference#9dd83147-e046-47fb-899e-e9b2d7bf66b2> [nextjournal#reference#d0caa040-11ac-49f4-a6e2-261c43654eb9]: <#nextjournal#reference#d0caa040-11ac-49f4-a6e2-261c43654eb9> [nextjournal#reference#701d574d-104f-4590-96dc-5556272578e5]: <#nextjournal#reference#701d574d-104f-4590-96dc-5556272578e5> [nextjournal#reference#d3ac9bce-5f78-4af9-a0bf-4e467e83dbb7]: <#nextjournal#reference#d3ac9bce-5f78-4af9-a0bf-4e467e83dbb7> [nextjournal#output#701a52e5-361c-4c3a-8167-fe35970add2c#scaled_iris_data.csv]:
This notebook was exported from https://nextjournal.com/a/C6Mo8UvDy78LYkYHa5r8cy?change-id=CFVKLTdK7F72Ax5uthn84X ```edn nextjournal-metadata {:article {:settings {:image "nextjournal/ubuntu:17.04-658650854"}, :nodes {"065f7282-30ac-4ea4-a5ec-0082f67a7425" {:compute-ref #uuid "7aab57e0-6980-11e8-b2e8-04d15d513695", :exec-duration 1486, :id "065f7282-30ac-4ea4-a5ec-0082f67a7425", :kind "code", :locked? false, :name "pip freeze", :output-log-lines {:stdout 30}, :runtime [:runtime "8816211c-f80d-40d2-bb1a-19d46dce1677"]}, "133ae426-e3c2-48bb-8409-f873dbdbe55e" {:compute-ref #uuid "a94f6d80-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 787, :id "133ae426-e3c2-48bb-8409-f873dbdbe55e", :kind "code", :name "check scaled dataframe", :output-log-lines {:stdout 9}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}, "1f3c672a-d672-4b07-ad8c-5b69ece66ff2" {:compute-ref #uuid "abae7e10-6b08-11e8-ba50-37ceecaf847f", :exec-duration 104943, :id "1f3c672a-d672-4b07-ad8c-5b69ece66ff2", :kind "code", :name "pip install", :output-log-lines {:stdout 13}, :runtime [:runtime "63422feb-ef96-4894-9b5b-f9816ad04daf"]}, "224a0436-1474-4440-9daa-60a654472bb7" {:compute-ref #uuid "945789d0-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 721, :id "224a0436-1474-4440-9daa-60a654472bb7", :kind "code", :name "scatter plot matplotlib", :output-log-lines {}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}, "2a96c883-c41d-4a37-b29c-c14ed32d7857" {:id "2a96c883-c41d-4a37-b29c-c14ed32d7857", :kind "file"}, "3502e232-627c-48a0-bb17-88da498f1ae9" {:compute-ref #uuid "5921c330-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 783, :id "3502e232-627c-48a0-bb17-88da498f1ae9", :kind "code", :name "convert df to ndarray", :output-log-lines {:stdout 9}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}, "3a8da2e0-b30c-4e0d-89a6-1197e2d48d59" {:compute-ref #uuid "97743320-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 1238, :id "3a8da2e0-b30c-4e0d-89a6-1197e2d48d59", :kind "code", :name "scatter plot seaborn", :output-log-lines {:stdout 12}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}, "43ca0c2c-2c2a-443c-8e1b-86061eb67cdf" {:compute-ref #uuid "9b968ca0-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 700, :id "43ca0c2c-2c2a-443c-8e1b-86061eb67cdf", :kind "code", :name "box plot seaborn", :output-log-lines {}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}, "4622ab79-87c9-4440-9857-84c4c88f69c7" {:environment [:environment nil], :id "4622ab79-87c9-4440-9857-84c4c88f69c7", :kind "runtime", :language "javascript", :type :nextjournal}, "473c475d-d5c8-4192-924c-5f9f554091e2" {:compute-ref #uuid "85544ea0-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 888, :id "473c475d-d5c8-4192-924c-5f9f554091e2", :kind "code", :name "display df head", :output-log-lines {}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}, "47446733-5939-4f19-8e15-30291f7ac90e" {:compute-ref #uuid "ac05b3e0-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 1643, :id "47446733-5939-4f19-8e15-30291f7ac90e", :kind "code", :name "distribution comparison", :output-log-lines {:stdout 12}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}, "5a671a32-fe8a-4a4a-bc69-ad82b8b6eae4" {:compute-ref #uuid "9ea923d0-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 2672, :id "5a671a32-fe8a-4a4a-bc69-ad82b8b6eae4", :kind "code", :name "pair plots seaborn", :output-log-lines {:stdout 4}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}, "63422feb-ef96-4894-9b5b-f9816ad04daf" {:environment [:environment {:node/id "2d7db078-e0cf-483e-a118-89ddc1d4adab", :article/nextjournal.id #uuid "5accb601-b16a-4637-ae55-5fd73544a52f", :change/nextjournal.id #uuid "5b101a6e-8be7-4492-97da-65ef13c070a6"}], :environment? true, :id "63422feb-ef96-4894-9b5b-f9816ad04daf", :kind "runtime", :language "bash", :name "Custom Python Setup", :type :nextjournal, :docker/environment-image "eu.gcr.io/nextjournal-com/environment@sha256:daf9e68635798017fd7e520c19682a8809ff14bfe004ba2ccfde1c2f41f9c24d", :environment/name "Custom Python Setup"}, "701a52e5-361c-4c3a-8167-fe35970add2c" {:compute-ref #uuid "bb28d320-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 138, :id "701a52e5-361c-4c3a-8167-fe35970add2c", :kind "code", :name "save results", :output-log-lines {}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}, "701d574d-104f-4590-96dc-5556272578e5" {:id "701d574d-104f-4590-96dc-5556272578e5", :kind "reference", :ref-id "de93fa39-73b7-4ac3-ade3-04cd8ff3ead4", :ref-var "n_variables"}, "769adb00-18e2-4ed7-8af6-06ccf04f7dfa" {:compute-ref #uuid "61f310b0-6b08-11e8-ba50-37ceecaf847f", :exec-duration 108790, :id "769adb00-18e2-4ed7-8af6-06ccf04f7dfa", :kind "code", :name "conda install", :output-log-lines {:stdout 61}, :runtime [:runtime "63422feb-ef96-4894-9b5b-f9816ad04daf"]}, "79da0cc5-a8e6-4ffa-bcf3-7f5b087db6c0" {:compute-ref #uuid "a6b33ed0-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 644, :id "79da0cc5-a8e6-4ffa-bcf3-7f5b087db6c0", :kind "code", :name "scale dataframe", :output-log-lines {}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}, "7dcd3eea-f2f5-448f-be29-15221339d076" {:id "7dcd3eea-f2f5-448f-be29-15221339d076", :kind "code", :name "dawn-fire", :runtime [:runtime "4622ab79-87c9-4440-9857-84c4c88f69c7"]}, "8816211c-f80d-40d2-bb1a-19d46dce1677" {:environment [:environment {:node/id "2d7db078-e0cf-483e-a118-89ddc1d4adab", :article/nextjournal.id #uuid "5accb601-b16a-4637-ae55-5fd73544a52f", :change/nextjournal.id #uuid "5b101a6e-8be7-4492-97da-65ef13c070a6"}], :id "8816211c-f80d-40d2-bb1a-19d46dce1677", :kind "runtime", :language "python", :name "Base Python", :type :nextjournal}, "8a823a41-93dc-4f7d-a84b-7e5f84c3529f" {:compute-ref #uuid "32d169b0-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 525, :id "8a823a41-93dc-4f7d-a84b-7e5f84c3529f", :kind "code", :name "sparkling-limit", :output-log-lines {}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}, "8af414d7-635a-4caa-8dea-64806d692d39" {:compute-ref #uuid "a2cf11b0-6b08-11e8-ba50-37ceecaf847f", :exec-duration 14867, :id "8af414d7-635a-4caa-8dea-64806d692d39", :kind "code", :name "fragrant-base", :output-log-lines {:stdout 216}, :runtime [:runtime "63422feb-ef96-4894-9b5b-f9816ad04daf"]}, "8ba362ad-0890-48a4-b75b-98bd71ef36b0" {:compute-ref #uuid "8aadefa0-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 596, :id "8ba362ad-0890-48a4-b75b-98bd71ef36b0", :kind "code", :name "display df with columns set", :output-log-lines {}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}, "9bb961e4-08d2-404c-b5ae-102b4f47dadd" {:compute-ref #uuid "af6818c0-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 645, :id "9bb961e4-08d2-404c-b5ae-102b4f47dadd", :kind "code", :name "green-glade", :output-log-lines {}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}, "9c2ecad1-8847-42c7-a32e-91512ef1f515" {:environment [:environment "63422feb-ef96-4894-9b5b-f9816ad04daf"], :id "9c2ecad1-8847-42c7-a32e-91512ef1f515", :kind "runtime", :language "python", :name "Custom Python", :type :nextjournal}, "9dd83147-e046-47fb-899e-e9b2d7bf66b2" {:id "9dd83147-e046-47fb-899e-e9b2d7bf66b2", :kind "reference"}, "b2bc8a41-70dd-4d7e-8e78-f98f1fd4b0ba" {:compute-ref #uuid "7e107be0-6980-11e8-b2e8-04d15d513695", :exec-duration 146, :id "b2bc8a41-70dd-4d7e-8e78-f98f1fd4b0ba", :kind "code", :name "import pandas", :output-log-lines {}, :runtime [:runtime "8816211c-f80d-40d2-bb1a-19d46dce1677"]}, "c2d3111e-33d2-44e2-975e-846e7aa0aa9e" {:compute-ref #uuid "ea3f0870-6b08-11e8-ba50-37ceecaf847f", :exec-duration 72284, :id "c2d3111e-33d2-44e2-975e-846e7aa0aa9e", :kind "code", :name "github install", :output-log-lines {:stdout 315}, :runtime [:runtime "63422feb-ef96-4894-9b5b-f9816ad04daf"]}, "c5f2751a-6982-4f3d-84ca-de85b1c8df2c" {:compute-ref #uuid "895b09d0-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 135, :id "c5f2751a-6982-4f3d-84ca-de85b1c8df2c", :kind "code", :name "rapid-recipe", :output-log-lines {}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}, "d0caa040-11ac-49f4-a6e2-261c43654eb9" {:id "d0caa040-11ac-49f4-a6e2-261c43654eb9", :kind "reference", :ref-id "de93fa39-73b7-4ac3-ade3-04cd8ff3ead4", :ref-var "n_examples"}, "d3ac9bce-5f78-4af9-a0bf-4e467e83dbb7" {:id "d3ac9bce-5f78-4af9-a0bf-4e467e83dbb7", :kind "reference", :link [:output "9bb961e4-08d2-404c-b5ae-102b4f47dadd" nil]}, "de93fa39-73b7-4ac3-ade3-04cd8ff3ead4" {:compute-ref #uuid "82a66cb0-6b0a-11e8-ba50-37ceecaf847f", :exec-duration 932, :id "de93fa39-73b7-4ac3-ade3-04cd8ff3ead4", :kind "code", :name "dataframe shape", :output-log-lines {:stdout 1}, :runtime [:runtime "9c2ecad1-8847-42c7-a32e-91512ef1f515"]}}, :nextjournal/id #uuid "59d40f47-9063-4fcb-a795-a6bf31b2e22a", :article/change {:nextjournal/id #uuid "5b1a6373-125a-4396-b314-1dcbb17a1330"}}} ```