David Schmudde / Aug 06 2019

Data science intro with panthera

Clojure + Pandas + Numpy = 💖

I'll show how it is possible to get the most out of the Pandas & the Clojure ecosystem at the same time.

This intro is based on this Kaggle notebook you can follow along with that if you come from the Python world.

Env setup

The easiest way to go is the provided Docker image, but if you want to setup your machine just follow along.

System install

If you want to install everything at the system level you should do something equivalent to what we do below:

sudo apt-get update sudo apt-get install libpython3.6-dev pip3 install numpy pandas

conda

To work within a conda environment just create a new one with:

conda create -n panthera python=3.6 numpy pandas conda activate panthera

Than start your REPL from the activated conda environment. This is the best way to install requirements for panthera because in the process you get MKL as well with Numpy.

Here

Now require panthera main API namespace and define a little helper to better inspect data-frames

(require '[panthera.panthera :as pt])
;; (require '[clojupyter.display :as display])
(require '[libpython-clj.python :as py])

(defn show
  [obj]
  (print obj)
  #_(display/html
    (py/call-attr obj "to_html")))
user/show
;; (helper/add-dependencies '[metasoarous/oz "1.5.4"])
;; (require '[oz.notebook.clojupyter :as oz])

A brief primer

pokemon.csv

We will work with Pokemons! Datasets are available here.

We can read data into panthera from various formats, one of the most used is read-csv. Most panthera functions accept either a data-frame and/or a series as a first argument, one or more required arguments and then a map of options.

To see which options are available you can check docs or even original Pandas docs, just remember that if you pass keywords they'll be converted to Python automatically (for example :index-col becomes index_col), while if you pass strings you have to use its original name.

Below as an example we read-csv our file, but we want to get only the first 10 rows, so we pass a map to the function like {:nrows 10}.

(print (pt/read-csv 
pokemon.csv
{:nrows 10}))

The cool thing is that we can chain operations, the threading first macro is our friend!

Below we read the whole csv, get the correlation matrix and then show it

(-> (pt/read-csv 
pokemon.csv
) pt/corr print)

Since we'll be using pokemon.csv a lot, let's give it a name, defonce is great here

(defonce pokemon (pt/read-csv 
pokemon.csv
))

Let's see how plotting goes

(defn heatmap 
  [data x y z]
  {:data {:values data}
   :width 500
   :height 500
   :encoding {:x {:field x
                  :type "nominal"}
              :y {:field y
                  :type "nominal"}}
   :layer [{:mark "rect"
            :encoding {:color {:field z
                               :type "quantitative"}}}
           {:mark "text"
            :encoding {:text 
                       {:field z
                        :type "quantitative"
                        :format ".2f"}
                       :color {:value "white"}}}]})
user/heatmap
(with-meta
  (-> pokemon
      pt/corr
      pt/reset-index
      (pt/melt {:id-vars :index})
      pt/->clj
      (heatmap :index :variable :value))
    {:nextjournal/viewer "vega-lite"})

What we did is plotting the heatmap of the correlation matrix shown above. Don't worry too much to all the steps we took, we'll be seeing all of them one by one later on!

What if we already read our data but we want to see only some rows? We have the head function for that

(show (pt/head pokemon))
(show (pt/head pokemon 10))

Another nice thing we can do is to get columns names

(pt/names pokemon)
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x3442f97f", "Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary'], dtype='object')", Map]

Now when you see an output as the above one, that means that the data we have is still in Python. That's ok if you keep working within panthera, but what if you want to do something with column names using Clojure?

(vec (pt/names pokemon))
Vector(12) ["#", "Name", "Type 1", "Type 2", "HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed", "Generation", "Legendary"]

That's it! Just call vecand now you have a nice Clojure vector that you can deal with.

N.B.: with many Python objects you can directly treat them as similar Clojure collections. For instance in this case we can do something like below

(doseq [a (pt/names pokemon)] (println a))

Some plotting

Plotting is nice to learn how to munge data: you get a fast visual feedback and usually results are nice to look at!

Let's plot Speed and Defense

(defn line-plot
  [data x y & [color]]
  (let [spec {:data {:values data}
              :mark "line"
              :width 600
              :height 300
              :encoding {:x {:field x
                             :type "quantitative"}
                         :y {:field y
                             :type "quantitative"}
                         :color {}}}]
    (if color
      (assoc-in spec [:encoding :color] {:field color
                                         :type "nominal"})
      (assoc-in spec [:encoding :color] {:value "blue"}))))
user/line-plot
(with-meta
(-> pokemon
    (pt/subset-cols :# :Speed :Defense)
    (pt/melt {:id-vars :#})
    pt/->clj
    (line-plot :# :value :variable))
      {:nextjournal/viewer "vega-lite"})

Let's look at the operation above:

  • subset-cols: we use this to, well, subset columns. We can choose N columns by label, we will get a 'new' data-frame with only the selected columns

  • melt: this transforms the data-frame from wide to long format (for more info about it see further below

  • ->clj: this turns data-frames and serieses to a Clojure vector of maps

subset-cols is pretty straightforward:

(-> pokemon (pt/subset-cols :Speed :Attack) pt/head show)
(-> pokemon (pt/subset-cols :Speed :Attack :HP :#) pt/head show)
(-> pokemon (pt/subset-cols :# :Attack) pt/head)
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x505b8227", " # Attack 0 1 49 1 2 62 2 3 82 3 4 100 4 5 52", Map]

->clj tries to understand what's the better way to transform panthera data structures to Clojure ones

(-> pokemon (pt/subset-cols :Speed) pt/head pt/->clj)
Vector(5) [Map, Map, Map, Map, Map]
(-> pokemon (pt/subset-cols :Speed :HP) pt/head pt/->clj)
Vector(5) [Map, Map, Map, Map, Map]

Now we want to see what happens when we plot Attack vs Defense

(defn scatter
  [data x y & [color]]
  (let [spec {:data {:values data}
              :mark "point"
              :width 600
              :height 300
              :encoding {:x {:field x
                             :type "quantitative"}
                         :y {:field y
                             :type "quantitative"}
                         :color {}}}]
    (if color
      (assoc-in spec [:encoding :color] {:field color
                                         :type "nominal"})
      (assoc-in spec [:encoding :color] {:value "dodgerblue"}))))
user/scatter
(with-meta
(-> pokemon
    (pt/subset-cols :Attack :Defense)
    pt/->clj
    (scatter :attack :defense))
    {:nextjournal/viewer "vega-lite"})

And now the Speed histogram

(defn hist
  [data x & [color]]
  (let [spec {:data {:values data}
              :mark "bar"
              :width 600
              :height 300
              :encoding {:x {:field x
                             :bin {:maxbins 50}
                             :type "quantitative"}
                         :y {:aggregate "count"
                             :type "quantitative"}
                         :color {}}}]
    (if color
      (assoc-in spec [:encoding :color] {:field color
                                         :type "nominal"})
      (assoc-in spec [:encoding :color] {:value "dodgerblue"}))))
user/hist
(with-meta
(-> pokemon
    (pt/subset-cols :Speed)
    pt/->clj
    (hist :speed))
    {:nextjournal/viewer "vega-lite"})

Data-frames basics

Creation

How to create data-frames? Above we read a csv, but what if we already have some data in the runtime we want to deal with? Nothing easier than this:

(show (pt/data-frame [{:a 1 :b 2} {:a 3 :b 4}]))

What if we don't care about column names, or we'd prefer to add them to an already generated data-frame?

(show (pt/data-frame (to-array-2d [[1 2] [3 4]])))

Columns of data-frames are just serieses:

(-> pokemon (pt/subset-cols "Defense") pt/pytype)
:series
(pt/series [1 2 3])
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x111e8b43", "0 1 1 2 2 3 dtype: int64", Map]

The column name is the name of the series:

(pt/series [1 2 3] {:name :my-series})
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x7321b30", "0 1 1 2 2 3 Name: my-series, dtype: int64", Map]

Filtering

One of the most straightforward ways to filter data-frames is with booleans. We have filter-rows that takes either booleans or a function that generates booleans

(-> pokemon
    (pt/filter-rows #(-> % (pt/subset-cols "Defense") (pt/gt 200)))
    show)

gt is exactly what you think it is: >. Check the Basic concepts notebook to better understand how math works in panthera.

Now we'll have to introduce Numpy in the equation. Let's say we want to filter the data-frame based on 2 conditions at the same time, we can do that using npy:

(require '[panthera.numpy :refer [npy]])
(defn my-filter
  [col1 col2]
  (npy :logical-and 
       {:args [(-> pokemon
                   (pt/subset-cols col1)
                   (pt/gt 200))
               (-> pokemon
                   (pt/subset-cols col2)
                   (pt/gt 100))]}))
user/my-filter
(-> pokemon
    (pt/filter-rows (my-filter :Defense :Attack))
    show)

panthera.numpy works a little differently than regular panthera, usually you need only npy to have access to all of numpy functions.

For instance:

(-> pokemon
    (pt/subset-cols :Defense)
    ((npy :log))
    pt/head)
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x5aa0f40e", "0 3.891820 1 4.143135 2 4.418841 3 4.812184 4 3.761200 Name: Defense, dtype: float64", Map]

Above we just calculated the log of the whole Defense column! Remember that npy operations are vectorized, so usually it is faster to use them (or equivalent panthera ones) than Clojure ones (unless you're doing more complicated operations, then Clojure would probably be faster).

Now let's try to do some more complicated things:

(/ (pt/sum (pt/subset-cols pokemon :Speed)) 
   (pt/n-rows pokemon))
Vector(2) [27311, 400]

Above we see how we can combine operations on serieses, but of course that's a mean, and we have a function for that!

(defn col-mean
  [col]
  (pt/mean (pt/subset-cols pokemon col)))
user/col-mean

Now we would like to add a new column that says high when the value is above the mean, and low for the opposite.

npy is really helpful here:

(npy :where {:args [(pt/gt (pt/head (pt/subset-cols pokemon :Speed)) (col-mean :Speed))
                    "high"
                    "low"]})
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x20bed54", "['low' 'low' 'high' 'high' 'low']", Map]

But this is pretty ugly and we can't chain it with other functions. It is pretty easy to wrap it into a chainable function:

(defn where
  [& args]
  (npy :where {:args args}))
user/where
(-> pokemon
    (pt/subset-cols :Speed)
    pt/head
    (pt/gt (col-mean :Speed))
    (where "high" "low"))
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x6a99ce1a", "['low' 'low' 'high' 'high' 'low']", Map]

That seems to work! Let's add a new column to our data-frame:

(def speed-level
  (-> pokemon
    (pt/subset-cols :Speed)
    (pt/gt (col-mean :Speed))
    (where "high" "low")))

(-> pokemon
    (pt/assign {:speed-level speed-level})
    (pt/subset-cols :speed_level :Speed)
    (pt/head 10)
    show)

Of course we didn't actually add speed_level to pokemon, we created a new data-frame. Everything here is as immutable as possible, let's check if this is really the case:

(vec (pt/names pokemon))
Vector(12) ["#", "Name", "Type 1", "Type 2", "HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed", "Generation", "Legendary"]

Inspecting data

Other than head we have tail

(show (pt/tail pokemon))

We can always check what's the shape of the data structure we're interested in. shape returns rows and columns count

(pt/shape pokemon)
Vector(4) [libpython_clj.python.bridge$generic_python_as_list$fn$reify__24449, "0x227c733c", "(800, 12)", Map]

If you want just one of the two you can either use one of n-rows or n-cols, or get the required value by index:

(pt/n-rows pokemon)
800
((pt/shape pokemon) 0)
800

Exploratory data analysis

Now we can move to something a little more interesting: some data analysis.

One of the first things we might want to do is to look at some frequencies. value-counts is our friend

(-> pokemon
    (pt/subset-cols "Type 1")
    (pt/value-counts {:dropna false}))
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x4e8a03f", "Water 112 Normal 98 Grass 70 Bug 69 Psychic 57 Fire 52 Rock 44 Electric 44 Ghost 32 Dragon 32 Ground 32 Dark 31 Poison 28 Steel 27 Fighting 27 Ice 24 Fairy 17 Flying 4 Name: Type 1, dtype: int64", Map]

As we can see we get counts by group automatically and this can come in handy!

There's also a nice way to see many stats at once for all the numeric columns: describe

(show (pt/describe pokemon))

If you need some of these stats only for some columns, chances are that there's a function for that!

(-> (pt/subset-cols pokemon :HP)
    ((juxt pt/mean pt/std pt/minimum pt/maximum)))
Vector(4) [69.25875, 25.53466903233207, 1, 255]

<a id='reshape'></a>

Reshaping data

Some of the most common operations with rectangular data is to reshape them how we most please to make other operations easier.

The R people perfectly know what I mean when I talk about tidy data, if you have no idea about this check the link, but the main point is that while most are used to work with double entry matrices (like the one above built with describe), it is much easier to work with long data: one row per observation and one column per variable.

In panthera there's melt as a workhorse for this process

(-> pokemon pt/head show)
(-> pokemon pt/head (pt/melt {:id-vars "Name" :value-vars ["Attack" "Defense"]}) show)

Above we told panthera that we wanted to melt our data-frame and that we would like to have the column Name act as the main id, while we're interested in the value of Attack and Defense.

This makes much easier to group values by some variable:

(-> pokemon 
    pt/head 
    (pt/melt {:id-vars "Name" :value-vars ["Attack" "Defense"]}) 
    (pt/groupby :variable)
    pt/mean)
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x41d1518a", " value variable Attack 69.0 Defense 72.2", Map]

If you've ever used Excel you already know about pivot, which is the opposite of melt

(-> pokemon 
    pt/head 
    (pt/melt {:id-vars "Name" :value-vars ["Attack" "Defense"]}) 
    (pt/pivot {:index "Name" :columns "variable" :values "value"})
    show)

What if we have more than one data-frame? We can combine them however we want!

(show 
  (pt/concatenate
    [(pt/head pokemon)
     (pt/tail pokemon)]
    {:axis 0
     :ignore-index true}))

Just a second to discuss some options:

  • :axis: most of panthera operations can be applied either by rows or columns, we decide which with this keyword where 0 = rows and 1 = columns

  • :ignore-index: panthera works by index, to better understand what kind of indexes there are and most of their quirks check Basic concepts

To better understand :axis let's make another example

(show
  (pt/concatenate
    (repeat 2 (pt/head pokemon))
    {:axis 1}))

Types, types everywhere

There are many dedicated types, but no worries, there are nice ways to deal with them.

(pt/dtype pokemon)
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x68cb547c", "# int64 Name object Type 1 object Type 2 object HP int64 Attack int64 Defense int64 Sp. Atk int64 Sp. Def int64 Speed int64 Generation int64 Legendary bool dtype: object", Map]

I guess there isn't much to say about :int64 and :bool, but surely :object looks more interesting. When panthera (numpy included) finds either strings or something it doesn't know how to deal with it goes to the less tight type possible which is an :object.

:objects are usually bloated, if we want to save some overhead and it makes sense to deal with categorical values we can convert them to :category

(-> pokemon
    (pt/subset-cols "Type 1")
    (pt/astype :category)
    pt/head)
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0xbb216f2", "0 Grass 1 Grass 2 Grass 3 Grass 4 Fire Name: Type 1, dtype: category Categories (18, object): [Bug, Dark, Dragon, Electric, ..., Psychic, Rock, Steel, Water]", Map]
(-> pokemon
    (pt/subset-cols "Speed")
    (pt/astype :float)
    pt/head)
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x7c5a4b0", "0 45.0 1 60.0 2 80.0 3 80.0 4 65.0 Name: Speed, dtype: float64", Map]

Dealing with missing data

One of the most painful operations for data scientists and engineers is dealing with the unknown: NaN (or nil, Null, etc).

panthera tries to make this as painless as possible:

(-> pokemon
    (pt/subset-cols "Type 2")
    (pt/value-counts {:dropna false}))
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x22dd4576", "NaN 386 Flying 97 Ground 35 Poison 34 Psychic 33 Fighting 26 Grass 25 Fairy 23 Steel 22 Dark 20 Dragon 18 Ice 14 Water 14 Ghost 14 Rock 14 Fire 12 Electric 6 Normal 4 Bug 3 Name: Type 2, dtype: int64", Map]

We could check for NaN in other ways has well:

(-> pokemon (pt/subset-cols "Type 2") ((juxt pt/hasnans? (comp pt/all? pt/not-na?))))
Vector(2) [true, false]

One of the ways to deal with missing data is to just drop rows

(-> pokemon
    (pt/dropna {:subset ["Type 2"]})
    (pt/subset-cols "Type 2")
    (pt/value-counts {:dropna false}))
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x65052234", "Flying 97 Ground 35 Poison 34 Psychic 33 Fighting 26 Grass 25 Fairy 23 Steel 22 Dark 20 Dragon 18 Ice 14 Rock 14 Ghost 14 Water 14 Fire 12 Electric 6 Normal 4 Bug 3 Name: Type 2, dtype: int64", Map]

But let's say we want to replace missing observations with a flag or value of some kind, we can do that easily with fill-na

(-> pokemon
    (pt/subset-cols "Type 2")
    (pt/fill-na :empty)
    (pt/head 10))
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x7369cd1", "0 Poison 1 Poison 2 Poison 3 Poison 4 empty 5 empty 6 Flying 7 Dragon 8 Flying 9 empty Name: Type 2, dtype: object", Map]

Time and dates

Programmers hate time, that's a fact. Panthera tries to make this experience as painless as possible

(def times
  ["1992-01-10","1992-02-10","1992-03-10","1993-03-15","1993-03-16"])

(pt/->datetime times)
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x639929c8", "DatetimeIndex(['1992-01-10', '1992-02-10', '1992-03-10', '1993-03-15', '1993-03-16'], dtype='datetime64[ns]', freq=None)", Map]
(-> pokemon
    pt/head
    (pt/set-index (pt/->datetime times))
    show)
(-> pokemon
    pt/head
    (pt/set-index (pt/->datetime times))
    (pt/select-rows "1993-03-16" :loc))
Vector(4) [libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__24566, "0x30eb68cb", "# 5 Name Charmander Type 1 Fire Type 2 NaN HP 39 Attack 52 Defense 43 Sp. Atk 60 Sp. Def 50 Speed 65 Generation 1 Legendary False Name: 1993-03-16 00:00:00, dtype: object", Map]
(-> pokemon
    pt/head
    (pt/set-index (pt/->datetime times))
    (pt/select-rows (pt/slice "1992-03-10" "1993-03-16") :loc)
    show)