Parens for Python - UMAP & Trimap
We are going to explore some more Python libraries through the use of libpython-clj.
This time, we are going to focus on a couple dimensionality reduction libraries called UMAP and Trimap. They are going to need a few support libraries installed to go through the examples:
{:deps
{org.clojure/clojure {:mvn/version "1.10.1"}
cnuernber/libpython-clj {:mvn/version "1.36"}}}
Install the python dependencies
pip3 install seaborn
pip3 install matplotlib
pip3 install sklearn
pip3 install numpy
pip3 install pandas
pip3 install umap-learn
pip3 install trimap
We also need to setup a plotting alias with matplotlib
(ns gigasquid.plot
(:require [libpython-clj.require :refer [require-python]]
[libpython-clj.python :as py :refer [py. py.. py.-]]))
First, we have to define a quick macro to show the plotting for our local system. This allows matplotlib, (the library that seaborn is built on), to be able to be shown headlessly.
;;;; have to set the headless mode before requiring pyplot
(def mplt (py/import-module "matplotlib"))
(py. mplt "use" "Agg")
(require-python matplotlib.pyplot)
(require-python matplotlib.backends.backend_agg)
(defmacro with-show
"Takes forms with mathplotlib.pyplot to then show locally"
[& body]
(let [_# (matplotlib.pyplot/clf)
fig# (matplotlib.pyplot/figure)
agg-canvas# (matplotlib.backends.backend_agg/FigureCanvasAgg fig#)]
(cons do body)
(py. agg-canvas# "draw")
(matplotlib.pyplot/savefig (str "results/" gensym ".png"))))
UMAP
UMAP is a dimensionality reduction library. It seems like a lot of words, but it basically takes a complicated dataset with many variables and reduces it down to something much simpler without losing the fundamental characteristics.
(ns gigasquid.umap
(:require [libpython-clj.require :refer [require-python]]
[libpython-clj.python :as py :refer [py. py.. py.-]]
[gigasquid.plot :as plot]))
;;;; you will need all these things below installed
;;; with pip or something else
;;; What is umap? - dimensionality reduction library
(require-python [seaborn :as sns])
(require-python [matplotlib.pyplot :as pyplot])
(require-python [sklearn.datasets :as sk-data])
(require-python [sklearn.model_selection :as sk-model])
(require-python [numpy :as numpy])
(require-python [pandas :as pandas])
(require-python [umap :as umap])
Next we are going to follow along the code tutorial from https://umap-learn.readthedocs.io/en/latest/basic_usage.html
We next setup the defaults for plotting and get some data to work with. We'll look at the Iris dataset. It isn't very representative in terms of real world data since btoht the number of points and features are small, but it will illustrate what is going on with dimensionality reduction.
;;; set the defaults for plotting
(sns/set)
(def iris (sk-data/load_iris))
(py.- iris DESCR)
We define a data frame and a series for the data set and can then plot the species.
(def iris-df (pandas/DataFrame (py.- iris data) :columns (py.- iris feature_names)))
(py/att-type-map iris-df)
(def iris-name-series (let [iris-name-map (zipmap (range 3) (py.- iris target_names))]
(pandas/Series (map (fn [item]
(get iris-name-map item))
(py.- iris target)))))
(py. iris-df __setitem__ "species" iris-name-series)
(py/get-item iris-df "species")
(plot/with-show
(sns/pairplot iris-df :hue "species"))
Now time to reduce! First we define a reducer and than train it to lean about the manifold. The fit_tranforms
function first fits data and then transforms it into a numpy array.
(def reducer (umap/UMAP))
(def embedding (py. reducer fit_transform (py.- iris data)))
(py.- embedding shape) ;=> (150, 2)
;;; 150 samples with 2 column. Each row of the array is a 2-dimensional representation of the corresponding flower. Thus we can plot the embedding as a standard scatterplot and color by the target array (since it applies to the transformed data which is in the same order as the original).
(str (first embedding)) ;=> [12.449954 -6.0549345]
(let [colors (mapv (py/get-item (sns/color_palette) %)
(py.- iris target))
x (mapv first embedding)
y (mapv last embedding)]
(plot/with-show
(pyplot/scatter x y :c colors)
(py. (pyplot/gca) set_aspect "equal" "datalim")
(pyplot/title "UMAP projection of the Iris dataset" :fontsize 24)))
UMAP with Digits Data
Now let's use a dataset with more complicated data. The handwritten digit set we all know and love.
(def digits (sk-data/load_digits))
(str (py.- digits DESCR))
Let's take a look at the images to see what we are dealing with:
(plot/with-show
(let [[fig ax-array] (pyplot/subplots 20 20)
axes (py. ax-array flatten)]
(doall (map-indexed (fn [i ax]
(py. ax imshow (py/get-item (py.- digits images) i) :cmap "gray_r"))
axes))
(pyplot/setp axes :xticks [] :yticks [] :frame_on false)
(pyplot/tight_layout :h_pad 0.5 :w_pad 0.01)))
Now, let's do a scatterplot of the first 10 dimensions for the 64 elements of the grayscale values.
(def digits-df (pandas/DataFrame (mapv (take 10 %) (py.- digits data))))
(def digits-target-series (pandas/DataFrame (mapv (str "Digit " %) (py.- digits target))))
(py. digits-df __setitem__ "digit" digits-target-series)
(plot/with-show
(sns/pairplot digits-df :hue "digit" :palette "Spectral"))
Let's reduce it!
;;;; use umap with the fit instead
(def reducer (umap/UMAP :random_state 42))
(py. reducer fit (py.- digits data))
;;; now we can look at the embedding attribute on the reducer or call transform on the original data
(def embedding (py. reducer transform (py.- digits data)))
(str (py.- embedding shape))
We now have a dataset with 1797 rows but only 2 columns. We can plot the resulting embedding, coloring the data points by the class to which they belong (the digit).
(plot/with-show
(let [x (mapv first embedding)
y (mapv last embedding)
colors (py.- digits target)
bounds (numpy/subtract (numpy/arange 11) 0.5)
ticks (numpy/arange 10)]
(pyplot/scatter x y :c colors :cmap "Spectral" :s 5)
(py. (pyplot/gca) set_aspect "equal" "datalim")
(py. (pyplot/colorbar :boundaries bounds) set_ticks ticks)
(pyplot/title "UMAP projection of the Digits dataset" :fontsize 24)))
Trimap
Trimap is another dimensionality reduction library that uses a different algorithm - ;https://pypi.org/project/trimap/
(ns gigasquid.trimap
(:require [libpython-clj.require :refer [require-python]]
[libpython-clj.python :as py :refer [py. py.. py.-]]
[gigasquid.plot :as plot]))
(require-python [trimap :as trimap])
(require-python [sklearn.datasets :as sk-data])
(require-python [matplotlib.pyplot :as pyplot])
We can do the digit example using it too.
(def digits (sk-data/load_digits))
(def digits-data (py.- digits data))
(def embedding (py. (trimap/TRIMAP) fit_transform digits-data))
(str (py.- embedding shape))
Finally, we can visualize it as before
(plot/with-show
(let [x (mapv first embedding)
y (mapv last embedding)
colors (py.- digits target)
bounds (numpy/subtract (numpy/arange 11) 0.5)
ticks (numpy/arange 10)]
(pyplot/scatter x y :c colors :cmap "Spectral" :s 5)
(py. (pyplot/gca) set_aspect "equal" "datalim")
(py. (pyplot/colorbar :boundaries bounds) set_ticks ticks)
(pyplot/title "UMAP projection of the Digits dataset" :fontsize 24)))
I hope that you have enjoyed this example and that it will spur your curiosity to try Python interop for yourself. You can find this code example, along with other here https://github.com/gigasquid/libpython-clj-examples