Parens for Python - UMAP & Trimap
We are going to explore some more Python libraries through the use of libpython-clj.
This time, we are going to focus on a couple dimensionality reduction libraries called UMAP and Trimap. They are going to need a few support libraries installed to go through the examples:
{:deps {org.clojure/clojure {:mvn/version "1.10.1"} cnuernber/libpython-clj {:mvn/version "1.36"}}}Install the python dependencies
pip3 install seabornpip3 install matplotlibpip3 install sklearnpip3 install numpypip3 install pandaspip3 install umap-learnpip3 install trimapWe also need to setup a plotting alias with matplotlib
(ns gigasquid.plot (:require [libpython-clj.require :refer [require-python]] [libpython-clj.python :as py :refer [py. py.. py.-]]))First, we have to define a quick macro to show the plotting for our local system. This allows matplotlib, (the library that seaborn is built on), to be able to be shown headlessly.
;;;; have to set the headless mode before requiring pyplot(def mplt (py/import-module "matplotlib"))(py. mplt "use" "Agg")(require-python matplotlib.pyplot)(require-python matplotlib.backends.backend_agg)(defmacro with-show "Takes forms with mathplotlib.pyplot to then show locally" [& body] (let [_# (matplotlib.pyplot/clf) fig# (matplotlib.pyplot/figure) agg-canvas# (matplotlib.backends.backend_agg/FigureCanvasAgg fig#)] (cons do body) (py. agg-canvas# "draw") (matplotlib.pyplot/savefig (str "results/" gensym ".png"))))UMAP
UMAP is a dimensionality reduction library. It seems like a lot of words, but it basically takes a complicated dataset with many variables and reduces it down to something much simpler without losing the fundamental characteristics.
(ns gigasquid.umap (:require [libpython-clj.require :refer [require-python]] [libpython-clj.python :as py :refer [py. py.. py.-]] [gigasquid.plot :as plot]));;;; you will need all these things below installed;;; with pip or something else;;; What is umap? - dimensionality reduction library(require-python [seaborn :as sns])(require-python [matplotlib.pyplot :as pyplot])(require-python [sklearn.datasets :as sk-data])(require-python [sklearn.model_selection :as sk-model])(require-python [numpy :as numpy])(require-python [pandas :as pandas])(require-python [umap :as umap])Next we are going to follow along the code tutorial from https://umap-learn.readthedocs.io/en/latest/basic_usage.html
We next setup the defaults for plotting and get some data to work with. We'll look at the Iris dataset. It isn't very representative in terms of real world data since btoht the number of points and features are small, but it will illustrate what is going on with dimensionality reduction.
;;; set the defaults for plotting(sns/set)(def iris (sk-data/load_iris))(py.- iris DESCR)We define a data frame and a series for the data set and can then plot the species.
(def iris-df (pandas/DataFrame (py.- iris data) :columns (py.- iris feature_names)))(py/att-type-map iris-df)(def iris-name-series (let [iris-name-map (zipmap (range 3) (py.- iris target_names))] (pandas/Series (map (fn [item] (get iris-name-map item)) (py.- iris target)))))(py. iris-df __setitem__ "species" iris-name-series)(py/get-item iris-df "species")(plot/with-show (sns/pairplot iris-df :hue "species"))Now time to reduce! First we define a reducer and than train it to lean about the manifold. The fit_tranforms function first fits data and then transforms it into a numpy array.
(def reducer (umap/UMAP))(def embedding (py. reducer fit_transform (py.- iris data)))(py.- embedding shape) ;=> (150, 2);;; 150 samples with 2 column. Each row of the array is a 2-dimensional representation of the corresponding flower. Thus we can plot the embedding as a standard scatterplot and color by the target array (since it applies to the transformed data which is in the same order as the original).(str (first embedding)) ;=> [12.449954 -6.0549345](let [colors (mapv (py/get-item (sns/color_palette) %) (py.- iris target)) x (mapv first embedding) y (mapv last embedding)] (plot/with-show (pyplot/scatter x y :c colors) (py. (pyplot/gca) set_aspect "equal" "datalim") (pyplot/title "UMAP projection of the Iris dataset" :fontsize 24)))UMAP with Digits Data
Now let's use a dataset with more complicated data. The handwritten digit set we all know and love.
(def digits (sk-data/load_digits))(str (py.- digits DESCR))Let's take a look at the images to see what we are dealing with:
(plot/with-show (let [[fig ax-array] (pyplot/subplots 20 20) axes (py. ax-array flatten)] (doall (map-indexed (fn [i ax] (py. ax imshow (py/get-item (py.- digits images) i) :cmap "gray_r")) axes)) (pyplot/setp axes :xticks [] :yticks [] :frame_on false) (pyplot/tight_layout :h_pad 0.5 :w_pad 0.01)))Now, let's do a scatterplot of the first 10 dimensions for the 64 elements of the grayscale values.
(def digits-df (pandas/DataFrame (mapv (take 10 %) (py.- digits data))))(def digits-target-series (pandas/DataFrame (mapv (str "Digit " %) (py.- digits target))))(py. digits-df __setitem__ "digit" digits-target-series)(plot/with-show (sns/pairplot digits-df :hue "digit" :palette "Spectral"))Let's reduce it!
;;;; use umap with the fit instead(def reducer (umap/UMAP :random_state 42))(py. reducer fit (py.- digits data));;; now we can look at the embedding attribute on the reducer or call transform on the original data(def embedding (py. reducer transform (py.- digits data)))(str (py.- embedding shape))We now have a dataset with 1797 rows but only 2 columns. We can plot the resulting embedding, coloring the data points by the class to which they belong (the digit).
(plot/with-show (let [x (mapv first embedding) y (mapv last embedding) colors (py.- digits target) bounds (numpy/subtract (numpy/arange 11) 0.5) ticks (numpy/arange 10)] (pyplot/scatter x y :c colors :cmap "Spectral" :s 5) (py. (pyplot/gca) set_aspect "equal" "datalim") (py. (pyplot/colorbar :boundaries bounds) set_ticks ticks) (pyplot/title "UMAP projection of the Digits dataset" :fontsize 24)))Trimap
Trimap is another dimensionality reduction library that uses a different algorithm - ;https://pypi.org/project/trimap/
(ns gigasquid.trimap (:require [libpython-clj.require :refer [require-python]] [libpython-clj.python :as py :refer [py. py.. py.-]] [gigasquid.plot :as plot]))(require-python [trimap :as trimap])(require-python [sklearn.datasets :as sk-data])(require-python [matplotlib.pyplot :as pyplot])We can do the digit example using it too.
(def digits (sk-data/load_digits))(def digits-data (py.- digits data))(def embedding (py. (trimap/TRIMAP) fit_transform digits-data))(str (py.- embedding shape))Finally, we can visualize it as before
(plot/with-show (let [x (mapv first embedding) y (mapv last embedding) colors (py.- digits target) bounds (numpy/subtract (numpy/arange 11) 0.5) ticks (numpy/arange 10)] (pyplot/scatter x y :c colors :cmap "Spectral" :s 5) (py. (pyplot/gca) set_aspect "equal" "datalim") (py. (pyplot/colorbar :boundaries bounds) set_ticks ticks) (pyplot/title "UMAP projection of the Digits dataset" :fontsize 24)))I hope that you have enjoyed this example and that it will spur your curiosity to try Python interop for yourself. You can find this code example, along with other here https://github.com/gigasquid/libpython-clj-examples