Gaussian Processes
Practical introduction to Gaussian Processes (GP) in Clojure. Part 1/2.
Short introduction
First of all, I don't want to describe any internals of GP. There are plenty of great articles and videos listed below. Read and watch them now. What is important:
GP generates random and continuous functions
When you take all of possible functions, function values for one particular point are from normal distribution.
You can select subset of functions by using known values or setting noise.
Smoothness of the function is defined by covariance which is calculated from kernels
Kernel defines similarity between points, closer points give higher covariance
There are plenty of kernels
Regression line (or interpolation) is just a mean from a subset of functions
For every point you can calculate standard deviation, this way you may know uncertainty of prediction
In this article I mainly focus on regression and interpolation. Addtionally I'll explore kernels and parametrization.
Second article will focus on Bayessian Optimization where GP is used.
Fastmath implements two GPs in fastmath.regression namespace. One is backed by SMILE library and is limited to regression. Second one is implemented directly in Clojure and brings more information like standard deviation and prior and posterior samplers.
Links to external resources:
https://katbailey.github.io/post/gaussian-processes-for-dummies/
https://distill.pub/2019/visual-exploration-gaussian-processes/
Dependencies
Let's load dependencies. Three libraries are used:
cljplot - for charts
fastmath - for kernels and GP itself
clojure2d - for colors, gradients
{:deps
{org.clojure/clojure {:mvn/version "1.10.1"}
cljplot {:mvn/version "0.0.2-SNAPSHOT"}
generateme/fastmath {:mvn/version "1.4.0-SNAPSHOT"}
clojure2d {:mvn/version "1.2.0-SNAPSHOT"}}}
(require [cljplot.build :as b]
[cljplot.core :refer :all]
[fastmath.core :as m]
[fastmath.kernel :as k]
[fastmath.random :as r]
[fastmath.regression :as reg]
[clojure2d.color :as c])
GP
Let's start with SMILE version first and let's explore a little bit regression and interpolation between points.
This version accepts two parameters:
kernel
- kernel object (default: gaussian)lambda
- noise factor (default: 0.5)
To predict just call predict
/ predict-all
functions or regression object itself (all regression objects implement IFn
).
(def domain [-3.0 3.0]) ;; our domain, 1d
(def xs [0 1 -2 -2.001]) ;; 4 points
(def ys [-2 3 0.5 -0.6]) ;; 4 values
(def scatter-data (map vector xs ys)) ;; reorganize data for scatter plot
(def gp1 (reg/gaussian-process xs ys))
(def gp2 (reg/gaussian-process {:kernel (k/kernel :gaussian 0.8)
:lambda 0.000001} xs ys))
(reg/predict gp1 1.5) ;; prediction!
(gp1 1.5)
(gp2 1.5)
(reg/predict-all gp2 [0 1 -2 -2.001])
(defn draw-gp
"Draw multi function chart with legend.
* xs - domain points
* ys - values
* title - chart title
* legend-name - legend title
* labels - legend labels
* gps - map of gaussian processes to visualize
* h, w - optional height and width of the chart"
[xs ys title legend-name labels gps & [h w]]
(let [pal (c/palette-presets :category10) ;; use category10 palette
leg (map (vector :line %1 {:color %2}) labels pal) ;; construct legend
scatter-data (map vector xs ys)] ;; prepare scatter data
(-> (xy-chart {:width (or w 700) :height (or h 300)}
(-> (b/series [:grid]
[:scatter scatter-data {:size 8}])
(b/add-multi :function gps {:samples 400
:domain domain} {:color pal}))
(b/add-axes :bottom)
(b/add-axes :left)
(b/add-legend legend-name leg)
(b/add-label :top title)))))
(let [gp (reg/gaussian-process {:lambda 0.5} xs ys)] ;; create gaussian process
(save (draw-gp xs ys "Default GP regression with Smile"
"Lambda" [0.5] {:r1 gp})
"/results/default-smile.jpg"))
Lambda / noise
You may observe that the line doesn't go through the points. There is a result of using lambda parameter with quite high value.
Lambda parameter measures level of noise in our data and regulates fitness. Let's see how does it work.
(let [lambdas [0.00005 0.1 2.0]]
(save (draw-gp xs ys "GP regression with different lambda"
"Lambda" lambdas
(map vector lambdas
(map (reg/gaussian-process {:lambda %} xs ys) lambdas)))
"/results/smile-different-lambda.jpg"))
As you can see, lower lambda, better function interpolation but also leads to overfitting of the model.
Let's try with more points from some noisy toy function.
(defn toyfn [x] (+ (r/grand) (m/exp x) (* -5 (m/cos (* 3 x)))))
(save (xy-chart {:width 700 :height 300}
(b/series [:grid] [:function toyfn {:domain domain}])
(b/add-axes :bottom)
(b/add-axes :left)) "/results/toyfn.jpg")
(let [lambdas [0.00005 0.1 2.0]
xs (repeatedly 30 (apply r/drand domain))
ys (map toyfn xs)]
(save (draw-gp xs ys "GP regression with different lambda"
"Lambda" lambdas
(map vector lambdas
(map (reg/gaussian-process {:lambda %} xs ys) lambdas)))
"/results/toy-different-lambda.jpg"))
To summarize: lower lambda, better fit.
Kernels
The other important parameter is kernel. Kernel is a measure of correlation between points. Usually kernels are constructed to give closer points higher correlation.
fastmath.kernel namespace defines plenty of kernels. Most of the can be used in GP, some of them don't. To work with GP kernel should be symmetric and positive semi-definite.
(sort (keys (methods k/kernel)))
Kernel multimethod creates actual kernel function which operates on vectors. Multimethod accepts kernel name as keyword and additional parameters (mostly scaling factor).
(def k1 (k/kernel :gaussian)) ;; default scaling = 1.0
(def k2 (k/kernel :mattern-32)) ;; mattern 3/2 kernel
(def k3 (k/kernel :mattern-32 0.5)) ;; scaling parameter
{:k1 (k1 [0] [1])
:k2 (k2 [0] [1])
:k3 (k3 [0] [1])}
Let's see how they look like. Blue with scaling 1.0, red with scaling 0.5:
(def kernels [:gaussian :cauchy :anova :linear :inverse-multiquadratic
:mattern-12 :mattern-52 :periodic :exponential])
(let [fk (fn [f] (f [0.1] [%]))
make-data (map (fn [k] [(str k)
(fk (k/kernel k %))]) kernels)
cfg {:domain [-3 3] :samples 200 :stroke {:size 2}}]
(-> (xy-chart {:width 700 :height 600}
(-> (b/lattice :function (make-data 1) cfg {:grid true})
(b/add-series (b/lattice :function (make-data 0.5)
(assoc cfg :color (c/color 245 80 60)) {:label name})))
(b/update-scales :x :ticks 5)
(b/add-label :top "Various kernels around 0.1")
(b/add-axes :bottom)
(b/add-axes :left)
(b/add-axes :right))
(save "/results/kernels.jpg")))
Let's look at covariance matrices.
(let [fk (fn [f] (f [0] [%]))
r (range -1 1 0.025)
data (map (fn [k] [(str k)
(let [kernel (k/kernel k)]
(for [x r y r] [[x y] (kernel [x] [y])]))]) kernels)]
(-> (xy-chart {:width 700 :height 600}
(b/lattice :heatmap data {} {:label name :grid true})
(b/add-label :top "Covariance matrices"))
(save "/results/covariances.jpg")))
Now I use some of the kernels to play with regression.
(save (draw-gp xs ys "GP regression with different kernel"
"Kernel" kernels
(map vector kernels
(map (reg/gaussian-process {:lambda 0.01
:kernel (k/kernel %)} xs ys) kernels)))
"/results/smile-different-kernels.jpg")
Exploration
Now a little bit more kernel. See how parameters
(defn draw-gp-lattice
[title gps]
(xy-chart {:width 700 :height 500}
(-> (b/lattice :function gps {:domain domain :samples 400}
{:shape [-1 1] :grid true})
(b/add-series (b/lattice :scatter (zipmap (map first gps) (repeat scatter-data)) {:size 8} {:label name :shape [-1 1]})))
(b/add-axes :bottom)
(b/add-axes :left)
(b/add-label :top title)))
(defn gen-gps
"Generate gaussian processes for given parametrization."
[lambdas kernels kernel-params]
(map (fn [l k p]
(let [n (str "Kernel: " k "(" p "), lambda=" l)
kernel (k/kernel k p)]
[n (reg/gaussian-process {:kernel kernel :lambda l} xs ys)])) (cycle lambdas) (cycle kernels) kernel-params))
(save (draw-gp-lattice "Various gaussian kernels"
(gen-gps [0.001] [:gaussian] [0.1 1 4])) "/results/gk.jpg")
(save (draw-gp-lattice "Various mattern kernels"
(gen-gps [0.0001] [:mattern-12] [0.1 1 4])) "/results/mk.jpg")
(save (draw-gp-lattice "Various periodic kernels"
(gen-gps [0.0001] [:periodic] [0.1 1 4])) "/results/pk.jpg")
GP+
gaussian-process+ is the enhanced version of GP which exposes some additional information, like:
prior distribution
posterior distribution
standard deviation
Let's see how all above depend on kernels and noise.
This version accepts following parameters as a map:
kernel - the same as above (default: gaussian)
kscale - variance scaler, result of the kernel is multiplied by this value (default: 1.0)
noise - the same as lambda, just different name for the same parameter (default 1.0e-6)
normalize? - should ys be normalized to have zero mean or not? (default: false)
(def gp1+ (reg/gaussian-process+ xs ys))
(def gp2+ (reg/gaussian-process+ {:normalize? true} xs ys))
{:gp1+ (gp1+ 1.5)
:gp2+ (gp2+ 1.5)}
Prior
Prior samples are functions drawn from GP without any knowledge. Every point has N(0,1) distrtibution.
We can't get explicit function so we have to sample points and interpolate between values.
(let [gp (reg/gaussian-process+ xs ys)]
(reg/prior-samples gp (range 0 1 0.1)))
(defn draw-prior
([gp] (draw-prior gp 20))
([gp cnt]
(let [xs (range -3 3.1 0.1)
priors (map (vector % (map vector xs (reg/prior-samples gp xs))) (range cnt))]
(xy-chart {:width 700 :height 300}
(-> (b/series [:grid])
(b/add-multi :line priors {} {:color (cycle (c/palette-presets :category20c))})
(b/add-serie [:hline 0 {:color :black}]))
(b/add-axes :bottom)
(b/add-axes :left)
(b/add-label :top "Priors")))))
(save (draw-prior (reg/gaussian-process+ xs ys)) "/results/prior.jpg")
When we make noise level higher, all priors is more random
(save (draw-prior (reg/gaussian-process+ {:noise 0.5} xs ys)) "/results/priorn.jpg")
When we use different kernel, function structure will reflect kernel shape
(save (draw-prior (reg/gaussian-process+ {:kernel (k/kernel :periodic 2)
:noise 0.01} xs ys) 5)
"/results/priorp.jpg")
(save (draw-prior (reg/gaussian-process+ {:kernel (k/kernel :linear)} xs ys))
"/results/priorl.jpg")
Posteriors
When you have some points known, you can draw posterior functions using posterior-samples. Posteriors represent a selection of possible functions for given kernels and noise value.
(def xs [-4 -1 2])
(def ys [-5 -1 2])
(defn draw-posterior
([gp] (draw-posterior gp 20))
([gp cnt]
(let [xxs (range -5 5.1 0.2)
posteriors (map (vector % (map vector xxs (reg/posterior-samples gp xxs))) (range cnt))]
(xy-chart {:width 700 :height 300}
(-> (b/series [:grid])
(b/add-multi :line posteriors {} {:color (cycle (c/palette-presets :category20c))})
(b/add-serie [:function gp {:domain [-5 5] :color :black :stroke {:size 2 :dash [5 2]}}])
(b/add-serie [:scatter (map vector xs ys) {:size 8}]))
(b/add-axes :bottom)
(b/add-axes :left)
(b/add-label :top "Posteriors")))))
(save (draw-posterior (reg/gaussian-process+ {:kernel (k/kernel :gaussian 0.5)}
xs ys))
"/results/posterior.jpg")
(save (draw-posterior (reg/gaussian-process+ {:kernel (k/kernel :periodic 0.2 6.5)}
xs ys))
"/results/posteriorp.jpg")
Standard deviation
Instead of drawing posterior samples, we can draw confidence intervals using standard deviation. This can give information about areas of uncertainty. This information will be important in the next part about Bayessian Optimization.
Below charts show 50% and 95% of confidence and predicted function (mean).
(def xs [-5 1 2])
(def ys [17 10 12])
(defn draw-stddev
[gp]
(let [xxs (range -5 5.1 0.2)
pairs (reg/predict-all gp xxs true)
mu (map first pairs)
stddev (map second pairs)
s95 (map (partial * 1.96) stddev)
s50 (map (partial * 0.67) stddev)]
(xy-chart {:width 700 :height 300}
(b/series [:grid]
[:ci [(map vector xxs (map - mu s95)) (map vector xxs (map + mu s95))] {:color (c/color :lightblue 180)}]
[:ci [(map vector xxs (map - mu s50)) (map vector xxs (map + mu s50))] {:color (c/color :lightblue)}]
[:line (map vector xxs mu) {:color :black :stroke {:size 2 :dash [5 2]}}]
[:scatter (map vector xs ys) {:size 8}])
(b/add-axes :bottom)
(b/add-axes :left)
(b/add-label :top "Confidence intervals"))))
(save (draw-stddev(reg/gaussian-process+ xs ys)) "/results/predict.jpg")
You can see that function goes down towards zero value in areas without known points. This is cause by lack of normalization.
(save (draw-stddev(reg/gaussian-process+ {:normalize? true} xs ys)) "/results/predictn.jpg")
With data normalization line is more flat, but this creates higher uncertainty.
Let's try different kernels and options
(save (draw-stddev(reg/gaussian-process+ {:kernel (k/kernel :mattern-32)
:noise 0.2
:normalize? true}
xs ys)) "/results/predictnm.jpg")
(save (draw-stddev(reg/gaussian-process+ {:kernel (k/kernel :gaussian 0.2)
:normalize? false}
xs ys)) "/results/predictb.jpg")
(save (draw-stddev(reg/gaussian-process+ {:kscale 0.1}
xs ys)) "/results/predict-low-scale.jpg")
Play yourself.
PART 2 - Bayesian Optimization (soon)