Clojure Spec for Data Science
At the SciCloj meetup in Berlin the idea came up to use clojure.spec
to validate input data and then use generators from the same specs to fill in for the invalid data. Here we explore a proof of concept.
{:deps
{org.clojure/clojure {:mvn/version "1.10.1"}
org.clojure/test.check {:mvn/version "0.10.0-alpha3"}}}
(require [clojure.spec.alpha :as s])
(require [clojure.spec.gen.alpha :as gen])
Let's have an example data set with a bunch of values
(def data [1 2 1 4 5 1 999 ""])
Define the specs on what we consider valid data
(s/def ::n (s/int-in 1 6))
(s/def ::input (s/coll-of ::n))
and see if our input data is considered valid
(s/valid? ::input data)
(s/explain-data ::input data)
We see it is not valid, and clojure.spec
can give us exact information on what problems it found
(def explained *1)
(get-in data (-> explained ::s/problems first :in))
To use the generators, we can just give it the name of the spec for which we need a value
(gen/generate (s/gen ::n))
We can put this together, to validate the input data and automatically fill in for the values which failed validation.
(reduce (fn [d p]
(update-in d
(:in p)
(fn [n] (gen/generate (s/gen (-> p :via last))))))
data
(-> explained ::s/problems))
(def valid-data *1)
(s/valid? ::input valid-data)
And our data is now valid!
TODO:
write a custom generator which analyzes valid input data and generate the most common value