Dieter Komendera / Feb 27 2020

Clojure Spec for Data Science

At the SciCloj meetup in Berlin the idea came up to use clojure.spec to validate input data and then use generators from the same specs to fill in for the invalid data. Here we explore a proof of concept.

{:deps {org.clojure/clojure {:mvn/version "1.10.1"}  org.clojure/test.check {:mvn/version "0.10.0-alpha3"}}}

deps.edn

(require '[clojure.spec.alpha :as s])(require '[clojure.spec.gen.alpha :as gen])

0.2s

Clojure

Let's have an example data set with a bunch of values

(def data [1 2 1 4 5 1 999 ""])

0.1s

Clojure

user/data

Define the specs on what we consider valid data

(s/def ::n (s/int-in 1 6))(s/def ::input (s/coll-of ::n))

0.1s

Clojure

:user/input

and see if our input data is considered valid

(s/valid? ::input data)

0.1s

Clojure

false

(s/explain-data ::input data)

0.1s

Clojure

Map {:clojure.spec.alpha/problems: List(2), :clojure.spec.alpha/spec: :user/input, :clojure.spec.alpha/value: Vector(8)}

We see it is not valid, and clojure.spec can give us exact information on what problems it found

(def explained *1)

0.1s

Clojure

Map {:clojure.spec.alpha/problems: List(2), :clojure.spec.alpha/spec: :user/input, :clojure.spec.alpha/value: Vector(8)}

(get-in data (-> explained ::s/problems first :in))

0.1s

Clojure

999

To use the generators, we can just give it the name of the spec for which we need a value

(gen/generate (s/gen ::n))

1.9s

Clojure

We can put this together, to validate the input data and automatically fill in for the values which failed validation.

(reduce (fn [d p]          (update-in d                      (:in p)                     (fn [n] (gen/generate (s/gen (-> p :via last))))))        data        (-> explained ::s/problems))

0.1s

Clojure

Vector(8) [1, 2, 1, 4, 5, 1, 5, 4]

(def valid-data *1)(s/valid? ::input valid-data)

0.1s

Clojure

true

And our data is now valid!

TODO:

write a custom generator which analyzes valid input data and generate the most common value

Clojure Spec for Data Science

TODO:

Runtimes (1)