Boulder Rescue Response Times
Through Colorado's OpenData intiative we have access to many great datasets and Boulder, Colorado specifically has a rescue-response time dataset. We dive into this interesting dataset to show how to quickly explore a dataset and train a model to see how predictive a subset of the columns are of the rescue response time column.
{:deps
{org.clojure/clojure {:mvn/version "1.10.0"}
org.clojure/tools.deps.alpha
{:git/url "https://github.com/clojure/tools.deps.alpha.git"
:sha "f6c080bd0049211021ea59e516d1785b08302515"}
techascent/tech.ml.dataset {:mvn/version "2.0-beta-56"}
techascent/tech.viz {:mvn/version "0.3"}
techascent/tech.ml {:mvn/version "2.0-beta-56"
:exclusions [org.ow2.asm/asm]}
compliment {:mvn/version "0.3.9"}}}
You can add dependencies by modifying the deps.edn
above (requires a runtime restart)...
(use clojure.tools.deps.alpha.repl)
(clojure-version)
(require [tech.ml.dataset :as ds])
(require [tech.v2.datatype :as dtype])
(require [tech.io :as io])
Loading The Data
You can get the original dataset from: http://data-boulder.opendata.arcgis.com/datasets/6ae16fc5d05b46189800f189a587a223_0.csv. Be warned, it is around 70MB and updates weekly with new information.
(def ds (ds/->dataset boulder-rescue-unit-response-times-latest.csv.gz
{:key-fn keyword}))
(dtype/shape ds)
(println ds)
Some Quick Explorations
One great first step is to generate descriptive statistics across all of the columns. Note that this command works even for date/time columns.
(println (vary-meta (ds/descriptive-stats ds)
assoc :print-column-max-width :15
:print-index-range :all))
Let's zero in on CALLTYPE for a moment.
(->> (ds :CALLTYPE)
(frequencies)
(sort-by second >)
(map (hash-map :calltype (first %) :frequency (second %)))
(ds/->>dataset)
((vary-meta % assoc :print-index-range :all))
(println))
;; Mean response time by various columns
(require [tech.v2.datatype.datetime.operations :as dtype-dt-ops])
(require [tech.v2.datatype.functional :as dfn])
(defn mean-response-by
[ds by-colname]
(as-> (ds/select-columns ds [by-colname :RESPONSETIME]) ds
(ds/drop-rows ds (ds/missing ds))
(ds/group-by-column by-colname ds)
;;Dataset group by produces a map of key to dataset
(map (fn [[calltype ds]]
{by-colname calltype
:mean-response-sec (-> (ds :RESPONSETIME)
(dtype-dt-ops/get-seconds)
(dfn/mean))})
ds)
(sort-by :mean-response-sec ds)
(ds/->>dataset {:dataset-name :mean-res-time-by-callback} ds)
(vary-meta ds assoc :print-index-range :all)))
(println (mean-response-by ds :CALLTYPE))
(println (mean-response-by ds :SEVERITY))
Fixing Parsing Issues
Often your data is very messy. A few of the columns failed to parse because we don't have the exact format in our parse code. We can fix these types of issues without reloading the entire dataset.
;;A set of columns failed to parse because they have a custom datetime format
(take 5 (ds :TIMEPHONEPICKUP))
;;We can parse one of those columns with a custom parser like this:
(require [tech.ml.dataset.column :as ds-col])
;;parse-column takes a datatype or a datatype paired with a parsing function.
;;If the datatype is a datetime type the parsing function may be string in which
;;case it is used to construct a java.time.format.DateTimeFormatter object.
(take 5 (ds-col/parse-column [:zoned-date-time "yyyy/MM/dd HH:mm:ssx"]
(ds :TIMEPHONEPICKUP)))
;;So now we just parse all of them. We make a further assumption that columns
;;with the epoch time 1970-01-01 are actually missing values.
(require [tech.ml.dataset.column :as ds-col])
(def date-columns
[:TIMEPHONEPICKUP :TIMECALLCLOSED :TIMEASSIGNED :TIMEENROUTE :TIMEARRIVED
:UNITCLEARED :TRANSPORTDEPART :TRANSPORTARRIVED])
(defn parse-date-columns
[ds]
(->> date-columns
;;parsing is single-threaded so a pmap in this case is helpful. This isn't
;;true for most of the operations on a column, however. Most operations are
;;either lazy or parallelized.
(pmap (fn [cname]
[cname (ds-col/parse-column
[:zoned-date-time "yyyy/MM/dd HH:mm:ssx"]
(ds-col/set-missing (ds cname)
;;argfilter works like filter but
;;is non-lazy, parallelized, and
;;produces the result in indexes
(dfn/argfilter
(= "1970/01/01 00:00:00+00"
%)
(ds cname))))]))
;;flatten out pairs
(apply concat)
(apply ds/assoc ds)))
;;Let's see the date range of the dataset.
(-> (parse-date-columns ds)
(ds/select-columns date-columns)
(ds/descriptive-stats)
(println))
Heading Toward Training/Inference
Let's narrow down our columns to a few I thought were interesting. Maybe I missed an important one but these were categorical columns while many of the other columns are various measures of timings internal to the rescue response process.
(def interesting-columns
[:TIMEPHONEPICKUP :RESPONSETIME
:LATATASSIGNTIME :LONGATASSIGNTIME
:X :Y :RESPONSEPLAN :INCIDENTTYPE
:PROBLEM :CALLTYPE :DETERMINANT :SEVERITY])
(def simple-ds (as-> (parse-date-columns ds) ds
(ds/select-columns ds interesting-columns)
(ds/drop-rows ds (ds/missing ds))
(vary-meta ds assoc :print-column-max-width 15)))
Let's print out what we have so far. Note that LONGATASSIGNTIME is an integer and X is a floating point number. Further note that LONGATASSIGNTIME is positive...
(println simple-ds)
Often it is useful to force the system to use fewer characters per column even if this means truncating output.
(println (vary-meta (ds/descriptive-stats simple-ds)
assoc :print-column-max-width 15
:print-index-range :all))
Calculating Distance From Incident
Switching gears a bit, let's get to training a model. One of the features we want is the distance from assign time to the XY coordinate which we guess is the incident location.
Following from our note above in order to calculate the distance we do have a little bit of work to do.
(defn append-distance
[ds]
;;Readers allow random typed access. They also overload IFn, which means
;;you can use them as a function of their indicies for untyped access.
(let [lat-assign (dtype/->reader (ds :LATATASSIGNTIME))
lon-assign (dtype/->reader (ds :LONGATASSIGNTIME))
assign-denom 1000000.0
;;It is important to note that these elementwise math functions are
;;lazy
lon-assign (dfn/- (dfn// lon-assign assign-denom))
lat-assign (dfn// lat-assign assign-denom)
target-lon (dtype/->reader (ds :X))
target-lat (dtype/->reader (ds :Y))]
;;Remove all missing values int these coordinates
(-> (ds/drop-rows ds (reduce dtype/set-or
[(ds-col/missing (ds :LATATASSIGNTIME))
(ds-col/missing (ds :LONGATASSIGNTIME))
(ds-col/missing (ds :X))
(ds-col/missing (ds :Y))]))
(ds/assoc :distance
(dtype/make-reader :float64
(ds/row-count ds)
(let [x-diff (- (double (lon-assign idx))
(double (target-lon idx)))
y-diff (- (double (lat-assign idx))
(double (target-lat idx)))]
(Math/sqrt
(+ (* x-diff x-diff)
(* y-diff y-diff)))))))))
(println(take 5 ((append-distance simple-ds) :distance)))
;;Let's plot a histogram of distances. Taking the log because the scales are so
;;different. In general they are well distributed with a tail of bad distances
;;at the end.
(require [tech.viz.vega :as vega])
(-> (vega/histogram (dfn/log ((append-distance simple-ds) :distance)) "Distances"
{:bin-count 20})
(vega/vega->svg-file "results/distances.svg"))
Now we prepare the dataset for training and we split it into train/test datasets.
Building Train/Test Datasets
(def categoricals
[:RESPONSEPLAN :INCIDENTTYPE :PROBLEM
:CALLTYPE :DETERMINANT :SEVERITY])
;;Useful things for building out ml pipelines
(require [tech.ml.dataset.pipeline :as ds-pipe])
(defn prepare-for-training
[ds]
(-> (append-distance ds)
(ds/select-columns (concat [:distance :RESPONSETIME]
categoricals))
;;Convert string columns to numbers storing conversion table in column
;;metadata
(ds-pipe/string->number)
;;Get the number of seconds as opposed to a duration. xgboost doesn't know
;;anything about durations...
(ds/update-column :RESPONSETIME (dtype-dt-ops/get-seconds %))
;;We will try to predict responsetime
(ds/set-inference-target :RESPONSETIME)
;;Ignore missing for now. Really we would have to have a plan for missing
;;as data coming in would undoubtably have missing values.
(ds/drop-rows (ds/missing ds))
;;Remember how we said the distance calculation was lazy? Many of the
;;previous operations were also lazy. dtype/clone forces them to complete
;;and stores the result in dense primitive arrays.
(dtype/clone)))
;;train dataset is in :train-ds, test dataset is in :test-ds
(def train-test-split (-> (prepare-for-training simple-ds)
(ds/->train-test-split)))
The training/test datasets have all numbers where they had strings.
(println (ds/head (train-test-split :train-ds))))
Training And Verifying
We train a model and check get some explanatory information from the model.
(require [tech.ml :as ml])
(require [tech.libs.xgboost])
;;Because we set the inference target above, ml system knows the target column.
(def model (ml/train {:model-type :xgboost/regression}
(:train-ds train-test-split)))
;;Let's see what columns the model thought were important. Also you can safely
;;save out the model with nippy/freeze and friends. No need to pickle or any
;;further steps, the model is already pure data, not an object.
(->> (get (ml/explain-model model) "gain")
(map (hash-map :colname (first %) :gain (second %)))
(ds/->>dataset)
(println))
Let's check the accuracy. We are fans of Mean Average Error (MAE) because its units are in the original space. In this example, the original space is seconds and it is nice to see the accuracy expressed in those units.
;;We like MAE because it does not change units.
(def test-ds (:test-ds train-test-split))
(def predictions (double-array (ml/predict model test-ds)))
(require [tech.ml.loss :as loss])
(loss/mae (test-ds :RESPONSETIME) predictions)
(-> (vega/histogram (dfn/- (test-ds :RESPONSETIME) predictions)
"Residuals"
{:bin-count 20})
(vega/vega->svg-file "results/residuals.svg"))
(-> (dfn/descriptive-stats (dfn/abs (dfn/- (test-ds :RESPONSETIME) predictions)) [:min :mean :max :standard-deviation :skew])
(println))
The mean is being pushed up quite a bit by a few outliers. This article is getting a bit long but a potential next step would be to analyze the outliers and see what is going on there.
Error Rates By Column
Earlier we mapped columns from string->number. Now we want to map back so we can see what is going on in the dataset a bit better. The mapping is stored as metadata on the column.
;;The string->number operation leaves a :label-map in the column's metadata
;;so you can get back to the original strings from the final number.
(require [clojure.set :as set])
(defn reverse-map-labels
[ds]
(->> (map meta ds)
(filter :label-map)
(reduce (fn [ds {:keys [name label-map]}]
(let [src-rdr (dtype/->reader (ds name))
inv-map (set/map-invert label-map)]
(ds/assoc
ds name
(dtype/make-reader :string (ds/row-count ds)
(get inv-map (long (src-rdr idx))
"")))))
ds)))
(def predict-ds (as-> test-ds ds
(ds/assoc ds :predictions predictions)
(ds/assoc ds :residuals (dfn/- (ds :RESPONSETIME)
(ds :predictions)))
(reverse-map-labels ds)))
(println predict-ds)
So to end it we will get the MAE by CALLTYPE. It would be easy to extend this to several other columns to see if their column value affects MAE.
;;Now let's get mean-average-error by calltype from the predictions dataset
(defn mae-by-column
[predict-ds colname]
(->> (ds/group-by-column colname predict-ds)
(map (fn [[calltype predict-ds]]
{colname calltype
:response-mae-sec (loss/mae (predict-ds :RESPONSETIME)
(predict-ds :predictions))}))
(sort-by :response-mae-sec)
(ds/->>dataset {:dataset-name :mae-by-calltype})))
(println (mae-by-column predict-ds :CALLTYPE))
(println (mae-by-column predict-ds :SEVERITY))