Boulder Rescue Response Times

Through Colorado's OpenData intiative we have access to many great datasets and Boulder, Colorado specifically has a rescue-response time dataset. We dive into this interesting dataset to show how to quickly explore a dataset and train a model to see how predictive a subset of the columns are of the rescue response time column.

{:deps
 {org.clojure/clojure {:mvn/version "1.10.0"}
  org.clojure/tools.deps.alpha
  {:git/url "https://github.com/clojure/tools.deps.alpha.git"
   :sha "f6c080bd0049211021ea59e516d1785b08302515"}
  techascent/tech.ml.dataset {:mvn/version "2.0-beta-56"}
  techascent/tech.viz {:mvn/version "0.3"}
  techascent/tech.ml {:mvn/version "2.0-beta-56"
                  	  :exclusions [org.ow2.asm/asm]}
  compliment {:mvn/version "0.3.9"}}}
deps.edn
Extensible Data Notation

You can add dependencies by modifying the deps.edn above (requires a runtime restart)...

(use 'clojure.tools.deps.alpha.repl)
(clojure-version)
3.0s
Clojure
"1.10.0"
(require '[tech.ml.dataset :as ds])
(require '[tech.v2.datatype :as dtype])
(require '[tech.io :as io])
27.4s
Clojure
nil

Loading The Data

You can get the original dataset from: http://data-boulder.opendata.arcgis.com/datasets/6ae16fc5d05b46189800f189a587a223_0.csv. Be warned, it is around 70MB and updates weekly with new information.

boulder-rescue-unit-response-times-latest.csv.gz
(def ds (ds/->dataset 
boulder-rescue-unit-response-times-latest.csv.gz
                      {:key-fn keyword}))
  
13.8s
Clojure
user/ds
(dtype/shape ds)
0.1s
Clojure
Vector(2) [48, 136367]
(println ds)
0.6s
Clojure
nil

Some Quick Explorations

One great first step is to generate descriptive statistics across all of the columns. Note that this command works even for date/time columns.

(println (vary-meta (ds/descriptive-stats ds)
                    assoc :print-column-max-width :15
                          :print-index-range :all))
6.6s
Clojure
nil

Let's zero in on CALLTYPE for a moment.

(->> (ds :CALLTYPE)
  (frequencies)
  (sort-by second >)
  (map #(hash-map :calltype (first %) :frequency (second %)))
  (ds/->>dataset)
  (#(vary-meta % assoc :print-index-range :all))
  (println))
0.5s
Clojure
nil
;; Mean response time by various columns
(require '[tech.v2.datatype.datetime.operations :as dtype-dt-ops])
(require '[tech.v2.datatype.functional :as dfn])
(defn mean-response-by
  [ds by-colname]
  (as-> (ds/select-columns ds [by-colname :RESPONSETIME]) ds
  	(ds/drop-rows ds (ds/missing ds))
  	(ds/group-by-column by-colname ds)
  	;;Dataset group by produces a map of key to dataset
  	(map (fn [[calltype ds]]
         	{by-colname calltype
           :mean-response-sec (-> (ds :RESPONSETIME) 
                                (dtype-dt-ops/get-seconds)
                                (dfn/mean))})
       ds)
  	(sort-by :mean-response-sec ds)
  	(ds/->>dataset {:dataset-name :mean-res-time-by-callback} ds)
    (vary-meta ds assoc :print-index-range :all)))
(println (mean-response-by ds :CALLTYPE))
0.8s
Clojure
nil
(println (mean-response-by ds :SEVERITY))
0.4s
Clojure
nil

Fixing Parsing Issues

Often your data is very messy. A few of the columns failed to parse because we don't have the exact format in our parse code. We can fix these types of issues without reloading the entire dataset.

;;A set of columns failed to parse because they have a custom datetime format
(take 5 (ds :TIMEPHONEPICKUP))
0.0s
Clojure
List(5) ("1970/01/01 00:00:00+00", "1970/01/01 00:00:00+00", "1970/01/01 00:00:00+00", "1970/01/01 00:00:00+00", "1970/01/01 00:00:00+00")
;;We can parse one of those columns with a custom parser like this:
(require '[tech.ml.dataset.column :as ds-col])
;;parse-column takes a datatype or a datatype paired with a parsing function.
;;If the datatype is a datetime type the parsing function may be string in which
;;case it is used to construct a java.time.format.DateTimeFormatter object.
(take 5 (ds-col/parse-column [:zoned-date-time "yyyy/MM/dd HH:mm:ssx"] 
                             (ds :TIMEPHONEPICKUP)))
1.3s
Clojure
List(5) (Vector(4), Vector(4), Vector(4), Vector(4), Vector(4))
;;So now we just parse all of them.  We make a further assumption that columns
;;with the epoch time 1970-01-01 are actually missing values.
(require '[tech.ml.dataset.column :as ds-col])
(def date-columns
  [:TIMEPHONEPICKUP :TIMECALLCLOSED :TIMEASSIGNED :TIMEENROUTE :TIMEARRIVED
   :UNITCLEARED :TRANSPORTDEPART :TRANSPORTARRIVED])
(defn parse-date-columns
  [ds]
  (->> date-columns
    ;;parsing is single-threaded so a pmap in this case is helpful.  This isn't
    ;;true for most of the operations on a column, however.  Most operations are
    ;;either lazy or parallelized.
       (pmap (fn [cname]
                 [cname (ds-col/parse-column
                         [:zoned-date-time "yyyy/MM/dd HH:mm:ssx"]
                         (ds-col/set-missing (ds cname)
                                             ;;argfilter works like filter but
                                             ;;is non-lazy, parallelized, and 
                                             ;;produces the result in indexes
                                             (dfn/argfilter 
                                              	#(= "1970/01/01 00:00:00+00"
                                              	  %)
                                                (ds cname))))]))
       ;;flatten out pairs
       (apply concat)
       (apply ds/assoc ds)))
0.1s
Clojure
user/parse-date-columns
;;Let's see the date range of the dataset.
(-> (parse-date-columns ds)
  (ds/select-columns date-columns)
  (ds/descriptive-stats)
  (println))
3.5s
Clojure
nil

Heading Toward Training/Inference

Let's narrow down our columns to a few I thought were interesting. Maybe I missed an important one but these were categorical columns while many of the other columns are various measures of timings internal to the rescue response process.

(def interesting-columns
  [:TIMEPHONEPICKUP :RESPONSETIME
   :LATATASSIGNTIME :LONGATASSIGNTIME
   :X :Y :RESPONSEPLAN :INCIDENTTYPE
   :PROBLEM :CALLTYPE :DETERMINANT :SEVERITY])
(def simple-ds (as-> (parse-date-columns ds) ds
                (ds/select-columns ds interesting-columns)
                (ds/drop-rows ds (ds/missing ds))
                (vary-meta ds assoc :print-column-max-width 15)))
2.6s
Clojure
user/simple-ds

Let's print out what we have so far. Note that LONGATASSIGNTIME is an integer and X is a floating point number. Further note that LONGATASSIGNTIME is positive...

(println simple-ds)
0.3s
Clojure
nil

Often it is useful to force the system to use fewer characters per column even if this means truncating output.

(println (vary-meta (ds/descriptive-stats simple-ds) 
                     assoc :print-column-max-width 15
                           :print-index-range :all))
0.7s
Clojure
nil

Calculating Distance From Incident

Switching gears a bit, let's get to training a model. One of the features we want is the distance from assign time to the XY coordinate which we guess is the incident location.

Following from our note above in order to calculate the distance we do have a little bit of work to do.

(defn append-distance
  [ds]
  ;;Readers allow random typed access.  They also overload IFn, which means 
  ;;you can use them as a function of their indicies for untyped access.
  (let [lat-assign (dtype/->reader (ds :LATATASSIGNTIME))
        lon-assign (dtype/->reader (ds :LONGATASSIGNTIME))
        assign-denom 1000000.0
        ;;It is important to note that these elementwise math functions are
        ;;lazy
        lon-assign (dfn/- (dfn// lon-assign assign-denom))
        lat-assign (dfn// lat-assign assign-denom)
        target-lon (dtype/->reader (ds :X))
        target-lat (dtype/->reader (ds :Y))]
    ;;Remove all missing values int these coordinates
    (-> (ds/drop-rows ds (reduce dtype/set-or
                                 [(ds-col/missing (ds :LATATASSIGNTIME))
                                  (ds-col/missing (ds :LONGATASSIGNTIME))   
                                  (ds-col/missing (ds :X))
                                  (ds-col/missing (ds :Y))]))
        (ds/assoc :distance
                  (dtype/make-reader :float64
                                     (ds/row-count ds)
                                     (let [x-diff (- (double (lon-assign idx))
                                                     (double (target-lon idx)))
                                           y-diff (- (double (lat-assign idx))
                                                     (double (target-lat idx)))]
                                       (Math/sqrt
                                        (+ (* x-diff x-diff)
                                           (* y-diff y-diff)))))))))
(println(take 5 ((append-distance simple-ds) :distance)))
0.5s
Clojure
nil
;;Let's plot a histogram of distances.  Taking the log because the scales are so
;;different.  In general they are well distributed with a tail of bad distances
;;at the end.
(require '[tech.viz.vega :as vega])
(-> (vega/histogram (dfn/log ((append-distance simple-ds) :distance)) "Distances" 
                    {:bin-count 20})
    (vega/vega->svg-file "results/distances.svg"))
11.0s
Clojure
nil

Now we prepare the dataset for training and we split it into train/test datasets.

Building Train/Test Datasets

(def categoricals
  [:RESPONSEPLAN :INCIDENTTYPE :PROBLEM
   :CALLTYPE :DETERMINANT :SEVERITY])
;;Useful things for building out ml pipelines
(require '[tech.ml.dataset.pipeline :as ds-pipe])
(defn prepare-for-training
  [ds]
  (-> (append-distance ds)
      (ds/select-columns (concat [:distance :RESPONSETIME]
                                 categoricals))
    ;;Convert string columns to numbers storing conversion table in column
    ;;metadata
      (ds-pipe/string->number)
    ;;Get the number of seconds as opposed to a duration.  xgboost doesn't know
    ;;anything about durations...
      (ds/update-column :RESPONSETIME #(dtype-dt-ops/get-seconds %))
    ;;We will try to predict responsetime
      (ds/set-inference-target :RESPONSETIME)
    ;;Ignore missing for now.  Really we would have to have a plan for missing
    ;;as data coming in would undoubtably have missing values.
      (ds/drop-rows (ds/missing ds))
    ;;Remember how we said the distance calculation was lazy?  Many of the
    ;;previous operations were also lazy.  dtype/clone forces them to complete
    ;;and stores the result in dense primitive arrays.
      (dtype/clone)))
;;train dataset is in :train-ds, test dataset is in :test-ds
 (def train-test-split (-> (prepare-for-training simple-ds)
                         (ds/->train-test-split)))
3.8s
Clojure
user/train-test-split

The training/test datasets have all numbers where they had strings.

(println (ds/head (train-test-split :train-ds))))
0.4s
Clojure
nil

Training And Verifying

We train a model and check get some explanatory information from the model.

(require '[tech.ml :as ml])
(require '[tech.libs.xgboost])
;;Because we set the inference target above, ml system knows the target column.
(def model (ml/train {:model-type :xgboost/regression} 
                     (:train-ds train-test-split)))
3.8s
Clojure
user/model
;;Let's see what columns the model thought were important.  Also you can safely
;;save out the model with nippy/freeze and friends.  No need to pickle or any
;;further steps, the model is already pure data, not an object.
(->> (get (ml/explain-model model) "gain")
     (map #(hash-map :colname (first %) :gain (second %)))
     (ds/->>dataset)
     (println))
  	
0.5s
Clojure
nil

Let's check the accuracy. We are fans of Mean Average Error (MAE) because its units are in the original space. In this example, the original space is seconds and it is nice to see the accuracy expressed in those units.

;;We like MAE because it does not change units.
(def test-ds (:test-ds train-test-split))
(def predictions (double-array (ml/predict model test-ds)))
(require '[tech.ml.loss :as loss])
(loss/mae (test-ds :RESPONSETIME) predictions)
0.8s
Clojure
108.14828096542252
(-> (vega/histogram (dfn/- (test-ds :RESPONSETIME) predictions)
                    "Residuals" 
                    	{:bin-count 20})
    (vega/vega->svg-file "results/residuals.svg"))
0.8s
Clojure
nil
(-> (dfn/descriptive-stats (dfn/abs (dfn/- (test-ds :RESPONSETIME) predictions))                            [:min :mean :max :standard-deviation :skew])
    (println))
0.3s
Clojure
nil

The mean is being pushed up quite a bit by a few outliers. This article is getting a bit long but a potential next step would be to analyze the outliers and see what is going on there.

Error Rates By Column

Earlier we mapped columns from string->number. Now we want to map back so we can see what is going on in the dataset a bit better. The mapping is stored as metadata on the column.

;;The string->number operation leaves a :label-map in the column's metadata
;;so you can get back to the original strings from the final number.
(require '[clojure.set :as set])
(defn reverse-map-labels
  [ds]
    (->> (map meta ds)
         (filter :label-map)
         (reduce (fn [ds {:keys [name label-map]}]
                   (let [src-rdr (dtype/->reader (ds name))
                         inv-map (set/map-invert label-map)]
                     (ds/assoc
                      ds name
                      (dtype/make-reader :string (ds/row-count ds)
                                         (get inv-map (long (src-rdr idx)) 
                                              "")))))
                 ds)))
(def predict-ds (as-> test-ds ds
                    (ds/assoc ds :predictions predictions)
                    (ds/assoc ds :residuals (dfn/- (ds :RESPONSETIME)
                                                   (ds :predictions)))
                    (reverse-map-labels ds)))
(println predict-ds)
0.5s
Clojure
nil

So to end it we will get the MAE by CALLTYPE. It would be easy to extend this to several other columns to see if their column value affects MAE.

;;Now let's get mean-average-error by calltype from the predictions dataset
(defn mae-by-column
  [predict-ds colname]
(->> (ds/group-by-column colname predict-ds)
  (map (fn [[calltype predict-ds]]
         {colname calltype
          :response-mae-sec (loss/mae (predict-ds :RESPONSETIME)
                                      (predict-ds :predictions))}))
  (sort-by :response-mae-sec)
  (ds/->>dataset {:dataset-name :mae-by-calltype})))
(println (mae-by-column predict-ds :CALLTYPE))
0.5s
Clojure
nil
(println (mae-by-column predict-ds :SEVERITY))
0.3s
Clojure
nil
Runtimes (1)