Alan / Mar 01 2019

Getting Started

This is a tutorial for a a couple of new Clojure libraries for Machine Learning and ETL -- part of the tech.ml stack.

Author: Chris Nuernberger

Translated to Nextjournal: Alan Marazzi

The API is still alpha, we are putting our efforts into extending and beautifying it. Comments will be welcome!

Reading from an excellent article on advanced regression techniques.The target is to predict the SalePrice column.

1.
Prepare the environment

We just have to mount a deps.edn file with all our dependencies.

{:deps
 {org.clojure/clojure {:mvn/version "1.9.0"}
  techascent/tech.ml {:mvn/version "0.19"}
  metasoarous/oz {:mvn/version "1.5.2"}
  org.clojure/tools.deps.alpha
  {:git/url "https://github.com/clojure/tools.deps.alpha.git"
   :sha "f6c080bd0049211021ea59e516d1785b08302515"}}}
deps.edn
Extensible Data Notation
21.1s
Language:Clojure
(require '[oz.notebook.clojupyter :as oz])
(require '[tech.ml.dataset.etl :as etl])
(require '[tech.ml.dataset.etl.pipeline-operators :as pipe-ops])
(require '[tech.ml.dataset.etl.math-ops :as pipe-math])
(require '[tech.ml.dataset.etl.column-filters :as col-filters])
(require '[tech.ml.dataset :as dataset])
(require '[tech.ml.dataset.column :as ds-col])
(require '[tech.ml :as ml])
(require '[tech.ml.loss :as loss])
(require '[tech.ml.utils :as ml-utils])
(require '[clojure.core.matrix :as m])

;;use tablesaw as dataset backing store
(require '[tech.libs.tablesaw :as tablesaw])

;;model generators
(require '[tech.libs.smile.regression])

;;put/get nippy
(require '[tech.io :as io])
(require '[clojure.pprint :as pp])
(require '[clojure.set :as c-set])

(import '[java.io File])

(defn pp-str
  [ds]
  (with-out-str
    (pp/pprint ds)))

(defn print-table
  ([ks data]
     (->> data
          (map (fn [item-map]
                 (->> item-map
                      (map (fn [[k v]]
                             [k (if (or (float? v)
                                        (double? v))
                                  (format "%.3f" v)
                                  v)]))
                      (into {}))))
          (pp/print-table ks)))
  ([data]
   (print-table (sort (keys (first data))) data)))

(defn print-dataset 
    ([dataset column-names index-range]
     (print-table column-names (-> (dataset/select dataset column-names index-range)
                                   (dataset/->flyweight))))
    ([dataset column-names]
     (print-dataset dataset column-names :all)))

(defn gridsearch-model
  [dataset-name dataset loss-fn opts]
  (let [gs-options (ml/auto-gridsearch-options
                    (assoc opts
                           :gridsearch-depth 75
                           :top-n 20))]
    (println (format "Dataset: %s, Model %s"
                     dataset-name
                     (:model-type opts)))
    (let [gs-start (System/nanoTime)
          {results :retval
           milliseconds :milliseconds}
          (ml-utils/time-section
           (ml/gridsearch
            (assoc gs-options :k-fold 10)
            loss-fn
            dataset))]
      (->> results
           (mapv #(merge %
                         {:gridsearch-time-ms milliseconds
                          :dataset-name dataset-name}))))))

(defn gridsearch-dataset
  [dataset-name force-gridsearch? dataset options]
  (let [ds-filename (format "file://ames-%s-results.nippy" dataset-name)]
    (if (or force-gridsearch?
            (not (.exists ^File (io/file ds-filename))))
      (let [base-systems [{:model-type :smile.regression/lasso}
                          {:model-type :smile.regression/ridge}
                          {:model-type :smile.regression/elastic-net}]
            results (->> base-systems
                         (map #(merge options %))
                         (mapcat
                          (partial gridsearch-model
                                   dataset-name
                                   dataset
                                   loss/rmse))
                         vec)]
        (io/put-nippy! ds-filename results)
        results)
      (io/get-nippy ds-filename))))

(defn results->accuracy-dataset
  [gridsearch-results]
  (->> gridsearch-results
       (map (fn [{:keys [average-loss options predict-time train-time]}]
              {:average-loss average-loss
               :model-name (str (:model-type options))
               :predict-time predict-time
               :train-time train-time}))))

Well, that wasn't particularly pleasant but it at least is something you can cut & paste...

We mount the datasets as well.

train.csv
test.csv
ames-final-results.nippy
ames-one-hot-results.nippy
sample_submission.csv
ames-skew-fix-results.nippy
(def src-dataset (tablesaw/path->tablesaw-dataset "train.csv"))

(println (m/shape src-dataset))

The shape is backward as compared to pandas. This is by intention; core.matrix is a row-major linear algebra system. tech.ml.dataset is column-major. Thus, to ensure sanity when doing conversions we represent the data in a normal shape. Note that pandas returns [1460 81].

2.
Outliers

We first check for outliers, graph and then remove them.

(defn scatter
  [dataset cols]
  (let [data (-> dataset
                 (dataset/select cols :all)
                 (dataset/->flyweight))]
	  {:data
 	    [{:x (map #(get % "SalePrice") data)
        :y (map #(get % "GrLivArea") data)
        :mode "markers"
        :type "scatter"}]
     :layout {:autosize false :width 600 :height 600}}))
(scatter src-dataset ["SalePrice" "GrLivArea"])
(.plot js/Nextjournal scatter)
(def filtered-ds (pipe-ops/filter src-dataset "GrLivArea" '(< (col) 4000)))
(scatter filtered-ds ["SalePrice" "GrLivArea"])
(.plot js/Nextjournal scatter-filtered)

3.
Initial Pipeline

We now begin to construct our data processing pipeline. Note that all pipeline operations are available as repl functions from the pip-ops namespace. Since we have the pipeline outline we want from the article, I will just represent the pipeline mainly as pure data.

(def initial-pipeline-from-article
  '[[remove "Id"]
    [m= "SalePrice" (log1p (col))]])

4.
Categorical Fixes

Whether columns are categorical or not is defined by attributes.

(def more-categorical
  '[[set-attribute 
     ["MSSubClass" "OverallQual" "OverallCond"] 
     :categorical? true]])

(println "pre-categorical-count" 
         (count (col-filters/categorical? filtered-ds)))

(def post-categorical-fix 
  (-> (etl/apply-pipeline filtered-ds
                         (concat initial-pipeline-from-article
                                 more-categorical)
                          {})
      :dataset))

(println "post-categorical-count" 
         (count (col-filters/categorical? post-categorical-fix)))

5.
Missing Entries

Missing data is a theme that will come up again and again. Pandas has great tooling to clean up missing entries and we borrow heavily from them.

0.6s
Language:Clojure
;; Impressive patience to come up with this list!!
(def initial-missing-entries
  '[
    ;; Handle missing values for features where median/mean or most common value
    ;; doesn't make sense

    ;; Alley : data description says NA means "no alley access"
    [replace-missing "Alley" "None"]
    ;; BedroomAbvGr : NA most likely means 0
    [replace-missing ["BedroomAbvGr"
                      "BsmtFullBath"
                      "BsmtHalfBath"
                      "BsmtUnfSF"
                      "EnclosedPorch"
                      "Fireplaces"
                      "GarageArea"
                      "GarageCars"
                      "HalfBath"
                      ;; KitchenAbvGr : NA most likely means 0
                      "KitchenAbvGr"
                      "LotFrontage"
                      "MasVnrArea"
                      "MiscVal"
                      ;; OpenPorchSF : NA most likely means no open porch
                      "OpenPorchSF"
                      "PoolArea"
                      ;; ScreenPorch : NA most likely means no screen porch
                      "ScreenPorch"
                      ;; TotRmsAbvGrd : NA most likely means 0
                      "TotRmsAbvGrd"
                      ;; WoodDeckSF : NA most likely means no wood deck
                      "WoodDeckSF"
                      ]
     0]
    ;; BsmtQual etc : data description says NA for basement features is "no basement"
    [replace-missing ["BsmtQual"
                      "BsmtCond"
                      "BsmtExposure"
                      "BsmtFinType1"
                      "BsmtFinType2"
                      ;; Fence : data description says NA means "no fence"
                      "Fence"
                      ;; FireplaceQu : data description says NA means "no fireplace"
                      "FireplaceQu"
                      ;; GarageType etc : data description says NA for garage
                      ;; is "no garage"
                      "GarageType"
                      "GarageFinish"
                      "GarageQual"
                      "GarageCond"
                      ;; MiscFeature : NA means "no misc feature"
                      "MiscFeature"
                      ;; PoolQC : data description says NA means "no pool"
                      "PoolQC"
                      ]
     "No"]
    [replace-missing "CentralAir" "N"]
    [replace-missing ["Condition1"
                      "Condition2"]
     "Norm"]
    ;; Condition : NA most likely means Normal
    ;; EnclosedPorch : NA most likely means no enclosed porch
    ;; External stuff : NA most likely means average
    [replace-missing ["ExterCond"
                      "ExterQual"
                      ;; HeatingQC : NA most likely means typical
                      "HeatingQC"
                      ;; KitchenQual : NA most likely means typical
                      "KitchenQual"
                      ]
     "TA"]
    ;; Functional : data description says NA means typical
    [replace-missing "Functional" "Typ"]
    ;; LotShape : NA most likely means regular
    [replace-missing "LotShape" "Reg"]
    ;; MasVnrType : NA most likely means no veneer
    [replace-missing "MasVnrType" "None"]
    ;; PavedDrive : NA most likely means not paved
    [replace-missing "PavedDrive" "N"]
    [replace-missing "SaleCondition" "Normal"]
    [replace-missing "Utilities" "AllPub"]])

(println "pre missing fix #1")
(pp/pprint (dataset/columns-with-missing-seq post-categorical-fix))

(def post-missing 
  (-> (etl/apply-pipeline post-categorical-fix initial-missing-entries {})
                          :dataset))

(println "post missing fix #1")

(pp/pprint (dataset/columns-with-missing-seq post-missing))

6.
String->Number

We need to convert string data into numbers somehow. One method is to build a lookup table such that 1 string column gets converted into 1 numeric column. The exact encoding of these strings can be very important to communicate semantic information from the dataset to the ml system. We remember all these mappings because we have to use them later. They get stored both in the recorded pipeline and in the options map so we can reverse-map label values back into their categorical initial values.

0.7s
Language:Clojure
(def str->number-initial-map
  {
   "Alley"  {"Grvl"  1 "Pave" 2 "None" 0}
   "BsmtCond"  {"No"  0 "Po"  1 "Fa"  2 "TA"  3 "Gd"  4 "Ex"  5}
   "BsmtExposure"  {"No"  0 "Mn"  1 "Av" 2 "Gd"  3}
   "BsmtFinType1"  {"No"  0 "Unf"  1 "LwQ" 2 "Rec"  3 "BLQ"  4
                     "ALQ"  5 "GLQ"  6}
   "BsmtFinType2"  {"No"  0 "Unf"  1 "LwQ" 2 "Rec"  3 "BLQ"  4
                     "ALQ"  5 "GLQ"  6}
   "BsmtQual"  {"No"  0 "Po"  1 "Fa"  2 "TA" 3 "Gd"  4 "Ex"  5}
   "ExterCond"  {"Po"  1 "Fa"  2 "TA" 3 "Gd" 4 "Ex"  5}
   "ExterQual"  {"Po"  1 "Fa"  2 "TA" 3 "Gd" 4 "Ex"  5}
   "FireplaceQu"  {"No"  0 "Po"  1 "Fa"  2 "TA"  3 "Gd"  4 "Ex"  5}
   "Functional"  {"Sal"  1 "Sev"  2 "Maj2"  3 "Maj1"  4 "Mod" 5
                   "Min2"  6 "Min1"  7 "Typ"  8}
   "GarageCond"  {"No"  0 "Po"  1 "Fa"  2 "TA"  3 "Gd"  4 "Ex"  5}
   "GarageQual"  {"No"  0 "Po"  1 "Fa"  2 "TA"  3 "Gd"  4 "Ex"  5}
   "HeatingQC"  {"Po"  1 "Fa"  2 "TA"  3 "Gd"  4 "Ex"  5}
   "KitchenQual"  {"Po"  1 "Fa"  2 "TA"  3 "Gd"  4 "Ex"  5}
   "LandSlope"  {"Sev"  1 "Mod"  2 "Gtl"  3}
   "LotShape"  {"IR3"  1 "IR2"  2 "IR1"  3 "Reg"  4}
   "PavedDrive"  {"N"  0 "P"  1 "Y"  2}
   "PoolQC"  {"No"  0 "Fa"  1 "TA"  2 "Gd"  3 "Ex"  4}
   "Street"  {"Grvl"  1 "Pave"  2}
   "Utilities"  {"ELO"  1 "NoSeWa"  2 "NoSewr"  3 "AllPub"  4}
   })


(def str->number-pipeline
  (->> str->number-initial-map
       (map (fn [[k v-map]]
              ['string->number k v-map]))))

(def str-num-result (etl/apply-pipeline post-missing str->number-pipeline {}))
(def str-num-dataset (:dataset str-num-result))
(def str-num-ops (:options str-num-result))

(pp/pprint (:label-map str-num-ops))

7.
Replacing values

There is a numeric operator that allows you to map values from one value to another in a column. We now use this to provide simplified versions of some of the columns.

0.6s
Language:Clojure
(def replace-maps
  {
   ;; Create new features
   ;; 1* Simplifications of existing features
   "SimplOverallQual" {"OverallQual" {1  1, 2  1, 3  1, ;; bad
                                      4  2, 5  2, 6  2, ;; average
                                      7  3, 8  3, 9  3, 10  3 ;; good
                                      }}
   "SimplOverallCond" {"OverallCond" {1  1, 2  1, 3  1,       ;; bad
                                      4  2, 5  2, 6  2,       ;; average
                                      7  3, 8  3, 9  3, 10  3 ;; good
                                      }}
   "SimplPoolQC" {"PoolQC" {1  1, 2  1,    ;; average
                            3  2, 4  2     ;; good
                            }}
   "SimplGarageCond" {"GarageCond" {1  1,             ;; bad
                                    2  1, 3  1,       ;; average
                                    4  2, 5  2        ;; good
                                    }}
   "SimplGarageQual" {"GarageQual" {1  1,             ;; bad
                                    2  1, 3  1,       ;; average
                                    4  2, 5  2        ;; good
                                    }}
   "SimplFireplaceQu"  {"FireplaceQu" {1  1,           ;; bad
                                       2  1, 3  1,     ;; average
                                       4  2, 5  2      ;; good
                                        }}
   "SimplFunctional"  {"Functional" {1  1, 2  1,           ;; bad
                                     3  2, 4  2,           ;; major
                                     5  3, 6  3, 7  3,     ;; minor
                                     8  4                  ;; typical
                                      }}
   "SimplKitchenQual" {"KitchenQual" {1  1,             ;; bad
                                      2  1, 3  1,       ;; average
                                      4  2, 5  2        ;; good
                                      }}
   "SimplHeatingQC"  {"HeatingQC" {1  1,           ;; bad
                                   2  1, 3  1,     ;; average
                                   4  2, 5  2      ;; good
                                    }}
   "SimplBsmtFinType1"  {"BsmtFinType1" {1  1,         ;; unfinished
                                         2  1, 3  1,   ;; rec room
                                         4  2, 5  2, 6  2 ;; living quarters
                                          }}
   "SimplBsmtFinType2" {"BsmtFinType2" {1 1,           ;; unfinished
                                        2 1, 3 1,      ;; rec room
                                        4 2, 5 2, 6 2  ;; living quarters
                                        }}
   "SimplBsmtCond" {"BsmtCond" {1 1,    ;; bad
                                2 1, 3 1, ;; average
                                4 2, 5 2  ;; good
                                }}
   "SimplBsmtQual" {"BsmtQual" {1 1,      ;; bad
                                2 1, 3 1, ;; average
                                4 2, 5 2  ;; good
                                }}
   "SimplExterCond" {"ExterCond" {1 1,      ;; bad
                                  2 1, 3 1, ;; average
                                  4 2, 5 2  ;; good
                                  }}
   "SimplExterQual" {"ExterQual" {1 1,      ;; bad
                                  2 1, 3 1, ;; average
                                  4 2, 5 2  ;; good
                                  }}
   })


(def simplifications
  (->> replace-maps
       (mapv (fn [[k v-map]]
               (let [[src-name replace-data] (first v-map)]
                 ['m= k ['replace ['col src-name] replace-data]])))))

(pp/pprint (take 3 simplifications))

(def replace-dataset (-> (etl/apply-pipeline str-num-dataset simplifications {})
                         :dataset))

(pp/pprint (-> (dataset/column str-num-dataset "KitchenQual")
                (ds-col/unique)))

(pp/pprint (-> (dataset/column replace-dataset "SimplKitchenQual")
                (ds-col/unique)))

8.
Linear Combinations

We create a set of simple linear combinations that derive from our semantic understanding of the dataset.

1.5s
Language:Clojure
(def linear-combinations
  ;; 2* Combinations of existing features
  ;; Overall quality of the house
  '[
    [m= "OverallGrade" (* (col "OverallQual") (col "OverallCond"))]
    ;; Overall quality of the garage
    [m= "GarageGrade" (* (col "GarageQual") (col "GarageCond"))]
    ;; Overall quality of the exterior
    [m= "ExterGrade"(* (col "ExterQual") (col "ExterCond"))]
    ;; Overall kitchen score
    [m= "KitchenScore" (* (col "KitchenAbvGr") (col "KitchenQual"))]
    ;; Overall fireplace score
    [m= "FireplaceScore" (* (col "Fireplaces") (col "FireplaceQu"))]
    ;; Overall garage score
    [m= "GarageScore" (* (col "GarageArea") (col "GarageQual"))]
    ;; Overall pool score
    [m= "PoolScore" (* (col "PoolArea") (col "PoolQC"))]
    ;; Simplified overall quality of the house
    [m= "SimplOverallGrade" (* (col "SimplOverallQual") (col "SimplOverallCond"))]
    ;; Simplified overall quality of the exterior
    [m= "SimplExterGrade" (* (col "SimplExterQual") (col "SimplExterCond"))]
    ;; Simplified overall pool score
    [m= "SimplPoolScore" (* (col "PoolArea") (col "SimplPoolQC"))]
    ;; Simplified overall garage score
    [m= "SimplGarageScore" (* (col "GarageArea") (col "SimplGarageQual"))]
    ;; Simplified overall fireplace score
    [m= "SimplFireplaceScore" (* (col "Fireplaces") (col "SimplFireplaceQu"))]
    ;; Simplified overall kitchen score
    [m= "SimplKitchenScore" (* (col "KitchenAbvGr" ) (col "SimplKitchenQual"))]
    ;; Total number of bathrooms
    [m= "TotalBath" (+ (col "BsmtFullBath") (* 0.5 (col "BsmtHalfBath"))
                       (col "FullBath") (* 0.5 (col "HalfBath")))]
    ;; Total SF for house (incl. basement)
    [m= "AllSF"  (+ (col "GrLivArea") (col "TotalBsmtSF"))]
    ;; Total SF for 1st + 2nd floors
    [m= "AllFlrsSF" (+ (col "1stFlrSF") (col "2ndFlrSF"))]
    ;; Total SF for porch
    [m= "AllPorchSF" (+ (col "OpenPorchSF") (col "EnclosedPorch")
                        (col "3SsnPorch") (col "ScreenPorch"))]
    ;; Encode MasVrnType
    [string->number "MasVnrType" ["None" "BrkCmn" "BrkFace" "CBlock" "Stone"]]
    [m= "HasMasVnr" (not-eq (col "MasVnrType") 0)]
    ]
  )

(def linear-combined-ds (-> (etl/apply-pipeline replace-dataset 
                                                linear-combinations
                                                {})
                            :dataset))



(let [print-columns ["TotalBath" "BsmtFullBath" "BsmtHalfBath" 
                                  "FullBath" "HalfBath"]]
  (println (print-table print-columns 
                        (-> linear-combined-ds
                            (dataset/select print-columns (range 10))
                            (dataset/->flyweight)))))

(let [print-columns ["AllSF" "GrLivArea" "TotalBsmtSF"]]
  (println (print-table print-columns 
                        (-> linear-combined-ds
                            (dataset/select print-columns (range 10))
                            (dataset/->flyweight)))))

9.
Correlation Table

Let's check the correlations between the various columns and the target column (SalePrice).

0.6s
Language:Clojure
(def article-correlations
  ;;Default for pandas is pearson.
  ;;  Find most important features relative to target
  (->> 
    {"SalePrice"            1.000
     "OverallQual"          0.819
     "AllSF"                0.817
     "AllFlrsSF"            0.729
     "GrLivArea"            0.719
     "SimplOverallQual"     0.708
     "ExterQual"            0.681
     "GarageCars"           0.680
     "TotalBath"            0.673
     "KitchenQual"          0.667
     "GarageScore"          0.657
     "GarageArea"           0.655
     "TotalBsmtSF"          0.642
     "SimplExterQual"       0.636
     "SimplGarageScore"     0.631
     "BsmtQual"             0.615
     "1stFlrSF"             0.614
     "SimplKitchenQual"     0.610
     "OverallGrade"         0.604
     "SimplBsmtQual"        0.594
     "FullBath"             0.591
     "YearBuilt"            0.589
     "ExterGrade"           0.587
     "YearRemodAdd"         0.569
     "FireplaceQu"          0.547
     "GarageYrBlt"          0.544
     "TotRmsAbvGrd"         0.533
     "SimplOverallGrade"    0.527
     "SimplKitchenScore"    0.523
     "FireplaceScore"       0.518
     "SimplBsmtCond"        0.204
     "BedroomAbvGr"         0.204
     "AllPorchSF"           0.199
     "LotFrontage"          0.174
     "SimplFunctional"      0.137
     "Functional"           0.136
     "ScreenPorch"          0.124
     "SimplBsmtFinType2"    0.105
     "Street"               0.058
     "3SsnPorch"            0.056
     "ExterCond"            0.051
     "PoolArea"             0.041
     "SimplPoolScore"       0.040
     "SimplPoolQC"          0.040
     "PoolScore"            0.040
     "PoolQC"               0.038
     "BsmtFinType2"         0.016
     "Utilities"            0.013
     "BsmtFinSF2"           0.006
     "BsmtHalfBath"        -0.015
     "MiscVal"             -0.020
     "SimplOverallCond"    -0.028
     "YrSold"              -0.034
     "OverallCond"         -0.037
     "LowQualFinSF"        -0.038
     "LandSlope"           -0.040
     "SimplExterCond"      -0.042
     "KitchenAbvGr"        -0.148
     "EnclosedPorch"       -0.149
     "LotShape"            -0.286}
    (sort-by second >)))

(def tech-ml-correlations (get (dataset/correlation-table 
                                 linear-combined-ds 
                                 :pearson) 
                               "SalePrice"))

(pp/print-table (map #(zipmap [:pandas :tech.ml.dataset]
                                                  [%1 %2])
                                         (take 20 article-correlations)
                                         (take 20 tech-ml-correlations)))

10.
Polynomial Combinations

We now extend the power of our linear models to be effectively polynomial models for a subset of the columns. We do this using the correlation table to indicate which columns are worth it (the author used the top 10).

(defn polynomial-combinations
  [correlation-seq]
  (let [correlation-colnames (->> correlation-seq
                                  (drop 1)
                                  (take 10)
                                  (map first))]
    (->> correlation-colnames
         (mapcat (fn [colname]
                   [['m= (str colname "-s2") ['** ['col colname] 2]]
                    ['m= (str colname "-s3") ['** ['col colname] 3]]
                    ['m= (str colname "-sqrt") ['sqrt ['col colname]]]])))))

(def polynomial-pipe (polynomial-combinations tech-ml-correlations))

(def poly-data (-> (etl/apply-pipeline linear-combined-ds polynomial-pipe {})
                      :dataset))

(pp/pprint (take 4 polynomial-pipe))


(print-dataset poly-data 
               ["OverallQual"
                "OverallQual-s2"
                "OverallQual-s3"
                "OverallQual-sqrt"]
               (range 10))

11.
Numeric Vs. Categorical

The article considers anything non-numeric to be categorical. This is a point on which the tech.ml.dataset system differs. For tech, any column can be considered categorical and the underlying datatype does not change this definition. Earlier the article converted numeric columns to string to indicate they are categorical but we just set metadata.

(def numerical-features 
  (col-filters/select-columns poly-data '[and
                                          [not "SalePrice"]
                                          numeric?]))

(def categorical-features 
  (col-filters/select-columns poly-data '[and
                                          [not "SalePrice"]
                                          [not numeric?]]))

(println (count numerical-features))

(println (count categorical-features))

;;I printed out the categorical features from the when using pandas.
(pp/pprint 
 (->> 
   (c-set/difference 
   (set ["MSSubClass", "MSZoning", "Alley", "LandContour", "LotConfig",
         "Neighborhood", "Condition1", "Condition2", "BldgType",
         "HouseStyle", "RoofStyle", "RoofMatl", "Exterior1st",
         "Exterior2nd", "MasVnrType", "Foundation", "Heating", "CentralAir",
         "Electrical", "GarageType", "GarageFinish", "Fence", "MiscFeature",
         "MoSold", "SaleType", "SaleCondition"])
    (set categorical-features))
    (map (comp ds-col/metadata (partial dataset/column poly-data)))))
(def fix-all-missing
  '[
    ;;Fix any remaining numeric columns by using the median.
    [replace-missing numeric? (median (col))]
    ;;Fix any string columns by using 'NA'.
    [replace-missing string? "NA"]])


(def missing-fixed (-> (etl/apply-pipeline poly-data fix-all-missing {})
                       :dataset))

(pp/pprint (dataset/columns-with-missing-seq missing-fixed))

12.
Skew

Here is where things go a bit awry. We attempt to fix skew. The attempted fix barely reduces the actual skew in the dataset. We will talk about what went wrong. We also begin running models on the stages to see what the effect of some of these things are.

1.2s
Language:Clojure
(def fix-all-skew
  '[[m= [and
         [numeric?]
         [not "SalePrice"]
         [> (abs (skew (col))) 0.5]]
     (log1p (col))]])

(def skew-fix-result (etl/apply-pipeline missing-fixed 
                                         fix-all-skew 
                                         {:target "SalePrice"}))
(def skew-fixed (:dataset skew-fix-result))
(def skew-fixed-options (:options skew-fix-result))

;;Force gridsearch here if you want to re-run the real deal.  I saved the results 
;;to s3 and you downloaded them as part of get-data.sh
(def skew-fix-models (gridsearch-dataset "skew-fix" false
                                             (pipe-ops/string->number 
                                              skew-fixed 
                                              (col-filters/string? skew-fixed))
                                             skew-fixed-options))

(println "Pre-fix skew counts" (count (col-filters/select-columns
                                            missing-fixed
                                            '[and
                                              [numeric?]
                                              [not "SalePrice"]
                                              [> (abs (skew (col))) 0.5]])))

(println "Post-fix skew counts" (count (col-filters/select-columns
                                            skew-fixed
                                            '[and
                                              [numeric?]
                                              [not "SalePrice"]
                                              [> (abs (skew (col))) 0.5]])))

That didn't work. Or at least it barely did. What happened??

0.6s
Language:Clojure
(let [before-columns (set (col-filters/select-columns
                                       missing-fixed
                                       '[and
                                         [numeric?]
                                         [not "SalePrice"]
                                         [> (abs (skew (col))) 0.5]]))
      after-columns (set (col-filters/select-columns
                                          skew-fixed
                                          '[and
                                            [numeric?]
                                            [not "SalePrice"]
                                            [> (abs (skew (col))) 0.5]]))
      check-columns (c-set/intersection before-columns after-columns)]
               (->> check-columns
                    (map (fn [colname]
                           (let [{before-min :min
                                  before-max :max
                                  before-mean :mean
                                  before-skew :skew} 
                                  (-> (dataset/column missing-fixed colname)
                                      (ds-col/stats [:min :max :mean :skew]))
                                 {after-min :min
                                  after-max :max
                                  after-mean :mean
                                  after-skew :skew} 
                                  (-> (dataset/column skew-fixed colname)
                                      (ds-col/stats [:min :max :mean :skew]))]
                             {:column-name colname
                              :before-skew before-skew
                              :after-skew after-skew
                              :before-mean before-mean
                              :after-mean after-mean})))
                    (print-table [:column-name 
                                  :before-skew :after-skew
                                  :before-mean :after-mean])))

Maybe you can see the issue now. For positive skew and and small means, the log1p fix has very little effect. For very large numbers, it may skew the result all the way to be negative. And then for negative skew, it makes it worse.

(let [data (->> (results->accuracy-dataset skew-fix-models)
                (group-by :model-name))]
	  {:data
 	    [{:y (->> (get data ":smile.regression/lasso")
                (map :average-loss))
        :type "box"
        :name "lasso"}
       {:y (->> (get data ":smile.regression/elastic-net")
                (map :average-loss))
        :type "box"
        :name "elastic-net"}]
     :layout {:autosize false :width 600 :height 600}})
(.plot js/Nextjournal box-skew)

13.
std-scaler

  • range-scaler - scale column such that min/max equal a range min/max. Range defaults to [-1 1].
  • std-scaler - scale column such that mean = 0 and variance,stddev = 1.
1.0s
Language:Clojure
(def std-scale-numeric-features
  [['std-scaler (vec numerical-features)]
   ['string->number 'string?]])

(def scaled-pipeline-result 
  (etl/apply-pipeline 
   skew-fixed 
   std-scale-numeric-features 
   {:target "SalePrice"}))



(def final-dataset (:dataset scaled-pipeline-result))
(def final-options (:options scaled-pipeline-result))



(println "Before std-scaler")

(->> (dataset/select skew-fixed (take 10 numerical-features) :all)
                      (dataset/columns)
                      (map (fn [col]
                             (merge (ds-col/stats col [:mean :variance])
                                    {:column-name (ds-col/column-name col)})))
                      (print-table [:column-name :mean :variance]))

(println "\n\nAfter std-scaler")

(->> (dataset/select final-dataset (take 10 numerical-features) :all)
                      (dataset/columns)
                      (map (fn [col]
                             (merge (ds-col/stats col [:mean :variance])
                                    {:column-name  (ds-col/column-name col)})))
                      (print-table [:column-name :mean :variance]))

14.
Final Models

We now train our prepared data across a range of models.

(def final-models (gridsearch-dataset "final" false final-dataset final-options))
(let [data (->> (results->accuracy-dataset final-models)
                (group-by :model-name))]
	  {:data
 	    [{:y (->> (get data ":smile.regression/lasso")
                (map :average-loss))
        :type "box"
        :name "lasso"}
       {:y (->> (get data ":smile.regression/elastic-net")
                (map :average-loss))
        :type "box"
        :name "elastic-net"}
       {:y (->> (get data ":smile.regression/ridge")
                (map :average-loss))
        :type "box"
        :name "ridge"}]
     :layout {:autosize false :width 600 :height 600}})
(.plot js/Nextjournal box-models)