Getting Started
This is a tutorial for a a couple of new Clojure libraries for Machine Learning and ETL -- part of the tech.ml stack.
Author: Chris Nuernberger
Translated to Nextjournal: Alan Marazzi
The API is still alpha, we are putting our efforts into extending and beautifying it. Comments will be welcome!
Reading from an excellent article on advanced regression techniques.The target is to predict the SalePrice column.
1. Prepare the environment
We just have to mount a deps.edn
file with all our dependencies.
{:deps {org.clojure/clojure {:mvn/version "1.9.0"} techascent/tech.ml {:mvn/version "0.19"} metasoarous/oz {:mvn/version "1.5.2"} org.clojure/tools.deps.alpha {:git/url "https://github.com/clojure/tools.deps.alpha.git" :sha "f6c080bd0049211021ea59e516d1785b08302515"}}}
(require '[oz.notebook.clojupyter :as oz]) (require '[tech.ml.dataset.etl :as etl]) (require '[tech.ml.dataset.etl.pipeline-operators :as pipe-ops]) (require '[tech.ml.dataset.etl.math-ops :as pipe-math]) (require '[tech.ml.dataset.etl.column-filters :as col-filters]) (require '[tech.ml.dataset :as dataset]) (require '[tech.ml.dataset.column :as ds-col]) (require '[tech.ml :as ml]) (require '[tech.ml.loss :as loss]) (require '[tech.ml.utils :as ml-utils]) (require '[clojure.core.matrix :as m]) ;;use tablesaw as dataset backing store (require '[tech.libs.tablesaw :as tablesaw]) ;;model generators (require '[tech.libs.smile.regression]) ;;put/get nippy (require '[tech.io :as io]) (require '[clojure.pprint :as pp]) (require '[clojure.set :as c-set]) (import '[java.io File]) (defn pp-str [ds] (with-out-str (pp/pprint ds))) (defn print-table ([ks data] (->> data (map (fn [item-map] (->> item-map (map (fn [[k v]] [k (if (or (float? v) (double? v)) (format "%.3f" v) v)])) (into {})))) (pp/print-table ks))) ([data] (print-table (sort (keys (first data))) data))) (defn print-dataset ([dataset column-names index-range] (print-table column-names (-> (dataset/select dataset column-names index-range) (dataset/->flyweight)))) ([dataset column-names] (print-dataset dataset column-names :all))) (defn gridsearch-model [dataset-name dataset loss-fn opts] (let [gs-options (ml/auto-gridsearch-options (assoc opts :gridsearch-depth 75 :top-n 20))] (println (format "Dataset: %s, Model %s" dataset-name (:model-type opts))) (let [gs-start (System/nanoTime) {results :retval milliseconds :milliseconds} (ml-utils/time-section (ml/gridsearch (assoc gs-options :k-fold 10) loss-fn dataset))] (->> results (mapv #(merge % {:gridsearch-time-ms milliseconds :dataset-name dataset-name})))))) (defn gridsearch-dataset [dataset-name force-gridsearch? dataset options] (let [ds-filename (format "file://ames-%s-results.nippy" dataset-name)] (if (or force-gridsearch? (not (.exists ^File (io/file ds-filename)))) (let [base-systems [{:model-type :smile.regression/lasso} {:model-type :smile.regression/ridge} {:model-type :smile.regression/elastic-net}] results (->> base-systems (map #(merge options %)) (mapcat (partial gridsearch-model dataset-name dataset loss/rmse)) vec)] (io/put-nippy! ds-filename results) results) (io/get-nippy ds-filename)))) (defn results->accuracy-dataset [gridsearch-results] (->> gridsearch-results (map (fn [{:keys [average-loss options predict-time train-time]}] {:average-loss average-loss :model-name (str (:model-type options)) :predict-time predict-time :train-time train-time}))))
Well, that wasn't particularly pleasant but it at least is something you can cut & paste...
We mount the datasets as well.
(def src-dataset (tablesaw/path->tablesaw-dataset "train.csv")) (println (m/shape src-dataset))
The shape is backward as compared to pandas. This is by intention; core.matrix is a row-major linear algebra system. tech.ml.dataset is column-major. Thus, to ensure sanity when doing conversions we represent the data in a normal shape. Note that pandas returns [1460 81].
2. Outliers
We first check for outliers, graph and then remove them.
(defn scatter [dataset cols] (let [data (-> dataset (dataset/select cols :all) (dataset/->flyweight))] {:data [{:x (map #(get % "SalePrice") data) :y (map #(get % "GrLivArea") data) :mode "markers" :type "scatter"}] :layout {:autosize false :width 600 :height 600}}))
(scatter src-dataset ["SalePrice" "GrLivArea"])
(.plot js/Nextjournal scatter↩)
(def filtered-ds (pipe-ops/filter src-dataset "GrLivArea" '(< (col) 4000)))
(scatter filtered-ds ["SalePrice" "GrLivArea"])
(.plot js/Nextjournal scatter-filtered↩)
3. Initial Pipeline
We now begin to construct our data processing pipeline. Note that all pipeline operations are available as repl functions from the pip-ops namespace. Since we have the pipeline outline we want from the article, I will just represent the pipeline mainly as pure data.
(def initial-pipeline-from-article '[[remove "Id"] [m= "SalePrice" (log1p (col))]])
4. Categorical Fixes
Whether columns are categorical or not is defined by attributes.
(def more-categorical '[[set-attribute ["MSSubClass" "OverallQual" "OverallCond"] :categorical? true]]) (println "pre-categorical-count" (count (col-filters/categorical? filtered-ds))) (def post-categorical-fix (-> (etl/apply-pipeline filtered-ds (concat initial-pipeline-from-article more-categorical) {}) :dataset)) (println "post-categorical-count" (count (col-filters/categorical? post-categorical-fix)))
5. Missing Entries
Missing data is a theme that will come up again and again. Pandas has great tooling to clean up missing entries and we borrow heavily from them.
;; Impressive patience to come up with this list!! (def initial-missing-entries '[ ;; Handle missing values for features where median/mean or most common value ;; doesn't make sense ;; Alley : data description says NA means "no alley access" [replace-missing "Alley" "None"] ;; BedroomAbvGr : NA most likely means 0 [replace-missing ["BedroomAbvGr" "BsmtFullBath" "BsmtHalfBath" "BsmtUnfSF" "EnclosedPorch" "Fireplaces" "GarageArea" "GarageCars" "HalfBath" ;; KitchenAbvGr : NA most likely means 0 "KitchenAbvGr" "LotFrontage" "MasVnrArea" "MiscVal" ;; OpenPorchSF : NA most likely means no open porch "OpenPorchSF" "PoolArea" ;; ScreenPorch : NA most likely means no screen porch "ScreenPorch" ;; TotRmsAbvGrd : NA most likely means 0 "TotRmsAbvGrd" ;; WoodDeckSF : NA most likely means no wood deck "WoodDeckSF" ] 0] ;; BsmtQual etc : data description says NA for basement features is "no basement" [replace-missing ["BsmtQual" "BsmtCond" "BsmtExposure" "BsmtFinType1" "BsmtFinType2" ;; Fence : data description says NA means "no fence" "Fence" ;; FireplaceQu : data description says NA means "no fireplace" "FireplaceQu" ;; GarageType etc : data description says NA for garage ;; is "no garage" "GarageType" "GarageFinish" "GarageQual" "GarageCond" ;; MiscFeature : NA means "no misc feature" "MiscFeature" ;; PoolQC : data description says NA means "no pool" "PoolQC" ] "No"] [replace-missing "CentralAir" "N"] [replace-missing ["Condition1" "Condition2"] "Norm"] ;; Condition : NA most likely means Normal ;; EnclosedPorch : NA most likely means no enclosed porch ;; External stuff : NA most likely means average [replace-missing ["ExterCond" "ExterQual" ;; HeatingQC : NA most likely means typical "HeatingQC" ;; KitchenQual : NA most likely means typical "KitchenQual" ] "TA"] ;; Functional : data description says NA means typical [replace-missing "Functional" "Typ"] ;; LotShape : NA most likely means regular [replace-missing "LotShape" "Reg"] ;; MasVnrType : NA most likely means no veneer [replace-missing "MasVnrType" "None"] ;; PavedDrive : NA most likely means not paved [replace-missing "PavedDrive" "N"] [replace-missing "SaleCondition" "Normal"] [replace-missing "Utilities" "AllPub"]]) (println "pre missing fix #1") (pp/pprint (dataset/columns-with-missing-seq post-categorical-fix)) (def post-missing (-> (etl/apply-pipeline post-categorical-fix initial-missing-entries {}) :dataset)) (println "post missing fix #1") (pp/pprint (dataset/columns-with-missing-seq post-missing))
6. String->Number
We need to convert string data into numbers somehow. One method is to build a lookup table such that 1 string column gets converted into 1 numeric column. The exact encoding of these strings can be very important to communicate semantic information from the dataset to the ml system. We remember all these mappings because we have to use them later. They get stored both in the recorded pipeline and in the options map so we can reverse-map label values back into their categorical initial values.
(def str->number-initial-map { "Alley" {"Grvl" 1 "Pave" 2 "None" 0} "BsmtCond" {"No" 0 "Po" 1 "Fa" 2 "TA" 3 "Gd" 4 "Ex" 5} "BsmtExposure" {"No" 0 "Mn" 1 "Av" 2 "Gd" 3} "BsmtFinType1" {"No" 0 "Unf" 1 "LwQ" 2 "Rec" 3 "BLQ" 4 "ALQ" 5 "GLQ" 6} "BsmtFinType2" {"No" 0 "Unf" 1 "LwQ" 2 "Rec" 3 "BLQ" 4 "ALQ" 5 "GLQ" 6} "BsmtQual" {"No" 0 "Po" 1 "Fa" 2 "TA" 3 "Gd" 4 "Ex" 5} "ExterCond" {"Po" 1 "Fa" 2 "TA" 3 "Gd" 4 "Ex" 5} "ExterQual" {"Po" 1 "Fa" 2 "TA" 3 "Gd" 4 "Ex" 5} "FireplaceQu" {"No" 0 "Po" 1 "Fa" 2 "TA" 3 "Gd" 4 "Ex" 5} "Functional" {"Sal" 1 "Sev" 2 "Maj2" 3 "Maj1" 4 "Mod" 5 "Min2" 6 "Min1" 7 "Typ" 8} "GarageCond" {"No" 0 "Po" 1 "Fa" 2 "TA" 3 "Gd" 4 "Ex" 5} "GarageQual" {"No" 0 "Po" 1 "Fa" 2 "TA" 3 "Gd" 4 "Ex" 5} "HeatingQC" {"Po" 1 "Fa" 2 "TA" 3 "Gd" 4 "Ex" 5} "KitchenQual" {"Po" 1 "Fa" 2 "TA" 3 "Gd" 4 "Ex" 5} "LandSlope" {"Sev" 1 "Mod" 2 "Gtl" 3} "LotShape" {"IR3" 1 "IR2" 2 "IR1" 3 "Reg" 4} "PavedDrive" {"N" 0 "P" 1 "Y" 2} "PoolQC" {"No" 0 "Fa" 1 "TA" 2 "Gd" 3 "Ex" 4} "Street" {"Grvl" 1 "Pave" 2} "Utilities" {"ELO" 1 "NoSeWa" 2 "NoSewr" 3 "AllPub" 4} }) (def str->number-pipeline (->> str->number-initial-map (map (fn [[k v-map]] ['string->number k v-map])))) (def str-num-result (etl/apply-pipeline post-missing str->number-pipeline {})) (def str-num-dataset (:dataset str-num-result)) (def str-num-ops (:options str-num-result)) (pp/pprint (:label-map str-num-ops))
7. Replacing values
There is a numeric operator that allows you to map values from one value to another in a column. We now use this to provide simplified versions of some of the columns.
(def replace-maps { ;; Create new features ;; 1* Simplifications of existing features "SimplOverallQual" {"OverallQual" {1 1, 2 1, 3 1, ;; bad 4 2, 5 2, 6 2, ;; average 7 3, 8 3, 9 3, 10 3 ;; good }} "SimplOverallCond" {"OverallCond" {1 1, 2 1, 3 1, ;; bad 4 2, 5 2, 6 2, ;; average 7 3, 8 3, 9 3, 10 3 ;; good }} "SimplPoolQC" {"PoolQC" {1 1, 2 1, ;; average 3 2, 4 2 ;; good }} "SimplGarageCond" {"GarageCond" {1 1, ;; bad 2 1, 3 1, ;; average 4 2, 5 2 ;; good }} "SimplGarageQual" {"GarageQual" {1 1, ;; bad 2 1, 3 1, ;; average 4 2, 5 2 ;; good }} "SimplFireplaceQu" {"FireplaceQu" {1 1, ;; bad 2 1, 3 1, ;; average 4 2, 5 2 ;; good }} "SimplFunctional" {"Functional" {1 1, 2 1, ;; bad 3 2, 4 2, ;; major 5 3, 6 3, 7 3, ;; minor 8 4 ;; typical }} "SimplKitchenQual" {"KitchenQual" {1 1, ;; bad 2 1, 3 1, ;; average 4 2, 5 2 ;; good }} "SimplHeatingQC" {"HeatingQC" {1 1, ;; bad 2 1, 3 1, ;; average 4 2, 5 2 ;; good }} "SimplBsmtFinType1" {"BsmtFinType1" {1 1, ;; unfinished 2 1, 3 1, ;; rec room 4 2, 5 2, 6 2 ;; living quarters }} "SimplBsmtFinType2" {"BsmtFinType2" {1 1, ;; unfinished 2 1, 3 1, ;; rec room 4 2, 5 2, 6 2 ;; living quarters }} "SimplBsmtCond" {"BsmtCond" {1 1, ;; bad 2 1, 3 1, ;; average 4 2, 5 2 ;; good }} "SimplBsmtQual" {"BsmtQual" {1 1, ;; bad 2 1, 3 1, ;; average 4 2, 5 2 ;; good }} "SimplExterCond" {"ExterCond" {1 1, ;; bad 2 1, 3 1, ;; average 4 2, 5 2 ;; good }} "SimplExterQual" {"ExterQual" {1 1, ;; bad 2 1, 3 1, ;; average 4 2, 5 2 ;; good }} }) (def simplifications (->> replace-maps (mapv (fn [[k v-map]] (let [[src-name replace-data] (first v-map)] ['m= k ['replace ['col src-name] replace-data]]))))) (pp/pprint (take 3 simplifications)) (def replace-dataset (-> (etl/apply-pipeline str-num-dataset simplifications {}) :dataset)) (pp/pprint (-> (dataset/column str-num-dataset "KitchenQual") (ds-col/unique))) (pp/pprint (-> (dataset/column replace-dataset "SimplKitchenQual") (ds-col/unique)))
8. Linear Combinations
We create a set of simple linear combinations that derive from our semantic understanding of the dataset.
(def linear-combinations ;; 2* Combinations of existing features ;; Overall quality of the house '[ [m= "OverallGrade" (* (col "OverallQual") (col "OverallCond"))] ;; Overall quality of the garage [m= "GarageGrade" (* (col "GarageQual") (col "GarageCond"))] ;; Overall quality of the exterior [m= "ExterGrade"(* (col "ExterQual") (col "ExterCond"))] ;; Overall kitchen score [m= "KitchenScore" (* (col "KitchenAbvGr") (col "KitchenQual"))] ;; Overall fireplace score [m= "FireplaceScore" (* (col "Fireplaces") (col "FireplaceQu"))] ;; Overall garage score [m= "GarageScore" (* (col "GarageArea") (col "GarageQual"))] ;; Overall pool score [m= "PoolScore" (* (col "PoolArea") (col "PoolQC"))] ;; Simplified overall quality of the house [m= "SimplOverallGrade" (* (col "SimplOverallQual") (col "SimplOverallCond"))] ;; Simplified overall quality of the exterior [m= "SimplExterGrade" (* (col "SimplExterQual") (col "SimplExterCond"))] ;; Simplified overall pool score [m= "SimplPoolScore" (* (col "PoolArea") (col "SimplPoolQC"))] ;; Simplified overall garage score [m= "SimplGarageScore" (* (col "GarageArea") (col "SimplGarageQual"))] ;; Simplified overall fireplace score [m= "SimplFireplaceScore" (* (col "Fireplaces") (col "SimplFireplaceQu"))] ;; Simplified overall kitchen score [m= "SimplKitchenScore" (* (col "KitchenAbvGr" ) (col "SimplKitchenQual"))] ;; Total number of bathrooms [m= "TotalBath" (+ (col "BsmtFullBath") (* 0.5 (col "BsmtHalfBath")) (col "FullBath") (* 0.5 (col "HalfBath")))] ;; Total SF for house (incl. basement) [m= "AllSF" (+ (col "GrLivArea") (col "TotalBsmtSF"))] ;; Total SF for 1st + 2nd floors [m= "AllFlrsSF" (+ (col "1stFlrSF") (col "2ndFlrSF"))] ;; Total SF for porch [m= "AllPorchSF" (+ (col "OpenPorchSF") (col "EnclosedPorch") (col "3SsnPorch") (col "ScreenPorch"))] ;; Encode MasVrnType [string->number "MasVnrType" ["None" "BrkCmn" "BrkFace" "CBlock" "Stone"]] [m= "HasMasVnr" (not-eq (col "MasVnrType") 0)] ] ) (def linear-combined-ds (-> (etl/apply-pipeline replace-dataset linear-combinations {}) :dataset)) (let [print-columns ["TotalBath" "BsmtFullBath" "BsmtHalfBath" "FullBath" "HalfBath"]] (println (print-table print-columns (-> linear-combined-ds (dataset/select print-columns (range 10)) (dataset/->flyweight))))) (let [print-columns ["AllSF" "GrLivArea" "TotalBsmtSF"]] (println (print-table print-columns (-> linear-combined-ds (dataset/select print-columns (range 10)) (dataset/->flyweight)))))
9. Correlation Table
Let's check the correlations between the various columns and the target column (SalePrice).
(def article-correlations ;;Default for pandas is pearson. ;; Find most important features relative to target (->> {"SalePrice" 1.000 "OverallQual" 0.819 "AllSF" 0.817 "AllFlrsSF" 0.729 "GrLivArea" 0.719 "SimplOverallQual" 0.708 "ExterQual" 0.681 "GarageCars" 0.680 "TotalBath" 0.673 "KitchenQual" 0.667 "GarageScore" 0.657 "GarageArea" 0.655 "TotalBsmtSF" 0.642 "SimplExterQual" 0.636 "SimplGarageScore" 0.631 "BsmtQual" 0.615 "1stFlrSF" 0.614 "SimplKitchenQual" 0.610 "OverallGrade" 0.604 "SimplBsmtQual" 0.594 "FullBath" 0.591 "YearBuilt" 0.589 "ExterGrade" 0.587 "YearRemodAdd" 0.569 "FireplaceQu" 0.547 "GarageYrBlt" 0.544 "TotRmsAbvGrd" 0.533 "SimplOverallGrade" 0.527 "SimplKitchenScore" 0.523 "FireplaceScore" 0.518 "SimplBsmtCond" 0.204 "BedroomAbvGr" 0.204 "AllPorchSF" 0.199 "LotFrontage" 0.174 "SimplFunctional" 0.137 "Functional" 0.136 "ScreenPorch" 0.124 "SimplBsmtFinType2" 0.105 "Street" 0.058 "3SsnPorch" 0.056 "ExterCond" 0.051 "PoolArea" 0.041 "SimplPoolScore" 0.040 "SimplPoolQC" 0.040 "PoolScore" 0.040 "PoolQC" 0.038 "BsmtFinType2" 0.016 "Utilities" 0.013 "BsmtFinSF2" 0.006 "BsmtHalfBath" -0.015 "MiscVal" -0.020 "SimplOverallCond" -0.028 "YrSold" -0.034 "OverallCond" -0.037 "LowQualFinSF" -0.038 "LandSlope" -0.040 "SimplExterCond" -0.042 "KitchenAbvGr" -0.148 "EnclosedPorch" -0.149 "LotShape" -0.286} (sort-by second >))) (def tech-ml-correlations (get (dataset/correlation-table linear-combined-ds :pearson) "SalePrice")) (pp/print-table (map #(zipmap [:pandas :tech.ml.dataset] [%1 %2]) (take 20 article-correlations) (take 20 tech-ml-correlations)))
10. Polynomial Combinations
We now extend the power of our linear models to be effectively polynomial models for a subset of the columns. We do this using the correlation table to indicate which columns are worth it (the author used the top 10).
(defn polynomial-combinations [correlation-seq] (let [correlation-colnames (->> correlation-seq (drop 1) (take 10) (map first))] (->> correlation-colnames (mapcat (fn [colname] [['m= (str colname "-s2") ['** ['col colname] 2]] ['m= (str colname "-s3") ['** ['col colname] 3]] ['m= (str colname "-sqrt") ['sqrt ['col colname]]]]))))) (def polynomial-pipe (polynomial-combinations tech-ml-correlations)) (def poly-data (-> (etl/apply-pipeline linear-combined-ds polynomial-pipe {}) :dataset)) (pp/pprint (take 4 polynomial-pipe)) (print-dataset poly-data ["OverallQual" "OverallQual-s2" "OverallQual-s3" "OverallQual-sqrt"] (range 10))
11. Numeric Vs. Categorical
The article considers anything non-numeric to be categorical. This is a point on which the tech.ml.dataset system differs. For tech, any column can be considered categorical and the underlying datatype does not change this definition. Earlier the article converted numeric columns to string to indicate they are categorical but we just set metadata.
(def numerical-features (col-filters/select-columns poly-data '[and [not "SalePrice"] numeric?])) (def categorical-features (col-filters/select-columns poly-data '[and [not "SalePrice"] [not numeric?]])) (println (count numerical-features)) (println (count categorical-features)) ;;I printed out the categorical features from the when using pandas. (pp/pprint (->> (c-set/difference (set ["MSSubClass", "MSZoning", "Alley", "LandContour", "LotConfig", "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle", "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType", "Foundation", "Heating", "CentralAir", "Electrical", "GarageType", "GarageFinish", "Fence", "MiscFeature", "MoSold", "SaleType", "SaleCondition"]) (set categorical-features)) (map (comp ds-col/metadata (partial dataset/column poly-data)))))
(def fix-all-missing '[ ;;Fix any remaining numeric columns by using the median. [replace-missing numeric? (median (col))] ;;Fix any string columns by using 'NA'. [replace-missing string? "NA"]]) (def missing-fixed (-> (etl/apply-pipeline poly-data fix-all-missing {}) :dataset)) (pp/pprint (dataset/columns-with-missing-seq missing-fixed))
12. Skew
Here is where things go a bit awry. We attempt to fix skew. The attempted fix barely reduces the actual skew in the dataset. We will talk about what went wrong. We also begin running models on the stages to see what the effect of some of these things are.
(def fix-all-skew '[[m= [and [numeric?] [not "SalePrice"] [> (abs (skew (col))) 0.5]] (log1p (col))]]) (def skew-fix-result (etl/apply-pipeline missing-fixed fix-all-skew {:target "SalePrice"})) (def skew-fixed (:dataset skew-fix-result)) (def skew-fixed-options (:options skew-fix-result)) ;;Force gridsearch here if you want to re-run the real deal. I saved the results ;;to s3 and you downloaded them as part of get-data.sh (def skew-fix-models (gridsearch-dataset "skew-fix" false (pipe-ops/string->number skew-fixed (col-filters/string? skew-fixed)) skew-fixed-options)) (println "Pre-fix skew counts" (count (col-filters/select-columns missing-fixed '[and [numeric?] [not "SalePrice"] [> (abs (skew (col))) 0.5]]))) (println "Post-fix skew counts" (count (col-filters/select-columns skew-fixed '[and [numeric?] [not "SalePrice"] [> (abs (skew (col))) 0.5]])))
That didn't work. Or at least it barely did. What happened??
(let [before-columns (set (col-filters/select-columns missing-fixed '[and [numeric?] [not "SalePrice"] [> (abs (skew (col))) 0.5]])) after-columns (set (col-filters/select-columns skew-fixed '[and [numeric?] [not "SalePrice"] [> (abs (skew (col))) 0.5]])) check-columns (c-set/intersection before-columns after-columns)] (->> check-columns (map (fn [colname] (let [{before-min :min before-max :max before-mean :mean before-skew :skew} (-> (dataset/column missing-fixed colname) (ds-col/stats [:min :max :mean :skew])) {after-min :min after-max :max after-mean :mean after-skew :skew} (-> (dataset/column skew-fixed colname) (ds-col/stats [:min :max :mean :skew]))] {:column-name colname :before-skew before-skew :after-skew after-skew :before-mean before-mean :after-mean after-mean}))) (print-table [:column-name :before-skew :after-skew :before-mean :after-mean])))
Maybe you can see the issue now. For positive skew and and small means, the log1p fix has very little effect. For very large numbers, it may skew the result all the way to be negative. And then for negative skew, it makes it worse.
(let [data (->> (results->accuracy-dataset skew-fix-models) (group-by :model-name))] {:data [{:y (->> (get data ":smile.regression/lasso") (map :average-loss)) :type "box" :name "lasso"} {:y (->> (get data ":smile.regression/elastic-net") (map :average-loss)) :type "box" :name "elastic-net"}] :layout {:autosize false :width 600 :height 600}})
(.plot js/Nextjournal box-skew↩)
13. std-scaler
- range-scaler - scale column such that min/max equal a range min/max. Range defaults to [-1 1].
- std-scaler - scale column such that mean = 0 and variance,stddev = 1.
(def std-scale-numeric-features [['std-scaler (vec numerical-features)] ['string->number 'string?]]) (def scaled-pipeline-result (etl/apply-pipeline skew-fixed std-scale-numeric-features {:target "SalePrice"})) (def final-dataset (:dataset scaled-pipeline-result)) (def final-options (:options scaled-pipeline-result)) (println "Before std-scaler") (->> (dataset/select skew-fixed (take 10 numerical-features) :all) (dataset/columns) (map (fn [col] (merge (ds-col/stats col [:mean :variance]) {:column-name (ds-col/column-name col)}))) (print-table [:column-name :mean :variance])) (println "\n\nAfter std-scaler") (->> (dataset/select final-dataset (take 10 numerical-features) :all) (dataset/columns) (map (fn [col] (merge (ds-col/stats col [:mean :variance]) {:column-name (ds-col/column-name col)}))) (print-table [:column-name :mean :variance]))
14. Final Models
We now train our prepared data across a range of models.
(def final-models (gridsearch-dataset "final" false final-dataset final-options))
(let [data (->> (results->accuracy-dataset final-models) (group-by :model-name))] {:data [{:y (->> (get data ":smile.regression/lasso") (map :average-loss)) :type "box" :name "lasso"} {:y (->> (get data ":smile.regression/elastic-net") (map :average-loss)) :type "box" :name "elastic-net"} {:y (->> (get data ":smile.regression/ridge") (map :average-loss)) :type "box" :name "ridge"}] :layout {:autosize false :width 600 :height 600}})
(.plot js/Nextjournal box-models↩)