Titanic Walkthrough
Remix this to get started with Julia
"$VERSION"
Here are the files available on the Kaggle Titanic thing:
Let's load them into Julia, so we can work with them as Tables
import Pkg; Pkg.add("CSV") using CSV survival_data = CSV.read(train.csv) test_data = CSV.read(train.csv) train_data = CSV.read(train.csv)
train_data.Sex
I don't want to worry about filling in missing values with mode or averages right now, so I remove any rows which are missing data.
using DataFrames dropmissing!(train_data) dropmissing!(test_data) dropmissing!(survival_data)
I tried to use XGBoost here, but it doesn't build on Julia v1.1. Talking it over here: https://github.com/dmlc/XGBoost.jl/issues/65
Let's import the Alan Turing Institute's MLJ.jl instead, and use their Decision Tree classifier described here: https://alan-turing-institute.github.io/MLJ.jl/dev/
Pkg.add("MLJ") Pkg.add("DecisionTree") Pkg.add("MLJModels") using MLJ DecisionTreeClassifier import MLJModels import DecisionTree import MLJModels.DecisionTree_.DecisionTreeClassifier
In the sample code, max_depth was originally set to 2 with poor results, so I increased the depth of the model to 4.
tree_model = DecisionTreeClassifier(max_depth=4) DecisionTreeClassifier(pruning_purity = 1.0, max_depth = 4, min_samples_leaf = 1, min_samples_split = 2, min_purity_increase = 0.0, n_subfeatures = 0.0, display_depth = 5, post_prune = false, merge_purity_threshold = 0.9,)
Ideally this model would include all columns, but the DecisionTreeClassifier is intent on every influencing factor being a Continuous type, so I will include only class ticket, age, and ticket fare.
Pkg.add("IterableTables") function columncheck(x) if (x != :Pclass) && (x != :Age) && (x != :Fare) && (x != :Genderfix) return false else return true end end train_data[:Genderfix] = train_data[:Sex] == 'M' ? 1 : 0 train_data[:Genderfix] = coerce(Continuous, train_data[:Genderfix]) train_data[:Pclass] = coerce(Continuous, train_data[:Pclass] * 1) train_data[:Age] = coerce(Continuous, train_data[:Age] * 1) train_data[:Fare] = coerce(Continuous, train_data[:Fare] * 1) train_data[:Survived] = coerce(Multiclass, train_data[:Survived] * 1) trainhelp = train_data[:, filter(columncheck, names(train_data))] #scitype(trainhelp[:Pclass]) tree = machine(tree_model, trainhelp, train_data.Survived)
Now we provide a randomly shuffled 85% of the training data rows as training data (the other 15% will be used for test, though in a real Kaggle project, we would use the training CSV which we loaded.
train, test = partition(eachindex(train_data.Survived), 0.9, shuffle=true) fit!(tree, rows=train)
yhat = predict(tree, trainhelp[test,:]) misclassification_rate(yhat, train_data.Survived[test])
The resulting classifier is not that accurate :'( and it's misclassification rate varies from 20-40% depending on what falls into the test set.
Naive Bayes
Let's try running the previous CSVs through a Naive Bayes classifier instead
Pkg.add("CSV") using CSV train_data = CSV.read(train.csv) using DataFrames dropmissing!(train_data)
Pkg.add("MLJ") Pkg.add("NaiveBayes") Pkg.add("MLJModels") using MLJ MultinomialNBClassifier import MLJModels import NaiveBayes import MLJModels.NaiveBayes_.MultinomialNBClassifier
bmodel = MultinomialNBClassifier() Pkg.add("IterableTables") function columncheck(x) if (x != :Pclass) && (x != :Age) && (x != :Fare) && (x != :Genderfix) return false else return true end end train_data[:Age] = map(x -> floor(x * 1), train_data[:Age]) train_data[:Age] = coerce(Count, train_data[:Age]) train_data[:Genderfix] = train_data[:Sex] == 'M' ? 1 : 0 train_data[:Genderfix] = coerce(Count, train_data[:Genderfix]) train_data[:Pclass] = coerce(Count, train_data[:Pclass] * 1) train_data[:Fare] = map(x -> floor(x * 1), train_data[:Fare]) train_data[:Fare] = coerce(Count, train_data[:Fare]) train_data[:Survived] = coerce(Multiclass, train_data[:Survived] * 1) trainhelp = train_data[:, filter(columncheck, names(train_data))] #scitype(trainhelp[:Pclass]) tree = machine(bmodel, trainhelp, train_data.Survived)
train, test = partition(eachindex(train_data.Survived), 0.8, shuffle=true) fit!(tree, rows=train)
yhat = predict(tree, trainhelp[test,:]) misclassification_rate(yhat, train_data.Survived[test])