Titanic Walkthrough

Remix this to get started with Julia .

"$VERSION"

1.0s

Julia

"1.1.0"

Here are the files available on the Kaggle Titanic thing:

gender_submission.csv

test.csv

train.csv

Let's load them into Julia, so we can work with them as Tables

import Pkg;
Pkg.add("CSV")
using CSV
survival_data = CSV.read(train.csv
)
test_data = CSV.read(train.csv
)
train_data = CSV.read(train.csv
)

2.9s

Julia

train_data.Sex

2.2s

Julia

891-element Array{Union{Missing, String},1}: "male" "female" "female" "female" "male" "male" "male" "male" "female" "female" ⋮ "female" "male" "male" "female" "male" "female" "female" "male" "male"

I don't want to worry about filling in missing values with mode or averages right now, so I remove any rows which are missing data.

using DataFrames
dropmissing!(train_data)
dropmissing!(test_data)
dropmissing!(survival_data)

0.8s

Julia

I tried to use XGBoost here, but it doesn't build on Julia v1.1. Talking it over here: https://github.com/dmlc/XGBoost.jl/issues/65

Let's import the Alan Turing Institute's MLJ.jl instead, and use their Decision Tree classifier described here: https://alan-turing-institute.github.io/MLJ.jl/dev/

Pkg.add("MLJ")
Pkg.add("DecisionTree")
Pkg.add("MLJModels")
using MLJ
@load DecisionTreeClassifier
import MLJModels
import DecisionTree
import MLJModels.DecisionTree_.DecisionTreeClassifier

2.6s

Julia

In the sample code, max_depth was originally set to 2 with poor results, so I increased the depth of the model to 4.

tree_model = DecisionTreeClassifier(max_depth=4)
DecisionTreeClassifier(pruning_purity = 1.0,
                       max_depth = 4,
                       min_samples_leaf = 1,
                       min_samples_split = 2,
                       min_purity_increase = 0.0,
                       n_subfeatures = 0.0,
                       display_depth = 5,
                       post_prune = false,
                       merge_purity_threshold = 0.9,)

0.3s

Julia

DecisionTreeClassifier(pruning_purity = 1.0, max_depth = 4, min_samples_leaf = 1, min_samples_split = 2, min_purity_increase = 0.0, n_subfeatures = 0.0, display_depth = 5, post_prune = false, merge_purity_threshold = 0.9,)[34m @ 1…54[39m

Ideally this model would include all columns, but the DecisionTreeClassifier is intent on every influencing factor being a Continuous type, so I will include only class ticket, age, and ticket fare.

Pkg.add("IterableTables")
function columncheck(x)
  if (x != :Pclass) && (x != :Age) && (x != :Fare) && (x != :Genderfix)
    return false
  else
    return true
  end
end
train_data[:Genderfix] = train_data[:Sex] == 'M' ? 1 : 0
train_data[:Genderfix] = coerce(Continuous, train_data[:Genderfix])
train_data[:Pclass] = coerce(Continuous, train_data[:Pclass] * 1)
train_data[:Age] = coerce(Continuous, train_data[:Age] * 1)
train_data[:Fare] = coerce(Continuous, train_data[:Fare] * 1)
train_data[:Survived] = coerce(Multiclass, train_data[:Survived] * 1)
trainhelp = train_data[:, filter(columncheck, names(train_data))]
#scitype(trainhelp[:Pclass])
tree = machine(tree_model, trainhelp, train_data.Survived)

2.1s

Julia

Now we provide a randomly shuffled 85% of the training data rows as training data (the other 15% will be used for test, though in a real Kaggle project, we would use the training CSV which we loaded.

train, test = partition(eachindex(train_data.Survived), 0.9, shuffle=true)
fit!(tree, rows=train)

0.7s

Julia

[34mMachine{DecisionTreeClassifier} @ 8…14[39m

yhat = predict(tree, trainhelp[test,:])
misclassification_rate(yhat, train_data.Survived[test])

0.4s

Julia

0.277778

The resulting classifier is not that accurate :'( and it's misclassification rate varies from 20-40% depending on what falls into the test set.

Naive Bayes

Let's try running the previous CSVs through a Naive Bayes classifier instead

Pkg.add("CSV")
using CSV
train_data = CSV.read(train.csv
)
using DataFrames
dropmissing!(train_data)

1.1s

Julia

Pkg.add("MLJ")
Pkg.add("NaiveBayes")
Pkg.add("MLJModels")
using MLJ
@load MultinomialNBClassifier
import MLJModels
import NaiveBayes
import MLJModels.NaiveBayes_.MultinomialNBClassifier

3.3s

Julia

bmodel = MultinomialNBClassifier()

Pkg.add("IterableTables")
function columncheck(x)
  if (x != :Pclass) && (x != :Age) && (x != :Fare) && (x != :Genderfix)
    return false
  else
    return true
  end
end
train_data[:Age] = map(x -> floor(x * 1), train_data[:Age])
train_data[:Age] = coerce(Count, train_data[:Age])
train_data[:Genderfix] = train_data[:Sex] == 'M' ? 1 : 0
train_data[:Genderfix] = coerce(Count, train_data[:Genderfix])
train_data[:Pclass] = coerce(Count, train_data[:Pclass] * 1)
train_data[:Fare] = map(x -> floor(x * 1), train_data[:Fare])
train_data[:Fare] = coerce(Count, train_data[:Fare])
train_data[:Survived] = coerce(Multiclass, train_data[:Survived] * 1)
trainhelp = train_data[:, filter(columncheck, names(train_data))]
#scitype(trainhelp[:Pclass])
tree = machine(bmodel, trainhelp, train_data.Survived)

1.8s

Julia

[34mMachine{MultinomialNBClassifier} @ 2…46[39m

train, test = partition(eachindex(train_data.Survived), 0.8, shuffle=true)
fit!(tree, rows=train)

0.6s

Julia

[34mMachine{MultinomialNBClassifier} @ 2…46[39m

yhat = predict(tree, trainhelp[test,:])
misclassification_rate(yhat, train_data.Survived[test])

0.4s

Julia

0.324324