GSoC 2020 : MLJTime

Introduction to Time Series Classification in Julia

Introduction

In this notebook, I want to give you an introduction to time series classification and our new toolbox in Julia called MLJTime. While Julia is a great language for ML, and the MLJ(Machine Learning in Julia) ecosystem provides many composable tools, there is a lack of dedicated time series toolbox.

Time series data has recently received renewed interest from the ML community. A lot of interesting work is being done in the time series field right now, see, for example, sktime

The goal of MLJTime is to advance the time series analysis capabilities in Julia, partially inspired by the sktime package in Python and other existing time series packages in Julia. The vision of MLJTime is to provide state-of-the-art time series algorithms and MLJ-compatible tools for model building, tuning and evaluation.   

About the project

I am Aadesh and this project is part of my Google Summer of Code 2020. Over the course of the summer, I will be developing MLJTime. Guiding me on my quest are my mentors Markus Löning and Sebastian Vollmer. This is the first blog post about MLJTime, more are to follow.

Setting up the environment

We add the required packages for the Notebook.

using Pkg
Pkg.add("ZipFile")
Pkg.add("IndexedTables")
Pkg.add("Statistics")
Pkg.add("DecisionTree")
Pkg.add("CSVFiles")
Pkg.add("MLJModelInterface")
Pkg.add("MLJBase")
Pkg.add("StableRNGs")
Pkg.add("CategoricalArrays")
Pkg.add("MLJTuning")
Pkg.add("Plots")
Pkg.add(PackageSpec(url="https://github.com/alan-turing-institute/MLJTime.jl.git", rev="master"))
121.2s

Load data

The package MLJTime provides access to some of the common time series classification data sets collected in the timeseriesclassification.com repository.

These are well known, standard data sets that can be used to get started with data processing and time series classification. Here, we use the Chinatown data set. One can also use csv files, you will need to specify the location of the file on your machine with load_dataset(path) function.

using MLJTime
214.4s

In this tutorial we are using the chinatown example by PedestrianCountingSystem. City of Melbourne, Australia has developed an automated pedestrian counting system to better understand pedestrian activity within the municipality, such as how people use different city locations at different time of the day. The data analysis can facilitate decision making and urban planning for the future. Chinatown is an extract of data from 10 locations for the whole year 2017. Classes are based on whether data are from a normal day or a weekend day. - Class 1: Weekend - Class 2: Weekday.

X, y = ts_dataset("Chinatown")
17.5s
(Table with 363 rows, 24 columns: Columns: # colname type ──────────────────── 1 1 Float64 2 2 Float64 3 3 Float64 4 4 Float64 5 5 Float64 6 6 Float64 7 7 Float64 8 8 Float64 9 9 Float64 10 10 Float64 11 11 Float64 12 12 Float64 13 13 Float64 14 14 Float64 15 15 Float64 16 16 Float64 17 17 Float64 18 18 Float64 19 19 Float64 20 20 Float64 21 21 Float64 22 22 Float64 23 23 Float64 24 24 Float64, CategoricalArrays.CategoricalValue{Float64,UInt32}[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 … 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0])

This is a binary time series classification problem, having two values:

unique(y)
0.3s
2-element Array{Float64,1}: 1.0 2.0

Plots

Let's visualise a time series for each class.

using Plots
series = matrix(X)
Idx_class1, Idx_class2 = findfirst(x->x == 1, y), findfirst(x->x == 2, y)
plot(transpose(series[[Idx_class1, Idx_class2], :]), xlabel="Time(24 hrs)", ylabel="Pedestrian count", label = ["class 1 Weekend" "class 2 Weekday"], lw = 3)
0.4s

Split data into training and test sets

train, test = partition(eachindex(y), 0.7, shuffle=true, rng=1234) #70:30 split
0.8s
([177, 229, 217, 60, 196, 100, 131, 362, 295, 211 … 50, 111, 71, 159, 261, 294, 41, 219, 150, 325], [81, 357, 170, 23, 175, 114, 184, 163, 332, 15 … 99, 182, 324, 74, 80, 84, 127, 76, 152, 237])

In MLJTime, a case/instance is a pair {X, y}, where X is IndexedTable with n observations x1, . . . , xn (the time series) and discrete class variable y with CategoricalValues.

X_train, y_train = X[train], y[train];
X_test, y_test = X[test], y[test];
1.0s

Time series classification model

TimeseriesforestClassifier is a modification of the random forest algorithm to the time series setting:

  1. Split the series into multiple random intervals,

  2. Extract features (mean, standard deviation and slope) from each interval,

  3. Train a decision tree on the extracted features,

  4. Ensemble steps 1- 3.

For more details, take a look at the paper.

In MLJTime, we can write:

model = TimeSeriesForestClassifier(n_trees=3)
0.2s
TimeSeriesForestClassifier( n_trees = 3, random_state = nothing, min_interval = 3, max_depth = -1, min_samples_leaf = 1, min_samples_split = 2, min_purity_increase = 0.0, n_subfeatures = 0, post_prune = false, merge_purity_threshold = 1.0, pdf_smoothing = 0.0, display_depth = 5) @388

MLJTime has the same API as MLJ. We follow the MLJ style interface to interact with our learning models, including the common fit!, predict and evaluate! functions. The only difference between MLJ & MLJTime is in terms of the API that, X contains time series data (IndexedTable), rather than tabular data.

mach = machine(model, X, y) forest = fit(mach) predict(forest, X_test)

mach = machine(model, X_train, y_train)
0.2s
Machine{TimeSeriesForestClassifier} @097 trained 0 times. args: 1: Source @272 ⏎ `ScientificTypes.Table{AbstractArray{ScientificTypes.Continuous,1}}` 2: Source @882 ⏎ `AbstractArray{ScientificTypes.Multiclass{2},1}`

Fit model

As in MLJTime, to fit the machine, you can use the function fit! specifying the rows to be used for the training.

This fitresult will vary from model to model though classifiers will usually give out a tuple with the first element corresponding to the fitting and the second one keeping track of how classes are named (so that predictions can be appropriately named).

fit!(mach)
0.4s
Machine{TimeSeriesForestClassifier} @097 trained 1 time. args: 1: Source @272 ⏎ `ScientificTypes.Table{AbstractArray{ScientificTypes.Continuous,1}}` 2: Source @882 ⏎ `AbstractArray{ScientificTypes.Multiclass{2},1}`

Generate predictions

You can now use the machine to make predictions with the predict function specifying rows to be used for the prediction. Note that the output is probabilistic, effectively a vector with a score for each class. You could get the mode by using the mode function on y_pred or using predict_mode:

y_pred = predict(mach, X_test)
0.5s

Evaluate performance

Note that multiple measurements can be specified jointly. Here only one measurement is (implicitly) specified but we still have to select the corresponding results. Here the implicit measure is the cross_entropy. The measurement is the mean taken over the folds.

using MLJBase: evaluate!
evaluate!(mach, measure=cross_entropy)
0.3s
┌───────────────┬───────────────┬──────────────────────────────────────────┐ │ _.measure │ _.measurement │ _.per_fold │ ├───────────────┼───────────────┼──────────────────────────────────────────┤ │ cross_entropy │ 2.84 │ [5.87, 2.22e-16, 1.72, 2.57, 4.29, 2.57] │ └───────────────┴───────────────┴──────────────────────────────────────────┘ _.per_observation = [[[36.0, 2.22e-16, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 36.0], [2.22e-16, 2.22e-16, ..., 36.0], [2.22e-16, 2.22e-16, ..., 2.22e-16]]]

Tuning

Like in MLJ, tuning is implemented as a model wrapper. After wrapping a model in a tuning strategy (e.g. cross-validation) and binding the wrapped model to data in a machine, fitting the machine initiates a search for optimal model hyper-parameters.

As above, wrapping a model in a tuning strategy as above means creating a new "self-tuning" version of the model, tuned_model = TunedModel(model=...), in which further key-word arguments specify:

  1. The algorithm. 

  2. Resampling strategy.

  3. The measure (or measures) on which to base performance evaluations.

  4. Range i.e the space of hyper-parameters to be searched.

using MLJTuning
14.8s
tsf = TimeSeriesForestClassifier()
0.2s
TimeSeriesForestClassifier( n_trees = 200, random_state = nothing, min_interval = 3, max_depth = -1, min_samples_leaf = 1, min_samples_split = 2, min_purity_increase = 0.0, n_subfeatures = 0, post_prune = false, merge_purity_threshold = 1.0, pdf_smoothing = 0.0, display_depth = 5) @218

The range function takes a model (tsf), a symbol for the hyper-parameter of interest (:n_trees) and indication of how to sample values.

r = range(tsf, :n_trees, lower=5, upper=10, scale=:log)
0.8s
MLJBase.NumericRange(Int64, :n_trees, ... )
cv = CV(nfolds=5, shuffle=true)
0.6s
CV( nfolds = 5, shuffle = true, rng = _GLOBAL_RNG()) @868
tuned_model = TunedModel(model=tsf, ranges=r, measure=cross_entropy, resampling=cv)
2.3s

Now we define the machine with the tuned model.

tuned_mach = machine(tuned_model, X_train, y_train)
0.6s
Machine{ProbabilisticTunedModel{Grid,…}} @072 trained 0 times. args: 1: Source @583 ⏎ `ScientificTypes.Table{AbstractArray{ScientificTypes.Continuous,1}}` 2: Source @324 ⏎ `AbstractArray{ScientificTypes.Multiclass{2},1}`

As before, we fit the machine and generate predictions.

fit!(tuned_mach)
15.9s
Machine{ProbabilisticTunedModel{Grid,…}} @072 trained 1 time. args: 1: Source @583 ⏎ `ScientificTypes.Table{AbstractArray{ScientificTypes.Continuous,1}}` 2: Source @324 ⏎ `AbstractArray{ScientificTypes.Multiclass{2},1}`
y_pred = predict(tuned_mach, X_test)
0.2s

Finally, one can evaluate the performance and observe that tuning improves the predictive performance as cross_entropy decreases form 2.84 to 1.13.

evaluate!(tuned_mach, measure=cross_entropy)
3.8s
┌───────────────┬───────────────┬──────────────────────────────────────────┐ │ _.measure │ _.measurement │ _.per_fold │ ├───────────────┼───────────────┼──────────────────────────────────────────┤ │ cross_entropy │ 1.13 │ [1.68, 0.838, 0.858, 0.858, 0.858, 1.72] │ └───────────────┴───────────────┴──────────────────────────────────────────┘ _.per_observation = [[[36.0, 2.22e-16, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 2.22e-16], [2.22e-16, 2.22e-16, ..., 2.22e-16]]]

Development roadmap: what comes next?

In future work, I hope to add: 1> Support for multivariate time series classification algorithms. 

2> Support for Composition classes: pipelines. 

3> Non-tree based algorithms. 

4> Forecasting.

We’re actively looking for contributors. Please get in touch if you’re interested in Julia, machine learning and time series. You can find us on GitHub: https://github.com/alan-turing-institute/MLJTime.jl

Runtimes (1)