**GSoC 2020 : MLJTime**

## Introduction to Time Series Classification in Julia

### Introduction

In this notebook, I want to give you an introduction to time series classification and our new toolbox in Julia called MLJTime. While Julia is a great language for ML, and the MLJ(Machine Learning in Julia) ecosystem provides many composable tools, there is a lack of dedicated time series toolbox.

Time series data has recently received renewed interest from the ML community. A lot of interesting work is being done in the time series field right now, see, for example, sktime.

The goal of MLJTime is to advance the time series analysis capabilities in Julia, partially inspired by the sktime package in Python and other existing time series packages in Julia. The vision of MLJTime is to provide state-of-the-art time series algorithms and MLJ-compatible tools for model building, tuning and evaluation**. **

### About the project

I am Aadesh and this project is part of my Google Summer of Code 2020. Over the course of the summer, I will be developing MLJTime. Guiding me on my quest are my mentors Markus Löning and Sebastian Vollmer. This is the first blog post about MLJTime, more are to follow.

**Setting up the environment**

We add the required packages for the Notebook.

`using Pkg`

`Pkg.add("ZipFile")`

`Pkg.add("IndexedTables")`

`Pkg.add("Statistics")`

`Pkg.add("DecisionTree")`

`Pkg.add("CSVFiles")`

`Pkg.add("MLJModelInterface")`

`Pkg.add("MLJBase")`

`Pkg.add("StableRNGs")`

`Pkg.add("CategoricalArrays")`

`Pkg.add("MLJTuning")`

`Pkg.add("Plots")`

`Pkg.add(PackageSpec(url="https://github.com/alan-turing-institute/MLJTime.jl.git", rev="master"))`

**Load data **

The package MLJTime provides access to some of the common time series classification data sets collected in the timeseriesclassification.com repository.

These are well known, standard data sets that can be used to get started with data processing and time series classification. Here, we use the `Chinatown`

data set. One can also use `csv`

files, you will need to specify the location of the file on your machine with `load_dataset(path)`

function.

`using MLJTime`

In this tutorial we are using the `chinatown`

example by PedestrianCountingSystem. City of Melbourne, Australia has developed an automated pedestrian counting system to better understand pedestrian activity within the municipality, such as how people use different city locations at different time of the day. The data analysis can facilitate decision making and urban planning for the future. Chinatown is an extract of data from 10 locations for the whole year 2017. Classes are based on whether data are from a normal day or a weekend day. - Class 1: Weekend - Class 2: Weekday.

`X, y = ts_dataset("Chinatown")`

This is a binary time series classification problem, having two values:

`unique(y)`

### Plots

Let's visualise a time series for each class.

`using Plots`

`series = matrix(X)`

`Idx_class1, Idx_class2 = findfirst(x->x == 1, y), findfirst(x->x == 2, y)`

`plot(transpose(series[[Idx_class1, Idx_class2], :]), xlabel="Time(24 hrs)", ylabel="Pedestrian count", label = ["class 1 Weekend" "class 2 Weekday"], lw = 3)`

### Split data into training and test sets

`train, test = partition(eachindex(y), 0.7, shuffle=true, rng=1234) #70:30 split`

In MLJTime, a case/instance is a pair {X, y}, where X is `IndexedTable`

with n observations x1, . . . , xn (the time series) and discrete class variable y with `CategoricalValues`

.

`X_train, y_train = X[train], y[train];`

`X_test, y_test = X[test], y[test];`

**Time series classification model**

`TimeseriesforestClassifier`

is a modification of the random forest algorithm to the time series setting:

Split the series into multiple random intervals,

Extract features (mean, standard deviation and slope) from each interval,

Train a decision tree on the extracted features,

Ensemble steps 1- 3.

For more details, take a look at the paper.

In MLJTime, we can write:

`model = TimeSeriesForestClassifier(n_trees=3)`

MLJTime has the same API as MLJ. We follow the MLJ style interface to interact with our learning models, including the common `fit!`

, `predict`

and `evaluate!`

functions. The only difference between MLJ & MLJTime is in terms of the API that, `X`

contains time series data (`IndexedTable`

), rather than tabular data.

`mach = machine(model, X, y) `

`forest = fit(mach) `

`predict(forest, X_test)`

`mach = machine(model, X_train, y_train)`

**Fit model **

As in MLJTime, to fit the machine, you can use the function `fit!`

specifying the rows to be used for the training.

This `fitresult`

will vary from model to model though classifiers will usually give out a tuple with the first element corresponding to the fitting and the second one keeping track of how classes are named (so that predictions can be appropriately named).

`fit!(mach)`

**Generate predictions**

You can now use the machine to make predictions with the `predict`

function specifying rows to be used for the prediction. Note that the output is probabilistic, effectively a vector with a score for each class. You could get the mode by using the mode function on `y_pred`

or using predict_mode:

`y_pred = predict(mach, X_test)`

### Evaluate performance

Note that multiple measurements can be specified jointly. Here only one measurement is (implicitly) specified but we still have to select the corresponding results. Here the implicit measure is the `cross_entropy`

. The measurement is the mean taken over the folds.

`using MLJBase: evaluate!`

`evaluate!(mach, measure=cross_entropy)`

### Tuning

Like in MLJ, tuning is implemented as a model wrapper. After wrapping a model in a tuning strategy (e.g. cross-validation) and binding the wrapped model to data in a machine, fitting the machine initiates a search for optimal model hyper-parameters.

As above, *wrapping a model in a tuning strategy* as above means creating a new "self-tuning" version of the model, `tuned_model = TunedModel(model=...),`

in which further key-word arguments specify:

The algorithm.

Resampling strategy.

The measure (or measures) on which to base performance evaluations.

Range i.e the space of hyper-parameters to be searched.

`using MLJTuning`

`tsf = TimeSeriesForestClassifier()`

The `range`

function takes a model (`tsf`

), a symbol for the hyper-parameter of interest (`:n_trees`

) and indication of how to sample values.

`r = range(tsf, :n_trees, lower=5, upper=10, scale=:log)`

`cv = CV(nfolds=5, shuffle=true)`

`tuned_model = TunedModel(model=tsf, ranges=r, measure=cross_entropy, resampling=cv)`

Now we define the machine with the tuned model.

`tuned_mach = machine(tuned_model, X_train, y_train)`

As before, we fit the machine and generate predictions.

`fit!(tuned_mach)`

`y_pred = predict(tuned_mach, X_test)`

Finally, one can evaluate the performance and observe that tuning improves the predictive performance as `cross_entropy`

decreases form 2.84 to 1.13.

`evaluate!(tuned_mach, measure=cross_entropy)`

**Development roadmap: what comes next?**

In future work, I hope to add: 1> Support for multivariate time series classification algorithms.

2> Support for Composition classes: pipelines.

3> Non-tree based algorithms.

4> Forecasting.

We’re actively looking for contributors. Please get in touch if you’re interested in Julia, machine learning and time series. You can find us on GitHub: https://github.com/alan-turing-institute/MLJTime.jl