GSoC 2020 : MLJTime
Introduction to Time Series Classification in Julia
Introduction
In this notebook, I want to give you an introduction to time series classification and our new toolbox in Julia called MLJTime. While Julia is a great language for ML, and the MLJ(Machine Learning in Julia) ecosystem provides many composable tools, there is a lack of dedicated time series toolbox.
Time series data has recently received renewed interest from the ML community. A lot of interesting work is being done in the time series field right now, see, for example, sktime.
The goal of MLJTime is to advance the time series analysis capabilities in Julia, partially inspired by the sktime package in Python and other existing time series packages in Julia. The vision of MLJTime is to provide state-of-the-art time series algorithms and MLJ-compatible tools for model building, tuning and evaluation.
About the project
I am Aadesh and this project is part of my Google Summer of Code 2020. Over the course of the summer, I will be developing MLJTime. Guiding me on my quest are my mentors Markus Löning and Sebastian Vollmer. This is the first blog post about MLJTime, more are to follow.
Setting up the environment
We add the required packages for the Notebook.
using Pkg
Pkg.add("ZipFile")
Pkg.add("IndexedTables")
Pkg.add("Statistics")
Pkg.add("DecisionTree")
Pkg.add("CSVFiles")
Pkg.add("MLJModelInterface")
Pkg.add("MLJBase")
Pkg.add("StableRNGs")
Pkg.add("CategoricalArrays")
Pkg.add("MLJTuning")
Pkg.add("Plots")
Pkg.add(PackageSpec(url="https://github.com/alan-turing-institute/MLJTime.jl.git", rev="master"))
Load data
The package MLJTime provides access to some of the common time series classification data sets collected in the timeseriesclassification.com repository.
These are well known, standard data sets that can be used to get started with data processing and time series classification. Here, we use the Chinatown
data set. One can also use csv
files, you will need to specify the location of the file on your machine with load_dataset(path)
function.
using MLJTime
In this tutorial we are using the chinatown
example by PedestrianCountingSystem. City of Melbourne, Australia has developed an automated pedestrian counting system to better understand pedestrian activity within the municipality, such as how people use different city locations at different time of the day. The data analysis can facilitate decision making and urban planning for the future. Chinatown is an extract of data from 10 locations for the whole year 2017. Classes are based on whether data are from a normal day or a weekend day. - Class 1: Weekend - Class 2: Weekday.
X, y = ts_dataset("Chinatown")
This is a binary time series classification problem, having two values:
unique(y)
Plots
Let's visualise a time series for each class.
using Plots
series = matrix(X)
Idx_class1, Idx_class2 = findfirst(x->x == 1, y), findfirst(x->x == 2, y)
plot(transpose(series[[Idx_class1, Idx_class2], :]), xlabel="Time(24 hrs)", ylabel="Pedestrian count", label = ["class 1 Weekend" "class 2 Weekday"], lw = 3)
Split data into training and test sets
train, test = partition(eachindex(y), 0.7, shuffle=true, rng=1234) #70:30 split
In MLJTime, a case/instance is a pair {X, y}, where X is IndexedTable
with n observations x1, . . . , xn (the time series) and discrete class variable y with CategoricalValues
.
X_train, y_train = X[train], y[train];
X_test, y_test = X[test], y[test];
Time series classification model
TimeseriesforestClassifier
is a modification of the random forest algorithm to the time series setting:
Split the series into multiple random intervals,
Extract features (mean, standard deviation and slope) from each interval,
Train a decision tree on the extracted features,
Ensemble steps 1- 3.
For more details, take a look at the paper.
In MLJTime, we can write:
model = TimeSeriesForestClassifier(n_trees=3)
MLJTime has the same API as MLJ. We follow the MLJ style interface to interact with our learning models, including the common fit!
, predict
and evaluate!
functions. The only difference between MLJ & MLJTime is in terms of the API that, X
contains time series data (IndexedTable
), rather than tabular data.
mach = machine(model, X, y)
forest = fit(mach)
predict(forest, X_test)
mach = machine(model, X_train, y_train)
Fit model
As in MLJTime, to fit the machine, you can use the function fit!
specifying the rows to be used for the training.
This fitresult
will vary from model to model though classifiers will usually give out a tuple with the first element corresponding to the fitting and the second one keeping track of how classes are named (so that predictions can be appropriately named).
fit!(mach)
Generate predictions
You can now use the machine to make predictions with the predict
function specifying rows to be used for the prediction. Note that the output is probabilistic, effectively a vector with a score for each class. You could get the mode by using the mode function on y_pred
or using predict_mode:
y_pred = predict(mach, X_test)
Evaluate performance
Note that multiple measurements can be specified jointly. Here only one measurement is (implicitly) specified but we still have to select the corresponding results. Here the implicit measure is the cross_entropy
. The measurement is the mean taken over the folds.
using MLJBase: evaluate!
evaluate!(mach, measure=cross_entropy)
Tuning
Like in MLJ, tuning is implemented as a model wrapper. After wrapping a model in a tuning strategy (e.g. cross-validation) and binding the wrapped model to data in a machine, fitting the machine initiates a search for optimal model hyper-parameters.
As above, wrapping a model in a tuning strategy as above means creating a new "self-tuning" version of the model, tuned_model = TunedModel(model=...),
in which further key-word arguments specify:
The algorithm.
Resampling strategy.
The measure (or measures) on which to base performance evaluations.
Range i.e the space of hyper-parameters to be searched.
using MLJTuning
tsf = TimeSeriesForestClassifier()
The range
function takes a model (tsf
), a symbol for the hyper-parameter of interest (:n_trees
) and indication of how to sample values.
r = range(tsf, :n_trees, lower=5, upper=10, scale=:log)
cv = CV(nfolds=5, shuffle=true)
tuned_model = TunedModel(model=tsf, ranges=r, measure=cross_entropy, resampling=cv)
Now we define the machine with the tuned model.
tuned_mach = machine(tuned_model, X_train, y_train)
As before, we fit the machine and generate predictions.
fit!(tuned_mach)
y_pred = predict(tuned_mach, X_test)
Finally, one can evaluate the performance and observe that tuning improves the predictive performance as cross_entropy
decreases form 2.84 to 1.13.
evaluate!(tuned_mach, measure=cross_entropy)
Development roadmap: what comes next?
In future work, I hope to add: 1> Support for multivariate time series classification algorithms.
2> Support for Composition classes: pipelines.
3> Non-tree based algorithms.
4> Forecasting.
We’re actively looking for contributors. Please get in touch if you’re interested in Julia, machine learning and time series. You can find us on GitHub: https://github.com/alan-turing-institute/MLJTime.jl