Pietro Vertechi / Dec 13 2018
Remix of Julia by Nextjournal

StatsMakie: statistical and data visualizations on the GPU in Julia

The StatsMakie package has just been released. It allows to do statistical and data visualizations using the high performance plotting package Makie.

1.
Set up

pkg"add StatsMakie"
pkg"add Distributions StatsBase KernelDensity RDatasets"

2.
Statistics

StatsMakie allows to visualize many statistical objects in Julia.

using AbstractPlotting, GLMakie, StatsMakie
using StatsBase
x = randn(10000)
h = fit(Histogram, x)
plot(h)

The plot type is inferred from the data type. A 1D histogram is plotted by default as a bar plot, whereas a 2D histogram would default to a heatmap

N = 10000
x = 100 .* randn(N)
y = 100 .* (randn(N) .+ 5 .* rand(0:1, N))
h = fit(Histogram, (x, y))
plot(h)

but different plot types can be used, as long as they make sense for that type of data:

vbox(surface(h), wireframe(h))

Also kernel densities can be visualized, for a smoother estimate:

using KernelDensity
d = kde((x, y))
plot(d)

Distributions admit a canonical visualization (pdf on the support)

using Distributions
d1 = Normal()
d2 = Poisson()
vbox(plot(d1), plot(d2))

And sometimes also more sophisticated objects admit a plotting representation. For example, a QQPair will tell us whether two variables are distributed according to the same distribution.

q1 = qqbuild(x, y)
q2 = qqbuild(x, x)
vbox(plot(q1, markersize = 10), plot(q2, markersize = 10))

In this case for example x is not distributed like y (as the markers are very far from the line) but unsurprisingly x is distributed like x.

3.
Grouping, styling and data

StatsMakie also provides utilities to work with data (potentially grouped) and to use some variables from a dataset to style plots. Let's set up a few variables to play with:

N = 1000
a = rand(1:2, N) # a discrete variable
b = rand(1:2, N) # a discrete variable
x = randn(N) # a continuous variable
y = @. x * a + 0.8*randn() # a continuous variable
z = x .+ y; # a continuous variable

Now we can see how x and y are related and how variable a influences their relationship. This is done with Group. We can group by as many things as we want and we can use any attribute to do so, provided that there is a default palette for that attribute.

ms = 0.25
plot0 = scatter(x, y, markersize = ms)
plot1 = scatter(Group(color = a), x, y, markersize = ms)
plot2 = scatter(Group(marker = a), x, y, markersize = ms)
plot3 = scatter(Group(marker = a, color = b), x, y, markersize = ms)
vbox(hbox(plot0, plot1), hbox(plot2, plot3))

Manual values of the grouping attribute can also be provided:

scatter(Group(color = a), x, y, markersize = ms, color = [:red, :blue])

Style instead is used to style by a continuous quantity, without grouping the data (a key difference will become clear when discussing Analysis).

scatter(Group(marker = a), Style(color = z), x, y, markersize = ms)

Often datasets of this type are encoded in a tabular structure, like a DataFrame or an IndexedTable or some sort of spreadsheet. StatsMakie also supports plotting from those sources.

using RDatasets
iris = RDatasets.dataset("datasets", "iris")

After specifying what argument represents the dataset with Data, StatsMakie will replace, in the other arguments, the column symbol with the relative values:

scatter(Data(iris), Group(marker = :Species), Style(color = :PetalLength), :SepalLength, :SepalWidth)

4.
Analysis

Unfortunately, the two things presented so far (support for statistical objects and grouping), don't go really well together. What if I want to plot the density of PetalLength of each different species in a different color? I should use the split-apply-combine strategy in DataFrames, manually compute the density, give a color, etc...

Analyses try to circumvent this issue. They are basically functions (you can actually just use functions) that are run on the data after grouping. There are a few built-in but it's quite easy to write your own.

plot(density, Data(iris), Group(color = :Species), :SepalLength)

Note that analyses work in combination with grouping but not with styling as the analysis could change the length of the dataset so the original styling vector may no longer be applicable.

Some extra facilities are also provided to simplify the grouping (Position.stack or Position.dodge, for bars to not overlap), which can be useful for grouped histogram, barplot, boxplot or violin.

plot(Position.stack, histogram, Data(iris), Group(color = :Species), :SepalLength)

Analyses also allow the user to pass keywords that will be used at the time of fitting. For example the default number of bins is very conservative and we may wish to increase it a bit:

plot(Position.stack, histogram(nbins = 30), Data(iris), Group(color = :Species), :SepalLength)

As in the first plot, one can also use custom plot styles rather than the default one. I kind of like visualizing 2D densities as the wireframe of a surface, so that I can have many in the same plot and compare them:

wireframe(
  density(trim=true), Data(iris), Group(:Species), 
  :SepalLength, :SepalWidth, 
  transparency = true, linewidth = 0.1
)

A few more analyses are supported, like linear or non-linear regression:

using StatsMakie: smooth, linear
scatter(Data(iris), Group(color = :Species), :SepalLength, :SepalWidth)
plot!(linear, Data(iris), Group(color = :Species), :SepalLength, :SepalWidth)
plot!(smooth, Data(iris), Group(color = :Species), :SepalLength, :SepalWidth, linestyle = :dash)

5.
Theming

Colors, markers, etc. vary according to some default palettes. In particular colors are designed to be distinguishable by color-blind readers (see reference). However, the theme can be customized and the package MakieThemes offers many lovely options.

using MakieThemes, CSV
datafolder = joinpath(dirname(pathof(MakieThemes)), "..", "data");

for data  (:www, :drivers, :mtcars, :diamonds)
  @eval $(data) = CSV.read(joinpath(datafolder, $(string(data))*".tsv"), delim = '\t', allowmissing = :none)
end

AbstractPlotting.set_theme!(ggthemr(:fresh))

p1 = scatterlines(Data(www), :Minute, :Users,
  Group(color = :Measure, marker = :Measure),
  markersize = 6, marker = [:rect, :circle])

p2 = plot(density, Data(mtcars),
  :mpg, Group(color = :cyl))

p3 = plot(Position.stack, histogram, Data(diamonds),
  :price, Group(color = :cut));

p4 = boxplot(Data(drivers), :Year, :Deaths)

vbox(hbox(p1, p2), hbox(p3, p4))

6.
Future directions

That's pretty much it for the time being. Many key features are still missing (automatic legend and label, facet plots where a categorical variable sets the layout, support for non-numerical x and y axis) and many enhancements are planned (new statistical analyses, such as clustering, confidence intervals on regressions, support for population analysis and error estimation, a default UI to create these plots from a graphical environment). So stay tuned for more! And give StatsMakie a try already if you like bleeding edge / slightly unstable software.