Working examples

Iris dataset

The Iris dataset, or sometimes called the Anderson's iris dataset, comprises the measure ments of four variables (sepal and petal width and length) from 150 plants belonging to three different species: Iris setosa, Iris virginica, and Iris versicolor. This dataset was analised by Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis" (Gedeon, 2003).

]add Turing, RDatasets, StatsPlots, MLDataUtils, NNlib
48.3s
# Load Turing.
using Turing
# Load RDatasets.
using RDatasets
# Functionality for splitting and normalizing the data.
using MLDataUtils: shuffleobs, splitobs, rescale!
# We need a softmax function which is provided by NNlib.
using NNlib: softmax
# Set a seed for reproducibility.
using Random
Random.seed!(0)
# Hide the progress prompt while sampling.
Turing.setprogress!(false);
337.5s
# Load StatsPlots for visualizations and diagnostics.
using StatsPlots
220.3s
# Import the "iris" dataset.
data = RDatasets.dataset("datasets", "iris");
# Show twenty random rows.
data[rand(1:size(data, 1), 20), :]
20.8s
# Recode the `Species` column.
species = ["setosa", "versicolor", "virginica"]
data[!, :Species_index] = indexin(data[!, :Species], species)
# Show twenty random rows of the new species columns
data[rand(1:size(data, 1), 20), [:Species, :Species_index]]
0.9s
# Split our dataset 50%/50% into training/test sets.
trainset, testset = splitobs(shuffleobs(data), 0.5)
# Define features and target.
features = [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]
target = :Species_index
# Turing requires data in matrix and vector form.
train_features = Matrix(trainset[!, features])
test_features = Matrix(testset[!, features])
train_target = trainset[!, target]
test_target = testset[!, target]
# Standardize the features.
μ, σ = rescale!(train_features; obsdim = 1)
rescale!(test_features, μ, σ; obsdim = 1);
2.5s
# Bayesian multinomial logistic regression
@model function logistic_regression(x, y, σ)
    n = size(x, 1)
    length(y) == n || throw(DimensionMismatch("number of observations in `x` and `y` is not equal"))
    # Priors of intercepts and coefficients.
    intercept_versicolor ~ Normal(0, σ)
    intercept_virginica ~ Normal(0, σ)
    coefficients_versicolor ~ MvNormal(4, σ)
    coefficients_virginica ~ MvNormal(4, σ)
    # Compute the likelihood of the observations.
    values_versicolor = intercept_versicolor .+ x * coefficients_versicolor
    values_virginica = intercept_virginica .+ x * coefficients_virginica
    for i in 1:n
        # the 0 corresponds to the base category `setosa`
        v = softmax([0, values_versicolor[i], values_virginica[i]])
        y[i] ~ Categorical(v)
    end
end;
4.2s
chain = sample(logistic_regression(train_features, train_target, 1), HMC(0.05, 10), MCMCThreads(), 1500, 4)
71.4s
plot(chain)
33.4s

GSoC 2021 work product

1. Violin plots

Violin plots are similar to box plots, with the addition of a rotated kernel density plot on one or both sides. Use the call plot(chain::Chains; kwargs...) with seriestype = :violinplot, or the shorthands version violinplot(chain::Chains; kwargs...) for plotting. Use the kwarg colordim to create violin plots grouped by chains (colordim = :chains) or by parameters (colordim = :parameters).

violinplot(chain; colordim = :chain)
Julia
violinplot(chain; colordim = :parameters)
Julia

If the kwarg combined = true, chains are appended and only one plot per parameter is returned. In this case colordim := :chain. Otherwise (combined = false), a violin plot is returned as defined by colordim.

NOTE: Discrete parameters are plotted as defined in StatsPlots.jl.

For plotting multiple parameters, Ridgeline, Forest and Caterpillar plots can be  useful.

2. Ridgeline

Given a chain object, ridgelineplot(chain::Chains, par_names::Vector{Symbol}; kwrags...) returns a Ridgeline plot for the sampled  parameters specified on par_names.

For ridgelineplots, the following attributes are defined:

ridgelineplot(chain, chain.name_map[:parameters])

(a) Fill

Fill area below the curve can be determined by quantiles interval (fill_q = true) orhdp interval (fill_hpd = true). Default options are fill_hpd = true and fill_q = false. If both fill_q = false and fill_hpd = false, then the whole area  below  the curve  will be filled. If no fill color is desired, it should be specified with series  attributes. These fill options are mutually exclusive.

(b) Mean and median

A vertical line can be plotted repesenting the mean (show_mean = true) or median (show_median = true) of the density (kde) distribution. Both options can be plotted at the same time.

(c) Intervals

At the bottom of each density plot, a quantile interval (show_qi = true) or HPD interval (show_hdpi = true) can be plotted. These options are mutually exclusive.  Default options are show_qi = false and show_hpdi = true. To plot quantile  intervals,  the values  specified as q will be taken, and for HPD intervals,  only the smaller value specified in hpd_valwill be used.

Note: When one parameter is given, it will be plotted as a density plot with all the elements described above.

3. Forest and Caterpillar plots

References

  • Gideon, T. D. (2003). AI 2003: Advances in Artificial Intelligence: 16th Australian Conference on AI, Perth. Springer Science & Business Media. 1075 pp.

Runtimes (1)