What if Fisher & Anderson Provided Uncertainties?
The classic Iris Flower dataset was collected by Edgar Anderson (1935) and made iconic by Ronald Fisher's introduction of Linear Discriminant Analysis(1936). The data contains 4 measurements of 3 Iris flower species (setosa, versicolor, and virginica). The measurements are of the heights and widths of the flower samples petals and sepals. The dataset has provided a fantastic pedagogical example because it features one class which is linearly separable, and two which are not. It's also easy a great dataset to test wild algorithms on because it has perfect class balance (Nc=50), small number of independent variables (4), and only 150 total samples. But I always felt there was something missing from the modern reproductions of this dataset. Something, maybe only a natural scientist or engineer might concern themselves with.
Anyone who has ever made an informative model has been concerned with error or uncertainty estimation. Practitioners in these fields will often ask or show how good a model is. Usually by displaying scalar summary statistics, plots of residuals, etc. But when we measure something in the real world, scientists are trained not to assume that their measurements are infallible.
In fact, it is well known that every time a ruler, micrometer, caliper, or any measuring implement is used there is some uncertainty associated with the readings. If you have studied quantum mechanics you may even learn that really small stuff cannot be known to absolute certainty even with a supposedly perfect tool! Note: no such perfect tools may exist anyways.
For those of us who haven't pursued that line of education I offer the following example to illustrate the idea. Consider a ruler with 1mm graduations being used to measure the length of a steel rod. You measure the rod and determine that the length of it is 158mm, but a little element of its length falls somewhere inbetween the 158-159mm graduation. Is it 158mm? 159mm? 158.5mm? Maybe you messed up - so you measure it again. This time you see the length is closer to 158mm then it is to 159mm. This is a little contradictory. So you ask a friend who has better eyesight then you to measure it. They determine the rod is exactly 159mm. That can't be right! You call in another friend and they swear its closer to 157mm. Instead of upholding certainty in all matters you now choose to explore a new paradigm.
Perhaps the proper solution is to admit that we don't know exactly the length of the steel rod using this ruler. We can state pretty certainly that is 158mm +/- 1mm. You see - the ruler used in conjunction with the 'human analyzer' has an uncertainty of +/- 1mm.
Well, this is common knowledge to someone in a STEM field. Yet, in data science it's rarely accounted for or, in my opinion, is under utilized. Often times because this element is not explicitly known. The purpose of this blog post is to explore the classic iris dataset, presuming we knew the measurement error in Anderson's iris data, and see what we can learn with a bread and butter analytic method. What are the implications of this simple idea after-all?
The Prompt
Looking at the data we can see that Edgar Anderson was confident in reporting the 4 length measurements(in centimeters) in the Iris dataset to 2 decimals. Maybe then we can assume there is a +/- 0.1cm uncertainty in all of the measurements.
Now we can incorporate this information to our data, using the Measurements.jl (Julia) package! And we'll use my package, ChemometricsTools.jl to load in the data and do some analysis.
using Pkg
Pkg.update()
Pkg.add(["Measurements", "ChemometricsTools", "LinearAlgebra", "Plots", "Statistics"])
using Measurements, ChemometricsTools, LinearAlgebra, Plots, Statistics
iris = ChemometricsTools.ChemometricsToolsDataset("iris.csv")
println("One sample from each Iris class:")
println(iris[[1,51,101],:])
Great we now have the packages we want loaded into memory. Now let's separate the species/class column from the measurement data, and add the uncertainty values to each measurement.
species = iris[!, :species]
measures = iris[!, [:sepal_length, :sepal_width, :petal_length, :petal_width ] ]
measures = measures .± 0.1 #cm
println(measures[ [ 1, 51, 101 ],:])
preprocess_pipeline = Pipeline( Matrix( measures ), Center, Scale );
Although, this is a simple dataset, it still has 4 dimensions. Meaning - we can't display it on our computer screen in an easy way! Thankfully there are dimension reduction techniques, such as the classic Principal Component Analysis (PCA), available for doing this. We will borrow some code from ChemometricsTools and allow for the Measurements.jl Floating point type to be used in it. Multidispatch, a feature of Julia, would allow us to simply define a numeric type and use my package - but I want people to see my implementation of the NIPALS algorithm anyways. A lot of people use PCA, but few know how to implement one other then using a Singular Value Decomposition (SVD).
PCA Implementation
To implement PCA, you can always use an SVD; call down to LAPACK, or whatever linear algebra library you prefer. Yes - PCA is the same thing as SVD. But, if we want to propagate uncertainty we need our math to be done in 1 language. Julia is awesome here because it does math fast out of the box, and we can easily dispatch the Measurements data type to use the same code we'd use to do PCA using any other numeric type!
For this post I used the NIPALS (nonlinear iterative partial least squares) algorithm for solving the eigenvalue decomposition that is PCA. PCA is basically the description of the transformation induced by the data's covariance matrices (row and column spaces). Eigenvalue decomposition's are a commonly used tool to describe linear transformations in terms of what they do to a linear basis. Infact they completely define how the space changes during a transform via eigenvalues and eigenvectors. So another way to perform a PCA is to take an eigenvalue decomposition of both covariance matrices. Both NIPALS and SVD can obtain the left hand and right hand eigen/singular vectors simultaneously. I chose NIPALS because it is simple and offers a convenient way to propagate the error.
struct PCA_NIPALS_Uncertainty
Scores::Array{Measurement{Float64},2}
Loadings::Array{Measurement{Float64},2}
Values::Array
end
"""
PCA_NIPALS_Uncertainty(X; Factors = minimum(size(X)) - 1, tolerance = 1e-7, maxiters = 200)
Compute's a PCA from `x` using the NIPALS algorithm with a user specified number of latent variables(`Factors`).
The tolerance is the minimum change in the F norm before ceasing execution. Returns a PCA_NIPALS_Uncertainty object.
"""
function PCA_NIPALS_Uncertainty(X::Array{Measurement{Float64},2}; Factors = minimum(size(X)) - 1, tolerance = 1e-7, maxiters = 200)
tolsq = tolerance * tolerance
#Instantiate some variables up front for performance...
Xsize = size(X)
Tm = zeros(Measurement{Float64}, ( Xsize[1], Factors ) )
Pm = zeros(Measurement{Float64}, ( Factors, Xsize[2] ) )
t = zeros(Measurement{Float64}, ( 1, Xsize[1] ) )
p = zeros(Measurement{Float64}, ( 1, Xsize[2] ) )
#Set tolerance to floating point precision
Residuals = copy(X)
for factor in 1:Factors
lastErr = sum(abs, Residuals);
curErr = tolerance + 1;
t = Residuals[:, 1]
iterations = 0
while (abs(curErr - lastErr) > tolsq) && (iterations < maxiters)
p = Residuals' * t
p = p ./ sqrt.( p' * p )
t = Residuals * p
#Track change in Frobenius norm
lastErr = curErr
curErr = sqrt(sum( ( Residuals - ( t * p' ) ) .^ 2))
iterations += 1
end
Residuals -= t * p'
Tm[:,factor] = t
Pm[factor,:] = p
end
#Find singular values/eigenvalues
EigVal = sqrt.( LinearAlgebra.diag( Tm' * Tm ) )
#Scale loadings by singular values
Tm = Tm * LinearAlgebra.Diagonal( 1.0 / EigVal )
return PCA_NIPALS_Uncertainty(Tm, Pm, EigVal)
end
"""
(T::PCA)(Z::Array; Factors = length(T.Values), inverse = false)
Calling a PCA object on new data brings the new data `Z` into or out of (`inverse` = true) the PCA basis.
"""
function (T::PCA_NIPALS_Uncertainty)(Z::Array; Factors = length(T.Values), inverse = false)
if (inverse)
return Z * (Diagonal(T.Values[1:Factors]) * T.Loadings[1:Factors,:])
else
return Z * (Diagonal( 1.0 ./ T.Values[1:Factors]) * T.Loadings[1:Factors,:])'
end
end
Applying it to the data
In order to use a PCA meaningfully we ought to pretreat our data (center & scale) prior to its application, so we will do that using a ChemometricsTools pipeline as well,
#Preprocess
preprocess_pipeline = Pipeline( Matrix( measures ), Center, Scale )
preprocessed_measures = preprocess_pipeline( Matrix( measures ) )
#Now perform a PCA
pca = PCA_NIPALS_Uncertainty( preprocessed_measures; Factors = 4, tolerance = 1e-6, maxiters = 200 )
#Make a Scree Plot
cumulative_var_explained = 100.0 .* cumsum( pca.Values .^ 2 ) / sum(pca.Values .^ 2)
l = [a b]
p1 = scatter( cumulative_var_explained, title = "Scree Plot", label = "",
xlab = "Singular Values Used", ylab = "Explained Variance (%)")
p2 = bar( map( x -> x.err, cumulative_var_explained), title = "Explained Uncertainty", label = "",
xlab = "Singular Values Used", ylab = "Explained Variance Uncertainty (%)")
plot(p1, p2, layout = l)
The beauty of using Julia here is that, the code in ChemometricsTools.jl for centering and scaling works with this "new" Measurements.jl data type without any effort. No special code was written, Julia just does the math, and Measurements.jl propagates the uncertainty through the arbitrary operations.
Looking at the plots you can notice that the uncertainty of the explained variance (shown error bars in the left plot or as the bar chart on the right) follows the opposite trend. The more singular values used, the more certain our estimate of how much variance they explain becomes. Does this make sense? Presume so, we expect our reconstruction to be perfect(within numerical noise) when the rank of our variable reduction is equal to that of the rank of the data. This isn't without cost, but that will be discussed later. Neat!
An interesting thought here is, how many components should be chosen? We usually say look for the "elbow" or similar heuristic. Maybe a good answer is when a singular value becomes indistinguishable from another due to propagated measurement uncertainty. There's some fuel in that simple idea for a short paper - open invitation for someone else to explore.
But... What does this mean with regards to visualization? From the Scree plot we see that we can have a nice 2-D scores plot representation if we are willing to only understand ~95% variance.
color_map = Dict("setosa" => :purple, "virginica" => :blue, "versicolor" => :yellow)
label_map = Dict( "setosa" => findall(species .== "setosa"),
"virginica" => findall(species .== "virginica"),
"versicolor" => findall(species .== "versicolor"),)
lbl = "setosa"
s = scatter( pca.Scores[label_map[lbl], 1], pca.Scores[label_map[lbl],2], title = "Scores Plot",
markerstrokecolor=:grey, color = color_map[lbl], label = lbl, legend = :bottomright,
xlab = "1st Component Scores", ylab = "2nd Component Scores");
lbl = "virginica"
scatter!(s, pca.Scores[label_map[lbl],1], pca.Scores[label_map[lbl],2], title = "Scores Plot",
markerstrokecolor=:grey, color = color_map[lbl], labels = lbl);
lbl = "versicolor"
scatter!(s, pca.Scores[label_map[lbl],1], pca.Scores[label_map[lbl],2], title = "Scores Plot",
markerstrokecolor=:grey, color = color_map[lbl], labels = lbl)
Alas, now our 4-D data is mostly represented on a 2-D plane. Sure we've sacrificed some variance to do this, but there was no guarantee that the variance we lost would make our classes more separable anyways. But, now we have an interesting explanation for our data.
The setosa class, even with measurement uncertainty is linearly separable from both viriginica and versicolor in this basis. Go ahead and run a linear learning machine on it if you don't believe me. However, there are samples in either the virginica and versicolor classes that could easily be mistaken for one another given the propagated measurement uncertainty.
Traditionally, people perform a variable reduction, make this plot, run a linear classifier, and conjecture the following: two classes were not linearly separable. Well, that is useful, but why?
Well, here we can say, "ah-hah, given the uncertainty of the sample measures, several samples from the two classes could be shared by the same space". IE: Several samples in different classes are indistinguishable from one another statistically. Is that more interesting? I think so! Sure we fudged an uncertainty number to make this statement, but it still remains a feasible explanation for our failure to discriminate these classes and why we should revisit the data - ask for a new measurement, or property. It's more descriptive then saying "model doesn't work perfectly for 2 classes".
Yum... Data Science - emphasis on Science.
Let's go a little further down the rabbit hole
Cool I showed one textbook case, with a hole in it, where this paradigm is valuable. But what else does uncertainty propagation tell us? What exactly, does PCA do to our propagated uncertainty. This is going to confuse many people who view error as a 1-D facet of an analysis. See there is model error, then there is variable error, and ... other kinds too. Let's see who is willing to hang on through this exposition. Hopefully you are, otherwise I've failed to make this interesting.
What happens to our reconstructed uncertainty(NOT reconstruction error) as we vary the number of latent variables?
LVs = 4
reconstruct_est = [ pca(pca(preprocessed_measures; Factors = L); Factors = L, inverse = true) for L in 1:LVs ];
As a sanity check, the full reconstruction should be similar if not completely equivalent too the original values. Let's do that first,
error(X) = map( x -> x.err, X)
Plots.default(size = (800, 600) )
l = [a b; c d]
p1 = scatter( error(reconstruct_est[4][:,1]), error(preprocessed_measures[:,1]), label = names(measures)[1],
title = "Original vs Reconstructed Uncertainties", xlab = "Reconstructed", ylab = "Original",
legend = :bottomright);
p2 = scatter( error(reconstruct_est[4][:,2]), error(preprocessed_measures[:,2]), label = names(measures)[2],
xlab = "Reconstructed", ylab = "Original", legend = :bottomright);
p3 = scatter( error(reconstruct_est[4][:,3]), error(preprocessed_measures[:,3]), label = names(measures)[3],
xlab = "Reconstructed", ylab = "Original", legend = :bottomright);
p4 = scatter( error(reconstruct_est[4][:,4]), error(preprocessed_measures[:,4]), label = names(measures)[4],
xlab = "Reconstructed", ylab = "Original", legend = :bottomright);
plot( p1,p2,p3,p4, layout = l )
So... We see what we'd hope, perfect correlations and within numerical noise agreement between the reconstructions and the original data. Whew! Great now we can proceed with some confidence about seeing what happens to our uncertainty as we build the PCA model from different numbers of latent variables.
u_reconstruct_est = [ preprocess_pipeline(r_e, inverse = true) for r_e in reconstruct_est]
l = [a b; c d]
p1 = bar( [ mean(error(u_reconstruct_est[lv][:,1])) for lv in 1:4], label = "",
title = names(measures)[1], xlab = "LVs", ylab = "Mean of Uncertainties" )
p2 = bar( [ mean(error(u_reconstruct_est[lv][:,2])) for lv in 1:4], label = "",
title = names(measures)[2], xlab = "LVs", ylab = "Mean of Uncertainties" )
p3 = bar( [ mean(error(u_reconstruct_est[lv][:,3])) for lv in 1:4], label = "",
title = names(measures)[3], xlab = "LVs", ylab = "Mean of Uncertainties" )
p4 = bar( [ mean(error(u_reconstruct_est[lv][:,4])) for lv in 1:4], label = "",
title = names(measures)[4], xlab = "LVs", ylab = "Mean of Uncertainties" )
plot( p1,p2,p3,p4, layout = l )
Interesting! Petal width, sepal width, and sepal length all offer less uncertainty then the original data until they use all 4 latent variables, or are full rank. Petal length has more, but the same as the original data at 4 latent variables. So why on earth could this happen? Well... Remember, the fewer the latent variables the more reconstruction error right(less variance explained).
Maybe we really need to lower the uncertainty of a component, and the cost of poorer reconstruction is worth every penny. Or maybe we just want to understand what all we are trading as we model. I've shown here PCA, but surely, so many other things can be done.
Conclusions and Obvious Extensions
So what if Fisher and Anderson incorporated uncertainty into their data/analysis? Well they'd have more explanation power. Sure, Fisher used LDA not PCA, but we could apply this methodology to LDA - easily. Infact, I have code for LDA in my package (https://github.com/caseykneale/ChemometricsTools.jl/blob/d8cd288ae76b221274a54cc204cd146791bddf98/src/Analysis.jl#L75), why don't you take a try at it? Remix this blog post and do something cool :).
I've seen dozens of papers of people arguing how to propagate uncertainty through say, Partial Least Squares regression/classification models. Guess what... I've used this same approach there as well, and it works as expected. No need for horrendously large analytic yet completely approximate expressions for these tasks - just let your computer do it. Julia is a prime programming language for this. Julia let's us define operators, and use multidispatch, so if we want to use a new algorithm with an uncertain number type(Measurement{Float64} for instance) we can do that very easily! No need to rewrite an entire math library/class. Just fire away.
So yea. Someone who was desperate could probably write 1-2 papers out of this. I am not in a position to spend money to write publications - this is just for fun. So go ahead, write a paper propagating uncertainty through PLS, or LDA, both or whatever you want. Or a paper on picking the number of factors can benefitted from experimental uncertainty may be described as above. Or if you're like me just make a NextJournal and take a crack at applying these ideas elsewhere!
Oh, but what if the model you have has a matrix inversion? Just do a pseudo inverse via PCA (hint: PCA code is already in this blog) and yes that works too... Probably at some cost though! Studying that adequately could be ... another ... paper?
Acknowledgements
Big thanks to Mose Giordano for writing Measurements.jl. I wrote an alternate library for this many months back to handle a case Measurements.jl does not handle. Mose's work is fantastic, written cleanly, rigorous, and published. Due credit for this must be awarded, and I love the package! The library I wrote was written in one afternoon, and was rushed to say the least. It is quite a bit faster for doing the PCA, but it has a pretty serious cost for use in technical debt compared to Mose's.
Big thanks to Fredrik Bagge Carlson(who created MonteCarloMeasurements.jl), Valentin Bogad (promising student and Julia community member), and those who commented on the drafts errors and suggested improvements! If anything isn't quite right in this post - it is not their fault, they just made quick constructive suggestions.