The mathematical Ideal and Softmax in Julia
A Poetic Notion
Walking an elegant equation or proof backwards is an unraveling. It’s hard to see how the equation got so refined, so simple… and it's usually an aggregate over many years of small refined edits. Rarely does math come out Ideal. It’s messy and gross and gets refined over time. Students are rarely told this, and you’d hardly guess how messy math is looking through books.
We put to paper an attempt to describe the world in matters of precision that are proportional to the phenomenon; it is a realization unfolding and in progress. What lands never feels sufficient. It never seems right. Scribbles followed by notes and pages of pages finally reduce to some elegant form. That is what we are after, elegant form.
In more plain words, the perfect looking form of an algorithm gets very messy very quickly when you start working through it. This is nowhere more abundantly clear than in software.
What is Ideal?
Take Euler’s number, represented as e. To be represented as a single alphabetic character is an abstraction towards simplicity…. in other words, it is Ideal. A common numeric representation is
ℯ
But that is not totally accurate. It is also an abstraction towards simplicity since the number just goes on and on (known to over a trillion digits). For our purposes a mathematical Ideal is an abstraction towards simplicity. Another way of saying is that a mathematical Ideal is an approximation to a reality. That is, we know e goes to many digits, but in calculating quantities using e it's often enough to approximate.
Simplifying a representation (i.e., abstraction) of mathematical properties seems to be a unique human trait that allows us to represent more and more complicated mathematical ideas in simple ways.
Computing Quantities
We can compute the exponential of an integer by raising to the exponent of Euler’s number e¹ = 2.7182. I’ve used the Julia programming language to show the exp
function to calculate exponentiation for 0,1,2,3 with the default base e
.
myExpon = [exp(x) for x in 0:3]
If we sum the four quantities of exp
over 0,1,2,3 we get 1.0 + 2.71828 + 7.38906 + 20.0855 = 31.1929
. Another simplification: I’ve reduced the quantities to 4 digits after the decimal point. Seems odd to do this from a mathematical point of view because it has the consequence of impacting the accuracy. But this “rounding” for floating point numbers is a common area of innovation, research, and discussion in using math and software. Computers are finite resources and approximating values where possible is good resource management. It’s another kind of abstraction towards simplicity. Doing math on a computer is all about managing approximations to a reality. And if we can live with 31.1929
over 31.19287485057736
then we have made things simpler and easier to manage
esum = sum(myExpon)
If I decide I want to derive some distribution for the exponentiated values of 0, 1, 2, 3
then I can iterate over them and divide by the sum.
A = [exp(0.0)/esum, exp(1.0)/esum, exp(2.0)/esum, exp(3.0)/esum]
I can represent this as an enumerated list of values, that all add up to 1.0
; a distribution
sum(A)
Stepping Back
All that we’ve done is take some number, e, compute it with others, (e⁰, e¹, e², e³) and asked what if we summed these up and divided each element by the sum. This is a fine way to do math, it’s a great way to teach math, and its arguably the way most great mathematical discoveries were made. It’s messy, meandering, and less than ideal; but we ended up somewhere important.
We end up with the Softmax function; not in its Ideal form but in the way one finds the peak of a mountain by stepping past multiple stones. I’ll not go into what Softmax is used for, why its cool, et. cetera (one can find plenty of that by an internet search).
Playing around with numbers
The beauty of mathematics is that if you start following a pattern or playing around with some computation you eventually get somewhere interesting. There is no need to start from an elegant equation and break it down; instead one can start at "the messy bottom", so to speak, and work up to abstraction.
Softmax is a heavily used function in some of the worlds most advanced software. Machine Learning applications have used softmax as a "differentiable" argmax for doing multi-class classification for many years. The avante garde Deep Learning frameworks, Tensorflow, PyTorch, MXNet, Flux, Gorgonia, etc… all use softmax (typically for loss function in the last layer of a network where one needs multi-class classification). For this reason, one can find many resources on softmax, including software implementations embedded within the advanced software projects I mention. Note that the Python projects (Tensorflow, Pytorch, MXNext) depend on the low-level C++ execution, which means you’ll likely find both Python and C++ implementations and it’ll be quite noisy. If you want to see an implementation in a cutting edge software project I’d suggest either FluxML (written in Julia), or Gorgonia (written in Go).
Julia Implementation
f(x) = exp.(x) ./ sum(exp.(x))
This mathematical representation is considered naive from a computational point of view. In all branches of mathematics, and to all working mathematicians, evolving ideas toward representational elegance, simplicity, and interpretability (e.g., abstraction towards simplicity) is the pinnacle of mathematical achievement. But where finite resources are the reality and discrete values the necessity, such Idealism is cast as naive because it does not handle errors due to resource limits (e.g., numerical overflow and underflow).
It’s a funny tension. Software writers appreciate elegance, simplicity, and interpretability too, but not at the cost of broken computation and/or buggy programs.
scores4 = [1.1, 5.0, 2.8, 7.3]
f(scores4)
scores4BIG = [10_000.1, 50_000.1, 2_000_000.8, 7_000_000.3]
f(scores4BIG)
Ouch. Overflow. Python, Go, and Rust will overflow too. Any programming language will. We hit the numerical limits of both the programming languages and the devices. In this case we can see that the
(typemin(Float64), typemax(Float64))
of floating point numerals is -Inf, Inf
, a representational convenience for machine epsilon. That is our upper limit and the softmax function above exceeds that numerical limit if our numbers are too high.
eps(Float64) #machine epsilon
(eps(Inf), eps(NaN))
To provide numerical stability subtract the highest number from all the others.
#maximum(scores4) -> 7.3
sub_max_scores4 = scores4 .- maximum(scores4)
One last thing we do is provide a scaling parameter θ (theta) that allows us to grow the numbers out. Creating greater distance between the small and large numbers makes it easier to see where relative boundaries between large and small happen to be. Define θ = 2.5.
We get the same results if we multiply all our numbers by θ and subtract by max.
θ = 2.5
#two scalar-vector multiplications
(scores4 * θ .- maximum(scores4 * θ))
Or, subtract by max and then multiply by θ.
#one scalar-vector multiplication
(scores4 .- maximum(scores4)) * θ
We’ve gone from original values [1.1, 5.0, 2.8, 7.3]
, to subtracted maximum values [-6.199999999999999, -2.3, -4.5, 0.0]
, to scaled and subtracted maximum values [-15.499999999999998, -5.75, -11.25, 0.0]
. You can see we’ve increased the relative distances quite a bit.
Since we are doing this in Julia we can be mindful of our types. I should be clear though: you do not need to provide types for function arguments nor return values; but I'll show here what it might look like to do that. Julia provides an interesting sets of numeric types as well as vector and matrix types. I can use a Real
type for all Real numbers, or various kinds of Abstract types that allow me to construct functions that can accept a wide a variety of types of Vector, Matrix, Float, etc.
#abstract exponentiation function, subtract max for numerical stability
_exp(x::AbstractVecOrMat) = exp.(x .- maximum(x))
#abstract exponentiation function, subtract max for numerical stability and scale by theta
_exp(x::AbstractVecOrMat, θ::AbstractFloat) = exp.((x .- maximum(x)) * θ)
#softmax algorithm expects stablized eponentiated e
_sftmax(e::AbstractVecOrMat, d::Integer) = (e ./ sum(e, dims = d))
# top level softmax function
function softmax(X::AbstractVecOrMat{T}, dim::Integer)::AbstractVecOrMat where T <: AbstractFloat
_sftmax(_exp(X), dim)
end
function softmax(X::AbstractVecOrMat{T}, dim::Integer, θ::Float64)::AbstractVecOrMat where T <: AbstractFloat
_sftmax(_exp(X, θ), dim)
end
Julia also has what is called multiple dispatch; the parameters of my functions will dispatch various methods. This means I can write many different functions called softmax
that accept different parameters and Julia’s compiler will construct methods for those specifically defined functions (based on parameters). I can inspect those methods for type stability and work towards building faster executing programs by exploiting types and type combinations that yield better compiler output (the above methods show a Union, we probably don’t want that and for an optimization pass we’d likely refactor a bit depending on our goals for the program).
println(methods(softmax))
The Goal
The goal of this post is not to motivate and explain softmax, nor to provide the best softmax function for Julia, nor to introduce the Julia language. Most software needing softmax already has it implemented and it is implemented in ways that best suit the program goals, the programming language properties, and the kinds of devices and hardware engineers expect to run on.
Instead, the goal here is to highlight just how far away this (or any) softmax implementation is from the Ideal representation found in mathematical texts. Considering Julia’s type system, how functions and methods are compiled, and the numerical guts of this specific language, we’ve traveled far from the simplicity and elegance of the equation.
And yet, implementations have their own elegance too. Mine may not be elegant, but look under the hood at any well-engineered library and you'll find one.
Other Resource for Softmax
the-softmax-function-and-its-derivative