div.ProseMirror

Automatic Differentiation using Dual Numbers

Modern machine learning libraries such as Flux is built upon the idea of Automatic Differentiation. One way to implement that is by defining dual numbers.

struct Dual{T<:Real} <: Real
  x::T
  ϵ::T
end
0.3s
autograd-dualnum (Julia)

With dual numbers we keep track of the result of performing a mathematical operation on a number as well as the derivative of that number.

If inline_formula not implemented, then inline_formula not implemented. We bake this into the basic constructor for a dual number.

Dual(x) = Dual(x, one(x))
0.8s
autograd-dualnum (Julia)
Dual

To make it easier to work we we create a nice representation of a dual number

Base.show(io::IO, d::Dual) = print(io, d.x, " + ", d.ϵ, ")
0.2s
autograd-dualnum (Julia)

Let us us create a couple of dual numbers

Dual(3), Dual(2, 4)
1.0s
autograd-dualnum (Julia)
(3 + 1ϵ, 2 + 4ϵ)

Elementary rules of differentiation

Think of ϵ as derivative of inline_formula not implemented (inline_formula not implemented), we can then define how different mathematical operations should work with dual numbers.

Differentiation is linear

Let us start with addition and subtraction. Derivation rules says that the derivative of the function inline_formula not implemented with respect to inline_formula not implemented is:

formula not implemented
import Base: +, -, *, /
a::Dual + b::Dual = Dual(a.x + b.x, a.ϵ + b.ϵ)
a::Dual - b::Dual = Dual(a.x - b.x, a.ϵ - b.ϵ)
0.7s
autograd-dualnum (Julia)
- (generic function with 175 methods)

Let us try these operations out with some numbers:

Dual(3) + Dual(4)
0.4s
autograd-dualnum (Julia)
7 + 2ϵ

The product rule

The derivative of a function inline_formula not implemented defined as the product of two functions is defined as:

formula not implemented

We can see how the derivative of inline_formula not implemented follows from this.

formula not implemented

We can use this definition to implement multiplication for dual numbers.

a::Dual * b::Dual = Dual(a.x*b.x, b.x*a.ϵ + a.x*b.ϵ)
0.5s
autograd-dualnum (Julia)
* (generic function with 358 methods)

Again let us check out if they work as expected. If inline_formula not implemented, then inline_formula not implemented and for the derivative we get inline_formula not implemented.

x = Dual(3)
y = Dual(6)
x * x, x * y, y * y
0.5s
autograd-dualnum (Julia)
(9 + 6ϵ, 18 + 9ϵ, 36 + 12ϵ)

The quotient rule

Say the function inline_formula not implemented is defined as a quotient of two function inline_formula not implemented and inline_formula not implemented like this:

formula not implemented

Then the derivative of inline_formula not implemented is defined as:

formula not implemented
a::Dual / b::Dual = Dual(a.x / b.x, (a.ϵ*b.x - b.ϵ*a.x) / b.x^2)
0.4s
autograd-dualnum (Julia)
/ (generic function with 116 methods)

We can try this out with some numbers again:

y/x
0.5s
autograd-dualnum (Julia)
2.0 + -0.333333ϵ

The chain rule

This applies when we try to derivate a function, where the argument to the function is another function. This is really at the core of what is done when back propagation is performed in a neural network.

Given that we have a function inline_formula not implemented the derivative is defined as:

formula not implemented

To get an intuition for why this rule makes sense, we can use Leibniz's notation instead:

formula not implemented

We can use this rule to define what the derivative of inline_formula not implemented and inline_formula not implemented should be when their arguments are functions (dual numbers). So we want to derive cases like this:

formula not implemented

If we apply the chain rule to derive inline_formula not implemented and inline_formula not implemented we get:

formula not implemented

We leave doing the same derivation for inline_formula not implemented as an exercise to the reader. To read and understand the implementation below it helps to think of the variable d as

formula not implemented
import Base: sin, cos
sin(d::Dual) = Dual(sin(d.x), cos(d.x) * d.ϵ)
cos(d::Dual) = Dual(cos(d.x), sin(d.x) * (-d.ϵ))
0.4s
autograd-dualnum (Julia)
cos (generic function with 13 methods)

We could implement more cases but this enough of the basics. Next we want to play nice with the Julia conversion and promotion system.

Number Conversion and Promotion

Julia being a numerical language has a system that makes it easier to define all the rules for conversion and promotion when dealing with numbers which are not of the same type in the same expression.

You could do manage this manually by simply defining a large number of methods dealing with each case. But that gets cumbersome. Instead Julia lets you define promotion rules.

import Base: convert, promote_rule
function promote_rule(::Type{Dual{T}}, ::Type{Dual{R}}) where {T,R} 
  Dual{promote_type(T,R)}
end
function promote_rule(::Type{Dual{T}}, ::Type{R}) where {T,R}
  Dual{promote_type(T,R)}
end
0.6s
autograd-dualnum (Julia)
promote_rule (generic function with 124 methods)

This rule says that when dealing with an expression involving a type R and a dual number of type Dual{T} then the resulting type is of type Dual{S}where S is a common number type of R and T. In Julia you can use the promote_type(T, R) function to find a common number type. Here are some examples.

promote_type(Int8, Int16)    |> println
promote_type(UInt64, Int64)  |> println
promote_type(Int64, Float64) |> println
0.7s
autograd-dualnum (Julia)

This function relies on collections of promote_rules having been define. Because we have defined such rules for our dual number we can also use promote_type for dual numbers.

promote_type(Dual{Int64}, Dual{Int8})    |> println
promote_type(Dual{Float64}, Dual{Int64}) |> println
promote_type(Float64, Dual{Int64})       |> println
0.4s
autograd-dualnum (Julia)

Of course promotion is worthless if Julia doesn't know how to convert every type to the promoted type. Hence we need to define conversion functions. Or rather we need to add conversion methods to the convert function defined in the Base module.

function convert(::Type{Dual{T}}, x::Dual) where T
  Dual(convert(T, x.x), convert(T, x.ϵ))
end
function convert(::Type{Dual{T}}, x::Real) where T
  Dual(convert(T, x), zero(T))
end
0.4s
autograd-dualnum (Julia)
convert (generic function with 184 methods)

With the conversion functions we can use dual numbers together with regular numbers and Julia will use these covert functions when need to to make conversions.

2*Dual(2, 3)        |> println
Dual(2.1) + Dual(3) |> println
2 + Dual(3)         |> println
0.7s
autograd-dualnum (Julia)

Derivative of a Function

We have now built enough functionality to be able to define a function for performing the derivative of another function.

derive(f, x) = f(Dual(x)).ϵ
0.5s
autograd-dualnum (Julia)
derive (generic function with 1 method)

We can try it on some different well known cases to see if it works as expected.

f(x) = x^2
g(x) = 6x^3 + 2x^2
df(x) = derive(f, x)
dg(x) = derive(g, x)
# Manually derive g(x) for comparison
h(x) = 18x^2 + 4x
println("f(3) = ", f(3), " then f'(3) = ", df(3))
println("g(1) = ", g(1), " then g'(1) = ", dg(1))
println(dg(1), " == ", h(1))
0.4s
autograd-dualnum (Julia)

Of course checking lots of individual values to see if the derivation works is cumbersome. A better way to get a quick overview is to plot each function an their corresponding derivative.

using Plots
plot([f, df], linewidth=3, xlabel="x", label=["f(x)" "f'(x)"])
title!("Function f(x) and its derivative f'(x)")
vline!([0], linewidth=2, color=:orange, label="global minimum")
56.0s
autograd-dualnum (Julia)
using Plots; gr()
plot([sin, x -> derive(sin, x)], linewidth=3, xlabel="x", label=["sin(x)" "sin'(x)"])
vline!([π/2], linewidth=2, color=:orange, label=nothing)
scatter!([π/2, π/2], [1, 0], markersize=6, label="maximum", color=:orange)
5.3s
autograd-dualnum (Julia)

Function Gradients

In practice derivations are not all that useful for real world machine learning because we are not dealing with functions with single inputs. We have multiple function arguments which we want to differentiate. You can think of a gradient as a generalization of the derivative.

using Plots; plotly()
bulb(x, y) = sin(x) + cos(y) 
surface(range(-π,π, length=30), range(-π,π, length=30), bulb)
14.1s
autograd-dualnum (Julia)

You can play with this interactive plot, and it should hopefully be obvious to you that the derivative is not possible to define for a point on this surface. The slope isn't a single number. You have a slope in both the x and the y direction.

We construct a gradient from partial derivatives. When creating a partial derivative, we treat every argument to a function as a constant except one. When creating partial derivatives we don't use the symbol inline_formula not implemented but inline_formula not implemented.

So say we have a function inline_formula not implemented, we can define the two partial derivatives as:

formula not implemented

The gradient of inline_formula not implemented is denoted as inline_formula not implemented and is a vector composed of its partial derivatives.

formula not implemented

When implementing this in Julia, we want to generalize so that the arguments to the function inline_formula not implemented is a vector of inline_formula not implemented values called xs. Before showing the code, let me explain a bit of how it works.

  1. We loop over every argument to f .

  2. For each iteration i we pick a value x which we will derivate with respect to.

  3. This x will be wrapped in a dual number, because being a dual number is what causes derivatives to be calculated. Otherwise numbers just acts as constants in the function.

function gradient(f, xs...)
  gs = []
  for (i, x) in enumerate(xs)
      if i == 1
        push!(gs, f(Dual(x), xs[2:end]...))
      else
        push!(gs, f(xs[1:(i-1)]..., Dual(x), xs[(i+1):end]...))
      end
  end
  [g.ϵ for g in gs]
end
0.6s
autograd-dualnum (Julia)
gradient (generic function with 1 method)

Let us just try a few simple examples to see how this works.

f(x, y) = 3x + 10y
gradient(f, 1, 1) |> println
gradient(f, 2, 3) |> println
1.0s
autograd-dualnum (Julia)
f(x, y) = 3x^2 + 10y^2
gradient(f, 1, 1) |> println
gradient(f, 2, 3) |> println
0.5s
autograd-dualnum (Julia)

Computing Gradients of Parameterized Functions

While these derivatives and gradient functions work fine in the general case, we want some Julia functions tailored towards the needs of machine learning. In machine learning the idea is that we are turning the inputs to a function into constants and the parameters of the function into variables. For instance if we have the function:

formula not implemented

Then when performing training, what we actually want to do is to derivate with respect to the parameters inline_formula not implemented and inline_formula not implemented not with respect to inline_formula not implemented. We want to find this:

formula not implemented

We need Julia to be able to keep tabs on the parameters of a function. Because Julia doesn't have pointers as in C/C++ or Go we have to create a wrapper for the parameters.

import Base: getproperty, setproperty!
const Params = Dict{Symbol, Real}
function getproperty(params::Params, key::Symbol)
   if hasfield(Params, key)
       getfield(params, key)
   else
       getindex(params, key)
   end
end
function setproperty!(params::Params, key::Symbol, x::Real)
  if hasfield(Params, key)
      setfield!(params, key, x)
  else
      setindex!(params, x, key)
  end
end
0.6s
autograd-dualnum (Julia)
setproperty! (generic function with 9 methods)

This is just a simple way to make a structure similar in behavior to an object in object-oriented dynamic languages such as Ruby and Python.

Let me explain a little bit how this works. Whenever you see something such as point.x in Julia code it is actually syntax sugar for getproperty(point, :x). This is usually implemented to simply call getfield(point, :x), which is the raw access to fields in a Julia struct. If you want to access fields without any extra code running, this is what you should call. But don't use this as an optimization attempt. If getproperty doesn't do anything but call getfield then there will never be any overhead. The Julia JIT will remove any indirection.

Anyway let me clarify how this works with an example:

point = Dict{Symbol, Real}()
point[:x] = 10
println("x accessed through key: ", point[:x])
# Because we have registered getproperty for Dict with these type parameters
# we can also write:
point.y = 20
println("y accessed through key: ", point[:y])
println("y accessed as property: ", point.y)
println("x accessed as property: ", point.x)
0.6s
autograd-dualnum (Julia)

This works because Params is just an alias for Dict{Symbol, Real}(), and they can be used interchangeably.

Let me demonstrate that we can create the exact same function using params to store parameter values.

using Plots; gr()
xs = range(-10,10, length=30)
ys = xs
a = 3
b = 2
c = 6
f(x, y) = a*x^2 + b*x*y + c*y^2
surface(xs, ys, f, layout=(2, 1), subplot=1)
params = Params()
params.a = 3
params.b = 2
params.c = 6
g(x, y) = params.a*x^2 + params.b*x*y + params.c*y^2 
surface!(xs, ys, g, layout=(2, 1), subplot=2)
15.5s
autograd-dualnum (Julia)

We can use this to create a gradient function which uses a params data structure instead.

function gradient(f, params::Params)
  gs = Params()       # To hold gradient
  
  for k in keys(params)
    backup = params[k]
    params[k] = Dual(backup) # So we can derivate with respect to paramer `k`
    gs[k] = f().ϵ
    params[k] = backup
  end
  
  gs
end
0.2s
autograd-dualnum (Julia)
gradient (generic function with 2 methods)

We should verify that we get the same result using both gradient functions.

f(x, y) = 3x + 10y
gradient(f, 1, 1) |> println
gradient(f, 2, 3) |> println
ps = Params()
ps.x = 1
ps.y = 1
keys(ps)
g() = 3ps.x + 10ps.y
gradient(g, ps) |> println
ps.x = 2
ps.y = 3
gradient(g, ps) |> println
2.6s
autograd-dualnum (Julia)
Runtimes (1)