# Automatic Differentiation using Dual Numbers

Modern machine learning libraries such as Flux is built upon the idea of Automatic Differentiation. One way to implement that is by defining dual numbers.

`struct Dual{T<:Real} <: Real`

` x::T`

` ϵ::T`

`end`

With dual numbers we keep track of the result of performing a mathematical operation on a number as well as the derivative of that number.

If inline_formula not implemented, then inline_formula not implemented. We bake this into the basic constructor for a dual number.

`Dual(x) = Dual(x, one(x))`

To make it easier to work we we create a nice representation of a dual number

`Base.show(io::IO, d::Dual) = print(io, d.x, " + ", d.ϵ, "ϵ")`

Let us us create a couple of dual numbers

`Dual(3), Dual(2, 4)`

## Elementary rules of differentiation

Think of ϵ as derivative of inline_formula not implemented (inline_formula not implemented), we can then define how different mathematical operations should work with dual numbers.

**Differentiation is linear**

Let us start with addition and subtraction. Derivation rules says that the derivative of the function inline_formula not implemented with respect to inline_formula not implemented is:

formula not implemented`import Base: +, -, *, /`

`a::Dual + b::Dual = Dual(a.x + b.x, a.ϵ + b.ϵ)`

`a::Dual - b::Dual = Dual(a.x - b.x, a.ϵ - b.ϵ)`

Let us try these operations out with some numbers:

`Dual(3) + Dual(4)`

**The product rule**

The derivative of a function inline_formula not implemented defined as the product of two functions is defined as:

formula not implementedWe can see how the derivative of inline_formula not implemented follows from this.

formula not implementedWe can use this definition to implement multiplication for dual numbers.

`a::Dual * b::Dual = Dual(a.x*b.x, b.x*a.ϵ + a.x*b.ϵ)`

Again let us check out if they work as expected. If inline_formula not implemented, then inline_formula not implemented and for the derivative we get inline_formula not implemented.

`x = Dual(3)`

`y = Dual(6)`

`x * x, x * y, y * y`

**The quotient rule**

Say the function inline_formula not implemented is defined as a quotient of two function inline_formula not implemented and inline_formula not implemented like this:

formula not implementedThen the derivative of inline_formula not implemented is defined as:

formula not implemented`a::Dual / b::Dual = Dual(a.x / b.x, (a.ϵ*b.x - b.ϵ*a.x) / b.x^2)`

We can try this out with some numbers again:

`y/x`

**The chain rule**

This applies when we try to derivate a function, where the argument to the function is another function. This is really at the core of what is done when back propagation is performed in a neural network.

Given that we have a function inline_formula not implemented the derivative is defined as:

formula not implementedTo get an intuition for why this rule makes sense, we can use Leibniz's notation instead:

formula not implementedWe can use this rule to define what the derivative of inline_formula not implemented and inline_formula not implemented should be when their arguments are functions (dual numbers). So we want to derive cases like this:

formula not implementedIf we apply the chain rule to derive inline_formula not implemented and inline_formula not implemented we get:

formula not implementedWe leave doing the same derivation for inline_formula not implemented as an exercise to the reader. To read and understand the implementation below it helps to think of the variable d as

formula not implemented`import Base: sin, cos`

`sin(d::Dual) = Dual(sin(d.x), cos(d.x) * d.ϵ)`

`cos(d::Dual) = Dual(cos(d.x), sin(d.x) * (-d.ϵ))`

We could implement more cases but this enough of the basics. Next we want to play nice with the Julia conversion and promotion system.

## Number Conversion and Promotion

Julia being a numerical language has a system that makes it easier to define all the rules for conversion and promotion when dealing with numbers which are not of the same type in the same expression.

You could do manage this manually by simply defining a large number of methods dealing with each case. But that gets cumbersome. Instead Julia lets you define promotion rules.

`import Base: convert, promote_rule`

`function promote_rule(::Type{Dual{T}}, ::Type{Dual{R}}) where {T,R} `

` Dual{promote_type(T,R)}`

`end`

`function promote_rule(::Type{Dual{T}}, ::Type{R}) where {T,R}`

` Dual{promote_type(T,R)}`

`end`

This rule says that when dealing with an expression involving a type `R`

and a dual number of type `Dual{T}`

then the resulting type is of type `Dual{S}`

where `S`

is a common number type of `R`

and `T`

. In Julia you can use the `promote_type(T, R)`

function to find a common number type. Here are some examples.

`promote_type(Int8, Int16) |> println`

`promote_type(UInt64, Int64) |> println`

`promote_type(Int64, Float64) |> println`

This function relies on collections of `promote_rule`

s having been define. Because we have defined such rules for our dual number we can also use `promote_type`

for dual numbers.

`promote_type(Dual{Int64}, Dual{Int8}) |> println`

`promote_type(Dual{Float64}, Dual{Int64}) |> println`

`promote_type(Float64, Dual{Int64}) |> println`

Of course promotion is worthless if Julia doesn't know how to convert every type to the promoted type. Hence we need to define conversion functions. Or rather we need to add conversion methods to the `convert`

function defined in the `Base`

module.

`function convert(::Type{Dual{T}}, x::Dual) where T`

` Dual(convert(T, x.x), convert(T, x.ϵ))`

`end`

`function convert(::Type{Dual{T}}, x::Real) where T`

` Dual(convert(T, x), zero(T))`

`end`

With the conversion functions we can use dual numbers together with regular numbers and Julia will use these covert functions when need to to make conversions.

`2*Dual(2, 3) |> println`

`Dual(2.1) + Dual(3) |> println`

`2 + Dual(3) |> println`

## Derivative of a Function

We have now built enough functionality to be able to define a function for performing the derivative of another function.

`derive(f, x) = f(Dual(x)).ϵ`

We can try it on some different well known cases to see if it works as expected.

`f(x) = x^2`

`g(x) = 6x^3 + 2x^2`

`df(x) = derive(f, x)`

`dg(x) = derive(g, x)`

`# Manually derive g(x) for comparison`

`h(x) = 18x^2 + 4x`

`println("f(3) = ", f(3), " then f'(3) = ", df(3))`

`println("g(1) = ", g(1), " then g'(1) = ", dg(1))`

`println(dg(1), " == ", h(1))`

Of course checking lots of individual values to see if the derivation works is cumbersome. A better way to get a quick overview is to plot each function an their corresponding derivative.

`using Plots`

`plot([f, df], linewidth=3, xlabel="x", label=["f(x)" "f'(x)"])`

`title!("Function f(x) and its derivative f'(x)")`

`vline!([0], linewidth=2, color=:orange, label="global minimum")`

`using Plots; gr()`

`plot([sin, x -> derive(sin, x)], linewidth=3, xlabel="x", label=["sin(x)" "sin'(x)"])`

`vline!([π/2], linewidth=2, color=:orange, label=nothing)`

`scatter!([π/2, π/2], [1, 0], markersize=6, label="maximum", color=:orange)`

## Function Gradients

In practice derivations are not all that useful for real world machine learning because we are not dealing with functions with single inputs. We have multiple function arguments which we want to differentiate. You can think of a gradient as a generalization of the derivative.

`using Plots; plotly()`

`bulb(x, y) = sin(x) + cos(y) `

`surface(range(-π,π, length=30), range(-π,π, length=30), bulb)`

You can play with this interactive plot, and it should hopefully be obvious to you that the derivative is not possible to define for a point on this surface. The slope isn't a single number. You have a slope in both the x and the y direction.

We construct a gradient from partial derivatives. When creating a partial derivative, we treat every argument to a function as a constant except one. When creating partial derivatives we don't use the symbol inline_formula not implemented but inline_formula not implemented.

So say we have a function inline_formula not implemented, we can define the two partial derivatives as:

formula not implementedThe gradient of inline_formula not implemented is denoted as inline_formula not implemented and is a vector composed of its partial derivatives.

formula not implementedWhen implementing this in Julia, we want to generalize so that the arguments to the function inline_formula not implemented is a vector of inline_formula not implemented values called `xs`

. Before showing the code, let me explain a bit of how it works.

We loop over every argument to

`f`

.For each iteration

`i`

we pick a value`x`

which we will derivate with respect to.This

`x`

will be wrapped in a dual number, because being a dual number is what causes derivatives to be calculated. Otherwise numbers just acts as constants in the function.

`function gradient(f, xs...)`

` gs = []`

` for (i, x) in enumerate(xs)`

` if i == 1`

` push!(gs, f(Dual(x), xs[2:end]...))`

` else`

` push!(gs, f(xs[1:(i-1)]..., Dual(x), xs[(i+1):end]...))`

` end`

` end`

` [g.ϵ for g in gs]`

`end`

Let us just try a few simple examples to see how this works.

`f(x, y) = 3x + 10y`

`gradient(f, 1, 1) |> println`

`gradient(f, 2, 3) |> println`

`f(x, y) = 3x^2 + 10y^2`

`gradient(f, 1, 1) |> println`

`gradient(f, 2, 3) |> println`

### Computing Gradients of Parameterized Functions

While these derivatives and gradient functions work fine in the general case, we want some Julia functions tailored towards the needs of machine learning. In machine learning the idea is that we are turning the inputs to a function into constants and the parameters of the function into variables. For instance if we have the function:

formula not implementedThen when performing training, what we actually want to do is to derivate with respect to the parameters inline_formula not implemented and inline_formula not implemented not with respect to inline_formula not implemented. We want to find this:

formula not implementedWe need Julia to be able to keep tabs on the parameters of a function. Because Julia doesn't have pointers as in C/C++ or Go we have to create a wrapper for the parameters.

`import Base: getproperty, setproperty!`

`const Params = Dict{Symbol, Real}`

`function getproperty(params::Params, key::Symbol)`

` if hasfield(Params, key)`

` getfield(params, key)`

` else`

` getindex(params, key)`

` end`

`end`

`function setproperty!(params::Params, key::Symbol, x::Real)`

` if hasfield(Params, key)`

` setfield!(params, key, x)`

` else`

` setindex!(params, x, key)`

` end`

`end`

This is just a simple way to make a structure similar in behavior to an object in object-oriented dynamic languages such as Ruby and Python.

Let me explain a little bit how this works. Whenever you see something such as `point.x`

in Julia code it is actually syntax sugar for `getproperty(point, :x)`

. This is usually implemented to simply call `getfield(point, :x)`

, which is the raw access to fields in a Julia struct. If you want to access fields without any extra code running, this is what you should call. But don't use this as an optimization attempt. If `getproperty`

doesn't do anything but call `getfield`

then there will never be any overhead. The Julia JIT will remove any indirection.

Anyway let me clarify how this works with an example:

`point = Dict{Symbol, Real}()`

`point[:x] = 10`

`println("x accessed through key: ", point[:x])`

`# Because we have registered getproperty for Dict with these type parameters`

`# we can also write:`

`point.y = 20`

`println("y accessed through key: ", point[:y])`

`println("y accessed as property: ", point.y)`

`println("x accessed as property: ", point.x)`

This works because `Params`

is just an alias for `Dict{Symbol, Real}()`

, and they can be used interchangeably.

Let me demonstrate that we can create the exact same function using params to store parameter values.

`using Plots; gr()`

`xs = range(-10,10, length=30)`

`ys = xs`

`a = 3`

`b = 2`

`c = 6`

`f(x, y) = a*x^2 + b*x*y + c*y^2`

`surface(xs, ys, f, layout=(2, 1), subplot=1)`

`params = Params()`

`params.a = 3`

`params.b = 2`

`params.c = 6`

`g(x, y) = params.a*x^2 + params.b*x*y + params.c*y^2 `

`surface!(xs, ys, g, layout=(2, 1), subplot=2)`

We can use this to create a gradient function which uses a params data structure instead.

`function gradient(f, params::Params)`

` gs = Params() # To hold gradient`

` `

` for k in keys(params)`

` backup = params[k]`

` params[k] = Dual(backup) # So we can derivate with respect to paramer `k``

` gs[k] = f().ϵ`

` params[k] = backup`

` end`

` `

` gs`

`end`

We should verify that we get the same result using both gradient functions.

`f(x, y) = 3x + 10y`

`gradient(f, 1, 1) |> println`

`gradient(f, 2, 3) |> println`

`ps = Params()`

`ps.x = 1`

`ps.y = 1`

`keys(ps)`

`g() = 3ps.x + 10ps.y`

`gradient(g, ps) |> println`

`ps.x = 2`

`ps.y = 3`

`gradient(g, ps) |> println`