Simon Danisch / Aug 01 2019

Neural Networks From Scratch with a twist

There have been lots of "Neuronal Network from Scratch" articles lately. So you may ask, why for the love of god would you write yet another.

While being part of the Julia Community and their machine learning efforts for quite a while, I think I can add a unique perspective on the matter.

While most articles that implement DNNs from scratch only work for toy examples, I will show how to build them while maintaining production ready performance. This works out pretty well thanks to Julia's unique strengths in this area - so you may also read this article to learn about some of Julia's main advantages for user friendly high performance programming. Furthermore, I will explain basic DNN concepts in a more clutter free way, like back propagation and automatic differentiation.

These three perspectives, achieving state of the art performance quickly, learning about Julia and explaining nasty details in an easy way were enough motivation to write yet another article :)

So, for the readers that are less into the topic, let's start with a very general explanation of what a DNN is.

High Level view of a DNN

In its very core, any DNN is very simple. It's basically a black box, that contains a huge tunable function with millions of parameters. Training that function will result in the parameters to be fine tuned to some problem, to mold the function into returning the right answers:

Inside the tunable functions, we usually have lots of layers, made up by smaller functions. Those functions can be any function that has parameters we can tune and an input/output. In reality, they contain mostly a few functions that have been proofen to be effective:

  • softmax (exp.(x) ./ sum(exp.(x)))
  • dense (W * x .+ b)
  • relu (max(zero(x), x))
  • convolution

Tuning a.k.a Back-Propagation

#TODO, actually, I feel like I could come up with an even better example to visualize the basic work horse of a DNN

utilities.jl
include(
utilities.jl
) function next_position(position, angle) position .+ (sin(angle), cos(angle)) end # Our tunable function ... or chain of flexible links function predict(chain, input) output = next_position(input, chain[1]) # Layer 1 output = next_position(output, chain[2]) # Layer 2 output = next_position(output, chain[3]) # Layer 3 output = next_position(output, chain[4]) # Layer 4 return output end function loss(chain, input, target) sum((predict(chain, input) .- target) .^ 2) end chain = [(rand() * pi) for i in 1:4] input, target = (0.0, 0.0), (3.0, 3.0) weights, s = visualize(chain, input, target) s
using Zygote
function loss_gradient(chain, input, target)
  # first index, to get gradient of first argument
  Zygote.gradient(loss, chain, input, target)[1]
end
for i in 1:100
  # get gradient of loss function
  angle∇ = loss_gradient(chain, input, target)
  # update weights with our loss gradients
  # this updates the weights in the direction of smaller loss
  chain .-= 0.01 .* angle∇
  # update visualization
  weights[] = chain
  sleep(0.01)
end;

From Scratch

TODO: describe all the things

using Colors, ImageShow
import Zygote, Flux

glorot_uniform(dims...) = (rand(Float32, dims...) .- 0.5f0) .* sqrt(24.0f0/sum(dims))

struct Dense{M <: AbstractMatrix, V <: AbstractVector, F <: Function}
  W::M
  b::V
  func::F
end

function Dense(in, out, func = identity)
  Dense(glorot_uniform(out, in), zeros(Float32, out), func)
end

function (a::Dense)(x::AbstractArray)
  a.func.(a.W * x .+ a.b)
end

softmax(xs) = exp.(xs) ./ sum(exp.(xs))

relu(x::Real) = max(zero(x), x)

function crossentropy(::AbstractVecOrMat, y::AbstractVecOrMat; weight = 1)
  -sum(y .* log.() .* weight) * 1 // size(y, 2)
end
crossentropy (generic function with 1 method)
0.3s
Julia
function forward(network, input)
  result = input
  for layer in network
    result = layer(result)
  end
  return result
end
loss(network, x, y) = crossentropy(forward(network, x), y)
function loss_gradient(network, input, target)
  # first index, to get gradient of first argument
  Zygote.gradient(loss, network, input, target)[1]
end

apply_gradient!(a, b::Nothing, optimizer) = nothing
function apply_gradient!(a, b::NamedTuple, optimizer)
  for field in propertynames(b)
    apply_gradient!(getfield(a, field), getfield(b, field), optimizer)
  end
end
function apply_gradient!(a::Tuple, b, optimizer)
  for (alayer, blayer) in zip(a, b)
    apply_gradient!(alayer, blayer, optimizer)
  end
end
"""
We use standard Gradient descent for nothing as Optimizer
"""
function apply_gradient!(a::AbstractArray, b::AbstractArray, optimizer::Nothing)
  a .-= 0.1 .* b
end

function train!(network, X, Y, optimizer = nothing, epochs = 100)
  for epoch in 1:epochs
    grad = loss_gradient(network, X, Y)
    apply_gradient!(network, grad, optimizer)
    @show epoch
  end
end

function test(n)
  img = X[1:28^2, n:n]
  predict = Tuple(argmax(forward(network, img)))[1] - 1
  @show predict
  save("/results/test.png", Gray.(reshape(img, (28, 28))))
  return nothing
end
test (generic function with 1 method)
network = (
  Dense(28^2, 32, relu),
  Dense(32, 10),
  softmax
)
imgs = Flux.Data.MNIST.images()
labels = Flux.Data.MNIST.labels()
Y = Flux.onehotbatch(labels, 0:9)
X = Float32.(hcat(float.(reshape.(imgs, :))...))
train!(network, X, Y)
using FileIO
test(1)