Efficient Neural Network Loss Landscape Generation

Robert Luciani, Hao Huang, & Simon Danisch


Fairly recently we came across the paper by Li et.al on generating loss surfaces. The plots looked so cool, we immediately wanted to try it ourselves. Unfortunately, the paper's associated GitHub repo had so much Python code for this one task that we didn't have the energy to read through it. With the research paper as a guide and a bit of hacking, we ended up with a few dozen lines of Julia code and some sweet looking plots.

The two implementations are almost certainly not equivalent, but still, for rendering a 51x51 pixel surface this version is relatively "interactive":

Reference implementation: 60 minutes on 4x GTX 1080 Ti

This implementation: 4 minutes on a K80 cloud-instance


We begin by producing a set of Gaussian random set of weights (a vector) ϕ similar in shape and scale to a trained model θ. We do it again for another random vector ψ. They will act as the x and y axes the final plot. We then do piecewise linear interpolation between the three vectors and plot out the loss at those points. That's basically it!

An intriguing part about this method is that each time the code is run, the generated landscape looks nearly identical. It's a neat result because the vectors that we map to plot axes are chosen essentially at random, with no consideration given to perpendicularity. In theory, assuming the weights of the network are not degenerate, one might be able to solve for two perfectly orthogonal vectors so that the resulting landscape would be more "true". In practice though, the nature of high-dimensional space seems to render such a step unnecessary as any set of random vectors are "orthogonal enough", hence the consistent results.


The combined expressiveness and high performance of Julia makes it very well suited for Machine Learning work. You can grab mathematical pseudocode from a science book and basically copy it verbatim in Julia, and on the first run it will execute both as intended and blazingly fast. Ultimately, it lets programmers move seamlessly between being mathematicians and computer engineers.


Flux provides nearly all convenience functions an ML practitioner might want, including state of the art autodifferentiation, integration with ODE solvers, and very elegant GPU and TPU support. One thing we find especially appealing is that given Flux's pure Julia implementation, all of its functionality is trivially modifiable.

A DNN in Flux is typically built out of functions like maxpool and objects like Dense. These convenience objects provided by Flux are essentially structs to keep track of weights, gradients, activations, and so on. Once objects are initialized like dlayer = Dense(2, 4, σ) they can be called with input data like dlayer(dat). For feed-forward networks the convenience object Chain lets us create a sequence of layers and transformations that we can index into and change easily. For example by creating a mode like vgg = Chain(Conv, ..., softmax) you can pass an image through the first five operations like vgg[1:5](img). You can access the bias gradients of the last dense layer like vgg[29].b.grad. This flexibility be useful later when we want to do things like overwrite layer weights constantly.


We chose the FashionMNIST dataset because it is small enough to be quickly downloaded and complex enough to produce an interesting landscape. Given that batch size affects how a network learns, it follows that the resulting look will be affected by what value is chosen. We want to train the network as we would under ordinary circumstances, which is to say with a full training dataset. However, only a small set of training samples are needed for the final landscape generation step.

using Flux
using CuArrays

batchsize = 64

dat_x = Flux.Data.FashionMNIST.images(:train)
dat_x = [reshape(Float32.(x), 28, 28, 1, 1) for x in dat_x]
dat_x = gpu.([reduce((x, y)->cat(x, y, dims=4), batch) 
            for batch in Iterators.partition(dat_x, batchsize)])

dat_y = Flux.Data.FashionMNIST.labels(:train)
dat_y = gpu.([Flux.onehotbatch(batch, 0:9) 
            for batch in Iterators.partition(dat_y, batchsize)])

dat = zip(dat_x, dat_y);

We define a VGG-like network with 13 layers, leaky relu activations, and batchnorm. It will easily learn the FashionMNIST dataset in less than one epoch. We set the optimizer to ADAM with η = 0.001 and β = (0.9, 0.999). The loss function is ordinary cross entropy:

H(y,y^)=i=1nyilog2y^inH(y, ŷ) = -\sum_{i=1}^{n} \frac{y_i log_2 ŷ_i}{n}
vgg = gpu(Chain(
    Conv((3,3), 1  =>64,  relu, pad=(1,1)), BatchNorm(64),
    Conv((3,3), 64 =>64,  relu, pad=(1,1)), BatchNorm(64), 
    x -> maxpool(x, (2,2)),
    Conv((3,3), 64 =>128, relu, pad=(1,1)), BatchNorm(128),
    Conv((3,3), 128=>128, relu, pad=(1,1)), BatchNorm(128), 
    x -> maxpool(x, (2,2)),
    Conv((3,3), 128=>256, relu, pad=(1,1)), BatchNorm(256),
    Conv((3,3), 256=>256, relu, pad=(1,1)), BatchNorm(256),
    Conv((3,3), 256=>256, relu, pad=(1,1)), BatchNorm(256), 
    x -> maxpool(x, (2,2)),
    Conv((3,3), 256=>512, relu, pad=(1,1)), BatchNorm(512),
    Conv((3,3), 512=>512, relu, pad=(1,1)), BatchNorm(512),
    Conv((3,3), 512=>512, relu, pad=(1,1)), BatchNorm(512), 
    x -> maxpool(x, (2,2)), x -> reshape(x, :, size(x,4)),
    Dense(512,  4096, relu),
    Dense(4096, 4096, relu), Dropout(0.5),
    Dense(4096, 10),

loss(x, y) = Flux.crossentropy(vgg(x), y) 

Flux.train!(loss, params(vgg), dat, ADAM())
# Write out to the /results/ folder, which will get stored in an online 
# data bucket, which we can more easily reference anywhere
using BSON

model = cpu(vgg)
BSON.@save "/results/model.bson" model;

We define functions to help us extract, modify, and reinsert arrays into our Chain object. We include the weights and biases of both convolutional and dense layers. Other layers such as batchnorm are left as is.

minweights() returns a dict with a layer number and its weights/biases:

Dict(layer1 => [[W,W,W], [b,b,b]], layer3 => [[W,W], [b,b]])

rdnweights() returns a similar dict except the weights are Gaussian random and have the same shape/norm as the input model.

Run (Julia)
function minweights(layers)
    out = Dict{Int, Array}()
    for i in 1:length(layers)
        if typeof(layers[i]) <: Conv || typeof(layers[i]) <: Dense
            out[i] = getweights(layers[i])
    return out

function rndweights(layers, n)
    out = [Dict{Int, Array}() for i in 1:n]
    for θ in out, i in 1:length(layers)
        if typeof(layers[i]) <: Conv || typeof(layers[i]) <: Dense
            w, b = getweights(layers[i])
            d = [gpu(randn(Float32, size(w))), gpu(randn(Float32, size(b)))]
            θ[i] = filternorm(d, getweights(layers[i]))
    return out

getweights(layer::T) where T <: Conv = [copy(layer.weight.data), copy(layer.bias.data)]
getweights(layer::T) where T <: Dense = [copy(layer.W.data), copy(layer.b.data)]

function setweights!(model, weights)
    for (l, w) in weights
        if typeof(model[l]) <: Conv 
            copyto!(model[l].weight.data, w[1]), copyto!(model[l].bias.data, w[2])
            copyto!(model[l].W.data, w[1]), copyto!(model[l].b.data, w[2])
setweights! (generic function with 1 method)

The weights in neural networks (especially rectified ones) can be scale invariant and vary wildly between layers. The first thing we need is the ability to normalize vectors so that we can change it by a predictable amount. The robust way to do this is to normalize each filter j on each layer i in the random vector ϕ to the norm of its counterpart in the trained model θ like so:

ϕi,jϕi,jϕi,jθi,jϕ_{i,j} \leftarrow \frac{ϕ_{i,j}}{\left\lVertϕ_{i,j}\right\rVert}\left\lVertθ_{i,j}\right\rVert

One can also norm entire layers at a time which is a bit faster but not as smooth.

layernorm(dir, min) = lnkern.(dir, min)
lnkern(dir, min) = sqrt(sum(min.^2)) .* dir ./ sqrt(sum(dir.^2))

filternorm(dir, min) = fnkern.(dir, min)
function fnkern(dir, min)
    dim = ndims(dir) == 4 ? [1,2,3] : [1]
    sqrt.(sum(min.^2, dims=dim)) .* dir ./ sqrt.(sum(dir.^2, dims=dim))
fnkern (generic function with 1 method)

n most circumstances, two random vectors will perturb a given model's loss in fairly similar ways. This means that in order to generate interesting asymmetric landscapes we need to treat the plot axes differently. We can do this by parameterizing asymetrically, norming intermediate vectors, or applying linear interpolation twice.

function linear2x(α, β, θ1, θ2, θ3)
    ψ = α .* θ2 .+ (1-α) .* θ1
    return β .* θ3 .+ (1-β) .* layernorm(ψ, θ1)

function barycentric(α, β, θ1, θ2, θ3)
    ϕ  = α .* (θ2 .- θ1) .+ θ1
    ψ  = β .* (θ3 .- θ1) .+ θ1
    return α .* filternorm(ϕ, θ1) .+ (1-β) .* filternorm(ψ, θ1)
barycentric (generic function with 1 method)

Below is the entirety of the code that generates the loss landscape. The function landscape begins by setting aside the optimized model vector θm and generates additional random vectors θ1, θ2. It then interpolates between those vectors using the offset parameters α, β, creating a new vector θd which is reinserted into the model. We then calculate the cross entropy using that vector and store the loss. The last step is simply to replace infinities and log scale the values.

using ProgressMeter

function landscape(dnn, dat, lbl, res, interpolate)
    x,  y,  z  = collect(res), collect(res), []
    θ1, θ2     = rndweights(dnn, 2)
    θm, θd     = minweights(dnn), Dict()
    @showprogress for α in x, β in y
        for (i, w) in θm
            θd[i] = interpolate(α, β, θm[i], θ1[i], θ2[i])
        setweights!(dnn, θd)
        push!(z, Flux.crossentropy(dnn(dat), lbl).data)
    setweights!(dnn, θm)
    z[isinf.(z)] .= maximum(z[.!(isinf.(z))])
    log.(reshape(z, length(x), length(y)))
landscape (generic function with 1 method)


# Load a pre-trained model from disk to save time.
using BSON, Flux, CuArrays
# use ctrl + e to insert a reference to a saved result file
model vgg = gpu(model);
# Consider what resolution surface you want before rendering.
res = -2.5f0:0.1f0:2.5f0
zmap = landscape(vgg, dat_x[end], dat_y[end], res, linear2x)
BSON.@save "/results/zmap.bson" res zmap;
# Load a pre-rendered surface to save even more time!
using BSON, Flux

zmap res = -2.4:0.01:2.4
using Makie

contour(res, res, zmap, levels = 10, linewidth = 2, colormap=:darktest)


We added some functions to convert back and forth between local (manifold) coordinates and zmap indices. This allows us to plot the path that optimizers might take across this sort of landscape. For the interactive sliders to be fun, the optimizer parameters need to be adjusted to match the resolution of the surface.

using Flux.Optimise: apply!

function zdepth(xy, xmap, ymap, zmap)
    x, y = xy.data
    z = 0f0
    try z = zmap[findfirst(v->v>x, xmap)-1, findfirst(v->v>y, ymap)-1]
    catch e end
    return [x, y, z]

function zgrad!(xy, xmap, ymap, z)
    x, y = xy.data
    try zx, zy = findfirst(v->v>x, xmap)-1, findfirst(v->v>y, ymap)-1
        xgrad, ygrad = z[zx,zy]-z[zx+1,zy], z[zx,zy]-z[zx,zy+1] 
        copyto!(xy.grad, [xgrad, ygrad])
    catch e end

function descend(xy, xres, yres, zmap, opt, steps)
    out = []
    for i in 1:steps
        zgrad!(xy, xres, yres, zmap)
        xy.data .+= apply!(opt, xy.data, xy.grad) 
        push!(out, [zdepth(xy, xres, yres, zmap)...])
    return Point3f0.(out)
descend (generic function with 1 method)
s_x   = Makie.slider(res, raw=true, camera=campixel!, start=0f0)
s_y   = Makie.slider(res, raw=true, camera=campixel!, start=0f0)
s_opt = Makie.slider(1:1:100, raw=true, camera=campixel!, start=2)

opt   = NADAM(0.1, (0.99, 0.999))
path  = lift((x, y, steps) -> descend(param([to_value(x) to_value(y)]), 
            res, res, zmap, opt, to_value(steps)), 
            s_x[end][:value], s_y[end][:value], s_opt[end][:value])

scene = surface(res, res, zmap, colormap=:coolwarm)
scene = lines!(scene, path, color=RGBAf0(0.2, 0.9, 0.2, 1), linewidth=3)
scene = contour!(scene, res, res, zmap, levels = 15, linewidth = 3, 
                transformation = (:xy, minimum(scene.limits[])[3]), colormap=:lighttest)

Makie.hbox(scene, Makie.vbox(s_y, s_x, s_opt))


That was it! We hope the code above exmplified that legibility doesn't have to come at the expense of performance and utility. Tools like Flux, Makie, CuArrays, and the whole Julia ecosystem makes life enjoyable for people interested in Deep Learning and scientific computing. As a small bonus we've modified the plot above to run interactively in a web-browser. Try moving the model around and see where the NADAM optimizer takes you. GLHF!

Run (Julia)
using WebSockets, WebIO, Interact, CSSUtil, ImageShow
# use online version for WebIO, so that the output doesn't go away after the 
# runner shuts down
ENV["WEBIO_BUNDLE_URL"] = "https://simondanisch.github.io/ReferenceImages/generic_http.js"
opt = NADAM(0.1, (0.99, 0.999))

s1 = Interact.slider(res, value = 1.9f0, label = "x")
s2 = Interact.slider(res, value = 1.06f0, label = "y")
npoints = 30
cam_rotation = Interact.slider(LinRange(0, 2pi, 360), value = 20, label = "cam rotation")

path = map(s1, s2) do x, y
  descend(param([x y]), res, res, zmap, opt, npoints)
scene_s = Scene(resolution = (600, 650))

surface!(scene_s, res, res, zmap, colormap=:coolwarm)

  scene_s, observe(path), color = (:white, 0.4), linewidth = 2,
  linestyle = :dot, overdraw = true
  scene_s, observe(path), color = RGBf0(0.9, 0.2, 0.4), markersize = 0.09

xm, ym, zm = minimum(scene_s.limits[])

  scene_s, res, res, zmap, levels = 15, linewidth = 3, 
  transformation = (:xy, minimum(scene.limits[])[3]), colormap=:lighttest
scene_s.center = false
# squish z a bit
scale!(scene_s, 1.0, 1.0, 0.6)
rotate_cam!(scene_s, 0.0, -0.4, 0.0)

# display the sliders
lastval = Ref(cam_rotation[])
on(cam_rotation) do x
  global lastval
  # cam rotation is relative
  rotate_cam!(scene_s, lastval[] - x, 0.0, 0.0)
  lastval[] = x
scene_obs = Observables.async_latest(map(path, cam_rotation) do p, rot
# Display slider and scene. The slider will go away once the runner shuts down 
# To interact with them, log into nextjournal and start the notebook!
CSSUtil.vbox(cam_rotation, CSSUtil.hbox(s1, s2), scene_obs)