Simon Danisch / Dec 04 2018

Howto: Reproducible Benchmarks

Benchmarking and tuning your own algorithm can be pretty fun! Reproducing, on the other hand, can quickly turn into a frustrating battle! This article will help with common pitfalls and offers a way to publish your benchmarks online, so that it's easy for readers to run them and add new insights to them!

Reproducing the Environment

It's sometimes not that simple to get some benchmark running, especially if it depends on lots of software! It's even worse when a benchmarks runs, but one forgets to set e.g. an environment variable that would speed up the baseline, so you euphorically publish a blog post about how you beat Benchmark X - but you simply failed to correctly reproduce the benchmark. To save people from such embarassements, this article will give some tips on how to publish your benchmarks.

Luckily, nextjournal almost by coincident came up with the perfect platform to share your benchmarks and enable readers to immediately run and reproduce them! Nextjournal is a platform where you can publish articles that contain code and data, all frozen into a runnable docker container! This is supposed to make research more reproducable but can also be used for benchmarking. The beauty is, that every article can be "remixed": that means you can make a new version of an existing article in the exact same environment (running in the cloud) as the original author. So if you find a benchmarking article hosted on nextjournal, you can easily extend it with your own version. You can also combine the environments of multiple articles into one, making it easy to compare benchmarks written in different languages or frameworks.

In this article, I will write my own benchmarks and analysis, using and explaining the tools nextjournal offers us! Ironically, it was quite a bit of work to set up and measure the C benchmarks used in this article correctly - work one really doesn't want to do when quickly trying to get a baseline on the own hardware!

Reproducable Measurements

When benchmarking, the first goal is to create statistical significant measurments and comparisons - which turns out to be quite the science in itself. Even when running the same code on the same hardware, there is enough noise on most machines to spoil the measurements. Let's first explore how bad noise can be for our measurements! With nextjournal you have the option to use a "shared" runner or a "dedicated" runner to execute your code. Shared ones will share the CPU with multiple other users, while a dedicated runner will boot up a whole machine for your article. As you can already guess, a shared runner probably has much more noise - but how bad can it be and how reliable is a dedicated machine? I prepared two articles running benchmarks from the BenchmarkGame - which I remixed and run on dedicated and shared hardware.

The benchmarks are run using BenchmarkTools, a framework that applies lots of tricks to eliminate noise. BenchmarkTools times the same code until a significant number of samples is reached, which then gets encapsulated in a single benchmarking object. Those objects can then be compared and tested if they contain a statistical significant difference. For the above articles, I record around 1000 benchmark objects for each code sample.

The most interesting statistic is: how high is the likelyhood, that I benchmark the exact same native code, but measure a significant difference in execution speed. For that, let's compare the benchmark samples generated from the same code against itself!

using Serialization, BenchmarkTools
# Data from other articles! The URL was extracted from the exported article
# but nextjournal should offer soon to cross reference data between articles!
data_shared_night = open(deserialize, download(""))
data_dedicated_night = open(deserialize, download(""))
data_shared_day = open(deserialize, download(""))
data_dedicated_day = open(deserialize, download(""))
using Plots; plotly()
function judge_each_other(samples)
  min_trial = minimum.(samples) # compare by minimal time
  res = Symbol[]; N = length(samples)
  for i in 1:N
    for j in i:N
      push!(res, judge(min_trial[i], min_trial[j]).time)
function selfcompare_plot(data, title = "")
  judges = judge_each_other(data)
  ks = unique(judges)
  frequencies = map(k-> count(isequal(k),  judges), ks);
  frequencies = filter(x-> x != 0, frequencies)
  bar(string.(ks), frequencies, legend = false, title = title)
  selfcompare_plot(data_shared_day["julia"]["nbody"], "shared"),
  selfcompare_plot(data_dedicated_night["julia"]["nbody"], "dedicated"),
using StatPlots
function benchgame_compare(data)
  names = ["fannkuch", "nbody", "mandel"]
  labels = 
  res = [minimum(map(x-> minimum(x).time / 10^9, data[lang][name])) for name in names, lang in ("julia", "c")]
  groupedbar(names, res, bar_position = :dodge, label = ["julia" "c"])
plot(benchgame_compare.((data_shared_day, data_dedicated_day, data_dedicated_night))...)
names = ["fannkuch", "nbody", "mandel"]
data = ["shared 1" => data_shared_night, "dedicated 1" => data_dedicated_night, "dedicated 2" => data_dedicated_day]
lang = ["julia", "c"]
datasets = vec([d[l] for l in lang, (n, d) in data])
res = [minimum(map(x-> minimum(x).time / 10^9, d[name])) for name in names, d in datasets]
labels = permutedims(vec(["$l in $n" for (n, d) in data, l in lang]))
groupedbar(names, res, bar_position = :dodge, label = labels)
function biggest_speedup(data)
  absmin = minimum(map(x-> minimum(x).time, data))
  absmax = maximum(map(x-> minimum(x.times), data))
  absmax / absmin
function findbiggest()
  maxidiff = 0
  for data in (data_dedicated, data_shared)
    for lang in ("c", "julia")
      for bench in ("mandel", "nbody", "fannkuch")
        diff = biggest_speedup(data[lang][bench])
        maxidiff = max(diff, maxidiff)