The Julia Language Challenge

I've been using the Julia language for around 5 years now, and I've grown rather fond of it. By now, I can't really imagine how to do the things I'm doing in any other language. This could, of course, be a misconception and the result of not knowing any other language well enough - a form of the Dunning–Kruger effect. But my impression is that Julia is the best language to write high performance numeric libraries for machine learning, data science, solvers, optimizations, etc.... 

Because of that, I get into a lot of discussions that go, "Why Julia? We always just use a scripting language and write the fast part in C/C++," and all different combinations of that argument. I usually quickly reply that this approach ends up in a lot of effort and it doesn't actually compose very well to mix a low-level language with a scripting language! One ends up with a difficult to maintain C/C++ library, and a scripting language that can't be used for all tasks, since the slow speed becomes a bottleneck. Or the C/C++ library doesn't work with the high-level constructs that you learned to love. Even Tensorflow and Google have acknowledged this, and are trying to rewrite Tensorflow in one language, namely, Swift! 

But how close to the truth am I, really?  How big is the difference? And how fast and composable can an implementation become in a modern C++?
Obviously, I can't just sit down and learn how to write the best and most elegant code in another language - it would take years until I reached the level of my current Julia skills. This is where the Julia challenge comes into play!
I put together a reference implementation for a problem that nicely illustrates the fundamental principles which make Julia so productive and scalable for numeric libraries. It's the foundation that allows one to freely combine packages and still get optimal performance. (If you're curious about such packages, have a look at this article: Some State of the Art Packages in Julia 1.0.)
I can't really imagine writing those in any other language, so I dare you to teach me! Use Python + Numba, Python + C, Swift, Cython, Go, Lua-Jit, Rust, D - surprise us! If all works out, this can be a great learning experience for everyone following the challenge! :) 

1.
Rules

There are 3 categories: developer effort, performance and extensibility.
A successful implementations needs to shine in all 3 categories.

2.
The Problem: Broadcasting

A common problem is that you have a bunch of arrays and scalars, and want to apply a function element-wise over the arrays. This is an extension of the regular map function:

In many programming languages, map is the name of a higher-order function that applies a given function to each element of a list, returning a list of results in the same order. It is often called apply-to-all when considered in functional form.

Julia

It's a fundamental operation for any array library, and is used heavily in machine learning and any other area in computer science. If you were in the situation of needing to implement a Numpy-like library, this would be one of your first challenges.

It's such a basic operation that Julia has its own syntax for it. You can apply any function element-wise by just putting a dot in front of it:

x = fill(0.0, (4, 4)) # 4x4 array of zeros
y = [1, 2, 3, 4] # 4 element vector
z = 2 # a scalar
# a user defined function, let's not make it complicated
user_defined(x) = x + 1 
# y's 1st dimension gets repeated for the 2nd dimension in x
# and the scalar z get's repeated for all dimensions
# the below is equal to 
# `broadcast(+, broadcast(+, broadcast(user_defined, x), y), z)`
println(user_defined.(x) .+ y .+ z)
# or try column vector with a row vector
A = 1:4
b = A'
println(A .+ b) # gives a 4x4 matrix
# Or it can be used quite general purpose:
x = ComplexF64[1im, 2im, 3im]
println(getfield.(x, :im)) # or just use imag.(x)

1.4s

What makes it difficult to implement is that you can really throw anything at it! It works for n-arguments, n-dimensional arrays with arbitrary user types and functions - and pretty much needs to have optimal performance in all cases.

What I haven't even mentioned yet is that Julia takes multiple broadcast calls, e.g. user_defined.(x) .+ y .+ z, and fuses them into one loop without any temporary allocations. This mechanism is customizable, and one can influence how to fuse the call tree to allow domain specific optimizations. This is possible because Julia's implementation allows you to overload it, and passes you a lazy representation of the broadcast expression which you're free to modify appropriately. 

So, the spec for our little toy broadcast implementation is the following:

Needs to work for n arguments
Needs to work for any combinations of scalars and n-dimensional arrays
Inside the implementation, one needs to have access to the whole call tree and it should be possible to rewrite that tree or flatten it into one loop
User defined functions and types need to be inlined and SIMD acceleration should be used whenever possible

3.
Reference Implementation

The below snippet implements a (quite) fully featured lazy, n-dimensional, n-argument broadcast. You can actually play around with the implementation by creating a Nextjournal account with signup code: juliacon, and then clicking edit in the right corner of the article!

1.1s

import Base: getindex, iterate, axes, eachindex, tail, @propagate_inbounds
struct LazyBroadcast{F, Args}
    f::F
    args::Args
end
br_getindex(scalar, I) = scalar # Scalars no need to index them
@propagate_inbounds function br_getindex(A::AbstractArray, I)
    idx = ntuple(i-> ifelse(size(A, i) === 1, 1, I[i]), Val(ndims(A)))
    return A[CartesianIndex(idx)]
end
@propagate_inbounds function br_getindex(x::LazyBroadcast, I)
    # this could be a map, but the current map in 1.0 has a perf problem
    return x.f(getindex_arg(x.args, I)...) 
end
getindex_arg(args::Tuple{}, I) = () # recursion ancor
@propagate_inbounds function getindex_arg(args::NTuple{N, Any}, I) where N
    return (br_getindex(args[1], I), getindex_arg(tail(args), I)...)
end
@propagate_inbounds getindex(x::LazyBroadcast, I) = br_getindex(x, Tuple(I))
function materialize!(out::AbstractArray, call_tree::LazyBroadcast)
    # an n-dimensional simd accelerated loop
    @simd for i in CartesianIndices(axes(out)) 
        @inbounds out[i] = call_tree[i]
    end
    return out
end
br_construct(x) = x
function br_construct(x::Expr)
    x.args .= br_construct.(x.args) # apply recursively
    if Meta.isexpr(x, :call) # replace calls to construct LazyBroadcasts
        x = :(LazyBroadcast($(x.args[1]), ($(x.args[2:end]...),)))
    end
    x
end
# macro to enable the syntax @broadcast a + b - sin(c) to construct our type
macro broadcast(call_expr) 
    esc(br_construct(call_expr))
end

1.1s

A first prototype took me half an hour to write, and a further 1.5 hours to optimize all cases to match Julia's SIMD optimized broadcast implementation in terms of performance. This should be the baseline for truly winning this challenge.

4.
Winning Criteria For Implementation

4.1.
Solution

Code must not be much longer while still being readable. Exemptions for languages with tiny standard libraries.
One shouldn't need to spend a lot of time to implement this, so that it can be easily rewritten or improved.
materialize! needs to have access to the full call tree, and should be able to rewrite the tree without losing any performance.
It should be possible to construct the lazy broadcast object as close to the actual calls as possible as in @broadcast a + b - sin(c)

4.2.
Performance

To win the performance part, your implementation needs to be at least as fast as Julia's base broadcast implementation for arbitrary dimensions and argument combinations. That means it needs to use SIMD acceleration and compile away all those high level, lazy indexing constructs. It also needs to work with user defined types and functions that are out of the control of your implementation!

using BenchmarkTools

reference(out, a, b, c) = (out .= a .+ b .- sin.(c))

a = rand(1000, 1000);
b = rand(1000);
c = 1.0
out1 = similar(a);
out2 = similar(a);
br = @broadcast a + b - sin(c)

@btime materialize!($out1, $br)
@btime reference($out2, $a, $b, $c)
@assert out1 == out2
nothing

24.4s

4.3.
Types & Functions From Different Packages

It's mandatory to use an existing package without changing it. This can also be emulated by putting it in an isolated module not using any constructs defined in our implementation. This is necessary in order to make Julia's packages talk to each other and combine already existing code just like Lego. So it's important to use a normal, user defined type/object/class in the language that you target!

# Library A - I chose GeometryTypes. 
pkg"add GeometryTypes" # use package manager to install the package
# Any library with NVectors specializing to the length will do!
using GeometryTypes
# You need to take my word for it, that this is just a normal Julia struct
# Or you can inspect it with dump(Point3)
const Point3 = Point{3, Float32} 

# function needs to come from different library
module LibraryB
  # no odd stuff, no functors, no special lambda expression! 
  # this function needs to be a normal language function
  # as can be found in the wild
	super_custom_func(a, b) = sqrt(sum(a .* b))
end
# emulate that the function comes from a different library
using .LibraryB: super_custom_func

11.2s

using BenchmarkTools

a = rand(Point3, 10^6)
b = rand(Point3, 10^6)
out = fill(0f0, 10^6)

@btime $out .= super_custom_func.($a, $b)
br = @broadcast super_custom_func(a, b)
@btime materialize!($out, $br)
nothing

27.1s

4.4.
Feed Our Type Into Other Functions

The inverse of the above: now it's time for our type to be consumed by other functions. For simplicity, we just use Julia's Base sum function, which should already get us all the SIMD accelerated speed we wish for. For it to work, we just need to implement the iterator protocol.

Implementing the iterator protocol for our lazy broadcast is a bit more challenging, since now we actually need to figure out its shape so we can iterate over it (we got around that before by using an output array).

# Simplified implementation to take the axes of the array with the largest 
# dimensionality (axes -> the index range an array iterates over)
biggest(a, b, c, rest...) = biggest(biggest(a, b), biggest(c, rest...))
biggest(a::NTuple{N1, Any}, b::NTuple{N2, Any}) where {N1, N2} = 
	ifelse(N1 > N2, a, b)
biggest(a) = a
flatten_args(t::LazyBroadcast, rest...) = 
	(flatten_args(t.args...)..., flatten_args(rest...)...)
flatten_args(t::Any, rest...) = (t, flatten_args(rest...)...)
flatten_args() = ()
# the indexing axes of our array
axes(br::LazyBroadcast) = biggest(map(axes, flatten_args(br))...)
# lazy view that can be used to index all elements in br
eachindex(br::LazyBroadcast) = CartesianIndices(axes(br))

1.0s

With that, we can overload Julia's iteration protocol:

iterate(br::LazyBroadcast) = iterate(br, (eachindex(br),))
@propagate_inbounds function iterate(bc::LazyBroadcast, s)
    istate = iterate(s...) # iterate indices
    istate === nothing && return nothing # if done
    i, newstate = istate
    return (bc[i], (s[1], newstate))
end

0.7s

Which now enables us to feed LazyBroadcast into any function that accepts iterators! Important to win the challenge: the function we use needs to be out of our control and must be part of some standard library:

@btime sum($br) 
@btime sum($out .= super_custom_func.($a, $b))

@assert sum(br) ≈ sum(super_custom_func.(a, b))

nothing

23.7s

5.
Bonus Points

5.1.
Multithreading

Now let's write a simple multi-threaded materialize!, just because we can :) 

using Statistics
function threaded_materialize!(out::AbstractArray, x::LazyBroadcast)
    Threads.@threads for i in CartesianIndices(axes(out))
        @inbounds out[i] = x[i]
    end
    return out
end
a = @benchmark threaded_materialize!($out, $br)
b = @benchmark materialize!($out, $br)
hyper_threads = Sys.CPU_THREADS
# intels hyper threads don't really count in such intense calculations
cores = hyper_threads / 2
# scaling
println("best scaling ", minimum(b).time / minimum(a).time / cores)
println("average scaling ", mean(b).time / mean(a).time / cores)
nothing

23.4s

Next bonus point could be to write a simple GPU-accelerated materialize! To not further bloat this article, I'll leave that for another challenge.

6.
Submission

This is it - if you get through this just fine with your language, I'd be delighted to see your code! If you don't, but have some interesting fragments going, please publish them as well.

Results are accepted as a pull request (PR) to: https://github.com/SimonDanisch/julia-challenge. The Github repository can also be used to turn this into a community effort. Just make a PR with an incomplete solution and continue polishing it with others.

 Good luck and happy coding!

The Julia Language Challenge

1. Rules

2. The Problem: Broadcasting

3. Reference Implementation

4. Winning Criteria For Implementation

4.1. Solution

4.2. Performance

4.3. Types & Functions From Different Packages

4.4. Feed Our Type Into Other Functions

5. Bonus Points

5.1. Multithreading

6. Submission

1.
Rules

2.
The Problem: Broadcasting

3.
Reference Implementation

4.
Winning Criteria For Implementation

4.1.
Solution

4.2.
Performance

4.3.
Types & Functions From Different Packages

4.4.
Feed Our Type Into Other Functions

5.
Bonus Points

5.1.
Multithreading

6.
Submission