A fresh approach to N-Body problems

N-Body problems describe the motion of a system of N mutually interacting particles or bodies. They initially appeared in astronomy, as a way of describing the motion of the planets, but the same formalism can also be applied in the case of molecular dynamics or other systems where we have entities interacting via potentials. The simplest case of N=2 has analytical solutions, but starting from N=3, there are no general analytical solutions and thus numerical methods are required.

Mathematically, this implies solving a system of ordinary differential equations, constructed by computing the forces on each particle of the system. Since in the most cases of interest the above mentioned N is very large, this results in a large system of ordinary differential equations (ODEs) that need to be solved. This in turn generates constraints on the performance of the numerical methods that we can use to solve the said system.

Moreover, in most cases, the equations of motion of interest have a special structure that preserves some invariants, such as the total energy of the system. Thus we need a method that has good conservation properties and good performance. For this reasons, in practice, the most widely used numerical method is the Verlet integrator (or variations such as the leapfrog method).

In Julia, the answer to solving large system of ODEs is to use the differential equations integrators provided by the SciML ecosystem. If we do not want to construct the system of equations ourselves, we can use on of the dedicated N-Body problem packages in Julia to simplify our task. One of them is the NBodySimulator.jl package, which can use the vast number of integrators in the SciML ecosystem.

NBodySimulator.jl benchmarks

One question that arises naturally is if there are other numerical methods that could be used besides the Verlet method. To answer this question, I started benchmarking several integrators in order to see how they behave. As a test problem, I chose one of the examples form the NBodySimulator.jl package dealing with a liquid argon simulation. The benchmark can be seen in the SciMLBenchmarks.jl repository here.

As the benchmark requires a very long runtime and significant computing power, it would be counter-productive to reproduce them here, so I will continue only with the main results.

When we solve numerically a system of ODEs, we are implicitly converting a continuous dynamical system into a discrete one. It is important to note that the mathematical properties of the continuous system are not necessarily the same as the ones of its discrete version. Thus if we want to preserve the qualitative features of the system, we should choose an integration method that gives the discrete version of the system properties as close as possible to the original ones.

In the case of classical mechanics, examples of such mathematical properties that we want to preserve, are energy conservation and the symplectic structure. There is a family of integrators called symplectic integrators which is of interest as preserving the symplectic structure indirectly helps with energy conservation.

Comparing most of the symplectic methods in OrdinaryDiffEq.jl (see the complete list here) in terms of runtime vs energy error gives the following result.

As we can see, the low order methods generally perform better than the high order ones in this case. We can also see that CalvoSanz4 and McAte2 are competitive with the Verlet method.

Besides symplectic integrators, there is an other family of integrators specialized on second order ODEs called Runge-Kutta-Nyström integrators, which can be more efficient in certain conditions as they have adaptive time stepping, but they do not conserve energy that well.

Comparing them to the best symplectic integrators from the previous benchmark gives

As we can see above, the symplectic methods perform significantly better on energy error, but they are slower.

Molly.jl optimizations

During the development of these benchmarks, I discovered another Julia package that was targeting N-body simulations, named Molly.jl. This package was built for molecular dynamics and thus was more advanced than NBodySimulator.jl in those aspects. After a discussion with my JSoC mentor, I decided to start working on this package (together with its author) to optimize and enhance it in order to become a future successor of NBodySimulator.jl

In order see the performance impact of the optimizations we will consider two test problems (taken from the tests of the package): one concerning a Lennard-Jones gas and another considering more complex interactions, a peptide simulation.

## Lennard-Jones gas
using Molly
using Test
temperature = 298
timestep = 0.002
n_steps = 1000
box_size = 2.0
n_atoms = 100
s = Simulation(
    simulator=VelocityVerlet(),
    atoms=[Atom(attype="Ar", name="Ar", resnum=i, resname="Ar", charge=0.0,
                mass=10.0, σ=0.3, ϵ=0.2) for i in 1:n_atoms],
    general_inters=Dict("LJ" => LennardJones(true)),
    coords=[box_size .* rand(SVector{3}) for i in 1:n_atoms],
    velocities=[velocity(10.0, temperature) .* 0.01 for i in 1:n_atoms],
    temperature=temperature,
    box_size=box_size,
    neighbour_finder=DistanceNeighbourFinder(trues(n_atoms, n_atoms), 10, 2.0),
    thermostat=AndersenThermostat(10.0),
    loggers=Dict("temp" => TemperatureLogger(100),
                    "coords" => CoordinateLogger(100)),
    timestep=timestep,
    n_steps=n_steps
)
@time simulate!(s, parallel=false)
# commit b7d87b241e8422284f96dcd2dec2b2a8b310f2e8 (before #5): 4.007033 seconds (151.04 M allocations: 6.070 GiB, 16.34% gc time)
# commit 92308291a316d58c9e5569369edde7b3d4624831 (v0.1.0): 0.179878 seconds (34.52 k allocations: 29.653 MiB)
# commit 45c8a78fce5491373131a6dc1049c55fee9d34df (current master): 0.145809 seconds (33.14 k allocations: 4.652 MiB)
# commit 8b2b220dc5630aeaecb57459655186a56dbd8c74 (unify): 0.159678 seconds (33.05 k allocations: 4.649 MiB)
using BenchmarkTools
# new version (master)
# @btime Molly.accelerations($s, parallel=false)
# old version (up to v0.1.0)
n = find_neighbours(s, nothing, s.neighbour_finder, 0, parallel=false)
@btime Molly.accelerations($s, $n, parallel=false);
# commit b7d87b241e8422284f96dcd2dec2b2a8b310f2e8 (before #5): 3.022 ms (151773 allocations: 6.20 MiB)
# commit 92308291a316d58c9e5569369edde7b3d4624831 (v0.1.0): 123.643 μs (1 allocation: 2.50 KiB)
# commit 45c8a78fce5491373131a6dc1049c55fee9d34df (current master): 121.729 μs (1 allocation: 2.50 KiB)
# commit 8b2b220dc5630aeaecb57459655186a56dbd8c74 (unify): 134.209 μs (1 allocation: 2.50 KiB)
22.9s
Molly (Julia)

The first part of the benchmark measures the complete simulation time and the second part measures the time for one accelerations function call (which is the most important part in a time step).

## Peptide
using Molly
using Test
timestep = 0.0002
n_steps = 100
temp = 298
atoms, specific_inter_lists, general_inters, nb_matrix, coords, box_size = readinputs(
            normpath(dirname(pathof(Molly)), "..", "data", "5XER", "gmx_top_ff.top"),
            normpath(dirname(pathof(Molly)), "..", "data", "5XER", "gmx_coords.gro"))
true_n_atoms = 5191
@test length(atoms) == true_n_atoms
@test length(coords) == true_n_atoms
@test size(nb_matrix) == (true_n_atoms, true_n_atoms)
@test length(specific_inter_lists) == 3
@test length(general_inters) == 2
@test box_size == 3.7146
s = Simulation(
    simulator=Molly.VelocityVerlet(),
    atoms=atoms,
    specific_inter_lists=specific_inter_lists,
    general_inters=general_inters,
    coords=coords,
    velocities=[velocity(a.mass, temp) .* 0.01 for a in atoms],
    temperature=temp,
    box_size=box_size,
    neighbour_finder=DistanceNeighbourFinder(nb_matrix, 10),
    thermostat=Molly.AndersenThermostat(10.0),
    loggers=Dict("temp" => TemperatureLogger(10)),
    timestep=timestep,
    n_steps=n_steps
)
@time simulate!(s, parallel=false)
# commit b7d87b241e8422284f96dcd2dec2b2a8b310f2e8 (before #5): 105.474686 seconds (4.20 G allocations: 162.335 GiB, 16.98% gc time)
# commit 92308291a316d58c9e5569369edde7b3d4624831 (v0.1.0): 6.870994 seconds (1.77 M allocations: 535.806 MiB, 0.65% gc time)
# commit 45c8a78fce5491373131a6dc1049c55fee9d34df (current master): 7.695775 seconds (1.77 M allocations: 233.411 MiB, 0.35% gc time)
# commit 8b2b220dc5630aeaecb57459655186a56dbd8c74 (unify): 8.374999 seconds (14.49 M allocations: 691.883 MiB, 0.83% gc time)
using BenchmarkTools
# new version (master)
# @btime Molly.accelerations($s, parallel=false)
# old version (up to v0.1.0)
n = find_neighbours(s, nothing, s.neighbour_finder, 0, parallel=false)
@btime Molly.accelerations($s, $n, parallel=false);
# commit b7d87b241e8422284f96dcd2dec2b2a8b310f2e8 (before #5): 903.236 ms (41561843 allocations: 1.60 GiB)
# commit 92308291a316d58c9e5569369edde7b3d4624831 (v0.1.0): 49.141 ms (16351 allocations: 1.29 MiB)
# commit 45c8a78fce5491373131a6dc1049c55fee9d34df (current master): 57.982 ms (16351 allocations: 1.56 MiB)
# commit 8b2b220dc5630aeaecb57459655186a56dbd8c74 (unify): 63.042 ms (36908 allocations: 3.39 MiB)
35.5s
Molly (Julia)

In order to see the performance evolution, I locally ran the above code on the commits mentioned in the comments and recorded the results (the absolute values are not relevant in this case).

julia> versioninfo()
Julia Version 1.5.0
Commit 96786e22cc* (2020-08-01 23:44 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "/opt/visual-studio-code/code"
  JULIA_NUM_THREADS = 6
Julia

The first data point is before my first pull request to Molly.jl, the second one is after some changes aimed at eliminating dynamic dispatch and type instability. As we can observe, these were the most significant performance improvements.

The next step was to improve the testing infrastructure by checking energy conservation during simulations. This required adding the potentials besides the forces for each type of interaction. I also made some performance improvements targeting non-bonded interactions (which can be seen in the speedup in the Lennard-Jones test). This gives the situation at the third data point.

The last data point is from an ongoing PR of Joe Greener (@jgreener64) which aims at making Molly work on GPUs and also being differentiable via Zygote.jl.

Optimization details

In this section I will detail more about the optimizations that I implemented in Molly. The first round of optimizations was aimed at eliminating dynamic dispatch, as this was occurring during the acceleration function call, which is one of the most called functions during the simulation. This was achieved by ensuring that the Simulation structure had concrete types for its fields. This helps type inference to determine the types cases like s.atoms and thus eliminates dynamic dispatch.

Another optimization was part of the energy function PR and was aimed at the vector function, which computes the distance between two coordinates by taking into account periodic boundary conditions. Since almost all potentials and forces depend on the distance between particles, this is one of the most important functions to optimize during the simulation. The version in the package (at v0.1.0) uses broadcasting to compute the result.

vector(c1, c2, box_size::Real) = vector1D.(c1, c2, box_size)
Julia

Since Molly uses StaticArrays to represent coordinates for performance (in the CPU case), we can add a new method specialized on StaticArrays to add a performance optimization while retaining the generic fallback. How could such an optimization look like?

The simplest way to optimize the above code would be to directly hardcode the resulting SVector

vector(c1::SVector{3}, c2::SVector{3}, box_size::Real) = 
SVector{3}(vector1D(c1[1], c2[2], box_size),
					 vector1D(c1[2], c2[2], box_size),
					 vector1D(c1[3], c2[3], box_size))
Julia

But this will not be as general as the previous definition. If we would like to also use this optimization in the case of 2D simulations, we would have to write another method. What could we do to keep generality in this case?

The answer is to use the compiler to write the code for us. This can be done via metaprogramming, more specifically with generated functions.

@generated function vector(c1::SVector{N}, c2::SVector{N}, box_size::Real) where N
    quote
        Base.Cartesian.@ncall $N SVector{$N} i->vector1D(c1[i], c2[i], box_size)
    end
end
Julia

We can use similar techniques for other optimizations such as eliminating if checks in code by encoding information in type parameters. For example, there is an option to check if the force between two particles is too big and truncate it if necessarily. This can help in the case of incorrectly chosen initial conditions that put particles too close.

if @generated
  quote
    if !(F === Nothing)
      f = min(f, cutoff.max_force)
    end
  end
else
  if !(F === Nothing)
    f = min(f, cutoff.max_force)
  end
end
Julia

The above show the use of optionally generated functions to eliminate the if check completely from the code if the F type parameter is Nothing.

Comparing with NBodySimulator.jl

After optimizing the Molly.jl code, an question that appears is how does it compare with NBodySimulator.jl. To this end, we can take a test problem and simulate it with both packages to see the differences.

using BenchmarkTools, NBodySimulator
using StaticArrays
using Molly
using NBodySimulator: gather_bodies_initial_coordinates
using NBodySimulator: obtain_data_for_lennard_jones_interaction, pairwise_lennard_jones_acceleration!
using NBodySimulator: gather_accelerations_for_potentials, gather_simultaneous_acceleration
const T = 289.0 # K
const kb = 8.3144598e-3 # kJ/(K*mol)
const ϵ = T * kb
const σ = 0.34 # nm
const ρ = 100
const N = 300
const m = 10.0
const L = (m*N/ρ)^(1/3)#10.229σ
const R = 0.5*L
const v_dev = sqrt(kb * T / m)
206.7s
Molly (Julia)
0.490192

We will consider the NBodySimulator setup as the reference one and create a similar one in Molly. We will benchmark both the total simulation time and the acceleration computation in one step.

function nb_setup(τ, t)
    bodies = generate_bodies_in_cell_nodes(N, m, v_dev, L)
    
    lj_parameters = LennardJonesParameters(ϵ, σ, R)
    lj_system = PotentialNBodySystem(bodies, Dict(:lennard_jones => lj_parameters));
    
    pbc = CubicPeriodicBoundaryConditions(L)
    t1 = 0.0
    t2 = t
    simulation = NBodySimulation(lj_system, (t1, t2), pbc, kb)
end
function molly_setup(τ, t)
    simulation = nb_setup(τ, t)
    u0, v0, n = gather_bodies_initial_coordinates(simulation)
    atoms = [Atom(mass=m, σ=σ, ϵ=ϵ) for i=1:N]
    coords = [SVector{3}(u0[:,i]) for i=1:N]
    velocities = [SVector{3}(v0[:,i]) for i=1:N]
    #coeff = R/σ
    #cutoff = ShiftedPotentialCutoff(nothing, coeff, coeff^2, inv(coeff))
    general_inters = (LennardJones(false),)
    
    Simulation(
        simulator=Molly.VelocityVerlet(),
        atoms=atoms,
        general_inters=general_inters,
        coords=coords,
        velocities=velocities,
        temperature=T,
        box_size=L,
        loggers=Dict("temp" => TemperatureLogger(100)),
        timestep=τ, # ps
        n_steps=t/τ
    )
end
function acceleration(simulation)
    u0, v0, n = gather_bodies_initial_coordinates(simulation)
    acceleration_functions = gather_accelerations_for_potentials(simulation)
    simultaneous_acceleration = gather_simultaneous_acceleration(simulation)
    function soode_system!(dv, v, u, p, t)
        @inbounds for i = 1:n
            a = MVector(0.0, 0.0, 0.0)
            for acceleration! in acceleration_functions
                acceleration!(a, u, v, t, i);
            end
            dv[:, i] .= a
        end
        for acceleration! in simultaneous_acceleration
            acceleration!(dv, u, v, t);
        end
    end
    return soode_system!
end
0.7s
Molly (Julia)
acceleration (generic function with 1 method)

We can now initialize the setups and benchmark the accelerations

τ = 0.5e-3 # ps or 1e-12 s
t = 10_000τ
s = molly_setup(τ, t)
nb_sim = nb_setup(τ, t)
f = acceleration(nb_sim)
u0, v0, n = gather_bodies_initial_coordinates(nb_sim);
# Benchmarks
n = find_neighbours(s, nothing, s.neighbour_finder, 0, parallel=false)
@btime Molly.accelerations($s, $n, parallel=false)
@btime $f(dv, $v0, $u0, nothing, 0.0) setup=(dv=zero(v0))
27.4s
Molly (Julia)

As we can see, the Molly version is over 2x faster than the NBodySimulator one. Let us now look at the full simulation time.

@btime simulate!(s, parallel=false) setup=(s=molly_setup($τ, $t))
@btime run_simulation(s) setup=(s=nb_setup($τ, $t));
91.2s
Molly (Julia)

In this case, the NBodySimulator version almost 2x is faster. This indicates that the integration is more efficient, which was expected as the package uses the high performance integrators from OrdinaryDiffEq. A future integration of these integrators with Molly is planned.

Potential cutoffs

Theory

The energy function PR added the possibility to record the energy of the system during its evolution. While computing the energy of the system seems quite simple at first sight, a more subtle problem appears when we take a closer look at the potentials.

The total potential energy of a system is given as a sum of the individual inter-particle potentials

formula not implemented

The forces acting on the particles are given by

formula not implemented

In the case of the Lennard-Jones potential, the inter-particle potential is given by

formula not implemented

and the forces are given by

formula not implemented

As the potential, and thus also the force decreases rapidly with the distance, in almost every implementation of the Lennard-Jones force calculation there is a cutoff radius beyond which the force is set to 0.

While this sounds like a very sensible approach, it introduces a discontinuity in the force function and it requires us to also modify the potential, as beyond the cutoff radius the force would be 0, but the derivative of the unmodified potential is not. One way to truncate the potential is to shift the potential by its cutoff value.

formula not implemented

This way the potential function is continuous and the relation between forces and potentials is satisfied. This truncation method is called shifted potential cutoff.

Another option is to shift the force in order to make it continuous

formula not implemented

This requires a more complicated change in the potential in order to satisfy the relation between them. This method is called the shifted force cutoff. The continuity of the force is desirable as it may give better energy conservation properties as shown in Toxvaerd 2011.

There are also more complicated truncation methods that interpolate between the original potential and 0, but we will consider those two for the moment.

The truncation approximations that we use can significantly alter the qualitative features of the simulation as shown in many articles in the molecular dynamics literature (Fitzner 2017, van der Spoel 2006 and others).

Implementation

After the brief theoretical review above, we can focus on the implementation of cutoffs in Molly. As the truncation approximation can be theoretically applied to any sufficiently rapidly decaying potential, it is important to decouple the potential implementation from the cutoff approximation.

In the general case, the cutoffs all share a number of points at which the force expression changes. If we have no cutoff, that number is 0, if we have a shifted force or potential cutoff, then that number is 1. In more complex cases we can have 2 points at which the force expression changes.

Because of the nature of the cutoff, we must exit the function early after the cutoff radius. This imposes that we return from inside the force function corresponding to the potential. In order to decouple the cutoff approximation we can use the following

if cutoff_points(C) == 0
  f = force(inter, r2, inv(r2), params)
elseif cutoff_points(C) == 1
  sqdist_cutoff = cutoff.sqdist_cutoff * σ2
  r2 > sqdist_cutoff && return
  f = force_cutoff(cutoff, r2, inter, params)
elseif cutoff_points(C) == 2
  sqdist_cutoff = cutoff.sqdist_cutoff * σ2
  activation_dist = cutoff.activation_dist * σ2
  r2 > sqdist_cutoff && return
  if r2 < activation_dist
    f = force(inter, r2, inv(r2), params)
  else
    f = force_cutoff(cutoff, r2, inter, params)
  end
end
Julia

This way we can have any number of cutoffs approximations for the same potentials without needing to rewrite the code each time. A similar technique is used for the potential.

For the cutoff implementations we now just have to dispatch on the cutoff type

cutoff_points(::Type{ShiftedForceCutoff{F, T}}) where {F, T} = 1
force_cutoff(::NoCutoff, r2, inter, params) = force(inter, r2, inv(r2), params)
potential_cutoff(::NoCutoff, r2, inter, params) = potential(inter, r2, inv(r2), params)
function force_cutoff(cutoff::ShiftedPotentialCutoff{F}, r2, inter, params) where F
    f = force(inter, r2, inv(r2), params)
    if @generated
        quote
            if !(F === Nothing)
                f = min(f, cutoff.max_force)
            end
        end
    else
        if !(F === Nothing)
            f = min(f, cutoff.max_force)
        end
    end
    return f
end
function potential_cutoff(cutoff::ShiftedPotentialCutoff, r2, inter, params)
    potential(inter, r2, inv(r2), params) - potential(inter, cutoff.sqdist_cutoff, cutoff.inv_sqdist_cutoff, params)
end
Julia

As we can see, with this design the relationship between the force and the potential is enforced by dispatching on the same type in both the force_cutoff function and the potential_cutoff one.

As a further performance optimization, we could also use optionally generated functions in the force implementation to insert only the code corresponding to the desired number of cutoff points at compile time.

Outlook

As next steps for this project, I would like to help integrate Molly with the integrators in OrdinaryDiffEq. To this end I created the ParticleAccelerations.jl package which will factor out the summation algorithms for forces and potentials and provide a way to decouple the summation implementations from the interaction implementations (which will live in Molly). An interesting aspect to tackle would be to implement Ewald summation, which could improve the efficiency of the summation algorithm.

Runtimes (1)