Fast fixed-point sine and cosine approximation with Julia

I needed a fast fixed-point sine and cosine approximation in Julia. The accuracy was secondary.

Let's create a reference benchmark using floating-point operations first:

using BenchmarkTools
sinsf = Vector{Float64}(undef, 1000); cossf = Vector{Float64}(undef, 1000)
function bench_sincos(sins, coss, freq::T) where T
  @inbounds @fastmath @simd for i = 1:length(sins)
    sins[i], coss[i] = sincos(T(2π) * freq * i)
  end
end
@btime bench_sincos($sinsf, $cossf, $0.002)
8.1s

Since we do not care about accuracy too much, could we do any better, if we switch Float32?

sinsf32 = Vector{Float32}(undef, 1000); cossf32 = Vector{Float32}(undef, 1000)
@btime bench_sincos($sinsf32, $cossf32, $Float32(0.002))
6.6s

Alright, that's an improvement of about 40%. Can we do even better?

The first idea was to use a Look-up-Table (LUT). I created the LUT in such a way, that the number of indices, that represent a complete sine / cosine wave, is of the form 2n2^n where nn is an integer. This allows to get rid of the quite costly mod function to confine the angle between 00 and 2π2 \pi. Instead you can use x & 2^n - 1, which is considerably less computational expensive.

const n = 10
const ampl = 10
const COS_LUT = floor.(Int, cos.(2π .* (0:1 << n - 1) ./ 1 << n) .* (1 << ampl))
const SIN_LUT = floor.(Int, sin.(2π .* (0:1 << n - 1) ./ 1 << n) .* (1 << ampl))
Base.@propagate_inbounds cos_lut(x::Int) = COS_LUT[x & (1 << n - 1) + 1]
Base.@propagate_inbounds sin_lut(x::Int) = SIN_LUT[x & (1 << n - 1) + 1]
sincos_lut(x) = (sin_lut(x), cos_lut(x))
1.1s
sincos_lut (generic function with 1 method)
sins_lut = Vector{Int}(undef, 1000); coss_lut = Vector{Int}(undef, 1000)
function bench_sincos_lut(sins::Vector{T}, coss::Vector{T}, freq) where T <: Integer
  fp = sizeof(T) * 8 - n - 2
  delta = floor(T, freq * 1 << (fp + n))  
  @inbounds for i = T(1):T(length(sins)) # Putting @simd here is slower?
    sins[i], coss[i] = sincos_lut((i * delta) >> fp)
  end
end
@btime bench_sincos_lut($sins_lut, $coss_lut, $0.002)
2.1s

That's already 21 times faster than the original floating-point solution. What about accuracy?

using Plots
x = -1<<9:1<<9
plot(x, sin_lut.(x) ./ 1 << ampl .- sin.(2π .* x ./ 1 << n), lab="Sine error")
38.7s

The accuracy could certainly be increased, if we would decrease the quantization and discretization at the expense of more memory usage. Could we do better, if we decrease the precision again?

const COS_LUT32 = floor.(Int32, cos.(2π .* (0:1 << n - 1) ./ 1 << n) .* (1 << ampl))
const SIN_LUT32 = floor.(Int32, sin.(2π .* (0:1 << n - 1) ./ 1 << n) .* (1 << ampl))
Base.@propagate_inbounds cos_lut(x::Int32) = COS_LUT32[x & (1 << n - 1) + one(x)]
Base.@propagate_inbounds sin_lut(x::Int32) = SIN_LUT32[x & (1 << n - 1) + one(x)]
sins_lut32 = Vector{Int32}(undef, 1000); coss_lut32 = Vector{Int32}(undef, 1000)
const COS_LUT16 = floor.(Int16, cos.(2π .* (0:1 << n - 1) ./ 1 << n) .* (1 << ampl))
const SIN_LUT16 = floor.(Int16, sin.(2π .* (0:1 << n - 1) ./ 1 << n) .* (1 << ampl))
Base.@propagate_inbounds cos_lut(x::Int16) = COS_LUT16[x & (1 << n - 1) + one(x)]
Base.@propagate_inbounds sin_lut(x::Int16) = SIN_LUT16[x & (1 << n - 1) + one(x)]
sins_lut16 = Vector{Int16}(undef, 1000); coss_lut16 = Vector{Int16}(undef, 1000)
@btime bench_sincos_lut($sins_lut32, $coss_lut32, $0.002)
@btime bench_sincos_lut($sins_lut16, $coss_lut16, $0.002)
6.3s

There doesn't seem to be an improvement. It even gets a little slower. Who knows why? I guess it can not be optimized by SIMD, because the values in the LUT might not be retrieved consecutively.

Can we still do better?

I googled for sine approximation and found this: http://www.coranac.com/2009/07/sines/ . It uses a Polynom to approximate the sine function. It makes use of the fact that the sine is an odd function. Thereby, you only need to calculate the odd factors of the Polynom. The even factors are zero. The accuracy depends on the order of the Polynom. However, a drawback is that it only approximates the sine and the cosine is calculated by the fact that cos(x)=sin(x+π2)\cos(x) = \sin(x + \frac{\pi}{2}). There must be a better way to calculate both, no?

I googled again and finally found this: http://www.olliw.eu/2014/fast-functions/ . He moves both cos and sin functions by π4\frac{\pi}{4} to the left. Thereby, the sine function is the mirror of cosine function around the y-axis. In this case, the approximations are given by

cos(x)A(x)+B(x),sin(x)A(x)B(x),\cos(x) \approx A(x) + B(x),\\ \sin(x) \approx A(x) - B(x),

where A(x)A(x) contains the even parts and B(x)B(x) the odd parts of the Polynom. Note that the function itself is not even nor odd anymore. That's why there are even and odd terms. Before we express the problem in fixed-point arithmetic, let's check out what accuracy we might expect. First we need to calculate the factors for the even and odd terms. They are given in the reference, but I will calculate them here again as we need them later.

W = [
    1   0       0       0       0       0       0     # y-axis intercept
    1   -1/2    1/4     -1/8    1/16    -1/32   1/64  # point at -1/2
    1   1/2     1/4     1/8     1/16    1/32    1/64  # point at 1/2
    0   1       -2/2    3/4     -4/8    5/16    -6/32 # derivative at -1/2
    0   1       2/2     3/4     4/8     5/16    6/32  # derivative at 1/2
    0   0       2       6/2     12/4    20/8    30/16 # 2nd derivative at 1/2
    0   0       2       -6/2    12/4    -20/8   30/16 # 2nd derivative at -1/2
]
v = [1/2,  1,  0,  0,  -π/2,  0,  -(π/2)^2]
c = W \ v
2.7s
7-element Array{Float64,1}: 0.707107 -1.11067 -0.872348 0.456159 0.179252 -0.0539105 -0.0142718
fast_cos(x) = c[1] + c[2]*x + c[3]*x^2 + c[4]*x^3 + c[5]*x^4 + c[6]*x^5 + c[7]*x^6
x = 0:0.001:1
y_approx = fast_cos.(x .- 1 ./ 2)
y = cos.(x .* π ./ 2)
println("Max. error: ", maximum(y_approx .- y))
plot(x, y_approx .- y, lab="Cosine error")
1.8s
0.2s

That means we can expect a maximum error of around 6.22e66.22\mathrm{e}{-6}. If you are looking for more accuracy you can add more coefficients to the Polynom. For example the derivative and second derivative at the y-axis intercept. If we choose our fixed-point output to have 17 fixed-point bits (corresponding to a resolution of 7.63e67.63\mathrm{e}{-6}), we could expect to contain the approximation error to the least significant bit. Now we need to transform the coefficients to fixed-point values. I did it in such a way that each multiplication uses the full bit size. You can find more on this here: https://www.nullhardware.com/blog/fixed-point-sine-and-cosine-for-embedded-systems/ . Here are the functions to calculate A(x)A(x) and B(x)B(x):

get_quadrant_size(x::Int64) = 18 # I have chosen 18 bits instead of 17 bits
@inline function calc_A(x::Int64)
    p = 65; r = 5; q = 63; k = 63; A = 6521908912666390528; C = -8045990844691798016
    E = 6613222665482916864; G = -64274600444504
    n = get_quadrant_size(x)
    a = 18
     = (x * x) >> n
    rounding = one(x) << (k - a - 1)
    y1 = ( * G) >> r
    y2 = (E + y1) >> n
    y3 = ( * y2) >> (p - q)
    y4 = (C + y3) >> n
    y5 = ( * y4) >> (q - k)
    y6 = (A + rounding + y5) >> (k - a)
end
@inline function calc_B(x::Int64)
    p = 64; r = 4; q = 62; B = -5122062798018288640; D = 8414646631798223872;
    F = -60697792564407
    n = get_quadrant_size(x)
    a = 18
     = (x * x) >> n
    rounding = one(x) << (q - a - 1)
    y1 = ( * F) >> r
    y2 = (D + y1) >> n
    y3 = ( * y2) >> (p - q)
    y4 = (B + y3) >> n
    y5 = (rounding + x * y4) >> (q - a)
end
0.4s
calc_B (generic function with 1 method)

At the end of the document I will show you how to calculate the fixed-point factors as well as the corresponding bit shifts.

The function is now valid for the first quadrant. To expand that to all four quadrants, we need to have a look at the bits 19 and 20. Here is a bit table on how to extend to the correct quadrant:

cos:[bit 20:bit 19:00011011]cossincossinsin:[bit 20:bit 19:00011011]sincossincos\cos : \begin{bmatrix} \text{bit 20:} & \text{bit 19:} \\ 0 & 0 \\ 0 & 1\\ 1 & 0 \\ 1 & 1\end{bmatrix} \Rightarrow \begin{matrix} \\ \cos \\ -\sin \\ -\cos \\ \sin\end{matrix} \qquad \sin : \begin{bmatrix} \text{bit 20:} & \text{bit 19:} \\ 0 & 0 \\ 0 & 1\\ 1 & 0 \\ 1 & 1\end{bmatrix} \Rightarrow \begin{matrix} \\ \sin \\ \cos \\ -\sin \\ -\cos\end{matrix}

Olli selects the correct sin and cos approximation by using multiple branches. Since I'd like to avoid branches in order to use the full capacity of SIMD, I did it the following way:

@inline mysign(x) = x >= zero(x) ? one(x) : -one(x)
@inline function get_first_bit_sign(x)
    n = get_quadrant_size(x)
    mysign(x << (sizeof(x) * 8 - n - 1))
end
@inline function get_second_bit_sign(x)
    n = get_quadrant_size(x)
    mysign(x << (sizeof(x) * 8 - n - 2))
end
@inline function get_angle(x)
    n = get_quadrant_size(x)
    x & (one(x) << n - one(x)) - one(x) << (n - 1)
end
@inline function sincos(x::Union{Int16, Int32, Int64})
    first_bit_sign = get_first_bit_sign(x)
    second_bit_sign = get_second_bit_sign(x)
    angle = get_angle(x)
    A = calc_A(angle)
    B = calc_B(angle)
    cos_approx = second_bit_sign * (first_bit_sign * A + B)
    sin_approx = second_bit_sign * (A - first_bit_sign * B)
    sin_approx, cos_approx
end
0.5s
sincos (generic function with 1 method)

Here the first bit means bit 19 and the second bit 20. The branch in mysign(x) doesn't seem to be that harmful to SIMD. However, I am open for other approaches. The multiplications of the signs with A(x)A(x) and B(x)B(x) result from the bit table above. Let's see how fast we got:

sinsi = Vector{Int}(undef, 1000); cossi = Vector{Int}(undef, 1000)
function bench_sincos(sins::Vector{T}, coss::Vector{T}, freq) where T <: Integer
  n = get_quadrant_size(zero(T)) + 2
  fp = sizeof(T) * 8 - n - 2
  delta = floor(T, freq * 1 << (fp + n))  
  @inbounds for i = T(1):T(length(sins)) 
    sins[i], coss[i] = sincos((i * delta) >> fp)
  end
end
@btime bench_sincos($sinsi, $cossi, $0.002)
5.9s

That's slower than the LUT variant above. It does not seem to vectorize very well. We will have a look at the variants with less precision further down. But let's have a look at the accuracy first:

x = 0:1<<6:1<<20
plot(x, getindex.(sincos.(x), 1) ./ 1 << 18 .- sin.(2π .* x ./ 1 << 20), lab="Sine error")
1.7s

It's by multiple orders of magnitude more accurate than the LUT above. I would guess that it can be made faster, if SIMD is used to a greater extend. Let's see what happens, if we decrease the integer precision. Since we decrease the number of bits in this case, it doesn't make sense to use a Polynom of the order of six. I made the same evaluation as above and came to the conclusion that I use a Polynom of the order of five for Int32 and Polynom of the order of three for Int16. Here is the code for that:

get_quadrant_size(x::Int16) = 6
get_quadrant_size(x::Int32) = 14
@inline function calc_A(x::Int16)
    p = 15; r = 1; A = Int16(23170); C = Int16(-849)
    n = get_quadrant_size(x)
    a = 6
     = (x * x) >> n
    rounding = one(x) << (p - a - 1)
    (A + rounding + ( * C) >> r) >> (p - a)
end
@inline function calc_B(x::Int16)
    p = 14; r = 3; B = Int16(-17790); D = Int16(702)
    n = get_quadrant_size(x)
    a = 6
     = (x * x) >> n
    rounding = one(x) << (p - a - 1)
    (rounding + x * (B + ( * D) >> r) >> n) >> (p - a)
end
@inline function calc_A(x::Int32)
    p = 31; r = 3; q = 31; A = Int32(1518500249); C = Int32(-1871437695); E = Int32(180476)
    n = get_quadrant_size(x)
    a = 14
     = (x * x) >> n
    rounding = one(x) << (q - a - 1)
    (A + rounding + ( * (C + ( * E) >> r) >> n) >> (p - q)) >> (q - a)
end
@inline function calc_B(x::Int32)
    p = 32; r = 4; q = 30; B = Int32(-1193052062); D = Int32(1974511850); F = Int32(-256048)
    n = get_quadrant_size(x)
    a = 14
     = (x * x) >> n
    rounding = one(x) << (q - a - 1)
    (rounding + x * (B + ( * (D + ( * F) >> r) >> n) >> (p - q)) >> n) >> (q - a)
end
0.2s
calc_B (generic function with 3 methods)
sinsi32 = Vector{Int32}(undef, 1000); cossi32 = Vector{Int32}(undef, 1000)
sinsi16 = Vector{Int16}(undef, 1000); cossi16 = Vector{Int16}(undef, 1000)
@btime bench_sincos($sinsi32, $cossi32, $0.002)
@btime bench_sincos($sinsi16, $cossi16, $0.002)
5.5s

That does seem to vectorize much better. That's around 400ns for a thousend sin and cos calculations. Wow! Let's have a look at the accuracy again:

x = 0:1<<4:1<<16
plot(x, getindex.(sincos.(Int32.(x)), 1) ./ 1 << 14 .- sin.(2π .* x ./ 1 << 16), lab="Sine error with Int32")
0.9s
x = 0:1<<8
plot(x, getindex.(sincos.(Int16.(x)), 1) ./ 1 << 6 .- sin.(2π .* x ./ 1 << 8), lab="Sine error with Int16")
0.8s

So for the Int16 case the maximum error reaches up to 2%. That might be enough for some applications. With respect to the first floating-point variant, it is about 65 times faster.

A side note: The frequency of 0.002 can not be represented by Int16 very well. That's why the error increases between the floating-point and fixed-point variant:

plot(sinsi16 ./ 1 << 6, lab="Int16 sin approx. with Int16 phase")
plot!(sinsf, lab="Float64 sin")
2.1s

To circumvent that problem you can represent the phase as Int32:

function bench_sincos(sins::Vector{Int16}, coss::Vector{Int16}, freq)
  n = get_quadrant_size(zero(Int16)) + 2
  fp = 32 - n - 2
  delta = floor(Int32, freq * 1 << (fp + n))  
  @inbounds @simd for i = Int32(1):Int32(length(sins))
    sins[i], coss[i] = sincos(Int16((i * delta) >> fp))
  end
end
@btime bench_sincos($sinsi16, $cossi16, $0.002)
plot(sinsi16 ./ 1 << 6, lab="Int16 sin approx. with Int32 phase")
plot!(sinsf, lab="Float64 sin")
4.0s

That looks much better. The benchmark has increased a little bit from 400 ns to 500 ns, however.

As promised somewhere above, finally here the code to translate the floating-point coefficients to fixed-point coefficients. Here is the case for the sixth order Polynom:

function coeff_within(coeff)
    for i = -32:32
        if abs(coeff) < 2.0^i
            return i
        end
    end
end
function calc_sixth_order_A_coeff(coeff)
    bitsize = 64
    n = 18
    p = bitsize - coeff_within(coeff[5]) - 1
    E = floor(Int64, coeff[5] * Int128(1) << p)
    r = bitsize - coeff_within(coeff[7]) - p
    G = floor(Int64, coeff[7] * Int128(1) << (r + p - n))
    q = bitsize - coeff_within(coeff[3]) - 1
    C = floor(Int64, coeff[3] * Int128(1) << q)
    k = bitsize - coeff_within(coeff[1]) - 1
    A = floor(Int64, coeff[1] * Int128(1) << k)
    (p = p, r = r, q = q, k = k, A = A, C = C, E = E, G = G)
end
function calc_sixth_order_B_coeff(coeff)
    bitsize = 64
    n = 18
    p = bitsize - coeff_within(coeff[4]) - 1
    D = floor(Int64, coeff[4] * Int128(1) << p)
    r = bitsize - coeff_within(coeff[6]) - p
    F = floor(Int64, coeff[6] * Int128(1) << (r + p - n))
    q = bitsize - coeff_within(coeff[2]) - 1
    B = floor(Int64, coeff[2] * Int128(1) << q)
    (p = p, r = r, q = q, B = B, D = D, F = F)
end
calc_sixth_order_A_coeff(c)
0.7s
(p = 65, r = 5, q = 63, k = 63, A = 6521908912666390528, C = -8045990844691793920, E = 6613222665482852352, G = -64274600444494)
calc_sixth_order_B_coeff(c)
0.6s
(p = 64, r = 4, q = 62, B = -5122062798018289664, D = 8414646631798232064, F = -60697792564407)
0.2s

Runtimes (1)