Koustav Chowdhury / Aug 19 2019
Remix of Julia by Nextjournal

RobinDict

This blog is to explain in details about the dictionary that I wrote as a part of my project in Julia. I will discuss about the implementation, few benchmarks and some interesting plots here. So let's get started! I have already introduced the intuition behind this special hash table in a previous blog.

Implementation

mutable struct RobinDict{K,V} <: AbstractDict{K,V}
    hashes::Vector{UInt32}
    keys::Array{K,1}
    vals::Array{V,1}
    count::Int
    idxfloor::Int
end

This is the definition of the structure of RobinDict. It stores 32-bit hash of key, along side the key and value. The hash is stored for ease of calculation of displacement of a key from its preferred location in the table. Also, hash is also stored for bettering comparison of two equal keys.

Given a hash value, the displacement of a key from its position can be easily calculated. The dibs value, or Distance to Initial Bucket is a measure of this displacement.

desired_index(hash, sz) = (hash & (sz - 1)) + 1

function calculate_distance(h::RobinDict{K, V}, index) where {K, V} 
    @assert isslotfilled(h, index)
    sz = length(h.keys)
    @inbounds index_init = desired_index(h.hashes[index], sz)
    return (index - index_init + sz) & (sz - 1)
end
calculate_distance (generic function with 1 method)

As it can been seen, the size of the table is maintained as a power of 2, to ease the modulo calculation. The calculate_distance function computes the dibs value for the key residing in the bucket marked with index.

0.8s
function rh_insert!(h::RobinDict{K, V}, key::K, val::V) where {K, V}
    # table full
    @assert h.count != length(h.keys)
    
    ckey, cval, chash = key, val, hash_key(key)
    sz = length(h.keys)
    index_init = desired_index(chash, sz)

    index_curr = index_init
    probe_distance = 0
    probe_current = 0
    @inbounds while true
        if (isslotempty(h, index_curr)) || (isslotfilled(h, index_curr) && isequal(h.keys[index_curr], ckey))
            break
        end
        probe_distance = calculate_distance(h, index_curr)

        if probe_current > probe_distance
            h.vals[index_curr], cval = cval, h.vals[index_curr]
            h.keys[index_curr], ckey = ckey, h.keys[index_curr]
            h.hashes[index_curr], chash = chash, h.hashes[index_curr]
            probe_current = probe_distance
        end
        probe_current += 1
        index_curr = (index_curr & (sz - 1)) + 1
    end
    
    @inbounds if isslotfilled(h, index_curr) && isequal(h.keys[index_curr], ckey)
        h.vals[index_curr] = cval
        return index_curr
    end

    @inbounds if isslotempty(h, index_curr)
        h.count += 1
    end

    @inbounds h.vals[index_curr] = cval
    @inbounds h.keys[index_curr] = ckey
    @inbounds h.hashes[index_curr] = chash
    
    @assert probe_current >= 0
    
    if h.idxfloor == 0
        h.idxfloor = index_curr
    else
        h.idxfloor = min(h.idxfloor, index_curr)
    end
    return index_curr
end
rh_insert! (generic function with 1 method)

rh_insert! lies at the heart of dictionary operation. It's the primary procedure for insertion of keys in the dictionary. I have already discussed the procedure here, so I'll not be covering this again. The main takeaway is, to notice that there is linear-probing happening during the insertion procedure.

function rh_search(h::RobinDict{K, V}, key::K) where {K, V}
    sz = length(h.keys)
    chash = hash_key(key)
    index = desired_index(chash, sz)
    cdibs = 0
    @inbounds while true
        if isslotempty(h, index)
            return -1
        elseif cdibs > calculate_distance(h, index)
            return -1
        elseif h.hashes[index] == chash && (h.keys[index] === key || isequal(h.keys[index], key))
            return index
        end
        index = (index & (sz - 1)) + 1
    end
end
rh_search (generic function with 1 method)

You might question what's special about this search procedure? It looks like standard linear probe search. Well, true! But, not exactly same! The count of cdibs ensures that the search procedure doesn't overshoot the max probe length which is typically around 7-8 items at load factor 70%.

0.7s
function rh_delete!(h::RobinDict{K, V}, index) where {K, V}
    @assert index > 0

    # this assumes that there is a key/value present in the dictionary at index
    index0 = index
    sz = length(h.keys)
    @inbounds while true
        index0 = (index0 & (sz - 1)) + 1
        if isslotempty(h, index0) || calculate_distance(h, index0) == 0
            break
        end
    end
    #index0 represents the position before which we have to shift backwards 
    
    # the backwards shifting algorithm
    curr = index
    next = (index & (sz - 1)) + 1
    
    @inbounds while next != index0
        h.vals[curr] = h.vals[next]
        h.keys[curr] = h.keys[next]
        h.hashes[curr] = h.hashes[next]
        curr = next
        next = (next & (sz-1)) + 1
    end
    
    #curr is at the last position, reset back to normal
    isbitstype(K) || isbitsunion(K) || ccall(:jl_arrayunset, Cvoid, (Any, UInt), h.keys, curr-1)
    isbitstype(V) || isbitsunion(V) || ccall(:jl_arrayunset, Cvoid, (Any, UInt), h.vals, curr-1)
    @inbounds h.hashes[curr] = 0x0

    h.count -= 1
    # this is necessary because key at idxfloor might get deleted 
    h.idxfloor = get_next_filled(h, h.idxfloor)
    return h
end
rh_delete! (generic function with 1 method)

rh_delete! function performs the crucial task of deleting an item from the table. This is done by backward shifting all the entries of the dictionary. The slight advantage by this procedure is that we do away with the need of maintaining tombstones. I have explained in details, the backwards shifting algorithm here.

Demonstration

To try your hands at RobinDict, all you gotta do is clone the latest version of DataStructures.jl. My code resides in src/robin_dict.jl. Any form of contributions to RobinDict is always welcome and encouraged.

import Pkg; Pkg.add("DataStructures");
using DataStructures

As a quick demo, try running some Dict code in the code cell below. The API for function calls for both RobinDict and Dict are same. There's even a constructor for converting Dict to RobinDict! Go on, don't be shy! 🤗

h = RobinDict('a'=>1, 'b'=>'c')
RobinDict{Char,Any} with 2 entries: 'a' => 1 'b' => 'c'

Benchmarks

This is the part for which I was most excited about. To compare and see how my dictionary fared against the present implementation of Dict residing in base/dict.jl. The code for the benchmark-suite resides in this PR. Here, I'll present a tabulated form of the results for the cases that I have bench-marked both Dict and RobinDict.

  • Dict and RobinDict are actually Dict{Any, Any} and RobinDict{Any, Any}, respectively.
  • push! has been averaged over 1M entries.
  • pop! has been calculated for 10 evals and 100 samples. The average value has been written in the benchmark table.

Plots

Well, I've obtained some plots to ascertain certain assumption on the working of the dictionary with Robin Hood hashing. The main intuition behind this kind of hashing technique was to bring down the variance of the displacement of an entry from its hashed position. In other words, we can say, all entries will try to stay as close it can to the original position. Presented below are expected value and the variance of DIB for several load factors.

Take-aways

  • As the load factor increases, the expected value of displacement increases. This can be explained by considering under high load, there will be high amount of clustering in a linear-probed table. Robin Hood hashing tends to distribute this value in comparison to an ordinary table. For a simply implemented linear probed table, the expected value will be much worse.
  • The kinks in the plots represents points where rehash! happens. Notice, as the load factor increases, the frequency of rehash! operations decreases. This is important because rehash! is the most costly dictionary operation.

The above mentioned points are primary reason in influencing my decision on settling for a load factor of 70%. We don't want a high expected value of displacement because that would increase the number of probes while insertion, search and deletion. Neither, do we want the rehash to happen too frequently. It's a trade-off that had to done.

Future work

There's always scope for improvement in any aspect. Same is the case with my implementation. Currently, I'm working on intertwining information about hash and dibs into a single state called meta. The work resides in this branch. And, I'm looking to optimize the rehash operation even further in this implementation. The benchmarks have been favorable so far, even without optimizing the rehash operation.

During the duration of my project, I came to know of Google's SwissDict implementation, and I was highly impressed by its simplicity. I am planning to integrate some optimization from that dictionary into RobinDict, but I haven't exactly carved out the path.

Some reflections

I am extremely grateful for all the inputs I received from my mentor, Jameson Nash @vtjnash during the entire tenure of this project. There were times when I was stuck in the implementation, but he guided me well past those stages. I could've accomplished much more, had I not fallen sick during this period. Thanks a ton for understanding.

I would also like to thank @chethega , who goes by an interesting name, Foo Bar, on Julia Slack, for all the time and iterations that he spent during the review of this PR. We discussed some roadblocks in the implementation and also planned how to mitigate it. I am eagerly waiting to on some interesting projects that he recommended.

Finally, I am thankful to the entire Julia community for showing faith in my project and selecting it for it's season of contribution. You guys are the best!