RobinDict
This blog is to explain in details about the dictionary that I wrote as a part of my project in Julia. I will discuss about the implementation, few benchmarks and some interesting plots here. So let's get started! I have already introduced the intuition behind this special hash table in a previous blog.
Implementation
mutable struct RobinDict{K,V} <: AbstractDict{K,V} hashes::Vector{UInt32} keys::Array{K,1} vals::Array{V,1} count::Int idxfloor::Int end
This is the definition of the structure of RobinDict
. It stores 32-bit hash of key, along side the key and value. The hash
is stored for ease of calculation of displacement of a key from its preferred location in the table. Also, hash is also stored for bettering comparison of two equal keys.
Given a hash value, the displacement of a key from its position can be easily calculated. The dibs
value, or Distance to Initial Bucket is a measure of this displacement.
desired_index(hash, sz) = (hash & (sz - 1)) + 1 function calculate_distance(h::RobinDict{K, V}, index) where {K, V} isslotfilled(h, index) sz = length(h.keys) index_init = desired_index(h.hashes[index], sz) return (index - index_init + sz) & (sz - 1) end
As it can been seen, the size of the table is maintained as a power of 2, to ease the modulo calculation. The calculate_distance
function computes the dibs value for the key residing in the bucket marked with index
.
function rh_insert!(h::RobinDict{K, V}, key::K, val::V) where {K, V} # table full h.count != length(h.keys) ckey, cval, chash = key, val, hash_key(key) sz = length(h.keys) index_init = desired_index(chash, sz) index_curr = index_init probe_distance = 0 probe_current = 0 while true if (isslotempty(h, index_curr)) || (isslotfilled(h, index_curr) && isequal(h.keys[index_curr], ckey)) break end probe_distance = calculate_distance(h, index_curr) if probe_current > probe_distance h.vals[index_curr], cval = cval, h.vals[index_curr] h.keys[index_curr], ckey = ckey, h.keys[index_curr] h.hashes[index_curr], chash = chash, h.hashes[index_curr] probe_current = probe_distance end probe_current += 1 index_curr = (index_curr & (sz - 1)) + 1 end if isslotfilled(h, index_curr) && isequal(h.keys[index_curr], ckey) h.vals[index_curr] = cval return index_curr end if isslotempty(h, index_curr) h.count += 1 end h.vals[index_curr] = cval h.keys[index_curr] = ckey h.hashes[index_curr] = chash probe_current >= 0 if h.idxfloor == 0 h.idxfloor = index_curr else h.idxfloor = min(h.idxfloor, index_curr) end return index_curr end
rh_insert!
lies at the heart of dictionary operation. It's the primary procedure for insertion of keys in the dictionary. I have already discussed the procedure here, so I'll not be covering this again. The main takeaway is, to notice that there is linear-probing happening during the insertion procedure.
function rh_search(h::RobinDict{K, V}, key::K) where {K, V} sz = length(h.keys) chash = hash_key(key) index = desired_index(chash, sz) cdibs = 0 while true if isslotempty(h, index) return -1 elseif cdibs > calculate_distance(h, index) return -1 elseif h.hashes[index] == chash && (h.keys[index] === key || isequal(h.keys[index], key)) return index end index = (index & (sz - 1)) + 1 end end
You might question what's special about this search procedure? It looks like standard linear probe search. Well, true! But, not exactly same! The count of cdibs
ensures that the search procedure doesn't overshoot the max probe length which is typically around 7-8 items at load factor 70%.
function rh_delete!(h::RobinDict{K, V}, index) where {K, V} index > 0 # this assumes that there is a key/value present in the dictionary at index index0 = index sz = length(h.keys) while true index0 = (index0 & (sz - 1)) + 1 if isslotempty(h, index0) || calculate_distance(h, index0) == 0 break end end #index0 represents the position before which we have to shift backwards # the backwards shifting algorithm curr = index next = (index & (sz - 1)) + 1 while next != index0 h.vals[curr] = h.vals[next] h.keys[curr] = h.keys[next] h.hashes[curr] = h.hashes[next] curr = next next = (next & (sz-1)) + 1 end #curr is at the last position, reset back to normal isbitstype(K) || isbitsunion(K) || ccall(:jl_arrayunset, Cvoid, (Any, UInt), h.keys, curr-1) isbitstype(V) || isbitsunion(V) || ccall(:jl_arrayunset, Cvoid, (Any, UInt), h.vals, curr-1) h.hashes[curr] = 0x0 h.count -= 1 # this is necessary because key at idxfloor might get deleted h.idxfloor = get_next_filled(h, h.idxfloor) return h end
rh_delete!
function performs the crucial task of deleting an item from the table. This is done by backward shifting all the entries of the dictionary. The slight advantage by this procedure is that we do away with the need of maintaining tombstones. I have explained in details, the backwards shifting algorithm here.
Demonstration
To try your hands at RobinDict
, all you gotta do is clone the latest version of DataStructures.jl
. My code resides in src/robin_dict.jl
. Any form of contributions to RobinDict
is always welcome and encouraged.
import Pkg; Pkg.add("DataStructures"); using DataStructures
As a quick demo, try running some Dict
code in the code cell below. The API for function calls for both RobinDict
and Dict
are same. There's even a constructor for converting Dict
to RobinDict
! Go on, don't be shy! 🤗
h = RobinDict('a'=>1, 'b'=>'c')
Benchmarks
This is the part for which I was most excited about. To compare and see how my dictionary fared against the present implementation of Dict
residing in base/dict.jl
. The code for the benchmark-suite resides in this PR. Here, I'll present a tabulated form of the results for the cases that I have bench-marked both Dict
and RobinDict
.
Dict
andRobinDict
are actuallyDict{Any, Any}
andRobinDict{Any, Any}
, respectively.push!
has been averaged over 1M entries.pop!
has been calculated for 10 evals and 100 samples. The average value has been written in the benchmark table.
Plots
Well, I've obtained some plots to ascertain certain assumption on the working of the dictionary with Robin Hood hashing. The main intuition behind this kind of hashing technique was to bring down the variance of the displacement of an entry from its hashed position. In other words, we can say, all entries will try to stay as close it can to the original position. Presented below are expected value and the variance of DIB for several load factors.
Take-aways
- As the load factor increases, the expected value of displacement increases. This can be explained by considering under high load, there will be high amount of clustering in a linear-probed table. Robin Hood hashing tends to distribute this value in comparison to an ordinary table. For a simply implemented linear probed table, the expected value will be much worse.
- The kinks in the plots represents points where
rehash!
happens. Notice, as the load factor increases, the frequency ofrehash!
operations decreases. This is important becauserehash!
is the most costly dictionary operation.
The above mentioned points are primary reason in influencing my decision on settling for a load factor of 70%. We don't want a high expected value of displacement because that would increase the number of probes while insertion, search and deletion. Neither, do we want the rehash to happen too frequently. It's a trade-off that had to done.
Future work
There's always scope for improvement in any aspect. Same is the case with my implementation. Currently, I'm working on intertwining information about hash
and dibs
into a single state called meta
. The work resides in this branch. And, I'm looking to optimize the rehash
operation even further in this implementation. The benchmarks have been favorable so far, even without optimizing the rehash operation.
During the duration of my project, I came to know of Google's SwissDict
implementation, and I was highly impressed by its simplicity. I am planning to integrate some optimization from that dictionary into RobinDict
, but I haven't exactly carved out the path.
Some reflections
I am extremely grateful for all the inputs I received from my mentor, Jameson Nash @vtjnash during the entire tenure of this project. There were times when I was stuck in the implementation, but he guided me well past those stages. I could've accomplished much more, had I not fallen sick during this period. Thanks a ton for understanding.
I would also like to thank @chethega , who goes by an interesting name, Foo Bar, on Julia Slack, for all the time and iterations that he spent during the review of this PR. We discussed some roadblocks in the implementation and also planned how to mitigate it. I am eagerly waiting to on some interesting projects that he recommended.
Finally, I am thankful to the entire Julia community for showing faith in my project and selecting it for it's season of contribution. You guys are the best!