Blog #1 : Julia Season of Code
My project is to try, test and finally implement a new form of Dict
in Julia which uses Robin Hood hashing scheme. As per the talk with my mentor, Jameson Nash, I got down to work from the first week itself.
Then as per the plan of Week #1 of my proposal, I started out reading the original paper on Robin Hood hashing by Pedro Celis . I started formalising what I've to implement and how I will implement. The paper presents quite a mathematical approach in analyzing the hashing technique, but at the end of the day, it has to boil down to code. The plan is write an alternate form of dictionary, which I call RobinDict
and get it merged in DataStructures.jl
. This can ease out the integration process in base/dict.jl
.
On Robin Hood hashing
Blatantly speaking, Robin Hood hashing technique is a variation of linear probing technique which improves over certain shortcomings of the linear probing. The basic idea is to take normal open addressing, but use one clever trick in order to drastically reduce the variance of the expected average and maximum probe lengths.
The way to do this is : when you probe for a position to insert a new element, if the probe length for the existing element is less than the current probe length for the element being inserted, swap the two elements and keep going.
In this way, the elements which had the comfort of being in the proximity of it's initial hashed index is removed from it position. Stealing from the rich and giving to the poor? That’s Robin Hood all over.
Implementation
I opened a draft pull request in DataStructures.jl
implementing the dictionary which implements the dictionary.
I'll try to explain major snippets of my implementation here.
Definition
mutable struct RobinDict{K,V} <: AbstractDict{K,V} #there is no need to maintain an table_size as an additional variable slots::Array{UInt8,1} # indicator, to be used later on keys::Array{K,1} vals::Array{V,1} dibs::Array{Int,1} # distance to initial bucket - critical for implementation count::Int totalcost::Int maxprobe::Int # length of longest probe idxfloor::Int function RobinDict{K, V}() where {K, V} n = 16 # default size of an empty Dict in Julia new(zeros(UInt, n), Vector{K}(undef, n), Vector{V}(undef, n), zeros(Int, n), 0, 0, 0, 0) end function RobinDict{K, V}(d::RobinDict{K, V}) where {K, V} new(copy(d.slots), copy(d.keys), copy(d.vals), copy(d.dibs), d.count, d.totalcost, d.maxprobe, d.idxfloor) end function RobinDict{K, V}(slots, keys, vals, dibs, count, totalcost, maxprobe, idxfloor) where {K, V} new(slots, keys, dibs, vals, count, totalcost, maxprobe, idxfloor) end end
slots
is an array which indicates whether a position has been occupied or not. There are ways to skip this implementation, but it's not worth all the pain.
keys
and vals
are standard arrays for storing keys and values respectively.
dibs
is an indicator for Distance to Initial Bucket. As I have explained earlier, this forms the heart of Robin Hood hashing technique.
count
keeps the track of number of entries in the dictionary.
totalcost
is a parameter to indicate number of operations performed on the dictionary.
idxfloor
is the first(minimum) index where an entry is to be found. It is 0, when there are no entries in the dictionary.
maxprobe
keeps record of the maximum value in the dibs array. As of now, I've kept this for benchmarking purposes.
Insertion Algorithm
function rh_insert!(h::RobinDict{K, V}, key::K, val::V) where {K, V} # table full if h.count == length(h.keys) return -1 end ckey, cval, cdibs = key, val, 0 sz = length(h.keys) index = hashindex(ckey, sz) while true if (h.slots[index] == 0x0) || (h.slots[index] == 0x1 && h.keys[index] == ckey) break end if h.dibs[index] < cdibs h.vals[index], cval = cval, h.vals[index] h.keys[index], ckey = ckey, h.keys[index] h.maxprobe = max(h.maxprobe, cdibs) h.dibs[index], cdibs = cdibs, h.dibs[index] end cdibs += 1 h.totalcost += 1 index = (index & (sz - 1)) + 1 end if h.slots[index] == 0x1 && h.keys[index] == ckey h.vals[index] = cval return index end if h.slots[index] == 0x0 h.count += 1 end h.slots[index] = 0x1 h.vals[index] = cval h.keys[index] = ckey h.dibs[index] = cdibs h.totalcost += 1 h.maxprobe = max(h.maxprobe, cdibs) if h.idxfloor == 0 h.idxfloor = index else h.idxfloor = min(h.idxfloor, index) end return index end
The first part checks whether the dictionary is full. This part of the code is there for safekeeping. It's an additional protective measure that I've taken to ensure that this scenario never arises.
Inside the loop, I've implemented the standard version of Robin Hood hashing technique. While probing linearly from index
, I keep on performing swaps until either of the two things happen.
- There is a slot which hasn't been filled
- There is a slot which has been occupied by the same key
In the first case, I insert the element into the desired index . In the second case, I update the present value of the vals
array with the new one. In both of the cases, I return the index
of the position.
Search Algorithm
function rh_search(h::RobinDict{K, V}, key::K) where {K, V} sz = length(h.keys) index = hashindex(key, sz) while true if h.slots[index] == 0x0 return -1 elseif cdibs > h.dibs[index] return -1 elseif h.keys[index] == key return index end index = (index & (sz - 1)) + 1 end end
Pretty much straightforward, standard linear probing search algorithm.
Deletion Algorithm
# backward shift deletion by not keeping any tombstones function rh_delete!(h::RobinDict{K, V}, index) where {K, V} #forceful (index < 0) && return false; #this assumes that the key is present in the dictionary at index index0 = index sz = length(h.keys) while true index0 = (index0 & (sz - 1)) + 1 if h.slots[index0] == 0x0 break end end #index0 represents the first empty slot in linear probe # the backwards shifting algorithm curr = index next = (index & (sz - 1)) + 1 while next != index0 h.slots[curr] = h.slots[next] h.vals[curr] = h.vals[next] h.keys[curr] = h.keys[next] h.dibs[curr] = (h.dibs[next] > 0) ? (h.dibs[next] - 1) : 0 curr = next next = (next & (sz-1)) + 1 end #curr is at the last position, reset back to normal h.slots[curr] = 0x0 ccall(:jl_arrayunset, Cvoid, (Any, UInt), h.keys, index-1) ccall(:jl_arrayunset, Cvoid, (Any, UInt), h.vals, index-1) h.dibs[curr] = 0 h.count -= 1 h.totalcost += 1 return h end
One of the most challenging part of the code. Here I am implementing the deletion by backward shifting.
Instead of replacing entries with “tombstone” entries on deletion, the idea is to shift backward all the entries following the entry to delete until either an empty bucket, or a bucket with a DIB(Distance to Initial Bucket) of 0. By doing this, every deletion will shift backwards entries and therefore decrease their respective DIBs by 1 (all their initial DIB values would be >= 1). An intuitive way to understand the backward shift is to think that by shifting backward the entries, the table is left as if the deleted entry had never been inserted. This is why even after a large number of deletions, the mean DIB and the variance of the DIB remain constant and low.
rehash!
algorithm
rehash!
algorithmfunction rehash!(h::RobinDict{K,V}, newsz = length(h.keys)) where {K, V} olds = h.slots oldk = h.keys oldv = h.vals oldd = h.dibs sz = length(olds) newsz = _tablesz(newsz) h.totalcost += 1 h.idxfloor = 1 if h.count == 0 resize!(h.slots, newsz) fill!(h.slots, 0) resize!(h.keys, sz) resize!(h.vals, sz) resize!(h.dibs, sz) h.count = 0 h.maxprobe = 0 h.totalcost = 0 h.idxfloor = 0 return h end slots = zeros(UInt8,newsz) keys = Vector{K}(undef, newsz) vals = Vector{V}(undef, newsz) dibs = Vector{Int}(undef, newsz) fill!(dibs, 0) totalcost0 = h.totalcost count = 0 maxprobe = 0 idxfloor = h.idxfloor for i = 1:sz if olds[i] == 0x1 k = oldk[i] v = oldv[i] d = dibs[i] index0 = index = hashindex(k, newsz) while slots[index] != 0 index = (index & (newsz-1)) + 1 end probe = (index - index0) & (newsz-1) probe > maxprobe && (maxprobe = probe) index < idxfloor && (idxfloor = index) slots[index] = 0x1 keys[index] = k vals[index] = v dibs[index] = d count += 1 if h.totalcost != totalcost0 # if `h` is changed by a finalizer, retry return rehash!(h, newsz) end end end h.slots = slots h.keys = keys h.vals = vals h.dibs = dibs h.count = count h.maxprobe = maxprobe h.idxfloor = idxfloor h.totalcost == totalcost0 return h end
I've tried to keep it as close as I could to the present implementation in dict.jl
.
Going forward
Currently, I'm writing tests for the RobinDict
and progress on that can be seen here .
There are a few hiccups that needs to be resolved, before the dictionary is up and working.
- The first major bug that I've encountered while testing is as follows. Due to backward shifting all the entries, the original mapped position of a key is changed, and thus search is failing in certain circumstances,
- I'm calling
rehash!
function whenever the load factor of the dictionary reaches up to a certain limit(0.80, as of now). This seems to be slowing up the code forsetindex!
operation on size of about 10k entries.
The plan in the upcoming weeks to resolve these issues, and also to complete writing tests and finish writing the docstrings of the functions. Hopefully, by the time of next blog, the PR will already be in review.