Koustav Chowdhury / May 24 2019
Remix of Julia by Nextjournal

Blog #1 : Julia Season of Code

My project is to try, test and finally implement a new form of Dict in Julia which uses Robin Hood hashing scheme. As per the talk with my mentor, Jameson Nash, I got down to work from the first week itself.

Then as per the plan of Week #1 of my proposal, I started out reading the original paper on Robin Hood hashing by Pedro Celis . I started formalising what I've to implement and how I will implement. The paper presents quite a mathematical approach in analyzing the hashing technique, but at the end of the day, it has to boil down to code. The plan is write an alternate form of dictionary, which I call RobinDict and get it merged in DataStructures.jl. This can ease out the integration process in base/dict.jl.

On Robin Hood hashing

Blatantly speaking, Robin Hood hashing technique is a variation of linear probing technique which improves over certain shortcomings of the linear probing. The basic idea is to take normal open addressing, but use one clever trick in order to drastically reduce the variance of the expected average and maximum probe lengths.

The way to do this is : when you probe for a position to insert a new element, if the probe length for the existing element is less than the current probe length for the element being inserted, swap the two elements and keep going.

In this way, the elements which had the comfort of being in the proximity of it's initial hashed index is removed from it position. Stealing from the rich and giving to the poor? That’s Robin Hood all over.

Implementation

I opened a draft pull request in DataStructures.jl implementing the dictionary which implements the dictionary.

I'll try to explain major snippets of my implementation here.

Definition

mutable struct RobinDict{K,V} <: AbstractDict{K,V}
    #there is no need to maintain an table_size as an additional variable
    slots::Array{UInt8,1} # indicator, to be used later on
    keys::Array{K,1}
    vals::Array{V,1}
    dibs::Array{Int,1} # distance to initial bucket - critical for implementation
    count::Int
    totalcost::Int
    maxprobe::Int    # length of longest probe
    idxfloor::Int

    function RobinDict{K, V}() where {K, V}
        n = 16 # default size of an empty Dict in Julia
        new(zeros(UInt, n), Vector{K}(undef, n), Vector{V}(undef, n), zeros(Int, n), 0, 0, 0, 0)
    end

    function RobinDict{K, V}(d::RobinDict{K, V}) where {K, V}
        new(copy(d.slots), copy(d.keys), copy(d.vals), copy(d.dibs), d.count, d.totalcost, d.maxprobe, d.idxfloor)
    end

    function RobinDict{K, V}(slots, keys, vals, dibs, count, totalcost, maxprobe, idxfloor) where {K, V}
        new(slots, keys, dibs, vals, count, totalcost, maxprobe, idxfloor)
    end
end

slots is an array which indicates whether a position has been occupied or not. There are ways to skip this implementation, but it's not worth all the pain.

keys and vals are standard arrays for storing keys and values respectively.

dibs is an indicator for Distance to Initial Bucket. As I have explained earlier, this forms the heart of Robin Hood hashing technique.

count keeps the track of number of entries in the dictionary.

totalcost is a parameter to indicate number of operations performed on the dictionary.

idxfloor is the first(minimum) index where an entry is to be found. It is 0, when there are no entries in the dictionary.

maxprobe keeps record of the maximum value in the dibs array. As of now, I've kept this for benchmarking purposes.

Insertion Algorithm

Shift+Enter to run
function rh_insert!(h::RobinDict{K, V}, key::K, val::V) where {K, V}
    # table full
    if h.count == length(h.keys)
        return -1
    end
    ckey, cval, cdibs = key, val, 0
    sz = length(h.keys)
    index = hashindex(ckey, sz)
    @inbounds while true
    	if (h.slots[index] == 0x0) || (h.slots[index] == 0x1 && h.keys[index] == ckey)
    		break
    	end
        if h.dibs[index] < cdibs
            h.vals[index], cval = cval, h.vals[index]
            h.keys[index], ckey = ckey, h.keys[index]
            h.maxprobe = max(h.maxprobe, cdibs)
            h.dibs[index], cdibs = cdibs, h.dibs[index]
        end
        cdibs += 1
        h.totalcost += 1
        index = (index & (sz - 1)) + 1
    end

    @inbounds if h.slots[index] == 0x1 && h.keys[index] == ckey
        h.vals[index] = cval
        return index
    end

    @inbounds if h.slots[index] == 0x0
    	h.count += 1
    end

    @inbounds h.slots[index] = 0x1
    @inbounds h.vals[index] = cval
    @inbounds h.keys[index] = ckey
    @inbounds h.dibs[index] = cdibs
    h.totalcost += 1
    h.maxprobe = max(h.maxprobe, cdibs)
    if h.idxfloor == 0
        h.idxfloor = index
    else
        h.idxfloor = min(h.idxfloor, index)
    end
    return index
end

The first part checks whether the dictionary is full. This part of the code is there for safekeeping. It's an additional protective measure that I've taken to ensure that this scenario never arises.

Inside the loop, I've implemented the standard version of Robin Hood hashing technique. While probing linearly from index, I keep on performing swaps until either of the two things happen.

  1. There is a slot which hasn't been filled
  2. There is a slot which has been occupied by the same key

In the first case, I insert the element into the desired index . In the second case, I update the present value of the vals array with the new one. In both of the cases, I return the index of the position.

Search Algorithm

function rh_search(h::RobinDict{K, V}, key::K) where {K, V}
	sz = length(h.keys)
	index = hashindex(key, sz)
	while true
		if h.slots[index] == 0x0
			return -1
		elseif cdibs > h.dibs[index]
			return -1
		elseif h.keys[index] == key 
			return index
		end
		index = (index & (sz - 1)) + 1
	end
end

Pretty much straightforward, standard linear probing search algorithm.

Deletion Algorithm

Shift+Enter to run
# backward shift deletion by not keeping any tombstones
function rh_delete!(h::RobinDict{K, V}, index) where {K, V}
    #forceful
    (index < 0) && return false;

    #this assumes that  the key is present in the dictionary at index 
    index0 = index
    sz = length(h.keys)
    while true
        index0 = (index0 & (sz - 1)) + 1
        if h.slots[index0] == 0x0
            break
        end
    end
    #index0 represents the first empty slot in linear probe
    
    # the backwards shifting algorithm
    curr = index
    next = (index & (sz - 1)) + 1

    while next != index0 
        h.slots[curr] = h.slots[next]
        h.vals[curr] = h.vals[next]
        h.keys[curr] = h.keys[next]
        h.dibs[curr] = (h.dibs[next] > 0) ? (h.dibs[next] - 1) : 0
        curr = next
        next = (next & (sz-1)) + 1
    end

    #curr is at the last position, reset back to normal
    h.slots[curr] = 0x0
    ccall(:jl_arrayunset, Cvoid, (Any, UInt), h.keys, index-1)
    ccall(:jl_arrayunset, Cvoid, (Any, UInt), h.vals, index-1)
    h.dibs[curr] = 0
    h.count -= 1
    h.totalcost += 1

    return h
end

One of the most challenging part of the code. Here I am implementing the deletion by backward shifting.

Instead of replacing entries with “tombstone” entries on deletion, the idea is to shift backward all the entries following the entry to delete until either an empty bucket, or a bucket with a DIB(Distance to Initial Bucket) of 0. By doing this, every deletion will shift backwards entries and therefore decrease their respective DIBs by 1 (all their initial DIB values would be >= 1). An intuitive way to understand the backward shift is to think that by shifting backward the entries, the table is left as if the deleted entry had never been inserted. This is why even after a large number of deletions, the mean DIB and the variance of the DIB remain constant and low.

rehash! algorithm

Shift+Enter to run
function rehash!(h::RobinDict{K,V}, newsz = length(h.keys)) where {K, V}
    olds = h.slots
    oldk = h.keys
    oldv = h.vals
    oldd = h.dibs
    sz = length(olds)
    newsz = _tablesz(newsz)
    h.totalcost += 1
    h.idxfloor = 1
    if h.count == 0
        resize!(h.slots, newsz)
        fill!(h.slots, 0)
        resize!(h.keys, sz)
        resize!(h.vals, sz)
        resize!(h.dibs, sz)
        h.count = 0
        h.maxprobe = 0
        h.totalcost = 0
        h.idxfloor = 0
        return h
    end

    slots = zeros(UInt8,newsz)
    keys = Vector{K}(undef, newsz)
    vals = Vector{V}(undef, newsz)
    dibs = Vector{Int}(undef, newsz)
    fill!(dibs, 0)
    totalcost0 = h.totalcost
    count = 0
    maxprobe = 0
    idxfloor = h.idxfloor

    for i = 1:sz
        @inbounds if olds[i] == 0x1
            k = oldk[i]
            v = oldv[i]
            d = dibs[i]
            index0 = index = hashindex(k, newsz)
            while slots[index] != 0
                index = (index & (newsz-1)) + 1
            end
            probe = (index - index0) & (newsz-1)
            probe > maxprobe && (maxprobe = probe)
            index < idxfloor && (idxfloor = index)
            slots[index] = 0x1
            keys[index] = k
            vals[index] = v
            dibs[index] = d
            count += 1

            if h.totalcost != totalcost0
                # if `h` is changed by a finalizer, retry
                return rehash!(h, newsz)
            end
        end
    end

    h.slots = slots
    h.keys = keys
    h.vals = vals
    h.dibs = dibs
    h.count = count
    h.maxprobe = maxprobe
    h.idxfloor = idxfloor
    @assert h.totalcost == totalcost0
    return h
end

I've tried to keep it as close as I could to the present implementation in dict.jl.

Going forward

Currently, I'm writing tests for the RobinDict and progress on that can be seen here .

There are a few hiccups that needs to be resolved, before the dictionary is up and working.

  1. The first major bug that I've encountered while testing is as follows. Due to backward shifting all the entries, the original mapped position of a key is changed, and thus search is failing in certain circumstances,
  2. I'm calling rehash! function whenever the load factor of the dictionary reaches up to a certain limit(0.80, as of now). This seems to be slowing up the code for setindex! operation on size of about 10k entries.

The plan in the upcoming weeks to resolve these issues, and also to complete writing tests and finish writing the docstrings of the functions. Hopefully, by the time of next blog, the PR will already be in review.

References