Ordered Dictionary - An alternative approach

What are dictionaries? They are those cool data structures which guarantee lookup, insert, and deletion in constant time. Underneath, they are hash tables, either based on open-addressing or chaining or a variation of both.

Now what are ordered dictionaries? It is a dictionary subclass that remembers the order in which its contents are added, supporting the usual dictionary methods. If a new entry overwrites an existing entry, the original insertion position is left unchanged. Deleting an entry and reinserting it will move it to the end.

To guarantee the maintenance of order, we have to compromise on time complexity of deletion. As mentioned earlier, deleting and reinserting an entry messes with the order. So, critical challenge is to overcome this.

I'll discuss how I approached this challenge, and I'll describe my thought process in this blog.

Mirror Indexing

Now, what's the major challenge in deleting the key? Locating the position where the key was inserted. Now, we can easily get over this problem by keeping a dictionary to keep track of the position. This is done by the present implementation of OrderedDict in OrderedCollections.jl.

mutable struct OrderedDict{K,V} <: AbstractDict{K,V}
    slots::Array{Int32,1}
    keys::Array{K,1}
    vals::Array{V,1}
    ndel::Int
    dirty::Bool
end
0.2s

Basically, the slots array is used for maintaining the underlying open-addressing based and linearly probed hash table. It stores the index of the key in the keys array. You can call it inverted index.

Deletion works like this. Suppose, you have a key k and it's mapped to a position slot_index in the slots array. We mark slots[slot_index] by a garbage value and just remove the (k, v) from the keys and vals array, respectively.

    ccall(:jl_arrayunset, Cvoid, (Any, UInt), h.keys, ki-1)
    ccall(:jl_arrayunset, Cvoid, (Any, UInt), h.vals, ki-1)
Shift+Enter to run

These calls have to shift elements from ki to the end of the arrays keys and vals. And I guess, that has it's overhead.

Now, what is mirror indexing? It's almost similar to inverted index. Except that, the dictionary would store the most recent position of the key in the keys array. Wait, what? What do you mean by most recent ? Isn't it supposed to be unique. Yes, but not in my implementation. I don't immediately call :jl_arrayunset. I just update it to a garbage value and let it stay. Why? Remember the case, deletion followed by insertion. Do you see a drawback? What if we don't insert. What happens then?

Lazy deletion

In the last section, I talked about updating to garbage value and letting it stay. The problem is that, if there are frequent deletions with few re-insertions. Then, junk key keeps populating in my underlying dictionary. Let me clear it up. Talk is cheap, let's see the code.

import Pkg; Pkg.add("DataStructures");
using DataStructures:RobinDict;
mutable struct OrderedRobinDict{K,V} <: AbstractDict{K,V}
    dict::RobinDict{K, Int32} 
    keys::Vector{K}
    vals::Vector{V}
    count::Int32
end
231.4s

dict is a dictionary which maintains the mirror index of the key in keys array. When I say, junk data in the dictionary, I mean that keys is not cleared immediately following a deletion, and if there are several such items, it would blow out of proportion.

Here's the fix. Delete these junk when it crosses a threshold limit. We know, that dictionaries are defined by their load factor. Let's call this Junk Factor. The cleanup work is decided by the following piece of code. And this is what, I called Lazy Deletion.

# rehash when there are ALLOWABLE_USELESS_GROWTH %
# tombstones, or non-mirrored entries in the dictionary
function check_for_rehash(h::OrderedRobinDict)
    keysl = length(h.keys)
    dictl = length(h)
    return (keysl > (1 + ALLOWABLE_USELESS_GROWTH)*dictl)
end
0.5s
check_for_rehash (generic function with 1 method)

Generic implementation

This approach for maintaining order can be easily generalised. Notice that, there are no assumptions on the underlying container that can be used. As both Lyndon White and Jameson Nash pointed out, I can do away with dependency on RobinDict, by defining my struct, something of this form.

mutable struct OrderedDict{K, V, DT<:AbstractDict{K, Int}} <: AbstractDict{K,V}
     dict::DT
     keys::Vector{K}
     vals::Vector{V}
     count::Int
end
0.3s

Current status

The code for the implementation resides here . The PR is blocked because I'm using RobinDict from DataStructures.jl and OrderedCollections.jl is itself imported by DataStructures.jl. It'll be resolved as soon as this issue is resolved.

Initial benchmarks on my local machine, shows it's performs better than the present implementation. I'll cover detailed benchmarks, once it nears merging.

Runtimes (1)