Dictionary Search !

Well, this blog is going to be a commentary on the implementation of dictionary by several languages such as Python, Ruby, Go and Rust. It's part of my investigation that I did for bettering my own implementation of RobinDict. So, let's start!

Python

The implementation of a dictionary resides in cpython/dictobject.c . The first striking feature of the file is that it is extremely well-commented in the beginning.

There's even a file explaining use cases and data layout of the dictionary. The basic structure of the dictionary is store the keys and the hashes in a dict-keys object and have a separate values array.

Open-Addressing

It uses open-addressed hash table as a method for resolution for hash collision, with an unusual recurrence formula for traversing the hash table. I will explain this in a bit more detail.

Now, unlike regular linear probing, Python uses this recurrent formula.

j = ((5*j) + 1) mod 2**i.

Well, we can be sure of one thing, since the recurrence is traversed in a modulo 2**i , all the integers will be traversed. This is the Hull-Dobell theorem.

But that's not the end of it all. Python spices up things even further with a second optimization that brings the hash of a key into play. Let's see how.

This is done by initializing a variable perturb to the full hash code, and changing the recurrence to:

perturb >>= PERTURB_SHIFT;

j = (5*j) + 1 + perturb;

use j % 2**i as the next table index;

Now the probe sequence depends (eventually) on every bit in the hash code,and the pseudo-scrambling property of recurring on 5*j+1 is more valuable,because it quickly magnifies small differences in the bits that didn't affect the initial index. Note that because perturb is unsigned, if the recurrence is executed often enough perturb eventually becomes and remains 0. At that point (very rarely reached) the recurrence is on (just) 5*j+1 again, and that's certain to find an empty slot eventually (since it generates every integer in range(2**i), and that makes sure we find an empty slot.

Load Factor and Growth Rate

Python defines USABLE_FRACTION as the maximum dictionary load. As mentioned here, it is 2/3rd of dictionary size. So that ensures a load factor of 66.67% .

Python defines GROWTH_RATE as growth rate upon hitting maximum load. As cited here, currently , the dictionary rehashes to a size of about 3 times the number of keys present.

Ruby

A dictionary is called Hash in Ruby. Most of the code for the Hash resides in st.c . Although Ruby started with bucket-based chaining (from what I know) , it has moved on to follow open-addressed hash table in its current implementation. The hash used here is a variation of Murmur3 hash.

Ruby stores key, hash, and value in an array called entries and the entries are mapped by their hash value to an array called bins.

Open-Addressing

From the looks of it, Ruby has two provided to implementation to probing, both linear and quadratic. The entire code has #ifdef QUADRATIC_PROBE spilled all over it. I was pretty confused about this stuff, until I noticed here , the #define QUADRATIC_PROBE has been commented out.

Like Python, I noticed the recurrence used here is same as that used by Python, just the perturb_shift was 11 here, unlike 5 in Python. See the code here for reference. Although going through this code base, I came to know about the Hull-Dobell theorem.

Load Factor and Growth Rate

The first point to notice is that this aspect of the dictionary is cleverly hidden by the code. Whereas in Python, it was mentioned in the comments, here I had to do the hard work to figure it out.

Well, firstly there's an abstraction called st_features, from where we can see that the sizes of the bins array is twice as that of the entries array, which implies an upper bound on the max load factor, that is close to about 50% . Understanding what this line does can give a clear understanding about the load factor and growth rate. For now, I'll let it stay.

Go

Unlike all the above mentioned languages, Go uses chaining for collision resolution. Since it's not an open-addressed hash table, the usual definition of load factor won't work here. A dictionary is called a Map in Go. The implementation can be seen here.

The data is arranged into an array of buckets. Each bucket contains up to 8 key/elem pairs. The low-order bits of the hash are used to select a bucket. If there are more than 8 keys hashed to a bucket, there is chaining of extra buckets. When the hash table grows, we allocate a new array of buckets twice as big. Buckets are incrementally copied from the old bucket array to the new bucket array

Load Factor here is defined as the average number of key-values mapped to a bucket. The maximum average load of a bucket that triggers growth is 6.5

Rust

Previously, Rust used Robin Hood hashing technique and the improvement as presented in the micro-benchmarks is what inspired my jsoc-project. This implementation can be viewed here

The current hash table implementation is a Rust port of Google's SwissTable. The original C++ version of SwissTable can be found here, and this CppCon talk gives an overview of how the algorithm works. A SwissTable is an open-addressing hash table with quadratic probing. Quite interesting!

I wouldn't invest much time into it now but surely, I'll get back to it as soon as time permits. It would be interesting to get it inside DataStructures.jl at some point of time.

Julia

Julia uses an open-addressed hash table with linear probing as it's current implementation of dictionary, which is called Dict in the language. The implementation can be found here.

Unlike the above-mentioned languages, Julia doesn't store the hash of the key. The basic structure of Dict is an array named slots, which served as indicator whether the slot is empty, filled or deleted, and the usual arrays of keys and vals. Apart from these, there is a variable ndel which keeps count of the number of deletions. There is also a variable maxprobe which keeps track of the maximum probe length while probing for a key.

Open-Addressing

Unlike Python and Ruby, the recurrence used here is pretty standard. It probes linearly to the very next index.

j = (j + 1) mod 2**i.

Also, in most of implementations, table-size was maintained as a power of two to ease the operation of modulo.

Load Factor and Growth Rate

The Dict rehash operations happens here in either of two scenarios.

The number of entries is 2/3rd of the table-size
The number of deletions is 3/4th of the table size.

So, the maximum load factor that Dict can achieve is about 67%.

Whenever the rehash! happens , the Dict grows 4 times unless the number of entries exceed 64000, and thereafter it grows by 2 times.

This pretty-and-standard implementation of linear probing works fast enough for most circumstances. I was curious to try on several heuristics to improve it's performance. This motivated me to try out Robin Hood hashing scheme based on this implementation as a part of summertime project in Julia. I will explain, in details, the nuances of my implementation in a separate blog.