Dictionary Search !
Well, this blog is going to be a commentary on the implementation of dictionary by several languages such as Python, Ruby, Go and Rust. It's part of my investigation that I did for bettering my own implementation of RobinDict
. So, let's start!
Python
The implementation of a dictionary resides in cpython/dictobject.c
. The first striking feature of the file is that it is extremely well-commented in the beginning.
There's even a file explaining use cases and data layout of the dictionary. The basic structure of the dictionary is store the keys
and the hashes
in a dict-keys
object and have a separate values
array.
Open-Addressing
It uses open-addressed hash table as a method for resolution for hash collision, with an unusual recurrence formula for traversing the hash table. I will explain this in a bit more detail.
Now, unlike regular linear probing, Python uses this recurrent formula.
j = ((5*j) + 1) mod 2**i
.
Well, we can be sure of one thing, since the recurrence is traversed in a modulo 2**i , all the integers will be traversed. This is the Hull-Dobell theorem.
But that's not the end of it all. Python spices up things even further with a second optimization that brings the hash of a key into play. Let's see how.
This is done by initializing a variable perturb
to the full hash code, and changing the recurrence to:
perturb >>= PERTURB_SHIFT;
j = (5*j) + 1 + perturb;
use j % 2**i as the next table index;
Now the probe sequence depends (eventually) on every bit in the hash code,and the pseudo-scrambling property of recurring on 5*j+1
is more valuable,because it quickly magnifies small differences in the bits that didn't affect the initial index. Note that because perturb
is unsigned, if the recurrence is executed often enough perturb eventually becomes and remains 0
. At that point (very rarely reached) the recurrence is on (just) 5*j+1
again, and that's certain to find an empty slot eventually (since it generates every integer in range(2**i)
, and that makes sure we find an empty slot.
Load Factor and Growth Rate
Python defines USABLE_FRACTION
as the maximum dictionary load. As mentioned here, it is 2/3rd
of dictionary size. So that ensures a load factor of 66.67%
.
Python defines GROWTH_RATE
as growth rate upon hitting maximum load. As cited here, currently , the dictionary rehashes to a size of about 3 times the number of keys present.
Ruby
A dictionary is called Hash
in Ruby. Most of the code for the Hash
resides in st.c
. Although Ruby started with bucket-based chaining (from what I know) , it has moved on to follow open-addressed hash table in its current implementation. The hash used here is a variation of Murmur3 hash.
Ruby stores key
, hash
, and value
in an array called entries and the entries are mapped by their hash value to an array called bins
.
Open-Addressing
From the looks of it, Ruby has two provided to implementation to probing, both linear and quadratic. The entire code has #ifdef QUADRATIC_PROBE
spilled all over it. I was pretty confused about this stuff, until I noticed here , the #define QUADRATIC_PROBE
has been commented out.
Like Python, I noticed the recurrence used here is same as that used by Python, just the perturb_shift
was 11
here, unlike 5
in Python. See the code here for reference. Although going through this code base, I came to know about the Hull-Dobell theorem.
Load Factor and Growth Rate
The first point to notice is that this aspect of the dictionary is cleverly hidden by the code. Whereas in Python, it was mentioned in the comments, here I had to do the hard work to figure it out.
Well, firstly there's an abstraction called st_features
, from where we can see that the sizes of the bins
array is twice as that of the entries
array, which implies an upper bound on the max load factor, that is close to about 50%
. Understanding what this line does can give a clear understanding about the load factor and growth rate. For now, I'll let it stay.
Go
Unlike all the above mentioned languages, Go uses chaining for collision resolution. Since it's not an open-addressed hash table, the usual definition of load factor won't work here. A dictionary is called a Map
in Go. The implementation can be seen here.
The data is arranged into an array of buckets. Each bucket contains up to 8 key/elem pairs. The low-order bits of the hash are used to select a bucket. If there are more than 8 keys hashed to a bucket, there is chaining of extra buckets. When the hash table grows, we allocate a new array of buckets twice as big. Buckets are incrementally copied from the old bucket array to the new bucket array
Load Factor here is defined as the average number of key-values mapped to a bucket. The maximum average load of a bucket that triggers growth is 6.5
Rust
Previously, Rust used Robin Hood hashing technique and the improvement as presented in the micro-benchmarks is what inspired my jsoc-project. This implementation can be viewed here
The current hash table implementation is a Rust port of Google's SwissTable. The original C++ version of SwissTable
can be found here, and this CppCon talk gives an overview of how the algorithm works. A SwissTable
is an open-addressing hash table with quadratic probing. Quite interesting!
I wouldn't invest much time into it now but surely, I'll get back to it as soon as time permits. It would be interesting to get it inside DataStructures.jl
at some point of time.
Julia
Julia uses an open-addressed hash table with linear probing as it's current implementation of dictionary, which is called Dict
in the language. The implementation can be found here.
Unlike the above-mentioned languages, Julia doesn't store the hash of the key. The basic structure of Dict
is an array named slots
, which served as indicator whether the slot is empty, filled or deleted, and the usual arrays of keys
and vals
. Apart from these, there is a variable ndel
which keeps count of the number of deletions. There is also a variable maxprobe
which keeps track of the maximum probe length while probing for a key.
Open-Addressing
Unlike Python and Ruby, the recurrence used here is pretty standard. It probes linearly to the very next index.
j = (j + 1) mod 2**i
.
Also, in most of implementations, table-size
was maintained as a power of two to ease the operation of modulo.
Load Factor and Growth Rate
The Dict
rehash operations happens here in either of two scenarios.
- The number of entries is
2/3rd
of the table-size - The number of deletions is
3/4th
of the table size.
So, the maximum load factor that Dict can achieve is about 67%
.
Whenever the rehash!
happens , the Dict
grows 4 times unless the number of entries exceed 64000, and thereafter it grows by 2 times.
This pretty-and-standard implementation of linear probing works fast enough for most circumstances. I was curious to try on several heuristics to improve it's performance. This motivated me to try out Robin Hood hashing scheme based on this implementation as a part of summertime project in Julia. I will explain, in details, the nuances of my implementation in a separate blog.