GSoC'20 in DataStructures.jl

I am highly interested in data structures and my project this summer under Google Summer of Code, aptly reflects my interests. In this blog, I'll summarise the work that I've done this summer. Read on!

OrderedRobinDict

PR - https://github.com/JuliaCollections/DataStructures.jl/pull/658

This PR started in OrderedCollections.jl, but ended up getting in merged here in DataStructures.jl, because of a "package-dependancy-loop".

This is an ordered dictionary, which makes use of RobinDict, that I wrote last summer, to maintain the order of the keys. Apart from that, it uses a concept of "lazy" removal of tombstones, so that the amortised time complexity of deletion is better the O(n), where n is the number of keys. I've explained the approach for writing this dictionary in details here.

While writing this code, I fixed few bugs, which where present in RobinDict. It was fun, correcting my own mistakes, which were left unnoticed.

SwissDict

PR - https://github.com/JuliaCollections/DataStructures.jl/pull/634

SwissDict is inspired from SwissTables developed by Google engineers. This is a "compact" dictionary which works under extremely high load factor (98%), and still competes with Dict in terms of performance. It allocates 1/3rd the memory as compared to Dict (benchmarks posted as a comment in the PR). A cppcon talk by Matt Kulukundis inspired by implementation. My code uses llvmcall which were actually fun to write. It was quite a challenge getting this to work as per my expectations, given I had to implement this completely based on the talk. I didn't have reference to any paper of sorts. The keys are inserted in blocks of 16 elements, with the last 8 bits of the hash determining the location of the key in the block. The remaining 24/56 bits (depending on the architecture) determine the index of the block in which insertion can take place. While searching for a key, we first look-up the index of the bucket and then "quickly" search for the probable location of the key. This is done using LLVM mumbo-jumbo, (actually a machine level instruction) which allows us to compare 16 position simultaneously. This allows for the relatively fast lookup under high load factor, and the compactness.

Red Black Tree

PR - https://github.com/JuliaCollections/DataStructures.jl/pull/638

A red–black tree is a kind of self-balancing binary search tree in computer science. Each node of the binary tree has an extra bit, and that bit is often interpreted as the color (red or black) of the node. These color bits are used to ensure the tree remains approximately balanced during insertions and deletions. The height of the tree is logarithmic in terms of number of nodes, and this ensures search, insert and delete operations can be performed in O(log n) time complexity.

AVL Tree

PR - https://github.com/JuliaCollections/DataStructures.jl/pull/640

AVL Trees are also a variant of self-balancing binary search trees, offering similar complexity of search, insert and delete as does Red Black Tree. But, the manner in which height rebalancing occurs in an AVL Tree is quite different from that of a RBTree. AVL Trees are more strictly balanced than Red Black Tree. Hence, it is optimal for usage in look-up intensive operations.

AVLTree also implements order statistics on keys. Adding a field subsize to the nodes, can help in answering order statistics in O(log n). Maintaining of this augmented field is done while insertion, deletion, and in rotations. I haven't found a Julia code for doing order-statistics on keys in logarithmic time, so it is really code that I've this feature added to AVLTree .

Splay Tree

PR - https://github.com/JuliaCollections/DataStructures.jl/pull/645

A splay tree is a self-balancing binary search tree with the additional property that recently accessed elements are quick to access again. It performs basic operations such as insertion, look-up and removal in O(log n) amortized time. The motivation for amortized analysis is that looking at the worst-case run time per operation, rather than per algorithm, can be too pessimistic. Although the worst case height of a splay tree can be O(n), but over a sequence of operations, the logarithmic bound of complexity of operations can be achieved.

Other Work

Apart from this, I have a PR about BTree which is a WIP. I am stuck in implementing the delete operation, which involves a bit of case-work.

I cleaned up code in RobinDict, removing the import to the parent file in this PR. Also, I completed a really old PR about comprehensive benchmark suite for dictionaries.

I find it fun, reviewing pull requesting for DataStructures.jl, and to-be-honest, I did actually learn by reading other people's code.

My Experience

Making into Google Summer of Code was the best part of my summer. I am really passionate about data structures, and spend most of my idle time, reading about them. So, this project is a culmination of my interests. Getting an opportunity to do it in Julia was really a plus point. The community here is super-awesome, and always eager to help.

My mentors, Lyndon White and Jameson Nash, were really helpful and highly supporting. They constantly provided feedbacks and spent lot of time in reviewing my pull requests. Particularly, in getting OrderedRobinDict merged, it took a wholesome effort from Jameson's side. Lyndon was particularly patient and guided me really well, especially in the RBTree PR.

As a last note, I want to thank my parents, who supported me through the summer, because I had to stay at home, because of the pandemic. I am equally grateful to my friends who motivated me and supported me. All in all, it was a really awesome experience contributing to the Julia ecosystem, this summer.