Search for Descendants within a Graph
First things first: Setup
{:deps
{org.clojure/clojure {:mvn/version "1.10.0"}
org.clojure/tools.deps.alpha
{:git/url "https://github.com/clojure/tools.deps.alpha.git"
:sha "f6c080bd0049211021ea59e516d1785b08302515"}
compliment {:mvn/version "0.3.9"}
ubergraph {:mvn/version "0.8.2"}}}
apt-get update
apt-get install graphviz
(use clojure.tools.deps.alpha.repl)
(ns-unmap *ns* ancestors)
(ns-unmap *ns* descendants)
(require [ubergraph.core :as ugraph])
The problem
In a project in the field of e-commerce, my team was faced with the following task: From a relevant large number of boolean statements M
, the most specific super-statement S
(in the sense of superset) had to be found for a requested Boolean statement R
, which might be, but in most cases is not, part of M
.
An example: Assume M = {a, b, a && b}
and R = a && b && c
, then S = a && b
.
Note that the set M
was the nodes of a graph that was constantly updated asynchronously. The edges of this graph expressed a relationship between the nodes, but this relationship is not important here.
As the Boolean statements become more complicated, it is relatively difficult to check whether a statement A
is a super-statement of a statement B
. Therefore, as the size of the graph increases, the approach of checking each node of the graph does not scale.
A variant to speed up the search would have been to build an artificial binary search tree in the graph. As an example, let us assume M = {a, b, c, d, e, f}
. Then we could create the following binary search tree:
(-> "a || b || c"
(ugraph/digraph)
(ugraph/add-edges ["a || b || c" "a || b" {-> true}])
(ugraph/add-edges ["a || b" "a" {-> true}])
(ugraph/add-edges ["a || b" "c" {-> false}])
(ugraph/add-edges ["a" "b" {-> false}])
(ugraph/add-edges ["a || b || c" "d || e" {-> false}])
(ugraph/add-edges ["d || e" "d" {-> true}])
(ugraph/add-edges ["d || e" "f" {-> false}])
(ugraph/add-edges ["d" "e" {-> false}])
(ugraph/viz-graph {:auto-label true
:save {:filename "/results/t-0.jpeg"
:format :jpeg}}))
nil
We rejected this idea because we did not want to create a disjunction with 100,000+ Boolean statements as the root node.
I would like to describe an alternative approach in the following and would therefore like to abstract from the original problem.
Reformulation
Instead of working with Boolean statements and super-statements, the problem could also be described as follows: Suppose we have a graph G
whose nodes are natural numbers and whose edges express a relationship between the numbers. For simplicity's sake, this relationship should be understood as "my checksum is". So node 3 points to itself and node 13 points to node 4. Instead of looking for the super-statement, we look for the greatest divisor of a natural number which is also a node of G
. (Please note that a solution for this problem must also work for the original problem).
A solution
Protocol
To make our solution work for the original problem as well as for the derived problem, we design the solution algorithm against a protocol.
(defprotocol Tree
"The descendants search tree abstraction within a graph"
(root [self] "This node must be the ancestor of every other node")
(query [self] "A map to distinguish the edges already existing in the graph and the edges created for searching")
(descendant? [self a b] "The (probably expensive) predicate, that returns true if a is a descendant of b")
(compare-descendants [self a b] "If the search found two possible candidates, this function decides which candidate is a better fit. Should return a positive number, if a is a better fit.")
(graph [self] "Returns the underlying graph data structure")
(update-graph [self f] "Updates the underlying graph data structure and returns a new Tree"))
Implementation
Here we implement the protocol for the derived problem and call it DivisionTree
. A node A
is a descendent of a node B
, if B
is dividable by A
without a rest. The compare-descendants
function can be used, if multiple candidates have been found. For the original problem, the implementation for this function would have been more complicated. We also increment a counter for each call to the expensive function descendant?
.
(defrecord DivisionTree [graph counter] Tree
(root [self] 1)
(query [self] {:type :division})
(graph [self] graph)
(descendant? [self a b] (do (swap! counter inc) (= 0 (mod a b))))
(compare-descendants [self a b] (compare a b))
(update-graph [self f] (update self :graph f)))
(defn sum-of-digits [n]
(loop [count n acc 0]
(if (zero? count) acc
(recur (quot count 10)
(+ acc (mod count 10))))))
(defn insert-number [insert+balance tree n]
(-> tree
(insert+balance n)
(update-graph (fn [g]
(ugraph/add-edges g
[n
(sum-of-digits n)
{:type :sum-of-digit}])))))
(defn divison-graph [numbers]
(reduce (fn [t n] (insert-number (fn [& args] (first args)) t n))
(map->DivisionTree {:graph (ugraph/digraph 1)})
numbers))
Without any additional edges for searching, the internal graph with nodes form 1 up to 14 looks like this:
(defn viz
([tree] (viz tree {}))
([tree {file-name :file-name
layout :layout}]
(viz tree
(or file-name (str "g-" (int (rand 99999))))
(or layout :dot)))
([tree file-name layout]
(let [$graph (graph tree)
all-tree-edges (ugraph/find-edges (graph tree) {:type :division})
colored-graph (reduce (fn [g edge]
(ugraph/add-attr g edge :color :red))
$graph
all-tree-edges)]
(ugraph/viz-graph colored-graph {:layout layout :save {:filename (str "/results/" file-name ".jpeg") :format :jpeg}}))
nil))
(viz (divison-graph (range 1 15)) {:layout :neato})
Now we add edges for a fast search of descendants.
Construction Algorithm
To create a most even balanced tree, the construction algorithm consists of two steps for inserting a value v
:
Walk down from the root of the tree, following each edge that is an ancestor of
v
. From all the deepest nodes of all paths, insertv
as descendant of the node with the fewest descendants.For all the leafs of the tree that are descendants of
v
, move each leafl
tov
, if the width of the ancestor ofl
is larger than the width ofv
.
(defn descendants-ways [tree ancestor]
(ugraph/find-edges (graph tree)
(merge {:src ancestor}
(query tree))))
(defn descendants
([c-ways] (map :dest c-ways))
([tree ancestor] (descendants (descendants-ways tree ancestor))))
(defn ancestors [tree node]
(map :src
(ugraph/find-edges (graph tree)
(merge {:dest node}
(query tree)))))
(defn ancestor [tree node]
(first (ancestors tree node)))
(def conjj (fnil conj {}))
(defn insert [tree v ancestor]
(if (= v (root tree))
tree
(-> tree
(update-graph (ugraph/add-edges % [ancestor v (query tree)]))
(update :leafs disj ancestor)
(update :leafs conjj v))))
(defn leaf? [tree node]
(empty? (descendants tree node)))
(defn update-leafs-index [tree node ancestor new-ancestor]
(-> tree
(update :leafs disj new-ancestor)
(update :leafs (fn [leafs] (if (leaf? tree node)
(conjj leafs node)
leafs)))
(update :leafs (fn [leafs] (if (leaf? tree ancestor)
(conjj leafs ancestor)
leafs)))))
(defn move [tree node new-ancestor]
(let [ancestor (ancestor tree node)]
(-> tree
(update-graph (fn [g]
(-> g
(ugraph/remove-edges [ancestor
node
(query tree)])
(ugraph/add-edges [new-ancestor
node
(query tree)]))))
(update-leafs-index node ancestor new-ancestor))))
(defn descendants-count [tree ancestor]
(count (descendants tree ancestor)))
(defn move-if-ancestor-is-relieved [tree node sub-node-path]
(let [candidate (last sub-node-path)
node-descendants-count (descendants-count tree node)
sub-node-path-count (descendants-count tree (ancestor tree candidate))]
(if (< node-descendants-count sub-node-path-count)
(move tree candidate node)
tree)))
(defn walk-leafs
([tree start matches?]
(walk-leafs tree matches? [[start]] ()))
([tree matches? stack result]
(if-let [frst (first stack)]
(let [node (last frst)
matching-descendants (seq (filter matches?
(ancestors tree node)))]
(recur tree
matches?
(into (vec (rest stack))
(map (fn [c] (conj frst c))
matching-descendants))
(if matching-descendants result
(conj result frst))))
result)))
(defn leafs [tree]
(if-let [$leafs (seq (:leafs tree))]
$leafs
(list (root tree))))
(defn self-path? [node path]
(contains? (set path) node))
(defn alternative-paths [tree ancestor-node]
(letfn [(ancestor? [node] (descendant? tree node ancestor-node))]
(sequence (comp (filter ancestor?)
(mapcat (fn [leaf]
(walk-leafs tree leaf ancestor?)))
(remove (partial self-path? ancestor-node)))
(leafs tree))))
(defn balance [tree node]
(reduce (fn [t n] (move-if-ancestor-is-relieved t node n))
tree
(alternative-paths tree node)))
(defn select-node-with-fewest-descendants [tree node-seq]
(first (sort-by (descendants-count tree %) node-seq)))
(defn insert+balance
([tree node] (insert+balance tree (root tree) node))
([tree ancestor node]
(if-let [candidate
(select-node-with-fewest-descendants
tree
(filter (descendant? tree node %)
(descendants tree ancestor)))]
(recur tree candidate node)
(balance (insert tree node ancestor) node))))
Inserting an element has worst case time complexity of O(n)
: If a division-graph
would only consist of primes and we would insert another prime, we had to check all nodes for step 1 and again all nodes for step 2.
To illustrate the algorithm, we create a balanced-divison-tree
within a division-graph
.
(defn balanced-divison-tree [numbers]
(reduce (fn [t n] (insert-number insert+balance t n))
(map->DivisionTree {:graph (ugraph/multidigraph 1)
:counter (atom 0)})
numbers))
(defn reset-counter [tree]
(let [counter (:counter tree)]
(reset! (:counter tree) 0)
counter))
(def $balanced-division-tree
; if we add the numbers in the order 14, 13, 12, etc.
; the balancing kicks in. Otherwise, we would have a tree with
; all nodes connected to the root.
(balanced-divison-tree (reverse (range 1 15))))
(viz $balanced-division-tree {:layout :neato})
The black edges represent the sum-of-digit relationship and the red edges point from an ancestor to a direct descendant.
Since we do not care about the sum-of-digit relationship at this point, we create a view of the graph that focuses on the search tree.
(defn search-tree-view [tree]
(let [all-edges (ugraph/find-edges (graph tree) {})]
(update-graph tree
(fn [g]
(let []
(reduce (fn [g* edge]
(if (= (ugraph/attrs g* edge)
(query tree))
g*
(ugraph/remove-edges g* edge)))
g
all-edges))))))
(viz (search-tree-view $balanced-division-tree))
Now that we have a search tree, we can search for the oldest known ancestor of a value.
Search Algorithm
(defn ancestor-candidates [tree ancestor node]
(filter (partial descendant? tree node)
(descendants tree ancestor)))
(defn choose-oldest-known-ancestor [tree a b]
(if (pos? (compare-descendants tree a b))
a
b))
(defn oldest-known-ancestors
([tree node]
(oldest-known-ancestors tree
node
(root tree)
(ancestor-candidates tree
(root tree)
node)))
([tree node $oldest-known-ancestor candidates]
(if-let [next-candidate (first candidates)]
(if-let [next-ancestor-candidates
(and (not= next-candidate node)
(seq (ancestor-candidates tree
next-candidate
node)))]
(recur tree
node
$oldest-known-ancestor
(into (rest candidates)
next-ancestor-candidates))
(recur tree
node
(choose-oldest-known-ancestor tree
$oldest-known-ancestor
next-candidate)
(rest candidates)))
$oldest-known-ancestor)))
(oldest-known-ancestors $balanced-division-tree 15)
(reduce (fn [m n]
(assoc m
n
(oldest-known-ancestors $balanced-division-tree n)))
(sorted-map)
(range 10 30))
Results
Performance improvements are highly dependent on the nature of the data. That means, we have to measure the reduction of expensive work against our specific dataset. For the division-graph
problem and a graph size of the first 10000 natural numbers, we could measure the reduction of work like this:
(def big-tree-size 10000)
(def big-division-tree (balanced-divison-tree (range 1 big-tree-size)))
(defn work-reduction [number-of-searches]
(let [_ (reset-counter big-division-tree)
number-of-nodes (count (ugraph/nodes (graph big-division-tree)))
_ (doall (map (partial oldest-known-ancestors big-division-tree)
(range big-tree-size
(+ number-of-searches big-tree-size))))
number-of-expensive-calls (reset-counter big-division-tree)]
(double (/ number-of-expensive-calls
(* number-of-nodes number-of-searches)))))
(reduce (fn [m number-of-searches]
(assoc m
number-of-searches
(work-reduction number-of-searches)))
{}
[10 100 1000])
As already mentioned, the results are highly dependent on how even the data can be partitioned within the tree as well as the size of the graph. In our case the expensive work is now only about a tenth of the original work.
Conclusion
Depending on the nature of the hierarchical data, we could speed up the search for the oldest known ancestor significantly if we can tolerate the construction of the search tree in O(n)
time.
Going back to the original problem and our E-Commerce project, the set of Boolean statements was small enough in our case, so that we could check all nodes of the graph for each search request by applying a fast heuristic check first, followed by the more precise but expensive check.
It was an interesting problem, though.
Addendum
Deleting
For the sake of completeness, here is the implementation for deleting elements. To support different use cases, the deletion of an element within the search tree is not coupled with the deletion of an element within the graph.
(defn delete-ways [g query]
(let [ways (ugraph/find-edges g query)]
(reduce ugraph/remove-edges g ways)))
(defn delete [tree node]
(let [query (query tree)]
(let [leaf-candidates (ugraph/find-edges (graph tree)
(assoc query :dest node))
tree* (update-graph tree
(-> %
(delete-ways (assoc query :dest node))
(delete-ways (assoc query :src node))))]
(reduce (fn [t {leaf-candidate :src}]
(if (leaf? t leaf-candidate)
(update t conjj :leafs leaf-candidate)
t))
(update tree* :leafs disj node)
leaf-candidates))))
(defn delete+balance [tree node]
(if-let [descendants-ways (seq (descendants-ways tree node))]
(reduce (fn [t c] (insert+balance t c))
(delete tree node)
(descendants descendants-ways))
(delete tree node)))
(def balanced-divison-tree-ignoring-5 (delete+balance $balanced-division-tree 5))
(viz (search-tree-view balanced-divison-tree-ignoring-5) "wo5")
(oldest-known-ancestors balanced-divison-tree-ignoring-5 15)
(oldest-known-ancestors balanced-divison-tree-ignoring-5 5)
5 must not be deleted from the graph, since 5 is still the sum of digits of 14. It is just not relevant for search anymore.
; shows also the black edges from the underlying graph
(viz balanced-divison-tree-ignoring-5)