Variations on Balanced Trees Lazy RedBlack Trees Stefan

Variations on Balanced Trees Lazy Red-Black Trees Stefan Kahrs

Overview • some general introduction on BSTs • some specific observations on red-black trees • how we can make them lazy - and why we may want to • conclusions

Binary Search Trees • commonly used data structure to implement sets or finite maps (only keys shown): 56 33 227

A problem with ordinary BSTs • on random data searching or inserting or deleting an entry performs in O(log(n)) time where n is the number of entries, but. . . • if the data is biased then this can deteriorate to O(n) • . . . and thus a tree-formation can deteriorate to O(n 2)

Therefore. . . • people have come up with various schemes that make trees self-balance • the idea is always that insertion/deletion pay a O(log(n)) tax to maintain an invariant • the invariant guarantees that search or insert or delete all perform in logarithmic time

Well-known invariants for trees • Braun trees: size of left/right subtree vary by at most 1 – too strong for search trees O(n 0. 58) • AVL trees: depth of left/right subtree vary by at most 1 • 2 -3 -4 trees: a node has 1 to 3 keys, and 2 to 4 subtrees (special case of B-tree) • Red-Black trees: an indirect realisation of 2 -3 -4 trees

Red-Black Tree • BST with an additional colour field which can be RED or BLACK • invariant 1: red nodes have only black children, root/nil are black • thus, a non-empty black node has between 2 and 4 black children • invariant 2: all paths to leaves go through the same number of black nodes

Example 68 12 7 83 75 43 70 96 76 94 98

Perceived Wisdom • Red-Black trees are cheaper to maintain than AVL trees, though they may not be quite as balanced • pretty balanced though: average path-length for a Red-Black tree is in the worst case only 5% longer that of a Braun-tree

Aside: a problem with balanced trees • an ordinary BST has on random data an average path length of 2*ln(n) • this is only 38% longer than the average path length of a Braun tree • thus: most balanced tree schemes lose against ordinary BST on random data, because they fail to pay their tax from those 38% • red-black trees succeed though

Algorithms on RB trees • search: unchanged, ignores colour • insert: – insert as in BST (a fresh red node) – rotate subtrees until color violation goes away – colour root black • delete (more complex than insert): – delete as in BST – if underflow rotate from siblings until underflow goes away

Example 68 12 7 83 75 43 70 69 96 76 94 98

Example 75 68 83 12 96 7 43 70 69 76 94 98

Standard Imperative Algorithm • find the place of insertion in a loop • check your parent whether you’re a naughty child, and correct behaviour if necessary, by going up the tree

Problem with this Question: how do you go up the tree? Answer: children should know their parent. Which means: trees in imperative implementations are often not proper trees, every link consists of two pointers

Functional Implementations • in a pure FP language such as Haskell you don’t have pointer comparison and so parent pointers won’t work • instead we do something like this: insert x tree = repair (simpl. Insert x tree) • simpl. Insert inserts data in subtree and produces a tree with a potential invariant violation at the top, repair fixes that • the ancestors sit on the recursion stack

Recursion • actually, nothing stops us from doing likewise in an imperative language, using recursive insertion (or deletion) • cost: recursive calls rather than loops • benefit: no parent pointers – saves memory and makes all rotations cheaper • is still more expensive though. . .

Can we do better? • problem is that the recursive insertion algorithm is not tail-recursive and thus not directly loopifiable: we repair after we insert • what if we turn this around? newinsert x tree = simplinsert x (repair tree) • this is the fundamental idea behind lazy redblack trees

What does that mean? • we allow colour violations to happen in the first place • these violations remain in the tree • we repair them when we are about to revisit a node • this is all nicely loopifiable and requires no parent pointers

In the imperative code • where we used to have. . . n = n. left; . . . to continue in the left branch • we now have: n = n. left. repair();

Invariants? • the standard red-black tree invariant is broken with this (affects search) • in principle, we can have B-R-R-B-R-R paths, though these are rare • but this is as bad as it gets, so we do have an invariant that guarantees O(log(n)) • average path lengths are similar to RB trees

Performance? • I implemented this in Java, and the performance data were initially inconclusive (JIT compiler, garbage collection) • after forcing gc between tests, standard RB remains faster (40% faster on random inputs), though this may still be tweakable • so what is the extra cost, and can we do anything about it?

Checks! • most nodes we visit and check are fine • especially high up in the tree, as these are constantly repaired • . . . and the ones low down do not matter that much anyway • so we could move from regular maintenance to student-flat maintenance, i. e. repair trees only once in a blue moon

What? • yes, the colour invariant goes to pot with that • we do maintain black height though. . . • . . . and trust the healing powers of occasional repair: suppose we have a biased insertion sequence and don’t repair for a while. . .

Example 96 83 70 43 12 7 suppose the tree has this shape, and now we insert a 5 in repair-mode

Result 83 43 12 7 5 96 70

Findings • on random data, performance of lazy redblack trees is virtually unaffected, even if we perform safe-insert only 1/100 • on biased data works a bit better under student-flat, but still loses to RB (15% slower for this bias) • average tree depth: 1. 5 longer than RB – on random inputs – also on biased inputs (where BST falls off the cliff)

Conclusions • Ultimately: failure! • Lazy RB trees are not faster than normal ones. • On random inputs, Lazy RB perform very similarly to plain BST • Some small room for improvement – I doubt though the gap to plain RB can be closed • Perhaps other algorithms would benefit more from lazy invariant maintenance?