Traditional tree data structures provide logarithmic run-time access and manipulation to elements using a key. Here, we demonstrate how to adapt trees to operate on indices rather than keys. The resulting data structure provides a vector-like interface that achieves logarithmic lookup, insert, and deletion – as compared to the linear insertion and deletion run-time for vectors. Furthermore, we demonstrate how the data structure can be further modified to implement efficient manipulation of large text buffers.
Base Tree
For simplicity, the rest of this post will focus on AVL trees, although the same techniques apply to other trees such as red-black trees.
Each node of a standard AVL consists of four components: a reference to the left child, a reference to the right child, a balance factor, and the sorting key itself. From a given node, the left subtree consists of all nodes “less than” the key and the right subtree consists of all node “greater than” the key.
Index Tree
Next, we modify the AVL tree implementation in order to provide an interface that closely resembles a vector. In particular, all operations (e.g. insert, delete, lookup) operate using an index instead of a key. The nodes are indexed based on an in-order traversal of the tree – the first element is on the left and the last is on the right. After an insertion or deletion, all subsequent indices are automatically updated.
// base index tree interface
void idxtree_insert(struct idxtree_root_t *root, struct idxtree_node_t *node);
struct idxtree_node_t *idxtree_lookup(struct idxtree_root_t *root, unsigned int idx);
struct idxtree_node_t *idxtree_remove(struct idxtree_root_t *root, unsigned int idx);
The key insight of index trees is to store the size of each subtree at each index without ever explicitly storing indices. Because insertion and deletion only modify a small part of the tree, these operations can be efficiently implemented in logarithmic time. Let us look at the details.
Each node is modified to both add the size that node’s subtree and a reference to the parent node. The tree must maintain the invariant that the node’s size matches the number of elements in its subtree. The key field is no longer necessary and therefore removed. Notice that nowhere do we directly store indices; instead, a node’s index is implicit based on its location in the tree. The figures below provides a sample tree and the equivalent vector. Each node is given a letter name, followed by the size of its subtree. Let’s see how we use this to perform various operations.
Index lookup
First, we will see how to compute the index of any given node by starting at the root. Because indexing is base on an in-order tranversal, the index of the root is exactly equal to the size of the left subtree – i.e., the number of elements “left” of the root. For the above tree, the index of the root is 2. While this scheme works for the root, lets see what happens as we descend the tree.
Let us descend to the right child “C”. Looking exclusively at the subtree “C”, we know that its index would be 1 (the size of the left subtree). However, because the root “A” has an index of 1, all nodes in the subtree “C” to be at least 2 – they occur “after” the root. Combining these two facts, the index of “C” is exactly their sum 3 + 1 = 4.
If we instead descend to the left child “B”, these nodes fall before the root and are unaffected by its index. It follows that the index of “B” is exactly the number of nodes in its left subtree, which is exactly zero.
Continue down the tree to the node “E”. The node lies left of node “C” but right of node “A”. Thus, the index of “E” is only affected by the index of “A” – its index is 3 + 0 = 3. Because node “F” lies to the right of “C”, it is affected by its index. Again, all elements in the subtree of “F” must be at least 5, so the index of “F” is 5 + 0 = 4.
Inspired by the above process, we will derive an algorithm for calculating the index of an arbitrary node. Starting at the node itself, we begin by setting the index to be size of the left subtree. Next, we walk to the parent node. If we walk left (we were at the right child), then we add one plus the size parent’s left subtree to the index. If we walk right (we were at the left child), we add nothing. This process is repeated until we reach the root.
Using the previous tree as an example, let us start at the node “E”. Because it has no left subtree, we first set its index to 0. Next, we walk up to “C”. Since we walked to the right, we add nothing. Next, we walk up to “A”. Since we walked left, we add one plus the left subtree to the index: 0 + 3 = 3. Having reacheded the root, we conclude that the index of “E” is 3.
Lookup by Index
Given an index, we can reverse the index calculation to find a target node. The following pseudocode provides the implementation for looking up a node:
def lookup(node, index):
if index == node.left.size:
return node
elif index < node.left.size:
return lookup(node.left, index)
else:
return lookup(node.right, index - node.left.size - 1)
Notice that the function is very easily defined recursively, only adjusting the index if we travel to the right child.
Insertion
Insertion uses a simple modification of the standard AVL tree implementation. The previous lookup algorithm finds the location to insert the node. In order to maintain the size invariant, all parent nodes must be updated by incrementing their size. Insertion closely follows the standard implementation of node insertion in an AVL tree, including the necessary rotations to maintain balance. After each rotation, the affected nodes must be updated to maintain the size invariant – the size is most easily calcuated by summing the subtree sizes and adding one.
Removal
Removal follows the same pattern as insertion. The target node is found using the lookup by index algorithm. Once found, the node is removed using the standard AVL tree implementation, decrement each parent’s size and accounting for rotations.
Size Tree
The index tree relies on the implicit assumption that the each node has a size of 1. However, we modify the tree to allow every node to store a variable number of elements – we call this data structure a size tree.
For a concrete example, assume we store a list of characters at each node. If the user inserts a character at a given index, we can simply expand the size of a node rather than insert a completely new node. Note that the size of every parent node must still be adjusted.
In this way, we reduce the overhead of creating and destroying nodes. Such an optimization makes it ideal for text manipulation where users frequently insert and delete small amounts of data from a much larger buffer.
Wrapping Up
In this post, we covered a tree-based data structure that provides a vector-like (index-based) interface. The data structure was demonstrated using an AVL tree such that all operations including lookup, insert, and delete are performed with logarithmic running time. Furthermore, the data structure can store multiple elements at each node to reduce allocation overhead. Ideally, this data structure is ideal for large data sets dominated by inserts and deletes using an index. A reference implementation of the index tree can be found on github.