Implementing block-stored prefix trees in XML-DBMS c Oleg Borisenko Ilya Taranov Institute for System Programming, Russian Academy of Sciences borisenko@ispras.ru, taranov@ispras.ru Abstract B-trees are widely used in modern databases for index- ing [6]. Storing the keys of arbitrary length in B-trees The problem of search efficiency through large is possible however causes implementation problems. It amount of text data is well-known problem in is difficult to implement the efficient storage of variable computer science. We would like to introduce a length keys. In practice usually only a limited part of the BST data structure that allows searches through key is stored in the tree node, and the remainder is stored a set of string values, and is optimized for read- in separate overflow pages[2, 16]. This approach is quite ing and writing large blocks of data. This paper effective if the keys are short or differ only in first char- describes the algorithms for insertion, deletion acters. and search of variable-length strings in disk- The other problem arises if one key corresponds to a resident trie structures. This data structure is set of different value nodes; it is difficult to implement an used for value indexes on XML data. We also efficient search by key/value pair. Most of B-tree imple- compare our implementation with existing B+ mentations do not provide an ability to store such mul- tree implementation and show that our structure timaps. At the same time our indexes may have duplicate occupies several times less space with the same key/value pairs. To be able to store identical logical1 keys search efficiency. the concatenation of key/value pairs is usually used as a physical key. 1 Introduction This article describes an index implementation for disk-resident usage that fits our requirements. It can be The problem of developing data structures that provide shown as an extension of a data structure known as a basic dictionary operations (insert, deletion and lookup) trie[5] which is also called prefix tree. and is optimized for disk storage has been investigated in The idea of using prefix trees as a replacement for B- computer science for a long time but remains very impor- trees has already been discussed in other works. A mod- tant. This work describes a new disk-resident data struc- ified B-tree, called S(b)-tree [7], stores a binary Patri- ture BST that stores a set of string keys and algorithms cia[17] tree in its nodes. The feature of S(b)-tree is that for insertion, deletion and search for this structure. Also nodes do not store a key itself, but the number of bits we compare the existing implementation of B+ tree with passed during the comparison. During the search for the our novel approach within Sedna XML DBMS [8, 18] string you may have to compare this string with the string and show that the latter has an opportunity for significant stored in external memory. However, such a comparison economy of disk space providing the same search time is not a big problem in itself. The problem is to store all as B+ tree. string keys in a separate location. S(b)-tree is presented Our structure has been developed for use as a value in- as a data structure to support full-text index and it is good dex backend in Sedna XML DBMS. Sedna XML DMBS enough for this task[12]. has native support for value indices on nodes and it has The B-trie[1] is very similar to our approach. The ba- the following requirements to the structures used for in- sis of this work was the implementation of a prefix tree dex management: that provides efficient usage of CPU cache [9]. Both structures propose effective partition of the prefix tree 1. Structure must provide the ability to store keys of into blocks (buckets). But in our work we do not use unlimited length. For example, URI’s can be very stored supplementary data structures and provide more large in practice. effective strategy to keep the blocks filled. 2. Structure must implement the concept of multimap In addition, there are quite simple implementations of abstract data type. An index for an XML document similar data structures, i.e. Index Fabric[14], which is a in Sedna may have duplicate (key, value) pairs. B-tree with keys stored in a prefix tree in internal nodes. 2 Review of existing solutions 3 Building the structure In this paper we introduce a prefix tree based data struc- Most common data structures for implementaing ture which is optimized for reading and writing large database index are B-tree[3] and its variants[4, 19, 13]. 1 Here the physical key is the key that is actually used in comparison Proceedings of the Spring Researcher’s Colloquium on Database and search. Logical key refers to a key that is passed to index system and Information Systems, Moscow, Russia, 2012 interface. blocks of data. We call our structure the Block String If this condition holds, the set K uniquely defines a tree Trie or BST. Let the set of stored strings be denoted by T. K = {s1 , s2 , . . . , sn }. The basic operations for our structure are defined as follows: Theorem. Any non-empty set of strings K defines one tree T and the one tree only for which the conditions 1–4 1. Insert string s to the set K. holds. 2. Find all the strings with the prefix s in the set K. Proof. Let us assume that this statement does not hold and the set of strings K does not define the single tree T . 3. Remove string s from the set K. This may happen in two cases: the set of strings K does Such structure implements a set abstract data type but not define any tree at all, or the set K defines more than not a map. Now we specify how exactly we store the one tree. We are not going to consider the first case, as (key,value) pairs in our data structure. Each pair can be any non-empty set K defines at least one tree construc- represented as s = k + c + v (we call it physical key), tion procedure. where k is a key (or logical key), v is string representa- Let us assume that the set of strings K defines more tion of a value, and c is a symbol that is absent in the than one tree T that satisfies the conditions 1–4. Con- alphabet that k is built of 2 . To find all (key,value) pairs, sider two trees which corresponds to a set K. That means that correspond to a given key k we look for all strings that there is a string k ∈ K that has paths S(x11 , x1n ) and prefixed with k + c. Thus the dictionary problem can be S(x21 , x2m ) in trees T and T 0 respectively, such as: solved using the introduced data structure. k = pref ix(x11 ) + c(e11 ) + pref ix(x12 ) + 3.1 Prefix tree +c(e12 ) + · · · + c(e1n−1 ) + pref ix(x1n ) = Our data structure is a rooted tree T which stores a set K = pref ix(x21 ) + c(e21 ) + pref ix(x22 ) + of strings. The structure is as follows: +c(e22 ) + · · · + c(e2m−1 ) + pref ix(x2m ) 1. Each vertex x of tree T has the following properties: The fact that trees differ means either of the following: (a) A prefix pref ix(x) is a string which may have 1. In trees T and T 0 , there are vertices x1i and x2i , such zero length. that pref ix(x1i ) 6= pref ix(x2i ). (b) An array E(x) of n(x) exiting edges ordered in lexicographic order of characters c(e) they 2. In trees T and T 0 , there are edges e1i and e2i such are labeled with. that c(e1i ) 6= c(e2i ). (c) Auxiliary flags (will be described as they ap- 3. In trees T and T 0 , there are vertices x1i and x2i , such pear further in text). 1|2 that one vertex has the flag f inal(xi ), while the 2. Each edge e = (x, Li (x)). Li (x) is a node, pointed other does not. by i-th edge from the array E(x). Each edge is la- First two cases contradict the definition of our tree and beled with character c(e). In this case we consider the third case is in conflict with the condition 4. That a character the string of the length of one. means that trees T and T 0 are equal, hence there is only 3. Any path S(x1 , xn ) = x1 e1 x2 e2 . . . en−1 xn in the one tree, corresponding to the set K. tree is string s, obtained from the path S(x1 , xn ) We consider only minimal trees in further sections un- by concatenating its substrings s = pref ix(x1 ) + less stated otherwise. However the condition (4) is not c(e1 ) + pref ix(x2 ) + c(e2 ) + · · · + c(en−1 ) + required in the implementation of the structure and may pref ix(xn ). We also introduce the flag f inal(x), be not held if lazy removal technique is implemented. that shows whether a node has a corresponding key from the set K. Thus the string s defined by 3.2 Block separation S(x1 , xn ) belongs to the set of strings K if and only if the vertex xn is labeled with flag f inal(xn ). The structure described in the previous section is de- signed primarily for storage and retrieval of strings in The last property implies that a set K may have more memory. But we require the structure, that can be stored than one corresponding tree T . To build a tree T 0 that in external memory and our algorithms should effectively represents the same set as tree T we can add vertex x0 operate with reading and writing blocks of the fixed size. with an arbitrary prefix and not marked as f inal(x0 ). First of all we introduce a special type of vertex that Both trees T and T 0 will represent the same set of strings does not store the prefix and links to other vertices and K. Thus it seems reasonable to introduce an additional keeps only a reference to a node located in another block. condition, which covers such situations: We call such vertices reference nodes and we mark them with the service flag external(x). It should be noted that 4. Each vertex x ∈ T , is such that n(x) ≤ 1, is marked the nodes that are referenced by a reference vertices, can as f inal(x). A tree that satisfies that condition is not be reference nodes themselves. called minimal. Each vertex x that satisfies (doesn’t In addition we introduce the notion of twig. satisfy) the condition n(x) ≤ 1 ⇒ f inal(x) is called important (redundant). A twig is a rooted tree B such that all leaf vertices are marked as either f inal(x) or external(x). If a tree T 2 For text strings, we use null character. has no reference nodes it consists of a single twig. 3.3 Algorithms 3.3.1 Search The search algorithm for the BST requires two input pa- rameters: a pointer to r to the root node of the tree and string k to be found. The algorithm BST −Search(r, k) returns a node xn+1 which satisfies the following: 1. String s(xn ) defined by path S(r, xn ) is a prefix of the string k or equals it if exists. 2. The search string k is a prefix of a string s(xn+1 ) or equals it. We define the relationship A ≤ B as follows: string A is a prefix of B or equals it. Thus our requirements for result may be expressed using the following inequality: s(xn ) ≤ k ≤ s(xn+1 ). Figure 1: BST example without blocks The search procedure starts from the root of the tree. The function takes a pointer x to the root node and the search string k, an intermediate result is stored in the The concept of twig is a key for the problem of divid- stack S. In addition there is also an auxiliary function ing BST into blocks. Every tree T can be divided into y = L(x, c) which returns node y which is pointed by twigs by inserting a new reference node before any other that edge of vertex x, which is marked with character c node. Twigs (or NULL if there is no such an edge). This feature is im- plemented using binary search algorithm and the edges Blocks consist of the constant amount bytes W 3 . The are stored in sorted array in our implementation. We also distribution of twigs into blocks is organized in such a introduce the function Cut(p, s) which returns a string, way that in one block one or more twigs from the initial that is produced by removing the prefix p from a string tree can be stored. All the twigs in one block have a s, so that p + Cut(p, s) = s. The search algorithm can common direct ancestor. These conditions are necessary be written in compact form as follows: to ensure that the locality of changes in the tree and for block level locking. These conditions also imply that the block containing the root node has no other twigs. BST-Search(x, k, S) 1: Push(S, x) 2: if external(x) then 3: Disk-Read(J(x)) 4: return BST-Search(J(x), k, S) 5: end if 6: if not Is-Prefix(pref ix(x), k)) then 7: if Is-Prefix(k, pref ix(x)) then 8: return x 9: else 10: return NIL 11: end if 12: else 13: s ← Cut(pref ix(x), k) 14: if Empty(s) then 15: return x 16: else if L(x, s[1]) = NIL then 17: return NIL 18: else 19: return BST-Search(L(x, s[1]), Cut(s[1], s), S) 20: end if 21: end if Figure 2: BST block separation example The search function returns node x such that all the paths from the root to all possible leaf vertices which contain node x define the strings prefixed by a search string. The function returns N U LL if there is no such a node. We can find all strings prefixed by k traversing the 3 In the Sedna DBMS the default block size is 64KB. tree starting from node x. 3.3.2 Insertion splitting procedure starts and after that the insert proce- dure should be executed again. By mean of this approach The insert operation begins with building the structure, we will access not more than D+2 blocks (excluding the that describes the modifications we are going to make in possible block splitting). our tree. If some block on the way of our modifications does not enough space for modifications to be made, the block is split (this process is discussed in the next sub- 3.3.3 Block splitting section) and the insertion process is invoked again. One of the distinguishing features of BST compared to The insert procedure starts with executing the BST − B-trees is that the BST tree is not balanced. As we have Search(k, r, S) (where k is the string to be inserted) proven earlier1 T is uniquely determined by the set K method, but we need only the path S as result. Also of stored strings so we can not decrease the height of we will need three additional strings derived from path the tree. Nevertheless there are studies that show that S and string k: common, rest and key. We construct unbalanced tree affect the performance very little ([12]). these strings in the following way. First of all, we need The height of BST is the height of its twigs which must the string s0 , that is the string that corresponds to S with be read to find given prefix in the worst case. the last vertex’s prefix removed. From the definition of The procedure of new space allocation performs the search procedure it is obvious that this string is a prefix splitting of the block and is called only in case the block of added string k or equals it. We also need the string k 0 doesn’t have enough space to insert a new vertex. We use which is built from k by removing s0 from the beginning. two different strategies for block splitting. To build the string common common we take the great- The first strategy makes the blocks grow in “width” est common prefix of strings k 0 and p = pref ix(S[n]), and doesn’t affect the height of the tree so it is more where S[n] is the last vertex of the path S. In turn the preferable. The algorithm handles the situation when strings rest and key are p and k 0 with no prefix common one block contains several twigs. In this case we divide at the beginning respectively. The idea becomes clearer if illustrated by the following scheme: twigsP contained in block P into two sets P1 and P2 such that | w∈P1 w(B) − w∈P2 w(B)| is minimal over all We consider five cases in our insertion procedure: possible partitions P1 and P2 . Thus these sets must sep- arate twigs contained in the block to about half of their 1. There are no nodes in the tree. In this case, we need total size. One of these sets remains in its original block, to allocate new page for a new node. and a second set is moved to the new block. This al- 2. Rest and key are empty strings. In this case, it gorithm affects exactly three blocks — one block that is sufficient to mark the latest node of the path as contains the parent node (this is a significant fact, which f inal (in case of a minimal tree it will already be will play a role in calculating the total number of units marked as f inal). affected by the separation.) and two blocks containing the twigs (old and new). 3. Key is an empty string, rest is non-empty. In this The second strategy is used in case the block where case, we should split the final node of the path in the vertex is added to has only one twig and there is still two parts. One of these parts contains the prefix not enough space. In this case we are looking for the first common and is marked as f inal. node x starting from the root of the twig which has N > 1 links. This node has N child nodes that will be the 4. Key is a non-empty string, rest is empty. In this roots of N new twigs. Next, we move all nodes from the case, we need to add an extra node with the prefix root to x to the a parent twig contained in another block4 . key that will be a child to the last node of the path. Next, we split the remaining N resulting twigs using the first split strategy. If the initial twig is the root of the tree, 5. Key and rest are non-empty strings. In this case, then the set of nodes is not transferred to the block that the node xn is split into three ones, with prefixes contains an parent vertex node (since it does not exist) common, rest and key respectively, the last one is but into the new block. This is the only case when the tree marked as f inal. grows in height. This algorithm accesses exactly three blocks; one of these blocks contains the ancestor twig The tricky part is related to splitting of a block on and two blocks contain the newly created twigs. steps 4 and 5. If the vertex which is being split is a leaf So the only case left to handle is the one when the node of a twig that has a descendant twig we create a block contains only one twig, and the first algorithm is new vertex with a prefix key in a new twig. We need to applicable and at the same time in the block containing find a child twig which lies in a block which has more the branch of an ancestor there is not enough space to free space left. In theory this is a very expensive op- move the set of nodes of the twigs contained in the block eration because it affects all the blocks that contain the that we are trying to separate. child twigs. The number of such blocks is not greater While looking for a key during the insertion process, than the number of children twigs. But in practice such we build a path to the vertex which is going to be split an approach is unacceptable since in addition to the com- during insertion. We collect some auxiliary data about pactness of storage the number of blocks being accessed blocks on our way: how many twigs do these blocks con- should also be minimized. There is a compromise that al- tain, how much space is occupied by the root nodes, how lows to meet both requirements in practice: we limit the much free space do these blocks have, etc. It is very im- number of blocks accessed by some constant D (in our implementation D = 2 ). If none of the D blocks have 4 Strictly speaking there is no guarantee that the block that contains enough space for the new vertex to be added, the block a parent twig has enough room to accept this set of nodes. pref (x1 ) + c1 + · · · + pref (xn−1 ) cn−1 pref ix(xn ) z }| { s brownfoxjumpsoverala z y dog |{z} | {z } common rest z}|{ k brownfoxjumpsoverala z y gopher | {z } key Figure 3: Example for insert portant to note that we do not evaluate this data every > create index "urlbtree[1]" time. It is stored in a block header and is updated only on doc("dblp.xml")/dblp/* when the block is being changed. by ./url as xs:string With all this information collected we build the block using "btree"& splitting sequence starting from the block we need to in- sert new vertex to. The number of accessed blocks to 2h + 1 in worst case where h is the height of the tree. For BST: 3.3.4 Deletion > create index "urlbstrie[1]" on doc("dblp.xml")/dblp/* Following the example of most papers on B-trees, we in- by ./url as xs:string troduce an operation of lazy removal[10]. This approach using "bstrie"& is widely applied to the database systems because the procedure for the deletion of the element from the B- tree can be even trickier than the insert procedure. Lazy removal allows to significantly accelerate updates on the In our tests we measure space occupied by the same database. indexes for both structures, the time needed for updates In the developed structure delete is simply removing and search time. Indexes were created on publications the flag f inal from the vertex, that corresponds to the by their identifiers, URI and EE field. The total number string we are looking for. Thus after the removal of the of indexed publications is about 3 million. However we vertex we have the redundant vertex and consequently a show here only the space occupied by index, because the non-minimal tree. This approach requires a procedure to search time is almost the same for both types of trees. minimize the tree (or a separate twig). We remove all Distinct Key volume BST B-tree redundant vertices, which in turn are of two types: values Kb Kb Kb 1. Vertices that are not marked as f inal and have no key 2 989 811 59 337 98 752 106 432 outgoing edges. These nodes can be simply re- url 1 915 933 72 828 81 984 103 296 moved. ee 1 667 275 74 975 53 440 101 248 2. Vertices that are not marked as f inal and have ex- actly one outgoing edge. In this case it is enough to Table 1: Comparison between BST and B+ tree merge the nodes with the concatenation of pref ix+ As one can see, BST can archive even significant data edgecharacter + pref ix. compression, while times of search and update are al- After this procedure is applied there are no redun- most the same for both structures. dant nodes remaining. It may also happen that some twigs have no nodes or only a single external-node re- 3.5 Conclusion mains. Such twigs must be removed and the child twigs We have introduced a new trie structure that effectively are propagaded. In some cases this may require splitting solves the dictionary problem for strings of unlimited of the block. length and operates with large blocks of data. Thus minimizing the procedure for one twig may ac- Structure is implemented and is ready to use within cess 4 blocks in the worst case: 3 blocks by separation Sedna XML DBMS[15]. The code is open source under procedure of the original block and 1 block containing Apache license and is available on the project site. the descendant twig may be accessed. Tests show that the structure is well suited for use with indexes on fields that contain URI or strings which may 3.4 Tests and results contain the same prefixes. The BST structure also shows Tests were performed within Sedna DBMS. All the good results in the general case. tests were made on loaded DBLP Computer Science Bibliography[11] using B+ tree structure and BST both implemented in Sedna for handling indexes. Requests References for the construction of the index are as follows: [1] Nikolas Askitis and Justin Zobel. B-tries for disk- based string management. VLDB J., 18(1):157– For B+ tree: 179, 2009. [2] Ricardo A. Baeza-Yates. An adaptive overflow [16] B. Srinivasan. An adaptive overflow technique to technique for b-trees. In François Bancilhon, defer splitting in b-trees. Comput. J., 34(5):397– Costantino Thanos, and Dennis Tsichritzis, editors, 405, 1991. EDBT, volume 416 of Lecture Notes in Computer Science, pages 16–28. Springer, 1990. [17] Wojciech Szpankowski. Patricia tries again revis- ited. J. ACM, 37(4):691–711, 1990. [3] Rudolf Bayer and Edward M. McCreight. Orga- nization and maintenance of large ordered indices. [18] Ilya Taranov, Ivan Shcheklein, Alexander Kalinin, Acta Inf., 1:173–189, 1972. Leonid Novak, Sergei Kuznetsov, Roman Pas- tukhov, Alexander Boldakov, Denis Turdakov, [4] Rudolf Bayer and Karl Unterauer. Prefix b-trees. Konstantin Antipin, Andrey Fomichev, Peter Ple- ACM Trans. Database Syst., 2(1):11–26, 1977. shachkov, Pavel Velikhov, Nikolai Zavaritski, Maxim Grinev, Maria P. Grineva, and Dmitry Li- [5] Walter A. Burkhard. Hashing and trie algorithms zorkin. Sedna: native xml database management for partial match retrieval. ACM Trans. Database system (internals overview). In Ahmed K. Elma- Syst., 1(2):175–187, 1976. garmid and Divyakant Agrawal, editors, SIGMOD [6] Eugene Inseok Chong, Jagannathan Srinivasan, Conference, pages 1037–1046. ACM, 2010. Souripriya Das, Chuck Freiwald, Aravind Yala- [19] Nikolaus Walczuch and Herbert Hoeger. Using in- manchi, Mahesh Jagannath, Anh-Tuan Tran, dividual prefixes in b+ -trees. Journal of Systems Ramkumar Krishnan, and Richard Jiang. A map- and Software, 47(1):45–51, 1999. ping mechanism to support bitmap index and other auxiliary structures on tables stored as primary b+ - trees. SIGMOD Record, 32(2):78–88, 2003. [7] Paolo Ferragina and Roberto Grossi. The string b-tree: A new data structure for string search in external memory and its applications. J. ACM, 46(2):236–280, 1999. [8] Andrey Fomichev, Maxim Grinev, and Sergei D. Kuznetsov. Sedna: A native xml dbms. In Jirı́ Wiedermann, Gerard Tel, Jaroslav Pokorný, Mária Bieliková, and Julius Stuller, editors, SOFSEM, volume 3831 of Lecture Notes in Computer Sci- ence, pages 272–281. Springer, 2006. [9] Steffen Heinz, Justin Zobel, and Hugh E. Williams. Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst., 20(2):192–223, 2002. [10] Theodore Johnson and Dennis Shasha. B-trees with inserts and deletes: Why free-at-empty is better than merge-at-half. J. Comput. Syst. Sci., 47(1):45– 76, 1993. [11] Michael Ley. Die trierer informatik-bibliographie dblp. In GI Jahrestagung, pages 257–266, 1997. [12] Joong Chae Na and Kunsoo Park. Simple im- plementation of string b-trees. In Alberto Apos- tolico and Massimo Melucci, editors, SPIRE, vol- ume 3246 of Lecture Notes in Computer Science, pages 214–215. Springer, 2004. [13] Ratko Orlandic and Hosam M. Mahmoud. Storage overhead of o-trees, b-trees and prefix b-trees: A comparative analysis. Int. J. Found. Comput. Sci., 7(3):209–226, 1996. [14] Neal Sample, Brian F. Cooper, Michael J. Franklin, Gı́sli R. Hjaltason, Moshe Shadmon, and Levy Cohe. Managing complex and varied data with the indexfabric(tm). In ICDE, pages 492–493. IEEE Computer Society, 2002. [15] Sedna XML DBMS source code downloads, 2012. http://sedna.org/download.html.