DataGuide-based Distribution for XML Documents c Alexander Kalinin Institute for System Programming of the Russian Academy of Sciences allex.kalinin@gmail.com Abstract Distribution is a well-known solution to in- Q1:/lib/*[name() eq 'book' or name() eq 'article'] crease performance and provide load balanc- ing in case you need optimal resource uti- lization. Together with replication it also allows improved reliability, accessibility and fault-tolerance. However since the amount of data is large there is a problem of main- NODE1 NODE2 taining meta-information about distribution and /lib/*[name() eq 'book'] /lib/*[name() eq 'article'] finding needed data fragments during execu- tion of queries. These problems are well un- derstood but they have not received much at- tention in the context of XML data manage- ment. This paper presents research-in-progress, NODE3 which examines the possibility of management MERGE of meta-information about XML data distribu- tion extending auxillary index structure called DataGuide. 1 Introduction Figure 1: Example of load balancing Distribution and replication are often used in data man- databases ([1] or [2] for example), which deal with large agement to provide load-balancing and improve reliabil- amount of XML data. For them data replication could ity and accessibility. Distribution means partitioning data give the same benefits as for relational databases. Let us into some fragments and allocating the corresponding consider a small example: fragments on some number of sites. Replication deals with problem of allocating the same fragments on differ- Example 1. Let us look at Figure 1. This figure illus- ent sites. Since these terms are often used in the same trates simple load balancing example. There is a query context, in this paper we will be using terms “replica- Q1, which retrieves book and article elements. If book tion” and “distribution” interchangeably. This is justified elements were distributed to N ode1 and article ele- since we assume distribution of data into fragments and ments to N ode2 then this task could be done in parallel replication some of the fragments on multiple sites. on these nodes. Then result would be merged on some When using replication numerous problems arise. N ode3, which could be an original node query had ar- Among them we first consider: rived to. Without such distribution N ode3 would process such query by itself and probably sequentially by retriev- 1. Determining fragments to distribute and replicate ing book nodes and then article nodes. and their placement. Now let us review three aforementioned problems in 2. Management of meta-information about distribu- the context of XML data management. tion. First of all, there is a problem of determining distribu- tion fragments and their placement. This implies choos- 3. Management of corresponding fragments and repli- ing fragment’s size. For XML documents two choices cas, e.g. consistency of replicas during updates. seem reasonable: the whole document (document-level replication) or individual nodes (node-level replication). These problems have been well researched in the re- Document-level distribution is a well-known solution in lational world but to the best of our knowledge they case of small documents or mostly read-only environ- have not received much attention in the context of XML ment. The main benefit here is as DBMS executes some data management. However there are exist native XML query on document D it can choose any of the sites the Proceedings of the Spring Young Researcher’s Colloquium document D resides on to execute the query. This, at the on Database and Information Systems, Saint-Petersburg, Rus- same time, provides good load balancing and resource sia, 2009 utilization during read-only queries. However there is a problem of managing such replicated documents when updates arrive. Since document D is replicated fully among some sites update of any of its part propagates Peter to all of these sites. In case such updates are common Old Street,25 they can destroy all benefits we receive from such repli- John cation scheme. Another problem here is the size of the UStreet,16 swimming fragment. In case of large documents such replication cycling becomes quite space-inefficient. With regard to this we believe that replication based on nodes is a more efficient solution in general, when database may contain docu- Robert Old Street,25 ments large in size or when updates are not uncommon. So in this paper we imply node-level replication. Mary If we use node-level replication we must deal with Quensway,34 painting large amount of meta-information about distribution, e.g. on what sites particular nodes reside. This task in not trivial since the number of nodes in a large XML doc- doc ument may be quite high. The efficient management of person such data is the main theme of the presented research. The most common way to deal with such information is to use some kind of auxillary index structures. In @age name addr child hobby this paper we propose using DataGuide[10] as such in- person dex. We will describe DataGuide in details and give some examples in the next section. Here it is sufficient to say that DataGuide resembles path-index in such way name addr hobby as every possible path in a document is represented in a corresponding DataGuide structure. Since nodes are lo- cated by paths in every XML document we believe this Figure 2: Example of DataGuide structure to be the most appropriate solution in case of node-level replication. Moreover DataGuide is used in elaborate one, which is our main research proposal. Sec- some native XML implementations (e.g. Sedna[2]) for tion 4 gives a brief explanation of work to be done for optimization of XPath[3], which is a navigational lan- this research. In Section 5 we describe related work. And guage for XML documents and also deals with nodes. Section 6 concludes this paper. In such systems using DataGuide to store replication schema means integration with query executor straight- 2 DataGuide forward. Such sound integration allows to receive most benefits from replication since one of its main goals is to DataGuide is a structure that describes different paths of make query evaluation more efficient. XML document. More precisely: At first we propose straightforward and most obvious Definition 1. DataGuide for XML document D is an way to extend DataGuide, which requires minimal modi- XML document DG with following properties: fications. Such DataGuide is then replicated itself among participating sites. Then, the more elaborate approach 1. For every path in D there is a unique equivalent follows, which includes extending DataGuide, distribut- path in DG ing it in some fragments and replicating them only for nodes that really need it. Such approach creates multi- 2. For every path in DG there is a equivalent path in level environment, hence the name M LDG (Multi-Level D Data Guide). Figure 2 presents an example of DataGuide. Top half Last problem remaining – maintenance of corre- contains document D and the bottom half contains the sponding replicas. The most infamous problem here is corresponding DataGuide. One of the most important to keep replicas in consistent state. Much work has been features of DataGuide is that it effectively serves as a done in this field outside of XML data management. But path-level index. Since XPath is the main navigational since most algorithms deal with abstract “data elements” language for XML documents DataGuide is used in its and do not impose restrictions on underlying data model optimizations. For example, in Sedna XML database they can be used for dealing with replication of XML nodes of DataGuide point to the corresponding nodes data as well. So we believe it to be a separate problem of a document, which allows quick evaluation of XPath and do not include any suggestions on the problem in this queries. Let us look at some example. paper. The rest of the paper is organized as follows. In the Example 2. Consider query Q1: /doc/individual. next section we describe DataGuide structure and give Query executor would see that such path does not exist some examples of its usability during evaluation of some and return empty sequence. It does not even have to look path-expressions. Then, in Section 3 we discuss the pos- into any nodes. sibility of using DataGuide to store information about Now consider Q2: /doc/person/child. In this replication as well. First we give there the most straight- case query executor could find corresponding document forward approach and then we move further to the more nodes by looking at DataGuide and using it as an index. RDG MLDG( LEVEL 1) MLDG( LEVEL 2) MLDG( LEVEL 3) Sites A,B,C,D Site A Sites B,C Site D doc doc doc doc person {A} person person person child child person person {B,C} @age name addr child {B,C} hobby @age name addr child {B,C} hobby hobby person name addr hobby {D} name addr hobby {D} Figure 3: Multi-level DataGuide In fact, in this case it would not look at any unnecessary to site A only for node /doc. We will call such extended nodes at all. DataGuide RDG (Replication-aware DataGuide). At last, consider Q3: /doc/person[./child]. The proposed scheme allows executor to easily redi- There is a path /doc/person/child in DG but accord- rect part of a query to the specified sites. For example, ing to the definition it does not have to be unique. This for query Q3: /doc/person/child[@age=’15’] we means that not every “/doc/person” node has “child” would go to one of the nodes B or C and continue node as a child. However in this case DataGuide al- query evaluation there. For DataGuide-aware executor lows efficient iteration over “/doc/person/child” nodes traversing DataGuide would be part of the job anyway. and obtaining their parents, which would be resulting Other executors could evaluate Q3 on RDG since it is “/doc/person” nodes. structured as XML document. The main caveat here is that in order to work efficiently such RDG should This example shows that DataGuide can be used as be fully replicated between all sites we use to answer effective XML data map. We believe it is straightfor- queries. The main benefit here is that every site knows ward to extend it to be able to support information about about data allocation and can redirect parts of a query distribution, i.e. serve as a distribution map. Since we right to the corresponding nodes. Moreover it can an- consider node-level replication here and nodes are ac- swer some queries locally without even accessing the cessed by path expressions DataGuide is a natural choice. other machines. As an example consider query Q4 : Moreover it would allow easy integration with query ex- f n : count(/doc/person/child/brother) received by ecutors, which are aware of path expressions anyway. In the site A. Since the corresponding path is absent from the next section we will present extensions to DataGuide RDG then A can give the answer, 0, right away even to support replication. though it does not contain /doc/person/child nodes itself. 3 Extending DataGuide with replication Despite of some benefits the RDG approach also has In this section we discuss the possibility of tuning some drawbacks. Since we must replicate RDG to every DataGuide to support replication. We use the term “site” site it becomes somewhat hard to maintain it in case of to distinguish nodes on which data reside from XML updates. In fact updates can be of two types. First ones nodes. are updates of corresponding XML document. For exam- First and the most straightforward way is to extend ple, consider adding node with the name “child” to some DataGuide with information about sites the particular of the /doc/person/child/person nodes. Since the nodes reside on. Consider DataGuide DG from Figure corresponding path is absent from the DataGuide we 2. Let us assume that /doc/person/child nodes are would need to update it for every site. So such simple in- replicated among sites B and C, and site A contains all sert of a node becomes an update of the entire cluster of other nodes except /doc/person/child. Then we can sites. It would be beneficial to update just sites B and C store such information in the nodes of the DataGuide. where /doc/person/child/person reside. The sec- For example, node /doc/person/child would con- ond type of updates are updates of meta-information. For tain pair (B, C) allowing to find sites corresponding example, if we want to replicate /doc/person/child document nodes reside on. It is a matter of choice to some site D. Again since RDG is fully replicated we whether to propagate such information to the children should propagate such update to every site. However the of /doc/person/child. We could assume that without last problem could be easily alleviated. For example, for any explicit directions node resides on the same site as site A we could store information only about site B and its nearest ancestor. In this case we would write pointer site B knows that its data replicated also at site C. In this case this last update would touch only sites B and C. In fact they can execute this query as it first arrived Considering aforementioned shortcomings we pro- straight to one of them. This means we do not even need pose another approach. The main idea is to dis- to rewrite the query. tribute RDG itself creating multi-level environment. Consider query Q1 arriving at site D. Similar to A We call such extended DataGuide M LDG (Multi- it knows nothing about nodes in question. But it knows Level DataGuide). Let us look at Figure 3. The that /doc/person/child/person nodes are allocated left part presents RDG discussed earlier. Nodes at sites B and C. So if somebody could give an answer /doc/person/child are replicated on sites B and it is one of them. Again it redirects the query to either B C and nodes /doc/person/child/person/hobby are or C. placed on D. Since site A does not store all these Now consider query Q2 : /doc/person/name. If it nodes we can prune RDG tree leaving information arrives at site A the answer could be given right away about nodes that actually reside here and information since A knows it. If the query arrives at site B it must on where to find /doc/person/child. Such pruned be redirected to site A. B could do that since it knows RDG would be replicated on every site that stores repli- that /doc/person allocated at A from its DataGuide. cas of A. Such sites form level one in our distributed However if the query arrives at site D it could not be environment. Sites B and C receive another variation redirected right away since D knows only about B and of RDG. They are aware of the internal structure of C nodes. In this case it redirects the query at site B(C) /doc/person/child, so they contain more information and B(C) would redirect it to A. about corresponding subtrees. For example, they know about /doc/person/child/person nodes of which Multivelel redirection, such as D → B → C from A is not aware of. B and C form the second level. the above example shows another problem: if query ar- Lastly, /doc/person/child/person/hobby nodes are rives at some level but must be answered by a level distributed on site D. Again, B and C know where to much higher or lower than the current there would find these nodes, but they are not aware of their internal be a large number of redirections. Such indirection structure. D receives another modification of the initial could be alleviated by allowing levels to see more in- RDG as specified on Figure 3. formation about other levels. For example, we could The main rule here is that every site is aware of the add path /doc/person/child/person/hobby to the internal structure of its replica. This means that it stores DataGuide at site A allowing A to redirect the corre- the whole DataGuide only for subtrees belonging to the sponding questions without B or C. In this case the en- nodes that are replicated to it. Such DataGuides exclude vironment becomes more centralized and resembles the any subtrees for nodes that belong elsewhere as, for ex- RDG approach. However the amount of replicated in- ample, the case with /doc/person/child on site A. As formation is still small compared to RDG and at the can be seen on Figure 3 every site on level greater than same time it would allow processing queries without un- one also stores ancestor context for its replicas, which is necessary redirections. We could say that there would be a path from document root to the replicated node. These a some kind of trade-off between more centralized en- paths become parts of site’s DataGuide too. This al- vironment with some performance gains and more dis- lows two important things. First of all, this part of the tributed environments, which is more robust to the up- DataGuide allows pointing back to the previous level. dates of DataGuide. For example, /doc/person stores information about A As for the updates let us look at the following exam- and /doc/person/child/person stores information ple. about B and C. Without such “pointers” it would be im- Example 4. First of all, if we need to add some node that possible for such sites to receive queries since they would is already reflected in the DataGuide the same rule as for not know where to redirect them in case they have not got queries applies. Such update would be executed in place the needed nodes. Another reason is that ancestor context or redirected to the corresponding site. It is obvious that allows higher levels to issue proper, without mangling, no update of DataGuide is needed. path-queries to retrieve needed nodes from lower levels. Updates become more interesting when they involve In this case the structure of a query executor for every site nodes that are new to a DataGuide hence requiring it to would remain the same because of uniformity of distri- be changed. For example, let us assume that we want bution information’s structure. Moreover it yields more to add a child with the name “SSN” to the one of the natural description for disjoint node replicas. For exam- /doc/person/child/person nodes. This requires up- ple, if some site would hold /doc/person/name and dating DataGuide but only at sites B and C. Compare it /doc/person/hobby replicas it would be more natural with the RDG approach where such update would touch to hold it together with from-the-root paths and at the all sites to update fully replicated RDG. That is one of same time this allows to evaluate “usual” queries such as the examples where maintaing M LDG pays off. fn:count(/doc/person/name). Another benefit of M LDG would be fur- Let us see how queries could be evaluated in such ther distribution of XML data. For example, multi-level environment. let us assume that we want to reallocate nodes Example 3. Consider query Q1 : /doc/person/child/person/addr from B and C to /doc/person/child/person/name arriving at site some site F . In this case we would update only sites B A. A cannot answer this query but it knows that and C since it would be enough to locate these nodes. /doc/person/child nodes are allocated at sites B With RDG we would update all sites since every site and C. So it redirects this query to either B or C. B must know where to find relocated nodes. This example and C indeed contain all the data needed for this query. also illustrates that trade-off we were talking about earlier. If, for example, we want A to be able to locate relational data. But other works in this field discuss cor- aforementioned nodes we would update it as well. Again rectness problems for replicated data and propose algo- this alleviates indirection but makes changing data rithms that guarantee one or another level of correctness. allocation map less straightforward. XML replication can certainly benefit from these works, but this is not a part of our research. It is more of the third This example concludes our proposal. To summa- problem we described in “Introduction” since such algo- rize, the main idea is to use multi-leveled DataGuide rithms usually work without assuming something about (M LDG) as a representation of a data allocation map. underlying data structure. Multi-levelness allows more efficient updates and evolu- Now we discuss some works here that deal with dis- tion of the data allocation schema, i.e. reallocating nodes tribution of XML data. First one [6] discusses similar to another sites. problem of representing distribution map. Authors pro- There is much to be done in this research and in the pose ReplicationGuide(RG), which resembles RDG next section we will give a brief summary of future work. we discussed earlier. However in this paper authors are more concerned with performing structural joins and maintaining information about different physical paths 4 Future Work that correspond to one RG path in the form of µP IDs. In this section we give a brief summary of further di- This is somewhat similar to maintaing positional infor- rections of our research. First of all, when replicating mation we discussed in the previous section. However, XML data we must be aware of so-called document or- authors do not divulge into details whether µP IDs could der. Imagine, for example, that we want to retrieve some be used to solve serialization or update problems we of the /doc/person/child/person nodes. Such query mentioned earlier. Moreover this approach lacks in two would be easily delivered to one of the nodes B and C. ways. Firstly, RG is replicated on every site. As we have But then we must face the fact that some parts of the seen it can be cumbersome in some cases. Secondly, this nodes in question (“hobby” nodes in our example) are approach and µP IDs particulary are too dependent on located on another site. We must assemble all these parts DT D or XM LSchema, which can be bad in cases data from sites B (C) and D in the document order. Docu- does not follow such rigorous structure. ment order problem does not end with serialization how- Second paper [7] discusses partial evaluation problem ever. Most update extensions to XQuery allow user to in a distributed environment. The proposed algorithms specify the exact placement of the inserted node. Again deal with so-called “Boolean XPath queries“, which is original DataGuide is not suited for the purpose of stor- a subset of XPath queries that answer “true” or “false” ing such positional information. Maintaining informa- depending on existence of nodes corresponding to path- tion about document order is straightforward in the ab- expression. Authors propose efficient distribution algo- sence of distribution on the internal representation level, rithm, which guarantees some nice properties. However but in the distributed environment it could become not- this work is related to the problem of distributed query so-trivial task. We believe that using some of the node evaluation and does not discuss any problems of main- identifier (N ID) schemes can be a life-savior here, but taining distribution map. we must elaborate on some effective way to store such The final paper [12] deals with interesting prob- information in M LDG. lem of caching results of XPath queries in a peer-to- Another direction for the research could be to look peer environment. Authors propose two approaches: more closely on some of the XPath evaluation methods IndexCache and DataCache. The former involves and thinking about easy integration of M LDG with such storing results on peers that requested them and main- methods. Such integraton would allow more effective taining prefix-based index to allow other peers to access path-query evaluation in the distributed environment and it. Such prefix-based index is distributed among peers gaining the most benefits from the replication. for efficient querying. DataCache involves storing results And last but not least, we plan to do some experi- on particular sites since every site maintains its own por- ments to show benefits of the presented approach. There tion of query space. This allows to eliminate redundancy. could be a performance decrease because of redirecting Notice that prefix-based index resembles DataGuide ap- queries to another sites, but it should be tolerable in one proach as its primary goal is to path-index cached results. cases and completely surpassed by benefits of replication So this has little to do with querying distributed database in another. As for the updates it should be beneficial to but this technique can be beneficial as a performance in- RDG approach for the reasons presented in the previous creasing add-on. section. 6 Conclusion 5 Related Work This paper is a research-in-progress discussing some as- There is a large amount of work concerning replication in pects of managing replication for XML data. Replication relational databases(for example [8, 9, 5, 4, 11]). Some is an important task for any DBMS natively managing of such papers deal with replication of strictly relational XML data. Managing meta-information about replica- data. However since relational and XML data are differ- tion, i.e. some kind of replication map is not an easy and ent in structure these works are not suitable for our pur- trivial task. In this paper we have discussed approaches pose. Since XML data could be represented as a tree with to manage such information. We have described RDG, arbitrary depth level this implies somewhat more com- extended version of well-known DataGuide, and more plex map than in the case of “almost flat” (table-oriented) elaborate M LDG, which is a multi-leveled version of RDG itself. We have provided numerous examples to show benefits of our approach during evaluation of query and update statements. Also we have discussed direc- tions for future work to continue the presented research. References [1] eXist Native XML Database. http://exist.sourceforge.net/. [2] Sedna Native XML Database. http://modis.ispras.ru/sedna. [3] XML Path Language (XPath). http://www.w3.org/TR/xpath. [4] Fuat Akal, Can Türker, Hans-Jörg Schek, Yuri Bre- itbart, Torsten Grabs, and Lourens Veen. Fine- grained replication and scheduling with freshness and correctness guarantees. In VLDB, pages 565– 576, 2005. [5] Yuri Breitbart, Raghavan Komondoor, Rajeev Ras- togi, S. Seshadri, and Abraham Silberschatz. Up- date propagation protocols for replicated databases. In SIGMOD Conference, pages 97–108, 1999. [6] Jan-Marco Bremer and Michael Gertz. On dis- tributing xml repositories. In WebDB, pages 73–78, 2003. [7] Peter Buneman, Gao Cong, Wenfei Fan, and Anas- tasios Kementsietsidis. Using partial evaluation in distributed query evaluation. In VLDB, pages 211– 222, 2006. [8] Khuzaima Daudjee and Kenneth Salem. Lazy database replication with ordering guarantees. In ICDE, pages 424–435, 2004. [9] Khuzaima Daudjee and Kenneth Salem. Lazy database replication with snapshot isolation. In VLDB, pages 715–726, 2006. [10] Roy Goldman and Jennifer Widom. Dataguides: Enabling query formulation and optimization in semistructured databases. In VLDB, pages 436– 445, 1997. [11] Jim Gray, Pat Helland, Patrick E. O’Neil, and Den- nis Shasha. The dangers of replication and a so- lution. In SIGMOD Conference, pages 173–182, 1996. [12] Kostas Lillis and Evaggelia Pitoura. Cooperative xpath caching. In SIGMOD Conference, pages 327–338, 2008.