DataGuide-based Distribution for XML Documents

                                                c Alexander Kalinin
                        Institute for System Programming of the Russian Academy of Sciences
                                               allex.kalinin@gmail.com


                       Abstract
      Distribution is a well-known solution to in-                      Q1:/lib/*[name() eq 'book' or name() eq 'article']
      crease performance and provide load balanc-
      ing in case you need optimal resource uti-
      lization. Together with replication it also
      allows improved reliability, accessibility and
      fault-tolerance. However since the amount
      of data is large there is a problem of main-                             NODE1                             NODE2

      taining meta-information about distribution and                   /lib/*[name() eq 'book']        /lib/*[name() eq 'article']
      finding needed data fragments during execu-
      tion of queries. These problems are well un-
      derstood but they have not received much at-
      tention in the context of XML data manage-
      ment. This paper presents research-in-progress,                                              NODE3
      which examines the possibility of management                                              MERGE
      of meta-information about XML data distribu-
      tion extending auxillary index structure called
      DataGuide.

1     Introduction                                                      Figure 1: Example of load balancing
Distribution and replication are often used in data man-      databases ([1] or [2] for example), which deal with large
agement to provide load-balancing and improve reliabil-       amount of XML data. For them data replication could
ity and accessibility. Distribution means partitioning data   give the same benefits as for relational databases. Let us
into some fragments and allocating the corresponding          consider a small example:
fragments on some number of sites. Replication deals
with problem of allocating the same fragments on differ-      Example 1. Let us look at Figure 1. This figure illus-
ent sites. Since these terms are often used in the same       trates simple load balancing example. There is a query
context, in this paper we will be using terms “replica-       Q1, which retrieves book and article elements. If book
tion” and “distribution” interchangeably. This is justified   elements were distributed to N ode1 and article ele-
since we assume distribution of data into fragments and       ments to N ode2 then this task could be done in parallel
replication some of the fragments on multiple sites.          on these nodes. Then result would be merged on some
    When using replication numerous problems arise.           N ode3, which could be an original node query had ar-
Among them we first consider:                                 rived to. Without such distribution N ode3 would process
                                                              such query by itself and probably sequentially by retriev-
    1. Determining fragments to distribute and replicate      ing book nodes and then article nodes.
       and their placement.
                                                                  Now let us review three aforementioned problems in
    2. Management of meta-information about distribu-         the context of XML data management.
       tion.                                                      First of all, there is a problem of determining distribu-
                                                              tion fragments and their placement. This implies choos-
    3. Management of corresponding fragments and repli-       ing fragment’s size. For XML documents two choices
       cas, e.g. consistency of replicas during updates.      seem reasonable: the whole document (document-level
                                                              replication) or individual nodes (node-level replication).
    These problems have been well researched in the re-       Document-level distribution is a well-known solution in
lational world but to the best of our knowledge they          case of small documents or mostly read-only environ-
have not received much attention in the context of XML        ment. The main benefit here is as DBMS executes some
data management. However there are exist native XML           query on document D it can choose any of the sites the
Proceedings of the Spring Young Researcher’s Colloquium
                                                              document D resides on to execute the query. This, at the
on Database and Information Systems, Saint-Petersburg, Rus-   same time, provides good load balancing and resource
sia, 2009                                                     utilization during read-only queries. However there is a
problem of managing such replicated documents when
updates arrive. Since document D is replicated fully
                                                                                       <doc>
among some sites update of any of its part propagates                                  <person age='55'>
                                                                                        <name>Peter</name>
to all of these sites. In case such updates are common                                  <addr>Old Street,25</addr>
                                                                                        <child>
they can destroy all benefits we receive from such repli-                                <person>
                                                                                           <name>John</name>
cation scheme. Another problem here is the size of the                                     <addr>UStreet,16</addr>
                                                                                           <hobby>swimming</hobby>
fragment. In case of large documents such replication                                      <hobby>cycling</hobby>
                                                                                         </person>
becomes quite space-inefficient. With regard to this we                                 </child>
believe that replication based on nodes is a more efficient                             <child>
                                                                                         <person>
solution in general, when database may contain docu-                                       <name>Robert</name>
                                                                                           <addr>Old Street,25</addr>
ments large in size or when updates are not uncommon.                                    </person>
                                                                                        </child>
So in this paper we imply node-level replication.                                      </person>
                                                                                       <person age='20'>
                                                                                        <name>Mary</name>
    If we use node-level replication we must deal with                                  <addr>Quensway,34</addr>
                                                                                        <hobby>painting</hobby>
large amount of meta-information about distribution, e.g.                              </person>
                                                                                       </doc>
on what sites particular nodes reside. This task in not
trivial since the number of nodes in a large XML doc-                                                doc

ument may be quite high. The efficient management of
                                                                                                   person
such data is the main theme of the presented research.
The most common way to deal with such information
is to use some kind of auxillary index structures. In                           @age    name addr child hobby

this paper we propose using DataGuide[10] as such in-
                                                                                                           person
dex. We will describe DataGuide in details and give
some examples in the next section. Here it is sufficient
to say that DataGuide resembles path-index in such way                                        name    addr hobby

as every possible path in a document is represented in a
corresponding DataGuide structure. Since nodes are lo-
cated by paths in every XML document we believe this                       Figure 2: Example of DataGuide
structure to be the most appropriate solution in case of
node-level replication. Moreover DataGuide is used in         elaborate one, which is our main research proposal. Sec-
some native XML implementations (e.g. Sedna[2]) for           tion 4 gives a brief explanation of work to be done for
optimization of XPath[3], which is a navigational lan-        this research. In Section 5 we describe related work. And
guage for XML documents and also deals with nodes.            Section 6 concludes this paper.
In such systems using DataGuide to store replication
schema means integration with query executor straight-        2     DataGuide
forward. Such sound integration allows to receive most
benefits from replication since one of its main goals is to   DataGuide is a structure that describes different paths of
make query evaluation more efficient.                         XML document. More precisely:
    At first we propose straightforward and most obvious      Definition 1. DataGuide for XML document D is an
way to extend DataGuide, which requires minimal modi-         XML document DG with following properties:
fications. Such DataGuide is then replicated itself among
participating sites. Then, the more elaborate approach            1. For every path in D there is a unique equivalent
follows, which includes extending DataGuide, distribut-              path in DG
ing it in some fragments and replicating them only for
nodes that really need it. Such approach creates multi-           2. For every path in DG there is a equivalent path in
level environment, hence the name M LDG (Multi-Level                 D
Data Guide).
                                                                 Figure 2 presents an example of DataGuide. Top half
    Last problem remaining – maintenance of corre-            contains document D and the bottom half contains the
sponding replicas. The most infamous problem here is          corresponding DataGuide. One of the most important
to keep replicas in consistent state. Much work has been      features of DataGuide is that it effectively serves as a
done in this field outside of XML data management. But        path-level index. Since XPath is the main navigational
since most algorithms deal with abstract “data elements”      language for XML documents DataGuide is used in its
and do not impose restrictions on underlying data model       optimizations. For example, in Sedna XML database
they can be used for dealing with replication of XML          nodes of DataGuide point to the corresponding nodes
data as well. So we believe it to be a separate problem       of a document, which allows quick evaluation of XPath
and do not include any suggestions on the problem in this     queries. Let us look at some example.
paper.
    The rest of the paper is organized as follows. In the     Example 2. Consider query Q1: /doc/individual.
next section we describe DataGuide structure and give         Query executor would see that such path does not exist
some examples of its usability during evaluation of some      and return empty sequence. It does not even have to look
path-expressions. Then, in Section 3 we discuss the pos-      into any nodes.
sibility of using DataGuide to store information about           Now consider Q2: /doc/person/child. In this
replication as well. First we give there the most straight-   case query executor could find corresponding document
forward approach and then we move further to the more         nodes by looking at DataGuide and using it as an index.
                    RDG                                    MLDG( LEVEL 1)                 MLDG( LEVEL 2)              MLDG( LEVEL 3)

                Sites A,B,C,D                              Site A                         Sites B,C                   Site D

                                                                                                      doc              doc
                          doc
                                                                    doc
                                                                                                      person {A}      person
                        person
                                                                person                                child            child

                                                                                                      person          person {B,C}
        @age name addr child {B,C}       hobby
                                                 @age name addr child {B,C}    hobby
                                                                                                                       hobby
                                person
                                                                                          name addr      hobby {D}


                   name addr hobby {D}


                                                    Figure 3: Multi-level DataGuide

In fact, in this case it would not look at any unnecessary                    to site A only for node /doc. We will call such extended
nodes at all.                                                                 DataGuide RDG (Replication-aware DataGuide).
   At last, consider Q3: /doc/person[./child].                                    The proposed scheme allows executor to easily redi-
There is a path /doc/person/child in DG but accord-                           rect part of a query to the specified sites. For example,
ing to the definition it does not have to be unique. This                     for query Q3: /doc/person/child[@age=’15’] we
means that not every “/doc/person” node has “child”                           would go to one of the nodes B or C and continue
node as a child. However in this case DataGuide al-                           query evaluation there. For DataGuide-aware executor
lows efficient iteration over “/doc/person/child” nodes                       traversing DataGuide would be part of the job anyway.
and obtaining their parents, which would be resulting                         Other executors could evaluate Q3 on RDG since it is
“/doc/person” nodes.                                                          structured as XML document. The main caveat here
                                                                              is that in order to work efficiently such RDG should
   This example shows that DataGuide can be used as
                                                                              be fully replicated between all sites we use to answer
effective XML data map. We believe it is straightfor-
                                                                              queries. The main benefit here is that every site knows
ward to extend it to be able to support information about
                                                                              about data allocation and can redirect parts of a query
distribution, i.e. serve as a distribution map. Since we
                                                                              right to the corresponding nodes. Moreover it can an-
consider node-level replication here and nodes are ac-
                                                                              swer some queries locally without even accessing the
cessed by path expressions DataGuide is a natural choice.
                                                                              other machines. As an example consider query Q4 :
Moreover it would allow easy integration with query ex-
                                                                              f n : count(/doc/person/child/brother) received by
ecutors, which are aware of path expressions anyway. In
                                                                              the site A. Since the corresponding path is absent from
the next section we will present extensions to DataGuide
                                                                              RDG then A can give the answer, 0, right away even
to support replication.
                                                                              though it does not contain /doc/person/child nodes
                                                                              itself.
3   Extending DataGuide with replication
                                                                                  Despite of some benefits the RDG approach also has
In this section we discuss the possibility of tuning                          some drawbacks. Since we must replicate RDG to every
DataGuide to support replication. We use the term “site”                      site it becomes somewhat hard to maintain it in case of
to distinguish nodes on which data reside from XML                            updates. In fact updates can be of two types. First ones
nodes.                                                                        are updates of corresponding XML document. For exam-
    First and the most straightforward way is to extend                       ple, consider adding node with the name “child” to some
DataGuide with information about sites the particular                         of the /doc/person/child/person nodes. Since the
nodes reside on. Consider DataGuide DG from Figure                            corresponding path is absent from the DataGuide we
2. Let us assume that /doc/person/child nodes are                             would need to update it for every site. So such simple in-
replicated among sites B and C, and site A contains all                       sert of a node becomes an update of the entire cluster of
other nodes except /doc/person/child. Then we can                             sites. It would be beneficial to update just sites B and C
store such information in the nodes of the DataGuide.                         where /doc/person/child/person reside. The sec-
For example, node /doc/person/child would con-                                ond type of updates are updates of meta-information. For
tain pair (B, C) allowing to find sites corresponding                         example, if we want to replicate /doc/person/child
document nodes reside on. It is a matter of choice                            to some site D. Again since RDG is fully replicated we
whether to propagate such information to the children                         should propagate such update to every site. However the
of /doc/person/child. We could assume that without                            last problem could be easily alleviated. For example, for
any explicit directions node resides on the same site as                      site A we could store information only about site B and
its nearest ancestor. In this case we would write pointer                     site B knows that its data replicated also at site C. In this
case this last update would touch only sites B and C.           In fact they can execute this query as it first arrived
    Considering aforementioned shortcomings we pro-             straight to one of them. This means we do not even need
pose another approach. The main idea is to dis-                 to rewrite the query.
tribute RDG itself creating multi-level environment.                Consider query Q1 arriving at site D. Similar to A
We call such extended DataGuide M LDG (Multi-                   it knows nothing about nodes in question. But it knows
Level DataGuide). Let us look at Figure 3. The                  that /doc/person/child/person nodes are allocated
left part presents RDG discussed earlier.             Nodes     at sites B and C. So if somebody could give an answer
/doc/person/child are replicated on sites B and                 it is one of them. Again it redirects the query to either B
C and nodes /doc/person/child/person/hobby are                  or C.
placed on D. Since site A does not store all these                  Now consider query Q2 : /doc/person/name. If it
nodes we can prune RDG tree leaving information                 arrives at site A the answer could be given right away
about nodes that actually reside here and information           since A knows it. If the query arrives at site B it must
on where to find /doc/person/child. Such pruned                 be redirected to site A. B could do that since it knows
RDG would be replicated on every site that stores repli-        that /doc/person allocated at A from its DataGuide.
cas of A. Such sites form level one in our distributed          However if the query arrives at site D it could not be
environment. Sites B and C receive another variation            redirected right away since D knows only about B and
of RDG. They are aware of the internal structure of             C nodes. In this case it redirects the query at site B(C)
/doc/person/child, so they contain more information             and B(C) would redirect it to A.
about corresponding subtrees. For example, they know
about /doc/person/child/person nodes of which                       Multivelel redirection, such as D → B → C from
A is not aware of. B and C form the second level.               the above example shows another problem: if query ar-
Lastly, /doc/person/child/person/hobby nodes are                rives at some level but must be answered by a level
distributed on site D. Again, B and C know where to             much higher or lower than the current there would
find these nodes, but they are not aware of their internal      be a large number of redirections. Such indirection
structure. D receives another modification of the initial       could be alleviated by allowing levels to see more in-
RDG as specified on Figure 3.                                   formation about other levels. For example, we could
    The main rule here is that every site is aware of the       add path /doc/person/child/person/hobby to the
internal structure of its replica. This means that it stores    DataGuide at site A allowing A to redirect the corre-
the whole DataGuide only for subtrees belonging to the          sponding questions without B or C. In this case the en-
nodes that are replicated to it. Such DataGuides exclude        vironment becomes more centralized and resembles the
any subtrees for nodes that belong elsewhere as, for ex-        RDG approach. However the amount of replicated in-
ample, the case with /doc/person/child on site A. As            formation is still small compared to RDG and at the
can be seen on Figure 3 every site on level greater than        same time it would allow processing queries without un-
one also stores ancestor context for its replicas, which is     necessary redirections. We could say that there would be
a path from document root to the replicated node. These         a some kind of trade-off between more centralized en-
paths become parts of site’s DataGuide too. This al-            vironment with some performance gains and more dis-
lows two important things. First of all, this part of the       tributed environments, which is more robust to the up-
DataGuide allows pointing back to the previous level.           dates of DataGuide.
For example, /doc/person stores information about A                 As for the updates let us look at the following exam-
and /doc/person/child/person stores information                 ple.
about B and C. Without such “pointers” it would be im-
                                                                Example 4. First of all, if we need to add some node that
possible for such sites to receive queries since they would
                                                                is already reflected in the DataGuide the same rule as for
not know where to redirect them in case they have not got
                                                                queries applies. Such update would be executed in place
the needed nodes. Another reason is that ancestor context
                                                                or redirected to the corresponding site. It is obvious that
allows higher levels to issue proper, without mangling,
                                                                no update of DataGuide is needed.
path-queries to retrieve needed nodes from lower levels.
                                                                    Updates become more interesting when they involve
In this case the structure of a query executor for every site
                                                                nodes that are new to a DataGuide hence requiring it to
would remain the same because of uniformity of distri-
                                                                be changed. For example, let us assume that we want
bution information’s structure. Moreover it yields more
                                                                to add a child with the name “SSN” to the one of the
natural description for disjoint node replicas. For exam-
                                                                /doc/person/child/person nodes. This requires up-
ple, if some site would hold /doc/person/name and
                                                                dating DataGuide but only at sites B and C. Compare it
/doc/person/hobby replicas it would be more natural
                                                                with the RDG approach where such update would touch
to hold it together with from-the-root paths and at the
                                                                all sites to update fully replicated RDG. That is one of
same time this allows to evaluate “usual” queries such as
                                                                the examples where maintaing M LDG pays off.
fn:count(/doc/person/name).
                                                                    Another benefit of M LDG would be fur-
    Let us see how queries could be evaluated in such
                                                                ther distribution of XML data.              For example,
multi-level environment.
                                                                let us assume that we want to reallocate nodes
Example      3.    Consider      query      Q1         :        /doc/person/child/person/addr from B and C to
/doc/person/child/person/name arriving at site                  some site F . In this case we would update only sites B
A. A cannot answer this query but it knows that                 and C since it would be enough to locate these nodes.
/doc/person/child nodes are allocated at sites B                With RDG we would update all sites since every site
and C. So it redirects this query to either B or C. B           must know where to find relocated nodes. This example
and C indeed contain all the data needed for this query.        also illustrates that trade-off we were talking about
earlier. If, for example, we want A to be able to locate      relational data. But other works in this field discuss cor-
aforementioned nodes we would update it as well. Again        rectness problems for replicated data and propose algo-
this alleviates indirection but makes changing data           rithms that guarantee one or another level of correctness.
allocation map less straightforward.                          XML replication can certainly benefit from these works,
                                                              but this is not a part of our research. It is more of the third
   This example concludes our proposal. To summa-             problem we described in “Introduction” since such algo-
rize, the main idea is to use multi-leveled DataGuide         rithms usually work without assuming something about
(M LDG) as a representation of a data allocation map.         underlying data structure.
Multi-levelness allows more efficient updates and evolu-          Now we discuss some works here that deal with dis-
tion of the data allocation schema, i.e. reallocating nodes   tribution of XML data. First one [6] discusses similar
to another sites.                                             problem of representing distribution map. Authors pro-
   There is much to be done in this research and in the       pose ReplicationGuide(RG), which resembles RDG
next section we will give a brief summary of future work.     we discussed earlier. However in this paper authors
                                                              are more concerned with performing structural joins and
                                                              maintaining information about different physical paths
4   Future Work
                                                              that correspond to one RG path in the form of µP IDs.
In this section we give a brief summary of further di-        This is somewhat similar to maintaing positional infor-
rections of our research. First of all, when replicating      mation we discussed in the previous section. However,
XML data we must be aware of so-called document or-           authors do not divulge into details whether µP IDs could
der. Imagine, for example, that we want to retrieve some      be used to solve serialization or update problems we
of the /doc/person/child/person nodes. Such query             mentioned earlier. Moreover this approach lacks in two
would be easily delivered to one of the nodes B and C.        ways. Firstly, RG is replicated on every site. As we have
But then we must face the fact that some parts of the         seen it can be cumbersome in some cases. Secondly, this
nodes in question (“hobby” nodes in our example) are          approach and µP IDs particulary are too dependent on
located on another site. We must assemble all these parts     DT D or XM LSchema, which can be bad in cases data
from sites B (C) and D in the document order. Docu-           does not follow such rigorous structure.
ment order problem does not end with serialization how-           Second paper [7] discusses partial evaluation problem
ever. Most update extensions to XQuery allow user to          in a distributed environment. The proposed algorithms
specify the exact placement of the inserted node. Again       deal with so-called “Boolean XPath queries“, which is
original DataGuide is not suited for the purpose of stor-     a subset of XPath queries that answer “true” or “false”
ing such positional information. Maintaining informa-         depending on existence of nodes corresponding to path-
tion about document order is straightforward in the ab-       expression. Authors propose efficient distribution algo-
sence of distribution on the internal representation level,   rithm, which guarantees some nice properties. However
but in the distributed environment it could become not-       this work is related to the problem of distributed query
so-trivial task. We believe that using some of the node       evaluation and does not discuss any problems of main-
identifier (N ID) schemes can be a life-savior here, but      taining distribution map.
we must elaborate on some effective way to store such             The final paper [12] deals with interesting prob-
information in M LDG.                                         lem of caching results of XPath queries in a peer-to-
    Another direction for the research could be to look       peer environment. Authors propose two approaches:
more closely on some of the XPath evaluation methods          IndexCache and DataCache. The former involves
and thinking about easy integration of M LDG with such        storing results on peers that requested them and main-
methods. Such integraton would allow more effective           taining prefix-based index to allow other peers to access
path-query evaluation in the distributed environment and      it. Such prefix-based index is distributed among peers
gaining the most benefits from the replication.               for efficient querying. DataCache involves storing results
    And last but not least, we plan to do some experi-        on particular sites since every site maintains its own por-
ments to show benefits of the presented approach. There       tion of query space. This allows to eliminate redundancy.
could be a performance decrease because of redirecting        Notice that prefix-based index resembles DataGuide ap-
queries to another sites, but it should be tolerable in one   proach as its primary goal is to path-index cached results.
cases and completely surpassed by benefits of replication     So this has little to do with querying distributed database
in another. As for the updates it should be beneficial to     but this technique can be beneficial as a performance in-
RDG approach for the reasons presented in the previous        creasing add-on.
section.
                                                              6    Conclusion
5   Related Work
                                                              This paper is a research-in-progress discussing some as-
There is a large amount of work concerning replication in     pects of managing replication for XML data. Replication
relational databases(for example [8, 9, 5, 4, 11]). Some      is an important task for any DBMS natively managing
of such papers deal with replication of strictly relational   XML data. Managing meta-information about replica-
data. However since relational and XML data are differ-       tion, i.e. some kind of replication map is not an easy and
ent in structure these works are not suitable for our pur-    trivial task. In this paper we have discussed approaches
pose. Since XML data could be represented as a tree with      to manage such information. We have described RDG,
arbitrary depth level this implies somewhat more com-         extended version of well-known DataGuide, and more
plex map than in the case of “almost flat” (table-oriented)   elaborate M LDG, which is a multi-leveled version of
RDG itself. We have provided numerous examples to
show benefits of our approach during evaluation of query
and update statements. Also we have discussed direc-
tions for future work to continue the presented research.


References
 [1] eXist          Native         XML         Database.
     http://exist.sourceforge.net/.
 [2] Sedna         Native         XML          Database.
     http://modis.ispras.ru/sedna.
 [3] XML         Path      Language             (XPath).
     http://www.w3.org/TR/xpath.
 [4] Fuat Akal, Can Türker, Hans-Jörg Schek, Yuri Bre-
     itbart, Torsten Grabs, and Lourens Veen. Fine-
     grained replication and scheduling with freshness
     and correctness guarantees. In VLDB, pages 565–
     576, 2005.
 [5] Yuri Breitbart, Raghavan Komondoor, Rajeev Ras-
     togi, S. Seshadri, and Abraham Silberschatz. Up-
     date propagation protocols for replicated databases.
     In SIGMOD Conference, pages 97–108, 1999.
 [6] Jan-Marco Bremer and Michael Gertz. On dis-
     tributing xml repositories. In WebDB, pages 73–78,
     2003.
 [7] Peter Buneman, Gao Cong, Wenfei Fan, and Anas-
     tasios Kementsietsidis. Using partial evaluation in
     distributed query evaluation. In VLDB, pages 211–
     222, 2006.
 [8] Khuzaima Daudjee and Kenneth Salem. Lazy
     database replication with ordering guarantees. In
     ICDE, pages 424–435, 2004.
 [9] Khuzaima Daudjee and Kenneth Salem. Lazy
     database replication with snapshot isolation. In
     VLDB, pages 715–726, 2006.
[10] Roy Goldman and Jennifer Widom. Dataguides:
     Enabling query formulation and optimization in
     semistructured databases. In VLDB, pages 436–
     445, 1997.
[11] Jim Gray, Pat Helland, Patrick E. O’Neil, and Den-
     nis Shasha. The dangers of replication and a so-
     lution. In SIGMOD Conference, pages 173–182,
     1996.
[12] Kostas Lillis and Evaggelia Pitoura. Cooperative
     xpath caching. In SIGMOD Conference, pages
     327–338, 2008.