=Paper= {{Paper |id=Vol-215/paper-8 |storemode=property |title=Fragmenting XML Documents via Structural Constraints |pdfUrl=https://ceur-ws.org/Vol-215/paper08.pdf |volume=Vol-215 |dblpUrl=https://dblp.org/rec/conf/adbis/BonifatiCZ06 }} ==Fragmenting XML Documents via Structural Constraints== https://ceur-ws.org/Vol-215/paper08.pdf
                     Fragmenting XML Documents
                       via Structural Constraints

               Angela Bonifati1 , Alfredo Cuzzocrea2 , and Bruno Zinno1
                                 1
                                 Icar-CNR, Via P. Bucci 41/C,
                                     87036 Rende, Italy
                          {bonifati, zinno}@icar.cnr.it
                       2
                         DEIS, University of Calabria, Via P.Bucci 41/C,
                                     87036 Rende, Italy
                           cuzzocrea@si.deis.unical.it



       Abstract. XML query processors suffer from main-memory limitations that pre-
       vent them from processing large XML documents. While content-based predi-
       cates can be used to project down parts of the documents, it may still be needed
       to resize the obtained projections according to structural constraints. In this pa-
       per, we consider size, tree-width and tree-depth constraints to enable a structure-
       driven fragmentation of XML documents.
       Although a set of heuristics performing this kind of fragmentation can be easily
       devised, a key problem is determining the values of structural constraints input
       to the above heuristics, given that the search space is prohibitive at large. To alle-
       viate the problem, we introduce special-purpose structure histograms that report
       the constraint values for the fragments of a given document. We then present a
       prediction algorithm that probes those histograms to output the expected number
       of fragments, when fixed input values of the constraints are used. Furthermore,
       we study how to relax the fixed constraints by means of classical distributions.
       An experimental evaluation of our study shows the effectiveness of our fragmen-
       tation methodology on some representative XML datasets.


1 Introduction
An imminent development of XML processing is undoubtly making it as fast and effi-
cient as possible. Query engines for XML are being designed and implemented, with
the specific goal of employing indexes to improve their performance [10]. Others [23]
employ statistics to cost the most frequently asked queries, or use classical algebraic
techniques [16] to optimize query plans.
    An XML query engine which is parsimonious in resources, may still enable a further
resizing of the obtained query results. This may also happen in many resource-critical
contexts, such as a distributed database, or a stream processor. We propose a suitable
fragmentation of XML input trees and tree results, that applies the well-known ‘divide
and conquer’ principle. The advantages of XML fragmentation are already being proved
in an XML query engine [4, 5] or in a distributed setting [3]. Fragmentation of XML
documents as proposed by the previous works has been based on semantics, whereas in
this paper we work out a novel kind of fragmentation, which is orthogonal to the first
and is only guided by the structural properties of an XML document.
 The problem: building constraint-driven fragments. Given an XML document, mod-
 eled w.l.g. as a tree, there exist several ways of splitting it into subtrees, which may
 be semantically driven or structurally driven. Usually, query processors decides to ap-
 ply projections and selections beforehand in order to reduce the amount of data to be
manipulated in memory during query evaluation. Notwithstanding the effectiveness of
 pushing algebraic operators within the query plan, it may happen that the size of inter-
 mediate results are still too large to fit in memory. If for instance we consider a 100MB
 XMark document, and a query Q1 asking for open auctions sold by people owning a
 credit card (creditcard being an optional element) and for closed auctions sold by
 any people, the result query plan would look like the one shown in Fig.1 (a). The two
 branches of the join operator(s) would in such a case be of size 29.6MB(6.1MB) and
 17.09MB(11.8MB), respectively. Along with the size, the intermediate results can be
 also resized w.r.t. tree-width and tree-depth constraints that may affect the query pro-
 cessing time as well. These can be prohibitively large for the join branches above, i.e.
 70 (38) and 9 (2) respectively for the left-hand-side join operator.
     If structure-driven fragmentation is employed, the subtrees output by the selections
 and projections can be resized to smaller pieces according to the structural constraints
 and can be then processed per piece. Fig.1(b) pictures the result of fragmentation on the
 two operands of the join(s).
     A suitable application of structure-driven fragmentation is streaming XML process-
 ing (e.g. [8]). Stream query processors are mainly memory-based, thus motivating the
 use of smaller fragments given for instance by top-down navigation of the original doc-
 ument. Our fragmentation could be employed to obtain smaller XML messages input to
 the stream (see Fig.1 (c)), carefully designed to not exceed specified memory require-
 ments at query runtime. Thus, using the three structural constraints altogether allows us
 to obtain approximately uniform fragments, e.g. to be used in a uniform stream.
     Finally, distributed and parallel query processing may leverage the fragmentation of
 the original documents, in order to improve their performance. This issue will be further
 discussed in our experimental study on XP2P [3], a P2P-based infrastructure we have
 developed.
     Coming back to our problem, we can state it as follows. Let t be an XML tree, w
a tree-width constraint, d a tree-depth constraint and s a tree-size constraint, we split t
into valid fragments f of t such that size(f ) <= s, tw(f ) <= w and td(f ) <= d and
6 ∃f 0 such that f ∩ f 0 6= ∅, size, tw and td being functions returning the size of f , the
 maximum width of f , and the maximum depth of f , and f 0 being a valid fragment of t,
 respectively.
 The proposal. The above constraints may turn to be incompatible if randomly specified,
 thus possibly leading to empty solutions. We have devised SimpleX, a set of heuristics
 performing the above fragmentation. Being the search space prohibitively large, a key
 problem is determining the values of structural constraints input to the above heuristics.
 To alleviate the problem, we have designed a set of summary data structures, namely
the structure histograms, which let determine correct combinations of the constraint
 values, without actually doing the fragmentation beforehand. The histograms have been
 implemented within an analysis module that uses algorithms to predict an interval for
 the number of fragments produced by the heuristics. This is particularly interesting for
     (a)                                      MergeUnion
                                                    U

                          MSJ                                                MSJ
                     [/seller/@personref=@id]                    [/seller/@personref=@id]

                                                  //closed_auction
              //open_auction
                                                      (17.09MB)
                (29.6MB) //people/person[creditcard] (22)          //people/person
                (70)                 (6.1MB)          (9)              (11.8 MB)
                (9)                  (38)                              (39)
                                     (2)                               (2)

                                             MergeUnion
    (b)                                         U

                      MSJ                                                    MSJ
                     [/seller/@personref=@id]                  [/seller/@personref=@id]

                                                  //closed_auction
             //open_auction
                         //people/person[creditcard]               //people/person


                                                             (max 400KB)
           (max 500KB)                                       (max 10)
           (max 10)                (max 250KB)                                (max 1MB)
                                   (max 10)                  (max 5)          (max 10)
           (max 5)
                                   (max 2)                                    (max 2)


    (c)             f1
                          site
                                      f4
     open_auctions
                                     closed_auctions                              Data Stream
    open_auction
                                           closed_auction
                                                                             f1      f2     f3   f4
                                                             fragmentation                            reconstruction
               bidder                      seller
              f2                           interval
                   date     time
                                                        f3
                                   start      end


Fig. 1. Query plan of query Q1 (a), with the fragmentation operator applied (b), and example of
use of fragmentation for XML data streams (c).
                  N ode Size (KB) N ode Size (KB)            N ode Size (KB)
                   site    145       people      100        catgraph  45
                 person1    20      person2       50        person3   30
                  edge1     15        edge2       10         edge3    20
                             Table 1. Sizes of subtrees of Fig. 2.




large XML datasets and query results, as it offers a visual summarization tool that can
be inspected at any time for prediction.
    Moreover, we study how to relax the fixed constraints by means of classical distri-
butions.
    Finally, we have probed the effectiveness of our set of heuristics and structure his-
tograms on a set of significant XML datasets.
    The rest of the paper is organized as follows: Section 2 shows the heuristics for
structure-driven fragmentation; Section 3 describes the structure histograms and their
use in our analysis module; Section 4 presents a variety of experiments that probe the
effectiveness of our techniques; Section 5 discusses the related work.


2 SimpleX: simple top-down heuristics for shredding an XML
  document

The fragmentation problem stated above is a problem with linear cost function and in-
teger constraints, which is intrinsically exponential. To effectively explore the search
space, we have designed a set of simple top-down heuristics for document fragmenta-
tion. They all have in common the fact that they start at the root of the document and
proceed in a top-down fashion. At each step the current subtree width, depth and size
are checked against the constraints w, d, s. If the constraints are satisfied, the subtree
becomes a valid fragment and is pruned from the document to constitute a separate
valid XML document. A new node containing as PC-data the path expression of the ob-
tained fragment will then replace the given subtree in the original document. If instead
the constraints are not satisfied, the algorithm inspects the next subtree in the XML tree
according to the criteria assessed by the heuristic.
     A first criterion to select the next subtree is for instance given by the order of visit,
i.e. depth-first or breadth-first. We call these variants in-depth and in-width. Fig. 2 rep-
resents an XML tree compliant to the XMark DTD, whose subtree sizes 3 are reported
in Table 1 as absolute numbers (dots in Fig. 2 represent PC-data elements whose sizes
appear in Table 1).
     Applying for instance the in-depth heuristics with constrained size s = 100 and
depth d = 1 and unconstrained width w, the XML tree gets fragmented as in Fig. 2 (a),
whereas with s = 100, d = 1 and w = 1, it gets fragmented as shown in Fig. 2 (b). The
application of the other heuristics on the sample tree is omitted for conciseness.
 3
     The subtree depth and width can be easily inferred from Fig. 2 and are omitted for space
     reasons.
                                      site   f2                                        site
                                                                                                   f6
                     f1                                                     f3
                  people                          catgraph         people                                    catgraph



            person1 person2 person3   edge1 edge2 edge3      person1 person2 person3   edge1 edge2
                                                                                              .... edge3
                                                                   .... ....  ....       ....       ....
                 ....    .... ....     ....  ....  ....          f1         f2                f4        f5
                                      (a)                                              (b)


Fig. 2. A sample XMark tree fragmented with one of the SimpleX heuristics and two (three)
constraints in (a) (in (b)).



    SimpleX 4 is one possible set of simple heuristics among the various ones that can
be applied for shredding a document (e.g. bottom-up or random-access). In principle,
there is no better heuristics than any other, as it actually depends on the structure of
the document. Our aim in this paper is not finding the best heuristic, but instead to
show how to tune the fragmentation constraints for SimpleX heuristics, if summary
data structures are employed. To that purpose, we employ ad-hoc summary structures,
namely the structure histograms, that let estimate the number of fragments satisfying
the values of constraints. We introduce the structure histograms next.


3 Structure histograms
Given an XML document X, the structure histograms present X as summarized by
counting the fragments (i.e., the sub-trees) in X that hold the following structural prop-
erties: (i) the fragment size s, (ii) the fragment tree-depth d, and (iii) the fragment
tree-width w.
    Formally, let X be an XML document, let p be a structural property defined on X,
let Dp = [pmin , pmax ] be the value domain of p, a class ∆p is defined on Dp as follows:
∆p = [p0min , p0max ], such that pmin ≤ p0min ≤ p0max ≤ pmax . Then, a structure
histogram built on X, denoted by HS (X, p, ∆p), grouping p by an aggregation function
f = COUNT (thus, reporting the frequency of the fragments) over ∆p-wise steps, is a
tuple hDp , Hp i, such that each bucket b(∆p) in the co-domain Hp counts the fragments
in X having a value of the (structural) property p ranging ∆p = [p0min , p0max ]. We call
HS a one-dimensional histogram computed over p.
    Moreover, to support parametric summarization of XML data and thus improve
the fragmentation prediction, we introduce an extension of the previous histograms,
called parametric structure histograms, denoted by HSP (X, p, ∆p), such that P is a
fixed structural property w.r.t. which the histogram over p is computed.
    We call HSP a two-dimensional histogram computed over hP , pi.
    More precisely, in our fragmentation framework, we need to introduce the following
structure histograms summarizing X:
 4
     In the remainder, and in the experiments, we will simply indicate the set of heuristics as Sim-
     pleX. We mean that we apply all the heuristics in the set and pick the results of the most
     efficient one at each run.
                                            HS
                                  HD       S f     HW
                                  Df      145 1    W  f
                                  2 1     100 1     2 1
                                  1 2     45 1      3 2
                                  0 6     20 1      0 6
                                          ... ...
          Table 2. HD , HS (partial) and HW for the sample XMark tree of Fig. 2.



- the Tree-Size Structure Histogram HS (X, s, ∆s), which summarizes X w.r.t. the size
  s (i.e., p = s);
- the Tree-Depth Structure Histogram HD (X, d, ∆d), which summarizes X w.r.t. the
  tree-depth d (i.e., p = d);
- the Tree-Width Structure Histogram HW (X, w, ∆w), which summarizes X w.r.t. the
  tree-width w (i.e., p = w);
                                                       S
- the Max-Tree-Size Parametric Structure Histogram HD    (X, s, ∆d), which, fixed the
  size s by computing the max value (i.e., P = s), summarizes X w.r.t. the tree-depth
  d (i.e., p = d);

     In particular, given an input XML document X, and a structural property p, we build
the output structure histogram HS (X, p, ∆p) by setting the input parameters Dp and
∆p as follows (it should be noted that the input parameter p is directly set by the target
user/application): (i) Dp = [0, M axV alue], such that M axV alue is the maximum
value of the structural property p among all the fragments in X, (ii) ∆p = N · |D p |,
such that 0 ≤ N ≤ 1 is a parameter empirically set, and |Dp | is the cardinality of Dp .
Examples of such structure histograms for the subtree in Fig. 2 are shown in Table 2.
     Note that building the histograms for an arbitrary XML tree is necessarily exponen-
tial in the worst case, but our heuristics can significantly trim the number of inspected
fragments.
     A user (or application) willing to partition a document who knows how many frag-
ments she wants to obtain, may want to know the values of constraints that let exactly
obtain that number of fragments. In other words, she would like to properly tune the
constraints values. Moreover, constraints as specified by the user may not be compati-
ble among each other or the final results may be biased to the dataset inherent structure.
In order to automatize the task of deciding the constraint values, we have devised an
algorithm that predicts the range of frequencies by inspecting the structure histograms.
The algorithm pseudocode is shown in Fig. 3. For space reasons, we limit ourselves
to discuss the algorithm on the XMark sample of Fig. 2 and show that it lets predict
the range of frequencies quite sharply. Earlier, we have pointed out that if we disregard
the width w in (a), we obtain a rather different fragmentation w.r.t. (b), where w has a
non-null value. We start by looking at the histogram HD reported in Table 2 and we
remark that for a value of depth d = 1, we would obtain two fragments. If we look at the
histogram HW , this in turn tells that there are two nodes with width w = 3, and these
nodes cannot be part of the same fragment if w is chosen to be 1. In such a case we
would generate 6 fragments out of those nodes. Thus, only by looking at H D and HW ,
we learn that the number of fragments shall be in the range [2, 6]. If we further add
the third constraint s, the upper bound of the above range may raise or not, depending
on whether the fragments so far obtained satisfy or not the value of s. This leads to
choose a value of s from Histogram HS , that pessimistically corresponds to the size of
the largest subtree located at depth d = 1 (e.g. subtrees rooted in nodes people and
                                                               S
catgraph in Fig. 2), information that we learn from an HD        histogram. In this par-
ticular example, we can choose for instance a size s equal to 100, thus obtaining the
fragmentation shown in Fig. 2 (b) quite straightforwardly.


             function predictInterval(HD: tree-depth structure histogram,
                 HS : tree-size structure histogram,
                 HW : tree-width structure histogram,
                 s0 , d0 , w0 : size, depth, width constraints): return [fmin , fmax ]
           1   Let d0 be the chosen depth in HD // alternatively, w0 in HW
           2     such that HD (d0 , ∆d0 , f0 )
           3   Pick the max width wmax in HW //alternatively, dmax in HD
           4   Let fmin = f0 , fmax = sum(f0 ,wmax − w0 )
                                                           S
           5   Let smax the max size at depth d0 in HD
           6   Let fsmax the corresponding frequency in HS
           7     If s0 >= smax
           8        return [fmin , fmax ]
           9     else {
          10        fmax += fsmax
          11        for each wi in the interval wmax − w0 in HW
          12           fmax += fwi ∗ (wi − w0 )
          13     }
          14 return [fmin , fmax ]

    Fig. 3. Algorithm for predicting the range of frequencies using the structure histograms.




3.1 Relaxing constraints with distributions.

As we have seen, choosing correct values for the input constraints of the fragmentation
algorithm is a non trivial task. An incorrect value for such constraints would lead to too
many fragments or too few of them, or even to an empty solution in some cases. Indeed,
there may exist random values of w, d and s, which turn out to be incompatible among
each other. Figure 4 shows the search space for the NASA dataset.
    In order to alleviate this problem, we let the constraints vary along classical dis-
tributions (such as Uniform, Gauss, Zipf). Thus, along with choosing fixed bounds for
s, w, d, we assign ranges to them according to those distributions. This is further moti-
vated by the fact that an XML document contains ’unbreakable‘ pieces of text (such as
PC-data, entities etc.) that needs to be taken into account in the choice of the constraint
values (especially for the size constraint). By empirically comparing the output of Sim-
                               350k -

                               300k -

                               250k -




                        Size
                               200k -

                               100k -
                                                                                                200
                               50k -
                                                                                          150
                                   0                                           100
                                        1
                                            2                                        th
                                                 3
                                                        4                 50     Wid
                                                            5
                                                Depth           6
                                                                    7 0




              Fig. 4. The three-dimensional search space for the Nasa dataset.


           Document d (MB)        # elems. maxDepth maxW idth provenance
             XMARK (113)          3,332,129   11     25,500       [20]
              XMARK (30)           501,705    11      7649        [20]
               NASA (24)           476,645     7      2434        [19]
      FourReligiousWorks-Bom (1.5) 7,656       5       79          [9]
                    Query Description
                     QDi FOR $p IN XPathExpr RETURN $p
                      Table 3. XML documents and queries used.




pleX against the baseline case given by a constant distribution, we will show below that
non-constant distributions have in general a better behavior.


4 Experimental assessment

We have conducted an experimental study aimed at showing the effectiveness of our
structure-driven fragmentation methodology. The experiments are divided into three
classes. First, we build the structure histograms for representative XML datasets, and
show their use to decide the final values of constraints. Secondly, we define and measure
the accuracy error of fragmentation using the SimpleX set of heuristics, when fixed
constraints are employed against the cases (baseline) in which the constraints vary with
classical distributions (e.g. Uniform, Gauss, Zipf). Finally, we demonstrate the utility
of fragmentation in a distributed setting, such as a DHT-based P2P network. In such
a case, we measure the impact of fragmentation on network performance against the
expected ideal behavior.
    All the experiments have been performed on a machine with a 1.2 GHz processor,
512 MB RAM, and running Linux Suse 9.1. We uniquely identify each fragment with
its absolute root-to-leaf path expression. Each fragment stores with extra sub nodes
the path expression of subfragments and separately the path expression of its parent
fragment. We have presented this data model in [3]. Notice that any data model (such
as [5] for instance) other than ours can be adopted here to represent the fragments.
Finally, the datasets and queries employed in the study are summarized in Table 3.
                       [s1 − s2 ] (KB) f req    w f req d f req
                          [50 - 100]    20     674 1        4 6100
                         [100 - 150]     8     730 1        5 3430
                         [150 - 200]     6    1188 1        6 1854
                         [200 - 250]     1    2434 1        7 1
  Table 4. Window frames of histograms in Figure 5 depicting some of the frequency values.




Which constraints values? Ask the structure histograms! Fig. 5 shows the structure
histograms for the Nasa dataset. In order to improve readability, we separately report
the complete histograms in Fig. 5 and some of the frequency values in Table 4. Note
that such histograms have a size of the order of KB, thus being reasonably small. For
instance, considering a triple d0 = 5, w0 = 1200 and s0 = 230KB, and applying
the Algorithm in Fig. 3, we obtained a prediction range equal to [1200, 2500]. Similar
results for the other datasets of Table 3 and other values of the constraints are omitted for
space reasons. Moreover, notice that the histograms also allow a user to quickly discard
incompatible values of constraints as they summarize only valid constraints values.


                      475812
                                                Size Nasa.xml                                                                         Width Nasa.xml
           1000                                                                                       355610     109555

                                                                                                      2500



           800
                                                                                                      2000



           600
    freq




                                                                                                      1500
                                                                                        freq




           400
                                                                                                      1000



           200
                                                                                                       500




             0
                                     20                 40            60           80                    0
                                                                                                             0       5      10       15       20          25       30       35       40
                                                    s (kB)                                                                                    w
                          307857   104725
                                                                                                                   21.5 M
                                                 Depth Nasa.xml                                                             Max Size w.r.t. Depth Nasa.xml
           40000                                                                                      300k


           35000
                                                                                                      250k

           30000

                                                                                                      200k
           25000
                                                                                        smax (byte)
    freq




           20000                                                                                      150k


           15000
                                                                                                      100k

           10000

                                                                                                       50k
            5000


                  0                                                                                          0
                               0     1      2       3        4    5        6   7                                     0      1    2        3           4        5        6        7

                                                         d                                                                                        d




Fig. 5. Tree-Size (HS ), Tree-Width (HW ), Tree-Depth (HD ) and, finally, Max-Tree-Size Para-
           S
metric (HD   ) structure histograms for the Nasa dataset.
                  Distribution # f ragments Avg. # nodes es ed ew
                        None           1879         253      0.97 0.6 0.97
                      Uniform           61          7813     0.02 0.57 0.89
                       Gauss            57          8362     0.09 0.42 0.88
                        Zipf            71          6713      0.1 0.57 0.9
               Table 5. Application of SimpleX to NASA with/without distribution.




SimpleX heuristics at work. We have put at work the proposed heuristics on the
datasets of Table 3. We have considered various values of parameters s, w and d and run
the heuristics with fixed values of these parameters and with their variations as given
by classical distributions (e.g. Uniform, Gauss, Zipf). We define the accuracy error ec
of fragmentation w.r.t. constraint c as follows. Let ca be the average value of the size
(tree-depth and tree-width, resp.) obtained with SimpleX heuristics, c 0 the fixed value
of constraint input to Algorithm 3, cmin and cmax the interval of the constraint value
as obtained via the particular distribution, then the accuracy error of fragmentation is
                                                                                 c        +c
                                                                                | min 2 max −ca |
given by the formula |c0c−c a|
                               for fixed constraints and by the formula               c    +c
                          0                                                          | min 2 max |
for non-fixed constraints.
    As an example, Table 5 shows the obtained results with the NASA dataset and value
of constraints: s = 230KB, w = 1200 and d = 5. The lower is the accuracy error, the
better is the matching of the heuristics with the fragmentation constraints.
    It can be noticed that the case when the constraints are strict upper bounds leads
to fairly more fragments than the cases when distributions are applied. On average,
fragmentation via distributions obtains lower accuracy errors than the case when distri-
butions are not used. Results with other datasets and other values of constraints showed
the same trend.
Scattering XML fragments in DHT-based P2P networks. As we already discussed,
there exist several applications of our fragmentation strategy, which justify its effective-
ness. In this section, we present some experiments that have been performed on XP2P,
our DHT-based P2P simulator [3].
    For each experiment, we have scattered a certain number of fragments in the net-
work obtained with our structure-driven fragmentation. We then measured the network
scalability when both varying the number of peers and the number of queries.
    Fig. 6 (a) shows the nr. of hops versus the number of peers when XMark30 has been
divided into 1000 fragments with the constraint values predicted by our analysis tool.
Here we have considered exactly as many queries of kind QD i (see Table 3) as the
number of fragments, each query being propagated to the successor peers as dictated
by the current peer list of successors (i.e. at logarithmic distance). It can noticed that
the case where XML structure-driven fragmentation is used closely tracks the original
Chord 5 logarithmic curve. Finally, Fig. 6 (b) shows the nr. of hops when varying the
nr. of fragments for XMark30 dataset within a network of 500 peers. The fragmenta-

 5
     The original Chord simulator only stores on each peer an identifier of resources. In XP2P, we
     have extended it to store XML fragments.
Fig. 6. Nr. of hops w.r.t. nr. of peers with 1000 fragments (a); Nr. of hops w.r.t. nr. of fragments
with 500 peers (b). In both cases, the number of queries of kind QDi equals the number of
fragments.




tion slightly increases the number of hops, if compared with the constant curve that
represents no fragmentation.


5 Related work

The advantages of fragmenting relational databases are well established [17] as both
horizontal fragmentation [7], which splits a given relation into disjoint sub-relations,
and vertical fragmentation [13], which projects a given relation onto a subset of its at-
tributes. More recently, fragmentation techniques have been adapted to object-oriented [1],
semi-structured [14], and native XML [4] databases.
    [15] proposes an innovative approach for supporting the distribution of XML databases
via horizontal fragmentation. It employs a query-oriented cost model taking into ac-
count the most frequently asked queries, and uses heuristics to optimize the fragmenta-
tion based on the efficiency of such queries. Being query-driven, this approach does not
consider the structural properties of XML data, as we do in our proposal.
    XFrag [4] is a framework for processing XQuery-formatted queries on XML frag-
ments in order to reduce memory processing. This work focuses on how to employ
fragments to make query optimizations, thus strengthening our proposal. Several query
optimization techniques have been presented for XML data, among which [16] and [11].
While the former relies on algebraic projections, the latter is based on tree-automata.
These techniques can be combined with ours to let the processor evaluate queries in
parallel on multiple fragments.
    [6] and [18] propose summary data structures for XML twigs and paths that let de-
rive an estimation of queries selectivity by using statistical methods, such as histograms
or wavelets. Notwithstanding the importance of the above data structures for query op-
timization, our histograms are instead aimed at predicting the number of fragments of
an XML document when applying SimpleX heuristics. The latter prediction could not
be inferred by looking at the above data structures.
    [22] proposes the position histograms, which allow the estimation of both simple
and complex pattern query answers. Furthermore, when XML schema information is
available, they employ the so-called coverage histograms that extend the former and
allow the target XML database to be better summarized. Differently from ours, those
histograms help estimating the sizes of child and descendant steps in path expressions.
Histograms are used for a rather different purpose in our framework, as stated above.
    Finally, XRel [21] is a path-based approach to store XML documents into RDBMS
and retrieve them afterwards. While the path identification of fragments is similar to
ours, the focus of the paper is on building an extension of relational databases for XML
data.


6 Conclusions
We have presented a fragmentation strategy for XML documents that is driven by struc-
tural constraints. To the best of our knowledge, this is the first work addressing such a
problem. We further offer the user or the application a prediction of the ‘outcome’ of
the fragmentation, by means of the so-called structure histograms. By means of dis-
tributions, we are able to vary the constraint values thus improving the fragmentation
performance. We are currently developing new heuristics that use additional data struc-
tures, which could be used in combination with histograms to make the prediction more
precise. Moreover, we plan to embed our fragmentation tool and its analysis module in
an existing XQuery engine.


References
 1. Bellatreche, L., Karlapalem, K., Simonet, A.: Algorithms and Support for Horizontal Class
    Partitioning in Object-Oriented Databases. Distributed and Parallel Databases 8 (2000)
 2. Bohannon, P., Freire, J., Roy, P., Simeon, J.: From XML Schema to Relations: A Cost-based
    Approach to XML Storage. In: Proc. of ICDE. (2002)
 3. Bonifati, A., Matrangolo, U., Cuzzocrea, A., Jain, M.: XPath Lookup Queries in P2P Net-
    works. In: Proc. of WIDM. (2004)
 4. Bose, S., Fegaras, L.: XFrag: A Query Processing Framework for Fragmented XML Data.
    In: Proc. of WebDB. (2005)
 5. Bremer, J.M., Gertz, M.: On distributing xml repositories. In: Proc. of WebDB. (2003)
 6. Chen, Z., Jagadish, H. V., Korn, F., Koudas, N., Muthukrishnan, S., Ng, Raymond T., Srivas-
    tava, D. Counting Twig Matches in a Tree. In: Proc. of ICDE. (2001)
 7. Ezeife, C., Barker, K.: A Comprehensive Approach to Horizontal Class Fragmentation in a
    Distributed Object based System. Distributed and Parallel Databases 3 (1995)
 8. Florescu, D., Hillery, C., D., Kossman, et al. The BEA/XQRL Streaming XQuery Processor
    In: Proc. of VLDB. (2003)
 9. Ibiblio.org web site. Available at www.ibiblio.org/xml/books/biblegold/examples/baseball/
    (2004)
10. Jagadish, H.V., Al-Khalifa, S., Chapman, A., Lakshmanan, L.V., Nierman, A., Paparizos,
    S., Patel, J., Srivastava, D., Wiwatwattana, N., Wu, Y., , Yu., C.: Timber: a Native XML
    Database. VLDB Journal 11 (2002)
11. Koch, C.: Efficient Processing of Expressive Node-Selecting Queries on XML Data in Sec-
    ondary Storage: A Tree Automata-based Approach. In: Proc. of VLDB. (2003)
12. Krishnamurthy, R., Chakaravarthy, V.T., Naughton, J.F.: On the Difficulty of Finding Opti-
    mal Relational Decompositions for XML Workloads: A Complexity Theoretic Perspective.
    In: Proc. of ICDT. (2003)
13. Lin, X., Orlowska, M., Zhang, Y.: A Graph-based Cluster Approach for Vertical Partitioning
    in Databases Systems. Data and Knowledge Engineeering 11 (1993)
14. Ma, H., Schewe, K.D.: Fragmentation of XML Documents. In: Proc. of SBBD. (2003)
15. Ma, H., Schewe, K.D.: Heuristic Horizontal XML Fragmentation. In: Proc. of CAiSE.
    (2005)
16. Marian, A., Simeon, J.: Projecting XML Documents. In: Proc. of VLDB. (2003)
17. Ozsu, M., Valduriez, P.: Principles of Distributed Database Systems. Alan Apt (1999)
18. Polyzotis, N., Garofalakis, M.N.: Statistical synopses for graph-structured XML databases
    In: Proc. of SIGMOD 2002.
19.         University      of    Washington’s     XML       repository.          Available  at
    www.cs.washington.edu/research/xml/datasets (2004)
20. Xmark: An XML Benchmark Project. Available at http://monetdb.cwi.nl/xml/ (2002)
21. Yoshikawa, M., Amagasa, T., Shimura, T., , Uemura, S.: XRel: A Path-based Approach to
    Storage and Retrieval of XML Documents Using Relational Databases. ACM Transactions
    on Internet Technology 1 (2001)
22. Wu, Y., Patel, J., Jagadish, H.: Using Histograms to Estimate Answer Sizes for XML Queries.
    Information Systems 28 (2003)
23. Zhang, N., Haas, P., Josifovski, V., Lohman, G., Zhang, C.: Statistical Learning Techniques
    for Costing XML Queries. In: Proc. of VLDB. (2005)