131


       Heuristic Horizontal XML Fragmentation

                             Hui Ma, Klaus-Dieter Schewe

                Massey University, Information Science Research Centre
                 Private Bag 11 222, Palmerston North, New Zealand
                           [h.ma|k.d.schewe]@massey.ac.nz


       Abstract. A challenging question is how XML can be used to support
       distributed databases. This leads to the problem of how to obtain a suit-
       able, cost-eﬃcient distribution design for XML documents. In this paper
       we sketch a heuristic approach to minimise query costs for the case of
       horizontal fragmentation. The approach is based on a cost model that
       takes the complex structure of queries on XML documents into account.
       We show that the minimisation of transportation costs is decisive, and
       that this can be achieved locally by either accepting or rejecting a hori-
       zontal fragmentation with a simple predicate that arises from one of the
       most frequent queries.


1    Introduction
Since a few years XML has attracted a lot of attention as a highly expressive
datamodel. In particular, XML enables an integrated view on data at least within
an organisation, and it is not unrealistic to set up an enterprise-wide XML-
database. As large organisations are located at diﬀerent places, a consequence of
this view is that we have to deal with distributed databases. Thus, the problem
arises to provide adequate distribution techniques for XML, and to show that
these techniques will enable increased performance of the distributed systems in
comparison to using centralised XML documents.
    In the context of the relational datamodel (RDM) distribution design mainly
addressed the problems of schema fragmentation and allocation of the fragments
to the machines in a computer network [11]. Fragmentation is commonly divided
into horizontal fragmentation, which splits a relation into disjoint unions, and
vertical fragmentation, which projects a relation onto a subset of its attributes.
Horizontal and vertical fragmentation have been discussed intensively in [3, 4,
7]. Several authors have approached the generalisation of fragmentation tech-
niques to object oriented datamodels. For instance, horizontal fragmentation is
discussed in [1, 5, 8], and vertical fragmentation in [2, 6, 10]. Due to the similarity
between object oriented and semi-structured data, some of these techniques have
also been adapted to XML [9].
    In this paper we continue our work on distribution design for XML-based
databases from [9], where we introduced fragmentation operations for XML.
Finding a reasonable distribution is usually accompanied by a query performance
cost model. The problem is to design fragments and to allocate them in such a

Proceedings of the CAiSE'05 Forum - O. Belo, J. Eder, J. Falcão e Cunha, O. Pastor (Eds.)
© Faculdade de Engenharia da Universidade do Porto, Portugal 2005 - ISBN 972-752-078-2
132 Hui Ma, Klaus-Dieter Schewe


way that the overall performance of the distributed database system is better
than the one of an equivalent centralised one.
    This problem was only brieﬂy sketched in [9] assuming that XML documents
are stored in a relational database and that cost optimisation techniques known
from the RDM would be applied. In this paper we present an alternative ap-
proach that does not assume a relational representation of the XML documents.
Instead of this we develop a query cost model for XML. Then we present a
heuristic approach to minimise query costs for the case of horizontal fragmen-
tation. We show that the minimisation of transportation costs is decisive, and
that this can be achieved locally by either accepting or rejecting a horizontal
fragmentation with a simple predicate that arises from one of the most frequent
queries.


2    A Query Cost Model

Let us now look brieﬂy at query processing costs. For this we assume that the
queries are expressed by some XML query algebra, so each query give rise to
a query tree. We may then make the assumption that these query trees are
optimised, which allows us to assume that the leaves of such trees are the input
document(s), predecessors of leaves are selection nodes, which themselves have
projection nodes as predecessors. In exceptional cases there are no such selection
or projection nodes.
    Crucial to the query costs are the sizes of documents that have to be built
during query execution, as these documents have to be stored at secondary
storage, retrieved from there again, and sent between the locations of a network.
Therefore, we ﬁrst approach an estimation of sizes adapting the work in [8].
    In order to do so, we look at the elements  and attributes a deﬁned in
the schema DTD, and estimate the size s() = s(e), respectively, depending
on the deﬁning regular expression e for , the attributes of , and the types of
these attributes. Then, if r is the root element, s(r) is the (estimated) size of a
document rooted at r.
    Let si be the average size of elements in the set Vi – recall that Vi = dom(bi )
for a base type bi . In particular, let us assume b0 = ID, i.e. s0 is the average size
of an identiﬁer. This can be used to determine ﬁrst the size s(a) of an attribute
a, i.e. the average space needed for it in storage. We obtain s(a) = si , if a is of
type bi , s(a) = s0 , if a is of an identiﬁer or reference type, and s(a) = r · s0 if
a is of mutiple reference type. In the last of these cases r is the average number
of elements named  that are referenced in .
    Now let att() denote the set of attributes deﬁned on element n. We can
proceed inductively to deﬁne the size s(e):

                                                              we obtain s(e) = bi .
 – If the deﬁning expression e for  is a base type bi , then 
 – If the deﬁning expression e for  is , we obtain s(e) = a∈att() s(a).
 – If the deﬁning
                  expression e for  is an element name n, we obtain s(e) =
   s(n) + a∈att() s(a).
                                                                                          133

                                                                     ∗
 – If the deﬁning          expression e for  is an iteration (e ) , we obtain s(e) =
          
   q · s(e ) + a∈att() s(a), where q is the average number of successor elements
   with deﬁning expression e .
 – If
   the     deﬁningexpression e for  is a sequence e1 , . . . , en , we obtain s(e) =
       n
       i=1  s(e   i) +      a∈att() s(a).
   If the deﬁning expression
 –                                     e for  is a choice e1 | · · · | en , we obtain s(e) =
       n
       i=1  p i · s(e i ) +   a∈att()  s(a), where pi is the probability that the successor
                                                           n
   of  is deﬁned by ei . In particular, we have i=1 pi = 1.

     Horizontal fragmenentation corresponds to replacing a document x by some
union x1 υp . . . υp xn . The path p used herein just leads to a successor of the root
element. The new documents xi (i = 1, . . . , n) are all obtained by selections,
and we may assume n = 2 by switching to horizontal fragmentation with normal
predicates, i.e. satisﬁable conjunctions of simple selection formulae.
     Another round of query optimisation might shift the selection σp2 ,ϕ and the
projection πp1 ,e inside the newly introduced union νp3 , but the “upper part” of
the query tree would not be aﬀected. Therefore, in order to optimise horizontal
fragmentation, it is decisive and suﬃcient to consider the projection-selection
subqueries above.
     The size of a selection node σp ,ϕ is p · s, where s is the size of the successor
node in the query tree and p is the probability that an element that matches
                                                                                       se
the path p will satisfy ϕ. The size of a projection node πp ,e is (1 − p) · s · ,
                                                                                       so
where s is the size of the successor node in the query tree, so is the average size
of an element reached by p, se is the average size of an element deﬁned by e,
and p is the probability that two elements x1 , x2 matching p coincide on their
                                      
projection to e, i.e. πee (x1 ) = πee (x2 ).
     Fragmentation results in a set of fragments {f1 , . . . , fn } of average sizes
s1 , . . . , sn . If the network has nodes N1 , . . . , Nk we have to allocate these frag-
ments to the nodes, which gives rise to a mapping λ : {1, . . . , n} → {1, . . . , k},
which we call a location assignment.
     However, the fragments only appear on the leaves of query trees. More gen-
erally, we must associate a node λ(v) with each node v in each relevant query
tree. λ(v) indicates the node in the network, at which the intermediate query
result corresponding to v will be stored.
     Given a location assignment λ we can compute the total costs of query
processing. Let the set of queries be Qm = {Q1 , . . . , Qm }. Query costs are
composed of two parts: storage costs and transportation costs: costsλ (Qj ) =
storλ (Qj ) + transλ (Qj ).
     The storage costs give a measure for retrieving the data back from secondary
storage, which is mainly determined by the size of the data. The transportation
costs provide a measure for transporting between two nodes of the network.
     The storage costs of a query Qj depend on the size of the involved documents
and on the assigned locations,    which decide the storage cost factors. It can be
expressed as storλ (Qj ) =          s(h) · dλ(h) , where h ranges over the nodes of the
                                h
134 Hui Ma, Klaus-Dieter Schewe


query tree for Qj , s(h) are the sizes of the involved documents, and di indicates
the storage cost factor for node Ni (i = 1, . . . , k).
   The transportation costs of query Qj depend on the sizes of the involved
documents and on the assigned locations, which decide the transport cost factor
between every pair of sites. It can be expressed by transλ (Qj ) =      cλ(h )λ(h) ·
                                                                     h h
s(h ). Again the sum ranges over the nodes h of the query tree for Qj , h runs
over the predecessors of h in the query tree, and cij is a transportation cost
factor for data transport from node Ni to node Nj (i, j ∈ {1, . . . , k}).
    For each query Qj we get a value for its frequency f reqj . The total costs of
all the queries in Qm are the sum of the costs of each query multiplied by its
                                           
                                           m
frequency. It can be expressed by costλ =      costλ (Qj ) · f reqj .
                                             j=1


3    Optimisation of Horizontal Fragmentation
We now present a heuristic approach for the optimisation of horizontal fragmen-
tation. The major objective of the approach is to base the fragmentation deci-
sion on the eﬃciency of the most frequent queries. We have already seen that we
may concentrate on projection-selection subqueries (PSSs) and that horizontal
fragmentation with follow-on algebraic query optimisation will only aﬀect these
subqueries. As each such subquery deﬁnes a set of simple selection formulae.
Obviously, we should only be interested in horizontal fragmentation based on
normal predicates that are deﬁned by this set of simple selection predicates.
    Let us ﬁrst look at the eﬀects to query costs of a fragmentation with a single
simple predicate. Basically we have to adapt query trees, take into account which
eﬀect algebraic optimisation will have on the PSSs, and ﬁnd a modiﬁed location
assignment λ that would reduce the query processing costs, provided such a λ
exists. We have to distinguish three cases for this scenario.

Case I. Assume that the selection formula ϕ in the PSS has the form ϕ ≡ ψ∧ϕ0
and that we use ϕ0 to fragment the document x. So, the leave x would be replaced
by the fragment x1 of x that is determined by ϕ0 , and its predecessor in the query
tree would become σp,ψ . There are no other changes to the query tree.
    Then the size associated with the projection node does not change. Therefore,
if the location assignment λ was optimal before the fragmentation, it does not
make sense to change it for any node other than the three nodes corresponding to
the PSS. Unless the selection node is deleted from the query tree, its size remains
unchanged. This implies that there is at most a small change to the storage costs.
In particular, as transportation costs are larger anyway, we neglect this eﬀect
on the storage costs. The processing of the subquery itself does not require any
transport of data. Therefore, we can assume that λ assigns the same network
node Ni to these nodes in the query tree.
    As a consequence, the fragmentation can only require the change of the as-
signed network node from Ni to some Nj . This corresponds to the allocation of
the fragment x1 .
                                                                                     135


Case II. Assume now that the fragmentation with the selection formula ϕ0
leads to two fragments x1 and x2 . Then the PSS would be replaced by a subquery
of the form πp1 ,e (pσp2 ,ϕ (p x1 υp3 x2 /)/). Algebraic query optimisation will
move the selection inside the union.
    In this case the size associated with the projection node does not change. As
we can assume that the union node is assigned to the same network node as the
projection node, we obtain again marginal changes to the storage costs. Thus,
we can concentrate on the transportation costs. In addition, the processing of
the subquery would require sending the smaller selection result to the location
of the larger one.
    As a consequence, the fragmentation can only require that the network node
assigned to the two new fragments are diﬀerent and maybe also diﬀerent from
the network node that was assigned to x. Then the fragment with the larger
size after selection will determine the location assignment for the union and the
projection node.


Case III. The assumption for the third case is the same as for case II. However,
we now assume that the sizes of σp2 ,ϕ (xi ) are almost equal for i = 1, 2, and that
the projection has only a small impact on the size of the result of the PSS. In
this case it is advantageous to take the predecessor node n of πp1 ,e in the query
tree, and to assign λ(n) to the projection and the union node. This implies that
we have to transport both selection results to the same network node λ(n). As
in the two other cases, we may neglect the changes to the storage costs.

    Now consider all the PSSs. Let {q1 , . . . , qN } be the set of these local queries.
Each of these local queries qi has a “target node” Ti , i.e. a network node to which
the result of the query will be transported. Let si be the size of the result of qi
and let f reqi be the frequency of qi (i = 1, . . . , N ). Let x be some document
fragment. Then for each network node Nj the costs of allocating x to the node
                    N 
Nh arecosth (x) = j=1 i,Ti =Nj si · f reqi · ch,i .
    We apply the cost minimisation heuristics, which allocates a fragment x
to a network node Nh such that costh (x) will be minimal. According to our
discussion for the three basic cases of how fragmentation aﬀects the query costs,
the allocation of fragments to network nodes following the cost minimisation
heuristics, already determines the location assignment, provided that an optimal
location assignment for the queries was given prior to the fragmentation.
    Now we observe that one of the two fragments resulting from horizontal frag-
mentation with a formula in Φ will reside at the same location as the document
before fragmentation, whereas the other fragment will be moved to a new loca-
tion. In the ﬁrst case the transportation costs remain the same. In the second
case the transportation costs will be reduced.
    This suggests to take a total order on the elements of Φ according to their
frequency, i.e. let Φ = {ϕ1 , . . . , ϕ } with f req(ϕi ) ≥ f req(ϕj ) iﬀ i ≤ j. Then
determine Ψ = {ϕ1 , . . . , ϕ } such that fragmentation with elements in Ψ leads
136 Hui Ma, Klaus-Dieter Schewe


to a re-allocation of a fragment, whereas the fragmentation with elements in
Φ − Ψ does not add changes. Then  can be determined by binary search.

4    Conclusion
In this paper we continued our work on distribution design for XML-based
databases from [9]. We introduced a detailed query performance cost model,
and then addressed the problem to design fragments and to allocate them in
such a way that the overall performance of the distributed system is better than
the one of an equivalent centralised one. Our approach does not assume a re-
lational representation of the XML documents. Instead of this we presented a
heuristic approach to minimise query costs for the case of horizontal fragmenta-
tion. We showed that the minimisation of transportation costs is decisive, and
that this can be achieved locally by either accepting or rejecting a horizontal
fragmentation with a simple predicate that arises from one of the most frequent
queries.

References
 1. Bellatreche, L., Karlapalem, K., and Simonet, A. Algorithms and sup-
    port for horizontal class partitioning in object-oriented databases. Distributed and
    Parallel Databases 8, 2 (2000), 155–179.
 2. Chinchwadkar, G. S., and Goh, A. An overview of vertical partitioning in
    object oriented databases. The Computer Journal 42, 1 (1999).
 3. Chu, P.-C. A transaction oriented approach to attribute partitioning. Information
    Systems 17, 4 (1992), 329–342.
 4. Chu, P.-C., and Ieong, I. T. A transaction-based approach to vertical parti-
    tioning for relational databases. IEEE Transactions on Software Engineering 19,
    8 (1993), 804–812.
 5. Ezeife, C. I., and Barker, K. A comprehensive approach to horizontal class frag-
    mentation in a distributed object based system. Distributed and Parallel Databases
    3, 3 (1995), 247–272.
 6. Ezeife, C. I., and Barker, K. Distributed object based design: Vertical frag-
    mentation of classes. Distributed and Parallel Databases 6, 4 (1998), 317–350.
 7. Lin, X., Orlowska, M., and Zhang, Y. A graph-based cluster approach for
    vertical partitioning in databases systems. Data & Knowledge Engineeering 11, 2
    (1993), 151–170.
 8. Ma, H. Distribution design in object oriented databases. Master’s thesis, Massey
    University, 2003.
 9. Ma, H., and Schewe, K.-D. Fragmentation of XML documents. In Proceedings
    XVIII Simpósio Brasileiro de Bancos de Dados (SBBD 2003) (Manaus, Brazil,
    2003), pp. 200–214.
10. Malinowski, E., and Chakravarthy, S. Fragmentation techniques for dis-
    tributing object-oriented databases. In Conceptual Modeling - ER ’97 (1997),
    D. W. Embley and R. C. Goldstein, Eds., vol. 1331 of Lecture Notes in Computer
    Science, Springer, pp. 347–360.
11. Özsu, M. T., and Valduriez, P. Principles of Distributed Database Systems.
    Alan Apt, New Jersey, 1999.