<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hui Ma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Klaus-Dieter Schewe</string-name>
          <email>k.d.schewe@massey.ac.nz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Massey University, Information Science Research Centre Private Bag 11 222</institution>
          ,
          <addr-line>Palmerston North</addr-line>
          ,
          <country country="NZ">New Zealand</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A challenging question is how XML can be used to support distributed databases. This leads to the problem of how to obtain a suitable, cost-efficient distribution design for XML documents. In this paper we sketch a heuristic approach to minimise query costs for the case of horizontal fragmentation. The approach is based on a cost model that takes the complex structure of queries on XML documents into account. We show that the minimisation of transportation costs is decisive, and that this can be achieved locally by either accepting or rejecting a horizontal fragmentation with a simple predicate that arises from one of the most frequent queries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Since a few years XML has attracted a lot of attention as a highly expressive
datamodel. In particular, XML enables an integrated view on data at least within
an organisation, and it is not unrealistic to set up an enterprise-wide
XMLdatabase. As large organisations are located at different places, a consequence of
this view is that we have to deal with distributed databases. Thus, the problem
arises to provide adequate distribution techniques for XML, and to show that
these techniques will enable increased performance of the distributed systems in
comparison to using centralised XML documents.</p>
      <p>
        In the context of the relational datamodel (RDM) distribution design mainly
addressed the problems of schema fragmentation and allocation of the fragments
to the machines in a computer network [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Fragmentation is commonly divided
into horizontal fragmentation, which splits a relation into disjoint unions, and
vertical fragmentation, which projects a relation onto a subset of its attributes.
Horizontal and vertical fragmentation have been discussed intensively in [
        <xref ref-type="bibr" rid="ref3 ref4 ref7">3, 4,
7</xref>
        ]. Several authors have approached the generalisation of fragmentation
techniques to object oriented datamodels. For instance, horizontal fragmentation is
discussed in [
        <xref ref-type="bibr" rid="ref1 ref5 ref8">1, 5, 8</xref>
        ], and vertical fragmentation in [
        <xref ref-type="bibr" rid="ref10 ref2 ref6">2, 6, 10</xref>
        ]. Due to the similarity
between object oriented and semi-structured data, some of these techniques have
also been adapted to XML [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        In this paper we continue our work on distribution design for XML-based
databases from [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where we introduced fragmentation operations for XML.
Finding a reasonable distribution is usually accompanied by a query performance
cost model. The problem is to design fragments and to allocate them in such a
way that the overall performance of the distributed database system is better
than the one of an equivalent centralised one.
      </p>
      <p>
        This problem was only briefly sketched in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] assuming that XML documents
are stored in a relational database and that cost optimisation techniques known
from the RDM would be applied. In this paper we present an alternative
approach that does not assume a relational representation of the XML documents.
Instead of this we develop a query cost model for XML. Then we present a
heuristic approach to minimise query costs for the case of horizontal
fragmentation. We show that the minimisation of transportation costs is decisive, and
that this can be achieved locally by either accepting or rejecting a horizontal
fragmentation with a simple predicate that arises from one of the most frequent
queries.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>A Query Cost Model</title>
      <p>Let us now look briefly at query processing costs. For this we assume that the
queries are expressed by some XML query algebra, so each query give rise to
a query tree. We may then make the assumption that these query trees are
optimised, which allows us to assume that the leaves of such trees are the input
document(s), predecessors of leaves are selection nodes, which themselves have
projection nodes as predecessors. In exceptional cases there are no such selection
or projection nodes.</p>
      <p>
        Crucial to the query costs are the sizes of documents that have to be built
during query execution, as these documents have to be stored at secondary
storage, retrieved from there again, and sent between the locations of a network.
Therefore, we first approach an estimation of sizes adapting the work in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>In order to do so, we look at the elements and attributes a defined in
the schema DTD, and estimate the size s( ) = s(e), respectively, depending
on the defining regular expression e for , the attributes of , and the types of
these attributes. Then, if r is the root element, s(r) is the (estimated) size of a
document rooted at r.</p>
      <p>Let si be the average size of elements in the set Vi – recall that Vi = dom(bi)
for a base type bi. In particular, let us assume b0 = ID, i.e. s0 is the average size
of an identifier. This can be used to determine first the size s(a) of an attribute
a, i.e. the average space needed for it in storage. We obtain s(a) = si, if a is of
type bi, s(a) = s0, if a is of an identifier or reference type, and s(a) = r · s0 if
a is of mutiple reference type. In the last of these cases r is the average number
of elements named that are referenced in .</p>
      <p>Now let att( ) denote the set of attributes defined on element n. We can
proceed inductively to define the size s(e):</p>
      <p>a∈att( ) s(a).
– If the defining expression e for is a base type bi, then we obtain s(e) = bi.
– If the defining expression e for is , we obtain s(e) = a∈att( ) s(a).
– If the defining expression e for is an element name n, we obtain s(e) =
s(n) +
– If the defining expression e for is an iteration (e )∗, we obtain s(e) =
q · s(e ) + a∈att( ) s(a), where q is the average number of successor elements
with defining expression e .
– If the defining expression e for is a sequence e1, . . . , en, we obtain s(e) =
n
i=1 s(ei) + a∈att( ) s(a).
– If the defining expression e for is a choice e1 | · · · | en, we obtain s(e) =
in=1 pi · s(ei) + a∈att( ) s(a), where pi is the probability that the successor
of is defined by ei. In particular, we have in=1 pi = 1.</p>
      <p>Horizontal fragmenentation corresponds to replacing a document x by some
union x1υp . . . υpxn. The path p used herein just leads to a successor of the root
element. The new documents xi (i = 1, . . . , n) are all obtained by selections,
and we may assume n = 2 by switching to horizontal fragmentation with normal
predicates, i.e. satisfiable conjunctions of simple selection formulae.</p>
      <p>Another round of query optimisation might shift the selection σp2,ϕ and the
projection πp1,e inside the newly introduced union νp3 , but the “upper part” of
the query tree would not be affected. Therefore, in order to optimise horizontal
fragmentation, it is decisive and sufficient to consider the projection-selection
subqueries above.</p>
      <p>The size of a selection node σp ,ϕ is p · s, where s is the size of the successor
node in the query tree and p is the probability that an element that matches
the path p will satisfy ϕ. The size of a projection node πp ,e is (1 − p) · s · ssoe,
where s is the size of the successor node in the query tree, so is the average size
of an element reached by p, se is the average size of an element defined by e,
and p is the probability that two elements x1, x2 matching p coincide on their
projection to e, i.e. πee (x1) = πee (x2).</p>
      <p>Fragmentation results in a set of fragments {f1, . . . , fn} of average sizes
s1, . . . , sn. If the network has nodes N1, . . . , Nk we have to allocate these
fragments to the nodes, which gives rise to a mapping λ : {1, . . . , n} → {1, . . . , k},
which we call a location assignment .</p>
      <p>However, the fragments only appear on the leaves of query trees. More
generally, we must associate a node λ(v) with each node v in each relevant query
tree. λ(v) indicates the node in the network, at which the intermediate query
result corresponding to v will be stored.</p>
      <p>Given a location assignment λ we can compute the total costs of query
processing. Let the set of queries be Qm = {Q1, . . . , Qm}. Query costs are
composed of two parts: storage costs and transportation costs: costsλ(Qj ) =
storλ(Qj ) + transλ(Qj ).</p>
      <p>The storage costs give a measure for retrieving the data back from secondary
storage, which is mainly determined by the size of the data. The transportation
costs provide a measure for transporting between two nodes of the network.</p>
      <p>The storage costs of a query Qj depend on the size of the involved documents
and on the assigned locations, which decide the storage cost factors. It can be
expressed as storλ(Qj ) = s(h) · dλ(h), where h ranges over the nodes of the
frequency. It can be expressed by costλ =
query tree for Qj , s(h) are the sizes of the involved documents, and di indicates
the storage cost factor for node Ni (i = 1, . . . , k).</p>
      <p>The transportation costs of query Qj depend on the sizes of the involved
documents and on the assigned locations, which decide the transport cost factor
between every pair of sites. It can be expressed by transλ(Qj ) =
h h
s(h ). Again the sum ranges over the nodes h of the query tree for Qj , h runs
over the predecessors of h in the query tree, and cij is a transportation cost
factor for data transport from node Ni to node Nj (i, j ∈ {1, . . . , k}).</p>
      <p>For each query Qj we get a value for its frequency f reqj . The total costs of
all the queries in Qm are the sum of the costs of each query multiplied by its
m</p>
      <p>cλ(h )λ(h) ·
j=1
costλ(Qj ) · f reqj .
3</p>
    </sec>
    <sec id="sec-3">
      <title>Optimisation of Horizontal Fragmentation</title>
      <p>We now present a heuristic approach for the optimisation of horizontal
fragmentation. The major objective of the approach is to base the fragmentation
decision on the efficiency of the most frequent queries. We have already seen that we
may concentrate on projection-selection subqueries (PSSs) and that horizontal
fragmentation with follow-on algebraic query optimisation will only affect these
subqueries. As each such subquery defines a set of simple selection formulae.
Obviously, we should only be interested in horizontal fragmentation based on
normal predicates that are defined by this set of simple selection predicates.</p>
      <p>Let us first look at the effects to query costs of a fragmentation with a single
simple predicate. Basically we have to adapt query trees, take into account which
effect algebraic optimisation will have on the PSSs, and find a modified location
assignment λ that would reduce the query processing costs, provided such a λ
exists. We have to distinguish three cases for this scenario.</p>
      <p>Case I. Assume that the selection formula ϕ in the PSS has the form ϕ ≡ ψ∧ϕ0
and that we use ϕ0 to fragment the document x. So, the leave x would be replaced
by the fragment x1 of x that is determined by ϕ0, and its predecessor in the query
tree would become σp,ψ. There are no other changes to the query tree.</p>
      <p>Then the size associated with the projection node does not change. Therefore,
if the location assignment λ was optimal before the fragmentation, it does not
make sense to change it for any node other than the three nodes corresponding to
the PSS. Unless the selection node is deleted from the query tree, its size remains
unchanged. This implies that there is at most a small change to the storage costs.
In particular, as transportation costs are larger anyway, we neglect this effect
on the storage costs. The processing of the subquery itself does not require any
transport of data. Therefore, we can assume that λ assigns the same network
node Ni to these nodes in the query tree.</p>
      <p>As a consequence, the fragmentation can only require the change of the
assigned network node from Ni to some Nj . This corresponds to the allocation of
the fragment x1.</p>
      <p>Case II. Assume now that the fragmentation with the selection formula ϕ0
leads to two fragments x1 and x2. Then the PSS would be replaced by a subquery
of the form πp1,e( p σp2,ϕ( p x1υp3 x2 / ) / ). Algebraic query optimisation will
move the selection inside the union.</p>
      <p>In this case the size associated with the projection node does not change. As
we can assume that the union node is assigned to the same network node as the
projection node, we obtain again marginal changes to the storage costs. Thus,
we can concentrate on the transportation costs. In addition, the processing of
the subquery would require sending the smaller selection result to the location
of the larger one.</p>
      <p>As a consequence, the fragmentation can only require that the network node
assigned to the two new fragments are different and maybe also different from
the network node that was assigned to x. Then the fragment with the larger
size after selection will determine the location assignment for the union and the
projection node.</p>
      <p>Case III. The assumption for the third case is the same as for case II. However,
we now assume that the sizes of σp2,ϕ(xi) are almost equal for i = 1, 2, and that
the projection has only a small impact on the size of the result of the PSS. In
this case it is advantageous to take the predecessor node n of πp1,e in the query
tree, and to assign λ(n) to the projection and the union node. This implies that
we have to transport both selection results to the same network node λ(n). As
in the two other cases, we may neglect the changes to the storage costs.</p>
      <p>Now consider all the PSSs. Let {q1, . . . , qN } be the set of these local queries.
Each of these local queries qi has a “target node” Ti, i.e. a network node to which
the result of the query will be transported. Let si be the size of the result of qi
and let f reqi be the frequency of qi (i = 1, . . . , N ). Let x be some document
fragment. Then for each network node Nj the costs of allocating x to the node
Nh arecosth(x) = jN=1 i,Ti=Nj si · f reqi · ch,i.</p>
      <p>We apply the cost minimisation heuristics, which allocates a fragment x
to a network node Nh such that costh(x) will be minimal. According to our
discussion for the three basic cases of how fragmentation affects the query costs,
the allocation of fragments to network nodes following the cost minimisation
heuristics, already determines the location assignment, provided that an optimal
location assignment for the queries was given prior to the fragmentation.</p>
      <p>Now we observe that one of the two fragments resulting from horizontal
fragmentation with a formula in Φ will reside at the same location as the document
before fragmentation, whereas the other fragment will be moved to a new
location. In the first case the transportation costs remain the same. In the second
case the transportation costs will be reduced.</p>
      <p>This suggests to take a total order on the elements of Φ according to their
frequency, i.e. let Φ = {ϕ1, . . . , ϕ } with f req(ϕi) ≥ f req(ϕj) iff i ≤ j. Then
determine Ψ = {ϕ1, . . . , ϕ } such that fragmentation with elements in Ψ leads
to a re-allocation of a fragment, whereas the fragmentation with elements in
Φ − Ψ does not add changes. Then can be determined by binary search.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>
        In this paper we continued our work on distribution design for XML-based
databases from [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We introduced a detailed query performance cost model,
and then addressed the problem to design fragments and to allocate them in
such a way that the overall performance of the distributed system is better than
the one of an equivalent centralised one. Our approach does not assume a
relational representation of the XML documents. Instead of this we presented a
heuristic approach to minimise query costs for the case of horizontal
fragmentation. We showed that the minimisation of transportation costs is decisive, and
that this can be achieved locally by either accepting or rejecting a horizontal
fragmentation with a simple predicate that arises from one of the most frequent
queries.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bellatreche</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karlapalem</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Simonet</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Algorithms and support for horizontal class partitioning in object-oriented databases</article-title>
          .
          <source>Distributed and Parallel Databases</source>
          <volume>8</volume>
          ,
          <issue>2</issue>
          (
          <year>2000</year>
          ),
          <fpage>155</fpage>
          -
          <lpage>179</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chinchwadkar</surname>
            ,
            <given-names>G. S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Goh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>An overview of vertical partitioning in object oriented databases</article-title>
          .
          <source>The Computer Journal</source>
          <volume>42</volume>
          ,
          <issue>1</issue>
          (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chu</surname>
          </string-name>
          , P.-C.
          <article-title>A transaction oriented approach to attribute partitioning</article-title>
          .
          <source>Information Systems</source>
          <volume>17</volume>
          ,
          <issue>4</issue>
          (
          <year>1992</year>
          ),
          <fpage>329</fpage>
          -
          <lpage>342</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chu</surname>
          </string-name>
          , P.-C., and
          <string-name>
            <surname>Ieong</surname>
            ,
            <given-names>I. T.</given-names>
          </string-name>
          <article-title>A transaction-based approach to vertical partitioning for relational databases</article-title>
          .
          <source>IEEE Transactions on Software Engineering</source>
          <volume>19</volume>
          ,
          <issue>8</issue>
          (
          <year>1993</year>
          ),
          <fpage>804</fpage>
          -
          <lpage>812</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ezeife</surname>
            ,
            <given-names>C. I.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Barker</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>A comprehensive approach to horizontal class fragmentation in a distributed object based system</article-title>
          .
          <source>Distributed and Parallel Databases</source>
          <volume>3</volume>
          ,
          <issue>3</issue>
          (
          <year>1995</year>
          ),
          <fpage>247</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ezeife</surname>
            ,
            <given-names>C. I.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Barker</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Distributed object based design: Vertical fragmentation of classes</article-title>
          .
          <source>Distributed and Parallel Databases</source>
          <volume>6</volume>
          ,
          <issue>4</issue>
          (
          <year>1998</year>
          ),
          <fpage>317</fpage>
          -
          <lpage>350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Orlowska</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and Zhang,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>A graph-based cluster approach for vertical partitioning in databases systems</article-title>
          .
          <source>Data &amp; Knowledge Engineeering</source>
          <volume>11</volume>
          ,
          <issue>2</issue>
          (
          <year>1993</year>
          ),
          <fpage>151</fpage>
          -
          <lpage>170</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Ma, H.
          <article-title>Distribution design in object oriented databases</article-title>
          .
          <source>Master's thesis</source>
          , Massey University,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Ma, H., and
          <string-name>
            <surname>Schewe</surname>
          </string-name>
          , K.-D.
          <article-title>Fragmentation of XML documents</article-title>
          .
          <source>In Proceedings XVIII Simpo´sio Brasileiro</source>
          de Bancos de Dados (SBBD
          <year>2003</year>
          )
          <article-title>(Manaus</article-title>
          , Brazil,
          <year>2003</year>
          ), pp.
          <fpage>200</fpage>
          -
          <lpage>214</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Malinowski</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Chakravarthy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Fragmentation techniques for distributing object-oriented databases</article-title>
          .
          <source>In Conceptual Modeling - ER '97</source>
          (
          <year>1997</year>
          ),
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Embley</surname>
          </string-name>
          and R. C. Goldstein, Eds., vol.
          <volume>1331</volume>
          of Lecture Notes in Computer Science, Springer, pp.
          <fpage>347</fpage>
          -
          <lpage>360</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. O¨ zsu, M. T., and
          <string-name>
            <surname>Valduriez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>Principles of Distributed Database Systems</article-title>
          . Alan Apt, New Jersey,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>