=Paper= {{Paper |id=Vol-256/paper-12 |storemode=property |title=XML Query Algera for Cost-based Optimization |pdfUrl=https://ceur-ws.org/Vol-256/submission_2.pdf |volume=Vol-256 |dblpUrl=https://dblp.org/rec/conf/syrcodis/LukichevB07 }} ==XML Query Algera for Cost-based Optimization== https://ceur-ws.org/Vol-256/submission_2.pdf
             XML Query Algera for Cost-based Optimization ∗

                              c Maxim Lukichev
                              °                                  Dmirty Barashev
                                           University of Saint-Petersburg
                                                  lmax@inbox.ru



                       Abstract                              formations were considered in [9], including seman-
                                                             tical optimization, pushing predicates, joins reorder-
    Several requirements for algebra suitable
                                                             ing, eliminating redundant document-order sorts.
    for ecient cost-based optimization are pre-
                                                             Rules of transforming XQuery queries to forms more
    sented. It is shown that known XML alge-
                                                             suitable for translation to SQL were described in [7].
    bras do not fully satisfy this requirements.
                                                                XML Query Algebra [3] comes from the activity of
    A new algebra to satisfy better the require-
                                                             the W3C XML Query Working Group. That algebra
    ments is introduced.
                                                             is mainly intended for the formal denition of query
                                                             languages semantics. The algebra itself is an ab-
1   Introduction                                             stract version of XQuery, where high-level operators
Continously growing usage of XML data demands                (e.g., n-ary for and sortby clauses) are mapped into
for development of powerful query optimization sys-          low-level algebraic operators. Rewriting rules are
tems. Optimization approaches for XML databases              provided resembling functional programming lan-
depend on database type. Relational based XML                guages rules and nested relational rules.
DBMS decompose documents into conventional or                   Xtasy [12, 5] algebra also as YAT [13], XAT
special binary relations [1]. XQuery clauses in such         [14] and SAL [4] algebras has operators dened on
systems are translated to queries in SQL-like lan-           relational-like structures. Operations similar to re-
guage and query processors employ traditional re-            lational as selection, projection, join, cross product,
lational query optimization techniques. So called            order by etc. have appeared in these algebras. But
native XML DBMSs use their own data storage                  also operations specic to XPath and XQuery have
formats. Query optimizers in the native XML                  appeared. So in Xtasy they are presented by path
databases are not as ecient as their relational coun-       and return operations, those similar to bind and tree
terparts. Usually they are limited with some logical         in YAT algebra. In XAT and SAL algebras for vari-
optimization methods and follow a naive physical             able binding map operation is used.
plan. However, an ecient optimizer should con-                 TAX [8] is a query algebra developed in the con-
sider many equivalent physical plans and choose the          text of the TIMBER project. TAX data model is
best one using some cost function. Execution of              based on unordered collections of ordered data trees,
the best physical plan may be signicantly, some-            and each TAX operator takes as input collections
times several orders of magnitude faster comparing           of data trees, and produces as output collections
to naive one. So the problem of generating the op-           of data trees. Unlike YAT and SAL, TAX directly
timization space is very important. Optimization             manipulates trees without the need for an explicit
space is a set of equivalent expressions in some query       intermediate structure. Data extraction and bind-
algebra and thus the nal plan quality is dened by          ing are performed by using pattern trees: pattern
some properties of the chosen algebra. Unlike re-            trees, which resemble Xtasy input lters, describe
lational systems, there is no standard algebra for           the structure of the desired data, and impose condi-
XML queries, but there are many dierent algebras            tions on them.
[12, 14, 4, 13, 8] each having its own pros and cons.           MonetDB system [1] worth special attention.
   In this work we gather requirements for query al-         Data in that system are stored as binary relations.
gebra suitable for ecient cost-based optimization           Queries are translated to special intermediate SQL-
and propose an algebra based on elements of XAT              like representation and then are executed as SQL.
[14] and Xtasy [12] algebras and which meets the             Benchmarks show a signicant superiority of Mon-
requirements.                                                etDB over native XML DBMS, mostly when working
                                                             with big documents. The reason is that native XML
2   Related work                                             databases are lacking powerful query optimizers and
                                                             ecient indexing structures.
Many researches in XQuery optimization are focused
on logical optimizations. A number of logical trans-
                                                             3   Analysis of Algebra requirements
    ∗ This work was partially supported by RFBR (grant 07-
07-00268a).                                                  Time required to evaluate some query could dier
Proceedings of the Spring Young Researcher's Colloquium      very much depending on the chosen evaluation plan.
on Database and Information Systems, Moscow, Russia, 2007    The main optimizer role is to nd the most eec-
tive one. The space of available physical plans is        operations have enough good algebraic properties.
mainly dened by properties of algebra operations.        So the requirement (2) is satised by these algebras.
And the wider is that space the more eective eval-       But requirement (3) does not completely satised
uation plan could be found. Such properties of alge-      by them. The reason is that their operations used
braic operations as commutativity, associativity and      for xpath representation do not have complete num-
idempotance are desirable properties because they         ber of desirable algebraic properties. Therefore it is
extend search space, increasing optimizer's freedom       impossible to arbitrary change evaluation order of
for choosing optimal plan.                                operations for navigational expressions.
   One of the most important and widely used con-
structions of XQuery are nested for-clauses. The          4     XAnswer Algebra
order of their evaluation in evaluation plan is signif-
icant for performance as order of join operations in      In this section we propose new algebra called XAn-
relational algebra. Therefore the representation of       swer. This algebra is based on some elements of
nested for-clauses with operations with good alge-        XTasy and XAT algebras and satisfy requirements
braic properties is an important requirement.             described in previous section. First, it would be de-
   It is impossible to choose an order of operations      scribed data structures, algebra operations dened
evaluation without estimation of a size of an inter-      on. After that main algebra operations, that are
mediate result, for example result of joining two se-     similar to relational, would be introduced. And at
quences from the given three. Therefore data struc-       last specic for XQuery algebraic operations would
tures in query algebra must be good enough to rep-        be dened and compared with analogues of Xtasy
resent measurable intermediate results.                   algebra.
   XPath expressions also play an important role in
XQuery. There may be dierent plans of evaluating         4.1   XAnswer data structure
the same XPath expression and some plans may be           XAnswer algebraic operations are dened on
much more expensive than others [10]. So if XPath         relational-like data structures like in [12, 14]. Be-
would be presented with operations with good al-          low, this structure will be called Envelop. It is rep-
gebraic properties probably the more eective plan        resented with a table and consists of tuples which
could be found. It is the last important requirement.     contains XML-node values. Order of table attributes
   Concluding, XML query algebra suitable for cost-       or tuples is not signicant.
based optimization is expected to satisfy the follow-        In case of XQuery each attribute of Envelop is a
ing requirements:                                         name of variable that was bound in corresponding
                                                          subexpression. In case of XPath there are no any
 1. operations are dened on data structures suit-
                                                          variables, therefore some unique identier is used as
    able for representing measurable intermediate
                                                          corresponindg attribute instead of variable name.
    results
                                                             Lets      consider     a      path      expression:
 2. nested for-clauses are mapped to operations           book/author/address/country.           Lets Assume
    with good algebraic properties such as commu-         that identier $VA denotes values of nodes with
    tativity and associativity                            tag name book, identier $VB  author, $VC 
                                                          address, $VD  country. The Envelop, obtained
 3. xpath expressions are also represented by oper-       after evaluation of considered path expression could
    ations with good algebraic properties.                be represented with a table shown in Figure 1.
   W3C algebra operations dened on sequences.
This denition leads to several problems with an in-
termediate result representation, because sequences
do not provide any information about correspond-
ing bound variables and e.t.c. Therefore there are
problems with intermediate result storing and es-
timation that leads to problems with reordering of
some expensive operations. So this algebra does not
satisfy requirement (1). But this requirement is sat-                 Figure 1: Envelop structure
ised by YAT, XAT, Xtasy algebras. These algebras
have operations dened on relational-like structures,
                                                             XML data model assumes ordering of tags which
which are as relations consist of tuples. This struc-
                                                          is missing in our Envelop structure. However for op-
ture is suitable for representing intermediate result
                                                          timization purposes better algebraic properties are
of joining group of sequences. In XAT algebra such
                                                          more important than ordering so we assume that
structure called XAT-table, in Xtasy  Env.
                                                          query result should be additionally sorted if needed.
   The XML Query algebra is very useful for imple-
                                                          Assuming these two Envelops are equal independent
menting simple XML query processors with ability
                                                          of tuple or attribute order.
of logical optimization, but it appears unlikely that
it will form the basis for eective implementations of
                                                          4.2   Algebraic operations
XML query processors with cost-based optimization.
   YAT, XAT and Xtasy algebras have representa-           Path expressions are main building blocks of XQuery
tion of nested for-clauses with join operations. These    expressions. Also their evaluation is one of the most
expensive elements in XQuery evaluation. Therefore        4.2.2   Data Extraction operations
it is important to increase quality of xpath evalua-
                                                          There are some operations (leaf or unary opera-
tion plans. So query algebra has to provide a wide
                                                          tions), which produces new Envelop without pro-
space of equivalent plans for navigational expres-
                                                          vided any another as input. These operations are
sions. This problem could be solved by representa-
                                                          presented with function call and element search op-
tion of XPath with operations with good algebraic
                                                          erations. GetDocumentRoot operation is a member
properties. So XAnswer algebra use structural-join
                                                          of function call operations family. This operation
(binary) and data extraction (unary) operations to
                                                          generates a new Envelop containing root node for
represent path expressions and each step of path ex-
                                                          provided document. GetDocuemntRoot operation
pression is represented with these operations.
                                                          dened as follows:
4.2.1 Structural-join operation
                                                            GetDocumentRoot(Document) = {x |
Structural-join operation appears in many works
                                                                  x is a root element of the Document}
around the XPath optimization, for example [6, 15,
2]. XAnswer also has this operation with following           The next important data extraction operation is
denition:                                                GetElement operation. It searches for elements, that
                                                          satisfy NodeTest condition in the provided set of
  A ⊗axisAi Bj B = {(x, y) |                              documents. For example as a NodeTest condition
            x ∈ A, y ∈ B, axisAi Bj (x, y) = true}        could be a NameTest, i.e. some tag name.
   Here A, B  two input Envelops. Ai , Bj                 GetElement(N odeT est, DocSet) = {x |
attribute identiers by which join is performed.
axisAi Bj  is a predicate by which join is performed.       x ∈ elements of documents f rom DocSet,
It returns true, if elements of x and y, corresponding                            N odeT est(x) = true}
to identiers Ai , Bj are in axis relation (for example
child or parent).                                            The result of evaluation of this operation is a new
                                                          Envelop containing values of nodes satisfying to the
Example 1                                                 NodeTest. As a single attribute of this Envelop,
                                                          some unique identier is set. The following example
                                                          shows XAnswer algebraic expression for some
                                                          typical XPath twig query.

                                                          Example 2

                                                          Lets consider following path expression:
                                                            document(doc.xml)/manufacturers//dealer/address
                                                            The algebraic expression corresponding to this
                                                          path expression is:

                                                          ((GetDocumentRoot(doc.xml)
                                                          ⊗childv1 v2 GetElement(manuf acturers, doc.xml))
                                                          ⊗childv2 v3 GetElement(dealer, doc.xml))
                                                          ⊗childv3 v4 GetElement(address, doc.xml)
         Figure 2: Structural join example
                                                             This algebraic expression could be represented
   This example shows structural join of two En-          with a tree shown in Figure 3. For simplicity in
velops by child axis for attributes A2 and B1. Re-        this gure does not provided some information like
sult of joining is a new Envelop, every tuple of which    axis predicates for structural joins. Evaluation is
obtained by union of tuples of rst and second En-        performed by left-deep walking through this tree.
velops in case when tuple element corresponding to        It means that for this tree rst would be extracted
attribute B1 of the second Envelop is a child of tu-      document root node then manufacturers nodes then
ple element corresponding to attribute A2 of the rst     would be performed structural join by child axis and
Envelop.                                                  so on.
   Structural join operation has associativity(1)            Sometimes much more eective plan could be
and commutativity(2) properties. These properties         achieved by changing evaluation order of structural
could be prooved using operation denition.               joins. So Figure 4. shows alternative equivalent al-
                                                          gebraic expression for the expression from Example
  (1) : (A ⊗axisAi Bj B) ⊗axisBm Ck C =                   2. Equivalence of these expressions could be proved
                                                          by sequential applying of rule (1). In this case rst
          = A ⊗axisAi Bj (B ⊗axisBm Ck C)                 would be performed joining of manufacturers and
  (2) : (A ⊗axisAi Bj B) = (B ⊗axisAi Bj A)               document root then joining of dealers and addresses,
                                                          and at last the structural join by descendant-or-self
                                                          axis for manufacturers and dealers is performed.
                                                           projection by attribute corresponding to Ai . The
                                                           variable name became an identier of the single
                                                           attribute of the new Envelop.

                                                           Example 3




   Figure 3: An xpath algebraic expression tree

                                                                          Figure 5: For operation



                                                           4.3.2   Let operation
                                                           Let operation dened as follows:

                                                           Letvar
                                                              f unci ..in A = {(x, z) | x ∈ A, z = f unci1 ..in (x)}
                                                                    1

Figure 4: An alternative xpath algebraic expression        Here var is a variable name; func - a function re-
tree                                                       ferring to some already bound variables; i1 ..in  at-
                                                           tribute identiers corresponding to those variables.
4.2.3 Classical algebraic operations                       For example as this function could be an XPath ex-
                                                           pression like following: $b/address/country.
Some operations in XAnswer derived from relational            The result is a new Envelop, obtained by ap-
algebra. That are such operations as selection, pro-       pending to the old one a new column with results
jection, join, cross product, orderby. . . These oper-     of evaluation of function for each tuple. The given
ations have the same properties as their relational        variable name became an identier of appended
analogues.                                                 attribute.
Selection:
                                                           Example 4
  SelectP A = {x ∈ A | P (x) = true}

Projection:

  ProjectAi1 ...Ain A = {z |
                      z = (xAi1 , . . . , xAin ), x ∈ A}
Join:

  A ./P B = {(x, y) | x ∈ A, y ∈ B,
                                 P(x, y) = true}

4.3 Specic for XQuery operations
For and Let operators also known as variable binding                      Figure 6: Let operation
operators are one of the most important operators
in XQuery. These operators are similar in that they
dene a variable in query evaluation context and set          In this case path expression depends on single
to it some current value.                                  variable corresponding to attribute denoted by $A2.
XAnswer also has For and Let operations.
                                                           4.3.3   Return operation
4.3.1 For operation                                        The next one important algebraic operation, specic
For operation dened as follows:                           for XQuery, is a Return operation. XAnswer Return
                                                           operation by functions and representation is similar
    var
F orA i
        A = {z | z = P rojectAi A, x ∈ A}                  to the XTasy Return operation [12, 5].
                                                           ReturnOF A, where A - an Envelop produced with
Here, var is a variable name; Ai  identier of En-        one of the XAnswer operations like Join or For op-
velop attribute; A - Envelop.                              eration. OF - Output Filter, which is dened by
   The result of evaluation of For operation is a          following rule:
new Envlelop, obtained from given by applying                 OF ::= OF1,. . . ,OFn | label[val] | @label[val] |val
   val ::= var | varCopy | xpath.
   Output Filter is a description of activities for rep-
resentation of an evaluated data. For example it
could be an XML element or an XML attribute con-
structor, or a variable value or a result of evaluation
of navigational expression for some bound variable
and so on.

4.3.4 DJoin operation
This operation is used when evaluation of one En-
velop depends on evaluation of another. The only
way to perform this operation is nested loops.
Therefore the major goal of optimization is to trans-
late it to another join operations whenever it is pos-
sible.

4.3.5 Examples
In the following example there are shown two alge-
                                                           Figure 7: An algebraic expression tree for query from
braic expression for Xtasy and XAnswer algebras
                                                           example 6
for the same XQuery expression.

Example 5
                                                              The next example shows a query with nested
  for $b in document("books.xml")/book                     for-clauses where the inner has a dependency to the
  /author/addr                                             variable dened in the outer.
  return $b
An Xtasy algebraic expression for this query:              Example 7

  Returnentry[$b] path(_,$b,in)book[(_,_,/)author[
                                                             for $a in document(doc.xml)//dealer,
  (_,_,/)addr[∅]]] (books.xml)
                                                             $b in $a//address
An XAnswer algebraic expression:                             return $b
                                                           A tree of algebraic expression for this case is shown
  Returnentry[$b] ((GetDocumentRoot(books.xml)
                                                           in Figure 8. In this representation both for-clauses
  ⊗childv1 v2 GetElement(book, books.xml))
                                                           input lters have the same part, corresponding to
  ⊗childv2 v3 GetElement(author, books.xml))               the expression document(doc.xml)//dealer. In this
  ⊗childv3 v4 GetElement(addr, books.xml)                  case common sub-expression would be evaluated
                                                           once and then obtained result would be reused in
Inspite of horizontal and vertical decomposition           evaluation of the second for-clause.
rules of path operation in Xtasy [12], the space
of equivalent plans obtained in terms of XAnswer
algebra is wider then in terms of XTasy.

Example 6

   for $a in document(doc.xml)/manufacturers
   /dealer//address,
   $b in document(doc2.xml)//manager
   return $a
The tree of algebraic expression, corresponding to
the given query is shown in Figure 7. For simplic-
ity, nodes corresponding to join operation do not
provide any information about predicates by which
these join operations are performed. Copy of vari-
able value operation is used as output lter for the
return operation. As result a new element with dif-
ferent to current elements id is created. Also as out-
put lter it could be a variable reference operation.      Figure 8: A tree of algebraic expression for example
In this case an element with the same id would be          7
returned. This is the way to make changes in the
document.
4.4 Using some algebraic properties in query opti-        5   Conclusion
    mization
                                                          This paper outlines several requirements for query
Sometimes nested for-clauses have not dependencies        algebra suitable for building high-performance cost-
between themselves but they have similar parts            based optimizer for native XML DBMS. Also it was
of input path expressions. In this case special           shown that known algebras do not completely satisfy
optimization technique could be applied. It obtains       to these requirements. An another algebra that sat-
common parts of path expressions and rewrites             ises to these requirements better was introduced.
expression to provide sharing result of common            This algebra is a base of XAnswer optimizer for na-
parts. Such query and corresponding to it algebraic       tive XML DBMSs which is currently under develop-
tree after some reforming are described in Example        ment.
8. In [14] such optimization is called expression
minimization. It is shown that benet of this
optimization for class of similar queries could reach
20-70%, depending on XML document structure.              References
Each sub-query could have it's own optimal plans           [1] Monetdb home page http://monetdb.cwi.nl.
that have not common parts. In this case techniques
derived from multi query optimization for building         [2] S. Al-Khalifa, H. V. Jagadish, J. M. Patel,
optimal plan for group of queries [11] could be used.          Y. Wu, N. Koudas, and D. Srivastava. "struc-
In case of such queries the space of equivalent plans          tural joins: A primitive for ecient xml query
in terms of XAnswer algebra is wider than in terms             pattern matching". In 18th International Con-
of XAT.                                                        ference on Data Engineering (ICDE'02), 2002.

Example 8                                                  [3] M. Fernandez B. Choi and J. Simeon. The
                                                               xquery formal semantics: A foundation for im-
Lets consider following query: nd title for those             plementation and optimization. Technical re-
books author of that is a rst author at least of one          port, World Wide Web Consortium, 2002.
book.
Corresponding XQuery expression:                           [4] Catriel Beeri and Yariv Tzaban. SAL: An al-
                                                               gebra for semistructured data and XML. In
   for $a in document(bib.xml)/book/author[1]
                                                               WebDB (Informal Proceedings), pages 3742,
   for $b in document(bib.xml)/book                          1999.
   where $b/author = $a
   return $b/title                                         [5] C.Sartiani. Ecient Management of Semistruc-
   It is easy to see that expression docu-                     tured XML Data. PhD thesis, Universita degli
ment(bib.xml)/author could be evaluated once for             Studi di Pisa, 2003.
rst for-clause and then obtained result could be
reused in evaluation of second for-clause. In this case    [6] Torsten Grust. Accelerating the xpath location
algebraic expression could be presented as a graph,            steps. In ACM SIGMOD, pages 109120, 2002.
shown in Figure 9.
                                                           [7] Torsten Grust, Maurice Van Keulen, and Jens
                                                               Teubner. Accelerating xpath evaluation in any
                                                               rdbms. ACM Trans. Database Syst., 29(1):91
                                                               131, 2004.

                                                           [8] H. V. Jagadish, Laks V. S. Lakshmanan, Di-
                                                               vesh Srivastava, and Keith Thompson. Tax: A
                                                               tree algebra for xml. In DBPL '01: Revised Pa-
                                                               pers from the 8th International Workshop on
                                                               Database Programming Languages, pages 149
                                                               164, London, UK, 2002. Springer-Verlag.

                                                           [9] Ioana Manolescu, Daniela Florescu, and Don-
                                                               ald Kossmann. Answering xml queries on het-
                                                               erogeneous data sources. In VLDB '01: Pro-
                                                               ceedings of the 27th International Conference
                                                               on Very Large Data Bases, pages 241250, San
                                                               Francisco, CA, USA, 2001. Morgan Kaufmann
                                                               Publishers Inc.
Figure 9: A graph of algebraic expression for exam-
ple 8 after performing minimization                       [10] Priti Mishra and Margaret H. Eich. Join pro-
                                                               cessing in relational databases. ACM Comput.
                                                               Surv., 24(1):63113, 1992.
[11] Prasan Roy, S. Seshadri, S. Sudarshan, and
     Siddhesh Bhobe. Ecient and extensible algo-
     rithms for multi query optimization. In SIG-
     MOD '00: Proceedings of the 2000 ACM SIG-
     MOD international conference on Management
     of data, pages 249260, New York, NY, USA,
     2000. ACM Press.
[12] Carlo Sartiani and Antonio Albano. Yet an-
     other query algebra for xml data. In IDEAS '02:
     Proceedings of the 2002 International Sympo-
     sium on Database Engineering & Applications,
     pages 106115, Washington, DC, USA, 2002.
     IEEE Computer Society.
[13] S. Cluet V. Christophides and J. Simeon.
     Semistructured and structured integration rec-
     onciled: Yat += ecient query processing.
     Technical report, INRIA, Verso database group,
     1998.
[14] Song Wang, Elke A. Rundensteiner, and Murali
     Mani. Optimization of nested xquery expres-
     sions with orderby clauses. In ICDEW '05: Pro-
     ceedings of the 21st International Conference
     on Data Engineering Workshops, page 1277,
     Washington, DC, USA, 2005. IEEE Computer
     Society.
[15] J.Patel Y.Wu and H.Jagadish. Structural join
     order selection for xml query optimization. In
     ICDE ' 03 : International Conference on Data
     Engineering, 2003.