=Paper=
{{Paper
|id=Vol-256/paper-12
|storemode=property
|title=XML Query Algera for Cost-based Optimization
|pdfUrl=https://ceur-ws.org/Vol-256/submission_2.pdf
|volume=Vol-256
|dblpUrl=https://dblp.org/rec/conf/syrcodis/LukichevB07
}}
==XML Query Algera for Cost-based Optimization==
XML Query Algera for Cost-based Optimization ∗
c Maxim Lukichev
° Dmirty Barashev
University of Saint-Petersburg
lmax@inbox.ru
Abstract formations were considered in [9], including seman-
tical optimization, pushing predicates, joins reorder-
Several requirements for algebra suitable
ing, eliminating redundant document-order sorts.
for ecient cost-based optimization are pre-
Rules of transforming XQuery queries to forms more
sented. It is shown that known XML alge-
suitable for translation to SQL were described in [7].
bras do not fully satisfy this requirements.
XML Query Algebra [3] comes from the activity of
A new algebra to satisfy better the require-
the W3C XML Query Working Group. That algebra
ments is introduced.
is mainly intended for the formal denition of query
languages semantics. The algebra itself is an ab-
1 Introduction stract version of XQuery, where high-level operators
Continously growing usage of XML data demands (e.g., n-ary for and sortby clauses) are mapped into
for development of powerful query optimization sys- low-level algebraic operators. Rewriting rules are
tems. Optimization approaches for XML databases provided resembling functional programming lan-
depend on database type. Relational based XML guages rules and nested relational rules.
DBMS decompose documents into conventional or Xtasy [12, 5] algebra also as YAT [13], XAT
special binary relations [1]. XQuery clauses in such [14] and SAL [4] algebras has operators dened on
systems are translated to queries in SQL-like lan- relational-like structures. Operations similar to re-
guage and query processors employ traditional re- lational as selection, projection, join, cross product,
lational query optimization techniques. So called order by etc. have appeared in these algebras. But
native XML DBMSs use their own data storage also operations specic to XPath and XQuery have
formats. Query optimizers in the native XML appeared. So in Xtasy they are presented by path
databases are not as ecient as their relational coun- and return operations, those similar to bind and tree
terparts. Usually they are limited with some logical in YAT algebra. In XAT and SAL algebras for vari-
optimization methods and follow a naive physical able binding map operation is used.
plan. However, an ecient optimizer should con- TAX [8] is a query algebra developed in the con-
sider many equivalent physical plans and choose the text of the TIMBER project. TAX data model is
best one using some cost function. Execution of based on unordered collections of ordered data trees,
the best physical plan may be signicantly, some- and each TAX operator takes as input collections
times several orders of magnitude faster comparing of data trees, and produces as output collections
to naive one. So the problem of generating the op- of data trees. Unlike YAT and SAL, TAX directly
timization space is very important. Optimization manipulates trees without the need for an explicit
space is a set of equivalent expressions in some query intermediate structure. Data extraction and bind-
algebra and thus the nal plan quality is dened by ing are performed by using pattern trees: pattern
some properties of the chosen algebra. Unlike re- trees, which resemble Xtasy input lters, describe
lational systems, there is no standard algebra for the structure of the desired data, and impose condi-
XML queries, but there are many dierent algebras tions on them.
[12, 14, 4, 13, 8] each having its own pros and cons. MonetDB system [1] worth special attention.
In this work we gather requirements for query al- Data in that system are stored as binary relations.
gebra suitable for ecient cost-based optimization Queries are translated to special intermediate SQL-
and propose an algebra based on elements of XAT like representation and then are executed as SQL.
[14] and Xtasy [12] algebras and which meets the Benchmarks show a signicant superiority of Mon-
requirements. etDB over native XML DBMS, mostly when working
with big documents. The reason is that native XML
2 Related work databases are lacking powerful query optimizers and
ecient indexing structures.
Many researches in XQuery optimization are focused
on logical optimizations. A number of logical trans-
3 Analysis of Algebra requirements
∗ This work was partially supported by RFBR (grant 07-
07-00268a). Time required to evaluate some query could dier
Proceedings of the Spring Young Researcher's Colloquium very much depending on the chosen evaluation plan.
on Database and Information Systems, Moscow, Russia, 2007 The main optimizer role is to nd the most eec-
tive one. The space of available physical plans is operations have enough good algebraic properties.
mainly dened by properties of algebra operations. So the requirement (2) is satised by these algebras.
And the wider is that space the more eective eval- But requirement (3) does not completely satised
uation plan could be found. Such properties of alge- by them. The reason is that their operations used
braic operations as commutativity, associativity and for xpath representation do not have complete num-
idempotance are desirable properties because they ber of desirable algebraic properties. Therefore it is
extend search space, increasing optimizer's freedom impossible to arbitrary change evaluation order of
for choosing optimal plan. operations for navigational expressions.
One of the most important and widely used con-
structions of XQuery are nested for-clauses. The 4 XAnswer Algebra
order of their evaluation in evaluation plan is signif-
icant for performance as order of join operations in In this section we propose new algebra called XAn-
relational algebra. Therefore the representation of swer. This algebra is based on some elements of
nested for-clauses with operations with good alge- XTasy and XAT algebras and satisfy requirements
braic properties is an important requirement. described in previous section. First, it would be de-
It is impossible to choose an order of operations scribed data structures, algebra operations dened
evaluation without estimation of a size of an inter- on. After that main algebra operations, that are
mediate result, for example result of joining two se- similar to relational, would be introduced. And at
quences from the given three. Therefore data struc- last specic for XQuery algebraic operations would
tures in query algebra must be good enough to rep- be dened and compared with analogues of Xtasy
resent measurable intermediate results. algebra.
XPath expressions also play an important role in
XQuery. There may be dierent plans of evaluating 4.1 XAnswer data structure
the same XPath expression and some plans may be XAnswer algebraic operations are dened on
much more expensive than others [10]. So if XPath relational-like data structures like in [12, 14]. Be-
would be presented with operations with good al- low, this structure will be called Envelop. It is rep-
gebraic properties probably the more eective plan resented with a table and consists of tuples which
could be found. It is the last important requirement. contains XML-node values. Order of table attributes
Concluding, XML query algebra suitable for cost- or tuples is not signicant.
based optimization is expected to satisfy the follow- In case of XQuery each attribute of Envelop is a
ing requirements: name of variable that was bound in corresponding
subexpression. In case of XPath there are no any
1. operations are dened on data structures suit-
variables, therefore some unique identier is used as
able for representing measurable intermediate
corresponindg attribute instead of variable name.
results
Lets consider a path expression:
2. nested for-clauses are mapped to operations book/author/address/country. Lets Assume
with good algebraic properties such as commu- that identier $VA denotes values of nodes with
tativity and associativity tag name book, identier $VB author, $VC
address, $VD country. The Envelop, obtained
3. xpath expressions are also represented by oper- after evaluation of considered path expression could
ations with good algebraic properties. be represented with a table shown in Figure 1.
W3C algebra operations dened on sequences.
This denition leads to several problems with an in-
termediate result representation, because sequences
do not provide any information about correspond-
ing bound variables and e.t.c. Therefore there are
problems with intermediate result storing and es-
timation that leads to problems with reordering of
some expensive operations. So this algebra does not
satisfy requirement (1). But this requirement is sat- Figure 1: Envelop structure
ised by YAT, XAT, Xtasy algebras. These algebras
have operations dened on relational-like structures,
XML data model assumes ordering of tags which
which are as relations consist of tuples. This struc-
is missing in our Envelop structure. However for op-
ture is suitable for representing intermediate result
timization purposes better algebraic properties are
of joining group of sequences. In XAT algebra such
more important than ordering so we assume that
structure called XAT-table, in Xtasy Env.
query result should be additionally sorted if needed.
The XML Query algebra is very useful for imple-
Assuming these two Envelops are equal independent
menting simple XML query processors with ability
of tuple or attribute order.
of logical optimization, but it appears unlikely that
it will form the basis for eective implementations of
4.2 Algebraic operations
XML query processors with cost-based optimization.
YAT, XAT and Xtasy algebras have representa- Path expressions are main building blocks of XQuery
tion of nested for-clauses with join operations. These expressions. Also their evaluation is one of the most
expensive elements in XQuery evaluation. Therefore 4.2.2 Data Extraction operations
it is important to increase quality of xpath evalua-
There are some operations (leaf or unary opera-
tion plans. So query algebra has to provide a wide
tions), which produces new Envelop without pro-
space of equivalent plans for navigational expres-
vided any another as input. These operations are
sions. This problem could be solved by representa-
presented with function call and element search op-
tion of XPath with operations with good algebraic
erations. GetDocumentRoot operation is a member
properties. So XAnswer algebra use structural-join
of function call operations family. This operation
(binary) and data extraction (unary) operations to
generates a new Envelop containing root node for
represent path expressions and each step of path ex-
provided document. GetDocuemntRoot operation
pression is represented with these operations.
dened as follows:
4.2.1 Structural-join operation
GetDocumentRoot(Document) = {x |
Structural-join operation appears in many works
x is a root element of the Document}
around the XPath optimization, for example [6, 15,
2]. XAnswer also has this operation with following The next important data extraction operation is
denition: GetElement operation. It searches for elements, that
satisfy NodeTest condition in the provided set of
A ⊗axisAi Bj B = {(x, y) | documents. For example as a NodeTest condition
x ∈ A, y ∈ B, axisAi Bj (x, y) = true} could be a NameTest, i.e. some tag name.
Here A, B two input Envelops. Ai , Bj GetElement(N odeT est, DocSet) = {x |
attribute identiers by which join is performed.
axisAi Bj is a predicate by which join is performed. x ∈ elements of documents f rom DocSet,
It returns true, if elements of x and y, corresponding N odeT est(x) = true}
to identiers Ai , Bj are in axis relation (for example
child or parent). The result of evaluation of this operation is a new
Envelop containing values of nodes satisfying to the
Example 1 NodeTest. As a single attribute of this Envelop,
some unique identier is set. The following example
shows XAnswer algebraic expression for some
typical XPath twig query.
Example 2
Lets consider following path expression:
document(doc.xml)/manufacturers//dealer/address
The algebraic expression corresponding to this
path expression is:
((GetDocumentRoot(doc.xml)
⊗childv1 v2 GetElement(manuf acturers, doc.xml))
⊗childv2 v3 GetElement(dealer, doc.xml))
⊗childv3 v4 GetElement(address, doc.xml)
Figure 2: Structural join example
This algebraic expression could be represented
This example shows structural join of two En- with a tree shown in Figure 3. For simplicity in
velops by child axis for attributes A2 and B1. Re- this gure does not provided some information like
sult of joining is a new Envelop, every tuple of which axis predicates for structural joins. Evaluation is
obtained by union of tuples of rst and second En- performed by left-deep walking through this tree.
velops in case when tuple element corresponding to It means that for this tree rst would be extracted
attribute B1 of the second Envelop is a child of tu- document root node then manufacturers nodes then
ple element corresponding to attribute A2 of the rst would be performed structural join by child axis and
Envelop. so on.
Structural join operation has associativity(1) Sometimes much more eective plan could be
and commutativity(2) properties. These properties achieved by changing evaluation order of structural
could be prooved using operation denition. joins. So Figure 4. shows alternative equivalent al-
gebraic expression for the expression from Example
(1) : (A ⊗axisAi Bj B) ⊗axisBm Ck C = 2. Equivalence of these expressions could be proved
by sequential applying of rule (1). In this case rst
= A ⊗axisAi Bj (B ⊗axisBm Ck C) would be performed joining of manufacturers and
(2) : (A ⊗axisAi Bj B) = (B ⊗axisAi Bj A) document root then joining of dealers and addresses,
and at last the structural join by descendant-or-self
axis for manufacturers and dealers is performed.
projection by attribute corresponding to Ai . The
variable name became an identier of the single
attribute of the new Envelop.
Example 3
Figure 3: An xpath algebraic expression tree
Figure 5: For operation
4.3.2 Let operation
Let operation dened as follows:
Letvar
f unci ..in A = {(x, z) | x ∈ A, z = f unci1 ..in (x)}
1
Figure 4: An alternative xpath algebraic expression Here var is a variable name; func - a function re-
tree ferring to some already bound variables; i1 ..in at-
tribute identiers corresponding to those variables.
4.2.3 Classical algebraic operations For example as this function could be an XPath ex-
pression like following: $b/address/country.
Some operations in XAnswer derived from relational The result is a new Envelop, obtained by ap-
algebra. That are such operations as selection, pro- pending to the old one a new column with results
jection, join, cross product, orderby. . . These oper- of evaluation of function for each tuple. The given
ations have the same properties as their relational variable name became an identier of appended
analogues. attribute.
Selection:
Example 4
SelectP A = {x ∈ A | P (x) = true}
Projection:
ProjectAi1 ...Ain A = {z |
z = (xAi1 , . . . , xAin ), x ∈ A}
Join:
A ./P B = {(x, y) | x ∈ A, y ∈ B,
P(x, y) = true}
4.3 Specic for XQuery operations
For and Let operators also known as variable binding Figure 6: Let operation
operators are one of the most important operators
in XQuery. These operators are similar in that they
dene a variable in query evaluation context and set In this case path expression depends on single
to it some current value. variable corresponding to attribute denoted by $A2.
XAnswer also has For and Let operations.
4.3.3 Return operation
4.3.1 For operation The next one important algebraic operation, specic
For operation dened as follows: for XQuery, is a Return operation. XAnswer Return
operation by functions and representation is similar
var
F orA i
A = {z | z = P rojectAi A, x ∈ A} to the XTasy Return operation [12, 5].
ReturnOF A, where A - an Envelop produced with
Here, var is a variable name; Ai identier of En- one of the XAnswer operations like Join or For op-
velop attribute; A - Envelop. eration. OF - Output Filter, which is dened by
The result of evaluation of For operation is a following rule:
new Envlelop, obtained from given by applying OF ::= OF1,. . . ,OFn | label[val] | @label[val] |val
val ::= var | varCopy | xpath.
Output Filter is a description of activities for rep-
resentation of an evaluated data. For example it
could be an XML element or an XML attribute con-
structor, or a variable value or a result of evaluation
of navigational expression for some bound variable
and so on.
4.3.4 DJoin operation
This operation is used when evaluation of one En-
velop depends on evaluation of another. The only
way to perform this operation is nested loops.
Therefore the major goal of optimization is to trans-
late it to another join operations whenever it is pos-
sible.
4.3.5 Examples
In the following example there are shown two alge-
Figure 7: An algebraic expression tree for query from
braic expression for Xtasy and XAnswer algebras
example 6
for the same XQuery expression.
Example 5
The next example shows a query with nested
for $b in document("books.xml")/book for-clauses where the inner has a dependency to the
/author/addr variable dened in the outer.
return $b
An Xtasy algebraic expression for this query: Example 7
Returnentry[$b] path(_,$b,in)book[(_,_,/)author[
for $a in document(doc.xml)//dealer,
(_,_,/)addr[∅]]] (books.xml)
$b in $a//address
An XAnswer algebraic expression: return $b
A tree of algebraic expression for this case is shown
Returnentry[$b] ((GetDocumentRoot(books.xml)
in Figure 8. In this representation both for-clauses
⊗childv1 v2 GetElement(book, books.xml))
input lters have the same part, corresponding to
⊗childv2 v3 GetElement(author, books.xml)) the expression document(doc.xml)//dealer. In this
⊗childv3 v4 GetElement(addr, books.xml) case common sub-expression would be evaluated
once and then obtained result would be reused in
Inspite of horizontal and vertical decomposition evaluation of the second for-clause.
rules of path operation in Xtasy [12], the space
of equivalent plans obtained in terms of XAnswer
algebra is wider then in terms of XTasy.
Example 6
for $a in document(doc.xml)/manufacturers
/dealer//address,
$b in document(doc2.xml)//manager
return $a
The tree of algebraic expression, corresponding to
the given query is shown in Figure 7. For simplic-
ity, nodes corresponding to join operation do not
provide any information about predicates by which
these join operations are performed. Copy of vari-
able value operation is used as output lter for the
return operation. As result a new element with dif-
ferent to current elements id is created. Also as out-
put lter it could be a variable reference operation. Figure 8: A tree of algebraic expression for example
In this case an element with the same id would be 7
returned. This is the way to make changes in the
document.
4.4 Using some algebraic properties in query opti- 5 Conclusion
mization
This paper outlines several requirements for query
Sometimes nested for-clauses have not dependencies algebra suitable for building high-performance cost-
between themselves but they have similar parts based optimizer for native XML DBMS. Also it was
of input path expressions. In this case special shown that known algebras do not completely satisfy
optimization technique could be applied. It obtains to these requirements. An another algebra that sat-
common parts of path expressions and rewrites ises to these requirements better was introduced.
expression to provide sharing result of common This algebra is a base of XAnswer optimizer for na-
parts. Such query and corresponding to it algebraic tive XML DBMSs which is currently under develop-
tree after some reforming are described in Example ment.
8. In [14] such optimization is called expression
minimization. It is shown that benet of this
optimization for class of similar queries could reach
20-70%, depending on XML document structure. References
Each sub-query could have it's own optimal plans [1] Monetdb home page http://monetdb.cwi.nl.
that have not common parts. In this case techniques
derived from multi query optimization for building [2] S. Al-Khalifa, H. V. Jagadish, J. M. Patel,
optimal plan for group of queries [11] could be used. Y. Wu, N. Koudas, and D. Srivastava. "struc-
In case of such queries the space of equivalent plans tural joins: A primitive for ecient xml query
in terms of XAnswer algebra is wider than in terms pattern matching". In 18th International Con-
of XAT. ference on Data Engineering (ICDE'02), 2002.
Example 8 [3] M. Fernandez B. Choi and J. Simeon. The
xquery formal semantics: A foundation for im-
Lets consider following query: nd title for those plementation and optimization. Technical re-
books author of that is a rst author at least of one port, World Wide Web Consortium, 2002.
book.
Corresponding XQuery expression: [4] Catriel Beeri and Yariv Tzaban. SAL: An al-
gebra for semistructured data and XML. In
for $a in document(bib.xml)/book/author[1]
WebDB (Informal Proceedings), pages 3742,
for $b in document(bib.xml)/book 1999.
where $b/author = $a
return $b/title [5] C.Sartiani. Ecient Management of Semistruc-
It is easy to see that expression docu- tured XML Data. PhD thesis, Universita degli
ment(bib.xml)/author could be evaluated once for Studi di Pisa, 2003.
rst for-clause and then obtained result could be
reused in evaluation of second for-clause. In this case [6] Torsten Grust. Accelerating the xpath location
algebraic expression could be presented as a graph, steps. In ACM SIGMOD, pages 109120, 2002.
shown in Figure 9.
[7] Torsten Grust, Maurice Van Keulen, and Jens
Teubner. Accelerating xpath evaluation in any
rdbms. ACM Trans. Database Syst., 29(1):91
131, 2004.
[8] H. V. Jagadish, Laks V. S. Lakshmanan, Di-
vesh Srivastava, and Keith Thompson. Tax: A
tree algebra for xml. In DBPL '01: Revised Pa-
pers from the 8th International Workshop on
Database Programming Languages, pages 149
164, London, UK, 2002. Springer-Verlag.
[9] Ioana Manolescu, Daniela Florescu, and Don-
ald Kossmann. Answering xml queries on het-
erogeneous data sources. In VLDB '01: Pro-
ceedings of the 27th International Conference
on Very Large Data Bases, pages 241250, San
Francisco, CA, USA, 2001. Morgan Kaufmann
Publishers Inc.
Figure 9: A graph of algebraic expression for exam-
ple 8 after performing minimization [10] Priti Mishra and Margaret H. Eich. Join pro-
cessing in relational databases. ACM Comput.
Surv., 24(1):63113, 1992.
[11] Prasan Roy, S. Seshadri, S. Sudarshan, and
Siddhesh Bhobe. Ecient and extensible algo-
rithms for multi query optimization. In SIG-
MOD '00: Proceedings of the 2000 ACM SIG-
MOD international conference on Management
of data, pages 249260, New York, NY, USA,
2000. ACM Press.
[12] Carlo Sartiani and Antonio Albano. Yet an-
other query algebra for xml data. In IDEAS '02:
Proceedings of the 2002 International Sympo-
sium on Database Engineering & Applications,
pages 106115, Washington, DC, USA, 2002.
IEEE Computer Society.
[13] S. Cluet V. Christophides and J. Simeon.
Semistructured and structured integration rec-
onciled: Yat += ecient query processing.
Technical report, INRIA, Verso database group,
1998.
[14] Song Wang, Elke A. Rundensteiner, and Murali
Mani. Optimization of nested xquery expres-
sions with orderby clauses. In ICDEW '05: Pro-
ceedings of the 21st International Conference
on Data Engineering Workshops, page 1277,
Washington, DC, USA, 2005. IEEE Computer
Society.
[15] J.Patel Y.Wu and H.Jagadish. Structural join
order selection for xml query optimization. In
ICDE ' 03 : International Conference on Data
Engineering, 2003.