=Paper=
{{Paper
|id=None
|storemode=property
|title=Deferred Node-copying Scheme for XQuery Processors
|pdfUrl=https://ceur-ws.org/Vol-567/poster04.pdf
|volume=Vol-567
|dblpUrl=https://dblp.org/rec/conf/dateso/KursV10
}}
==Deferred Node-copying Scheme for XQuery Processors==
<pdf width="1500px">https://ceur-ws.org/Vol-567/poster04.pdf</pdf>
<pre>
        Deferred Node-copying
        Deferred node-copying scheme
                              Scheme for
                                      for XQuery
                                          XQuery
                       processors
                       Processors

                                 Jan Kurš and Jan Vraný
                                 Jan Kurš and Jan Vraný
                         Software Engineering Group, FIT ČVUT,
                         Software
                       Kolejn       Engineering
                                550/2,           Group,Czech
                                       160 00, Prague,  FIT ČVUT,
                                                             Republic
                       Kolejnı́ 550/2, 160 00,jan.vrany@fit.cvut.cz
                       kurs.jan@post.cz,       Prague, Czech Republic
                       kurs.jan@post.cz, jan.vrany@fit.cvut.cz


           Abstract. XQuery is generic, widely adopted language for querying
           and manipulating XML data. Many of currently available native XML
           databases are using XQuery as its primary query language. The XQuery
           specification requires each XML node to belong to exactly one XML tree.
           In case of the XML subtree is appended into a new XML structure, the
           whole subtree has to be copied. This may lead into excessive and un-
           necessary data copying and duplication. In this paper, we present a new
           XML node copying scheme that defers the node data copy operation un-
           less necessary. We will show that this schemes significantly reduces the
           XML node copy operations required during the query processing.


       Keywords: XML, XQuery, XQuery Processor, Smalltalk


   1    Introduction
   XQuery is an XML query language designed by the World Wide Web Consor-
   tium. Although widely adopted, fast and efficient implementation is still lacking.
   Optimization techniques for XQuery are still a subject to an active research.
   XQuery 1.0 and XPath 2.0 Data Model specification [1] forbids sharing of data
   model among multiple XML node hierarchies. Section 2.1 says:
        ...
        Every node belongs to exactly one tree, and every tree has exactly one
        root node.
        ...
   If a XML node is added into a new XML tree, the naive realization of this
   requirement would create a new node (by copying the original one) and the copy
   would be placed into the new XML tree. Consider the query at figure 1 is to be
   evaluated and its output is to be serialized to an output file.
       A whole XML subtree that matches fn:doc("doc.xml")//authors is never
   used. This may lead into excessive node copying and higher memory consump-
   tions depending on the subtree size.
       In this paper we will describe an efficient node-copying scheme that avoids
   unnecessary copying while preserving XQuery semantics. We will also discuss its
   correctness and benchmark results.


J. Pokorný, V. Snášel, K. Richta (Eds.): Dateso 2010, pp. 131–138, ISBN 978-80-7378-116-3.
132       Jan Kurš, Jan Vraný


1   let $authors = element authors { fn:doc("doc.xml")//authors }
2   let $titles = element titles { fn:doc("doc.xml")//titles }
3   return element result { $titles }


                        Fig. 1. Simple document-creating query


    The paper is organized as follows: section 2 give an overall description of the
node-copying scheme mentioned above. Section 3 discusses experimental results
based on running XMark benchmarks. Section 4 provides a brief overview of
related work. Section 5 concludes by summarizing presented work.


2      Deferred Node-copying
The basic idea is simple: share existing XML nodes between node hierarchies and
defer node-copy operation unless absolutely inevitable. In our implementation
the XML node can belong into multiple node hierarchies, although the XQuery
specification requirement mentioned in section 1 is preserved.
   The deferred node copying scheme has been developed to meet two main
goals:
    – separate query processing logic from underlying physical data model and
    – reduce memory consumption by preventing unnecessary data copying
   The first requirement has software engineering origin. XQuery processors
should be able to operate over various data models, not necessarily XML-based.
Moreover, good separation of query processor from physical data model provides
possibility to use one XQuery implementation in multiple environments – as a
standalone XQuery tools or within a database management machine.
   The latter goal came from practical needs. In case of large documents and
complex queries, naive implementation of an XQuery may consume – in edge
cases – twice more memory than actually needed.

2.1     XDM Adaptor
XDM specification defines a sequence to be an instance of data model. Each
sequence consists of zero or more items. An item is either a node or atomic value.
The specification also defines a bunch of node properties such as dm:node-name
or dm:parent.
    To meet our first goal we separates node from its physical data storage though
an XDM adaptor which operates on so called node ids. Node id is an unique
identifier of an XML node within particular physical storage. The structure of
the node id is not defined – in fact node id could be anything: reference to a
DOM node in memory, pointer to a database file or simple integer.
    Usage of XDM adaptor give us easy and straightforward way how to access
different physical data models. XDM adaptor abstracts any kind of data source
                         Deferred Node-copying Scheme for XQuery Processors           133


and may use any kind of optimization (such as extensive caching) to access data
effectively. However the physical data storage and access strategies are hidden
to the rest of the XQuery processor.

2.2   Node States
In order to defer copy operation, a new node property called node state is intro-
duced. Each node is in exactly one state from following three states:
Accessed State. Nodes that come from external data source are in accessed
   state.
Constructed State. Nodes that are constructed during query processing are
   in accessed state.
Hybrid State. Nodes which belongs to multiple node hierarchies are in a hybrid
   state.

2.3   Actions
During the query processing, the state of the node may change. The state dia-
gram of the node is shown at figure 2. There are three kinds of actions:
Copy Action. The copy action is performed whenever the XML subtree is ap-
  pended into a new XML hierarchy. The original subtree should be duplicated
  in order to meet the requirement XML node to belong into just one node
  hierarchy.
Change Action. The change action models any change in a data model such
  as setting a new parent.
“Child Read” Action. The “child read” action represents the situation when
  the XQuery processor accesses child nodes of given node.


                                               Copy


                                        Child read
                                 Copy                             Change/child read
                Copy


                       Hybrid       Accessed          error   Constructed


                                           Change


                                        Change/child read


                                Fig. 2. Node State Transitions


    Consider a document doc.xml (it’s content is shown at figure 3) and query 1
(figure 4). During execution of the query, following actions are performed:
134       Jan Kurš, Jan Vraný


1    <?xml version="1.0"?>
2    <root>
3      <elem>elem1</elem>
4      <elem>elem2</elem>
5    </root>


                                  Fig. 3. doc.xml contents

1    element myroot {
2      attribute attr { ’value’ },
3      fn:doc("doc.xml")/elem[0]
4    }


                                  Fig. 4. Example Query 1


    1. The myroot element is created in a constructed state. Then change actions
       are issued on that node: setting the node name “myroot”, adding attribute
       “attr” and appending a text node.
    2. Afterwards, the doc.xml is read and two child read actions are performed in
       order to evaluate XPath expression.
    3. Finally, the first elem (accessed) node from doc.xml is to be added into the
       myroot (constructed) node – the elem node and all its descendants should
       be copied.

2.4      Transitions


                    Fig. 5. Two XML trees sharing one hybrid node


Accessed Node Transitions. When a copy action of accessed node is trig-
gered, the node state is changed from accessed to hybrid and no physical data
copy is made. Changes to accessed nodes are not permitted – any change will
immediately lead into an error.

Constructed Node Transitions. Copy operations on constructed nodes be-
haves exactly as on accessed nodes. Changes to constructed nodes are permitted.
                      Deferred Node-copying Scheme for XQuery Processors     135


Hybrid Node Transitions. Transitions based on actions on hybrid nodes are
bit more interesting:
Copy Action. Copy action on hybrid nodes is a no-op. As a result, the same
   node is returned with its state unchanged.
Change Action. Whenever any of node properties (dm:parent, dm:name etc.)
   is to be changed the node state is changed to constructed and all node
   properties are copied. See the query at figure 6. When processing expression
   at line 5, two things happen (in that order):
    1. The text node “elem1” (a result of $doc/elem[0]/text() expression)
        is added to the myroot element. States of nodes after this addition are
        depicted at left side of figure 7.
    2. Afterwards, the text node value “elem1“ has to be changed to the “elem1
        is the first” because of the specification requirements. Obviously, the
        hybrid text node must be copied. The XML data accessible though $doc
        must remain unchanged.
Child Read Action. While appending a XML tree into a new structure, the
   state of a root node of the appended tree is changed to hybrid, the reference
   from the new structure is added to the hybrid. The rest of the appended
   tree (children of the root node) are unchanged – they don’t know, that their
   parent has changed its state to hybrid. This cause serious problems while
   executing XPath commands. To overcome this issue, we convert hybrid node
   to a constructed one during child read action. Such a behavior is illustrated
   at figure 8.
    Data are physically copied only when hybrid node is either being changed or
its children are being read.


1   let $doc:= doc("doc.xml")
2   return
3     element myroot {
4       element myelem {
5         { $doc/elem[0]/text() } is the first }
6       }
7     }


                            Fig. 6. Example Query 2


Serialization of Result Set. Once the query is processed, serialization of result
set may not lead into XML node copying. Because query is already processed, no
node kind transitions must be performed during serialization and thus no node
copies must be created. Obviously, if the application wants work with the result
set as with nodes in memory and wants to perform some modification on it, the
result set must be copied.
136      Jan Kurš, Jan Vraný


               Fig. 7. Change of hybrid node into the constructed node


              Fig. 8. Change of hybrid node while exploring the children


3     Discussion
3.1    Specification Conformance
Although deferred node-copying scheme does not require the XML nodes to
belong to exactly one node hierarchy it preserves original XQuery semantics.
Our claim is based on the results from the XQuery Test Suite [3].
    The axes tests and element constructors tests from Minimal Conformance -
Expressions section of XQTS Catalogue cover the node identity semantics and
were used to test the correctness of deferred node-copying scheme. Our proof-
of-concept implementation successfully passes all the mentioned test cases.

3.2    Benchmarks
Presented deferred node-copying scheme has been developed in order to increase
XQuery processor performance by reducing number of copy operations. A natural
question is whether this scheme has substantial effect in real-world applications.
The table 3 shows number of copy operations for selected XMark [2] queries1 on
a file created with the XMark data generator.
    Number of saved copies is dependent on a query characteristics. There are
no new nodes created in a Q1 command and that is why there is no difference in
results. There are text nodes appended to elements in a Q2 command. The text
nodes does not need to be copied at all, only transformed to the hybrid state.
    There is a subtree appended to each result item during the Q13 execution.
Without the optimization, each element of a tree has to be copied, but with the
optimization turned on, only a few of nodes are copied.
1
    Plus one nonstandard query marked INC. Its code is element a
    {doc("file:///auctions.xml") }. We include it as an illustration of extreme case.
                        Deferred Node-copying Scheme for XQuery Processors       137


                 Q. #    DNC             IC        Q. # DNC            IC
                         Nh Nc Nh             Nc         Nh Nc Nh           Nc
                   Q1      0     0   0         0   Q2    106   0   0    106
                   Q3      0    44   0        44   Q4     0    0   0         0
                   Q5      0     0   0         0   Q6     0    0   0         0
                   Q7      0     0   0         0   Q8    25 25     0        50
                   Q9     12    25   0        39   Q10 402     1   0 1244
                  Q11     12    25   0        39   Q12    3    3   0         6
                  Q13     22    22   0    560      Q14    0    0   0         0
                  Q15      7     0   0         7   Q16    0    6   0         6
                  Q17      0 138     0    138      Q18    0    0   0         0
                  Q19    217 217     0    434      Q20    8    0   0        12
                  INC 2074 114       0 5857

Legend:

Nh – number of hybrid nodes created
Nc – number of physically copied nodes
DNC – evaluated using deferred node-copying scheme
IC – evaluated using immediate copy as specified by the XQuery specification

                               Table 1. Benchmark results


4     Related Work

eXist XQuery Processor. eXist2 is an open-source XML-native database with
XQuery as its primary query language. As far as we know, eXist XQuery imple-
mentation unconditionally copies nodes whenever the node is to be added into
a different node hierarchy. Our approach is different since we avoid unnecessary
copy operations.


Saxon XQuery Processor. Saxon3 is well-known, widely adopted XML tool
set including XSLT 2.0, XPath 2.0 and XQuery 1.0 processor. Saxon’s XQuery
processor introduces concept of virtual nodes – a light-weigh node shallow copies
that shares as many properties as possible with their origin.
    Similarly to our approach, for a given virtual node some of standard XDM
properties may be overridden – namely the parent property. When the Saxon
2
    http://exist.sourceforge.net/
3
    http://saxon.sourceforge.net/
138     Jan Kurš, Jan Vraný


XQuery processor iterates over virtual node’s children, those are converted to
virtual nodes.
    However, presented deferred node copying scheme differs from virtual nodes
approach in several aspects:
 1. Creating virtual copies requires a new object to be allocated in the memory.
    Deferred node copying scheme shares the same object.
 2. Creation of virtual copies is a part of XQuery processing logic and must
    be explicitly expressed, whereas our approach separates copying logic of an
    XDM model from the query evaluation logic.


5     Conclusion and Future Work

This paper presents a deferred XML node-copying scheme for XQuery processors
that significantly reduces number of source nodes copy operations required dur-
ing query processing. This scheme defers the copy operation unless absolutely
inevitable. Whether the node is actually copied depends on a node state, a new
property which is maintained for each node in addition to standard XDM prop-
erties. Correctness of this approach has been successfully tested by XQuery Test
Suite.
    The main benefits of deferred node-copying scheme are: (i) efficiency, (ii) easy
to implement, (iii) independent on physical data model and (iv) independent on
XQuery processing logic.
    As a future plan, we plan to extend this scheme for use with various XML
indexing approaches, Ctree [4] and [5] in particular.


References
1. M. N. Mary Fernández, Ashok Malhotra. Jonathan Marsh and N. Walsh.
   XQuery 1.0 and XPath 2.0 Data Model (XDM).             W3C, 1st edition, 2006.
   http://www.w3.org/TR/xpath-datamodel.
2. A. Schmidt, F. Waas, M. Kersten, M. J. Carey, I. Manolescu, and R. Busse. Xmark:
   A benchmark for xml data management. In In VLDB, pages 974–985, 2002.
3. W3C XML Query Working Group. XML Query Test Suite. W3C, 1st edition, 2006.
   http://www.w3.org/XML/Query/test-suite/.
4. Q. Zou, S. Liu, and W. W. Chu. Ctree: a compact tree for indexing xml data.
   In WIDM ’04: Proceedings of the 6th annual ACM international workshop on Web
   information and data management, pages 39–46, New York, NY, USA, 2004. ACM.
5. Q. Zou, S. Liu, and W. W. Chu. Using a compact tree to index and query xml
   data. In CIKM ’04: Proceedings of the thirteenth ACM international conference
   on Information and knowledge management, pages 234–235, New York, NY, USA,
   2004. ACM.

</pre>