A flexible extension of XQuery Full-Text

                      Emanuele Panzeri and Gabriella Pasi

                          University of Milano-Bicocca
                       Viale Sarca 336, 20126 Milano, Italy
                       {panzeri,pasi}@disco.unimib.it


      Abstract. This paper presents the implementation of an extension of
      the XQuery Full-Text language on top of the BaseX query engine. The
      proposed extension adds to the language two new flexible axes that allow
      users to express structural constraints that are evaluated in an approx-
      imate way with respect to a considered path; the constraints evaluation
      produces a scored set of elements. The implementation and the efficiency
      evaluations of the constraints are reported in this paper.


1   Introduction
Recent works have been dedicated at improving standard XML query languages,
such as XQuery and XPath, by enriching their expressiveness in both content
constraints [1, 8] and structural constraints [2, 6] evaluation. While the work
reported in [1] has been adopted by W3C in the XQuery Full-Text extension[7],
no approximate matching for structural-based constraints has been standardized
by W3C yet. The adoption of structured query models (such as XQuery) to
inquiry highly structured document repositories or XML databases forces users
to be well aware of the underlying structure; non of the previous approaches
allows users to directly specify flexible structural constraints the evaluation of
which produces weighted fragments. The XQuery Full-Text language extension
proposed in [5] was the first proposal to introduce a set of flexible constraints
with an approximate matching. The extension allows users to formulate queries
where the relative position of important nodes can be specified independently
from an exact knowledge of the underlying structure. The extension gives to the
user the ability to express structural constraints with approximate matching and
to obtain a weighted set of fragments; users can also define a score combination
using standard XQuery operators and obtain a customized element ranking.
    In this work we present the implementation of the flexible constraints, as de-
fined and motivated in [5], named Near and Below, that allow users to explicitly
specify their tolerance to an approximate structural matching. The implementa-
tion, performed on top of the BaseX query engine [4], integrates and extends the
fragment scoring introduced by the FullText extension by taking into account
also the structural scores computed by the approximate constraint evaluation.

2   The XQuery Full-Text extension in a nutshell
For each element matched by the Below and Near, a score is computed by the
approximate matching; the score is in the interval ]0, 1] where 1 represents a full
satisfaction of the constraint evaluation, while values less than 1 are assigned to
target nodes far from the context node.
    The constraint Below is defined as an XPath axis (like, for example, the
children, self, etc axes) the evaluation of which is aimed at identifying el-
ements that are direct descendants of a node. The Below constraint is spec-
ified as: c/below::t, where c is the context node, and t is the target node.
The score computed by the Below axis evaluation, is computed by the formula:
                       1
wbelow (c, t) = |desc arcs(c,t)| . Where desc arcs(c, t) is a function that returns the
set of unique descending arcs from c to t.
    The constraint Near is specified as a flexible axis of a path expression; it
allows to identify XML elements connected through any path to the context
node. The axis allows to define a maximum distance n that acts as a threshold
on the number of arcs between the context node and the target node; nodes the
distance of which is more then n arcs are filtered out from the possible results.
The Near syntax is:c/near(n)::t and the score for its evaluation is computed
                              1
                                    if |arcs(c, t)| ≤ n
as: wnear (c, t, n) = |arcs(c,t)|                       where c is the context node, t
                              0     else.
is the current target node, n is the maximum allowed distance and arcs(c, e)
returns the set of arcs in the shortest path between c and t.

3    Implementation
The new axes have been integrated into the BaseX XQuery engine by extending
both its language interpreter and its XQuery evaluation processor to include
a new score-structure Score Variable definition. BaseX has been chosen for
being the first system (and the only one, to the best of our knowledge) to im-
plement the full XQuery Full-Text language. As described in [3], BaseX adopts
an efficient indexing schema for XML documents. The XQuery FLOWR clauses
have been made capable to identify the new structural score variable, and to al-
low its usage in sorting, ordering and results display. As an example the XQuery
for clause has been extended as follows:
ForClause ::="for" "$" VarName TypeDeclaration? PositionalVar? FTScoreVar?
  StructScoreVar? "in" ExprSingle ("," "$" VarName TypeDeclaration?
  PositionalVar? FTScoreVar? "in" ExprSingle)*
StructScoreVar ::= "score-structure" "$" VarName

    where Varname is a valid variable name; TypeDeclaration is a variable type
declaration; and ExprSingle is the actual query for node selection as defined in
the XQuery language. From the user point of view this approach offers unlimited
possibilities of the usage of the new Structure Score Variable: the user can define
aggregation functions using the default XQuery constructs.
    Fig. 1a shows an example of the Near constraint application: the query
person//act/near::title is evaluated and the three gray title nodes are
matched with a score of 0.3, for the act/movie/title node, and 0.25 for the
other two nodes. In Fig. 1b the evaluation of the query person/below::name is
shown: three name nodes are retrieved; person/name node with a score of 1 and
0.3 for the other nodes.
                           (a)                          (b)

       Fig. 1: Example of (a) Near and (b) Below constraint evaluation.


4   Evaluation

The performed evaluations compare the efficiency of all the new axes constraints
with the standard XPath/XQuery counterparts (if available): in particular for
the Below axis evaluation we executed each query using both the Below and the
descendant axes. Concerning the Near axis evaluation, instead, no counterpart
could be identified due to the innovative nature of the proposed axis.
    The axes evaluations have been performed by using the IMDB INEX Data-
Centric collection. Performance tests have been executed with an increasing size
of the evaluated collection to verify the overhead introduced by the flexible axis
evaluation in comparison with standard (if applicable) XPath axes constraints.
Due to the nature of the BaseX indexing system that caches queries, result set,
and opened databases, the evaluations have been performed by unloading the
BaseX system between each run. All evaluation tests have been executed 5 times,
and the average timings (removing the worst and the best results) are presented.
    Below axis evaluation: The Below axis has been compared with the stan-
dard descendant axis: both axes have been evaluated by executing the test
without any query optimization introduced by BaseX. Five queries containing
the Below axis have been evaluated against each collection by measuring its ex-
ecution time. The same query, with the Below axis replaced by the descendant
axis has then been executed and its timings compared. In Fig. 2 the evaluation
results are sketched: not surprisingly the Below axis evaluation takes more time
than the equivalent descendant axis to obtain the query results, due to the
computation of the structural score. The Below axis evaluation takes in average
36% more time than the execution of the descendant counterpart.
    Near axis evaluation: The Near axis evaluation has been performed by
using the same IMDb collection used for the evaluation of the Below axis. The
queries used during the evaluation process have been defined so as to require the
BaseX engine to retrieve all the XML elements without neither adopting any
optimization strategy nor any query re-writing; this aspect forced the BaseX
system to perform a sequential analysis of the target nodes, and thus to provide a
complete execution of the Near axis evaluation. Furthermore the BaseX Full-Text
index has been avoided, further enforcing the complete iteration over any target
node without using any BaseX pre-pruning strategy. These aspects allowed to
measure the efficiency of the Near axis evaluation implementation.
      Fig. 2: Comparison between Below and descendant axis evaluation.


5   Conclusions and Future Work
The Below and Near axes, semantically and syntactically defined in [5] have
been implemented and evaluated on top of the BaseX system, where both the
query interpreter and the evaluation engine have been extended to identify and
evaluate the new axes. The obtained results confirm that, although the flexible
evaluation of both axes requires relatively longer times, the proposed flexible
evaluation and the subsequent XML element ranking based on both textual and
structural constraints can be successfully introduced into the XQuery language.
Ongoing works are being conducted related to the definition, alongside the BaseX
data structures, of ad-hoc indexes to better evaluate the new flexible constraints
by adopting efficient pruning techniques during target node identification, thus
further improving the axis evaluation performance.


References
1. S. Amer-Yahia, C. Botev, and J. Shanmugasundaram. TeXQuery: A Full-Text
   Search Extension to XQuery. In WWW ’04, pages 583–594. ACM, 2004.
2. S. S. Bhowmick, C. Dyreson, E. Leonardi, and Z. Ng. Towards non-directional
   Xpath evaluation in a RDBMS. In CIKM ’09, pages 1501–1504, 2009.
3. C. Grün. Storing and Querying Large XML Instances. PhD thesis, Universität
   Konstanz, December 2010.
4. C. Grün, S. Gath, A. Holupirek, and M. H. Scholl. XQuery Full Text Implementation
   in BaseX. In XSym ’09, pages 114–128, 2009.
5. E. Panzeri and G. Pasi. An Approach to Define Flexible Structural Constraints in
   XQuery. In AMT, pages 307–317, 2012.
6. B. Truong, S. Bhowmick, and C. Dyreson. Sinbad:towards structure-independent
   querying of common neighbors xml databases. In DASFAA’12, pages 156–171. 2012.
7. W3C. XQuery/XPath FullText. www.w3.org/TR/xpath-full-text-10, March 2011.
8. C. Yu and H. V. Jagadish. Querying Complex Structured Databases. In VLDB ’07,
   pages 1010–1021, 2007.