<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Method for Evaluating Full-text Search Queries in Native XML Databases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roman Pastukhov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>3 Data Model &amp; Query Semantics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for System Programming of the Russian Academy of Sciences Ph.D. advisor: Grinev M. N</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Proceedings of the Spring Young Researcher's Colloquium On Database and Information Systems SYRCoDIS</institution>
          ,
          <addr-line>Moscow, Russia, 2007</addr-line>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>5</lpage>
      <abstract>
        <p>In this paper we consider the problem of efficiently producing results for full-text keyword search queries over XML documents. We describe full-text search query semantics and propose a method for efficient evaluation of keyword search queries with these semantics suitable for native XML databases. Method uses inverted file index which may be efficiently updated when a part of some XML document is updated.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>One of the main features of XML databases is ability to
store semi-structured data as well as structured data.
XQuery and XPath languages allow addressing parts of
XML documents and querying them. These languages
are convenient for querying regularly structured data. If
data is semi-structured or its structure is unknown
making queries using these languages becomes difficult.
In such cases it is easier to use keyword search queries.
One of the key advantages of keyword search queries is
its simplicity – users do not have to learn a complex
query language and can issue queries without prior
knowledge about the structure of the underlying data.</p>
      <p>
        Keyword searching over XML introduces new
challenges. The result of a keyword search query is not
always the entire document, but can be a deeply nested
XML element. In general XML keyword search results
can be arbitrarily nested elements, and returning the
“deepest” node containing the keywords usually gives
more context information (see [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ])
      </p>
      <p>
        In this paper we describe semantics of full-text
queries over XML documents and a method for
evaluating such queries in an XML database system that
supports XQuery data model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
The recent increase in the number of XML repositories
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] has motivated extensive work on designing
languages for XML full-text search [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8 ref9">5, 6, 7, 8, 9</xref>
        ].
      </p>
      <p>
        There has been extensive research in information
retrieval on the efficient evaluation of full-text queries
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], including structured full-text queries [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and of
XML queries such as XQuery/IR [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], XSEarch [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
XIRQL [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], XXL [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and Niagara [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. However, these
works develop algorithms for specific full-text
predicates in isolation.
      </p>
      <p>
        The idea of computing the most specific elements
for conjunctive queries has been actively explored using
deepest common ancestors [
        <xref ref-type="bibr" rid="ref13 ref14 ref15">13, 14, 15</xref>
        ]. We extend this
idea to support the efficient evaluation of queries with
complex full-text predicates.
      </p>
      <sec id="sec-1-1">
        <title>3.2 Full-text Search Query Semantics</title>
        <p>Full-text search queries concerned in this paper are
composed of keywords and four operations –
conjunction, disjunction, proximity and order.</p>
        <p>Conjunction and disjunction operations combine several
(at least 2) sub-queries into a single query. Proximity
and order operations are applied to a single query to
produce another query.</p>
        <p>Consider a query Q consisting of several keywords
and a mapping M that maps some of these words to
occurrences of these words in an XML document. If Q
has a sub-query q than M|q denotes a restriction of M to
the set of keywords in q. Lets define a predicate
matches(Q, M) to be true when one of the following
conditions is true:
01 Op-And(lists)
02 Current_match[i] = &lt;empty list&gt; for all i = {0, …, lists.count - 1}
03 while lists are not completely processed:
04 Choose i, such that lists[i] is the list with the lowest numbering scheme
number;
05 Elem = lists[i].next;
06 If nodes in current_match have no common ancestor with elem:
07 If all lists in current_match are not empty:
08 Put values from current_match to output stream
09 Current_match[i] = &lt;empty list&gt; for all i = {0, …, lists.count - 1}
10 Add elem to current_match[i];
12 If all lists in current_match are not empty:
13 Put values from current_match to output stream;
01 Op-Or(lists)
02 while lists are not completely processed:
03 choose i, such that lists[i] is the list with the lowest numbering scheme
number;
04 output l[i].next to the output stream;</p>
      </sec>
      <sec id="sec-1-2">
        <title>Listing 1</title>
      </sec>
      <sec id="sec-1-3">
        <title>Listing 2</title>
        <p>•</p>
        <p>If Q is a single keyword and M has a mapping
for this word.
• If Q is a conjunction query with sub-queries
S={q0, q1, … , qn} and {∀q∈S | matches(q,
M|q)}
• If Q is a disjunction query with sub-queries
S={q0, q1, … , qn} and {∃q∈S | matches(q,
M|q)}
• If Q is a proximity query with distance d and
sub-query q, such than matches(q, M) is true
and the maximal distance between the words in
image of M is no more than (d-1).
• If Q is an order query with sub-query q, such
that matches(q, M) is true, M is injective and
the order of the words in q matches the orders
of their images defined by M.</p>
        <p>The result of query Q consists of elements E such
that E is the “deepest” common ancestor of image of
some mapping M that makes predicate matches(Q, M)
true.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4 Query processing</title>
      <p>In order to evaluate the result of a query Q we will
translate this query into a query plan tree. Each node
corresponds to one of the four operations (conjunction,
disjunction, proximity or order), leaves correspond to
keywords in the query and edges correspond to relations
between queries and sub-queries.</p>
      <p>Each operator node receives a list of all matches
from its child nodes and produces a result to its parent.</p>
      <p>Output of the root operator (corresponding to the
whole query) is transformed to a set of nodes.</p>
      <sec id="sec-2-1">
        <title>4.1 Index structure</title>
        <p>To allow efficient query evaluation we will use an
inverted index which contains a list of all word
occurrences (including position of word in node text) in
nodes along with their identifiers and numbering
scheme labels.</p>
        <p>A numbering scheme assigns a unique label to each
node of an XML document according to some
schemespecific rules. The labels encode information about
relative position of the node in the document. Thus, the
main purpose of numbering scheme is to provide
mechanisms to quickly determine the structural
relationship between a pair of nodes.</p>
        <p>Most native XML databases use some sort of
numbering scheme.</p>
        <p>We will require numbering scheme to provide these
mechanisms:
(1) determining ancestor-descendent
relationship between to nodes;
(2) comparing nodes by document order;
(3) determining whether two nodes have a
common ancestor;</p>
        <p>Practically every numbering scheme used in native
XML databases provides mechanisms (1) and (2). If it
doesn’t provide mechanism (3) we can easily modify it
by adding document root identifier to numbering
scheme labels. This modified numbering scheme will
provide mechanism (3).</p>
      </sec>
      <sec id="sec-2-2">
        <title>4.2 Data used by operations</title>
        <p>
          To represent a list of all matches that is transferred
between operators we will use tuples that consist of lists
of word occurrences similar to the contents of inverted
files in the index. Operators in the query plan tree
leaves corresponding to keywords in the query will
simply read an inverted file for this word and return
tuples consisting of a single list describing some node
and word positions of the keyword in the text of this
node. Just like inverted files, lists in tuples contain
numbering scheme labels for nodes.
01 Op-Order(list)
02 while list is not completely processed:
03 elem = list.next;
04 if elem.width == 1:
05 put elem to output stream;
06 else:
07 create a list ord_list of all word occurrences in lists of elem along with
list number (list contains pairs &lt;list number, word occurence&gt;) in document order;
08 initialize array of pointers to ord_list p, to make p[i] point to the first
occurrence of a word from elem’s i’th list, which is after p[i-1] in the ord_list
(for i&gt;0);
09 p[n] points to the end of ord_list, n = elem.width
10 while all elements of p are defined:
11 output tuple i’th list of which consists of elements of elem’s i’th list
between p[i] and p[i+1] in ord_list;
12 p[0] = first element belonging to 0’th list which is after p[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] in
ord_list;
13 for each i in [1..n-1]:
14 p[i] = first element belonging to i’th list which is after p[i-1] in
ord_list;
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Listing 3</title>
        <p>If a tuple consists of n lists, then we will say that
width of this tuple is n (denoted as t.width in the
pseudo-code, here t is a variable referencing some
tuple).</p>
        <p>A tuple represents a set of matches that is Cartesian
product of sets of word occurrences in each list (i.e. if
we choose one word in each list of a tuple we will get
one of the matches represented by this tuple)</p>
        <p>All nodes in a tuple always have a common
ancestor. Tuples in an output stream of some operator
are returned in the document order of their respective
“deepest” common ancestors of nodes in each tuple.</p>
      </sec>
      <sec id="sec-2-4">
        <title>4.3 Conjunction operation</title>
        <p>This operator simply combines tuples from its
subqueries to a single tuple. Pseudo-code for this operation
is shown in listing 2.</p>
      </sec>
      <sec id="sec-2-5">
        <title>4.4 Disjunction operation</title>
        <p>Disjunction operation returns all tuples produced by its
sub-query operators in document order. Pseudo-code for
this operation is shown in listing 1.</p>
      </sec>
      <sec id="sec-2-6">
        <title>4.5 Order operation</title>
        <p>Order operation filters tuples returned by its sub-query
operator, so that all word occurrences in matches are in
the correct order (it’s always corresponds to the orders
of lists in tuple). This may split one tuple into several
tuples. Pseudo-code for this operation is shown in
listing 3.</p>
      </sec>
      <sec id="sec-2-7">
        <title>4.6 Proximity operation</title>
        <p>Proximity operation filters tuples returned by its
subquery operator, so that all word occurrences in matches
are within a window of specific size. Pseudo-code for
this operation is shown in listing 4.</p>
        <p>Since we do not have information about indices of
the first word in the nodes, this operator may produce
false matches which should be checked at later stages of
query execution by computing exact values of node
staring word index array (sw array in the pseudo-code).</p>
        <p>This is not needed if all words in the match that need to
fit in some window are in the same node.
01 Op-Window(list, window):
02 while list is not completely processed:
03 elem = list.next
04 if elem.width == 1:
05 put elem to output stream;
06 else:
07 create a list L of all different nodes in elem with a maximal word number
for each node. List is ordered by document order of nodes;
08 for each node U in the list L compute sw[U] as the sum of maximal word
numbers of nodes before node u;
09 while all lists elem[i], i={0..elem.width-1} are not empty:
10 choose word W with minimum S(W) = sw[node that contains W] + the ordinal
number of word W in the text of node;
11 if all lists elem[i], i={0..elem.width-1} contain words W1, such that
S(W1) &lt; S(W) + window:
12 output a tuple that consists of all words W1 from lists of elem, such
that S(W1) &lt; S(W) + window (if word W1 is in i’th list of elem, it will be in the
i’th list of the resulting tuple) to the output stream;
In this article we described semantics for full-text
search queries over XML data and proposed query
evaluation method suitable for native XML databases.</p>
        <p>Method uses inverted file indices which can be
effectively updated: if node changes do not affect its
ancestor or sibling nodes in the index.</p>
        <p>
          Proposed method allows efficient evaluation of most
full-text queries described in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6 Future work</title>
      <p>
        Investigating the following problems may lead to
improving the proposed query evaluation method:
• devise an efficient compression method
suitable for proposed inverted file index (and a
fixed numbering scheme);
• add relevance calculation and see how can
index be changed to allow efficient evaluation
of ranked queries;
• if the result of a full-text query will be
presented as a set of nodes, some operations
may not need to return a full set of matches
that include word occurrences, instead they can
return just a set of matching nodes. This may
be used to improve query evaluation
performance;
• see whether the proposed method can be
modified to allow “not“ or “mild not”[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
queries;
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fuhr</surname>
          </string-name>
          , K. Grobjohann, “
          <article-title>XIRQL: A Language for Information Retrieval in XML Documents”</article-title>
          ,
          <string-name>
            <given-names>SIGIR</given-names>
            <surname>Conf</surname>
          </string-name>
          .,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kersten</surname>
          </string-name>
          , M. Windhouwer, “
          <article-title>Querying XML Documents Made Easy: Nearest Concept Queries”</article-title>
          ,
          <string-name>
            <given-names>ICDE</given-names>
            <surname>Conf</surname>
          </string-name>
          .,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[3] XQuery 1.0 and XPath 2</source>
          .
          <article-title>0 Data Model (XDM) http</article-title>
          ://www.w3.org/TR/xpath-datamodel/
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] Initiative for the Evaluation of XML Retrieval</article-title>
          . http://inex.is.informatik.uni-duisburg.de/2005/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Mamou. Y.</given-names>
            <surname>Kanza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sagiv. XSEarch: A Semantic Search</surname>
          </string-name>
          <article-title>Engine for XML</article-title>
          .
          <source>VLDB</source>
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Florescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kossmann</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Manolescu.</surname>
          </string-name>
          <article-title>Integrating Keyword Search into XML Query Processing</article-title>
          .
          <source>WWW</source>
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fuhr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Grossjohann. XIRQL</surname>
          </string-name>
          :
          <article-title>An Extension of XQL for Information Retrieval</article-title>
          .
          <source>SIGIR</source>
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Theobald</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Weikum.</surname>
          </string-name>
          <article-title>The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking</article-title>
          .
          <source>EDBT</source>
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[9] The World Wide Web Consortium. XQuery 1.0 and XPath 2</source>
          .0 Full-Text.
          <article-title>Working draft</article-title>
          . http://www.w3.org/TR/xquery-full-text/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E. W.</given-names>
            <surname>Brown</surname>
          </string-name>
          .
          <article-title>Fast Evaluation of Structured Queries for Information Retrieval</article-title>
          .
          <source>SIGIR</source>
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>J. M. Bremer</surname>
            ,
            <given-names>M. Gertz.</given-names>
          </string-name>
          <article-title>XQuery/IR: Integrating XML Document and Data Retrieval</article-title>
          .
          <source>WebDB</source>
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Naughton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>DeWitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lohman</surname>
          </string-name>
          .
          <article-title>On Supporting Containment Queries in Relational Database Management Systems</article-title>
          .
          <source>SIGMOD</source>
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shanmugasundaram</surname>
          </string-name>
          . XRANK:
          <article-title>Ranked Keyword Search over XML Documents</article-title>
          .
          <source>SIGMOD</source>
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kersten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Windhouwer</surname>
          </string-name>
          .
          <article-title>Querying XML Documents Made Easy: Nearest Concept Queries</article-title>
          .
          <source>ICDE</source>
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Papakonstantinou</surname>
          </string-name>
          .
          <article-title>Efficient Keyword Search for Smallest LCAs in XML Databases</article-title>
          .
          <source>SIGMOD</source>
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>