<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A flexible XML query language for NON dummies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefania Marrara</string-name>
          <email>stefania.marrara@unimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuele Panzeri</string-name>
          <email>panzeri@disco.unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriella Pasi</string-name>
          <email>pasi@disco.unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universit`a degli Studi di Milano - DTI via Bramante 65</institution>
          ,
          <addr-line>26013 Crema (CR)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universit`a degli Studi di Milano Bicocca - DISCo viale Sarca 336</institution>
          ,
          <addr-line>20126 Milano (MI)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper introduces and motivates our proposal of an XPath extension which allows the definition of queries with flexible constraints on both content and structure. The proposed language allow expert users to benefit of the recall improvements of flexible languages while using their collection knowledge to improve the retrieval precision. 1 Introduction and Motivation</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        XML (Extensible Markup Language), a World Wide Web Consortium (W3C)
standard for the World Wide Web, since its first introduction in 1998 has proved
its ability to be the basis for the data interchange on the Internet. Differently
from the most popular HyperText Markup Language (HTML), XML allows
document designers to set up their own tag vocabulary, and to describe the structure
of documents by defining tag nesting. It is widely recognized that, in order to
exploit the power of XML, a query language for XML documents should
allow constraint formulation on both document content and structure. In other
words, it should represent tag nesting as well as their content. The birth of huge
collections of XML documents has implied the definition of appropriate query
languages, whose main representatives are today the W3C standards XPath [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]
and XQuery [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. The main issue with these XML query languages is that they
assume the user be fully aware of the target document structure, and allow
only an exact specification of the target documents, due to the boolean nature
of their query-document matching systems. This assumption is debatable since
most XML documents have no pre-set structure (DTD or XML Schema); even
worse, it requires the user to write a different query for each variation of the
document structure itself. In order to tackle this problem, in the last few years
both Information Retrieval (IR) and Database (DB) communities proposed
flexible XML query languages with search paths that provide a loose example of
the information the user is interested in. In Information Retrieval, the
proposals are broadly classified as content-only search (CO) and content and structure
search (CAS). CO approaches usually allow keyword based queries without any
possibility to specify constraints on the expected document structure. Most of
the existing keyword search systems return query results based on the notion of
Lowest Common Ancestor (LCA) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and its variants [
        <xref ref-type="bibr" rid="ref14 ref16">16, 14</xref>
        ]. On the opposite
side, CAS approaches focus on approximate matching of limited XPath
predicates (usually the child axis is transformed into a descendant axis during the
query evaluation), and on designing indexes to score document fragments. These
approaches are based on the notion of structural hint which considers the query
structure as a mere template of the required information. All fragments similar
to the template are retrieved. In the DB community, the most important
contributes are XPath Full Text (XPathFT), and XQuery Full-Text (XQueryFT)
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which include full-text capabilities in the traditional query languages.
Differently from CAS approaches, XPathFT and XQuery FT do not include structural
hints but query structures are evaluated as in the traditional standards. All
mentioned approaches focus on the notion of dummy user: the main idea is that users
are often unable or too lazy in order to formalize complex queries and therefore
keyword based queries or queries without too strict structural constraints are
preferred. In this work we start from a different perspective. In the last years
many important collections such as DBPL, Wikipedia, the Cancer Gene Disease
and Gene Compound or the Protein Data Bank (PDBML) have been created
by the scientific and research communities. Users belonging to these
communities are often scientists, high-educated people who, due to their continue use,
develop a good, even if partial (due to its dimensions), knowledge of the
collection document structure, and therefore they are both able to and interested
in formalizing complex queries if this can improve the retrieval precision. With
this idea in mind, in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] a flexible XPath extension has been proposed (briefly
sketched in Section 2), which allows users to define the extent and type of
desired constraint relaxation in the query both on content and structure. In this
approach the query is not considered as a mere template of the required
information, as the user can include exact and flexible constraints in the same query
on both content and structure. In this way the user can benefit of the recall
improvements typical of flexible languages while exploiting his/her knowledge of
the considered collection to improve the retrieval precision.
1.1
      </p>
      <p>
        Related Work
Recent research in both Information Retrieval and Database communities has
led to several approaches aimed at introducing some degrees of flexibility in
XML retrieval [
        <xref ref-type="bibr" rid="ref12 ref18 ref9">9, 12, 18</xref>
        ]. In the Information Retrieval research context, CO
approaches address the issue of querying XML documents by using a keyword
based approach [
        <xref ref-type="bibr" rid="ref2 ref7 ref8">2, 7, 8</xref>
        ], while CAS models consider both document content and
structure in the retrieval process: just as the keywords are hints, so are the
structural constraints. Examples of CAS approaches are offered by, for instance,
XIRQL [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], NEXI [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], TeXQuery [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and FleXPath [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and the recent
standard XQuery and XPath Full Text 1.0 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Other approaches proposed in the
literature define some flexibility on the evaluation of the query structure in a
more explicit way. For instance, in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] the authors define some
relaxations such as the introduction of generalized data types, the adoption of
edit distances on paths, and some operations to modify the structure like delete
a node, insert intermediate nodes or rename a node. FleXPath [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is the first
approach proposing a formalization of relaxations on the evaluation of the structure
of XML queries, as well as the first algebraic framework for spanning relaxations.
In addition, it proposes new ranking functions with properties that relaxations
must satisfy, and it develops efficient evaluation algorithms.
      </p>
      <p>
        All the above mentioned approaches, however, do not allow users to define the
extent and the type of the desired flexible constraints in the query: the query is
considered as a mere template of the required information. All fragments similar
to the template are retrieved. By these approaches the user has no possibility
to distinguish between portions of the query that must be considered as exact
and those that allow a certain flexibility in the retrieval process. In order to
allow users to explicitly specify both content-based and structure-based flexible
constraints in a query aimed at retrieving XML fragments, in [
        <xref ref-type="bibr" rid="ref11 ref6">11, 6</xref>
        ] a flexible
language has been proposed and defined as an extension of the XPath query
language. The flexible constraints have been syntactically defined in compliance
with the XPath language; the query evaluation produces a score, expressing the
degree of compatibility of the document’s content/structure with respect to the
user requirement. In Section 2 this language is briefly presented.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>A flexible extension of XPath</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref10 ref11 ref6">6, 10, 11</xref>
        ] a new approach aimed at introducing flexibility in the XPath query
language has been defined. This new language extends the syntax of XPath
to allow the definition of some flexible constraints on both the content (the
defined constraints are named cw and around ), and the structure of the XML
document (the defined constraints are named near, below, and approximately).
Informally, cw (contain words) is applied to nodes that have a textual content,
and is introduced to specify keyword-based constraints, as in usual IR query
languages. The constraint cw is followed by the list of search terms to be retrieved
in the textual element.
      </p>
      <p>The constraint around is applied to numeric or date values (within specific
numeric or data content nodes), and it requires their approximate evaluation.
The structure-based constraints below and near, inserted as a flexible axis of a
path expression, allow to extract XML fragments (i.e. elements, attributes or
text) that are, respectively, direct descendants or connected through any path
to the current node (also allowing users to fix a maximum path length). For
each retrieved fragment a retrieval status value is computed that is inversely
proportional to the distance (measured in number of arcs) between the ideal path
structure and the retrieved one. Finally, the constraint approximately allows to
select the elements with a given name that have a number of direct descendants
close to the one indicated in the query.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>The aim of this paper was to introduce and motivate the proposal of an XPath
extension that allows the specification of flexible constraints on both content
and structure of XML documents. The proposed language, briefly sketched in
Section 2, allows expert users to benefit of the recall improvements typical of
flexible languages. In future work the full syntax and semantics of this XPath
extension will be presented, as well as the definition of an appropriate data
structure; moreover evaluations will be performed.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. XQuery and XPath
          <issue>Full Text 1</issue>
          .0. http://www.w3.org/TR/xpath-full-text-
          <volume>10</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Consens</surname>
          </string-name>
          , G. Kazai, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lalmas</surname>
          </string-name>
          .
          <article-title>Structural relevance: a common basis for the evaluation of structured document retrieval</article-title>
          .
          <source>In CIKM'08</source>
          , pages
          <fpage>1153</fpage>
          -
          <lpage>1162</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>S.</given-names>
            <surname>Amer-Yahia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Botev</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Shanmugasundaram</surname>
          </string-name>
          .
          <article-title>Texquery: a full-text search extension to xquery</article-title>
          .
          <source>In WWW'04</source>
          , pages
          <fpage>583</fpage>
          -
          <lpage>594</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>S.</given-names>
            <surname>Amer-Yahia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cho</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          .
          <article-title>Tree pattern relaxation</article-title>
          .
          <source>In EDBT'02</source>
          , pages
          <fpage>496</fpage>
          -
          <lpage>513</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.</given-names>
            <surname>Amer-Yahia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. V. S.</given-names>
            <surname>Lakshmanan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Pandit</surname>
          </string-name>
          .
          <article-title>Flexpath: flexible structure and full-text querying for xml</article-title>
          .
          <source>In SIGMOD'04</source>
          , pages
          <fpage>83</fpage>
          -
          <lpage>94</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>A.</given-names>
            <surname>Campi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Damiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guinea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marrara</surname>
          </string-name>
          , G. Pasi, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Spoletini</surname>
          </string-name>
          .
          <article-title>A fuzzy extension of the xpath query language</article-title>
          .
          <source>J. Intell. Inf. Syst.</source>
          ,
          <volume>33</volume>
          (
          <issue>3</issue>
          ):
          <fpage>285</fpage>
          -
          <lpage>305</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Maarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mandelbrod</surname>
          </string-name>
          , Y. Mass, and
          <string-name>
            <given-names>A.</given-names>
            <surname>Soffer</surname>
          </string-name>
          .
          <article-title>Searching xml documents via xml fragments</article-title>
          .
          <source>In SIGIR'03</source>
          , pages
          <fpage>151</fpage>
          -
          <lpage>158</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          .
          <article-title>Controlling overlap in content-oriented xml retrieval</article-title>
          .
          <source>In SIGIR'05</source>
          , pages
          <fpage>314</fpage>
          -
          <lpage>321</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>S.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mamou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kanza</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sagiv</surname>
          </string-name>
          .
          <article-title>Xsearch: a semantic search engine for xml</article-title>
          .
          <source>In VLDB'03</source>
          , pages
          <fpage>45</fpage>
          -
          <lpage>56</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. E. Damiani,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marrara</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Pasi. Fuzzyxpath</surname>
          </string-name>
          :
          <article-title>Using fuzzy logic an ir features to approximately query xml documents</article-title>
          .
          <source>In IFSA'07</source>
          , pages
          <fpage>199</fpage>
          -
          <lpage>208</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. E. Damiani,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marrara</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Pasi</surname>
          </string-name>
          .
          <article-title>A flexible extension of xpath to improve xml querying</article-title>
          .
          <source>In SIGIR'08</source>
          , pages
          <fpage>849</fpage>
          -
          <lpage>850</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>D.</given-names>
            <surname>Florescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kossmann</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Manolescu.</surname>
          </string-name>
          <article-title>Integrating keyword search into xml query processing</article-title>
          .
          <source>Comput. Netw.</source>
          ,
          <volume>33</volume>
          (
          <issue>1-6</issue>
          ):
          <fpage>119</fpage>
          -
          <lpage>135</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>N.</given-names>
            <surname>Fuhr</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Grobjohann</surname>
          </string-name>
          .
          <article-title>Xirql: a query language for information retrieval in xml documents</article-title>
          .
          <source>In SIGIR'01</source>
          , pages
          <fpage>172</fpage>
          -
          <lpage>180</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>J. W. G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Effective keyword search for valuable lcas over xml documents</article-title>
          .
          <source>In CIKM</source>
          , pages
          <fpage>31</fpage>
          -
          <lpage>40</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>C. B. L. Guo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Shao</surname>
            and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Shanmugasundaram</surname>
          </string-name>
          . Xrank:
          <article-title>Ranked keyword search over xml documents</article-title>
          .
          <source>In SIGMOD</source>
          , pages
          <fpage>16</fpage>
          -
          <lpage>27</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Y. K. S. Cohen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mamou</surname>
            and
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Sagiv. Xsearch</surname>
          </string-name>
          :
          <article-title>A semantic search engine for xml</article-title>
          .
          <source>In VLDB</source>
          , pages
          <fpage>45</fpage>
          -
          <lpage>56</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>T.</given-names>
            <surname>Schlieder</surname>
          </string-name>
          .
          <article-title>Similarity search in xml data using cost-based query transformations</article-title>
          .
          <source>In WebDB'01</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>A.</given-names>
            <surname>Theobald</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          .
          <article-title>Adding relevance to xml</article-title>
          . In WebDB'
          <volume>00</volume>
          , pages
          <fpage>105</fpage>
          -
          <lpage>124</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>A.</given-names>
            <surname>Trotman</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Sigurbj</surname>
          </string-name>
          <article-title>¨ornsson. Narrowed extended xpath i (nexi)</article-title>
          .
          <source>In INEX'04</source>
          , pages
          <fpage>16</fpage>
          -
          <lpage>40</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. W3C.
          <article-title>Xml path language (xpath) 2.0</article-title>
          . http://www.w3.org/TR/xpath20/,
          <year>November 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21. W3C.
          <article-title>Xquery 1.0: An xml query language</article-title>
          . http://www.w3.org/TR/xquery/,
          <year>November 2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>