A flexible XML query language for NON
                     dummies

          Stefania Marrara1, Emanuele Panzeri2 , and Gabriella Pasi2
                     1
                        Università degli Studi di Milano - DTI
      via Bramante 65, 26013 Crema (CR), Italy, stefania.marrara@unimi.it
                2
                  Università degli Studi di Milano Bicocca - DISCo
     viale Sarca 336, 20126 Milano (MI), Italy, {panzeri,pasi}@disco.unimib.it


      Abstract. This paper introduces and motivates our proposal of an
      XPath extension which allows the definition of queries with flexible con-
      straints on both content and structure. The proposed language allow
      expert users to benefit of the recall improvements of flexible languages
      while using their collection knowledge to improve the retrieval precision.


1   Introduction and Motivation
XML (Extensible Markup Language), a World Wide Web Consortium (W3C)
standard for the World Wide Web, since its first introduction in 1998 has proved
its ability to be the basis for the data interchange on the Internet. Differently
from the most popular HyperText Markup Language (HTML), XML allows doc-
ument designers to set up their own tag vocabulary, and to describe the structure
of documents by defining tag nesting. It is widely recognized that, in order to
exploit the power of XML, a query language for XML documents should al-
low constraint formulation on both document content and structure. In other
words, it should represent tag nesting as well as their content. The birth of huge
collections of XML documents has implied the definition of appropriate query
languages, whose main representatives are today the W3C standards XPath [20]
and XQuery [21]. The main issue with these XML query languages is that they
assume the user be fully aware of the target document structure, and allow
only an exact specification of the target documents, due to the boolean nature
of their query-document matching systems. This assumption is debatable since
most XML documents have no pre-set structure (DTD or XML Schema); even
worse, it requires the user to write a different query for each variation of the
document structure itself. In order to tackle this problem, in the last few years
both Information Retrieval (IR) and Database (DB) communities proposed flex-
ible XML query languages with search paths that provide a loose example of
the information the user is interested in. In Information Retrieval, the propos-
als are broadly classified as content-only search (CO) and content and structure
search (CAS). CO approaches usually allow keyword based queries without any
possibility to specify constraints on the expected document structure. Most of
the existing keyword search systems return query results based on the notion of
Lowest Common Ancestor (LCA) [15], and its variants [16, 14]. On the opposite
side, CAS approaches focus on approximate matching of limited XPath pred-
icates (usually the child axis is transformed into a descendant axis during the
query evaluation), and on designing indexes to score document fragments. These
approaches are based on the notion of structural hint which considers the query
structure as a mere template of the required information. All fragments similar
to the template are retrieved. In the DB community, the most important con-
tributes are XPath Full Text (XPathFT), and XQuery Full-Text (XQueryFT)
[1], which include full-text capabilities in the traditional query languages. Differ-
ently from CAS approaches, XPathFT and XQuery FT do not include structural
hints but query structures are evaluated as in the traditional standards. All men-
tioned approaches focus on the notion of dummy user: the main idea is that users
are often unable or too lazy in order to formalize complex queries and therefore
keyword based queries or queries without too strict structural constraints are
preferred. In this work we start from a different perspective. In the last years
many important collections such as DBPL, Wikipedia, the Cancer Gene Disease
and Gene Compound or the Protein Data Bank (PDBML) have been created
by the scientific and research communities. Users belonging to these communi-
ties are often scientists, high-educated people who, due to their continue use,
develop a good, even if partial (due to its dimensions), knowledge of the col-
lection document structure, and therefore they are both able to and interested
in formalizing complex queries if this can improve the retrieval precision. With
this idea in mind, in [11] a flexible XPath extension has been proposed (briefly
sketched in Section 2), which allows users to define the extent and type of de-
sired constraint relaxation in the query both on content and structure. In this
approach the query is not considered as a mere template of the required infor-
mation, as the user can include exact and flexible constraints in the same query
on both content and structure. In this way the user can benefit of the recall
improvements typical of flexible languages while exploiting his/her knowledge of
the considered collection to improve the retrieval precision.

1.1 Related Work
Recent research in both Information Retrieval and Database communities has
led to several approaches aimed at introducing some degrees of flexibility in
XML retrieval [9, 12, 18]. In the Information Retrieval research context, CO ap-
proaches address the issue of querying XML documents by using a keyword
based approach [2, 7, 8], while CAS models consider both document content and
structure in the retrieval process: just as the keywords are hints, so are the
structural constraints. Examples of CAS approaches are offered by, for instance,
XIRQL [13], NEXI [19], TeXQuery [3], and FleXPath [5], and the recent stan-
dard XQuery and XPath Full Text 1.0 [1]. Other approaches proposed in the
literature define some flexibility on the evaluation of the query structure in a
more explicit way. For instance, in [13], [17], and [4] the authors define some
relaxations such as the introduction of generalized data types, the adoption of
edit distances on paths, and some operations to modify the structure like delete
a node, insert intermediate nodes or rename a node. FleXPath [5] is the first ap-
proach proposing a formalization of relaxations on the evaluation of the structure
of XML queries, as well as the first algebraic framework for spanning relaxations.
In addition, it proposes new ranking functions with properties that relaxations
must satisfy, and it develops efficient evaluation algorithms.
    All the above mentioned approaches, however, do not allow users to define the
extent and the type of the desired flexible constraints in the query: the query is
considered as a mere template of the required information. All fragments similar
to the template are retrieved. By these approaches the user has no possibility
to distinguish between portions of the query that must be considered as exact
and those that allow a certain flexibility in the retrieval process. In order to
allow users to explicitly specify both content-based and structure-based flexible
constraints in a query aimed at retrieving XML fragments, in [11, 6] a flexible
language has been proposed and defined as an extension of the XPath query
language. The flexible constraints have been syntactically defined in compliance
with the XPath language; the query evaluation produces a score, expressing the
degree of compatibility of the document’s content/structure with respect to the
user requirement. In Section 2 this language is briefly presented.
2   A flexible extension of XPath
In [6, 10, 11] a new approach aimed at introducing flexibility in the XPath query
language has been defined. This new language extends the syntax of XPath
to allow the definition of some flexible constraints on both the content (the
defined constraints are named cw and around ), and the structure of the XML
document (the defined constraints are named near, below, and approximately).
Informally, cw (contain words) is applied to nodes that have a textual content,
and is introduced to specify keyword-based constraints, as in usual IR query
languages. The constraint cw is followed by the list of search terms to be retrieved
in the textual element.
    The constraint around is applied to numeric or date values (within specific
numeric or data content nodes), and it requires their approximate evaluation.
The structure-based constraints below and near, inserted as a flexible axis of a
path expression, allow to extract XML fragments (i.e. elements, attributes or
text) that are, respectively, direct descendants or connected through any path
to the current node (also allowing users to fix a maximum path length). For
each retrieved fragment a retrieval status value is computed that is inversely
proportional to the distance (measured in number of arcs) between the ideal path
structure and the retrieved one. Finally, the constraint approximately allows to
select the elements with a given name that have a number of direct descendants
close to the one indicated in the query.


3   Conclusions
The aim of this paper was to introduce and motivate the proposal of an XPath
extension that allows the specification of flexible constraints on both content
and structure of XML documents. The proposed language, briefly sketched in
Section 2, allows expert users to benefit of the recall improvements typical of
flexible languages. In future work the full syntax and semantics of this XPath
extension will be presented, as well as the definition of an appropriate data
structure; moreover evaluations will be performed.


References
 1. XQuery and XPath Full Text 1.0. http://www.w3.org/TR/xpath-full-text-10/.
 2. M. S. Ali, M. P. Consens, G. Kazai, and M. Lalmas. Structural relevance: a
    common basis for the evaluation of structured document retrieval. In CIKM’08,
    pages 1153–1162, 2008.
 3. S. Amer-Yahia, C. Botev, and J. Shanmugasundaram. Texquery: a full-text search
    extension to xquery. In WWW’04, pages 583–594, 2004.
 4. S. Amer-Yahia, S. Cho, and D. Srivastava. Tree pattern relaxation. In EDBT’02,
    pages 496–513, 2002.
 5. S. Amer-Yahia, L. V. S. Lakshmanan, and S. Pandit. Flexpath: flexible structure
    and full-text querying for xml. In SIGMOD’04, pages 83–94, 2004.
 6. A. Campi, E. Damiani, S. Guinea, S. Marrara, G. Pasi, and P. Spoletini. A fuzzy
    extension of the xpath query language. J. Intell. Inf. Syst., 33(3):285–305, 2009.
 7. D. Carmel, Y. S. Maarek, M. Mandelbrod, Y. Mass, and A. Soffer. Searching xml
    documents via xml fragments. In SIGIR’03, pages 151–158, 2003.
 8. C. L. A. Clarke. Controlling overlap in content-oriented xml retrieval. In SIGIR’05,
    pages 314–321, 2005.
 9. S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv. Xsearch: a semantic search engine
    for xml. In VLDB’03, pages 45–56, 2003.
10. E. Damiani, S. Marrara, and G. Pasi. Fuzzyxpath: Using fuzzy logic an ir features
    to approximately query xml documents. In IFSA’07, pages 199–208, 2007.
11. E. Damiani, S. Marrara, and G. Pasi. A flexible extension of xpath to improve xml
    querying. In SIGIR’08, pages 849–850, 2008.
12. D. Florescu, D. Kossmann, and I. Manolescu. Integrating keyword search into xml
    query processing. Comput. Netw., 33(1-6):119–135, 2000.
13. N. Fuhr and K. Grobjohann. Xirql: a query language for information retrieval in
    xml documents. In SIGIR’01, pages 172–180, 2001.
14. J. W. G. Li, J. Feng and L. Zhou. Effective keyword search for valuable lcas over
    xml documents. In CIKM, pages 31–40, 2007.
15. C. B. L. Guo, F. Shao and J. Shanmugasundaram. Xrank: Ranked keyword search
    over xml documents. In SIGMOD, pages 16–27, 2003.
16. Y. K. S. Cohen, J. Mamou and Y. Sagiv. Xsearch: A semantic search engine for
    xml. In VLDB, pages 45–56, 2003.
17. T. Schlieder. Similarity search in xml data using cost-based query transformations.
    In WebDB’01, 2001.
18. A. Theobald and G. Weikum. Adding relevance to xml. In WebDB’00, pages
    105–124, 2001.
19. A. Trotman and B. Sigurbjörnsson. Narrowed extended xpath i (nexi). In INEX’04,
    pages 16–40, 2004.
20. W3C. Xml path language (xpath) 2.0. http://www.w3.org/TR/xpath20/, Novem-
    ber 2007.
21. W3C. Xquery 1.0: An xml query language. http://www.w3.org/TR/xquery/,
    November 2007.