<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Method for Evaluating Full-text Search Queries in Native XML Databases</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Roman</forename><surname>Pastukhov</surname></persName>
							<email>ignatich@mail.ru</email>
							<affiliation key="aff0">
								<orgName type="department">Institute for System Programming</orgName>
								<orgName type="institution">Russian Academy of Sciences</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">A Method for Evaluating Full-text Search Queries in Native XML Databases</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">6A14606CDA4F1734DE600E852C8FF34A</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T02:43+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper we consider the problem of efficiently producing results for full-text keyword search queries over XML documents. We describe full-text search query semantics and propose a method for efficient evaluation of keyword search queries with these semantics suitable for native XML databases. Method uses inverted file index which may be efficiently updated when a part of some XML document is updated.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>One of the main features of XML databases is ability to store semi-structured data as well as structured data. XQuery and XPath languages allow addressing parts of XML documents and querying them. These languages are convenient for querying regularly structured data. If data is semi-structured or its structure is unknown making queries using these languages becomes difficult. In such cases it is easier to use keyword search queries. One of the key advantages of keyword search queries is its simplicity -users do not have to learn a complex query language and can issue queries without prior knowledge about the structure of the underlying data.</p><p>Keyword searching over XML introduces new challenges. The result of a keyword search query is not always the entire document, but can be a deeply nested XML element. In general XML keyword search results can be arbitrarily nested elements, and returning the "deepest" node containing the keywords usually gives more context information (see <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>)</p><p>In this paper we describe semantics of full-text queries over XML documents and a method for evaluating such queries in an XML database system that supports XQuery data model <ref type="bibr" target="#b2">[3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related work</head><p>The recent increase in the number of XML repositories <ref type="bibr" target="#b3">[4]</ref> has motivated extensive work on designing languages for XML full-text search <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9]</ref>.</p><p>There has been extensive research in information retrieval on the efficient evaluation of full-text queries <ref type="bibr" target="#b2">[3]</ref>, including structured full-text queries <ref type="bibr" target="#b9">[10]</ref> and of XML queries such as XQuery/IR <ref type="bibr" target="#b10">[11]</ref>, XSEarch <ref type="bibr" target="#b4">[5]</ref>, XIRQL <ref type="bibr" target="#b6">[7]</ref>, XXL <ref type="bibr" target="#b7">[8]</ref> and Niagara <ref type="bibr" target="#b11">[12]</ref>. However, these works develop algorithms for specific full-text predicates in isolation.</p><p>The idea of computing the most specific elements for conjunctive queries has been actively explored using deepest common ancestors <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b14">15]</ref>. We extend this idea to support the efficient evaluation of queries with complex full-text predicates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Data Model &amp; Query Semantics</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">XML Data Model</head><p>The eXtensible Markup Language (XML) is a hierarchial format for data representation and exchange. An XML document consists of nested XML elements starting with root element. Each element can have attributes and text values, in addition to nested subelements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Full-text Search Query Semantics</head><p>Full-text search queries concerned in this paper are composed of keywords and four operationsconjunction, disjunction, proximity and order. Conjunction and disjunction operations combine several (at least 2) sub-queries into a single query. Proximity and order operations are applied to a single query to produce another query.</p><p>Consider a query Q consisting of several keywords and a mapping M that maps some of these words to occurrences of these words in an XML document. If Q has a sub-query q than M|q denotes a restriction of M to the set of keywords in q. Lets define a predicate matches(Q, M) to be true when one of the following conditions is true: Put values from current_match to output stream;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Proceedings of the</head><p>Listing 1</p><p>• If Q is a single keyword and M has a mapping for this word.</p><formula xml:id="formula_0">• If Q is a conjunction query with sub-queries S={q0, q1, … , qn} and {∀q∈S | matches(q, M|q)} • If Q is a disjunction query with sub-queries</formula><p>S={q0, q1, … , qn} and {∃q∈S | matches(q, M|q)} • If Q is a proximity query with distance d and sub-query q, such than matches(q, M) is true and the maximal distance between the words in image of M is no more than (d-1). • If Q is an order query with sub-query q, such that matches(q, M) is true, M is injective and the order of the words in q matches the orders of their images defined by M. The result of query Q consists of elements E such that E is the "deepest" common ancestor of image of some mapping M that makes predicate matches(Q, M) true.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Query processing</head><p>In order to evaluate the result of a query Q we will translate this query into a query plan tree. Each node corresponds to one of the four operations (conjunction, disjunction, proximity or order), leaves correspond to keywords in the query and edges correspond to relations between queries and sub-queries.</p><p>Each operator node receives a list of all matches from its child nodes and produces a result to its parent.</p><p>Output of the root operator (corresponding to the whole query) is transformed to a set of nodes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Index structure</head><p>To allow efficient query evaluation we will use an inverted index which contains a list of all word occurrences (including position of word in node text) in nodes along with their identifiers and numbering scheme labels.</p><p>A numbering scheme assigns a unique label to each node of an XML document according to some schemespecific rules. The labels encode information about relative position of the node in the document. Thus, the main purpose of numbering scheme is to provide mechanisms to quickly determine the structural relationship between a pair of nodes.</p><p>Most native XML databases use some sort of numbering scheme.</p><p>We will require numbering scheme to provide these mechanisms:</p><p>(1) determining ancestor-descendent relationship between to nodes; <ref type="bibr" target="#b1">(2)</ref> comparing nodes by document order; <ref type="bibr" target="#b2">(3)</ref> determining whether two nodes have a common ancestor; Practically every numbering scheme used in native XML databases provides mechanisms (1) and <ref type="bibr" target="#b1">(2)</ref>. If it doesn't provide mechanism (3) we can easily modify it by adding document root identifier to numbering scheme labels. This modified numbering scheme will provide mechanism (3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Data used by operations</head><p>To represent a list of all matches that is transferred between operators we will use tuples that consist of lists of word occurrences similar to the contents of inverted files in the index. Operators in the query plan tree leaves corresponding to keywords in the query will simply read an inverted file for this word and return tuples consisting of a single list describing some node and word positions of the keyword in the text of this node. Just like inverted files, lists in tuples contain numbering scheme labels for nodes. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Listing 3</head><p>If a tuple consists of n lists, then we will say that width of this tuple is n (denoted as t.width in the pseudo-code, here t is a variable referencing some tuple).</p><p>A tuple represents a set of matches that is Cartesian product of sets of word occurrences in each list (i.e. if we choose one word in each list of a tuple we will get one of the matches represented by this tuple)</p><p>All nodes in a tuple always have a common ancestor. Tuples in an output stream of some operator are returned in the document order of their respective "deepest" common ancestors of nodes in each tuple.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Conjunction operation</head><p>This operator simply combines tuples from its subqueries to a single tuple. Pseudo-code for this operation is shown in listing 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Disjunction operation</head><p>Disjunction operation returns all tuples produced by its sub-query operators in document order. Pseudo-code for this operation is shown in listing 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Order operation</head><p>Order operation filters tuples returned by its sub-query operator, so that all word occurrences in matches are in the correct order (it's always corresponds to the orders of lists in tuple). This may split one tuple into several tuples. Pseudo-code for this operation is shown in listing 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6">Proximity operation</head><p>Proximity operation filters tuples returned by its subquery operator, so that all word occurrences in matches are within a window of specific size. Pseudo-code for this operation is shown in listing 4.</p><p>Since we do not have information about indices of the first word in the nodes, this operator may produce false matches which should be checked at later stages of query execution by computing exact values of node staring word index array (sw array in the pseudo-code). This is not needed if all words in the match that need to fit in some window are in the same node. output a tuple that consists of all words W1 from lists of elem, such that S(W1) &lt; S(W) + window (if word W1 is in i'th list of elem, it will be in the i'th list of the resulting tuple) to the output stream;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Listing 4 5 Conclusion</head><p>In this article we described semantics for full-text search queries over XML data and proposed query evaluation method suitable for native XML databases. Method uses inverted file indices which can be effectively updated: if node changes do not affect its ancestor or sibling nodes in the index.</p><p>Proposed method allows efficient evaluation of most full-text queries described in <ref type="bibr" target="#b8">[9]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Future work</head><p>Investigating the following problems may lead to improving the proposed query evaluation method:</p><p>• devise an efficient compression method suitable for proposed inverted file index (and a fixed numbering scheme); • add relevance calculation and see how can index be changed to allow efficient evaluation of ranked queries; • if the result of a full-text query will be presented as a set of nodes, some operations may not need to return a full set of matches that include word occurrences, instead they can return just a set of matching nodes. This may be used to improve query evaluation performance; • see whether the proposed method can be modified to allow "not" or "mild not" <ref type="bibr" target="#b8">[9]</ref> queries;</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Spring Young Researcher's Colloquium On Database and Information Systems SYRCoDIS, Moscow, Russia, 2007</head><label></label><figDesc></figDesc><table><row><cell cols="2">01 Op-And(lists)</cell></row><row><cell>02</cell><cell>Current_match[i] = &lt;empty list&gt; for all i = {0, …, lists.count -1}</cell></row><row><cell>03</cell><cell>while lists are not completely processed:</cell></row><row><cell>04</cell><cell>Choose i, such that lists[i] is the list with the lowest numbering scheme</cell></row><row><cell cols="2">number;</cell></row><row><cell>05</cell><cell>Elem = lists[i].next;</cell></row><row><cell>06</cell><cell>If nodes in current_match have no common ancestor with elem:</cell></row><row><cell>07</cell><cell>If all lists in current_match are not empty:</cell></row><row><cell>08</cell><cell>Put values from current_match to output stream</cell></row><row><cell>09</cell><cell>Current_match[i] = &lt;empty list&gt; for all i = {0, …, lists.count -1}</cell></row><row><cell>10</cell><cell>Add elem to current_match[i];</cell></row><row><cell>12</cell><cell>If all lists in current_match are not empty:</cell></row><row><cell>13</cell><cell></cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">XIRQL: A Language for Information Retrieval in XML Documents</title>
		<author>
			<persName><forename type="first">N</forename><surname>Fuhr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Grobjohann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR Conf</title>
				<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Querying XML Documents Made Easy: Nearest Concept Queries</title>
		<author>
			<persName><forename type="first">A</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kersten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Windhouwer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICDE Conf</title>
				<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<ptr target="http://www.w3.org/TR/xpath-datamodel/" />
		<title level="m">XQuery 1.0 and XPath 2.0 Data Model</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<ptr target="http://inex.is.informatik.uni-duisburg.de/2005/" />
		<title level="m">Initiative for the Evaluation of XML Retrieval</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">XSEarch: A Semantic Search Engine for XML</title>
		<author>
			<persName><forename type="first">S</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">Y</forename><surname>Mamou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kanza</surname></persName>
		</author>
		<author>
			<persName><surname>Sagiv</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">VLDB</title>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Integrating Keyword Search into XML Query Processing</title>
		<author>
			<persName><forename type="first">D</forename><surname>Florescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kossmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Manolescu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">WWW</title>
		<imprint>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">XIRQL: An Extension of XQL for Information Retrieval</title>
		<author>
			<persName><forename type="first">N</forename><surname>Fuhr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Grossjohann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR</title>
				<imprint>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking</title>
		<author>
			<persName><forename type="first">A</forename><surname>Theobald</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Weikum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">EDBT</title>
				<imprint>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<ptr target="http://www.w3.org/TR/xquery-full-text/" />
		<title level="m">The World Wide Web Consortium</title>
				<imprint/>
	</monogr>
	<note>XQuery 1.0 and XPath 2.0 Full-Text. Working draft</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Fast Evaluation of Structured Queries for Information Retrieval</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">W</forename><surname>Brown</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR</title>
				<imprint>
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">XQuery/IR: Integrating XML Document and Data Retrieval</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Bremer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gertz</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2002">2002</date>
			<pubPlace>WebDB</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">On Supporting Containment Queries in Relational Database Management Systems</title>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Naughton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Dewitt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lohman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGMOD</title>
		<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">XRANK: Ranked Keyword Search over XML Documents</title>
		<author>
			<persName><forename type="first">L</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Shao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Botev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shanmugasundaram</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGMOD</title>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Querying XML Documents Made Easy: Nearest Concept Queries</title>
		<author>
			<persName><forename type="first">A</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kersten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Windhouwer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICDE</title>
				<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Efficient Keyword Search for Smallest LCAs in XML Databases</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Papakonstantinou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGMOD</title>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
