<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">One Query to Bind Them All</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Daniel</forename><forename type="middle">M</forename><surname>Herzig</surname></persName>
							<email>herzig@kit.edu</email>
							<affiliation key="aff0">
								<orgName type="institution">Institute AIFB Karlsruhe Institute of Technology</orgName>
								<address>
									<postCode>76128</postCode>
									<settlement>Karlsruhe</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Thanh</forename><surname>Tran</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Institute AIFB Karlsruhe Institute of Technology</orgName>
								<address>
									<postCode>76128</postCode>
									<settlement>Karlsruhe</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">One Query to Bind Them All</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">A15C6FE33DD642A153AB89AF6F768655</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T11:48+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Recently, SPARQL became the standard language for querying RDF data on the Web. Like other formal query languages, it applies a Boolean-match semantics, i.e. results adhere strictly to the query. Thus, queries formulated for one dataset can not easily be reused for querying other datasets. If another target dataset is to be queried, the queries need to be rewritten using the vocabulary of the target dataset, while preserving the captured information need. This is a tedious manual task, which requires the knowledge of the target vocabulary and often relies on computational expensive techniques, such as mapping, data consolidation or reasoning methods. Given the scale as well as the dynamics of Web datasets, even automatic rewriting is often infeasible. In this paper, we elaborate on a novel approach, which allows to reuse existing SPARQL queries adhering to one dataset to search for entities in other dataset, which are neither linked nor otherwise integrated beforehand. We use the results returned by the given seed query, to construct an entity relevance model (ERM), which captures the content and the structure of relevant results. Candidate entities from the target dataset are obtained using existing keyword search techniques and subsequently ranked according to their similarity to the ERM. During this ranking task, we compute mappings between the structure of the ERM and of the candidates onthe-fly. The effectiveness of this approach is shown in experiments using large-scale datasets and compared to a keyword search baseline.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Publishing structured data on the Web according to the Linked Data principles has been advocated by the Linked Data movement and the Semantic Web community and recently also receives an increasing support from companies like Yahoo!, Google, and Facebook, and also from governmental institutions. The amount of Linked Data on the Web increases rapidly and is now in the order of several billion RDF triples. In order to query these structured data effectively, SPARQL an expressive, formal query language has been developed, which became the standard as recommended by the W3C in 2008 <ref type="bibr" target="#b0">[1]</ref>. However, like other formal query languages, e.g. SQL for relational databases, this query language applies a Boolean-match semantics, which means that results to a query are crisp and strictly adhere to the specified constraints in the query.</p><p>As a consequence, these queries are bound to specific datasets, respectively the corresponding vocabularies. Whereas the database community relies mainly on query rewriting <ref type="bibr" target="#b1">[2]</ref>, the answer of the Linked Data community to this problem is <ref type="bibr" target="#b0">(1)</ref> interlinking datasets manually and with the help of mapping and consolidation techniques <ref type="bibr" target="#b2">[3]</ref>, <ref type="bibr" target="#b1">(2)</ref> promoting the reuse of vocabularies and (3) exploiting the semantics of RDF for reasoning <ref type="bibr" target="#b3">[4]</ref>. All these data integration approaches require high upfront investments, given the scale as well as the heterogeneity of the Web datasets, before the actual querying across datasets is possible. Also most of these integration steps need to be performed again, if the underlying data changes, which is frequently the case in the dynamic data Web scenario. Another way to bridge this schema-and data-mismatch is to apply information retrieval techniques, in particular keyword search, which crosses the gap of two datasets by simply ignoring the structure and applying the 'bag of words' notion not just on webpages, but also for data retrieval <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref>.</p><p>In this paper, we elaborate on a novel approach, which allows to search for entities in a target dataset immediately without any prior integration using a star-shaped SPARQL query adhering to the vocabulary of another dataset. This approach combines the flexibility of unstructured information retrieval solutions with nature of database-style querying by incorporating the structure of the underlying data. The idea is to start with a structured seed query specified for one particular data source. Based on the content as well as the structure of results obtained from this data source, we construct an Entity Relevance Model (ERM) that can be seen as an abstract representation of relevant results. Instead of relying on mappings for rewriting the structured seed query, we simply treat the seed query as a keyword query and submit it against other data sources to obtain additional results. Then, we create mappings between the structure of candidate results and the structure of the ERM during runtime, which are used for the subsequent matching and ranking. Candidates matching more closely to the content as well as the structure captured by the ERM are ranked higher. This way, we can use unstructured keywords for querying, but at the same time, incorporate structure information into matching and ranking that are part of the search process.</p><p>We investigate the effectiveness of our approach in experiments using DBpedia as the source dataset and an RDF representation of IMdb as target dataset. Our approach is not limited to these datasets, but general enough to be applied to any datasets. However, the scenario involving DBpedia illustrates also an important use case, because DBpedia as the nucleus of the Web of data is probably the best known dataset and many users already have SPARQL queries or know how to create such queries for DBpedia. In addition many applications are build on top of DBpedia, which will also benefit, if the underlying retrieval incorporates entities from other datasets on-the-fly.</p><p>Contributions. The contributions of this work can be summarized as follows: <ref type="bibr" target="#b0">(1)</ref> We propose a novel approach for searching Web data sources that does not rely on upfront data integration, but uses an Entity Relevance Model (ERM) for searching as well as for computing mappings on-the-fly in line with the pay-as-you-go data integration <ref type="bibr" target="#b6">[7]</ref> paradigm. <ref type="bibr" target="#b1">(2)</ref> We provide one specific implementation of this approach, show how an ERM can be constructed and applied to rank search results exploiting mappings created during runtime. <ref type="bibr" target="#b2">(3)</ref> In experiments, we investigate the feasibility and effectiveness of our approach and compare it to a common keyword search baseline.</p><p>Outline. Section 2 defines the research problem and gives an overview and briefly sketches our new approach. This approach of relevance based on-the-fly mappings is presented in detail in Section 3. Evaluation results are presented in Section 4. Section 5 discussed the related work before we conclude in Section 6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Overview</head><p>In this section we define the setting of the addressed problem, present existing solutions, and briefly sketch the framework we propose to deal with this problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Data on the Web</head><p>The problem we address is situated in a Web data scenario. Therefore, the kind of Web data that is of most interest is RDF data. For reasons of simplicity and generality, we employ a generic graph-based data model that omits specific RDF features such as blank nodes. In this model, entity nodes are RDF resources, literal nodes correspond to RDF literals, attributes are RDF properties, and edges stand for RDF triples.</p><p>Data Graph. The data graph is a directed and labeled graph G = (N, E). The set of nodes N is a disjoint union of entities N E and literals N L , i.e. N = N E N L . Edges E can be conceived as a disjoint union E = E E E L of edges representing connections between entities, i.e. a(e 1 , e 2 ) ∈ E E , iff e 1 , e 2 ∈ N E , and connections between entities and literals, i.e. a(e 1 , v) ∈ E L , iff e 1 ∈ N E and v ∈ N L . Given this graph structure, we define the set of edges including the targeted nodes A(e) = {a(e, •) ∈ E} as the description of the entity e ∈ N E , and each member a(e, •) ∈ A(e) is called an attribute of e. Thus, we in fact neglect the distinction between E E and E L and simply refer to all edges as attributes. The set of distinct attribute labels of an entity e, i.e. A (e) = {a|a ∈ A(e)}, is called the model of e.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Research Problem</head><p>Given this model of Web data, structured queries can be specified to search over such datasets. The most commonly used language for querying RDF data on the Web is SPARQL <ref type="bibr" target="#b0">[1]</ref>. One essential feature of SPARQL is the Basic Graph Pattern (BGP), which conceptually resembles the notion of conjunctive queries that constitute an important fragment of other widely used structured query languages such as SQL. Basically, a BGP is a set of conjunctive triple patterns, each of the form predicate(subject, object). They represent patterns because either predicate, subject or object might be a variable, or is specified explicitly (as a constant). Answering these queries amounts to the task of graph pattern matching, where subgraphs in the data graph matching the query pattern are returned as results. Predicates are matched against attributes in the data, whereas bindings to subjects and objects in the query are entities and literal values, respectively.</p><p>One particular form of BGP with high importance are so-called entity queries. Essentially, they are star-shaped queries with the node in the center of the star representing the entity (entities) to be retrieved. Figure <ref type="figure" target="#fig_3">3</ref> provides several examples. According to a recent Web query logs study performed by Yahoo! researchers, this type of queries is most common on the Web <ref type="bibr" target="#b7">[8]</ref> and also the analysis of SPARQL logs showed that it is also the most common type used in the Semantic Web context. Further, most of the current Semantic Web search engines such as Sig.ma<ref type="foot" target="#foot_0">1</ref> and Falcons <ref type="bibr" target="#b4">[5]</ref> focus on answering these queries. For the sake of clarity, we also focus on this type of queries in this paper to illustrate the main ideas underlying our approach.</p><p>Problem. The problem is finding relevant entities in a set of target datasets D t given a source dataset D s and an entity query q s adhering to the vocabulary of D s . Clearly, if there are no syntactical differences at the level of schema and data as discussed above, then q s can directly be used to retrieve information from D t . However, given Web data are heterogeneous in that sense, the following two strategies can be applied.</p><p>Baseline KW. The first and most widespread solution to this end is to use keyword search over so called 'bag-of-words' representations of entities. That is, the description of an entity is simply a set of terms. A query is also a set of terms, which is then matched against the flat term-based representation of data. This unstructured approach bridges differences at the level of schemas and data simply by ignoring the fine-grained semantics and structure available in the entity descriptions, as illustrated in Fig. <ref type="figure" target="#fig_1">1</ref>. Clearly, this approach is simple but also flexible in the sense that the same keyword query specified for D s can also be used for D t because results from D t can be obtained when there are matches at the level of terms.  Our approach. In this paper, we present a framework to address this problem of searching for entities in Web data using on-the-fly mappings computed in a pay-as-you-go fashion based on entity relevance models (ERM). This framework is instantiated involving the following three steps. (1) First, we compute an ERM from the results returned from the source dataset D s using q s . (2) Second, we present a light-weight on-the-fly integration technique, which maps the structure of result candidates of the target datasets D t to the structure of the ERM. (3) Finally, the result candidates are ranked according to their similarity to the ERM using the mappings computing during runtime.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Entity Search Over Web Data</head><p>In this section, we present how the entity relevance model is constructed and discuss how this model can be exploited for ranking and relevance-based on-thefly data integration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Entity Relevance Model</head><p>We aim to build a model that captures the structure and content of entities relevant to the information need, which is expressed in the seed entity query q s . This proposed model is called the Entity Relevance Model (ERM). The model is based on the notion of structured relevance model that has been proposed before <ref type="bibr" target="#b8">[9]</ref>.</p><p>The ERM is a composite model that consists of several, atomic language models <ref type="bibr" target="#b9">[10]</ref> and is constructed as follows. Let q s be the query submitted against the source dataset D s , and R s = {e 1 , . . . e i , . . . e m } be the set of m entities obtained for this query. Across all m entities, we observe n distinct attribute labels, a 1 s . . . a j s . . . a n s . For each observed attribute label a j s , we add a field a j s to the ERM , and create an attribute-specific language model ERM a j s to instantiate this field. This atomic model ERM a j s (w) specifies the probability of w occurring in the field a j s , for every word w ∈ V , where V is the vocabulary of all words. The ERM a j s (w) is computed from the entity descriptions a j s ∈ A(e), ∀e ∈ R s . In particular, every ERM a j s (w) captures the value v (i.e. the content) of the attributes with label a j s and is defined as follows:</p><formula xml:id="formula_0">ERM a j s (w) = e∈Rs v∈a j s n(w, v) e∈Rs v∈a j s |v|<label>(1)</label></formula><p>where n(w, v) denotes the number of occurrences of word w in the content value v of attribute a j s and |v| the length of v. Note that the outer sum goes over the entities e ∈ R s returned by q s and the inner sum goes over all values v of attributes with labels a j s . Thus, entity descriptions, which do not have the attribute a j s , do not contribute to ERM a j s (w). In order to capture the importance of the atomic attribute-specific models, we define the probability of occurrence P (a j s ) for every ERM a j s (w)) as P (a j s ) = n(a j s , e 1...m s )/m, where the numerator denotes the number of entities having attributes with label a j s and m = |R s | is the total number of entity results. In summary, an ERM has n fields, each corresponds to an attribute with a specific label, has a corresponding language model, which has a probability of occurrence. An example for an ERM constructed from two entities is illustrated in Fig. <ref type="figure">2</ref>.</p><p>(1) Fig. <ref type="figure">2</ref>. Example ERM (2) build from two entities e1 and e2 as illustrated in <ref type="bibr" target="#b0">(1)</ref>. The ERM has a field for each attribute with label as. Each field is weighted with P (as) and has a relevance model ERM (w) as defining the probability of w occurring in this field.</p><formula xml:id="formula_1">!"#$% &amp;'()% *+",)-%.)-,)-%!+//0",1)-% 1"-)2&amp;3-% )4% 4567% #+0)#% .3-#1%3,%."-)/% -)#)+/)1% 8#+9/%:;&lt;"&amp;/2=% /&amp;+--",&gt;% &amp;'()% 1"-)2&amp;3-% )?% 45@?% A)-3,"B+%A3//% -)#)+/)1% #+0)#% C)-$+,% #+,&gt;9+&gt;)% D+-0+-+%A+#),E,% /&amp;+--",&gt;%<label>(</label></formula><p>(3) Representation of an entity et with language models for each attribute with label at.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Searching over Web Data Using ERM</head><p>We tackle the problem of searching over heterogeneous data in a way similar to entity consolidation. That is, given the results e s ∈ R s from the source dataset obtained for the seed query, we aim to find out which entities in the target datasets are similar to R s . We use the ERM as the model of those relevant results. In particular, we estimate which entities e t of D t are relevant for the query q s by measuring their similarity to the ERM and rank them by decreasing similarity. First, we capture the values of a k t of each entity e t using a language model e a k t t (w). Similar to the ERM, this model is defined based on the number of word occurrences as follows:</p><formula xml:id="formula_2">e a k t t (w) = v∈a k t n(w, v) v∈a k t |v|<label>(2)</label></formula><p>As before, the sum goes over all values of attributes with label a k t , n(w, v) denotes that number of occurrences of w in v, and |v| denotes the length of v. Fig. <ref type="figure">2</ref> (3) illustrates an example.</p><p>The similarity of a candidate entity e t from the target dataset D t to the ERM is calculated as the sum of the weighted similarities between the language model of field a j s and the language model of a k t , which is calculated by their cross entropy H:</p><formula xml:id="formula_3">Sim(ERM, e t ) = a j s β a j s • P (a j s ) • H(ERM a j s ||e a k t t )<label>(3)</label></formula><p>In particular, we apply this similarity calculation only when we know which field a j s of ERM should be matched against which attribute a k t of e t . We address this problem in the next section and show how the ERM can be exploited to create on-the-fly schema mappings, i.e. mappings between an attribute a k t and a field a j s of ERM . Equation 3 applies to corresponding pairs of attribute and field. If there is no mapping between a j s and a k t , then we use a "maximum distance". This distance is computed as the cross entropy between ERM a j s and a language model that contains all words in the vocabulary expect the ones of ERM a j s . The similarity for each field a j s is weighted by the occurrence probability of this field, P (a j s ), and the parameter β a j s . This parameter gives us the flexibility to boost the importance of attributes that occur in the query q s as follows:</p><formula xml:id="formula_4">β a j s = 1 if a j s / ∈ q s b if a j s ∈ q s , b &gt; 1 (4)</formula><p>For constructing the language models of the ERM and of the candidate entities, a maximum likelihood estimation has been used, which is proportional to the count of the words in an attribute value. However, such an estimation assigns zero probabilities to those words not occurring in the attribute value. In order to address this issue, e a k t t (w) is smoothed using a collection-wide model c s (w), which captures the probability of w occurring in the entire dataset D s . This smoothing is controlled by the Jelinek-Mercer parameter λ. As a result, the entropy H is calculated over the vocabulary V of field a j s as:</p><formula xml:id="formula_5">H(ERM a j s ||e a k t t ) = w∈V a j s ERM a j s (w) • log( λ • e a k t t (w) + (1 − λ) • c s (w) ) (5)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">On The Fly Integration Using ERM</head><p>We want to determine which attribute of an entity needs to be compared to a given field of the ERM constructed for D s . The ERM is not only used for search, but also exploited for this alignment task. The details for computing mappings between entity attributes e a 1...r t t and ERM fields ERM a 1...n s are presented in Algorithm <ref type="bibr" target="#b0">(1)</ref>.The rational of the algorithm is that a field a j s is aligned to an attribute a k t when the cross entropy H(ERM a j s ||e a k t t ) between their language models is low, i.e. a mapping is established, if H is lower than a threshold t (normalized based on the highest cross entropy, line 12).That is, entropy is used as a similarity measure and the features used to compare attributes are based on their entire contents represented using language models. The algorithm iterates over n • r comparisons in worst case for an ERM with n fields and an entity with r = |A (e t )| attribute labels. Note that this on-the-fly algorithm operates only on entities that are requested as part of the search process, i.e. n and r are rather small, compared to full-fledge upfront integration that takes the entire schema into account, see Table <ref type="table" target="#tab_3">1 and Table 3</ref>. Especially, this computation comes for free: searching also requires the same computation (Equation <ref type="formula" target="#formula_3">3</ref>) and thus the entropy values computed here are kept and subsequently reused for that. Moreover, for a faster performance, ERM fields having an occurrence probability of P (a j s ) &lt; c can be pruned, because of their negligible influence. In addition, existing mappings can be used to reduce the number of comparisons even further.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 On the fly Alignment</head><formula xml:id="formula_6">Input: ERM a 1...n s , Entity e a 1...r t t , Threshold t ∈ [0, 1] Output: M appings A := {(a j s , a k t )|a j s ∈ ERM, a k t ∈ A (et) ∪ null} 1: A := new M ap 2:</formula><p>for all a j s ∈ ERM do</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3:</head><p>candM appings := new OrderedByV alueM ap</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4:</head><p>for all a k t ∈ A (et) do</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5:</head><p>if a k t / ∈ A.values then // If not already aligned 6:</p><formula xml:id="formula_7">h ← H(ERM a j s ||e a k t t ) // see equation<label>(5) 7: candM appings.add(a t k , h) 8: end if 9:</label></formula><p>end for 10:</p><p>bestA ← candM appings.f irstV alue</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>11:</head><p>worstA ← candM appings.lastV alue</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>12:</head><p>if bestA &lt; t • worstA then</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>13:</head><p>a i t ← candM appings.f irstKey</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>14:</head><p>A.add(a j s , a i t )</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>15:</head><p>else 16:</p><p>A.add(a j s , null) // no mapping found 17:</p><p>end if 18: end for 19: return A</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>In this section, we report on the experiments conducted using an implementation of the proposed approach. We performed the experiments with the following parameter setting: β = 10, c = 0.8, t = 0.75 and follow the Cranfield <ref type="bibr" target="#b10">[11]</ref> methodology for the experiments on the search performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Web Datasets</head><p>Our experiments were conducted with two RDF Web datasets, DBpedia 3.5.1 and IMdb. DBpedia is a structured representation of Wikipedia which contains more than 9 million entities of various types, among them about 50k entities typed as films. The IMdb dataset<ref type="foot" target="#foot_1">2</ref> is derived from www.imdb.com and transformed into RDF. The IMdb dataset contains information on movies and films.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Queries and Ground Truth</head><p>Our goal is to find relevant entities in the target datasets D t for a given query q s . We determine the relevant entities in D t by manually rewriting the query q s to obtain a structured query q t adhering to the vocabulary of D t . Fig. <ref type="figure" target="#fig_3">3</ref> shows such a set of queries, one of the queries serves as the source query q s and the results of the other two queries capture the ground truth for the IR-experiments. Note that this ground truth is conservative in the sense that it considers only matches that can be obtained through query rewriting based on same-as mappings. However, some possibly relevant entities may either be represented differently (using attributes that cannot be aligned via same-as mappings) or do not contain information for some of the query predicates. Hence, this ground truth defines a lower bound on the actual performance.</p><p>We created two query sets, each containing 23 BGP entity queries of different complexities, ranging from 2 to 4 triple patterns and yielded a varying number of relevant entities (see Table <ref type="table" target="#tab_2">2</ref>). The structured queries represent realworld information needs, e.g. retrieve "all movies directed by Steven Spielberg", "all movies available in English and also in Hebrew", or "all movies directed by Rainer Werner Fassbinder, which were released in 1982". The last query is illustrated in Fig. <ref type="figure" target="#fig_3">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Keyword Query Baseline (KW)</head><p>We compare the performance of our ERM approach against Semplore <ref type="bibr" target="#b5">[6]</ref>, which is based on Lucene, a real-world keyword search system. Semplore creates a virtual document for every entity description and use the concatenations of attribute and attribute value as document terms. In the same way, we transform the structured query into a keyword query of disjunctive terms by using the concatenations of predicates and constants of the structured query as keywords. The resulting keyword query retrieves all virtual documents representing entity descriptions, which contain some of the corresponding terms. Fig. <ref type="figure" target="#fig_1">1</ref> illustrates an example.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Search Performance</head><p>We evaluate search performance using the standard IR measures precision, recall, mean average precision (MAP) and mean reciprocal rank (MRR). We retrieve the top five thousand entities, rank them, and compute the metrics based on the top one thousand entities returned by each method. We evaluate ERM against the baseline described above. ERM performs alignment on-the-fly and operates without any upfront integration or prior knowledge about the schemas. When comparing the performance of ERM and KW in Fig. <ref type="figure" target="#fig_4">4</ref>(a), we observe that ERM outperforms KW . ERM yields a MAP of 0.55, whereas KW achieves a MAP of about 0.2. The robustness of the retrieval performance of ERM can be observed in Fig. <ref type="figure" target="#fig_4">4(b)</ref>, which shows the interpolated precision across the recall levels. We can observe that ERM considerably outperforms the KW baseline across the recall-levels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Related Work</head><p>We have discussed related work throughout the paper. Basically, there are two existing lines of approaches, one that is based on keyword search and the other one is structured query rewriting. Our approach represent a novel combination which combines the flexibility of keyword search with the power of query rewriting. Just like keyword search, it does not rely on precomputed mappings. However, it is able to exploit the fine-grained structure of query and results, which is the primary advantage of structured query rewriting. In addition, it can leverage existing mappings created by tools like <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b12">13]</ref>.</p><p>More precisely, there exist similar keyword search retrieval models that also take the structure of the underlying data into account. In fact, the model underlying our approach originates from the concept of language models <ref type="bibr" target="#b9">[10]</ref>, which have been proposed for modeling resources and queries as multinomial distributions over the vocabulary terms. More precisely, the foundation of our work is established by Lavrenko et al., who propose relevance-based language models to directly capture the relevance behind document and queries.In this context, structure information has been exploited for constructing structured relevance models <ref type="bibr" target="#b8">[9]</ref> (SRM). This is the one mostly related to ERM. The difference is that while the goal of SRM is to predict values of empty fields in a single dataset scenario, ERM targets searching in a completely different setting involving multiple heterogeneous datasets. Beside the query rewriting <ref type="bibr" target="#b1">[2]</ref> strategy, there are database oriented approaches bridging the schema-and vocabulary gap by query relaxation <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15]</ref>. However, these approaches need also a precomputing step to either identify duplicates <ref type="bibr" target="#b13">[14]</ref> or derive alternative relations <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15]</ref>.</p><p>The proposed mapping technique is in principle similar to existing techniques, e.g. <ref type="bibr" target="#b15">[16]</ref>, to the extent that it relies on the same features, i.e., values of attributes. However, the use of language models for representing these features as well as the similarity calculation based on entropy is common for retrieval tasks, but has not been applied to the schema mapping problem before. We consider this as a promising approach for embedding the pay-as-you-go data integration paradigm <ref type="bibr" target="#b6">[7]</ref> into the search process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>We have proposed a novel approach for searching Web datasets using one single structured seed query that adheres to the vocabulary of just one of the dataset. It is based on using an entity relevance model computed from the seed query, which captures the structure and content of relevant results. This model is used both for matching and ranking relevant results, as well as for performing data integration on-the-fly. This approach combines the flexibility of keyword search in the sense that no upfront integration is required, with the power of structured query rewriting techniques that comes from the use of the fined-grained structure of query and results. The experiments showed that our approach outperform a common keyword search baseline.</p><p>We demonstrated our approach between one source and one target dataset. However, there is no inherent limitation to the number of target datasets. So theoretically, one query can be used 'to bind them all'.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Baseline: A structured query (1) transformed into a keyword query (2) and matched against bag of word representations of entities (3).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>2) ERM as P (as) w : ERM as (w) label 1 world:0.2, on:0.2, wires:0.2, . . . starring 0.5 klaus:0.25, löwitsch:0.25, barbara:. . . director 1 rainer:0.33, werner:0.33, fassbinder:0e:0.33, t:0.33, 1994:0.33 i:actors coyote:0.5, peter:0.5 i:directors spielberg:0.33, steven:0.33, i:0.33 i:producer spielberg:0.33, steven:0.33, i:0.33 type movie:1</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. Example of manually created queries as ground truth. Each query represents the same information need for one of the datasets.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 4 .</head><label>4</label><figDesc>Fig. 4. IR Performance of ERM compared to KW Baseline.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>gives details about each dataset and shows also the varying average size of entity descriptions |A(e)|.</figDesc><table><row><cell cols="2">Dataset #Entities</cell><cell>#Distinct Attribute Labels</cell><cell>|A(e)| ±StdDev.</cell></row><row><cell>DBpedia</cell><cell>9.1M</cell><cell>39.6K</cell><cell>9±18.2</cell></row><row><cell>IMdb</cell><cell>859K</cell><cell>32</cell><cell>11.4±6.4</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 .</head><label>1</label><figDesc>dataset statistics</figDesc><table><row><cell cols="3">Rel. Entities IMdb DBpedia</cell></row><row><cell>max</cell><cell>834</cell><cell>47</cell></row><row><cell>avg.</cell><cell>114.9</cell><cell>10.9</cell></row><row><cell>median</cell><cell>21</cell><cell>5</cell></row><row><cell>min</cell><cell>1</cell><cell>1</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 .</head><label>2</label><figDesc>Relevant Entities per Query for each dataset</figDesc><table><row><cell cols="2">Source Dataset |ERM | ± StdDev.</cell></row><row><cell>IMdb</cell><cell>15.8±6.7</cell></row><row><cell>DBpedia</cell><cell>23±5.4</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 .</head><label>3</label><figDesc>Average Number of Fields of an ERM.</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://sig.ma/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">We thank Julien Gaugaz and the L3S Research Center for providing us their version of the IMdb dataset.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Acknowledgements</head><p>This work was supported by the German Federal Ministry of Education and Research (BMBF) under the iGreen project (grant 01IA08005K).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<ptr target="http://www.w3.org/TR/rdf-sparql-query/" />
		<title level="m">SPARQL Query Language for RDF</title>
				<imprint>
			<date type="published" when="2008-01">January 2008</date>
		</imprint>
	</monogr>
	<note>W3C: Recommendation 15</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Query rewriting and answering under constraints in data integration systems</title>
		<author>
			<persName><forename type="first">A</forename><surname>Calì</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lembo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rosati</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IJCAI</title>
				<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="16" to="21" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A survey on ontology mapping</title>
		<author>
			<persName><forename type="first">N</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">Y</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Han</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGMOD Rec</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="34" to="41" />
			<date type="published" when="2006-09">September 2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Robust and scalable linked data reasoning incorporating provenance and trust annotations</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">A</forename><surname>Bonatti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hogan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Polleres</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Sauro</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Web Sem</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="165" to="201" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Searching linked objects with falcons: Approach, implementation and evaluation</title>
		<author>
			<persName><forename type="first">G</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Int. J. Semantic Web Inf. Syst</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="49" to="70" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Semplore: A scalable ir approach to search the web of data</title>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Penin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">Z</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Web Sem</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="177" to="188" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Web-scale data integration: You can afford to pay as you go</title>
		<author>
			<persName><forename type="first">J</forename><surname>Madhavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">L</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Halevy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">R</forename><surname>Jeffery</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CIDR</title>
		<imprint>
			<biblScope unit="page" from="342" to="350" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Ad-hoc object retrieval in the web of data</title>
		<author>
			<persName><forename type="first">J</forename><surname>Pound</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mika</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zaragoza</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 19th Int. Conf. on WWW</title>
				<meeting>of the 19th Int. Conf. on WWW</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="771" to="780" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Information retrieval on empty fields</title>
		<author>
			<persName><forename type="first">V</forename><surname>Lavrenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Allan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">HLT-NAACL</title>
				<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="89" to="96" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A language modeling approach to information retrieval</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Ponte</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR</title>
				<imprint>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="275" to="281" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">The CRANFIELD Tests on Index Langauge Devices</title>
		<author>
			<persName><forename type="first">C</forename><surname>Cleverdon</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1967">1967</date>
			<publisher>Aslib</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Falcon-ao: A practical ontology matching system</title>
		<author>
			<persName><forename type="first">W</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Web Sem</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="237" to="239" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Aroma results for oaei</title>
		<author>
			<persName><forename type="first">J</forename><surname>David</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Ontology Matching Workshop</title>
				<imprint>
			<publisher>ISWC</publisher>
			<date type="published" when="2009">2009. 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Query relaxation using malleable schemas</title>
		<author>
			<persName><forename type="first">X</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gaugaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">T</forename><surname>Balke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Nejdl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGMOD Conference</title>
				<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="545" to="556" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Query relaxation for entityrelationship search</title>
		<author>
			<persName><forename type="first">S</forename><surname>Elbassuoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ramanath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Weikum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ESWC</title>
		<imprint>
			<biblScope unit="page" from="62" to="76" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">One size does not fit all: Customizing ontology alignment using user feedback</title>
		<author>
			<persName><forename type="first">S</forename><surname>Duan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fokoue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Srinivas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Semantic Web Conference</title>
				<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="177" to="192" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
