<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Discovering Related Data Sources in Data-Portals</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andreas Wagnery</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Haasez</string-name>
          <email>peter.haase@fluidops.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Achim Rettingery</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Holger Lammz</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>To allow e ective querying on the Web of data, systems frequently rely on data from multiple sources for answering queries. For instance, a user may wish to combine data from sources comprised in di erent statistical catalogs. Given such federated queries, in order to enable an interactive exploration of results, systems must allow user involvement during data source selection. That is, a user should be able to choose data sources contributing to query results, thereby allowing to re ne/expand current ndings. For this, one needs e ective recommendations for data sources to be picked: data source contextualization. Recent work, however, solely aims at source contextualization for \Web tables", while heavily relying on schema information and simple table structures. Addressing these shortcomings, we exploit work from the eld of data mining and show how to enable e ective Web data source contextualization. Based on a real-world nance use-case, we built a contextualization engine, which we integrated into a Web search system, our data portal, for accessing statistics data sets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The amount of RDF data available on the Web today, such as Linked Data1,
RDFa and Microformats, is large and rapidly increasing.2 RDF data contains
descriptions of entities, with each description being a set of triples. A triple
associates an entity identi er (subject) with an object via a predicate. A set of
triples forms a data graph.</p>
      <p>RDF data is oftentimes highly distributed, with each data source comprising
one or more RDF graphs (Fig. 1). Most notably, Linked Data as well as
Webaccessible SPARQL endpoints have contributed to this development.</p>
      <p>Integrated Querying of Multiple Data Sources. In order to ful ll
information needs over multiple, distributed data sources, a number of issues need
to addressed. These range from the ability to discover and identify relevant data
sources, to the ability to integrate them and nally to support querying them in
a transparent manner.</p>
    </sec>
    <sec id="sec-2">
      <title>1 http://www.w3.org/DesignIssues/LinkedData.html</title>
    </sec>
    <sec id="sec-3">
      <title>2 http://webdatacommons.org</title>
      <p>Src. 2: gov q ggdebt (Eurostat).
es : data / tec0001
r d f : type qb : Observation ;
es prop : geo es d i c :DE;
es prop : u n i t es d i c :MIO EUR;
es prop : i n d i c n a es d i c : B11 ;
sd : time "2010 01 01"^^ xs : date ;
sm : obsValue " 2496200.0 "^^ xs :
double .</p>
      <p>es : data / gov q ggdebt
r d f : type qb : Observation ;
es prop : geo es d i c :DE;
es prop : u n i t es d i c :MIO EUR;
es prop : i n d i c n a es d i c : F2 ;
sd : time "2010 01 01"^^ xs : date ;
sm : obsValue " 1786882.0 "^^ xs :</p>
      <p>decimal .</p>
      <p>Src. 3: NY.GDP.MKTP.CN (Worldbank).
wbi :NY.GDP.MKTP.CN
r d f : type qb : Observation ;
sd : r e f A r e a wbi : c l a s s i f i c a t i o n / country /DE;
sd : r e f P e r i o d "2010 01 01"^^ xs : date ;
sm : obsValue " 2500090.5 "^^ xs : double ;
wbprop : i n d i c a t o r wbi : c l a s s i f i c a t i o n / i n d i c a t o r /NY.GDP.MKTP.CN .</p>
      <p>
        Say a user is interested in economic data. Here, catalogs like Eurostat3 or
Worldbank4, o er rich statistical information about, e.g., GDP, spread across
many sources. However, these data sources are very speci c, and in order to
provide the user with her desired information, a system has to combine data
from multiple sources. Processing queries in such a manner requires knowledge
about what source features which information. This problem is commonly known
as source selection: a system chooses data sources relevant for a given query
and query fragment, respectively. Previous works selected sources by means of
indexes, e.g., [
        <xref ref-type="bibr" rid="ref11 ref8">8, 11</xref>
        ], link-traversal, e.g., [
        <xref ref-type="bibr" rid="ref11 ref9">9, 11</xref>
        ], or by using available source
metadata, e.g., [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Data Source Contextualization. Existing approaches for source selection
aim solely at a mapping of queries/query fragments to sources featuring exactly
matching data [
        <xref ref-type="bibr" rid="ref11 ref5 ref8 ref9">5, 8, 9, 11</xref>
        ]. In particular, such works do not consider \source
semantics", i.e., what sources are actually about and how they relate to each other.
For instance, consider a user searching for GDP rates in the EU. A traditional
system may discover sources in Eurostat to comprise matching data. At the
same time, other sources o er contextual information concerning, e.g., the
national debt. Notice, such sources are actually not relevant to the user's query,
3 http://ec.europa.eu/eurostat/
      </p>
    </sec>
    <sec id="sec-4">
      <title>4 http://worldbank.org</title>
      <p>but relevant to her information need. Integration of these additional sources for
contextualization of known, relevant sources, provides a user with broader results
in terms of result dimensions (schema complement) respectively result entities
(entity complement). See also an example in Fig. 1.</p>
      <p>For enabling systems to identify and integrate sources for contextualization,
we argue that user involvement during source selection is a key factor. That is,
starting with an initial search result obtained via, e.g., a SPARQL or keyword
query, a user should be able to choose and change sources used for result
computation. In particular, users should be recommended contextual sources at each
step of the search process. After modifying the selected sources, results may be
reevaluated and/or queries expanded.</p>
      <p>
        Recent work on data source contextualization focuses on Web tables [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], while
using top-level schema such as Freebase5. Further, they restrict data to a
simple table-structured form. We argue that such a solution is not a good t for
the \wild" Web of data. In particular, considering Linked Data, data sources
frequently feature schema-less data and/or high-dimensional, heterogeneous
entities. Targeting the Web of data, we propose an approach based on well-known
techniques from the eld of data mining. That is, we extract a sample of entities
from each data source and learn clusters of entities. Then, we exploit the
constructed clusters as a description for data sources, and nd contextual sources
via similarity measures between entity clusters.
      </p>
      <p>Contributions. In this work, we provide the following contributions:
(1) We present an entity-based solution for data source contextualization
in the Web of data. This engine is based on well-known data mining
strategies, and does not require schema information or data adhering to
a particular form.
(2) We implemented our system, the data-portal, based on a real-world use
case, thereby showing its practical relevance and feasibility. A prototype
version of this portal is freely available and currently tested by a pilot
customer.6</p>
      <p>Outline. In Sect. 2, we present a real-world use case. In Sect. 3, we outline
our contextualization engine, before we discuss the data portal system in Sect.
4. We present related work in Sect. 5. We conclude with Sect. 6.
2</p>
      <sec id="sec-4-1">
        <title>Use Case Scenario</title>
        <p>In this section, we introduce a real-world use case to illustrate challenges and
opportunities in contextualizing data sources. The scenario is situated in nancial
research, provided by a pilot user in a private bank.</p>
        <p>In their daily work, nancial researchers heavily rely on a variety of open and
closed Web data sources in order to provide prognoses of future trends. A typical
example is the analysis of government debt. During the nancial crisis in
20082009, most European countries made high debts. To lower doubts about repaying</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5 http://www.freebase.com/</title>
    </sec>
    <sec id="sec-6">
      <title>6 http://data.fluidops.net/</title>
      <p>these debts, most countries set up a plan to reduce their public budget de cits.
The ful llment of these plans is essential for the Euro zone's development.</p>
      <p>To analyze such plans, a nancial researcher requires an overview of
public revenue and expenditure in relation to the gross domestic product (GDP).
To measure this, she needs information about the de cit target, the
revenue/expenditure/de cit and GDP estimates. This information is publicly available,
provided by catalogs like Eurostat and Worldbank. However, it is spread across a
huge space of sources. That is, there is no single source satisfying her information
needs, instead data from multiple sources have to be identi ed and combined.</p>
      <p>To start her search process, a researcher may give \gross domestic product"
as keyword query. The result is GDP data from a large number of sources. At
this point, data source selection \hidden" from the researcher, and sources are
solely ranked via number and quality of keyword hits. However, knowing where
her information comes from is critical. In particular, she may want to restrict
and/or know the following meta-data:
{ General information about the data source, e.g., the name of the author and
a short description of the data source contents.
{ Information about entities contained in the data source, e.g., the single
countries of the European Union.
{ Description about the dimensions of the observations, e.g., the covered time
range or the data unit of the observations.</p>
      <p>By means of faceted search, the researcher nally restricts her data source
to tec00001 (Eurostat, Fig. 1) featuring \Gross domestic product at market
prices". However, searching the data source space in such a manner requires
extensive knowledge. Further, the researcher was not only interested in plain GDP
data { she was also looking for additional information.</p>
      <p>For this, a system should suggest data sources that might be of interest, based
on sources known to be relevant. These contextual sources may feature related,
additional information w.r.t. current search results/sources. For instance, data
sources containing information about the GDP of further countries or with a
different temporal range. By such means, the researcher may discover new sources
more easily, as one source of interest links to another { allowing her to explore
the space of sources.
3</p>
      <sec id="sec-6-1">
        <title>Contextualisation Engine</title>
        <p>In this section, we outline an approach for Web data source contextualization.</p>
        <p>
          For this, we conceive a data source D 2 D as set of multiple RDF graphs,
with D as set comprising all sources in the data space. Further, an entity e is
given by an RDF instance contained in source D, and described by a subgraph
Ge in D [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], see also Fig. 1.
        </p>
        <p>
          Related Entities. The intuition behind our approach is simple: if data
sources contain similar entities, they are somehow related. In other words, we
rely on entities to capture the \latent" semantics of data sources. That is, we
employ o ine procedures as follows: we (1) extract entities, (2) measure similarity
between them, and (3) cluster them.
(1) Entity Extraction. We start by extracting entities from each source D. First,
for scalability reasons, we go over all entities in data graphs in D and collect
a sample, with every entity having the same probability of being selected.
For each selected entity e, we crawl its surrounding subgraph { resulting in
a graph Ge that describes e [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. For cleaning Ge, we apply standard data
cleansing strategies to x, e.g., missing or wrong data types.
(2) Entity Similarity. In a second step, we de ne a dissimilarity measure, dis,
between two entities based on previous work on kernel functions for RDF [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
That is, for a given entity pair e0 and e00, we count common substructures in
Ge0 and Ge00. The more \overlapping structures" between the two graphs are
found, the lower we score the dissimilarity between e0 and e00. In addition to
the structural characteristics of entities, we also consider their literal
dissimilarity. For entities e0 and e00 we pairwise compare their literals by means of
string and numerical kernels, respectively [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. The former counts the
number of common substrings, given a literal pair from e0/e00. The latter, on the
other hand, computes the numerical distance between two literals associated
with e0 and e00. We aggregate these three di erent dissimilarity measures
for e0 and e00 via kernel aggregation strategies [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Intuitively, such a kernel
aggregation combines multiple kernels using, e.g., weighted summation.
(3) Entity Clustering. Last, we apply clustering techniques to mine for entity
groups. More precisely, we aim at discovering clusters, Cj , comprising similar
entities, which may or may not originate from the same source. Thus, clusters
relate sources by relating their contained entities. We use k-means [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] as
a well-known and simple algorithm for computing entity clusters. k-means
adheres to four steps [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. (a) Choose k initial cluster centers mi. (b) Based on
above dissimilarity function, dis, an indicator function is given as: 1(e; Cj )
is 1 i dis(e; Cj ) &lt; dis(e; Cz); 8j 6= z, and 0 otherwise. Intuitively, 1( ; )
assign each entity e to its \closest" cluster Cj . (c) Update cluster centers mi,
and reassign (if necessary) entities to new clusters. (d) Stop if convergence
threshold is reached, e.g., no (or minimal) reassignments occurred. Otherwise
go back to (b).
        </p>
        <p>
          Contextualisation Score. Similar to [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], given a source D0, we compute
two scores, ec(D00 j D0) and sc(D00 j D0), for quantifying the contextualization
of D0 via a second source D00. Both scores are aggregated to a contextualization
score for data source D00 given D0.
        </p>
        <p>
          The former is an indicator for the entity complement of D00 w.r.t. D0. That
is, ec asks: how many new, similar entities does D00 contribute to given entities
in D0? The latter score, sc, measures how many new \dimensions" are added by
D00, compared to those already present in D0 (schema complement). In contrast
to [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], however, we do not rely on any kind of \external" information, such as
top-level schema. Instead, we solely exploit semantics as captured by entities.
        </p>
        <p>FluidOps Data Portal</p>
        <p>Data Source Exploration
Data Source</p>
        <p>Search</p>
        <p>Data Source
Visualization</p>
        <p>Select/Remove Source
for Query Federation</p>
        <p>Data Source
Contextualization Engine</p>
        <p>Inspect Source
Contributing to</p>
        <p>Current Result
Data Access</p>
        <p>Data Source</p>
        <p>Meta-Data
Meta-Data Updates</p>
        <p>by Providers
Eurostat
Provider</p>
        <p>Worldbank
Provider</p>
        <p>Eurostat
gov_q_ggdebt
tec00001</p>
        <p>Entity
Clusters</p>
        <p>Offline Entity
Extraction and</p>
        <p>Clustering</p>
        <p>Query Processing
Sparql
Query</p>
        <p>Result</p>
        <p>Visualization
Federation Layer</p>
        <p>Query &amp;
Visualize
Results</p>
        <p>Processing
SPARQL Queries
against the</p>
        <p>Federation
Loading and
Populating of</p>
        <p>SPARQL</p>
        <p>Endpoints
Data Source
gov_q_ggdebt</p>
        <p>Data Source
tec00001</p>
        <p>Data Source</p>
        <p>NY.GDP.MKTP.CN
Data Space</p>
        <p>Worldbank</p>
        <p>Data Loader
NY.GDP.MKTP.CN
Fig. 2: The data portal system features two kinds of services: source space
exploration, and query processing. For the former, our source contextualization
engine is integrated as a key component. Overall, source space exploration
requires source meta-data as well as entity clusters to be available. Entity clusters
are computed as an o ine process, while meta-data may be updated frequently
during runtime. On the other hand, query processing distributes query fragments
via a federation layer. Each fragment is evaluated over one or more sources. For
this, each data source is mapped to a SPARQL endpoint, for which data is
accessed via a data-loader. For our running example, the necessary sources are
loaded via three endpoints: gov q ggdebt, tec00001, and NY.GDP.MKTP.CN.</p>
        <p>Let us rst de ne an entity complement score ec : D D 7! [0; 1]. In the
most simplistic manner, we may measure ec by counting the overlapping clusters
between both sources:
ec(D00 j D0) :=</p>
        <p>X
Cj2 cluster(D0)
1(Cj ; D00)jCj j
jCj j
with cluster as function mapping data sources to clusters their entities are
assigned to. Further, let 1(C; D) by an indicator function, returning 1 if cluster
C is associated with data source D via one or more entities in D.</p>
        <p>Considering the schema complement score, sc : D D 7! [0; 1], we aim to
count new dimensions (properties) that are introduced by D00. Thus, a simple
realization of sc may be given by:
jprops(Cj ) n SCi2 cluster(D0) props(Ci)j
jprops(Cj )j
with props as function projecting a cluster C to a set of properties, where
each property is contained in a description of an entity in C.</p>
        <p>Finally, a contextualization score cs is obtained by a monotonic aggregation
of ec and sc. In our case, we apply a weighted summation:</p>
        <p>cs(D00 j D0) := 1=2 ec(D00 j D0) + 1=2 sc(D00 j D0)</p>
        <p>Runtime Behavior and Scalability. Regarding online performance, i.e.,
computation of contextualization score cs, given the o ine learned clusters, we
aimed at simple and lightweight heuristics. For ec only an assignment of data
sources to clusters (function cluster(D)), and cluster size jCj is needed. Further,
measure sc only requires an additional mapping of clusters to \contained"
properties (function props(C)). All necessary statistics are easily kept in memory.</p>
        <p>
          With regard to o ine clustering behavior, we expect our approach to perform
well, as existing work on kernel k-means clustering showed such approaches to
scale to large data sets, e.g., [
          <xref ref-type="bibr" rid="ref18 ref3">3, 18</xref>
          ].
4
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>Source Contextualization in the Data-Portal</title>
        <p>We have implemented the presented algorithms (Sect. 3) for data source
contextualization in a data-portal, enabling on demand access to data sources from
a number of open statistic data catalogs. Based on our real-world use case, we
show how the source contextualization is used within this portal.</p>
        <p>Overview. Towards an active involvement of users in the source selection
process, we implemented a contextualization engine and integrated it in a system
o ering two services: source space exploration and distributed query processing,</p>
        <p>Using the former, users may explore the space of sources, i.e., search and
discover data sources of interest. Here, the contextualization engine fosters
discovery of relevant sources during exploration. The query processing service, on
the other hand, allows queries to be federated over multiple sources. See also
Fig. 2 for an overview.</p>
        <p>Interaction between both services is tight and user-driven. In particular,
sources discovered during source exploration may be used for answering queries.
On the other hand, sources employed for result computation may be inspected,
and via contextualization other relevant sources may be found.</p>
        <p>
          The data portal is based on the Information Workbench [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], and a running
prototype is available.7 Following our use-case (Sect. 2), we populated the system
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7 http://data.fluidops.net/</title>
      <p>with statistical data/sources from Eurostat and Worldbank. This population
involved an extraction of meta-data from data catalogs, represented using the
VoID and DCAT vocabularies. The meta-data includes information about the
accessibility of the actual data sources, which is used in a second step to load
and populate the data sources locally. Every data source is stored in a triple
store and accessible via a dedicated SPARQL endpoint. Overall, we have a total
of more than 10000 data sources available.</p>
      <p>Source Exploration and Selection. A typical search process starts with
looking for \the right" sources. That is, a user begins with exploration of the
data source space. For instance, she may issue a keyword query \gross domestic
product", yielding sources with matching words in their meta-data. If this query
does not lead to sources suitable for her information need, a faceted search
interface or a tag-cloud may be used. For instance, she re nes her sources via
entity \Germany" in a faceted search, Fig.3.</p>
      <p>Once the user discovered a source of interest, its structure as well
as entity information is shown. For example, a textual source
description for GDP (current US$) is given in Fig. 4. More details about source
GDP (current US$) is given via an entity and schema overview, respectively
(Fig.5-a/b). Note, entities used here have been extracted by our approach, and
are visualized by means of a map. Using these rich source descriptions, a user
can get to know the data and data sources before issuing queries.</p>
      <p>Further, for every source a ranked list of contextualization sources is given.
For GDP (current US$), e.g., source GDP at Market Prices is recommended,
Fig.5-c. This way, the user is guided from one source of interest to another.
At any point, she may select a particular source for querying. Eventually, she
not only knows her relevant sources, but has also gained rst insights into data
schema and entities.</p>
      <p>Processing Queries over Selected Sources. In this second component,
we provide means for issuing and processing queries over multiple (previously
selected) sources. Say, a user has chosen GDP (current US$) as well as its
contextualization source GDP at Market Prices (Fig. 5-d). Due to her previous
exploration, she knows that the former provides the German GDP from 2000
2010, while the second one features GDP from years 2011 and 2012 in Germany.</p>
      <p>
        Knowing data sources that contain the desired data, the user may simply add
them to the federation by clicking on the corresponding button. The federation
can then be queried transparently, i.e., as if the data was physically integrated
in a single source. Query processing is handled by the FedX engine [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], which
enables e cient execution of federated queries over multiple data sources.
      </p>
      <p>Following the example, the user may issue the SPARQL query shown in Fig.
6. Here, the user combined data from the sources GDP (current US$) and GDP
at market prices. That is, while GDP data from 2000 to 2010 was retrieved
from source GDP (current US$), GDP information for the years 2011 and 2012
was loaded from the data source GDP at market prices.</p>
      <p>The results of this query can be visualized using widgets o ered by the
Information Workbench. For instance, as shown in Fig.7, GDP information for
Germany may be depicted as bar chart.</p>
      <p>Future Applications of the Contextualization Engine. Besides the
current usage of source contextualization, we see further applications in the
future. In particular, the learned entity clusters may be used for data source
search result visualization, or even for visualization of SPARQL query results.
Further, SPARQL results could be ranked based on data source contextualization
sources and user inputs for source selection, respectively.
and schema, respectively. (c) Contextualization sources for GDP (current US$).
SELECT ? year ?gdp
WHERE f
f
g
? obs1 r d f : type qb : Observation ;
wb property : i n d i c a t o r</p>
      <p>wbi c i :NY.GDP.MKTP.CN ;
sdmx dimension : r e f A r e a</p>
      <p>wbi cc :DE ;
sdmx dimension : r e f P e r i o d ? year ;
sdmx measure : obsValue ?gdp .</p>
      <p>UNION
f
g
g
? obs2 r d f : type qb : Observation ;
qb : d a t a s e t es data : tec00001 ;
es property : geo es d i c : geo#DE ;
sdmx dimension : timePeriod ? year ;
sdmx measure : obsValue ?gdp .</p>
      <p>FILTER(? year &gt; "2010 01 01"^^ xsd : date )
(current US$) and GDP at Market Prices, were selected during source
exploration.
5</p>
      <sec id="sec-7-1">
        <title>Related Work</title>
        <p>
          Closest to our approach is recent work on \ nding related tables" on the Web
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. In fact, our notion of entity and schema complement is adopted from that
paper. However, [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] focuses on at entities in Web tables, i.e., entities adhere
to a simple and xed relational structure. In contrast, we consider entities as
subgraphs contained in Web data sources. Further, we do not require any kind of
\external" information. Most notably, we do not use top-level schema. We argue
that relying on such information would limit the applicability of our approach.
        </p>
        <p>
          Also related are approaches on data source recommendation for source
linking, e.g., [
          <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
          ]. Here, given a source D, the task is to nd (and rank) other
sources sharing same contents, in order to interlink such data sources with D.
Existing works commonly exploit keyword search, ontology matching, or user
feedback/information. In contrast, our contextualization engine does not depend
on user or schema information. Instead, it exploits clusters of entities, learned
from the data, and based on structural and literal similarities. Most importantly,
however, our goals di ers: recommendation for source linking aims at
discovering exactly the same entities across sources. Instead, we aim at nding either
completely new entities, which are somehow related to known/relevant entities
(entity complement), or same entities that feature di erent properties (schema
complement), i.e., provide additional information.
        </p>
        <p>
          Another line of work is concerned with query processing over distributed
RDF data, e.g., [
          <xref ref-type="bibr" rid="ref11 ref5 ref8 ref9">5, 8, 9, 11</xref>
          ]. During source selection, these approaches frequently
exploit indexes or source meta-data, for mapping queries/query fragments to
sources. Our approach is complementary, as it enables systems for involve their
users during source selection. We outlined such an extension of the traditional
search process, as well as its bene ts throughout the paper.
        </p>
        <p>
          Last, data integration for Web search has received much attention. Some
works target rewriting queries, e.g., [
          <xref ref-type="bibr" rid="ref19 ref2">2, 19</xref>
          ], while others rely on keyword search,
reducing queries and sources to bags-of-words, e.g., [
          <xref ref-type="bibr" rid="ref1 ref17">1, 17</xref>
          ]. We target, however,
a \fuzzy" form of integration, i.e., we do not give exact mappings of entities,
but merely measure whether sources contain entities that might be \somehow"
related. That is, our contextualization score indicates whether sources might
refer to similar entities, and may provide di erent data for these entities.
        </p>
      </sec>
      <sec id="sec-7-2">
        <title>Conclusion and Future Work</title>
        <p>We presented a novel approach for Web data source contextualization. For this,
we adapted well-known techniques from the eld of data mining. More precisely,
we provide a framework for source contextualization, to be instantiated in an
application-speci c manner. By means of a real-world use-case and prototype,
we show how source contextualization allows for user involvement during source
selection. Based on our use-cases and data-portal system, we plan to conduct
empirical experiments for validating the e ectiveness of our approach. In fact,
we aim at a comparison with related work from the eld of web-table
contextualization, as discussed in Sect. 5.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>R.</given-names>
            <surname>Blanco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mika</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vigna</surname>
          </string-name>
          .
          <article-title>E ective and e cient entity search in rdf data</article-title>
          .
          <source>In ISWC</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Cali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          .
          <article-title>Query rewriting and answering under constraints in data integration systems</article-title>
          .
          <source>In IJCAI</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>R.</given-names>
            <surname>Chitta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jin</surname>
          </string-name>
          , T. C.
          <article-title>Havens, and</article-title>
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jain</surname>
          </string-name>
          .
          <article-title>Approximate kernel k-means: solution to large scale kernel clustering</article-title>
          .
          <source>In SIGKDD</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>A. Das Sarma</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Xin</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Finding related tables</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>O.</given-names>
            <surname>Go</surname>
          </string-name>
          <article-title>rlitz and S. Staab. SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions</article-title>
          .
          <source>In COLD Workshop</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Grimnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Edwards</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Preece</surname>
          </string-name>
          .
          <article-title>Instance based clustering of semantic web resources</article-title>
          .
          <source>In ESWC</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>P.</given-names>
            <surname>Haase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Schwarte</surname>
          </string-name>
          .
          <article-title>The information workbench as a selfservice platform for linked data applications</article-title>
          .
          <source>In COLD Workshop</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>A.</given-names>
            <surname>Harth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karnstedt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sattler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          .
          <article-title>Data summaries for on-demand queries over linked data</article-title>
          .
          <source>In WWW</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>O.</given-names>
            <surname>Hartig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Freytag</surname>
          </string-name>
          .
          <article-title>Executing SPARQL queries over the web of linked data</article-title>
          .
          <source>In ISWC</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>A. K. Jain</surname>
            ,
            <given-names>M. N.</given-names>
          </string-name>
          <string-name>
            <surname>Murty</surname>
            , and
            <given-names>P. J.</given-names>
          </string-name>
          <string-name>
            <surname>Flynn</surname>
          </string-name>
          .
          <article-title>Data clustering: a review</article-title>
          .
          <source>ACM Computing Surveys</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. G. Ladwig and
          <string-name>
            <given-names>T.</given-names>
            <surname>Tran</surname>
          </string-name>
          .
          <article-title>Linked data query processing strategies</article-title>
          .
          <source>In ISWC</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. U.
          <article-title>Losch, S. Bloehdorn, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rettinger</surname>
          </string-name>
          .
          <article-title>Graph kernels for rdf data</article-title>
          .
          <source>In ESWC</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>d'Aquin. Identifying relevant sources for data linking using a semantic web index</article-title>
          .
          <source>In Workshop on Linked Data on the Web</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>L. A. P.</given-names>
            <surname>Paes Leme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. R.</given-names>
            <surname>Lopes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Nunes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Casanova</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          .
          <article-title>Identifying candidate datasets for data interlinking</article-title>
          .
          <source>In ICWE</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>A.</given-names>
            <surname>Schwarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Haase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schenkel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          .
          <source>FedX: Optimization Techniques for Federated Query Processing on Linked Data. In ISWC</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. J.
          <string-name>
            <surname>Shawe-Taylor</surname>
            and
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Cristianini</surname>
          </string-name>
          .
          <article-title>Kernel Methods for Pattern Analysis</article-title>
          .
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. H.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Penin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , T. Tran,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            , and
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Pan. Semplore</surname>
          </string-name>
          :
          <article-title>A scalable IR approach to search the Web of Data</article-title>
          . JWS,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Rudnicky</surname>
          </string-name>
          .
          <article-title>A large scale clustering scheme for kernel k-means</article-title>
          .
          <source>In Pattern Recognition</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gaugaz</surname>
          </string-name>
          , W.-T. Balke, and
          <string-name>
            <given-names>W.</given-names>
            <surname>Nejdl</surname>
          </string-name>
          .
          <article-title>Query relaxation using malleable schemas</article-title>
          .
          <source>In SIGMOD</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>