=Paper= {{Paper |id=Vol-1549/article-07 |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-1549/article-07.pdf |volume=Vol-1549 |dblpUrl=https://dblp.org/rec/conf/semweb/Wagner0RL13 }} ==None== https://ceur-ws.org/Vol-1549/article-07.pdf
           Discovering Related Data Sources in
                       Data-Portals

     Andreas Wagner† , Peter Haase‡ , Achim Rettinger† , and Holger Lamm‡
             †
               Karlsruhe Institute of Technology and ‡ fluid Operations
           {a.wagner,rettinger}@kit.edu, peter.haase@fluidops.com




       Abstract. To allow effective querying on the Web of data, systems fre-
       quently rely on data from multiple sources for answering queries. For
       instance, a user may wish to combine data from sources comprised in
       different statistical catalogs. Given such federated queries, in order to
       enable an interactive exploration of results, systems must allow user in-
       volvement during data source selection. That is, a user should be able
       to choose data sources contributing to query results, thereby allowing to
       refine/expand current findings. For this, one needs effective recommenda-
       tions for data sources to be picked: data source contextualization. Recent
       work, however, solely aims at source contextualization for “Web tables”,
       while heavily relying on schema information and simple table structures.
       Addressing these shortcomings, we exploit work from the field of data
       mining and show how to enable effective Web data source contextualiza-
       tion. Based on a real-world finance use-case, we built a contextualization
       engine, which we integrated into a Web search system, our data portal,
       for accessing statistics data sets.



1     Introduction

The amount of RDF data available on the Web today, such as Linked Data1 ,
RDFa and Microformats, is large and rapidly increasing.2 RDF data contains
descriptions of entities, with each description being a set of triples. A triple
associates an entity identifier (subject) with an object via a predicate. A set of
triples forms a data graph.
    RDF data is oftentimes highly distributed, with each data source comprising
one or more RDF graphs (Fig. 1). Most notably, Linked Data as well as Web-
accessible SPARQL endpoints have contributed to this development.
    Integrated Querying of Multiple Data Sources. In order to fulfill in-
formation needs over multiple, distributed data sources, a number of issues need
to addressed. These range from the ability to discover and identify relevant data
sources, to the ability to integrate them and finally to support querying them in
a transparent manner.
1
    http://www.w3.org/DesignIssues/LinkedData.html
2
    http://webdatacommons.org
2


        Src. 1: tec00001 (Eurostat).                           Src. 2: gov q ggdebt (Eurostat).
e s : data / tec0001                                      e s : data / gov q ggdebt
  r d f : t y p e qb : O b s e r v a t i o n ;              r d f : t y p e qb : O b s e r v a t i o n ;
  e s−prop : g e o e s−d i c :DE;                           e s−prop : g e o e s−d i c :DE;
  e s−prop : u n i t e s−d i c : MIO EUR ;                  e s−prop : u n i t e s−d i c : MIO EUR ;
  e s−prop : i n d i c n a e s−d i c : B11 ;                e s−prop : i n d i c n a e s−d i c : F2 ;
  sd : t i m e ”2010−01−01” ˆˆ x s : d a t e ;              sd : t i m e ”2010−01−01” ˆˆ x s : d a t e ;
 sm : o b s V a l u e ” 2 4 9 6 2 0 0 . 0 ” ˆˆ x s :       sm : o b s V a l u e ” 1 7 8 6 8 8 2 . 0 ” ˆˆ x s :
          double .                                                  decimal .




                               Src. 3: NY.GDP.MKTP.CN (Worldbank).
    wbi :NY.GDP.MKTP.CN
     r d f : t y p e qb : O b s e r v a t i o n ;
     sd : r e f A r e a wbi : c l a s s i f i c a t i o n / c o u n t r y /DE;
     sd : r e f P e r i o d ”2010−01−01” ˆˆ x s : d a t e ;
     sm : o b s V a l u e ” 2 5 0 0 0 9 0 . 5 ” ˆˆ x s : d o u b l e ;
     wbprop : i n d i c a t o r wbi : c l a s s i f i c a t i o n / i n d i c a t o r /NY.GDP.MKTP.CN .



Fig. 1: Src. 1/3 describes Germany’s GDP in 2010. Note, they contextual-
ize each other, as they feature varying observation values respectively prop-
erties. Src. 2 also provides additional information w.r.t. Src. 1, as it holds
the German debt in 2010. Src. 1-3 each contain one entity: es:data/tec0001,
es:data/gov q ggdebt, and wbi:NY.GDP.MKTP.CN. Every entity description, Ge ,
equals the entire graph contained in its source.


    Say a user is interested in economic data. Here, catalogs like Eurostat3 or
Worldbank4 , offer rich statistical information about, e.g., GDP, spread across
many sources. However, these data sources are very specific, and in order to
provide the user with her desired information, a system has to combine data
from multiple sources. Processing queries in such a manner requires knowledge
about what source features which information. This problem is commonly known
as source selection: a system chooses data sources relevant for a given query
and query fragment, respectively. Previous works selected sources by means of
indexes, e.g., [8, 11], link-traversal, e.g., [9, 11], or by using available source meta-
data, e.g., [5].
    Data Source Contextualization. Existing approaches for source selection
aim solely at a mapping of queries/query fragments to sources featuring exactly
matching data [5, 8, 9, 11]. In particular, such works do not consider “source se-
mantics”, i.e., what sources are actually about and how they relate to each other.
For instance, consider a user searching for GDP rates in the EU. A traditional
system may discover sources in Eurostat to comprise matching data. At the
same time, other sources offer contextual information concerning, e.g., the na-
tional debt. Notice, such sources are actually not relevant to the user’s query,
3
    http://ec.europa.eu/eurostat/
4
    http://worldbank.org
                                                                                  3

but relevant to her information need. Integration of these additional sources for
contextualization of known, relevant sources, provides a user with broader results
in terms of result dimensions (schema complement) respectively result entities
(entity complement). See also an example in Fig. 1.
    For enabling systems to identify and integrate sources for contextualization,
we argue that user involvement during source selection is a key factor. That is,
starting with an initial search result obtained via, e.g., a SPARQL or keyword
query, a user should be able to choose and change sources used for result compu-
tation. In particular, users should be recommended contextual sources at each
step of the search process. After modifying the selected sources, results may be
reevaluated and/or queries expanded.
    Recent work on data source contextualization focuses on Web tables [4], while
using top-level schema such as Freebase5 . Further, they restrict data to a sim-
ple table-structured form. We argue that such a solution is not a good fit for
the “wild” Web of data. In particular, considering Linked Data, data sources
frequently feature schema-less data and/or high-dimensional, heterogeneous en-
tities. Targeting the Web of data, we propose an approach based on well-known
techniques from the field of data mining. That is, we extract a sample of entities
from each data source and learn clusters of entities. Then, we exploit the con-
structed clusters as a description for data sources, and find contextual sources
via similarity measures between entity clusters.
    Contributions. In this work, we provide the following contributions:
     (1) We present an entity-based solution for data source contextualization
         in the Web of data. This engine is based on well-known data mining
         strategies, and does not require schema information or data adhering to
         a particular form.
     (2) We implemented our system, the data-portal, based on a real-world use
         case, thereby showing its practical relevance and feasibility. A prototype
         version of this portal is freely available and currently tested by a pilot
         customer.6
    Outline. In Sect. 2, we present a real-world use case. In Sect. 3, we outline
our contextualization engine, before we discuss the data portal system in Sect.
4. We present related work in Sect. 5. We conclude with Sect. 6.


2     Use Case Scenario
In this section, we introduce a real-world use case to illustrate challenges and
opportunities in contextualizing data sources. The scenario is situated in financial
research, provided by a pilot user in a private bank.
    In their daily work, financial researchers heavily rely on a variety of open and
closed Web data sources in order to provide prognoses of future trends. A typical
example is the analysis of government debt. During the financial crisis in 2008-
2009, most European countries made high debts. To lower doubts about repaying
5
    http://www.freebase.com/
6
    http://data.fluidops.net/
4

these debts, most countries set up a plan to reduce their public budget deficits.
The fulfillment of these plans is essential for the Euro zone’s development.
    To analyze such plans, a financial researcher requires an overview of pub-
lic revenue and expenditure in relation to the gross domestic product (GDP).
To measure this, she needs information about the deficit target, the revenue/-
expenditure/deficit and GDP estimates. This information is publicly available,
provided by catalogs like Eurostat and Worldbank. However, it is spread across a
huge space of sources. That is, there is no single source satisfying her information
needs, instead data from multiple sources have to be identified and combined.
    To start her search process, a researcher may give “gross domestic product”
as keyword query. The result is GDP data from a large number of sources. At
this point, data source selection “hidden” from the researcher, and sources are
solely ranked via number and quality of keyword hits. However, knowing where
her information comes from is critical. In particular, she may want to restrict
and/or know the following meta-data:
  – General information about the data source, e.g., the name of the author and
     a short description of the data source contents.
  – Information about entities contained in the data source, e.g., the single coun-
     tries of the European Union.
  – Description about the dimensions of the observations, e.g., the covered time
     range or the data unit of the observations.
By means of faceted search, the researcher finally restricts her data source
to tec00001 (Eurostat, Fig. 1) featuring “Gross domestic product at market
prices”. However, searching the data source space in such a manner requires ex-
tensive knowledge. Further, the researcher was not only interested in plain GDP
data – she was also looking for additional information.
    For this, a system should suggest data sources that might be of interest, based
on sources known to be relevant. These contextual sources may feature related,
additional information w.r.t. current search results/sources. For instance, data
sources containing information about the GDP of further countries or with a dif-
ferent temporal range. By such means, the researcher may discover new sources
more easily, as one source of interest links to another – allowing her to explore
the space of sources.


3   Contextualisation Engine

In this section, we outline an approach for Web data source contextualization.
    For this, we conceive a data source D ∈ D as set of multiple RDF graphs,
with D as set comprising all sources in the data space. Further, an entity e is
given by an RDF instance contained in source D, and described by a subgraph
Ge in D [6], see also Fig. 1.
    Related Entities. The intuition behind our approach is simple: if data
sources contain similar entities, they are somehow related. In other words, we
rely on entities to capture the “latent” semantics of data sources. That is, we em-
                                                                                    5

ploy offline procedures as follows: we (1) extract entities, (2) measure similarity
between them, and (3) cluster them.

(1) Entity Extraction. We start by extracting entities from each source D. First,
    for scalability reasons, we go over all entities in data graphs in D and collect
    a sample, with every entity having the same probability of being selected.
    For each selected entity e, we crawl its surrounding subgraph – resulting in
    a graph Ge that describes e [6]. For cleaning Ge , we apply standard data
    cleansing strategies to fix, e.g., missing or wrong data types.
(2) Entity Similarity. In a second step, we define a dissimilarity measure, dis, be-
    tween two entities based on previous work on kernel functions for RDF [12].
    That is, for a given entity pair e0 and e00 , we count common substructures in
    Ge0 and Ge00 . The more “overlapping structures” between the two graphs are
    found, the lower we score the dissimilarity between e0 and e00 . In addition to
    the structural characteristics of entities, we also consider their literal dissim-
    ilarity. For entities e0 and e00 we pairwise compare their literals by means of
    string and numerical kernels, respectively [16]. The former counts the num-
    ber of common substrings, given a literal pair from e0 /e00 . The latter, on the
    other hand, computes the numerical distance between two literals associated
    with e0 and e00 . We aggregate these three different dissimilarity measures
    for e0 and e00 via kernel aggregation strategies [16]. Intuitively, such a kernel
    aggregation combines multiple kernels using, e.g., weighted summation.
(3) Entity Clustering. Last, we apply clustering techniques to mine for entity
    groups. More precisely, we aim at discovering clusters, Cj , comprising similar
    entities, which may or may not originate from the same source. Thus, clusters
    relate sources by relating their contained entities. We use k-means [10] as
    a well-known and simple algorithm for computing entity clusters. k-means
    adheres to four steps [10]. (a) Choose k initial cluster centers mi . (b) Based on
    above dissimilarity function, dis, an indicator function is given as: 1(e, Cj )
    is 1 iff dis(e, Cj ) < dis(e, Cz ), ∀j 6= z, and 0 otherwise. Intuitively, 1(·, ·)
    assign each entity e to its “closest” cluster Cj . (c) Update cluster centers mi ,
    and reassign (if necessary) entities to new clusters. (d) Stop if convergence
    threshold is reached, e.g., no (or minimal) reassignments occurred. Otherwise
    go back to (b).

   Contextualisation Score. Similar to [4], given a source D0 , we compute
two scores, ec(D00 | D0 ) and sc(D00 | D0 ), for quantifying the contextualization
of D0 via a second source D00 . Both scores are aggregated to a contextualization
score for data source D00 given D0 .

    The former is an indicator for the entity complement of D00 w.r.t. D0 . That
is, ec asks: how many new, similar entities does D00 contribute to given entities
in D0 ? The latter score, sc, measures how many new “dimensions” are added by
D00 , compared to those already present in D0 (schema complement). In contrast
to [4], however, we do not rely on any kind of “external” information, such as
top-level schema. Instead, we solely exploit semantics as captured by entities.
6

                                               FluidOps Data Portal
                       Data Source Exploration
                                                                                         Query Processing

                                                                                                            Query &
                                                                          Sparql             Result
      Data Source     Data Source        Select/Remove Source                                               Visualize
                                                                          Query           Visualization
        Search        Visualization       for Query Federation                                               Results
                                                                                                                     Processing
             Data Source                      Inspect Source                                                     SPARQL Queries
        Contextualization Engine                                                                                     against the
                                              Contributing to                 Federation Layer                       Federation
                                              Current Result


    Data Access

                                       Entity
                                      Clusters
                                                          Data Source              Data Source               Data Source
               Data Source
                                                           gov_q_ggdebt               tec00001               NY.GDP.MKTP.CN
               Meta-Data

                                               Offline Entity                                             Loading and
           Meta-Data Updates                  Extraction and                                              Populating of
             by Providers                       Clustering                                                  SPARQL-
                        Worldbank                                                                          Endpoints
         Eurostat        Provider                                                  Data Loader
         Provider
                                                      Data Space          Worldbank
                           Eurostat

                  gov_q_ggdebt
                  gov_q_ggdebt     tec00001
                                   tec00001                                        NY.GDP.MKTP.CN
                                                                                   NY.GDP.MKTP.CN




Fig. 2: The data portal system features two kinds of services: source space ex-
ploration, and query processing. For the former, our source contextualization
engine is integrated as a key component. Overall, source space exploration re-
quires source meta-data as well as entity clusters to be available. Entity clusters
are computed as an offline process, while meta-data may be updated frequently
during runtime. On the other hand, query processing distributes query fragments
via a federation layer. Each fragment is evaluated over one or more sources. For
this, each data source is mapped to a SPARQL endpoint, for which data is ac-
cessed via a data-loader. For our running example, the necessary sources are
loaded via three endpoints: gov q ggdebt, tec00001, and NY.GDP.MKTP.CN.


   Let us first define an entity complement score ec : D × D 7→ [0, 1]. In the
most simplistic manner, we may measure ec by counting the overlapping clusters
between both sources:
                                     X        1(Cj , D00 )|Cj |
                  ec(D00 | D0 ) :=
                                            0
                                                   |Cj |
                                                        Cj ∈ cluster(D )

    with cluster as function mapping data sources to clusters their entities are
assigned to. Further, let 1(C, D) by an indicator function, returning 1 if cluster
C is associated with data source D via one or more entities in D.
                                                                                                 7

    Considering the schema complement score, sc : D × D 7→ [0, 1], we aim to
count new dimensions (properties) that are introduced by D00 . Thus, a simple
realization of sc may be given by:
                                                            S
           00   0
                             X              |props(Cj ) \       Ci ∈ cluster(D 0 ) props(Ci )|
      sc(D | D ) :=
                                                            |props(Cj )|
                      Cj ∈ cluster(D 00 )

    with props as function projecting a cluster C to a set of properties, where
each property is contained in a description of an entity in C.
    Finally, a contextualization score cs is obtained by a monotonic aggregation
of ec and sc. In our case, we apply a weighted summation:
                cs(D00 | D0 ) := 1/2 · ec(D00 | D0 ) + 1/2 · sc(D00 | D0 )
    Runtime Behavior and Scalability. Regarding online performance, i.e.,
computation of contextualization score cs, given the offline learned clusters, we
aimed at simple and lightweight heuristics. For ec only an assignment of data
sources to clusters (function cluster(D)), and cluster size |C| is needed. Further,
measure sc only requires an additional mapping of clusters to “contained” prop-
erties (function props(C)). All necessary statistics are easily kept in memory.
    With regard to offline clustering behavior, we expect our approach to perform
well, as existing work on kernel k-means clustering showed such approaches to
scale to large data sets, e.g., [3, 18].


4     Source Contextualization in the Data-Portal

We have implemented the presented algorithms (Sect. 3) for data source con-
textualization in a data-portal, enabling on demand access to data sources from
a number of open statistic data catalogs. Based on our real-world use case, we
show how the source contextualization is used within this portal.
    Overview. Towards an active involvement of users in the source selection
process, we implemented a contextualization engine and integrated it in a system
offering two services: source space exploration and distributed query processing,
    Using the former, users may explore the space of sources, i.e., search and
discover data sources of interest. Here, the contextualization engine fosters dis-
covery of relevant sources during exploration. The query processing service, on
the other hand, allows queries to be federated over multiple sources. See also
Fig. 2 for an overview.
    Interaction between both services is tight and user-driven. In particular,
sources discovered during source exploration may be used for answering queries.
On the other hand, sources employed for result computation may be inspected,
and via contextualization other relevant sources may be found.
    The data portal is based on the Information Workbench [7], and a running
prototype is available.7 Following our use-case (Sect. 2), we populated the system
7
    http://data.fluidops.net/
8




               Fig. 3: Faceted search exploration of data sources.


with statistical data/sources from Eurostat and Worldbank. This population
involved an extraction of meta-data from data catalogs, represented using the
VoID and DCAT vocabularies. The meta-data includes information about the
accessibility of the actual data sources, which is used in a second step to load
and populate the data sources locally. Every data source is stored in a triple
store and accessible via a dedicated SPARQL endpoint. Overall, we have a total
of more than 10000 data sources available.
    Source Exploration and Selection. A typical search process starts with
looking for “the right” sources. That is, a user begins with exploration of the
data source space. For instance, she may issue a keyword query “gross domestic
product”, yielding sources with matching words in their meta-data. If this query
does not lead to sources suitable for her information need, a faceted search
interface or a tag-cloud may be used. For instance, she refines her sources via
entity “Germany” in a faceted search, Fig.3.
    Once the user discovered a source of interest, its structure as well
as entity information is shown. For example, a textual source descrip-
tion for GDP (current US$) is given in Fig. 4. More details about source
GDP (current US$) is given via an entity and schema overview, respectively
(Fig.5-a/b). Note, entities used here have been extracted by our approach, and
are visualized by means of a map. Using these rich source descriptions, a user
can get to know the data and data sources before issuing queries.
   Further, for every source a ranked list of contextualization sources is given.
For GDP (current US$), e.g., source GDP at Market Prices is recommended,
Fig.5-c. This way, the user is guided from one source of interest to another.
At any point, she may select a particular source for querying. Eventually, she
                                                                               9




          Fig. 4: Textual information for source GDP (current US$).


not only knows her relevant sources, but has also gained first insights into data
schema and entities.
    Processing Queries over Selected Sources. In this second component,
we provide means for issuing and processing queries over multiple (previously
selected) sources. Say, a user has chosen GDP (current US$) as well as its con-
textualization source GDP at Market Prices (Fig. 5-d). Due to her previous
exploration, she knows that the former provides the German GDP from 2000 -
2010, while the second one features GDP from years 2011 and 2012 in Germany.
    Knowing data sources that contain the desired data, the user may simply add
them to the federation by clicking on the corresponding button. The federation
can then be queried transparently, i.e., as if the data was physically integrated
in a single source. Query processing is handled by the FedX engine [15], which
enables efficient execution of federated queries over multiple data sources.
    Following the example, the user may issue the SPARQL query shown in Fig.
6. Here, the user combined data from the sources GDP (current US$) and GDP
at market prices. That is, while GDP data from 2000 to 2010 was retrieved
from source GDP (current US$), GDP information for the years 2011 and 2012
was loaded from the data source GDP at market prices.
   The results of this query can be visualized using widgets offered by the In-
formation Workbench. For instance, as shown in Fig.7, GDP information for
Germany may be depicted as bar chart.
   Future Applications of the Contextualization Engine. Besides the
current usage of source contextualization, we see further applications in the
future. In particular, the learned entity clusters may be used for data source
search result visualization, or even for visualization of SPARQL query results.
Further, SPARQL results could be ranked based on data source contextualization
sources and user inputs for source selection, respectively.
10


                                                         (a)                                          (b)




                                                                                                      (c)




Fig. 5: (a+b) Source information for GDP (current US$) based on its entities
and schema, respectively. (c) Contextualization sources for GDP (current US$).

SELECT ? y e a r ? gdp
WHERE {                                                   UNION
{                                                          {
  ? o b s 1 r d f : t y p e qb : O b s e r v a t i o n ;       ? o b s 2 r d f : t y p e qb : O b s e r v a t i o n ;
  wb−p r o p e r t y : i n d i c a t o r                       qb : d a t a s e t e s−d a t a : t e c 0 0 0 0 1 ;
          wbi−c i :NY.GDP.MKTP.CN ;                            e s−p r o p e r t y : g e o e s−d i c : g e o#DE ;
  sdmx−d i m e n s i o n : r e f A r e a                       sdmx−d i m e n s i o n : t i m e P e r i o d ? y e a r ;
          wbi−c c :DE ;                                        sdmx−measure : o b s V a l u e ? gdp .
  sdmx−d i m e n s i o n : r e f P e r i o d ? y e a r ;       FILTER ( ? y e a r > ”2010−01−01” ˆˆ xsd : d a t e )
  sdmx−measure : o b s V a l u e ? gdp .                     }
  }                                                      }


Fig. 6: Query asking for Germany’s GDP from 2000-2012. Relevant sources, GDP
(current US$) and GDP at Market Prices, were selected during source explo-
ration.


5      Related Work
Closest to our approach is recent work on “finding related tables” on the Web
[4]. In fact, our notion of entity and schema complement is adopted from that
                                                                                11




               Fig. 7: Visualization of results for query in Fig. 6.



paper. However, [4] focuses on flat entities in Web tables, i.e., entities adhere
to a simple and fixed relational structure. In contrast, we consider entities as
subgraphs contained in Web data sources. Further, we do not require any kind of
“external” information. Most notably, we do not use top-level schema. We argue
that relying on such information would limit the applicability of our approach.
    Also related are approaches on data source recommendation for source link-
ing, e.g., [13, 14]. Here, given a source D, the task is to find (and rank) other
sources sharing same contents, in order to interlink such data sources with D.
Existing works commonly exploit keyword search, ontology matching, or user
feedback/information. In contrast, our contextualization engine does not depend
on user or schema information. Instead, it exploits clusters of entities, learned
from the data, and based on structural and literal similarities. Most importantly,
however, our goals differs: recommendation for source linking aims at discover-
ing exactly the same entities across sources. Instead, we aim at finding either
completely new entities, which are somehow related to known/relevant entities
(entity complement), or same entities that feature different properties (schema
complement), i.e., provide additional information.
    Another line of work is concerned with query processing over distributed
RDF data, e.g., [5, 8, 9, 11]. During source selection, these approaches frequently
exploit indexes or source meta-data, for mapping queries/query fragments to
sources. Our approach is complementary, as it enables systems for involve their
users during source selection. We outlined such an extension of the traditional
search process, as well as its benefits throughout the paper.
    Last, data integration for Web search has received much attention. Some
works target rewriting queries, e.g., [2, 19], while others rely on keyword search,
reducing queries and sources to bags-of-words, e.g., [1, 17]. We target, however,
a “fuzzy” form of integration, i.e., we do not give exact mappings of entities,
but merely measure whether sources contain entities that might be “somehow”
related. That is, our contextualization score indicates whether sources might
refer to similar entities, and may provide different data for these entities.
12

6    Conclusion and Future Work
We presented a novel approach for Web data source contextualization. For this,
we adapted well-known techniques from the field of data mining. More precisely,
we provide a framework for source contextualization, to be instantiated in an
application-specific manner. By means of a real-world use-case and prototype,
we show how source contextualization allows for user involvement during source
selection. Based on our use-cases and data-portal system, we plan to conduct
empirical experiments for validating the effectiveness of our approach. In fact,
we aim at a comparison with related work from the field of web-table contextu-
alization, as discussed in Sect. 5.

References
 1. R. Blanco, P. Mika, and S. Vigna. Effective and efficient entity search in rdf data.
    In ISWC, 2011.
 2. A. Cali, D. Lembo, and R. Rosati. Query rewriting and answering under constraints
    in data integration systems. In IJCAI, 2003.
 3. R. Chitta, R. Jin, T. C. Havens, and A. K. Jain. Approximate kernel k-means:
    solution to large scale kernel clustering. In SIGKDD, 2011.
 4. A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu.
    Finding related tables. In SIGMOD, 2012.
 5. O. Görlitz and S. Staab. SPLENDID: SPARQL Endpoint Federation Exploiting
    VOID Descriptions. In COLD Workshop, 2011.
 6. G. A. Grimnes, P. Edwards, and A. Preece. Instance based clustering of semantic
    web resources. In ESWC, 2008.
 7. P. Haase, M. Schmidt, and A. Schwarte. The information workbench as a self-
    service platform for linked data applications. In COLD Workshop, 2011.
 8. A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data
    summaries for on-demand queries over linked data. In WWW, 2010.
 9. O. Hartig, C. Bizer, and J. Freytag. Executing SPARQL queries over the web of
    linked data. In ISWC, 2009.
10. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM
    Computing Surveys, 1999.
11. G. Ladwig and T. Tran. Linked data query processing strategies. In ISWC, 2010.
12. U. Lösch, S. Bloehdorn, and A. Rettinger. Graph kernels for rdf data. In ESWC,
    2012.
13. A. Nikolov and M. d’Aquin. Identifying relevant sources for data linking using a
    semantic web index. In Workshop on Linked Data on the Web, 2011.
14. L. A. P. Paes Leme, G. R. Lopes, B. P. Nunes, M. Casanova, and S. Dietze.
    Identifying candidate datasets for data interlinking. In ICWE, 2013.
15. A. Schwarte, P. Haase, K. Hose, R. Schenkel, and M. Schmidt. FedX: Optimization
    Techniques for Federated Query Processing on Linked Data. In ISWC, 2011.
16. J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. 2004.
17. H. Wang, Q. Liu, T. Penin, L. Fu, L. Zhang, T. Tran, Y. Yu, and Y. Pan. Semplore:
    A scalable IR approach to search the Web of Data. JWS, 2009.
18. R. Zhang and A. Rudnicky. A large scale clustering scheme for kernel k-means. In
    Pattern Recognition, 2002.
19. X. Zhou, J. Gaugaz, W.-T. Balke, and W. Nejdl. Query relaxation using malleable
    schemas. In SIGMOD, 2007.