=Paper= {{Paper |id=Vol-1670/paper-47 |storemode=property |title=A Clustering Approach for Holistic Link Discovery (Project overview) |pdfUrl=https://ceur-ws.org/Vol-1670/paper-47.pdf |volume=Vol-1670 |authors=Markus Nentwig,Anika Groß,Erhard Rahm |dblpUrl=https://dblp.org/rec/conf/lwa/NentwigGR16 }} ==A Clustering Approach for Holistic Link Discovery (Project overview)== https://ceur-ws.org/Vol-1670/paper-47.pdf
    A Clustering Approach for Holistic Link Discovery
                   (Project overview)

                    Markus Nentwig, Anika Groß, and Erhard Rahm

          Database Group, Department of Computer Science, University of Leipzig


       Abstract. Pairwise link discovery approaches for the Web of Data do not scale to
       many sources thereby limiting the potential for data integration. We thus propose
       a holistic approach for linking many data sources based on a clustering of entities
       representing the same real-world object. Our clustering approach utilizes existing
       links and can deal with entities of different semantic types. The approach is able
       to identify errors in existing links and can find numerous additional links. An ini-
       tial evaluation on real-world linked data shows the effectiveness of the proposed
       holistic entity matching.


1   Introduction
Linking entities between sources has been a major effort in recent years to support data
integration in the so-called Web of Data. A large number of tools for semi-automatic
link discovery has been developed to facilitate the generation of new links (mostly of
type owl:sameAs) [5]. Repositories such as BioPortal [7] or LinkLion [6] collect
numerous links for many sources to improve their availability and re-usability without
having to repeatedly determine such links for new applications and use cases.
    Despite the advances made, there are significant limitations in the achieved inter-
linking of data sources and in the current approaches for link discovery. First, the de-
gree of inter-linking is still low and automatically generated links are wrong in many
cases [2]. Current approaches for link discovery only match two data sources at a time
(pairwise linking) resulting in a poor scalability to many sources [8]. This is because
the number of pairwise mappings increases quadratically with the number of sources,
e.g., one would need almost 20,000 mappings to fully interconnect 200 sources.
    Most of the current link discovery approaches process only two data sources at
a time restricting scalability for many sources while few approaches natively support
multiple data sources. In [3] the quality of joins on Linked Open Data is improved
by determining highly connected entity groups in a set of given links using metrics
such as edge betweenness. The joint entity matching approach in [1] aims at finding
links between multiple data sources based on an iteratively adopted matrix of pairwise
similarity values. Existing approaches to determine owl:sameAs links also focus on
entities of the same type while many sources contain entities of different types (biblio-
graphic datasets contain publication and author entities, geographical datasets contain
numerous kinds of entities such as countries, lakes, etc.). Furthermore, existing links
are hardly utilized when additional links need to be determined.
    The need for holistic approaches to integrate many data sources has been outlined
in [8] with the suggestion to use clustering-based approaches to link and fuse matching
entities for improved scalability. We are working on such clustering-based approaches
for the Web of Data [4] and summarize the approach and initial evaluation results in
this short project overview. The approach utilizes already existing links and supports
the integration of entities of different semantic types. All matching entities from differ-
ent sources are grouped into a single cluster thereby supporting a much more compact
representation of match results than with binary links. Furthermore, the cluster-based
approach facilitates the integration of additional sources and entities since they only
need to be matched with the set of already existing clusters rather than adopting a pair-
wise linking with numerous different sources.
     We consider a set of k data sources containing entities of different types. Each en-
tity e is referenced by an URI and has a set of describing semantic properties (i.e., RDF
vocabulary). Two entities of different sources can be connected by a owl:sameAs
link if they were found to represent the same real-world object. All same-as links be-
tween two sources Si and Sj (1 ≤ i, j ≤ k) constitute a binary equivalence mapping
Mi,j = {(e1 , e2 , sim)|e1 ∈ Si , e2 ∈ Sj , sim[0, 1], i 6= j}. Link discovery tools can
assign a similarity value sim to indicate the strength of a connection with 1 denot-
ing equality (highest similarity). For k data sources, there can be up to k·(k−1)    2     such
equivalence mappings. For holistic entity clustering, we use a set of existing mappings
         Sk
M = i,j=1 Mi,j and the set of associated entities E of the k data sources as input.
The goal is to compute a set of n clusters C = {cr1 , . . . , crn } such that each cluster only
includes matching entities (denoting the same real-world object) and that different clus-
ters represent different entities. In this paper, we consider duplicate-free data sources,
such that a cluster can contain at most k entities. For each cluster we determine a cluster
representative r derived from the cluster entities to simplify the comparison between
clusters.
     The following Sec. 2 will describe and illustrate the workflow for the proposed
holistic entity clustering. We then present preliminary evaluation results in Sec. 3 and
conclude.



2   Holistic Clustering

Our holistic clustering approach utilizing existing links consists of four main steps:
preprocessing, initial clustering based on connected components, cluster splitting and
iterative cluster merging. We illustrate the approach in Fig. 1 for partially linked geo-
graphical entities from four data sources. Due to space restrictions, linked entities (and
corresponding properties) are shortened to their IDs and clusters are represented by
thick bordered boxes. While our algorithm is generic, it can be customized to spe-
cific domains by providing appropriate background knowledge, similarity functions and
thresholds to determine relevant entities and clusters. For the considered geographical
domain, the similarity function determines a combined similarity from the string (tri-
gram) similarity on normalized labels, the similarity of the semantic entity type and the
normalized geographical distance. The details of the workflow will be described in the
rest of this section.
                 Pre                   Initial                 Cluster
                                                                                             Cluster Merge
              processing             Clustering             Decomposition

      Input                                                                                              Output
                                             0    1     2   3      0   1    r1     2    3     r2     2   3   r2
  0   1   2    3       0   1     2   3

  4   5                4   5                 4    5                4   5    r3                       0   1   4     5       r8

  6   7   8            6   7     8           6    7     8          6   8    r4     7    r5           6   8   r4        7    r5

  9 10 11 12           9 10 11 12            9 10 11               9 10 11       r6                  9 10 11       r6

  13 14 15 16         13 14 15 16            13 14 15 16           13 14 15 16         r7           13 14 15 16            r7
                           (a)                    (b)                        (c)                             (d)


                   Fig. 1: Application of holistic clustering to running example.


2.1   Preprocessing

During preprocessing we normalize property values needed for the similarity computa-
tion, i.e., we simplify entity labels, harmonize information about the semantic types of
entities and check that the input mappings do not violate the assumption of duplicate-
free data sources.
    Information about the semantic type of entities differs substantially between sources
or may be missing. For instance, DBpedia uses City and Town whereas Freebase has a
type citytown and other related types. To overcome such differences, we use background
knowledge about the equivalence and comparability of entity types of different sources
to harmonize the type information. We manually determined this type mapping for our
geographical sources although it could be constructed with the help of ontology match-
ing approaches. Based on the type mapping we simplified numerous types to more gen-
eral ones, e.g., the types city or suburb are treated as type Settlement. After harmonizing
the type information, we remove all links where the linked entities have incompatible
types. Note that we do not exclude links to entities with missing type information.
    With the assumption of duplicate-free data sources in place we check if all input
mappings comply with the restriction. In Fig. 1, entities 11 and 12 come from the
same source so that the links (9-11) and (9-12) violate the 1:1 assumption. In such
cases, we only keep the best-matching link (9-11) and drop weaker links (9-12) as
shown in Fig. 1a.


2.2   Initial Clustering

Using the preprocessed entities and mappings we first identify a set of initial clusters
by computing all connected components as the transitive closure from the given links.
Each resulting connected component builds an initial cluster C covering all entities that
are directly or indirectly connected via a same-as link in M. In our running example,
we create five different clusters covering 2-4 entities (see Fig. 1 b).
2.3   Cluster Decomposition

The initially created clusters can contain entities that should actually be separated, e.g.,
due to wrong input links or because of an insufficiently high transitive similarity be-
tween entities. For this reason we decompose clusters (1) based on incompatible seman-
tic types and (2) exclusion of entities based on intra-cluster similarity values. Finally,
for each resulting cluster a cluster representative is created.

Type-based Grouping: While we eliminate links with incompatible semantic types dur-
ing preprocessing, there can be entities without type information (e.g., entity (0)) lead-
ing to clusters with entities of different types during the initial clustering. We split such
clusters into several smaller sub-clusters with entities of the same type. Entities without
semantic types are then added to the sub-cluster of their most similar neighbor using
computed similarities between cluster members. For the considered cluster of our ex-
ample, we first build sub-clusters (2,3) and the singleton cluster (1) of different
types (e.g. Settlement vs. BodyOfWater). The untyped entity (0) is assigned to the
cluster with the more similar (geographically closer) entity (1) resulting in sub-cluster
(0,1).

Similarity-based Refinement: We further split clusters based on the computed intra-
cluster similarity between entities For each entity, we determine the average similarity
of its links to other cluster members and separate an entity if the average similarity is
below a given threshold ts . This process is executed iteratively as long as the average
similarity of an entity is smaller than ts . In the merge phase, such separated entities may
be added to other more similar clusters. As shown in Fig. 1 c we separate entity (7)
from the cluster (6,7,8) since the entity had a low label similarity to (6) and (8).

Cluster Representative: For each resulting cluster we create a cluster representative to
(1) facilitate the computation of inter-cluster similarities in the merge step and (2) to
efficiently match new entities, e.g., from additional data sources. We create the repre-
sentative by combining the properties from all entities in a cluster and select a preferred
value for each property, e.g., based on a majority consensus, the maximal length of la-
bels or pre-determined source priorities (for geo-coordinates). We also keep track of the
data sources represented in the cluster to avoid unnecessary merges for already covered
data sources.


2.4   Cluster Merge

Lastly we merge similar clusters below the maximal possible cluster size k. Therefore
we determine the similarity between clusters by applying the domain-specific similarity
function on the cluster representatives. Given the typically large number of clusters, this
is an expensive operation if we consider all pairs of clusters (quadratic complexity). We
avoid unneeded comparisons by not considering pairs of clusters with incompatible
types, overlapping data sources and clusters with > k resulting elements. The cluster
mapping CM computed for the remaining cluster pairs is restricted to the most similar
pairs of clusters with a similarity exceeding the merge similarity threshold tm .
Fig. 2: (a) Data set structure: number of entities and links, (b) Cluster sizes in workflow
phases.


    Cluster merging is an iterative process that continues as long as there are merge can-
didates in CM. In each iteration we select the pair of clusters (c1 , c2 ) with the highest
similarity from CM , merge it into a new cluster cm and compute a new representative
for it. c1 and c2 are removed from C and appropriate cluster pairs are removed from
CM. Furthermore, cm is added to C and CM is extended by similar cluster pairs for
the new cluster cm obeying the restriction for new cluster pairs. The termination of the
loop and the merge step is guaranteed since we reduce the number of clusters in each
iteration. Applying the approach to our example leads to the merging of (0,1,r1 ) and
(4,5,r3 ) into the new cluster (0,1,4,5,r8 ) (see Fig. 1 c,d) due to a high similarity
of all properties.
    For the given example, we clustered entities from four different data sources thereby
finding previously unknown links and eliminating wrong existing links for improved
data quality. The six resulting clusters (Fig. 1 d) implicitly represent 17 pairwise entity
links compared to 12 initially given links from which 3 turned out to be incorrect. In
particular, we could now identify matches between previously unconnected sources.


3   Initial Evaluation

We evaluate our holistic clustering approach using the location subset of the OAEI
2011 Instance Matching benchmark with links of presumed high quality. Fig. 2 a shows
the number of links between the four geographical data sources and the number of
entities that are interconnected by these links. We retrieved additional entity properties
via SPARQL endpoints or REST APIs in the respective sources in 2015. Still, the geo-
coordinates were missing for 1009 entities (13.4%) and the type information even for
2525 entities (33.5%), including all entities from the NY Times dataset. We use the
similarity function described in Sec. 2; the similarity thresholds ts , tm are set to 0.7.
    We first evaluate the resulting cluster sizes for the different phases of our holistic
clustering approach applied to these datasets (Fig. 2 b). During the preprocessing (not
shown in the Fig.), we already removed seven wrong NYT-GeoNames links based on
the one-to-one cardinality restriction; the missing type information for NYT did not
allow removal of type-incompatible links during preprocessing. The initial clustering
results only in clusters of sizes 3 and 4 since each NYT entity is linked with an entity
in Freebase and DBpedia. Applying the type-based grouping and similarity-based re-
finement results in a significant number of cluster splits and clusters of size 1 and 2 due
to incompatible entity types and partially low intra-cluster similarity. During the merge
phase some of the smaller clusters can be merged into larger ones leading to more clus-
ters of sizes 3 and 4. In particular, 15 singleton clusters could be merged into clusters of
size 2 and 3. Overall, the resulting clusters represent 9510 links with 4596 new links and
713 deleted links compared to the input link set. In particular, we could cluster many
entities from the previously unconnected sources GeoNames, DBpedia and Freebase.


4   Conclusion
We proposed a new holistic approach for clustering-based link discovery for many data
sources. The approach utilizes existing links and can match entities of different seman-
tic types. The determined entity clusters facilitate the integration of more data sources
without having to individually link them to each other data source. An initial evaluation
for linked data from the geographical domain confirmed that the new approach holds
great promise as it can identify wrong links and many additional links even between
previously unconnected sources. In the future, we will evaluate the scalability and qual-
ity of our approach on larger datasets and more sources from different domains based
on a parallel Hadoop-based implementation that is currently under development. We
will also study specific aspects such as improving the quality of current mapping col-
lections like BioPortal and the incremental extension of entity clusters when integrating
new data sources.


References
1. Christoph Böhm, Gerard de Melo, Felix Naumann, and Gerhard Weikum. LINDA: Distributed
   Web-of-Data-Scale Entity Matching. In Proc. of the 21st ACM CIKM, pages 2104–2108.
   ACM, 2012.
2. Daniel Faria, Ernesto Jiménez-Ruiz, Catia Pesquita, Emanuel Santos, and Francisco M Couto.
   Towards annotating potential incoherences in BioPortal mappings. In The Semantic Web–
   ISWC 2014, pages 17–32. Springer, 2014.
3. Jan-Christoph Kalo, Silviu Homoceanu, Jewgeni Rose, and Wolf-Tilo Balke. Avoiding chi-
   nese whispers: Controlling end-to-end join quality in linked open data stores. In ACM Web
   Science 2015, 2015.
4. Markus Nentwig, Anika Groß, Axel-Cyrille Ngonga Ngomo, and Erhard Rahm. Holistic
   Entity Clustering for Linked Data. Technical report, 2016. submitted for publication.
5. Markus Nentwig, Michael Hartung, Axel-Cyrille Ngonga Ngomo, and Erhard Rahm. A Sur-
   vey of Current Link Discovery Frameworks. Semantic Web J., 2016.
6. Markus Nentwig, Tommaso Soru, Axel-Cyrille Ngonga Ngonga Ngomo, and Erhard Rahm.
   LinkLion: A Link Repository for the Web of Data. In ESWC 2014 Posters & Demo session.
7. Natalya F Noy, Nigam H Shah, Patricia L Whetzel, Benjamin Dai, Michael Dorf, et al. Bio-
   Portal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res,
   37:W170–W173, 2009.
8. Erhard Rahm. The Case for Holistic Data Integration. In Proc. ADBIS. Springer LNCS, 2016.