Applying edge-counting semantic similarities to Link
            Discovery: Scalability and Accuracy

    Kleanthi Georgala1,2 , Mohamed Ahmed Sherif2 , Michael Röder2 , and Axel-Cyrille
                                  Ngonga Ngomo1,2
              1
               Department of Computer Science, Paderborn University, Germany,
              2
                Department of Computer Science, University of Leipzig, Germany
                      georgala@informatik.uni-leipzig.de
            {mohamed.sherif,michael.roeder,axel.ngonga}@upb.de


         Abstract. With the growth in number and variety of RDF datasets comes an in-
         creasing need for both scalable and accurate solutions to support link discovery at
         instance level within and across these datasets. In contrast to ontology matching,
         most linking frameworks rely solely on string similarities to this end. The limited
         use of semantic similarities when linking instances is partly due to the current
         literature stating that they (1) do not improve the F-measure of instance linking
         approaches and (2) are impractical to use because they lack time efficiency. We
         revisit the combination of string and semantic similarities for linking instances.
         Contrary to the literature, our results suggest that this combination can improve
         the F-measure achieved by instance linking systems when the combination of
         the measures is performed by a machine learning approach. To achieve this in-
         sight, we had to address the scalability of semantic similarities. We hence present
         a framework for the rapid computation of semantic similarities based on edge
         counting. This runtime improvement allowed us to run an evaluation of 5 bench-
         mark datasets. Our results suggest that combining string and semantic similarities
         can improve the F-measure by up to 6% absolute.


1     Introduction
RDF knowledge graphs (KGs) are used in a plethora of applications [10], especially
when published using the Linked Data paradigm. The provision of links3 between such
KGs is of central importance for numerous tasks such as federated queries [22] and
question answering [25]. Popular solutions to linking instances (often called link dis-
covery, short LD in the literature, see [12] for a survey) often implement specialized
measures for particular datatypes (e.g., geospatial or temporal data). In all other cases,
state-of-the-art LD frameworks such as SILK [1] and LIMES [14] rely on string similar-
ities and machine learning to compute links between instances in RDF KGs. While the
   Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Li-
   cense Attribution 4.0 International (CC BY 4.0). This work has been supported by the EU
   H2020 project KnowGraphs (GA no. 860801) as well as the BMVI projects LIMBO (GA no.
   19F2029C) and OPAL (GA no. 19F2028A).
 3
   The fourth principal of Linked Data, see http://www.w3.org/DesignIssues/
   LinkedData
2          K. Georgala et al.

use of string similarities has been shown to work well in a large number of papers (see,
e.g., [12,4]), string similarities have the major drawback of not considering the seman-
tics of the sequences of tokens they aim to compare. Hence, most string similarity mea-
sures return low scores for pairs of strings such as (lift, elevator), (holiday,
vacation), (headmaster, principal) and (aubergine, eggplant), although
they often stand for the same real-world concepts. Edge-counting semantic similarities
(e.g., [26,9,20]) alleviate this problem by using a dictionary to compute a semantic
distance between sequence of tokens within the need for an overlap. The synonymy
between aubergine and eggplant would hence lead semantic similarity to assign
the pair (aubergine, eggplant) a similarity score close to 1.
    The use of semantic similarities has been paid little attention to in LD for at least two
reasons: First, semantic similarities scale poorly and are thus impractical when used on
large knowledge graphs.4 Moreover, current works (e.g., [11]) suggest that they lead to
no improvement in F-measure. The goal of this paper is hence twofold: (1) we present
means to accelerate the computation of four popular bounded edge-counting semantic
similarities. (2) We then combine string and semantic similarities using two state-of-
the-art machine learning approaches for LD. Our results refute the current state of the
art and suggest that semantic similarities can help achieve better results in LD.


2      Preliminaries

The formal framework underlying our preliminaries is derived from [23]. A KG K is a
set of triples (s, p, o) ∈ (I ∪ B) × I × (I ∪ B ∪ L), where I is the set of all IRIs, B is the
set of all RDF blank nodes and L is the set of all literals. LD frameworks aim to compute
the set M = {(s, t) ∈ S × T : R(s, t)} where S and T are sets of RDF resources and
R is a binary relation. Note that this setting generalizes what is often known as entity
matching or deduplication [12], where the relation R must be owl:sameAs. Given
that M is generally difficult to compute directly, declarative LD frameworks compute
an approximation M 0 ⊆ S × T of M by executing a link specification (LS), which we
define formally in the following. Let M be the set of all similarity functions. We define
a similarity function m ∈ M as a function m : S × T × P 2 → [0, 1], where P is the set
of all properties, where ps , pt ∈ P. We write m(s, t, ps , pt ) to signify the similarity of
s and t w.r.t. their properties ps resp. pt . An atomic LS L is a pair L = ((m(ps , pt ), θ),
where θ ∈ [0, 1] is a similarity threshold. A complex LS L is a tuple L = op(L1 , L2 )
where two subspecification L1 and L2 are combined using the specification operator
op. Here, we consider the binary operators union (t), intersection (u) and difference
(\).
     The edge-counting semantic similarities are based on a lexical vocabulary. We de-
fine a lexical vocabulary as a directed acyclic graph (DAG) G = (V, E), where:

    – The set of vertices V is a set of concepts ci , were each ci stands for a set of syn-
      onyms. We denote |V | with nV .
    – E ⊆ V × V is a set of directed edges ejk = (cj , ck ). We denote |E| with nE .
 4
     This general finding is supported by our evaluation results presented in Section 4.
                      Applying edge-counting semantic similarities to Link Discovery             3

 – The edge ejk stands for the hypernymy relation from a parent concept cj to a child
   concept ck . We write cj → ck and we say that cj is a hypernym of ck . We also de-
   fine the hyponymy relation as a directed relation from a child concept ck to a parent
   concept. We write cj ← ck and we say that cj is a hyponym of ck . Hypernymy and
   hyponymy are transitive.
 – The root r is the unique node of the dictionary that has no parent concept.
 – A leaf concept ci is a concept node without any children concepts.
 – A concept is a common subsumer of c1 and c2 (denoted cs(c1 , c2 )) iff that concept
   is a hypernym of both c1 and c2 .
 – The least common subsumer (LSO) of c1 and c2 (denoted lso(c1 , c2 )) is “the most
   specific concept which is an ancestor of both c1 and c2 ” [26].
 – We define the directed path from c1 to c2 via a common subsumer cs(c1 , c2 ) as:
   path(c1 , c2 ) = {c1 ← ci ← . . . ← cs(c1 , c2 ) → cj → . . . → c2 : i, j, k ∈
   N, i, j, k ≤ nv }. Note that there can be multiple path(c1 , c2 ) between two con-
   cepts.
 – len(c1 , c2 ) is the length of the shortest path(c1 , c2 ) between two concepts c1 and
   c2 . Note that len defines a metric. Hence, it is symmetric and abides by the triangle
   inequality, i.e., len(c1 , c2 ) ≤ len(c1 , c3 ) + len(c2 , c3 ) for any (c1 , c2 , c3 ) ∈ V 3 .
 – We define depthm (ci ) as the length of the shortest path between r and ci . Analo-
   gously, depthM (ci ) as the maximum depth(ci ). We set D = max depthM (c).
                                                                                          c∈V

Note that the following holds:

 – depthm (r, ci ) = len(r, ci )
 – depthm (lso(c1 , c2 )) ≤ min(depthm (c1 ), depthm (c2 ))
 – depthM (lso(c1 , c2 )) ≤ min(depthM (c1 ), depthM (c2 ))
 – (triangle inequality) |len(r, c1 ) − len(r, c2 )| ≤ len(c1 , c2 ) ⇔ |depthm (c1 ) −
   depthm (c2 )| ≤ len(c1 , c2 )

    The Shortest Path (SP) similarity [20] of two concepts c1 and c2 is defined as the
length of their shortest path in comparison to the maximum distance (2D). We use the
normalized formulation of SP, i.e.,
                                                2D − len(c1 , c2 )
                               SP(c1 , c2 ) =                      .                            (1)
                                                      2D
   The Leacock and Chodorow metric (LCH) takes both the path between two con-
cepts and the depth of the hierarchy into consideration [8]. We use the normalized for-
mulation of LCH:
                                          
                                          1                                 if c1 = c2
                      LCHN (c1 , c2 ) =       − log
                                                      
                                                          len(c1 ,c2 )
                                                                         
                                                                                                (2)
                                                             2D
                                          
                                                      log(2D)
                                                                             else.

The normalized Wu Palmer (WP) similarity takes the path between two concepts and
the depth of their LSO into consideration [26]:

                                        2 × depthM (lso(c1 , c2 ))
                  WP(c1 , c2 ) =                                                                (3)
                                   2 × depthM (lso(c1 , c2 )) + N1 + N2
4         K. Georgala et al.

where N1 = len(lso(c1 , c2 ), c1 ) and N2 = len(lso(c1 , c2 ), c2 ). The Li et al. metric
(L I) is another take on using the path between two concepts and their LSO to define a
similarity [9]:
                                              eβdepth(lso(c1 ,c2 )) − e−βdepth(lso(c1 ,c2 ))
               L I(c1 , c2 ) = e−αlen(c1 ,c2 ) βdepth(lso(c ,c ))                              (4)
                                              e            1 2      + e−βdepth(lso(c1 ,c2 ))

where L I(c1 , c2 ) ∈ (0, 1). We set depth(lso(c1 , c2 )) = depthM (lso(c1 , c2 )), since the
original specification does not state which depth(lso(c1 , c2 )) to use.


3      Approach

Fundamentally, hECATE aims to compute the set M 0 = {(s, t) ∈ S×T : m(s, t, ps , pt ) ≥
θ}, where m is an edge-counting similarity. To achieve this goal, the approach makes
use of upper bounds which can be derived from the formulation of this family of mea-
sures. Take the SP similarity for example: For any two concepts c1 and c2 , SP(c1 , c2 ) ≥
θ implies len(c1 , c2 ) ≤ 2D(1 − θ). Formally, this means that we can discard all com-
parisons of pairs (c1 , c2 ) with len(c1 , c2 ) > 2D(1 − θ) without compromising the
computation of M 0 . Note that the computation of len(c1 , c2 ) can be carried out on-
line or offline, which affects the total runtime of our approach as discussed in Section
4. As similar bounds can be derived for the other edge-counting measures, hECATE
generalizes the computation of M 0 for edge-counting semantic similarities by using the
following algorithm. Our approach takes (1) two sets of resources, S and T , (2) an
atomic LS L = ((m(ps , pt ), θ), where m is one of the four semantic similarities de-
scribed in Section 2, and (3) a lexical vocabulary structured as DAG (VDAG) as input.
Our goal is to compute the mapping M 0 = [[L]] For each pair (s, t), hECATE retrieves
and pre-processes the property values for ps resp. pt . The pre-processing consists of
tokenizing and extracting all stop-words from the objects of the triples (s, ps , os ) and
(t, pt , ot ). In order to include a pair (s, t) in M 0 , the algorithm compares each set of
source tokens from os (sT okens) to each set of target tokens of ot (tT okens). The
pair of objects (os , ot ) with the highest similarity which abides by the bounds we de-
rive for each measure is finally used to compute the similarity between s and t, and
decides whether or not this pair should be added to M 0 . To do so, for each token
sT oken ∈ sT okens, we find the tT oken ∈ tT okens that is most similar. First, the
algorithm checks if sT oken and tT oken have been compared before If the tokens are
being compared for the first time the algorithm checks if the tokens are equal and as-
signs the value of 1 to T T Sim. Otherwise, it calls the function compare(sToken,
tToken, VDAG) that compares the corresponding sets of concepts obtained from
the input VDAG. 5 Then, T T Sim is compared to the maximum token-to-token similar-
ity and maxT T Sim is updated. The procedure continues until the highest similarity
between the current sT oken and a tT oken is found or maxT T Sim is equal to 1. The
algorithm aggregates the highest similarities maxT T Sim of all sT oken ∈ sT okens
and calculates an average similarity. This is done for all pairs of (sT okens, tT okens)
searching for the pair with the maximum similarity. If this maxSimilarity > θ the
 5
     Note that our algorithm handles homonyms by considering that a token can be included in
     more than one concept.
                    Applying edge-counting semantic similarities to Link Discovery    5

pair (s, t) can be added to the final mapping M 0 . The key behind hECATE lies in the
token comparison algorithm compare(sToken, tToken, VDAG) (Algorithms 1
and 3). For a pair of tokens (sToken, tToken), we retrieve the set of concepts
they belong to in the VDAG. If both sets of concepts are not empty, we compare each
source sCon with each target concept tCon and define the maximum similarity of
two tokens as the highest similarity of the corresponding concept pairs. To do so, we
first retrieve the set of all hypernym paths of each concept to the root of the VDAG using
the getPaths(concept, VDAG) algorithm. This algorithm traverses the VDAG by
utilizing the hypernym relation. It starts from the concept node and explores all paths to
the root node.For SP and LCH, we additionally retrieve the maximum depth D found
in the VDAG and the len(sCon, tCon) before calculating the corresponding sim-
ilarity as described in Equations 1 and 2 resp. For calculating len(sCon, tCon)
our algorithm relies on the set of hypernym paths of the concepts (Algorithm 2). For
each pair of hypernym paths hp1 and hp2 the two concepts have, the algorithm iterates
over both paths simultaneously, from top to bottom, until they do not share a common
node. Then, it proceeds in calculating the length of the newly found path, as the number
of concepts that the two paths do not have in common. Finally, the minimum length
that has been found is returned. For WP and L I, the comparison algorithm retrieves the
depth of the LSO between sCon and tCon (depth( lso(sCon, tCon)), and N1
and N2 by calling the function getLSO (hps1 , hps2 ) (Algorithm 4). This function uti-
lizes the set of hypernym paths in a similar manner as the min length algorithm. For
each combination of hypernym paths hp1 and hp2 of the concepts, the algorithm tra-
verses them simultaneously searching for the last node they have in common. If this
node is deeper than any other common node found so far or it has the same depth but
the remaining paths are shorter, it is taken as new LSO. Accordingly, the remaining
path lengths N1 and N2 are updated. Based on the deepest LSO and the derived val-
ues for depthM (lso(sCon, tCon), N1 and N2 , we proceed in calculating the
corresponding similarity as described in Equations 3 and 4 resp.
     Our first extension of hECATE is based on the idea of pre-computing and storing a
set of values that are used often in our algorithm. For edge-counting similarities, these
are the hypernym paths. Consequently, the extension hECATE-I of hECATE precom-
putes all hypernym paths for all concepts included in the VDAG, using the getPaths(
concept, VDAG) function. Therefore, every time the getPaths(concept, VDAG)
is invoked at runtime, hECATE-I retrieves the paths from an index. Our second ex-
tension of hECATE, hECATE-IF, combines hECATE-I with the idea of minimizing
unnecessary comparison between concepts by filtering out pairs of source and target
concepts that do not satisfy a condition for each semantic similarity. The filtering is
performed inside compare(sToken, tToken, VDAG) for each pair of concepts
sCon and tCon. Given a semantic similarity, if a pair of concepts satisfies the corre-
sponding filtering condition, then the algorithm proceeds normally as described before.
If the condition is not met the algorithm does not compute the similarity between the
two concepts. For the SP similarity, two concepts will be considered for comparison, if
the following holds:
                                           2D − len(c1 , c2 )
                          SP(c1 , c2 ) ≥ θ ⇔                  ≥θ
                                                   2D                                (5)
                     ⇒ |depthm (c1 ) − depthm (c2 )| ≤ 2D(1 − θ)

   For the WP similarity, the following must hold:
6            K. Georgala et al.


    Algorithm 1: compare(sCon, tCon,                         Algorithm 3: compare(sCon, tCon,
    V DAG) for SP or LCH                                     V DAG) for WP or L I
        Input: source concept sCon, target                       Input: source concept sCon, target
              concept tCon, and a vocabulary                           concept tCon, and a vocabulary
              DAG VDAG                                                 DAG VDAG
        Output: a similarity value                               Output: a similarity value
    1   D ← V DAG.getM axDepth(sCon)                         1   hps1 ← getP aths(sCon, V DAG)
    2   hps1 ← getP aths(sCon, V DAG)                        2   hps2 ← getP aths(tCon, V DAG)
    3   hps2 ← getP aths(tCon, V DAG)                        3   depth, N1 , N2 ← getLSO(hps1 , hps2 )
    4   minLength ←                                          4   Return
         getM inLength(hps1 , hps2 )                              computeSimilarity(N1 , N2 , depth)
    5   Return
         computeSimilarity(D, minLength)                     Algorithm 4: getLSO(hps1 , hps2 )
                                                                 Input: two sets of hypernym paths, hps1
    Algorithm 2: getM inLength(hps1 ,                                  and hps2
    hps2 )                                                       Output: depthM (lso(sCon, tCon)), N1
        Input: two sets of hypernym paths, hps1                           and N2
              and hps2                                       1   dLSO ← 0, N1 ← 0, N2 ← 0
        Output: len(sCon, tCon)                              2   foreach hp1 ∈ hps1 do
    1   size ← M AX V ALU E                                  3       foreach hp2 ∈ hps2 do
    2   foreach hp1 ∈ hps1 do                                4             l1 ← 0, l2 ← 0
    3       foreach hp2 ∈ hps2 do                            5             while l1 < hp1 .size() ∧ l2 <
    4            l1 ← 0, l2 ← 0                                             hp2 .size() ∧ hp1 .get(l1 ) ==
    5            while l1 < hp1 .size() ∧ l2 <                              hp2 .get(l2 ) do
                  hp2 .size() ∧ hp1 .get(l1 ) ==             6                  l1 ← l1 + 1, l2 ← l2 + 1
                  hp2 .get(l2 ) do
                                                             7             newSize ←
    6                 l1 ← l1 + 1, l2 ← l2 + 1
                                                                            hp1 .size() + hp2 .size() − 2l1
    7            newSize ←                                   8             oldSize ← N1 + N2
                  hp1 .size() + hp2 .size() − 2l1            9             if condition is met then
    8            if newSize < size then                     10                  dLSO ← l1 ,
                  size ← newSize ;                                                N1 ← hp1 .size() − l1
                                                            11                  N2 ← hp2 .size() − l2
    9   Return size
                                                            12   Return dLSO, N1 , N2


                                                         2depthM (lso(c1 , c2 ))
                          W U (c1 , c2 ) ≥ θ ⇔                                        ≥θ
                                                    2depthM (lso(c1 , c2 )) + N1 + N2
                 ⇔ 2depthM (lso(c1 , c2 )) ≥ θ(N1 + N2 ) + 2θdepthM (lso(c1 , c2 ))
                                                   2depthM (lso(c1 , c2 ))(1 − θ)                             (6)
                                          ⇔ N1 + N2 ≤
                                                                  θ
                                         2 min(depthM (c1 ), depthM (c2 ))(1 − θ)
                             ⇒ N1 + N2 ≤
                                                            θ

Based on the triangle inequality and Section 2, Equation 6 can be written as:
                        Applying edge-counting semantic similarities to Link Discovery         7


                                           2 min(depthM (c1 ), depthM (c2 ))(1 − θ)
                            len(c1 , c2 ) ≤                                         ⇒
                                                              θ                               (7)
                                           2min(depthM (c1 ), depthM (c2 ))(1 − θ)
           |depthm (c1 ) − depthm (c2 )| ≤
                                                              θ
     For the LCH similarity, two concepts will be considered for comparison, iff:
                             −log len(c 1 ,c2 )
                                       2D             log(2D) − log(len(c1 , c2 ))
          LCH(c1 , c2 ) ≥ θ ⇔                   ≥θ⇔                                ≥θ⇔
                                log(2D)                         log(2D)
                         log(len(c1 , c2 ))                                                   (8)
                      1−                     ≥ θ ⇔ log(len(c1 , c2 )) ≤ log(2D)(1 − θ) ⇔
                             log(2D)
             len(c1 , c2 ) ≤ 2log(2D)(1−θ) ⇒ |depthm (c1 ) − depthm (c2 )| ≤ 2log(2D)(1−θ)

     When considering the L I similarity, we make the following variable replacements
for the sake of legibility: x = depthM (lso(c1 , c2 )), y = min(depthM (c1 ), depthM (c2 ))
and z = len(c1 , c2 ). Then, two concepts will be considered for comparison, iff:
                                                                                  (e2βx −1)
                         eβx − e−βx                      eβx − e−βx
 L I(c1 , c2 ) ≥ θ ⇔ e−αz βx            ≥θ⇔e     αz
                                                      ≤ βx                     eβx
                                                                    ⇔ eαz ≤ (e2βx       ⇔
                         e + e−βx                      (e + e−βx )θ             βx
                                                                                  +1)
                                                                                      θ
                                                                                    e

    αz     (e2βx − 1)         (e2βy − 1)
e        ≤ 2βx        ⇒ eαz ≤ 2βy        ⇔ αz ≤ ln(e2βy − 1) − lnθ − ln(e2βy + 1) ⇔
          (e    + 1)θ        (e    + 1)θ
                                                             ln(e2βy − 1) − lnθ − ln(e2βy + 1)
                           |depthm (c1 ) − depthm (c2 )| ≤
                                                                             α
                                                                                            (9)
Based on Equations 5, 7, 8 and 9, each filtering condition requires the knowledge of
depthm (sCon), depthM (sCon), depthm (tCon) and depthM (tCon). Hence, we fur-
ther extend the index hECATE-IF relies on by precomputing depthm (ci ) and depthM (ci )
for every concept ci .


4        Evaluation

Our evaluation addresses the following three research questions: Q1 . How do our strate-
gies for improving the runtime of semantic similarities compare to each other w.r.t.
runtime?, Q2 . How do the different edge-counting semantic similarities compare w.r.t.
runtime?, and Q3 . Can semantic similarities improve the F-measure of LD systems?
    We evaluate our approach against five benchmark data sets: Abt-Buy, Amazon-GP
and DBLP-ACM described in [7], DailyMed-Drugbank (dubbed DM-DB) and Movies
described in [16]. We use WordNet6 as a DAG. To address Q1 and Q2 , we conduct a set
of experiments using the basic hECATE algorithm (dubbed hECATE-B) as a baseline
as well as hECATE-I and hECATE-IF. For an easier comparison, all methods are
implemented in the LD framework L IMES [14]. For hECATE-B and hECATE-I, we
create one atomic LS for each semantic similarity, where m iss the name of the edge-
counting similarity, θ = 0.1. We use the ’description’ as the source and target
properties for Abt-Buy and Amazon-GP datasets, ’title’ for the DBLP-Scholar and
 6
     https://wordnet.princeton.edu/
8       K. Georgala et al.

Movies datasets and ’name’ for the DM-DB dataset. For hECATE-IF, we use the
same values for m, ps and pt as before, but θ is derived from the interval [0.1, 1] with
an increment step of 0.1, since the θ is given as a parameter to the filtering functions.
For each dataset, we perform the aforementioned LSs against 2v instances from the
source and target datasets. We start with v = 2 and increment v until all instances are
covered (e.g., the maximal value of v is 9 for the Amazon-Google dataset). We define
a maximum runtime for each LS of 2 hrs. Each experiment is executed 3 times and we
present the average values.
     As explained in Section 1, the second goal of this work is to evaluate edge-counting
semantic similarities in LD in terms of accuracy. Consequently, for Q3 , we use the
hECATE extension with the best runtime performance based on the results of Q1
and executed a set of experiments using 2 machine learning (ML) algorithms: W OM -
BAT [23] and D RAGON [19]. We choose these two approaches because (1) they achieve
state-of-the-art performance while being deterministic, (2) they are open-source, mean-
ing our experiments can be easily reproduced and (3) they are able to generate complex
link specifications with any arbitrary number of measures. We perform a 10-fold cross
validation by allowing W OMBAT and D RAGON to use only string similarities (StrSim),
only semantic similarities (SmtSim) and a combination of both (StrSmtSim) as input.
We use the levenshtein, cosine and qgrams similarity measures for strings
implemented in L IMES [14]. For each dataset, we use all properties apart from those
that corresponded to numeric values. W OMBAT is configured as presented in [23] and
D RAGON is configured as presented in [19]. We use two termination criteria for W OM -
BAT : Either a LS with F-measure of 1 is found or a maximal depth of refinement of 10
is reached. For the string similarities, W OMBAT produced LSs with a minimum θ value
of 0.4 and for the semantic similarities, the minimum θ value is set to 0.7. D RAGON
terminates either when no new nodes are found or when the height of the decision
tree reached the maximum of 3. Additionally, we compare the achieved F1 scores with
scores for E AGLE [15], E UCLID [13], J48 [5] reported by [19], a Multilayer Perceptron
classifier reported by [24] and the Pessimistic as well as Re-weighted versions of the
work presented at [6].
    As expected, Figure 1 shows that hECATE-B has the highest runtimes compared
to hECATE-I and hECATE-IF in all datasets, except DM-DB. This supports the claim
that semantic similarities typically scale poorly. The results show that both extensions
improve the runtime of all semantic similarities, making them more amenable for LD
and scalable for larger datasets. Precisely, LCH’s, WP’s and SP’s runtimes improve by
71% and 57% on average when hECATE-I and hECATE-IF strategies are used resp.
L I has the least improvement by 65% and 50%. Comparing the two extensions, in all
datasets and for all semantic similarities, hECATE-I outperforms hECATE-IF by 30%
on average. A detailed analysis of the runtimes shows that even though hECATE-IF
reduces the number of comparisons between semantically different concepts and thus
the comparison time, the additional runtime cost of filtering creates an overhead that re-
sults in a worse total execution time than hECATE-I (Table 1). Regarding the DM-DB
dataset, the only property for both source and target datasets, name, consists of only one
value, which corresponds to the official name of a drug. That value can only be asso-
ciated with one concept. As a result, introducing an indexing and/or filtering technique
                                                                                     Applying edge-counting semantic similarities to Link Discovery                                                                                                                                                                  9

                                     3E03      hECATE-B       hECATE-I        hECATE-IF
                                                                                                                                                4E03                            hECATE-B     hECATE-I     hECATE-IF
                                                                                                                                                                                                                                                                  2E01
                                                                                                                                                                                                                                                                                 hECATE-B   hECATE-I   hECATE-IF


Overall Runtime for S=1081, T=1092


                                                                                                                                                                                                                              Overall Runtime for S=1046, T=936
                                     2E03                                                                                                       3E03


                                                                                                          Overall Runtime for S=512, T=512
                                                                                                                                                                                                                                                                  2E01
                                     2E03                                                                                                       3E03                                                                                                              1E01
                                     2E03                                                                                                       2E03
                                                                                                                                                                                                                                                                  1E01
                                     2E03                                                                                                       2E03
                                                                                                                                                                                                                                                                  1E01
                                     1E03                                                                                                       2E03
                                                                                                                                                                                                                                                                  8E00
                                     1E03                                                                                                       1E03
                                                                                                                                                                                                                                                                  6E00
                                     8E02                                                                                                       1E03

                                     5E02                                                                                                       7E02                                                                                                              4E00

                                     3E02                                                                                                       4E02                                                                                                              2E00

                                     0E00                                                                                                       0E00                                                                                                              0E00
                                             LCH              Li          Wu-Palmer       Shortest Path                                                        LCH              Li         Wu-Palmer    Shortest Path                                                     LCH        Li       Wu-Palmer   Shortest Path


                                                   (a) Abt-Buy                                                                                                  (b) Amazon-GP                                                                                                   (c) DM-DB
                                     3E03                                                                                                               2E02
                                                   hECATE-B        hECATE-I     hECATE-IF                                                                                  hECATE-B    hECATE-I     hECATE-IF
Overall Runtime for S=2616, T=2294


                                                                                                                   Overall Runtime for S=1017, T=1042
                                     2E03                                                                                                               2E02

                                     2E03                                                                                                               2E02

                                     2E03                                                                                                               2E02

                                     2E03                                                                                                               1E02

                                     1E03                                                                                                               1E02

                                     1E03                                                                                                               1E02

                                     8E02                                                                                                               7E01

                                     5E02                                                                                                               5E01

                                     3E02                                                                                                               2E01

                                     0E00                                                                                                               0E00
                                             LCH              Li          Wu-Palmer       Shortest Path                                                        LCH              Li          Wu-Palmer   Shortest Path


                                              (d) DBLP-ACM                                                                                                             (e) Movies

Fig. 1: Average runtime in seconds of hECATE-B, hECATE-I and hECATE-IF on all
datasets. For hECATE-IF, the standard deviation among different θ values is added.


produces an unnecessary overhead. Overall, Q1 can be answered with hECATE-I being
the most efficient approach.
     To answer Q2 we compare the runtimes of the single semantic similarities revealing
that L I has the worst runtime (see Figure 1). For the Movies dataset, we notice that
hECATE-I requires 100K more token comparisons for L I compared to the other sim-
ilarities (Table 1). The better runtime of the other similarities is caused by a condition
inside our algorithm which stops as soon as two tokens/concepts have a similarity of 1.
In contrast to the other similarities, L I(c1 , c2 ) ∈ (0, 1), i.e., it can never be 1. However,
based on Table 1, L I’s runtime shows a great improvement as the values of θ increase
in relation to the other metrics. This justifies the fact that the runtimes for L I have the
highest standard deviation, whereas SP, LCH and WP are less influenced by the differ-


Table 1: Number of concept comparisons performed by hECATE-I and hECATE-IF
for the Movies dataset. The numbers for hECATE-B are the same as for hECATE-I.
            Threshold                                                                        0.1                                 0.2                                 0.3              0.4           0.5                 0.6                                              0.7      0.8           0.9            1
                                                 hECATE-I 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M
                                             SP
                                                hECATE-IF 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M 61.0M 51.4M 10.3M
                                                       hECATE-I 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M
                                            WP
                                                      hECATE-IF 61.8M 61.8M 61.8M 61.7M 61.5M 60.3M 56.7M 44.7M 27.8M 10.3M
                                                       hECATE-I 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M 61.8M
                                            LCH
                                                      hECATE-IF 61.8M 61.8M 61.8M 61.4M 60.4M 56.5M 42.4M 42.4M 28.5M 28.5M
                                                       hECATE-I 61.9M 61.9M 61.9M 61.9M 61.9M 61.9M 61.9M 61.9M 61.9M 61.9M
                                             LI
                                                      hECATE-IF 61.9M 61.4M 59.9M 56.6M 51.5M 42.1M 28.3M 28.2M 10.0M 00.0M
10       K. Georgala et al.


Table 2: Average F-measure achieved by W OMBAT, D RAGON, E UCLID, E AGLE, J48
and Multilayer Perception within a 10-fold cross validation. The semantic similarities
use the hECATE-I strategy.
Algorithm            W OMBAT                 D RAGON         E UCLID E AGLE J48 Perceptron
Similarities StrSim SmtSim StrSmtSim StrSim SmtSim StrSmtSim StrSim StrSim StrSim StrSim
Abt-Buy      0.65     0.65          0.66      0.51     0.02          0.10          0.00    0.56   0.43      0.43
Amazon-GP    0.71     0.60          0.77      0.64     0.06          0.05          0.71    0.73   0.41      0.36
DBLP-ACM     0.97     0.74          0.97      0.93     0.81          0.93          0.98    0.98   0.77      0.97
DM-DB        0.94     0.71          0.97      0.89     0.65          0.89          1.00    1.00   0.94        -
Movies       1.00     0.73          1.00      0.93     0.80          0.93          0.98    0.99   0.84        -


Table 3: Maximum F-measure achieved by W OMBAT, D RAGON, Pessimistic and Re-
weighted using 2% of the data for training over 7 iterations [6]. The semantic similari-
ties use the hECATE-I strategy.
     Algorithm            W OMBAT                 D RAGON        Pessimistic Re-weighted
     Similarities StrSim SmtSim StrSmtSim StrSim SmtSim StrSmtSim StrSim       StrSim
     Abt-Buy        0.35     0.39          0.34      0.24     0.10          0.24          0.36       0.37
     Amazon-GP      0.53     0.33          0.43      0.45     0.13          0.35          0.39       0.43
     DBLP-ACM       0.91     0.55          0.91      0.90     0.66          0.90          0.93       0.95
     DM-DB          0.94     0.71          0.97      0.94     0.71          0.96            -          -
     Movies         0.97     0.33          0.97      0.96     0.33          0.96            -          -


ent values of θ. The answer for Q2 is that for all hECATE strategies, SP is the fastest
similarity, whereas L I is the slowest.
    To answer Q3 , we add the 4 edge-counting measures L I, WP, SP, and LCH to the
state-of-the-art algorithms W OMBAT [23] and D RAGON [19]. We evaluate their per-
formance with and without string similarities using a ten-fold cross validation. Table 2
shows the results of our experiments with these machine-learning algorithms. In the 6
right most columns of Table 2, we report the F1 score of the string-based LD algorithms.
While the performance of D RAGON remained the same or even worsened for 3 of the
5 datasets, adding semantic similarities to the W OMBAT algorithm improved its overall
performance for 3 datasets by up to 6% F-measure absolute. As expected, this effect is
most pronounced in datasets which rely on long textual descriptions such as Amazon-
GP. A look into the specifications learned by W OMBAT suggests that this effect is due
to the approach combining semantic and string similarities using operators such as t
and learning the correct threshold for each of these measures. The improvement on the
DM-DB datasets is achieved using the \ operator, not allowing semantically similar
concepts to be matched together. This refutes current results (see [11] where the same
similarities have been used) and suggests that the refinement operators can combine
semantic and string similarities in a way that improves the F-measure. For enabling a
comparison with [6], we used the same configuration setting and report the maximum
F-measure in Table 3. It can be seen that WOMBAT outperforms the Pessimistic and
Re-weighted methods on the majority of the datasets.
                       Applying edge-counting semantic similarities to Link Discovery            11

5   Related Work

We give a brief overview of linking approaches which use semantic similarities. An
exhaustive list of frameworks can be found in [12]. Over the past few years, semantic
similarities were used in ontology matching (OM) [21]. In this context, concepts in two
ontologies O1 and O2 are often matched based on a third ontology, e.g., WordNet. This
ontology can be viewed as a background knowledge source or a mediating ontology [2].
Frameworks such as AgreementMaker [3], Zhishi.links [18] and RuleMiner [17] utilize
semantic similarities in this way to improve structural matching on the ontology level.
While these enhancements have a positive effect on their instance level matching, to the
best of our knowledge no instance linking tool has used semantic similarities directly
and shown an improvement of the overall linking results. [11] compare the effect of a
predefined set of combinations of string and semantic similarities for label comparison
and suggest that semantic similarities do not improve the F-measure of the instance
matching task. Our results suggest the contrary by showing that dataset-specific combi-
nations of measures actually can achieve a better performance.


6   Conclusions and Future Work

To study the effect of semantic similarities on LD, we presented hECATE, a generic
framework for improving the runtime of edge-counting semantic similarities. Our eval-
uation of the framework shows that there is still a lot of potential in improving the
runtime of semantic similarities for LD. We used hECATE to evaluate the performance
of string similarities in LD on five datasets. Our evaluation shows that combining se-
mantic similarities with string similarities can indeed increase the F-measure achieved
by LD algorithms. This result is of central importance as it goes against current assump-
tions. The reason why we are indeed able to use semantic similarities for improving the
F-measure of LD in some cases lies in the refinement operator employed by W OMBAT.
In future works, we will investigate means that will allow improving the runtimes of
semantic similarities, extend our works beyond edge-counting similarities and aim to
classify datasets w.r.t. how suitable they are for semantic similarities.


References

 1. Bizer, C., Volz, J., Kobilarov, G., Gaedke, M.: Silk - A Link Discovery Framework for the
    Web of Data. In: 18th International World Wide Web Conference (April 2009)
 2. Cross, V., Silwal, P., Morell, D.: Using a reference ontology with semantic similarity in
    ontology alignment. In: Proceedings of the 3rd ICBO (2012)
 3. Cruz, I.F., Antonelli, F.P., Stroe, C.: Agreementmaker: Efficient matching for large real-world
    schemas and ontologies. PVLDB 2, 1586–1589 (2009)
 4. Euzenat, J., Ferrara, A., Meilicke, C., Nikolov, A., Pane, J., Scharffe, F., Shvaiko, P., Stuck-
    enschmidt, H., Šváb-Zazamal, O., Svátek, V., et al.: Results of the ontology alignment eval-
    uation initiative 2010. Tech. rep., University of Trento (2011)
 5. Holmes, G., Donkin, A., Witten, I.H.: Weka: a machine learning workbench. In: Proceedings
    of ANZIIS ’94. pp. 357–361 (Nov 1994)
12       K. Georgala et al.

 6. Kejriwal, M., Miranker, D.P.: Semi-supervised instance matching using boosted classifiers.
    In: The Semantic Web. Latest Advances and New Domains. pp. 388–402. Springer Interna-
    tional Publishing (2015)
 7. Köpcke, H., Thor, A., Rahm, E.: Evaluation of Entity Resolution Approaches on Real-world
    Match Problems. Proc. VLDB Endow. 3(1-2), 484–493 (Sep 2010)
 8. Leacock, C., Chodorow, M.: Combining local context and wordnet similarity for word sense
    identification (01 1998)
 9. Li, Y., Bandar, Z.A., McLean, D.: An approach for measuring semantic similarity between
    words using multiple information sources. IEEE Trans. on Knowl. and Data Eng. 15(4),
    871–882 (Jul 2003)
10. Malyshev, S., Krötzsch, M., González, L., Gonsior, J., Bielefeldt, A.: Getting the most out
    of wikidata: Semantic technology usage in wikipedia’s knowledge graph. In: International
    Semantic Web Conference. pp. 376–394 (2018)
11. McCrae, J.P., Buitelaar, P.: Linking Datasets Using Semantic Textual Similarity. CYBER-
    NETICS AND INFORMATION TECHNOLOGIES 18(1), 109–123 (2018)
12. Nentwig, M., Hartung, M., Ngonga Ngomo, A.C., Rahm, E.: A survey of current link dis-
    covery frameworks. Semantic Web pp. 1–18 (2015)
13. Ngomo, A.C.N., Lyko, K.: Unsupervised learning of link specifications: deterministic vs.
    non-deterministic. In: OM (2013)
14. Ngonga Ngomo, A.C.: On Link Discovery using a Hybrid Approach. Journal on Data Se-
    mantics 1(4), 203–217 (2012)
15. Ngonga Ngomo, A.C., Lyko, K.: Eagle: Efficient active learning of link specifications us-
    ing genetic programming. In: The Semantic Web: Research and Applications. pp. 149–163.
    Springer Berlin Heidelberg, Berlin, Heidelberg (2012)
16. Ngonga Ngomo, A.C., Lyko, K.: Unsupervised learning of link specifications: deterministic
    vs. non-deterministic. In: Proceedings of the Ontology Matching Workshop (2013)
17. Niu, X., Rong, S., Wang, H., Yu, Y.: An effective rule miner for instance matching in a
    web of data. In: Proceedings of the 21st ACM International Conference on Information and
    Knowledge Management. pp. 1085–1094. CIKM ’12, ACM, New York, NY, USA (2012)
18. Niu, X., Rong, S., Zhang, Y., Wang, H.: Zhishi.links results for OAEI 2011. Ontology Match-
    ing p. 220 (2011)
19. Obraczka, D., Ngomo, A.C.N.: Dragon: Decision tree learning for link discovery. In: 19TH
    International Conference On Web Engineering. Springer (2019)
20. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on
    semantic nets. IEEE Trans. Systems, Man, and Cybernetics 19, 17–30 (1989)
21. Rodrı́guez, M.A., Egenhofer, M.J.: Determining semantic similarity among entity classes
    from different ontologies. IEEE Trans. on Knowl. and Data Eng. 15(2), 442–456 (2003)
22. Saleem, M., Ali, M.I., Verborgh, R., Ngonga Ngomo, A.C.: Federated query processing over
    linked data. In: Tutorial at ISWC (2015)
23. Sherif, M., Ngonga Ngomo, A.C., Lehmann, J.: WOMBAT - A Generalization Approach for
    Automatic Link Discovery. In: 14th Extended Semantic Web Conference, Portorož, Slove-
    nia, 28th May - 1st June 2017 (2017)
24. Soru, T., Ngomo, A.C.N.: A comparison of supervised learning classifiers for link discovery.
    In: Proceedings of the 10th Intern. Conf. on Semantic Systems. pp. 41–44. ACM (2014)
25. Usbeck, R., Ngonga Ngomo, A.C., Haarmann, B., Krithara, A., Röder, M., Napolitano, G.:
    7th open challenge on question answering over linked data (QALD-7). In: Semantic Web
    Evaluation Challenge. pp. 59–69. Springer International Publishing (2017)
26. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32Nd
    Annual Meeting on Association for Computational Linguistics. pp. 133–138. ACL ’94, As-
    sociation for Computational Linguistics, Stroudsburg, PA, USA (1994)