Improving Link Specifications using Context-Aware
                          Information

              Andrea Cimmino                         Carlos R. Rivero                       David Ruiz
          University of Seville, Spain              Rochester Institute of          University of Seville, Spain
              cimmino@us.es                          Technology, USA                       druiz@us.es
                                                       crr@cs.rit.edu


ABSTRACT                                                           types of links between datasets, but the most common one is
There is an increasing interest in publishing data using the       owl:sameAs [7]. To generate the links, a link discovery task
Linked Open Data philosophy. To link the RDF datasets,             must be performed, which aims to find all pair of instances
a link discovery task is performed to generate owl:sameAs          that are describing the same concept [13, 15].
links. There are two ways to perform this task: by means              Link discovery can be performed in two different ways:
of a classifier or a link specification; we focus in the lat-      by means of a classifier [22], which links instances with
ter approach. Current link specification techniques only use       owl:sameAs if it considers them as the same; or generat-
the data properties of the instances that they are linking,        ing a link specification, which is a set of restrictions over
and they do not take the context information into account.         the data properties of two instances [17, 11]. Each pair of
In this paper, we present a proposal that aims to gener-           data properties is associated to a similarity function and a
ate context-aware link specifications to improve the regular       threshold, if the similarity function returns a value higher
link specifications, increasing the effectiveness of the results   than the threshold, then the restriction is satisfied and the
in several real-world scenarios where the context is crucial.      two instances are linked using owl:sameAs. For example, a
Our context-aware link specifications are independent from         link specification defines that two instances of Person are the
similarity functions, transformations or aggregations. We          same if both have a data property describing their names,
have evaluated our proposal using two real-world scenarios         and their literals are exactly the same.
in which we improve precision and recall with respect to              Unfortunately, in some scenarios, this definition is not
regular link specifications in 23% and 58%, respectively.          suitable to generate owl:sameAs links. Under some circum-
                                                                   stances, taking only the literals of the data properties of
                                                                   two instances into account may lead to mix up instances
Keywords                                                           that are very similar but actually different, e.g. if two peo-
Linked Data, Link Discovery, Link Specification, Contex-           ple are different but have the same name. In these cases,
Aware Link Specification                                           a link specification should include conditions over the con-
                                                                   text information, data properties of other instances that are
1.   INTRODUCTION                                                  related with the main two to have more information and
   In recent years, we have witnessed an increasing interest in    improve the effectiveness.
the Linked Open Data [9]. As a matter of fact, the number             In this paper, we focus on improving current link spec-
of datasets in 2011 were 452 and in 2014 that number raised        ifications making them context aware. We aim to extend
to 2,289 [6]. Publicly-available datasets have to fulfill the      the definition used by actual techniques [17, 11], and add
Linked Data principles, which manly consist in using IRIs          restrictions over the instances in the context of the pair that
as names of things, using HTTP IRIs so that people can             we are linking. To achieve this, we introduce the concept
look up those names, when someone looks up a IRI provide           of overlap factor. If we define different link specifications
useful information using the standards (RDF or SPARQL),            over the instances in both contexts, we can handle them as
and finally, include links to other IRIs so that they can dis-     sets potentially overlapped by means of an equity criteria
cover more things [2]. Since the number of datasets has            defined by the link specifications. The overlap factor is a
increased in the recent years and these principles establishes     function that measures the overlapping between contexts.
that the datasets must be linked with others and published         Thanks to this, we are able to define restrictions over this
in RDF formats [1], a huge effort has been done to link            value. In this paper, we restrict ourselves to two different
these RDF datasets automatically [14]. There are different         types of overlap factors, namely: exists or for all. The for-
                                                                   mer means that there is a pair of instances in the context
                                                                   considered the same by means of a link specification. The
                                                                   latter, means that all the instances in both contexts are the
                                                                   same.
                                                                      Figure 1 depicts a sample scenario where the context is
                                                                   crucial to obtain a good precision. Figure 1(a) shows a part
                                                                   of the data model of DBLP and the National Science Foun-
                                                                   dation (NSF), both were built using authors that have pub-
Copyright is held by the author/owner(s).                          lished in the International Conference of Very Large Data
WWW2016 Workshop: Linked Data on the Web (LDOW2016).
                                                                      CALSAR: forall LSAR and exists LSAP


                                                  dblp:Author                                                  nsf:Researcher
                                                 dblp:name                     LSAR: Jaro > 0.98              nsf:name                 nsf:leads
                                                                                                              nsf:email
                                                                                                              nsf:position                       nsf:Award
                                                   dblp:writes                                                                                nsf:name
                                                                                                                                              nsf:sponsor
                                                   dblp:Article                                                   nsf:Paper                   nsf:startDate
                                                                                                                                              nsf:amountToDate
                                            dblp:title                     LSAP: Levenshtein > 0.90           nsf:title
                                            dblp:date                                                         nsf:date                 nsf:supports
                                            dblp:numberOfPages                                                nsf:pages
                                            dblp:recordedAt                                                   nsf:conference

                                                             (a) Data model and context-aware link specification

                                                                                                                      nsf:leads
                                   dblpU:Wang0011:Wei                                              nsfU:WeiWang0007
                                                                                                                                   nsfU:AN-0423336
         dblp:writes             dblp:name : "Wei Wang"                (LSAR) sameAs                                                                          nsf:supports
                                                                                               nsf:name : "Wei Wang"
                                 rdf:type : dblp:Author                                                                           rdf:type: nsf:Award
                                                                                               rdf:type : nsf:Researcher
                                                                                                                                                                   nsfU:AN-0423336/#13

                                                                  (CALSAR) (LSAR) sameAs                                                                  nsf:title : "Efficient Computation ..."
                         dblpU:conf/vldb/JiangWL03                                                 nsfU:WeiWang0012                                       rdf:type : nsf:Paper
                                                                                               nsf:name : "Wei Wang"                 nsf:leads
                        dblp:title : "Holistic Twig ..."
                        rdf:type : dblp:Article                                                rdf:type : nsf:Researcher
                                                                                                                                   nsfU:AN-1043034
                                                                                                                                                             nsf:supports
                                                                                                                                  rdf:type: nsf:Award
                  dblpU:conf/vldb/YuanLLWYZ05                                                      nsfU:YangWang0023                                                nsfU:AN-1043034/#2
                                                                                                                                  nsf:leads               nsf:title : "Necessary and sufficient ..."
              dblp:title : "Efficient Computation ..."                                         nsf:name : "Yang Wang"
              rdf:type : dblp:Article                                                          rdf:type : nsf:Researcher                                  rdf:type : nsf:Paper

                                                           (b) Sample instances and links generated to relate them

                        Figure 1: Scenario DBLP-NSF: Improving precision using context-aware link specification


Bases (PVLDB) of 2013. DBLP contains these authors and                                                 author in DBLP, dblpU:Wang0011:Wei. We see that there
their articles, the NSF researchers from its portal, with the                                          are two researchers whose name is “Wei Wang” that lead
same name of these authors, that leads awards which sup-                                               two different NSF grants (nsf:AN-1043034, nsf:AN-0423336).
ports papers. We wish to identify authors in DBLP that                                                 Using LSAR alone, we link the three instances (dblpU:Wang-
have been awarded with NSF grants using their names and                                                0011:Wei and nsfU:WeiWang0012, and dblpU:Wang0011:Wei
publications. We include two link specifications: LSAR ,                                               and nsfU:WeiWang0007). However, using LSAP we are able
which links the dblp:Author and nsf:Researcher instances if                                            to identify one paper in DBLP written by “Wei Wang” that
their literals of dblp:name and nsf:name obtain a score over                                           also appears in NSF. If we use both links at the same time
0.98 using Jaro; LSAP , which links dblp:Article and nsf:Paper                                         as part of the context information of the authors, we discard
instances if their literals dblp:title and nsf:title obtain a score                                    one of the previous links (dblpU:Wang0011:Wei and nsfU:We-
over 0.90 using Levenshtein. We rely on these link specifica-                                          iWang0012), which is not correct since it is actually linking
tions and add overlap factors to them, each of which states                                            another researcher in NSF whose name is “Wei Wang”. As
how many instances should be linked. The overlap factors                                               a result, we improve precision using context information.
for LSAR and LSAP are for all and exists, respectively. The                                               Figure 2 depicts another sample scenario, in which DBLP
resulting context-aware link specification is interpreted as                                           acts both as source and target, where the context is cru-
follows: a pair of dblp:Author and nsf:Researcher instances                                            cial to obtain a good recall. Figure 2(a) shows a part of
are the same by means of the context-aware link specifica-                                             the data model, this scenario has several authors and their
tion if all the instances covered by LSAR , pairs of dblp:Author                                       aliases (different names that refer to the same person); both
and nsf:Researcher instances, are the same, and if at least one                                        datasets contains the same authors and their articles but
pair covered by LSAP , dblp:Article and nsf:Paper instances,                                           they have different aliases in each dataset. We include a
are the same.                                                                                          link specification, LSAA , that links dblp:Article instances if
   Actual link specification techniques can generate LSAR                                              the literals of their dblp:title obtain a score over 0.99 using
or LSAP (notice that LSAP would not link instances of db-                                              Jaccard. We rely on LSAA and we add to it a for all as over-
lp:Author and nsf:Researcher), but they are not able to use                                            lap factor. The resulting context-aware link specification is
LSAP to link instances of dblp:Author and nsf:Researcher.                                              interpreted as follows: two instances of dblp:Author are the
Additionally, they are not able to generate and apply overlap                                          same by means of the contex-aware link specification if all
factors over them.                                                                                     their dblp:Article instances are the same by means of LSAA .
   Figure 1(b) depicts a set of sample instances, the link                                                Figure 2(b) depicts a set of sample instances, the link
specifications LSAR and LSAP and the context-aware link                                                specification LSAA and the context-aware link specification
specification CALSAR . We focus on “Wei Wang”, who is an                                               CALSAA . We focus on link ”Hosagrahar V. Jagadish” and
                            CALSAA: forall LSAA


                                                                                      dblpU:Jagadish_0002:Hosagrahar.V.                                    dblpU:Jagadish_0001:H.V.
         dblp:Author                                   dblp:Author
                                                                                  dblp:name : "Hosagrahar V. Jagadish"             (CALSAA) sameAs       dblp:name : "H. V. Jagadish"
        dblp:name                                     dblp:name
                                                                                  rdf:type : dblp:Author                                                 rdf:type : dblp:Author

         dblp:writes                                   dblp:writes                                  dblpU:journals/pvldb/LabrinidisJ12        dblpU:journals/pvldb/LabrinidisJ12   dblp:writes
                                                                        dblp:writes
                                                                                                   dblp:title : "Challenges and ..."         dblp:title : "Challenges and ..."
         dblp:Article                                  dblp:Article                                rdf:type : dblp:Article                   rdf:type : dblp:Article

     dblp:title             LSAA: Jaccard > 0.99   dblp:title                                          dblpU:conf/sigmod/QianCJ12              dblpU:conf/sigmod/QianCJ12
     dblp:date                                     dblp:date
     dblp:numberOfPages                            dblp:numberOfPages                                 dblp:title : "Sample-driven ..."         dblp:title : "Sample-driven .."
     dblp:recordedAt                               dblp:recordedAt                                    rdf:type : dblp:Article                  rdf:type : dblp:Article

     (a) Data model and context-aware link specifica-                                 (b) Sample instances and links generated to relate them
     tion

                          Figure 2: Scenario DBLP-DBLP: Improving recall using context-aware link specification


”H. V. Jagadish” (dblpU:Jagadish 0002:Hosagrahar.V. and d-                                  to exploit the information contained in instances related to
blpU:Jagadish 0001:H.V.), which are different names of the                                  the pair that is been analyzed by it. It takes two data mod-
same person. A regular link specification would compare                                     els as input and generates a probabilistic model. In the first
these literals, instead we rely on their publications. Using                                place, the technique computes the probabilities of equiva-
LSAR as part of the context information of the authors, we                                  lences of instances, then, the probabilities for relationships
are able to link all their dblp:Article instances and, hence,                               with other instances and, finally, it creates the equivalences
link the instances dblpU:Jagadish 0002:Hosagrahar.V. and d-                                 between the classes. Hassanzadeh et al. [8] proposed a semi-
blpU:Jagadish 0001:H.V. through their publications. We im-                                  supervised technique that receives two datasets as input.
prove the recall because we do not rely on the names of au-                                 This technique works as follows: having all the data proper-
thors, which differs from both datasets, instead we use their                               ties of all the classes in each dataset, the technique iterates
publications, which titles do not differ from both datasets.                                over one set and searches in the other set, according to a
As result, we improve the recall using the context informa-                                 string distance, the most similar data properties. The tech-
tion.                                                                                       nique returns the ranked set of pairs according to the string
   We performed several experiments using the datasets in-                                  distance.
troduced in the Figure 1 and 2, in which we improved 23%                                       Regarding the link specification approaches, Isele and Bi-
in precision and 58% in recall, respectively.                                               zer [11] proposed GenLink, which is a supervised genetic
   The rest of the article is organized as follows: in Section                              programming technique that generates link specifications as
2, we report on several related proposals and their features;                               trees. It starts with a population made of random link spec-
Section 3 introduces our conceptual framework; Section 4                                    ifications and some recurrent link specification predefined
presents our proposal to generate context-aware link specifi-                               by the authors. Then, using genetic operations (reproduc-
cations; Section 5 reports the results obtained in our experi-                              tion, crossover and mutation), the population is evolved and
ments; and, finally, Section 6 recaps on our main conclusions                               its quality evaluated by means of a fitness function, which
and future work.                                                                            uses training data provided by the user. The technique stops
                                                                                            when a configured maximum number of iterations is reached,
2.     RELATED WORK                                                                         or a link specification obtains a value in the fitness function
                                                                                            over a threshold given by the user. Based on GenLink, the
   Over the last years, several approaches have been devel-                                 same authors proposed ActiveGenLink [12], which aims to
oped to address the link discovery task. There are two ways                                 reduce the number of labelled examples using active learn-
to face link discovery. The first approach is building different                            ing. ActiveGenLink selects link candidates to be labelled
kind of classifiers to establish if two instances are the same                              by the user from a pool of unlabelled instances through a
[22]. The second approach is through the discovery of accu-                                 query strategy. Then, once the user labels a given example,
rate link specifications, which specify conditions that must                                it adds the example to the training data and evolves the pop-
hold true for two entities to be interlinked [11, 12, 17, 18,                               ulation using GenLink. Another semi-supervised technique
19, 21]. The main difference between these two approaches                                   is EAGLE [17], which is based in a genetic programming
is that the former does not specify why two instances are the                               technique with active learning, and it aims to generate link
same, i.e., it works like a black box that receives as input two                            specifications as trees. It starts detecting similar classes and
instances and outputs whether or not they are the same; the                                 data properties using RAVEN [16]. Then, EAGLE evolves
latter generates a specification of why two instances should                                an initial random population of link specifications according
be the same, describing which data properties have to fulfil                                to genetic operators. After that, the technique computes the
the conditions.                                                                             most informative links and asks the user to label them. This
   We focus only in the link specification approach, although,                              process is repeated until the stop condition is fulfilled, i.e., a
we have also analyzed some classifiers since they exploit the                               maximal number of iterations is reached, or the fitness value
context information [8, 10, 23]. Holub et al. [10] proposed                                 of a link specification is over a given threshold. An unsuper-
a technique that works with a fixed formula that takes the                                  vised learning technique was proposed by Nikolov et al. [19],
instances related directly to the pair that is been linked into                             which starts with a random population and keeps iterating
account. PARIS [23] is an unsupervised technique developed
over it, applying genetic operators, until a maximal num-         3.1      Foundations
ber of iterations is reached, or the fitness of the population      We are focusing on RDF datasets, which are triple stores
does not improve for several iterations. Since this technique     that contain literals and IRIs. Our proposal focuses on the
does not work with labelled data, the fitness function uses       analysis of different instances, each of which entails several
two criteria defined by the authors to evaluate link specifica-   concepts as follows:
tions, namely: pseudo-F-measure and neighborhood growth.
When a stop condition is reached, it returns the link speci-         • IRI: it uniquely identifies a web location. Note that
fication with the highest fitness value from the population.           we use expressions like dblp:name to refer to IRIs, in
EUCLID [18] is an unsupervised technique that, using differ-           which dblp: is a prefix. Table 2 summarizes all of our
ent similarity functions, evaluates the data properties of the         prefixes. For example, some sample IRIs in Figure 1
instances and generates a space of similarity values. Then,            are dblpU:Wang0011:Wei and nsfU:YangWang0023.
depending on different heuristics, it iterates over that space
updating the scores and pruning them until a solution is             • BlankNode: are placeholders for IRIs whose actual value
found or a stop condition is reached. The unsupervised tech-           is unknown. They have only local scope and are purely
nique proposed by Song and Heflin [21] focuses on metrics              an artifact of the serialization. Blank nodes are disjoint
to improve the candidate selection to be as more scalable              from IRIs and Literals.
as possible. Candidate selection is a process to pick pairs
of instances, each of which has a high probability to be the         • Instance: an instance is an IRI or a BlankNode that we
same. The process is performed by selecting and compar-                are interested in linking.
ing only part of the data properties of each instance in the
pair. It then extracts a set of data properties very useful in
disambiguation, which identify why the pair of instances are       Pref.     IRI
                                                                   rdf:      http://www.w3.org/1999/02/22-rdf-syntax-ns#type
the same.
                                                                   owl:      http://www.w3.org/2002/07/owl#
   As far as we know, none of the previous link specification      dblp:     http://example.org/voc/dblp#
techniques is able to exploit context information. Holub et        nsf:      http://example.org/voc//nsf#
al. [10] proposed a technique that takes into account only         db-       http://example.org/urls/dblp#
the instances of the context one-hop related to the pair that      lpU:
is been linked. PARIS [23] takes as input all the datasets and     nsfU:     http://example.org/urls/nsf#
generates a probabilistic model to classify input instances.
The technique by Hassanzadeh et al. [8] returns a ranking of                 Table 2: IRI prefixes used in the paper
most similar data properties using several string distances.
None of the previous techniques is able to apply transforma-
tions on data properties and only [8] is able to use different       • Class: we can assign classes to Instances, each of which
string similarity measures, however it only returns a ranked           is an IRI that represents a real-world concept. When
list of data property and not why two instances should be              we assign a Class to an Instance, we are explicitly saying
linked. Additionally, [10] only takes one-hop connected in-            that the Instance belongs to the type of the Class. We
stances into account, although, many real-world scenarios              use rdf:type to represent this assignment. For exam-
require to take more than one-hop related instances into ac-           ple, in Figure 1, Instances related to Classdblp:Author
count [20].                                                            represent authors in DBLP.
   Table 1 summarizes all the techniques and their different
features. Those that generate link specifications are classi-        • DataProperty: Instances may comprise attributes that
fied as LS; if they take into account the context, been LS or          describe features of the Instances, which are plain lit-
not, then we classify them as context-aware (C-A). Finally, if         erals. To represent these features, we use data prop-
the technique is independent of any specific function (aggre-          erties, which are IRIs that identify these literals. In
gations, transformations, string distance measures or string           Figure 1, the names of the dblp:Author Instances are
similarities), we classify it as function independent (FI).            identified using dblp:name.

 Technique                                  LS     C-A    FI         • Literal: it denotes a value that a data property takes.
 [10] Holub et al. (2015)                   No     Yes    No           For example, in Figure 1 “Yang Wang” for nsfU:Yang-
 [23] Suchanek et al. (2011)                No     Yes    No           Wang0023 or “Wei Wang” for nsfU:WeiWang0012 are
 [8] Hassanzadeh et al. (2013)              No     Yes    No           sample literals for the same data property. Depending
 [5, 12] Isele and Bizer (2011, 2012)       Yes    No     Yes          on the Instance, data properties have different literals.
 [17, 18] Ngonga and Lyko (2012,2013)       Yes    No     Yes
 [19] Nikolov et al. (2012)                 Yes    No     Yes        • ObjectProperty: Instances can be related to other In-
 [21] Song and Heflin (2011)                Yes    No     No
                                                                       stances by means of object properties, which are IRIs;
                                                                       a set of Instances related conform a graph. Note that,
        Table 1: Comparison of current techniques                      in RDF, object properties are first-class citizens and
                                                                       they are not subordinated to Instances. Figure 1 shows
                                                                       a sample object property that relates dblp:Author and
3.   PRELIMINARIES                                                     dblp:Article using dblp:writes. Notice that we can add
  In the following, we present the formalization of several            multiple relations connecting the same Instance to mul-
concepts that we use to describe our proposal. We define its           tiple Instances; for example, in Figure 1, the object
foundations, what a link specification is and what we mean             property dblp:writes may relate one dblp:Author with
by context-aware link specification.                                   several dblp:Article Instances.
                                                   LinkSpecification                                          CALinkSpecification
                                               source: Set<Class>                                          source: Set<Class>
                                               target: Set<Class>                                          target: Set<Class>

                                                      Condition        *                                                            *
                                                                                                                 CACondition


                                      SameAsCondition        ConditionComposite                   CASameAsCondition     ConditionComposite
                                   f: Function               f: Aggregation                        oF: OverlapFactor    f: Aggregation
                                   threshold: Double
                                               2
                                           Operand       *                                                                2
                                                                                      LinkSpecification           ObjectLeafNode
                                                                                  source: Set<Class>         prop: ObjectProperty       *
                          DataLeafNode          OperandComposite                  target: Set<Class>         dataset: {SRC, TRG}
                     prop: DataProperty          f: Transformation
                     dataset: {SRC, TRG}

                                    (a) Link Specification                          (b) Context-Aware Link Specification

                                                              Figure 3: Models for link discovery


3.2    Link Specification                                                            Figure 3(b) specifies the structure of a context-aware link
   When performing link discovery, we have a source and                           specification using an UML-like notation. Each ObjectLeaf-
a target datasets that we wish to relate using owl:sameAs                         Node represents the object properties that connect the sets
links. To link the Instances of each dataset we generate a                        of classes in CALinkSpecification with the other sets of classes
link specification. A link specification has been defined in                      in the LinkSpecifications. A CASameAsCondition specifies an
multiple manners in the literature [3, 11, 12, 17]. We have                       OverlapFactor over a LinkSpecification. OverlapFactor takes
based our work in the definition given by Isele and Bizer                         as values for all, if all the Instances are required to be con-
[11]. A link specification is a set of restrictions that define                   sidered the same by means of the LinkSpecification, or exists,
the equality between a source and a target sets of classes                        if just one pair of Instances is required. ConditionComposi-
based on their data properties. For instance, a dblp:Author                       te combines different CASameAsConditions or other Condi-
and a nsf:Researcher are the same if they have very similar                       tionComposites, the Aggregation functions are: AND or OR
literals for dblp:name and nsf:name.                                              Boolean conditions. Finally, CALinkSpecification represents
   Figure 3(a) depicts how we model link specifications using                     the two main sets of classes, source and target, that the
an UML-like notation. Each DataLeafNode represents a spe-                         Instances we wish to link with owl:sameAs belongs.
cific data property and the dataset it belongs. OperandCom-                          We present a sample context-aware link specification in
posite specifies one Transformation function to be applied                        Figure 1(a) between dblp:Author and nsf:Researcher, that we
over the literals; examples of these transformations are low-                     refer to as CALSAR . The Instances of both classes are consid-
ercase, tokenize, concatenate or remove prefix. SameAsCon-                        ered the same using some of their data properties, dblp:name
dition represents a threshold and a string distance measure,                      and nsf:name, but also taking the data properties of Instances
or a string similarity, that defines when two Operands are                        belonging to the context into account. It uses two link spec-
the same; some of the well-known string distance measures                         ifications to link the different kind of Instances, LSAR and
are Levenshtein, Jaccard, Jaro and Jaro-Winkler. Condi-                           LSAP , and over them it defines two overlap factors, for all
tionComposite combines different SameAsConditions or other                        and exists, respectively.
ConditionComposites. Thanks to this, it is possible to define
restrictions over data properties and combine the results, for
example, using AND or OR Boolean conditions. LinkSpeci-                           4.       APPROACH
fication contains the sets of source and target classes of the                       Our proposal aims to generate context-aware link specifi-
Instances that we are relating with owl:sameAs links.                             cations by means of Algorithm 1. The input is an example
   Figure 1(a) shows two sample link specifications. One of                       composed by two Instances from different datasets represent-
them between the Instances of dblp:Author and nsf:Resear-                         ing the same concept. The output of the algorithm is a
cher (LSAR ). The Instances of both classes are the same                          context-aware link specification.
if literals of data properties dblp:name and nsf:name are the                        The algorithm takes each input Instance and explores the
same by means of a Jaro comparison and a threshold of 0.98.                       Instances related with them by means of their object prop-
The second link specification (LSAP ) relates dblp:Article and                    erties, retrieving all the new Instances from both contexts
nsf:Paper, which are the same if literals of data properties                      (lines 10-11 in Algorithm 1). Then, the algorithm generates
dblp:title and nsf:title are the same by means of a Levenshtein                   a set of link specifications for the Instances in each context,
comparison and a threshold of 0.90.                                               line 12 in Algorithm 1. For example, receiving as input the
                                                                                  Instances dblpU:Wang0011:Wei and nsfU:WeiWang0007 from
3.3    Context-Aware Link Specification                                           Figure 1(b), the algorithm firstly retrieves all the Instances
   A context-aware link specification extends the given def-                      that are related with them, dblpU:conf/vldb/JiangWL03 and
inition of link specification by defining when two Instances                      dblpU:conf/vldb/YuanLLWYZ05 for the first one, nsfU:AN-
are the same, like before, but the restrictions are not defined                   0423336 and nsfU:AN-0423336/#13 for the second one. It is
only over their data properties, but also taking data proper-                     important to notice that not only the Instances one-hop away
ties of other Instances that belongs to a different set of classes                are retrieved but also those that are more distant. Then,
into account, which are connected by object properties.                           the algorithm generates two link link specifications, LSAR
Algorithm 1 generateCALinkSpecification                           2), it firstly applies the current link specification to the In-
 1: input                                                         stances, and then, it measures the ratio of Instances linked
 2:   i1 , i2 : Instance                                          by a owl:sameAs generated with the current link specification
 3: output                                                        (line 16 in Algorithm2). In our technique, if all the Instances
 4:   cals: CALinkSpecification                                   in both contexts covered by the link specification are linked
 5: variables                                                     by owl:sameAs, then a for all overlap factor is assigned to
 6:   C1 , C2 : P Instance                                        the current link specification (lines 19-20 in Algorithm 2).
 7:   LS: P LinkSpecification                                     If only one owl:sameAs is generated between the Instances in
 8:   SO: P CASameAsCondition                                     the context, covered by the current link specification, then
 9:                                                               an exists overlap factor is assigned to it (lines 21-22 in Al-
10: C1 ← expand (i1 )                                             gorithm 2); otherwise, the link specification is discarded.
11: C2 ← expand (i2 )                                             In Figure 1(b), the algorithm assigns for all to LSAR , since
12: LS ← generateLinkSpecifications (C1 , C2 )                    there is only a pair of Instances covered by it and both ful-
13: SO ← assignOverlapFactor (LS, C1 , C2 , i1 , i2 )             fill the restrictions of LSAR . The algorithm assigns exists
14: cals ← createCALinkSpecification (SO, i1 , i2 )               to LSAP since only one pair of dblp:Article and nsf:Paper In-
                                                                  stances fulfill the conditions of LSAP . Additionally, we also
                                                                  store the sets of object properties that connect the class of
                                                                  the input Instance with the class of the Instances covered by
and LSAP , relating the Instances in both contexts. Actual
                                                                  the link specification (lines 17-18 in Algorithm 2). In Fig-
link specification techniques are able to generate LSAR and
                                                                  ure 1(b), the algorithm relates LSAR with an empty set of
LSAP , so, in this paper, we do not focus on generating them.
                                                                  object properties because the Instances covered by it are the
We assume that generateLinkSpecifications returns the best
                                                                  same that the input. Then the algorithm assigns to LSAP
link specifications for the Instances in the context.
                                                                  two sets of object properties, {dblp:writes } and {nsf:leads,
   The algorithm assigns to each link specification an overlap
                                                                  nsf:supports }, that connect the main Instancesdblp:Author
factor. In addition, we store the set of object properties
                                                                  and nsf:Researcher with the class of the Instances covered
that connect the class of the input Instance with the class
                                                                  by LSAP , dblp:Article and nsf:Paper. Finally, for each link
of the Instances covered by the link specification (line 13 in
                                                                  specification, its related overlap factor and the sets of ob-
Algorithm 1 and Algorithm 2). Finally, it combines with
                                                                  ject properties, the algorithm creates a CASameAsCondition
different aggregation functions the results obtained in the
                                                                  (line 24 in Algorithm 2). Every CASameAsCondition is added
previous step, creating the context-aware link specification
                                                                  to a set, which is the output of the algorithm when there are
(line 14 in Algorithm 1).
                                                                  no more link specifications to compute.
Algorithm 2 assignOverlapFactor
 1: input                                                         Algorithm 3 createCALinkSpecification
 2:   LS: P LinkSpecification                                      1: input
 3:   i1 , i2 : Instance                                           2:   i1 , i2 : Instance
 4:   C1 , C2 : P Instance                                         3:   SO: P CASameAsCondition
 5: output                                                         4: output
 6:   SO: P CASameAsCondition                                      5:   cals: CALinkSpecification
 7: variables                                                      6: variables
 8:   ls: LinkSpecification                                        7:   classsrc , classtrg : P Class
 9:   o1 , o2 : Double                                             8:   aggrAND: ConditionComposite
10:   oF: OverlapFactor                                            9:
11:   opsrc , optrg : P ObjectLeafNode                            10: aggrAND ← combineWithAndAggregations (SO)
12:                                                               11: classsrc ← extractRDFClass (i1 )
13: SO ← ∅                                                        12: classtrg ← extractRDFClass (i2 )
14: oF ← {}                                                       13: cals ← createCALS (aggrAND, classsrc , classtrg )
15: for each ls in LS
16:   (o1 ,o2 ) ← measureOverlap (ls, C1 , C2 )
17:   opsrc ← objectPropertiesPath (ls, C1 , i1 )
                                                                     Algorithm 3 receives as input a set of CASameAsCondi-
18:   optrg ← objectPropertiesPath (ls, C2 , i2 )
                                                                  tion and the input Instances, the output of the algorithm
19:   if o1 = 1.0 and o2 = 1.0 then
                                                                  is a CALinkSpecification. The algorithm starts combining
20:       oF ← {for all }
                                                                  all the different CASameAsConditions of the input set with
21:   if o1 > 0.0 and o2 > 0.0 then
                                                                  and aggregations (line 10 in Algorithm 3). Finally the algo-
22:       oF ← {exists }
                                                                  rithm extracts the class of each input Instance (lines 11-12
23:   if oF = {for all } or oF = {exists } then
                                                                  in Algorithm 3) and creates a CALinkSpecification (line 13
24:       SO ∪ createCASameAsCond (oF, ls, opsrc , optrg )
                                                                  in Algorithm 3). In Figure 1(b), the classes of the input In-
                                                                  stances are dblp:Author for Wang0011:Wei and nsf:Researcher
   Algorithm 2 receives as input a set of link specifications,    for WeiWang0007, the final context-aware link specification
two Instances to be linked, and two sets of Instance that         with the aggregation functions is depicted in this figure. It
are the contexts; the output of the algorithm is a set of         links dblp:Author and nsf:Researcher Instances if they have
CASameAsCondition. The algorithm starts iterating over the        similar names (for all LSAR ) and some of their publications
link specifications and, for each of them (line 15 in Algorithm   have similar titles (exists LSAP ).
                       LS: Jaro(dblp : name, nsf : name) ≥ threshold.       CALS: for all LS and exists Jaro(dblp : title, nsf : title) ≥ threshold.
                      1                                                                          1
                     0.9                                                                        0.9
                     0.8                                                                        0.8
                     0.7                                                                        0.7
     Effectiveness


                                                                                Effectiveness
                     0.6                                                                        0.6
                     0.5                                                                        0.5
                     0.4                                                                        0.4
                     0.3                                                                        0.3
                                                            Precision                                                                   Precision
                     0.2                                     Recall                             0.2                                      Recall
                     0.1                                   F-Measure                            0.1                                    F-Measure

                       0.7   0.75    0.8     0.85    0.9      0.95      1                         0.7    0.75    0.8     0.85    0.9      0.95      1
                                           Threshold                                                                   Threshold
                                (a) Link specification                                                  (b) Context-Aware link specfication

                                 Figure 4: Effectiveness results when specifications are given by an expert in DBLP-NSF


5.                   EVALUATION                                                                 cher instances that have the same name but are different
   We use two scenarios in which we study the effectiveness                                     authors, therefore, taking some context information into ac-
using link specifications and context-aware link specifica-                                     count, like their publications, is crucial to perform a suitable
tions. Both scenarios were built with real data using re-                                       link discovery task.
searchers that have published in PVLDB of 2013 extracted                                           To build this scenario, firstly, we extracted from DBLP all
from DBLP. Furthermore, there are real-world situations in                                      the articles and authors that have been published in PVLDB
which taking the context into account is crucial to perform                                     of 2013. Then, we looked up their names in the NSF portal
the optimal link discovery task. For each scenario, we did                                      and we extracted all their related information. Finally, we
two evaluations: in the first one, an expert defined a link                                     created two RDF datasets, whose data models are depicted
specification, to the best of his/her knowledge, and then,                                      in Figures 1(b) and 2(b), respectively. The resulting DBLP
the same expert defined a context-aware link specification.                                     dataset comprises 764 instances of dblp:Author and 47,225 in-
Since link specifications are very sensitive to their accep-                                    stances of dblp:Article. The resulting NSF dataset comprises
tance threshold, for each defined specification, we tuned the                                   235 instances of nsf:Researcher, 235 instances of nsf:Award,
acceptance threshold value of their string similarity from 0.7                                  and 6,877 instances of nsf:Paper. Since NSF has information
to 1.00 and analyzed for which values the best effectiveness                                    about different disciplines, in this dataset we have several re-
was achieved. In the second evaluation, we used GenLink                                         searchers that have the same name but are different people.
[11] to generate link specifications between the same classes                                   For example, in Figure 1, instances WeiWang0012 and Wei-
of the previous experiments, whose goal is to analyze the                                       Wang0007 have the same literal for nsf:name, although they
impact of adding context to a regular link specification gen-                                   are describing different researchers. In the whole dataset,
erated by a technique.                                                                          only 74 instances of nsf:Researcher have different literals for
   We make our data, algorithms, and scripts, publicly avail-                                   nsf:name.
able [4]. Therefore, our results can be reproduced and tested                                      Figure 4(a) depicts the effectiveness obtained with the link
by third parties and researchers can extend our results to                                      specification provided by the expert, a Jaro comparison over
cope with future requirements.                                                                  dblp:name and nsf:name, and using all possible acceptance
   We have implemented our technique in Java 1.8, and Jena                                      thresholds values from 0.7 to 1.0. Figure 4(b) depicts the
3.0.0. Our experiments were run on a computer that was                                          effectiveness obtained using the context-aware link specifi-
equipped with a Intel Core i7 2.8 GHz CPU and 16 GB                                             cation that extends the previous link specification, adding
RAM, running on Mac OS 10.9.5 (64-bits).                                                        a link specification composed by Jaro over dblp:title and
   In section 5.1, we present our first scenario, DBLP-NSF,                                     nsf:title, and using the best threshold for the dblp:Author
we describe its characteristics, the relationships between the                                  and nsf:Researcher link specification. The overlap factor for
datasets and how we built them. Section 5.2 follows the                                         the link specifications between the publications is exists and
same structure, in which we present our second scenario                                         for the link specification between persons is for all.
DBLP-DBLP.                                                                                         The results in Figure 4(a) shows how the effectiveness of
                                                                                                the link specification is better if the threshold acceptance
                                                                                                is higher, although it never reaches the best precision or
5.1                   NSF-DBLP scenario                                                         F-Measure of 1.0, it always obtains a recall of 1.00. Re-
   In this scenario, we have 188 owl:sameAs links between db-                                   call never changes because every dblp:Author that should
lp:Author and nsf:Researcher instances, which we consider our                                   be linked with a nsf:Researcher has exactly the same name,
gold standard. All of them relate authors and researchers                                       hence, if the threshold is low, the string metric generates
with the same name and publications in common. Between                                          false positives but always recognizes pairs of instances with
the datasets, we have 57 pair of dblp:Author and nsf:Resear-
                          LS: Jaro(dblp : name, dblp : name) ≥ threshold.                       CALS: for all Jaro(dblp:title, dblp:title) ≥ threshold.
                         1                                                                        1
                        0.9                                                                     0.9
                        0.8                                                                     0.8
                        0.7                                                                     0.7
        Effectiveness


                                                                                Effectiveness
                        0.6                                                                     0.6
                        0.5                                                                     0.5
                        0.4                                                                     0.4
                        0.3                                                                     0.3
                                                                Precision                                                                 Precision
                        0.2                                      Recall                         0.2                                        Recall
                        0.1                                    F-Measure                        0.1                                      F-Measure

                          0.7     0.75   0.8     0.85    0.9      0.95      1                     0.7      0.75   0.8      0.85    0.9      0.95      1
                                               Threshold                                                                Thresholds
                                     (a) Link specification                                             (b) Context-Aware link specfication

                                Figure 5: Effectiveness results when specifications are given by an expert in DBLP-DBLP


the same name (covering all the correct links). If the thresh-                           amples, Genlink generated LSN1 and LSN5 ; both have a Jac-
old is high, the precision improves by pruning these false                               card distance ≤ 0.37 over the literals of the data properties
positives.                                                                               dblp:name and nsf:name. Using 10 examples, GenLink gen-
   In Figure 4(b), the context-aware link specification ob-                              erated LSN10 , which has a Jaccard distance ≤ 0.21 for the
tains a precision that improves when the acceptance thresh-                              same data properties. On the other hand, the link specifi-
old is higher, however, the recall decreases for values higher                           cations between dblp:Article and nsf:Paper using 1 example
than 0.83 of acceptance threshold. The context-aware link                                was LST1 , it has a Levenshtein distance ≤ 29.48 over db-
specification reaches 1.00 in precision and recall for thresh-                           lp:title and nsf:title, using 5 examples GenLink generated
olds in the range of 0.80-0.83. Recall drops when the thresh-                            LST5 , which has a Jaccard distance ≤ 0.59 over the same
old is higher because this time we are comparing the names                               data properties and, finally, using 10 examples GenLink gen-
of the authors, and also the titles of their publications, which                         erated LST10 , it has a Levenshtein distance ≤ 7.05 over the
are written slightly different, e.g., SmartSaver turning flash                           same data properties.
drive into a disk energy saver for mobile computers and                                     We analyzed the effectiveness of dblp:Author and nsf:Re-
“SmartSaver: turning flash drive into a disk energy saver                                searcher link specification, and the context-aware link spec-
for mobilecomputers”. As result, an exact string matching                                ification for the same classes. The results in Table 3 shows
would not recognize them as the same. Due to this issue, re-                             how, when we took context into account, precision improved
call drops for higher thresholds. On the contrary, precision                             by 0.18 (1 example), 0.21 (examples) and 0.24 (10 examples).
improves when the threshold is higher, it mainly generates                               However, recall dropped by 0.05 in the context-aware link
false negatives but the instances linked are always correct.                             specification made by 10 examples because the acceptance
                                                                                         threshold was restrictive enough to not recognize titles writ-
                       LS for DBLP-NSF                                                   ten slightly different, as we explained before.
                    LS                P                     R         F
                   LSN1              0.76                  1.00      0.86                5.2          DBLP-DBLP scenario
                   LSN5              0.76                  1.00      0.86
                  LSN10              0.76                  1.00      0.86                  This scenario has 62 owl:sameAs links between the source
                                                                                         and target datasets. Both contains dblp:Author instances
                      CALS for DBLP-NSF
                                                                                         with similar names and aliases, which are different enough
        LS and their overlap factors  P                     R         F
                                                                                         to produce false positives using comparators with low ac-
      for all LSN1 and exists LST1   0.94                  1.00      0.97
                                                                                         ceptance thresholds, and false negatives with high accep-
       for all LSN5 and exists LST5  0.97                  1.00      0.99
                                                                                         tance thresholds. This scenario was built using the same
     for all LSN10 and exists LST10 1.00                   0.95      0.98
                                                                                         authors and publications in the previous scenario. We took
         CALS Best improvement       0.24                    -       0.13                the whole list filtered by authors with aliases (like “H. V.
                                                                                         Jagadish” and “Hosagrahar Visvesvaraya Jagadish”), then,
Table 3: GenLink results for dblp:Author and nsf:Researcher                              we split the instances in two datasets, each of which were
link specifications and context-aware link specification                                 obtained by using a different alias for the same person. The
                                                                                         data model of the source and target datasets is depicted in
  Table 3 shows the results obtained using GenLink to gen-                               Figure 2(a). Both datasets contain 58 dblp:Author instances
erate several link specifications between the classes of pre-                            and their publications, which are 5284 dblp:Article in total.
vious experiments. We used different number of examples                                  We conducted similar experiments in this scenario as previ-
to generate the links specifications. On one hand, for the                               ously.
instances of dblp:Author and nsf:Researcher with 1 and 5 ex-                               Figure 5(a) shows the results obtained for the link spec-
ification given by the expert, which relates by means of a         amples it generated LST10 that relates the different dblp:title
Jaro distance the dblp:name of dblp:Author instances from          by means of a Levenshtein distance ≤ 1.76.
the source and target dataset, then, we obtain the precision,        We analyzed the effectiveness for the source and target
recall and F-Measure for each possible threshold acceptance        dblp:Author instances using the link specification, and then,
value. Figure 5(b) shows the results for a context-aware link      the context-aware link specification. The results in Table
specification that uses a link specification which, by means       4 show how, when we take context into account, precision
of a Jaro distance, relates the dblp:title from the source and     does not change but recall improves by 0.58 (1 example),
target dblp:Article instances. The overlap factor for the link     0.54 (5 examples) and 0.58 (10 examples); which entails an
specification is for all.                                          improvement in the F-Measure of 0.46 (1 example), 0.45 (5
   The results of Figure 5(a) shows that the best F-Measure        examples) and 0.46 (10 examples).
results are obtained for thresholds values between 0.72 and
0.77; however, it never obtains a F-Measure of 1.00. For
higher thresholds, recall decreases while precision increases,     6.   CONCLUSION AND FUTURE WORK
this tendency is inverted for lower thresholds. Due to au-            In the literature, there are several techniques that gen-
thors’ aliases, recall behaves in the same way of Figure 4(b)      erate link specifications to perform a link discovery task;
with publication titles; if the threshold is higher, the link      however, none of them is able to exploit context informa-
specification does not recognize as the same some aliases,         tion. In this paper, we present a proposal to extend the
e.g., “H.V. Jagadish” and “Hosagrahar V. Jagadish”. On the         definition of link specification by means of the concept of
contrary, when the threshold is higher, precision improves         overlap factor, which let us exploit context information and
because the linked instances have similar names.                   define context-aware link specifications. Additionally, we
   Figure 5(b) shows that the context-aware link specifica-        have identified two real-world scenarios where the context
tions always obtain a precision of 1.00, recall and F-Measure      is crucial and where, the current techniques, are not able to
increases when the threshold is higher, achieving 1.00. This       obtain the best effectiveness without taking the context into
situation is the same as Figure 4(a), the titles of the publica-   account.
tions in each dataset have exactly the same literal, therefore,       Our experimental results prove how context-aware link
when the threshold is higher, the recall improves. Precision       specifications obtain a better effectiveness in comparison
is always 1.00 because the CALS of this example only links         with regular link specifications in our scenarios. We ob-
two instances of dblp:Author if all their publications are ex-     tained an improvement of 23% in precision and 58% in recall,
actly the same, due to the for all restriction. If just one        respectively.
publication is not linked, then their authors are also not            In future work, we plan to develop a technique to nav-
linked; therefore, if a link is actually generated, it is always   igate through context information of instances by not us-
correct.                                                           ing all of their object properties, and selecting only those
                                                                   more suitable to build effective context-aware link specifi-
                      LS for DBLP-DBLP                             cations. Additionally, we plan to add more metrics to cal-
                   LS               P         R       F            culate the overlap factor extending our current for all and
                  LSN1             1.00      0.26    0.45          exists restrictions. Finally, this paper is focused on gener-
                  LSN5             1.00      0.30    0.46          ating owl:sameAs links, but an interesting extension of our
                 LSN10             1.00      0.26    0.45          work is the generation of other kind of links in an automatic
                    CALS for DBLP-DBLP                             way, depending on the results of the overlap factor.
      LS and their overlap factors  P         R       F
              for all LST1         1.00      0.84    0.91
              for all LST5         1.00      0.84    0.91
                                                                   Acknowledgements
             for all LST10         1.00      0.84    0.91          Supported by the Spanish R&D&I program under grant
       CALS Best improvement         -       0.58    0.46          TIN2013-40848-R.

Table 4: GenLink results for source and target dblp:Author         7.   REFERENCES
link specification and context-aware link specification
                                                                    [1] C. Bizer, T. Heath, and T. Berners-Lee. Linked Data:
   Table 4 shows the link specifications generated by Gen-              Principles and state of the art. In WWW, pages 1–40,
Link for the same classes of the previous experiments. On               2008.
one hand, for the source and target dblp:Author instances,          [2] C. Bizer, T. Heath, and T. Berners-Lee. Linked
with 1 example, Genlink generated LSN1 that relates source              Data-the story so far. Int. J. Semantic Web Inf. Syst.
and target dblp:name data properties by means of a Jaccard              5(3), pages 205–227, 2009.
distance ≤ 0.15, with 5 examples generated LSN5 that re-            [3] M. G. Carvalho, A. H. Laender, M. A. Gonçalves, and
lates the same data properties by means of a Levenshtein                A. S. da Silva. Replica identification using genetic
distance ≤ 1.48 and, finally, with 10 examples it generated             programming. In SAC, pages 1801–1806, 2008.
LSN10 , which relates the same data properties by means of          [4] A. Cimmino, C. R. Rivero, and D.Ruiz. Research
a Levenshtein distance ≤ 1.15. On the other hand, for the               prototype, repositories and experimental results. URL
source and target dblp:Article instances, with 1 example Gen-           http://www.tdg-seville.info/acimmino/Cals, 2016.
Link generated LST1 that relates source and target dblp:title       [5] M. G. de Carvalho, A. H. F. Laender, M. A.
by means of a Levenshtein distance ≤ 1.76, with 5 examples              Gonçalves, and A. S. da Silva. A genetic programming
it generated LST5 that relates the same data properties by              approach to record deduplication. IEEE Trans.
means of a Levenshtein distance ≤ 1.46, finally, with 10 ex-            Knowl. Data Eng., 24(3):399–412, 2012.
 [6] I. Ermilov, M. Martin, J. Lehmann, and S. Auer.
     Linked Open Data Statistics: Collection and
     Exploitation. In KESW, pages 242–249. 2013.
 [7] H. Halpin, P. J. Hayes, J. P. McCusker, D. L.
     McGuinness, and H. S. Thompson. When owl:sameAs
     isn’t the same: An analysis of identity in Linked Data.
     In ISWC, pages 305–320. 2010.
 [8] O. Hassanzadeh, K. Q. Pu, S. H. Yeganeh, R. J.
     Miller, L. Popa, M. A. Hernández, and H. Ho.
     Discovering linkage points over web data. PVLDB,
     6(6):444–456, 2013.
 [9] T. Heath and C. Bizer. Linked Data: Evolving the
     Web into a Global Data Space. Morgan & Claypool
     Publishers, 2011.
[10] M. Holub, O. Proksa, and M. Bieliková. Detecting
     identical entities in the Semantic Web Data. In
     SOFSEM, pages 519–530. 2015.
[11] R. Isele and C. Bizer. Learning expressive linkage rules
     using genetic programming. PVLDB, 5(11):1638–1649,
     2012.
[12] R. Isele and C. Bizer. Active learning of expressive
     linkage rules using genetic programming. J. Web
     Sem., 23:2–15, 2013.
[13] R. Isele, A. Jentzsch, and C. Bizer. Efficient
     multidimensional blocking for link discovery without
     losing recall. In ACM SIGMOD workshops, 2011.
[14] M. Nentwig, M. Hartung, A.-C. N. Ngomo, and
     E. Rahm. A survey of current Link Discovery
     frameworks. Web Sem. J., pages 1–18, 2015.
[15] A.-C. N. Ngomo and S. Auer. LIMES: A time-efficient
     approach for large-scale Link Discovery on the Web of
     Data. In IJCAI, pages 2312–2317, 2011.
[16] A.-C. N. Ngomo, J. Lehmann, S. Auer, and
     K. Höffner. RAVEN - Active learning of link
     specifications. In ISWC workshops, pages 25–37, 2011.
[17] A.-C. N. Ngomo and K. Lyko. EAGLE: Efficient
     active learning of link specifications using genetic
     programming. In ESWC, pages 149–163. 2012.
[18] A.-C. N. Ngomo and K. Lyko. Unsupervised learning
     of link specifications: deterministic vs.
     non-deterministic. In ISWC workshops, pages 25–36,
     2013.
[19] A. Nikolov, M. d’Aquin, and E. Motta. Unsupervised
     learning of Link Discovery configuration. In ESWC,
     pages 119–133. 2012.
[20] C. R. Rivero, I. Hernández, D. Ruiz, and
     R. Corchuelo. Exchanging data amongst linked data
     applications. Knowl. Inf. Syst., 37(3):693–729, 2013.
[21] D. Song and J. Heflin. Automatically generating data
     linkages using a domain-independent candidate
     selection approach. In ISWC, pages 649–664. 2011.
[22] T. Soru and A.-C. N. Ngomo. A comparison of
     supervised learning classifiers for Link Discovery. In
     SEM, pages 41–44, 2014.
[23] F. M. Suchanek, S. Abiteboul, and P. Senellart.
     PARIS: Probabilistic alignment of relations, instances,
     and schema. PVLDB, 5(3):157–168, 2011.