<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improving Link Specifications using Context-Aware Information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Cimmino</string-name>
          <email>cimmino@us.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos R. Rivero</string-name>
          <email>crr@cs.rit.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Ruiz</string-name>
          <email>druiz@us.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Rochester Institute of</institution>
          ,
          <addr-line>Technology</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Seville</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>There is an increasing interest in publishing data using the Linked Open Data philosophy. To link the RDF datasets, a link discovery task is performed to generate owl:sameAs links. There are two ways to perform this task: by means of a classi er or a link speci cation; we focus in the latter approach. Current link speci cation techniques only use the data properties of the instances that they are linking, and they do not take the context information into account. In this paper, we present a proposal that aims to generate context-aware link speci cations to improve the regular link speci cations, increasing the e ectiveness of the results in several real-world scenarios where the context is crucial. Our context-aware link speci cations are independent from similarity functions, transformations or aggregations. We have evaluated our proposal using two real-world scenarios in which we improve precision and recall with respect to regular link speci cations in 23% and 58%, respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        In recent years, we have witnessed an increasing interest in
the Linked Open Data [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. As a matter of fact, the number
of datasets in 2011 were 452 and in 2014 that number raised
to 2,289 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Publicly-available datasets have to ful ll the
Linked Data principles, which manly consist in using IRIs
as names of things, using HTTP IRIs so that people can
look up those names, when someone looks up a IRI provide
useful information using the standards (RDF or SPARQL),
and nally, include links to other IRIs so that they can
discover more things [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Since the number of datasets has
increased in the recent years and these principles establishes
that the datasets must be linked with others and published
in RDF formats [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a huge e ort has been done to link
these RDF datasets automatically [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. There are di erent
types of links between datasets, but the most common one is
owl:sameAs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. To generate the links, a link discovery task
must be performed, which aims to nd all pair of instances
that are describing the same concept [
        <xref ref-type="bibr" rid="ref13 ref15">13, 15</xref>
        ].
      </p>
      <p>
        Link discovery can be performed in two di erent ways:
by means of a classi er [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], which links instances with
owl:sameAs if it considers them as the same; or
generating a link speci cation, which is a set of restrictions over
the data properties of two instances [
        <xref ref-type="bibr" rid="ref11 ref17">17, 11</xref>
        ]. Each pair of
data properties is associated to a similarity function and a
threshold, if the similarity function returns a value higher
than the threshold, then the restriction is satis ed and the
two instances are linked using owl:sameAs. For example, a
link speci cation de nes that two instances of Person are the
same if both have a data property describing their names,
and their literals are exactly the same.
      </p>
      <p>Unfortunately, in some scenarios, this de nition is not
suitable to generate owl:sameAs links. Under some
circumstances, taking only the literals of the data properties of
two instances into account may lead to mix up instances
that are very similar but actually di erent, e.g. if two
people are di erent but have the same name. In these cases,
a link speci cation should include conditions over the
context information, data properties of other instances that are
related with the main two to have more information and
improve the e ectiveness.</p>
      <p>
        In this paper, we focus on improving current link
speci cations making them context aware. We aim to extend
the de nition used by actual techniques [
        <xref ref-type="bibr" rid="ref11 ref17">17, 11</xref>
        ], and add
restrictions over the instances in the context of the pair that
we are linking. To achieve this, we introduce the concept
of overlap factor. If we de ne di erent link speci cations
over the instances in both contexts, we can handle them as
sets potentially overlapped by means of an equity criteria
de ned by the link speci cations. The overlap factor is a
function that measures the overlapping between contexts.
Thanks to this, we are able to de ne restrictions over this
value. In this paper, we restrict ourselves to two di erent
types of overlap factors, namely: exists or for all. The
former means that there is a pair of instances in the context
considered the same by means of a link speci cation. The
latter, means that all the instances in both contexts are the
same.
      </p>
      <p>Figure 1 depicts a sample scenario where the context is
crucial to obtain a good precision. Figure 1(a) shows a part
of the data model of DBLP and the National Science
Foundation (NSF), both were built using authors that have
published in the International Conference of Very Large Data
(a) Data model and context-aware link speci cation
Bases (PVLDB) of 2013. DBLP contains these authors and
their articles, the NSF researchers from its portal, with the
same name of these authors, that leads awards which
supports papers. We wish to identify authors in DBLP that
have been awarded with NSF grants using their names and
publications. We include two link speci cations: LSAR,
which links the dblp:Author and nsf:Researcher instances if
their literals of dblp:name and nsf:name obtain a score over
0.98 using Jaro; LSAP , which links dblp:Article and nsf:Paper
instances if their literals dblp:title and nsf:title obtain a score
over 0.90 using Levenshtein. We rely on these link speci
cations and add overlap factors to them, each of which states
how many instances should be linked. The overlap factors
for LSAR and LSAP are for all and exists, respectively. The
resulting context-aware link speci cation is interpreted as
follows: a pair of dblp:Author and nsf:Researcher instances
are the same by means of the context-aware link speci
cation if all the instances covered by LSAR, pairs of dblp:Author
and nsf:Researcher instances, are the same, and if at least one
pair covered by LSAP , dblp:Article and nsf:Paper instances,
are the same.</p>
      <p>Actual link speci cation techniques can generate LSAR
or LSAP (notice that LSAP would not link instances of
dblp:Author and nsf:Researcher), but they are not able to use</p>
      <sec id="sec-1-1">
        <title>LSAP to link instances of dblp:Author and nsf:Researcher.</title>
        <p>Additionally, they are not able to generate and apply overlap
factors over them.</p>
        <p>Figure 1(b) depicts a set of sample instances, the link
speci cations LSAR and LSAP and the context-aware link
speci cation CALSAR. We focus on \Wei Wang", who is an
author in DBLP, dblpU:Wang0011:Wei. We see that there
are two researchers whose name is \Wei Wang" that lead
two di erent NSF grants (nsf:AN-1043034, nsf:AN-0423336).
Using LSAR alone, we link the three instances
(dblpU:Wang0011:Wei and nsfU:WeiWang0012, and dblpU:Wang0011:Wei
and nsfU:WeiWang0007). However, using LSAP we are able
to identify one paper in DBLP written by \Wei Wang" that
also appears in NSF. If we use both links at the same time
as part of the context information of the authors, we discard
one of the previous links (dblpU:Wang0011:Wei and
nsfU:WeiWang0012), which is not correct since it is actually linking
another researcher in NSF whose name is \Wei Wang". As
a result, we improve precision using context information.</p>
        <p>Figure 2 depicts another sample scenario, in which DBLP
acts both as source and target, where the context is
crucial to obtain a good recall. Figure 2(a) shows a part of
the data model, this scenario has several authors and their
aliases (di erent names that refer to the same person); both
datasets contains the same authors and their articles but
they have di erent aliases in each dataset. We include a
link speci cation, LSAA, that links dblp:Article instances if
the literals of their dblp:title obtain a score over 0.99 using
Jaccard. We rely on LSAA and we add to it a for all as
overlap factor. The resulting context-aware link speci cation is
interpreted as follows: two instances of dblp:Author are the
same by means of the contex-aware link speci cation if all
their dblp:Article instances are the same by means of LSAA.</p>
        <p>Figure 2(b) depicts a set of sample instances, the link
speci cation LSAA and the context-aware link speci cation
CALSAA. We focus on link "Hosagrahar V. Jagadish" and
LSAA: Jaccard &gt; 0.99
dblp:Author
"H. V. Jagadish" (dblpU:Jagadish 0002:Hosagrahar.V. and
dblpU:Jagadish 0001:H.V.), which are di erent names of the
same person. A regular link speci cation would compare
these literals, instead we rely on their publications. Using
LSAR as part of the context information of the authors, we
are able to link all their dblp:Article instances and, hence,
link the instances dblpU:Jagadish 0002:Hosagrahar.V. and
dblpU:Jagadish 0001:H.V. through their publications. We
improve the recall because we do not rely on the names of
authors, which di ers from both datasets, instead we use their
publications, which titles do not di er from both datasets.
As result, we improve the recall using the context
information.</p>
        <p>We performed several experiments using the datasets
introduced in the Figure 1 and 2, in which we improved 23%
in precision and 58% in recall, respectively.</p>
        <p>The rest of the article is organized as follows: in Section
2, we report on several related proposals and their features;
Section 3 introduces our conceptual framework; Section 4
presents our proposal to generate context-aware link speci
cations; Section 5 reports the results obtained in our
experiments; and, nally, Section 6 recaps on our main conclusions
and future work.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Over the last years, several approaches have been
developed to address the link discovery task. There are two ways
to face link discovery. The rst approach is building di erent
kind of classi ers to establish if two instances are the same
[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. The second approach is through the discovery of
accurate link speci cations, which specify conditions that must
hold true for two entities to be interlinked [
        <xref ref-type="bibr" rid="ref11 ref12 ref17 ref18 ref19 ref21">11, 12, 17, 18,
19, 21</xref>
        ]. The main di erence between these two approaches
is that the former does not specify why two instances are the
same, i.e., it works like a black box that receives as input two
instances and outputs whether or not they are the same; the
latter generates a speci cation of why two instances should
be the same, describing which data properties have to ful l
the conditions.
      </p>
      <p>
        We focus only in the link speci cation approach, although,
we have also analyzed some classi ers since they exploit the
context information [
        <xref ref-type="bibr" rid="ref10 ref23 ref8">8, 10, 23</xref>
        ]. Holub et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed
a technique that works with a xed formula that takes the
instances related directly to the pair that is been linked into
account. PARIS [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] is an unsupervised technique developed
to exploit the information contained in instances related to
the pair that is been analyzed by it. It takes two data
models as input and generates a probabilistic model. In the rst
place, the technique computes the probabilities of
equivalences of instances, then, the probabilities for relationships
with other instances and, nally, it creates the equivalences
between the classes. Hassanzadeh et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] proposed a
semisupervised technique that receives two datasets as input.
This technique works as follows: having all the data
properties of all the classes in each dataset, the technique iterates
over one set and searches in the other set, according to a
string distance, the most similar data properties. The
technique returns the ranked set of pairs according to the string
distance.
      </p>
      <p>
        Regarding the link speci cation approaches, Isele and
Bizer [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed GenLink, which is a supervised genetic
programming technique that generates link speci cations as
trees. It starts with a population made of random link
speci cations and some recurrent link speci cation prede ned
by the authors. Then, using genetic operations
(reproduction, crossover and mutation), the population is evolved and
its quality evaluated by means of a tness function, which
uses training data provided by the user. The technique stops
when a con gured maximum number of iterations is reached,
or a link speci cation obtains a value in the tness function
over a threshold given by the user. Based on GenLink, the
same authors proposed ActiveGenLink [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which aims to
reduce the number of labelled examples using active
learning. ActiveGenLink selects link candidates to be labelled
by the user from a pool of unlabelled instances through a
query strategy. Then, once the user labels a given example,
it adds the example to the training data and evolves the
population using GenLink. Another semi-supervised technique
is EAGLE [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], which is based in a genetic programming
technique with active learning, and it aims to generate link
speci cations as trees. It starts detecting similar classes and
data properties using RAVEN [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Then, EAGLE evolves
an initial random population of link speci cations according
to genetic operators. After that, the technique computes the
most informative links and asks the user to label them. This
process is repeated until the stop condition is ful lled, i.e., a
maximal number of iterations is reached, or the tness value
of a link speci cation is over a given threshold. An
unsupervised learning technique was proposed by Nikolov et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ],
which starts with a random population and keeps iterating
over it, applying genetic operators, until a maximal
number of iterations is reached, or the tness of the population
does not improve for several iterations. Since this technique
does not work with labelled data, the tness function uses
two criteria de ned by the authors to evaluate link speci
cations, namely: pseudo-F-measure and neighborhood growth.
When a stop condition is reached, it returns the link
specication with the highest tness value from the population.
EUCLID [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] is an unsupervised technique that, using di
erent similarity functions, evaluates the data properties of the
instances and generates a space of similarity values. Then,
depending on di erent heuristics, it iterates over that space
updating the scores and pruning them until a solution is
found or a stop condition is reached. The unsupervised
technique proposed by Song and He in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] focuses on metrics
to improve the candidate selection to be as more scalable
as possible. Candidate selection is a process to pick pairs
of instances, each of which has a high probability to be the
same. The process is performed by selecting and
comparing only part of the data properties of each instance in the
pair. It then extracts a set of data properties very useful in
disambiguation, which identify why the pair of instances are
the same.
      </p>
      <p>
        As far as we know, none of the previous link speci cation
techniques is able to exploit context information. Holub et
al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed a technique that takes into account only
the instances of the context one-hop related to the pair that
is been linked. PARIS [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] takes as input all the datasets and
generates a probabilistic model to classify input instances.
The technique by Hassanzadeh et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] returns a ranking of
most similar data properties using several string distances.
None of the previous techniques is able to apply
transformations on data properties and only [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is able to use di erent
string similarity measures, however it only returns a ranked
list of data property and not why two instances should be
linked. Additionally, [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] only takes one-hop connected
instances into account, although, many real-world scenarios
require to take more than one-hop related instances into
account [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>Table 1 summarizes all the techniques and their di erent
features. Those that generate link speci cations are
classied as LS; if they take into account the context, been LS or
not, then we classify them as context-aware (C-A). Finally, if
the technique is independent of any speci c function
(aggregations, transformations, string distance measures or string
similarities), we classify it as function independent (FI).</p>
      <p>
        Technique
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] Holub et al. (2015)
[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] Suchanek et al. (2011)
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] Hassanzadeh et al. (2013)
[
        <xref ref-type="bibr" rid="ref12 ref5">5, 12</xref>
        ] Isele and Bizer (2011, 2012)
[
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ] Ngonga and Lyko (2012,2013)
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] Nikolov et al. (2012)
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] Song and He in (2011)
LS
No
No
No
Yes
Yes
Yes
Yes
      </p>
      <p>C-A
Yes
Yes
Yes
No
No
No
No</p>
      <p>FI
No
No
No
Yes
Yes
Yes
No</p>
    </sec>
    <sec id="sec-3">
      <title>PRELIMINARIES</title>
      <p>In the following, we present the formalization of several
concepts that we use to describe our proposal. We de ne its
foundations, what a link speci cation is and what we mean
by context-aware link speci cation.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Foundations</title>
      <p>We are focusing on RDF datasets, which are triple stores
that contain literals and IRIs. Our proposal focuses on the
analysis of di erent instances, each of which entails several
concepts as follows:</p>
      <p>IRI: it uniquely identi es a web location. Note that
we use expressions like dblp:name to refer to IRIs, in
which dblp: is a pre x. Table 2 summarizes all of our
pre xes. For example, some sample IRIs in Figure 1
are dblpU:Wang0011:Wei and nsfU:YangWang0023.
BlankNode: are placeholders for IRIs whose actual value
is unknown. They have only local scope and are purely
an artifact of the serialization. Blank nodes are disjoint
from IRIs and Literals.</p>
      <p>Instance: an instance is an IRI or a BlankNode that we
are interested in linking.</p>
      <p>Pref.
rdf:
owl:
dblp:
nsf:
dblpU:
nsfU:</p>
      <p>IRI
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://www.w3.org/2002/07/owl#
http://example.org/voc/dblp#
http://example.org/voc//nsf#
http://example.org/urls/dblp#
http://example.org/urls/nsf#
Class: we can assign classes to Instances, each of which
is an IRI that represents a real-world concept. When
we assign a Class to an Instance, we are explicitly saying
that the Instance belongs to the type of the Class. We
use rdf:type to represent this assignment. For
example, in Figure 1, Instances related to Classdblp:Author
represent authors in DBLP.</p>
      <p>DataProperty: Instances may comprise attributes that
describe features of the Instances, which are plain
literals. To represent these features, we use data
properties, which are IRIs that identify these literals. In
Figure 1, the names of the dblp:Author Instances are
identi ed using dblp:name.</p>
      <p>Literal: it denotes a value that a data property takes.
For example, in Figure 1 \Yang Wang" for
nsfU:Yang</p>
      <sec id="sec-4-1">
        <title>Wang0023 or \Wei Wang" for nsfU:WeiWang0012 are</title>
        <p>sample literals for the same data property. Depending
on the Instance, data properties have di erent literals.
ObjectProperty: Instances can be related to other
Instances by means of object properties, which are IRIs;
a set of Instances related conform a graph. Note that,
in RDF, object properties are rst-class citizens and
they are not subordinated to Instances. Figure 1 shows
a sample object property that relates dblp:Author and
dblp:Article using dblp:writes. Notice that we can add
multiple relations connecting the same Instance to
multiple Instances; for example, in Figure 1, the object
property dblp:writes may relate one dblp:Author with
several dblp:Article Instances.</p>
        <p>LinkSpecification
source: Set&lt;Class&gt;
target: Set&lt;Class&gt;</p>
        <p>Condition
*</p>
        <p>CALinkSpecification
source: Set&lt;Class&gt;
target: Set&lt;Class&gt;</p>
        <p>CACondition
*</p>
        <p>SameAsCondition
f: Function
threshold: Double</p>
        <p>2
Operand</p>
        <p>*</p>
        <p>DataLeafNode
prop: DataProperty
dataset: {SRC, TRG}</p>
        <p>OperandComposite
f: Transformation</p>
        <p>ConditionComposite
f: Aggregation
CASameAsCondition
oF: OverlapFactor</p>
        <p>ConditionComposite
f: Aggregation</p>
        <p>LinkSpecification
source: Set&lt;Class&gt;
target: Set&lt;Class&gt;</p>
        <p>2</p>
        <p>ObjectLeafNode
prop: ObjectProperty
dataset: {SRC, TRG}
*
(a) Link Speci cation
(b) Context-Aware Link Speci cation</p>
        <p>
          When performing link discovery, we have a source and
a target datasets that we wish to relate using owl:sameAs
links. To link the Instances of each dataset we generate a
link speci cation. A link speci cation has been de ned in
multiple manners in the literature [
          <xref ref-type="bibr" rid="ref11 ref12 ref17 ref3">3, 11, 12, 17</xref>
          ]. We have
based our work in the de nition given by Isele and Bizer
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. A link speci cation is a set of restrictions that de ne
the equality between a source and a target sets of classes
based on their data properties. For instance, a dblp:Author
and a nsf:Researcher are the same if they have very similar
literals for dblp:name and nsf:name.
        </p>
        <p>Figure 3(a) depicts how we model link speci cations using
an UML-like notation. Each DataLeafNode represents a
speci c data property and the dataset it belongs.
OperandComposite speci es one Transformation function to be applied
over the literals; examples of these transformations are
lowercase, tokenize, concatenate or remove pre x.
SameAsCondition represents a threshold and a string distance measure,
or a string similarity, that de nes when two Operands are
the same; some of the well-known string distance measures
are Levenshtein, Jaccard, Jaro and Jaro-Winkler.
ConditionComposite combines di erent SameAsConditions or other
ConditionComposites. Thanks to this, it is possible to de ne
restrictions over data properties and combine the results, for
example, using AND or OR Boolean conditions.
LinkSpecication contains the sets of source and target classes of the
Instances that we are relating with owl:sameAs links.</p>
        <p>Figure 1(a) shows two sample link speci cations. One of
them between the Instances of dblp:Author and
nsf:Researcher (LSAR). The Instances of both classes are the same
if literals of data properties dblp:name and nsf:name are the
same by means of a Jaro comparison and a threshold of 0.98.
The second link speci cation (LSAP ) relates dblp:Article and
nsf:Paper, which are the same if literals of data properties
dblp:title and nsf:title are the same by means of a Levenshtein
comparison and a threshold of 0.90.
3.3</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Context-Aware Link Specification</title>
      <p>A context-aware link speci cation extends the given
definition of link speci cation by de ning when two Instances
are the same, like before, but the restrictions are not de ned
only over their data properties, but also taking data
properties of other Instances that belongs to a di erent set of classes
into account, which are connected by object properties.</p>
      <p>Figure 3(b) speci es the structure of a context-aware link
speci cation using an UML-like notation. Each
ObjectLeafNode represents the object properties that connect the sets
of classes in CALinkSpeci cation with the other sets of classes
in the LinkSpeci cations. A CASameAsCondition speci es an</p>
      <sec id="sec-5-1">
        <title>OverlapFactor over a LinkSpeci cation. OverlapFactor takes</title>
        <p>as values for all, if all the Instances are required to be
considered the same by means of the LinkSpeci cation, or exists,
if just one pair of Instances is required.
ConditionComposite combines di erent CASameAsConditions or other
ConditionComposites, the Aggregation functions are: AND or OR
Boolean conditions. Finally, CALinkSpeci cation represents
the two main sets of classes, source and target, that the
Instances we wish to link with owl:sameAs belongs.</p>
        <p>We present a sample context-aware link speci cation in
Figure 1(a) between dblp:Author and nsf:Researcher, that we
refer to as CALSAR. The Instances of both classes are
considered the same using some of their data properties, dblp:name
and nsf:name, but also taking the data properties of Instances
belonging to the context into account. It uses two link
speci cations to link the di erent kind of Instances, LSAR and
LSAP , and over them it de nes two overlap factors, for all
and exists, respectively.
4.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>APPROACH</title>
      <p>Our proposal aims to generate context-aware link speci
cations by means of Algorithm 1. The input is an example
composed by two Instances from di erent datasets
representing the same concept. The output of the algorithm is a
context-aware link speci cation.</p>
      <p>The algorithm takes each input Instance and explores the
Instances related with them by means of their object
properties, retrieving all the new Instances from both contexts
(lines 10-11 in Algorithm 1). Then, the algorithm generates
a set of link speci cations for the Instances in each context,
line 12 in Algorithm 1. For example, receiving as input the</p>
      <sec id="sec-6-1">
        <title>Instances dblpU:Wang0011:Wei and nsfU:WeiWang0007 from</title>
        <p>Figure 1(b), the algorithm rstly retrieves all the Instances
that are related with them, dblpU:conf/vldb/JiangWL03 and
dblpU:conf/vldb/YuanLLWYZ05 for the rst one,
nsfU:AN0423336 and nsfU:AN-0423336/#13 for the second one. It is
important to notice that not only the Instances one-hop away
are retrieved but also those that are more distant. Then,
the algorithm generates two link link speci cations, LSAR
Algorithm 1 generateCALinkSpeci cation
1: input
2: i1, i2: Instance
3: output
4: cals: CALinkSpeci cation
5: variables
6: C1, C2 : P Instance
7: LS: P LinkSpeci cation
8: SO: P CASameAsCondition
9:
10: C1
11: C2
12: LS
13: SO
14: cals
expand (i1)
expand (i2)
generateLinkSpeci cations (C1, C2)
assignOverlapFactor (LS, C1, C2, i1, i2)
createCALinkSpeci cation (SO, i1, i2)
and LSAP , relating the Instances in both contexts. Actual
link speci cation techniques are able to generate LSAR and
LSAP , so, in this paper, we do not focus on generating them.
We assume that generateLinkSpeci cations returns the best
link speci cations for the Instances in the context.</p>
        <p>The algorithm assigns to each link speci cation an overlap
factor. In addition, we store the set of object properties
that connect the class of the input Instance with the class
of the Instances covered by the link speci cation (line 13 in
Algorithm 1 and Algorithm 2). Finally, it combines with
di erent aggregation functions the results obtained in the
previous step, creating the context-aware link speci cation
(line 14 in Algorithm 1).</p>
        <p>Algorithm 2 assignOverlapFactor
1: input
2: LS: P LinkSpeci cation
3: i1, i2 : Instance
4: C1, C2 : P Instance
5: output
6: SO: P CASameAsCondition
7: variables
8: ls: LinkSpeci cation
9: o1, o2 : Double
10: oF: OverlapFactor
11: opsrc, optrg: P ObjectLeafNode
12:
13: SO ?
14: oF fg
15: for each ls in LS
16: (o1,o2) measureOverlap (ls, C1, C2)
17: opsrc objectPropertiesPath (ls, C1, i1)
18: optrg objectPropertiesPath (ls, C2, i2)
19: if o1 = 1.0 and o2 = 1.0 then
20: oF ffor all g
21: if o1 &gt; 0.0 and o2 &gt; 0.0 then
22: oF fexists g
23: if oF = ffor all g or oF = fexists g then
24: SO [ createCASameAsCond (oF, ls, opsrc, optrg)</p>
        <p>Algorithm 2 receives as input a set of link speci cations,
two Instances to be linked, and two sets of Instance that
are the contexts; the output of the algorithm is a set of
CASameAsCondition. The algorithm starts iterating over the
link speci cations and, for each of them (line 15 in Algorithm
2), it rstly applies the current link speci cation to the
Instances, and then, it measures the ratio of Instances linked
by a owl:sameAs generated with the current link speci cation
(line 16 in Algorithm2). In our technique, if all the Instances
in both contexts covered by the link speci cation are linked
by owl:sameAs, then a for all overlap factor is assigned to
the current link speci cation (lines 19-20 in Algorithm 2).
If only one owl:sameAs is generated between the Instances in
the context, covered by the current link speci cation, then
an exists overlap factor is assigned to it (lines 21-22 in
Algorithm 2); otherwise, the link speci cation is discarded.
In Figure 1(b), the algorithm assigns for all to LSAR, since
there is only a pair of Instances covered by it and both
fulll the restrictions of LSAR. The algorithm assigns exists
to LSAP since only one pair of dblp:Article and nsf:Paper
Instances ful ll the conditions of LSAP . Additionally, we also
store the sets of object properties that connect the class of
the input Instance with the class of the Instances covered by
the link speci cation (lines 17-18 in Algorithm 2). In
Figure 1(b), the algorithm relates LSAR with an empty set of
object properties because the Instances covered by it are the
same that the input. Then the algorithm assigns to LSAP
two sets of object properties, fdblp:writes g and fnsf:leads,
nsf:supports g, that connect the main Instancesdblp:Author
and nsf:Researcher with the class of the Instances covered
by LSAP , dblp:Article and nsf:Paper. Finally, for each link
speci cation, its related overlap factor and the sets of
object properties, the algorithm creates a CASameAsCondition
(line 24 in Algorithm 2). Every CASameAsCondition is added
to a set, which is the output of the algorithm when there are
no more link speci cations to compute.</p>
        <p>Algorithm 3 createCALinkSpeci cation
1: input
2: i1, i2 : Instance
3: SO: P CASameAsCondition
4: output
5: cals: CALinkSpeci cation
6: variables
7: classsrc, classtrg: P Class
8: aggrAND: ConditionComposite
9:
10: aggrAND
11: classsrc
12: classtrg
13: cals</p>
        <p>combineWithAndAggregations (SO)
extractRDFClass (i1)
extractRDFClass (i2)
createCALS (aggrAND, classsrc, classtrg)</p>
        <p>Algorithm 3 receives as input a set of
CASameAsCondition and the input Instances, the output of the algorithm
is a CALinkSpeci cation. The algorithm starts combining
all the di erent CASameAsConditions of the input set with
and aggregations (line 10 in Algorithm 3). Finally the
algorithm extracts the class of each input Instance (lines 11-12
in Algorithm 3) and creates a CALinkSpeci cation (line 13
in Algorithm 3). In Figure 1(b), the classes of the input
Instances are dblp:Author for Wang0011:Wei and nsf:Researcher
for WeiWang0007, the nal context-aware link speci cation
with the aggregation functions is depicted in this gure. It
links dblp:Author and nsf:Researcher Instances if they have
similar names (for all LSAR) and some of their publications
have similar titles (exists LSAP ).</p>
      </sec>
      <sec id="sec-6-2">
        <title>LS: Jaro(dblp : name; nsf : name) threshold. CALS: for all LS and exists Jaro(dblp : title; nsf : title) threshold.</title>
        <p>1
0:9
0:8
0:7
0:85</p>
      </sec>
      <sec id="sec-6-3">
        <title>Threshold</title>
        <p>0:85
Threshold</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>EVALUATION</title>
      <p>
        We use two scenarios in which we study the e ectiveness
using link speci cations and context-aware link speci
cations. Both scenarios were built with real data using
researchers that have published in PVLDB of 2013 extracted
from DBLP. Furthermore, there are real-world situations in
which taking the context into account is crucial to perform
the optimal link discovery task. For each scenario, we did
two evaluations: in the rst one, an expert de ned a link
speci cation, to the best of his/her knowledge, and then,
the same expert de ned a context-aware link speci cation.
Since link speci cations are very sensitive to their
acceptance threshold, for each de ned speci cation, we tuned the
acceptance threshold value of their string similarity from 0.7
to 1.00 and analyzed for which values the best e ectiveness
was achieved. In the second evaluation, we used GenLink
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to generate link speci cations between the same classes
of the previous experiments, whose goal is to analyze the
impact of adding context to a regular link speci cation
generated by a technique.
      </p>
      <p>
        We make our data, algorithms, and scripts, publicly
available [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Therefore, our results can be reproduced and tested
by third parties and researchers can extend our results to
cope with future requirements.
      </p>
      <p>We have implemented our technique in Java 1.8, and Jena
3.0.0. Our experiments were run on a computer that was
equipped with a Intel Core i7 2.8 GHz CPU and 16 GB
RAM, running on Mac OS 10.9.5 (64-bits).</p>
      <p>In section 5.1, we present our rst scenario, DBLP-NSF,
we describe its characteristics, the relationships between the
datasets and how we built them. Section 5.2 follows the
same structure, in which we present our second scenario
DBLP-DBLP.
5.1</p>
    </sec>
    <sec id="sec-8">
      <title>NSF-DBLP scenario</title>
      <p>In this scenario, we have 188 owl:sameAs links between
dblp:Author and nsf:Researcher instances, which we consider our
gold standard. All of them relate authors and researchers
with the same name and publications in common. Between
the datasets, we have 57 pair of dblp:Author and
nsf:Researcher instances that have the same name but are di erent
authors, therefore, taking some context information into
account, like their publications, is crucial to perform a suitable
link discovery task.</p>
      <p>To build this scenario, rstly, we extracted from DBLP all
the articles and authors that have been published in PVLDB
of 2013. Then, we looked up their names in the NSF portal
and we extracted all their related information. Finally, we
created two RDF datasets, whose data models are depicted
in Figures 1(b) and 2(b), respectively. The resulting DBLP
dataset comprises 764 instances of dblp:Author and 47,225
instances of dblp:Article. The resulting NSF dataset comprises
235 instances of nsf:Researcher, 235 instances of nsf:Award,
and 6,877 instances of nsf:Paper. Since NSF has information
about di erent disciplines, in this dataset we have several
researchers that have the same name but are di erent people.
For example, in Figure 1, instances WeiWang0012 and
WeiWang0007 have the same literal for nsf:name, although they
are describing di erent researchers. In the whole dataset,
only 74 instances of nsf:Researcher have di erent literals for
nsf:name.</p>
      <p>Figure 4(a) depicts the e ectiveness obtained with the link
speci cation provided by the expert, a Jaro comparison over
dblp:name and nsf:name, and using all possible acceptance
thresholds values from 0.7 to 1.0. Figure 4(b) depicts the
e ectiveness obtained using the context-aware link speci
cation that extends the previous link speci cation, adding
a link speci cation composed by Jaro over dblp:title and
nsf:title, and using the best threshold for the dblp:Author
and nsf:Researcher link speci cation. The overlap factor for
the link speci cations between the publications is exists and
for the link speci cation between persons is for all.</p>
      <p>The results in Figure 4(a) shows how the e ectiveness of
the link speci cation is better if the threshold acceptance
is higher, although it never reaches the best precision or
F-Measure of 1.0, it always obtains a recall of 1.00.
Recall never changes because every dblp:Author that should
be linked with a nsf:Researcher has exactly the same name,
hence, if the threshold is low, the string metric generates
false positives but always recognizes pairs of instances with
1
0:9
0:8
0:7</p>
      <p>0:85</p>
      <p>Threshold
0:7
0:75
0:8
0:9
0:95
1
0:7
0:75
0:8
the same name (covering all the correct links). If the
threshold is high, the precision improves by pruning these false
positives.</p>
      <p>In Figure 4(b), the context-aware link speci cation
obtains a precision that improves when the acceptance
threshold is higher, however, the recall decreases for values higher
than 0.83 of acceptance threshold. The context-aware link
speci cation reaches 1.00 in precision and recall for
thresholds in the range of 0.80-0.83. Recall drops when the
threshold is higher because this time we are comparing the names
of the authors, and also the titles of their publications, which
are written slightly di erent, e.g., SmartSaver turning ash
drive into a disk energy saver for mobile computers and
\SmartSaver: turning ash drive into a disk energy saver
for mobilecomputers". As result, an exact string matching
would not recognize them as the same. Due to this issue,
recall drops for higher thresholds. On the contrary, precision
improves when the threshold is higher, it mainly generates
false negatives but the instances linked are always correct.</p>
      <sec id="sec-8-1">
        <title>LS for DBLP-NSF</title>
        <p>LS P
LSN1 0.76
LSN5 0.76
LSN10 0.76</p>
      </sec>
      <sec id="sec-8-2">
        <title>CALS for DBLP-NSF</title>
      </sec>
      <sec id="sec-8-3">
        <title>LS and their overlap factors P</title>
        <p>for all LSN1 and exists LST1 0.94
for all LSN5 and exists LST5 0.97
for all LSN10 and exists LST10 1.00</p>
      </sec>
      <sec id="sec-8-4">
        <title>CALS Best improvement 0.24 R 1.00</title>
        <p>amples, Genlink generated LSN1 and LSN5; both have a
Jaccard distance 0.37 over the literals of the data properties
dblp:name and nsf:name. Using 10 examples, GenLink
generated LSN10, which has a Jaccard distance 0.21 for the
same data properties. On the other hand, the link speci
cations between dblp:Article and nsf:Paper using 1 example
was LST1, it has a Levenshtein distance 29.48 over
dblp:title and nsf:title, using 5 examples GenLink generated
LST5, which has a Jaccard distance 0.59 over the same
data properties and, nally, using 10 examples GenLink
generated LST10, it has a Levenshtein distance 7.05 over the
same data properties.</p>
        <p>We analyzed the e ectiveness of dblp:Author and
nsf:Researcher link speci cation, and the context-aware link
speci cation for the same classes. The results in Table 3 shows
how, when we took context into account, precision improved
by 0.18 (1 example), 0.21 (examples) and 0.24 (10 examples).
However, recall dropped by 0.05 in the context-aware link
speci cation made by 10 examples because the acceptance
threshold was restrictive enough to not recognize titles
written slightly di erent, as we explained before.
5.2</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>DBLP-DBLP scenario</title>
      <p>This scenario has 62 owl:sameAs links between the source
and target datasets. Both contains dblp:Author instances
with similar names and aliases, which are di erent enough
to produce false positives using comparators with low
acceptance thresholds, and false negatives with high
acceptance thresholds. This scenario was built using the same
authors and publications in the previous scenario. We took
the whole list ltered by authors with aliases (like \H. V.
Jagadish" and \Hosagrahar Visvesvaraya Jagadish"), then,
we split the instances in two datasets, each of which were
obtained by using a di erent alias for the same person. The
data model of the source and target datasets is depicted in
Figure 2(a). Both datasets contain 58 dblp:Author instances
and their publications, which are 5284 dblp:Article in total.
We conducted similar experiments in this scenario as
previously.</p>
      <p>Figure 5(a) shows the results obtained for the link
speci cation given by the expert, which relates by means of a
Jaro distance the dblp:name of dblp:Author instances from
the source and target dataset, then, we obtain the precision,
recall and F-Measure for each possible threshold acceptance
value. Figure 5(b) shows the results for a context-aware link
speci cation that uses a link speci cation which, by means
of a Jaro distance, relates the dblp:title from the source and
target dblp:Article instances. The overlap factor for the link
speci cation is for all.</p>
      <p>The results of Figure 5(a) shows that the best F-Measure
results are obtained for thresholds values between 0.72 and
0.77; however, it never obtains a F-Measure of 1.00. For
higher thresholds, recall decreases while precision increases,
this tendency is inverted for lower thresholds. Due to
authors' aliases, recall behaves in the same way of Figure 4(b)
with publication titles; if the threshold is higher, the link
speci cation does not recognize as the same some aliases,
e.g., \H.V. Jagadish" and \Hosagrahar V. Jagadish". On the
contrary, when the threshold is higher, precision improves
because the linked instances have similar names.</p>
      <p>Figure 5(b) shows that the context-aware link speci
cations always obtain a precision of 1.00, recall and F-Measure
increases when the threshold is higher, achieving 1.00. This
situation is the same as Figure 4(a), the titles of the
publications in each dataset have exactly the same literal, therefore,
when the threshold is higher, the recall improves. Precision
is always 1.00 because the CALS of this example only links
two instances of dblp:Author if all their publications are
exactly the same, due to the for all restriction. If just one
publication is not linked, then their authors are also not
linked; therefore, if a link is actually generated, it is always
correct.</p>
      <sec id="sec-9-1">
        <title>LS for DBLP-DBLP</title>
        <p>LS P
LSN1 1.00
LSN5 1.00
LSN10 1.00</p>
      </sec>
      <sec id="sec-9-2">
        <title>CALS for DBLP-DBLP</title>
      </sec>
      <sec id="sec-9-3">
        <title>LS and their overlap factors P</title>
        <p>for all LST1 1.00
for all LST5 1.00
for all LST10 1.00</p>
      </sec>
      <sec id="sec-9-4">
        <title>CALS Best improvement R 0.26</title>
        <p>amples it generated LST10 that relates the di erent dblp:title
by means of a Levenshtein distance 1.76.</p>
        <p>We analyzed the e ectiveness for the source and target
dblp:Author instances using the link speci cation, and then,
the context-aware link speci cation. The results in Table
4 show how, when we take context into account, precision
does not change but recall improves by 0.58 (1 example),
0.54 (5 examples) and 0.58 (10 examples); which entails an
improvement in the F-Measure of 0.46 (1 example), 0.45 (5
examples) and 0.46 (10 examples).
6.</p>
        <p>CONCLUSION AND FUTURE WORK
In the literature, there are several techniques that
generate link speci cations to perform a link discovery task;
however, none of them is able to exploit context
information. In this paper, we present a proposal to extend the
de nition of link speci cation by means of the concept of
overlap factor, which let us exploit context information and
de ne context-aware link speci cations. Additionally, we
have identi ed two real-world scenarios where the context
is crucial and where, the current techniques, are not able to
obtain the best e ectiveness without taking the context into
account.</p>
        <p>Our experimental results prove how context-aware link
speci cations obtain a better e ectiveness in comparison
with regular link speci cations in our scenarios. We
obtained an improvement of 23% in precision and 58% in recall,
respectively.</p>
        <p>In future work, we plan to develop a technique to
navigate through context information of instances by not
using all of their object properties, and selecting only those
more suitable to build e ective context-aware link speci
cations. Additionally, we plan to add more metrics to
calculate the overlap factor extending our current for all and
exists restrictions. Finally, this paper is focused on
generating owl:sameAs links, but an interesting extension of our
work is the generation of other kind of links in an automatic
way, depending on the results of the overlap factor.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgements</title>
      <p>Supported by the Spanish R&amp;D&amp;I program under grant
TIN2013-40848-R.
7.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked Data: Principles and state of the art</article-title>
          .
          <source>In WWW</source>
          , pages
          <volume>1</volume>
          {
          <fpage>40</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked Data-the story so far</article-title>
          .
          <source>Int. J. Semantic Web Inf. Syst</source>
          .
          <volume>5</volume>
          (
          <issue>3</issue>
          ), pages
          <fpage>205</fpage>
          {
          <fpage>227</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Carvalho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Laender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Goncalves</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. S.</surname>
          </string-name>
          da Silva.
          <article-title>Replica identi cation using genetic programming</article-title>
          .
          <source>In SAC</source>
          , pages
          <year>1801</year>
          {
          <year>1806</year>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cimmino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Rivero</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          .
          <article-title>Research prototype, repositories and experimental results</article-title>
          . URL http://www.tdg-seville.info/acimmino/Cals,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>M. G. de Carvalho</surname>
            ,
            <given-names>A. H. F.</given-names>
          </string-name>
          <string-name>
            <surname>Laender</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Goncalves</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. S. da</given-names>
            <surname>Silva</surname>
          </string-name>
          .
          <article-title>A genetic programming approach to record deduplication</article-title>
          .
          <source>IEEE Trans. Knowl</source>
          . Data Eng.,
          <volume>24</volume>
          (
          <issue>3</issue>
          ):
          <volume>399</volume>
          {
          <fpage>412</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Ermilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          . Linked Open Data Statistics:
          <article-title>Collection and Exploitation</article-title>
          .
          <source>In KESW</source>
          , pages
          <volume>242</volume>
          {
          <fpage>249</fpage>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Halpin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCusker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and H. S.</given-names>
            <surname>Thompson</surname>
          </string-name>
          .
          <article-title>When owl:sameAs isn't the same: An analysis of identity in Linked Data</article-title>
          .
          <source>In ISWC</source>
          , pages
          <volume>305</volume>
          {
          <fpage>320</fpage>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Yeganeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Popa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Ho</surname>
          </string-name>
          .
          <article-title>Discovering linkage points over web data</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>6</volume>
          (
          <issue>6</issue>
          ):
          <volume>444</volume>
          {
          <fpage>456</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Linked Data: Evolving the Web into a Global Data Space</article-title>
          . Morgan &amp; Claypool Publishers,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Holub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Proksa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Bielikova</surname>
          </string-name>
          .
          <article-title>Detecting identical entities in the Semantic Web Data</article-title>
          .
          <source>In SOFSEM</source>
          , pages
          <volume>519</volume>
          {
          <fpage>530</fpage>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Learning expressive linkage rules using genetic programming</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>5</volume>
          (
          <issue>11</issue>
          ):
          <volume>1638</volume>
          {
          <fpage>1649</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Active learning of expressive linkage rules using genetic programming</article-title>
          .
          <source>J. Web Sem</source>
          .,
          <volume>23</volume>
          :2{
          <fpage>15</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>E cient multidimensional blocking for link discovery without losing recall</article-title>
          .
          <source>In ACM SIGMOD workshops</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nentwig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hartung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <article-title>A survey of current Link Discovery frameworks</article-title>
          .
          <source>Web Sem. J., pages</source>
          <volume>1</volume>
          {
          <fpage>18</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer. LIMES</surname>
          </string-name>
          :
          <article-title>A time-e cient approach for large-scale Link Discovery on the Web of Data</article-title>
          .
          <source>In IJCAI</source>
          , pages
          <volume>2312</volume>
          {
          <fpage>2317</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>A.-C. N. Ngomo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Ho</surname>
          </string-name>
          <article-title> ner. RAVEN - Active learning of link speci cations</article-title>
          .
          <source>In ISWC workshops</source>
          , pages
          <volume>25</volume>
          {
          <fpage>37</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Lyko. EAGLE</surname>
          </string-name>
          :
          <article-title>E cient active learning of link speci cations using genetic programming</article-title>
          .
          <source>In ESWC</source>
          , pages
          <volume>149</volume>
          {
          <fpage>163</fpage>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Lyko</surname>
          </string-name>
          .
          <article-title>Unsupervised learning of link speci cations: deterministic vs. non-deterministic</article-title>
          .
          <source>In ISWC workshops</source>
          , pages
          <volume>25</volume>
          {
          <fpage>36</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          , M. d'Aquin,
          <string-name>
            <given-names>and E.</given-names>
            <surname>Motta</surname>
          </string-name>
          .
          <article-title>Unsupervised learning of Link Discovery con guration</article-title>
          .
          <source>In ESWC</source>
          , pages
          <volume>119</volume>
          {
          <fpage>133</fpage>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Rivero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Corchuelo</surname>
          </string-name>
          .
          <article-title>Exchanging data amongst linked data applications</article-title>
          . Knowl. Inf. Syst.,
          <volume>37</volume>
          (
          <issue>3</issue>
          ):
          <volume>693</volume>
          {
          <fpage>729</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          and
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>He in. Automatically generating data linkages using a domain-independent candidate selection approach</article-title>
          .
          <source>In ISWC</source>
          , pages
          <volume>649</volume>
          {
          <fpage>664</fpage>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Soru</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          .
          <article-title>A comparison of supervised learning classi ers for Link Discovery</article-title>
          .
          <source>In SEM</source>
          , pages
          <volume>41</volume>
          {
          <fpage>44</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Senellart</surname>
          </string-name>
          . PARIS:
          <article-title>Probabilistic alignment of relations, instances, and schema</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <volume>157</volume>
          {
          <fpage>168</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>