<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LIGON { Link Discovery with Noisy Oracles</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohamed Ahmed Sherif</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kevin Dre ler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Axel-Cyrille Ngonga Ngomo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Leipzig</institution>
          ,
          <addr-line>04109 Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Paderborn University, Data Science Group</institution>
          ,
          <addr-line>Technologiepark 6, 33100 Paderborn</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Link discovery plays a key role in the integration and use of data across RDF knowledge graphs. Active learning approaches are a common family of solutions to address the problem of learning how to compute links from users. So far, only active learning from perfect oracles has been considered in the literature. However, real oracles are often far from perfect (e.g., in crowdsourcing). We hence study the problem of learning how to compute links across knowledge graphs from noisy oracles, i.e., oracles that are not guaranteed to return correct classi cation results. We present a novel approach for link discovery based on a probabilistic model, with which we estimate the joint odds of the oracles' guesses. We combine this approach with an iterative learning approach based on re nements. The resulting method, Ligon, is evaluated on 11 benchmark datasets. Our results suggest that Ligon achieves more than 95% of the F-measure achieved by state-of-the-art algorithms trained with a perfect oracle.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The provision of links between knowledge graphs in RDF3 is of central
importance for numerous tasks on the Semantic Web, including federated queries,
question answering and data fusion. While links can be created manually for
small knowledge bases, the sheer size and number of knowledge bases commonly
used in modern applications (e.g., DBpedia with more than 3 106 resources)
demands the use of automated link discovery mechanisms. In this work, we focus
on active learning for link discovery. State-of-the-art approaches that rely on
active learning [
        <xref ref-type="bibr" rid="ref12 ref3 ref8">3, 12, 8</xref>
        ] assume that the oracle they rely upon is perfect. Formally,
this means that given an oracle !, the probability of the oracle returning a wrong
result (i.e., returning false when an example is to be classi ed as true) is
exactly 0. While these approaches show pertinent results in evaluation scenarios,
within which the need for a perfect oracle can be ful lled, this need is di cult if
      </p>
    </sec>
    <sec id="sec-2">
      <title>Copyright 2020 for this paper by its authors. Use permitted under Creative Com</title>
      <p>mons License Attribution 4.0 International (CC BY 4.0)."</p>
      <sec id="sec-2-1">
        <title>3 See https://www.w3.org/RDF/.</title>
        <p>not impossible to uphold in real-world settings (e.g., when crowdsourcing
training data). No previous work has addressed link discovery based on oracles that
are not perfect.</p>
        <p>
          We address this research gap by presenting a novel approach for learning
link speci cations (LS) from noisy oracles, i.e., oracles that are not guaranteed
to return correct classi cations. This approach is motivated by the problem of
learning LS using crowdsourcing. Previous works have shown that agents in real
crowdsourcing scenarios are often not fully reliable (e.g., [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]). We model these
agents as noisy oracles, which provides erroneous answers to questions with a
xed probability. We address the problem of learning from such oracles by using
a probabilistic model, which approximates the odds of the answer of a set of
oracles being correct. Our approach, dubbed Ligon, assumes that the underlying
oracles are independent, i.e., that the probability distributions underlying oracles
are pairwise independent. Moreover, we assume that the oracles have a static
behavior, i.e., that the probability of them generating correct/incorrect answers
is constant over time.
        </p>
        <p>The contributions of this paper are as follows: (1) We present a formalization
of the problem of learning LS from noisy oracles. We derive a probabilistic model
for learning from such oracles. (2) We develop the rst learning algorithm
dedicated to learning LS from noisy data. The approach combines iterative operators
for LS with an entropy-based approach for selecting most informative training
examples. In addition, it uses cumulative evidence to approximate the
probability distribution underlying the noisy oracles that provide it with training data.
Finally, (3) we present a thorough evaluation of Ligon and show that it is
robust against noise, scales well and converges with 10 learning iterations to more
than 95% of the average F-measure achieved by Wombat|a state-of-the-art
approach for learning LS|provided with a perfect oracle.
2</p>
        <sec id="sec-2-1-1">
          <title>Preliminaries</title>
          <p>Knowledge graphs (also called knowledge bases) in RDF are de ned as sets of
triples K (R [ B) P (R [ B [ L), where R is the set of all resources, i.e., of
all objects in the domain of discourse (e.g., persons and publications); P R is
the set of all predicates, i.e., of binary relations (e.g., author); B is the set of all
blank nodes, which basically stand for resources whose existence is known but
whose identity is not relevant to the model and L is the set of all literals, i.e., of
values associated to datatypes (e.g., integers).4 The elements of K are referred
to as facts or triples. We call the elements of R entities or resources.</p>
          <p>The link discovery task on RDF knowledge graphs is de ned as follows: Let
S and T be two sets of resources, i.e., S R and T R. Moreover, let r 2 P be
a predicate. The aim of link discovery is to compute the set M = f(s; t) 2 S T :
r(s; t)g. We call M a mapping. In many cases, M cannot be computed directly
and is thus approximated by a mapping M 0. To nd the set M 0, declarative</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>4 See https://www.w3.org/RDF/ for more details.</title>
        <p>
          link discovery frameworks rely on link speci cations (LS), which describe the
conditions under which r(s; t) can be assumed to hold for a pair (s; t) 2 S T .
Several formal models have been used for describing LS in previous works [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
We adopt a formal approach derived from [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] and rst describe the syntax and
then the semantics of LS.
        </p>
        <p>LS consist of two types of atomic components: similarity measures m, which
allow the comparing of property values of input resources and operators op, which
can be used to combine LS to more complex LS. Without loss of generality, we
de ne a similarity measure m as a function m : S P T P ! [0; 1]. An
example of a similarity measure is the edit similarity dubbed edit5 which allows
computing the similarity of a pair (s; t) 2 S T w.r.t. the values of a pair of
properties (ps; pt) for s resp. t. An atomic LS is a pair (m; ). A complex LS
is the result of combining two LS L1 and L2 through an operator that allows
merging the results of L1 and L2. Here, we use the operators u, t and n as they
are complete w.r.t. the Boolean algebra and frequently used to de ne LS. An
example of a complex LS is given in Figure 1.</p>
        <p>We de ne the semantics [[L]] of a LS L w.r.t. a mapping as given in
Table 1. The mapping [[L]] of a LS L w.r.t. S T contains the link candidates
generated by L. A LS L is subsumed by L0, denoted by L v L0, if for all mappings
, we have [[L]] [[L0]] . Two LS are equivalent, denoted by L L0 i L v L0
and L0 v L. Subsumption (v) is a partial order over the set of LS, denoted L.
3</p>
        <sec id="sec-2-2-1">
          <title>Noisy Oracles</title>
          <p>We model oracles for r as black boxes with a characteristic function ! :
S T ! ftrue; falseg. The characteristic function !i of the oracle i returns
true i the oracle i assumes that r(s; t) holds. Otherwise, it returns false.
For ease of notation, we de ne LC = S T and call the elements of LC link
candidates. For l 2 LC, we write l &gt; to signify that r(l) holds, i.e., r(s; t)
is true for l = (s; t). Otherwise, we write l ?. We now assume a learning
situation typical for crowdsourcing, where n oracles are presented with a link
candidate l and asked whether l &gt; holds. We can describe each oracle i by
the following four probabilities: 1 p(!i(l) = truejl &gt;), i.e., the probability
of the oracle i generating true positives. This value is exactly 1 for a perfect
5 We de ne the edit similarity of two strings s and t as (1 + lev(s; t)) 1, where lev is
the Levenshtein distance.
oracle. 2 p(!i(l) = falsejl &gt;), the probability of false negatives (0 for a
perfect oracle). 3 p(!i(l) = truejl ?), i.e., the probability of false positives,
(0 for a perfect oracle), and 4 p(!i(l) = falsejl ?), the probability of true
negatives (1 for a perfect oracle). Given that p(AjB) + p(:AjB) = 1, the sum of
the rst two and last two probabilities is always 1.</p>
          <p>Example 1. A noisy oracle can have the following description: p(!i(l)=truejl
&gt;)=0:7; p(!i(l)=truejl ?)=0:5; p(!i(l)=falsejl &gt;)=0:3; p(!i(l)=falsejl ?)=
0:5.</p>
          <p>For compactness, we use the following vector notation in the rest of the
formal model: !! refers to the vector of characteristic functions over all oracles.
We write !!(l) = !x to signify that the ith oracle returned the xi for the link
candidate l. Let us assume that the probabilities underlying all oracles i are
known (we discuss ways to initialize and update these probabilities in the
subsequent section). Recalling that we assume that our oracles are independent, we
can now approximate the probability that l y (with y 2 f&gt;; ?g) for any given
link candidate l using the following Bayesian model:
odds(l=&gt;j !w=!x )=Yn p(!i=xijl &gt;) :</p>
          <p>i=1 p(!i=xijl ?)</p>
          <p>
            A key idea behind our model is that a link candidate l can be considered to be
a correct link if odds(l &gt;j !w = !x ) k with k &gt; 1. A link candidate is assumed
to not be a link if odds(lj !w = !x ) 1=k. All other link candidates remain
unclassi ed. Computing the odds for a link now boils down to (1) approximating
the four probabilities which characterize our oracles and (2) computing o+. As
known from previous works on probabilistic models [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ], o+ is hard to compute
directly as it requires knowing the set of links M , which is exactly what we are
trying to compute. Several strategies can be used to approximate o+. In this
work, we consider the following three:
1. Ignore strategy: We can assume the probabilities p(l = &gt;) and p(l = ?) to
be equally unknown and hence use o+ = 1 . This reduces Equation 3 to
p(l=yj !w=!x )=
          </p>
          <p>Qin=1 p(!i=xijl y)</p>
          <p>Qin=1 p(!i=xi)
p(l y)
Recall that the odds of an event A occurring are de ned as odds(A) = p(A)=P (:A).
For example, the odds of any link candidate being a correct link (denoted o+)
are given by
o+ =
p(l
p(l
&gt;) for any l 2 LC:
?)
o+ is independent of l and stands for the odds that an element of LC chosen
randomly would be a link. Given feedback from our oracles, we can approximate
the odds of a link candidate l being a correct link by computing the following:
(1)
(2)
(4)
Algorithm 1: Ligon Learning Algorithm</p>
          <p>Input: Set of positive examples E0
1 j 0 ;
2 foreach oracle i do
3 Initialize confusion matrix Ci with 12 ;
4 repeat
5 foreach oracle i do
6 Gather !i(l) for each l 2 Ej;</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>7 Update the confusion matrix Ci ;</title>
    </sec>
    <sec id="sec-4">
      <title>8 Update the characteristic matrix Di;</title>
    </sec>
    <sec id="sec-5">
      <title>LC; Oracles 1 : : : n; Odds parameter k</title>
      <p>Train Active Learner (Wombat by default) using Sij=0 Ei;</p>
    </sec>
    <sec id="sec-6">
      <title>Compute the set of the most informative unlabeled examples E ;</title>
      <p>foreach link candidate l 2 E do</p>
      <p>
        Get the oracle result vector !x for l;
13 Compute the set E+ of positive examples with odds(l = &gt;j !w = !x )
14 Compute the set E of negative examples with odds(l = &gt;j !w = !x )
15 j j + 1 ;
16 Ej E+ [ E ;
17 until termination criterion holds ;
18 return best link speci cation;
k ;
k1 ;
2. Equivalence strategy: If r is an equivalence relation (e.g., owl:sameAs), then
the set of all possible candidates has the size jSjjT j. There can be at most
min(jSj; jT j) links between S and T as no two pairs (s; t) and (s; t0) can be
linked if t 6= t0 and vice-versa (see [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]). Hence,
o+ min(jSj; jT j) : (5)
      </p>
      <p>jSjjT j min(jSj; jT j)
3. Approximate strategy: We approximate o+ by using our learning approach.</p>
      <p>We select the mapping [[L ]] computed using the best speci cation L learned
by Ligon (see the subsequent Section) as our best current approximation of
the mapping we are trying to learn. o+ is then computed as follows:
o+</p>
      <p>j[[L ]]j
jSjjT j j[[L ]]j
:
(6)
We quantify the e ect of these strategies on our learning algorithm in our
experiments.
4</p>
      <sec id="sec-6-1">
        <title>The LIGON approach</title>
        <p>Ligon is an active learning algorithm designed to learn LS from noisy oracles.
An overview of the approach is given in Algorithm 1 and explained in the sections
below.
Confusion Matrices. We begin by assuming that we are given an initial set
E0 LC of positive and negative examples for links. In the rst step, we aim
to compute the initial approximations of the conditional probabilities which
describe each of the oracles i. To this end, each oracle is assigned a confusion
matrix Ci of dimension 2 2 (see lines 2-3 of Algorithm 1). Each entry of the
matrix is initialized with 12 to account for potential sampling biases due to high
disparities between conditional probabilities. The rst and second row of each Ci
contains counts for links where the oracle returned true resp. false. The rst
and second column of Ci contain counts for positive resp. negative examples.
Hence, C11 basically contains counts for positive examples that were rightly
classi ed as &gt; by the oracle. In each learning iteration, we update the confusion
matrix by presenting the oracle with unseen link candidates and incrementing
the entries of C (see lines 4-8 of Algorithm 1). We discuss the computation of
the training examples in the subsequent section. Based on the confusion matrix,
we can approximate all conditional probabilities necessary to describe the oracle
by computing the 2 2 matrix D with dij = cij =(c1j + c2j ). For example,
d11 p(!i(l) = truejl &gt;). We call D the characteristic matrix of .
Example 2. Imagine an oracle were presented with a set of 5 positive and 5
negative training examples, of which he classi ed 4 resp. 3 correctly. We get
" 9 5 # " 9 5 #
C = 23 27 and D = 132 172 .</p>
        <p>
          2 2 12 12
Updating the Characteristic Matrices. Updating the probabilities is done via the
confusion matrices. In each learning iteration, we present all oracles with the link
candidates deemed to be most informative. Based on the answers of the oracles,
we compute the odds for each of these link candidates. Link candidates l with
odds in [0; 1=k] and [k; +1[ are considered to be false respectively true. The
new classi cations are subsequently used to update the counts in the confusion
matrices and therewith also the characteristic matrix of each of the oracles.
Active Learning Approach. So far, we have assumed the existence of an active
learning solution for link discovery. Several active learning approaches have been
developed over recent years [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Of these approaches, solely those based on genetic
programming can generate speci cations of arbitrary complexity. However,
genetic programming approaches are not deterministic and are thus di cult to use
in practical applications. Newer approaches based on iterative operators such as
Wombat [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] have been shown to perform well in classical link discovery tasks.
Therefore, we implemented a generic interface to apply Ligon to several active
learning algorithms, where we used the Wombat algorithm as the default
active learning algorithm for Ligon. See our last set of experiments for results of
applying Ligon to other state-of-the-art active learning approaches.
Selecting the Most Informative Examples. Given an active learning algorithm, we
denote the set of the m best LS generated in a given iteration i as Bi. The most
informative examples are those link candidates l, which maximize the decision
entropy across the elements of Bi [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Formally, let [[Bi]] be the union of the set
of link candidates generated by all LS b 2 Bi. Then, the most informative link
candidates are the l 2 Bi which maximize the entropy function e(l; Bi), which is
de ned as follows: Let p(l; Bi) be the probability that a link candidate belongs
to [[b]] for b 2 Bi. Then, e(l; Bi) = p(l; Bi) log2 p(l; Bi).
        </p>
        <p>Example 3. Let us assume jBij = 4. A link candidate l returned by two of the LS
in Bi would have a probability p(l; Bi) = 0:5. Hence, it would have an entropy
e(l; Bi) = 0:5.</p>
        <p>Termination Criterion. Ligon terminates after a set number of iterations has
been achieved or if a link speci cation learned by Wombat achieves an
Fmeasure of 1 on the training data.
5</p>
      </sec>
      <sec id="sec-6-2">
        <title>Experiments and Results</title>
        <p>We aimed to answer 6 research questions with our experimental evaluation:
Q1. Which combination of strategies for computing odds and the threshold k
leads to the best performance?, Q2. How does Ligon behave when provided
with an increasing number of noisy oracles?, Q3. How well does Ligon learn
from noisy oracles?, Q4. How well does Ligon scale?, Q5. How well does Ligon
perform compare to batch learning approaches trained with a similar number of
examples? and Q6. How general is Ligon, i.e., can Ligon be applied to problems
outside the link discovery domain? and does Ligon depend on the underlying
active learning algorithm?</p>
        <p>
          Experimantal Setup. All experiments were carried out on a 64-core 2:3
GHz PC running OpenJDK 64-Bit Server 1:8:0 151 on Ubuntu 16:04:3 LTS. Each
experiment was assigned 20 GB RAM. We evaluated Ligon using 8 link
discovery benchmark datasets. Five of these benchmarks were real-world datasets [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
while three were synthetic from the OAEI 2010 benchmark.6 We used the paradigm
proposed by [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and measured the performance of algorithms using the best
Fmeasure they achieved. As this measure fails to capture the average behaviour
of algorithm over several iterations, we also report the normalized average area
under the F-measure curve, which we denote AUC. We initialized Ligon with
10 positive examples (ergo, jE0j = 10). We xed the number of the most
informative examples to be labeled by the noisy oracles at each iteration to 10. For
labeling the most informative examples, we use n = 2; 4; 8 and 16 noisy oracles
which were all initialized with random confusion matrices. We set the size of
B to 10. All experiments were repeated 10 times and we report average values.
The characteristic matrices C of our noisy oracles were generated at random. To
this end we generated the true positive and true negative probabilities using a
uniform distribution between 0:5 and 1, i.e. p(!i(l) = truejl &gt;) 2 [0:5; 1] and
p(!i(l) = truejl &gt;) 2 [0:5; 1]. The other probabilities were set accordingly, as
they are complementary to the former two.
        </p>
        <sec id="sec-6-2-1">
          <title>6 http://oaei.ontologymatching.org/2010</title>
          <p>Parameter Estimation. Our rst series of experiments aimed to answer Q1.
We ran Ligon with k = 2; 4; 8 and 16. These settings were used in combination
with all three strategies for computing o+ aforementioned. A rst observation
is that the AUC achieved by Ligon does not depend much on the value of k
nor on the strategy used. This is a highly positive feature of our algorithm as it
suggests that our approach is robust w.r.t. to how it is initialized. Interestingly,
this experiment already suggests that Ligon achieves more than 95% of the
performance of the original Wombat algorithm trained with a perfect oracle.</p>
          <p>We chose to run the remainder of our experiments with the setting k = 16
combined with the equivalent strategy as this combination achieved the highest
average F-measure of 0.86.</p>
          <p>Comparison with Perfect Oracle. In our second set of experiments, we
answered Q2 by measuring how well Ligon performed when provided with an
increasing number of oracles. In this series of experiments, we used 2, 4, 8,
and 16 oracles which were initialized randomly. k was set to 16 and we used
the Equivalent strategy. Once more, the robustness of our approach became
evident as its performance was not majorly perturbed by a variation in the
number of oracles. In all settings Ligon achieves an average AUC close to 0.86
with no statistical di erence. We can hence conclude that the performance of
our approach depends mostly on the initial set of examples E0 being accurate,
which leads to our prior|i.e., the evaluation of the initial confusion matrix of the
oracles|being su cient. This su cient approximation means that our Bayesian
model is able to distinguish between erroneous classi cations well enough to nd
most informative examples accurately and generalize over them. In other words,
even a small balanced set containing 5 positive and 5 negative examples seems
su cient to approximate the confusion matrix of the oracles su ciently well to
detect positive and negative examples consistently in the subsequent steps. This
answers Q2. Figures 2 and 3 show the detailed results of running Ligon for 10
iterations for each of our 8 benchmark datasets.</p>
          <p>To answer Q3, we also ran our approach in combination with a perfect oracle
(i.e., an oracle which knew and returned the perfect classi cation for all pairs
1.00
0.99
0.98
0.97
0.94
0.90
0.86
0.82
1.00
0.96
0.92
from (S; T )). The detailed results are provided in Figures 3 and 2. Combining
our approach with a perfect oracle can be regarded as providing an upper bound
to our learning algorithm. Over all datasets, Ligon achieved 95% of the AUC
achieved with the perfect oracle (min = 88% on DBpedia-LMDB, max = 100%
on Restaurants ) of the AUC achieved with the perfect oracle. This answers Q3
and demonstrates that Ligon can learn LS with an accuracy close to that of an
approach provided with perfect answers.</p>
          <p>Runtime. In our third set of experiments, we were interested in knowing
how well our approach scales. To this end, we measured the runtime of our
algorithm while running the experiments carried out to answer Q2 and Q3. In
our experiments, Wombat, the machine learning approach used within Ligon,
makes up for more than 99% of Ligon's runtime. for more detailed results see
Table 2. This shows that the Bayesian framework used to re-evaluate the
characteristic matrices of the oracles is clearly fast enough to be used in interactive
scenarios, which answers Q4. Our approach completes a learning iteration in less
than 10 seconds on most datasets, which we consider acceptable even for
interac</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Dataset</title>
    </sec>
    <sec id="sec-8">
      <title>DBLP-ACM</title>
    </sec>
    <sec id="sec-9">
      <title>Amazon-GoogleProduct ABT-Buy</title>
    </sec>
    <sec id="sec-10">
      <title>Average</title>
    </sec>
    <sec id="sec-11">
      <title>Complete</title>
      <p>Ligon
tive scenarios. The longer runtime on DBLP-GoogleScholar (roughly 16 seconds
per iteration on average) is due to the large size of this dataset. Here, a parallel
version of the Wombat algorithm would help improving the interaction with
the user. The implementation of a parallel version of Wombat goes beyond the
work presented here.</p>
      <p>
        Comparison with Batch Learning. While active learning commonly
requires a small number of training examples to achieve good F-measures, other
techniques such as pessimistic and re-weighted batch learning have also been
designed to achieve this goal [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In addition, the positive-only learning algorithm
Wombat has also been shown to perform well with a small number of
training examples. In our nal set of experiments, we compared the best F-measure
achieved by Ligon when trained with 16 noisy oracles, k = 16 and the
equivalent strategy with the pessimistic and re-weighted models proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as well
as the two versions of the Wombat approach [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. All approaches were trained
with 2% of the reference data (i.e., with a perfect oracle) as suggested by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
The results of these experiments are shown in Table 3. Note that, we did not
consider the datasets Persons 1, Persons 2 and Restaurant because 2% of the
training data accounts to less than 10 examples, which Ligon requires as
initial training dataset E0. Our results answer Q5 clearly by showing that Ligon
outperforms previous batch learning algorithms even when trained with noisy
oracles. On average, Ligon is more than 40% better in F-measure. This clearly
demonstrates that our active learning strategy for selecting training examples is
superior to batch learning.
      </p>
      <p>
        Generalization of Ligon. In our last set of experiment, we implemented a
generalization of Ligon for binary classi cation tasks behind link discovery. We
thus used the active learning framework JCLAL [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] to wrap WEKA [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
classiers and implemented Ligon as a custom oracle. We selected three well known
binary classi cation datasets (i.e., Diabetes, breast-cancer and Ionosphere) from
the WEKA distribution on which we applied two state-of-the-art classi cation
algorithms, namely GBDT and Random Forests [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Based on our previous
experiments, we used 4 noisy oracles, k was set to 16 and we used the Ignore
strategy, since all the other strategies are speci c to the link discovery domain.
We executed two sets of experiments for noisy oracles with true positive/negative
probabilities drawn from the two uniform distributions in [0:5; 1] and [0:75; 1].
On average, Ligon achieves 75% and 89% of the learning accuracy for noisy
oracles drawn from [0:5; 1] and [0:75; 1] respectively. These results indicate that
Ligon is not only applicable to problems outside the link discovery domain but
also independent from the underlying active learning algorithm is able to achieve
F-measures near to the ones scored using a perfect oracle, which answers Q6.
6
      </p>
      <sec id="sec-11-1">
        <title>Related Work</title>
        <p>
          Link Discovery for RDF knowledge graphs has been an active research area for
nearly a decade, with the rst frameworks for link discovery [
          <xref ref-type="bibr" rid="ref4 ref9">4, 9</xref>
          ] appearing
at the beginning of the decade. Raven [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] was the rst active learning
approach for link discovery and used perception learning to detect accurate LS.
Other approaches were subsequently developed to learn LS within the active
learning setting [
          <xref ref-type="bibr" rid="ref11 ref12 ref3">3, 11, 12</xref>
          ]. Unsupervised learning approaches for monogamous
relations [11{13] rely on di erent pseudo-F-measures to detect links without
any training data. Positive-only learning algorithms [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] address the open-world
characteristic of the Semantic Web by using generalization algorithms to detect
LS. The work presented by [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] proposes an active learning approach for link
prediction in knowledge graphs. Ligon di ers from the state of the art in that
it does not assume that it deals with perfect oracles. Rather, it uses several
noisy oracles to achieve an F-measures close to those achieved with perfect
oracles. An active learning approach with uncertain labeling knowledge is proposed
by [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], where the authors used diversity density to characterize the uncertainty
of the knowledge. A probabilistic model of active learning with multiple noisy
oracles was introduced by [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] to label the data based on the most perfect oracle.
For Crowdsourcing scenarios, [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] propose a supervised learning algorithm for
multiple annotators (oracles), where the oracles' diverse reliabilities were treated
as a latent variables.
7
        </p>
      </sec>
      <sec id="sec-11-2">
        <title>Conclusions and Future Work</title>
        <p>We presented Ligon, an active learning approach designed to deal with noisy
oracles, i.e., oracles that are not guaranteed to return correct classi cation
results. Ligon relies on a probabilistic model to estimate the joint odds of link
candidates based on the oracles' guesses. Our experiments showed that Ligon
achieves 95% of the learning accuracy of approaches learning with perfect
oracles in the link discovery setting. Moreover, we showed that Ligon is (1) not
dependent on the underlying active learning algorithm and (2) able to deal with
other classi cation problems. In future work, we will evaluate Ligon within real
crowdsourcing scenarios. A limitation of our approach is that it assumes that
the confusion matrix of the oracles is static. While this assumption is valid with
the small number of iterations necessary for our approach to converge, we will
extend our model so as to deal with oracles which change dynamically.
Furthermore, we will extend Ligon to handle n-ary classi cation problems and evaluate
it on more stat-of-the-art approaches from the deep learning domain.
Acknowledgments. This work has been supported by the EU H2020 project
KnowGraphs (GA no. 860801) as well as the BMVI projects LIMBO (GA no. 19F2029C)
and OPAL (GA no. 19F2028A).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M.</given-names>
            <surname>Fang</surname>
          </string-name>
          and
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>Active learning with uncertain labeling knowledge</article-title>
          .
          <source>Pattern Recognition Letters</source>
          ,
          <volume>43</volume>
          (
          <string-name>
            <surname>Supplement</surname>
            <given-names>C</given-names>
          </string-name>
          ):
          <volume>98</volume>
          {
          <fpage>108</fpage>
          ,
          <year>2014</year>
          . ICPR2012 Awarded Papers.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hall</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <article-title>The WEKA data mining software: an update</article-title>
          .
          <source>SIGKDD Explorations</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Active learning of expressive linkage rules using genetic programming</article-title>
          .
          <source>J. Web Sem</source>
          .,
          <volume>23</volume>
          :2{
          <fpage>15</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Silk - generating RDF links while publishing or consuming linked data</article-title>
          .
          <source>In Proceedings of the ISWC Posters &amp; Demons</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>M.</given-names>
            <surname>Kejriwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Miranker</surname>
          </string-name>
          .
          <article-title>Semi-supervised instance matching using boosted classi ers</article-title>
          .
          <source>In The Semantic Web. Latest Advances and New Domains</source>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. H. Kopcke,
          <string-name>
            <given-names>A.</given-names>
            <surname>Thor</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <article-title>Evaluation of entity resolution approaches on real-world match problems</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .,
          <volume>3</volume>
          (
          <issue>1</issue>
          -2):
          <volume>484</volume>
          {
          <fpage>493</fpage>
          ,
          <string-name>
            <surname>Sept</surname>
          </string-name>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Raghavan</surname>
          </string-name>
          , and H. Schutze. Introduction to information retrieval,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>M.</given-names>
            <surname>Nentwig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hartung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <article-title>A survey of current link discovery frameworks</article-title>
          .
          <source>Semantic Web</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          ):
          <volume>419</volume>
          {
          <fpage>436</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer. LIMES -</surname>
          </string-name>
          <article-title>A time-e cient approach for large-scale link discovery on the web of data</article-title>
          .
          <source>In Proceedings of the International Joint Conference on Arti cial Intelligence</source>
          , Spain,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>A</surname>
          </string-name>
          .
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Ho</surname>
          </string-name>
          <article-title> ner. RAVEN - active learning of link speci cations</article-title>
          .
          <source>In Proceedings of the 6th International Workshop on Ontology Matching, Germany</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>A</surname>
          </string-name>
          .
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
            and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Lyko. Eagle</surname>
          </string-name>
          :
          <article-title>E cient active learning of link speci - cations using genetic programming</article-title>
          .
          <source>In Extended Semantic Web Conference</source>
          , pages
          <volume>149</volume>
          {
          <fpage>163</fpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>A</surname>
          </string-name>
          .
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Lyko</surname>
            , and
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Christen</surname>
          </string-name>
          .
          <article-title>Coala{correlation-aware active learning of link speci cations</article-title>
          .
          <source>In Extended Semantic Web Conference</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          , M. d'Aquin,
          <string-name>
            <given-names>and E.</given-names>
            <surname>Motta</surname>
          </string-name>
          .
          <article-title>Unsupervised learning of link discovery con guration</article-title>
          .
          <source>In Extended Semantic Web Conference</source>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>N.</given-names>
            <surname>Ostapuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Cudre-Mauroux</surname>
          </string-name>
          .
          <article-title>Activelink: deep active learning for link prediction in knowledge graphs</article-title>
          .
          <source>In The World Wide Web Conference</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>O. G. R.</given-names>
            <surname>Pupo</surname>
          </string-name>
          , E. Perez,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>del Carmen Rodr guez-</article-title>
          <string-name>
            <surname>Hernandez</surname>
            ,
            <given-names>H. M.</given-names>
          </string-name>
          <string-name>
            <surname>Fardoun</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ventura. JCLAL</surname>
          </string-name>
          :
          <article-title>A java framework for active learning</article-title>
          .
          <source>J. Mach. Learn. Res.</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodrigues</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          .
          <article-title>Learning from multiple annotators: Distinguishing good from random labelers</article-title>
          .
          <source>Pattern Recognition Letters</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>M. Sherif</surname>
          </string-name>
          , A.
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lehmann. WOMBAT -</surname>
          </string-name>
          <article-title>A Generalization Approach for Automatic Link Discovery</article-title>
          .
          <source>In 14th Extended Semantic Web Conference</source>
          , Slovenia. Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>W. Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            , and
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>A probabilistic model of active learning with multiple noisy oracles</article-title>
          .
          <source>Neurocomputing</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Miao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>An</surname>
          </string-name>
          .
          <article-title>Challenges and opportunities for trust management in crowdsourcing</article-title>
          .
          <source>In Proceedings of the The 2012 IEEE/WIC/ACM WI-IAT '12, USA</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. Liu,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Almpanidis.</surname>
          </string-name>
          <article-title>An up-to-date comparison of state-of-the-art classi cation algorithms</article-title>
          .
          <source>Expert Systems with Applications</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>