=Paper= {{Paper |id=Vol-1111/om2013_Tpaper3 |storemode=property |title=Unsupervised learning of link specifications: deterministic vs. non-deterministic |pdfUrl=https://ceur-ws.org/Vol-1111/om2013_Tpaper3.pdf |volume=Vol-1111 |dblpUrl=https://dblp.org/rec/conf/semweb/NgomoL13 }} ==Unsupervised learning of link specifications: deterministic vs. non-deterministic== https://ceur-ws.org/Vol-1111/om2013_Tpaper3.pdf
    Unsupervised Learning of Link Specifications:
        Deterministic vs. Non-Deterministic

                 Axel-Cyrille Ngonga Ngomo1 and Klaus Lyko1

                        Department of Computer Science
                              University of Leipzig
                         Johannisgasse 26, 04103 Leipzig
                      ngonga@informatik.uni-leipzig.de,
                     WWW home page: http://limes.sf.net


      Abstract. Link Discovery has been shown to be of utter importance for
      the Linked Data Web. In previous works, several supervised approaches
      have been developed for learning link specifications out of labelled data.
      Most recently, genetic programming has also been utilized to learn link
      specifications in an unsupervised fashion by optimizing a parametrized
      pseudo-F-measure. The questions underlying this evaluation paper are
      twofold: First, how well do pseudo-F-measures predict the real accu-
      racy of non-deterministic and deterministic approaches across different
      types of datasets? Second, how do deterministic approaches compare to
      non-deterministic approaches? To answer these questions, we evaluated
      linear and Boolean classifiers against classifiers computed by using ge-
      netic programming on six different data sets. We also studied the corre-
      lation between two different pseudo-F-measures and the real F-measures
      achieved by the classifiers at hand. Our evaluation suggests that pseudo-
      F-measures behave differently on the synthetic and real data sets.


1    Introduction
Over the last years, the importance of Link Discovery (LD) as a research topic
has increased significantly. This increase was upheld mainly by the ever-growing
size of the Linked Data Web and the scalability and accuracy requirements it
brings about. The creation of links between knowledge bases, one of the most
important steps in the realization of the vision of the Linked Data Web has
profited from this boost of research and seen the development of several LD
frameworks and approaches [7, 2, 12, 11, 3]. Two main research focuses played a
role so far: (1) the determination of time-efficient algorithms [3, 7, 6] for LD and
(2) the development of approaches for the efficient computation of link specifi-
cations (also called linkage rules) [8, 4, 10, 9]. In most cases, supervised machine
learning approaches were used to tackle the second challenge of LD. Approaches
developed so far include batch learning using genetic programming [3], the com-
bination of active learning and of linear and Boolean classifiers [8] as well as
the combination of active learning and genetic programming [9]. In addition,
unsupervised approaches for learning link specifications have been recently de-
veloped [10]. While all these approaches have been shown to achieve good results,
unsupervised approaches obviously trump batch and active learning approaches
as they do not require any feedback from the user and can still achieve remark-
ably good performance. In addition, genetic programming approaches yield the
central advantage of being able to exploit the whole spectrum of the link speci-
fication grammar provided by the framework in which they were implemented.
So far, unsupervised approaches to the discovery of link specifications have only
been tested with artificially generated benchmark data and low-noise datasets.
Moreover, no deterministic approach for the unsupervised discovery of link spec-
ifications has been presented so far, although deterministic approaches such as
those presented in [8] counterbalance their limitations in expressiveness by being
clearly more time-efficient than approaches based on genetic programming.
    The aim of this paper is to experimentally examine the unsupervised discov-
ery of link specifications with respect to two main questions:
 1. Are deterministic approaches able to achieve results comparable to those of
    genetic approaches? To address this question, we extended the approach
    presented in [9] and devised an approach for the unsupervised learning of
    Boolean and linear classifiers which is loosely based on the RAVEN ap-
    proach [8]. We refrained from reusing the approach presented in [10] as one
    of the pseudo-measures we rely on was designed especially to work well with
    this approach. Consequently, using it could have led to a bias in our results.
 2. How well are pseudo-F-measures suited for unsupervised discovery performed
    on synthetic and real data sets? Here, we compared the results achieved
    by the approaches above on the three OAEI 2010 datasets1 and on three
    data sets extracted from real data2 . In addition to the pseudo-F-measure
    described in [10] (which we dub Fuβ ), we devised a supplementary pseudo-F-
    measure Fdβ which relies more on the standard definition of the Fβ -measure.
    We performed a correlation analysis of the values of Fuβ , Fdβ and the F1
    measure and detected a surprisingly different behaviour of these measures
    across our two groups of data sets.
The rest of this paper is structured as follows: We first give an overview of the
approaches and measures we used for our experiments. We then present the
results of our experimental setup as well as the results of our experiments. For
the sake of reproducibility, we chose to use freely available datasets and made the
approaches presented herein freely available at the project website.3 We conclude
with a summary of the implications of our results for the LD community.


2   Approaches
In general, a link specification is a classifier C that assigns each element of the
set S ×T to one of the classes of Y = {+1, −1}, where S is called the set of source
1
  Freely available at http://oaei.ontologymatching.org/2010/.
2
  Freely available at http://dbs.uni-leipzig.de/en/research/projects/object_
  matching/fever/benchmark_datasets_for_entity_resolution.
3
  http://saim.sf.net
instances, while T is the set of target instances. (s, t) ∈ S × T is considered by C
to be a correct link when C(s, t) = +1. Otherwise, (s, t) is considered not be a
potential link. We will assume that the classifier C relies on a complex similarity
function σ which consists of a combination of atomic similarity measure σi .
Each of the atomic similarity measures is associated with a parameter ωi , which
is used in main cases as threshold or weight for σi . Supervised approaches to the
computation of link specifications use labelled training data L ⊆ S × T × Y to
maximize an objective function such as the distance from the labelled data items
to the boundary of the classifier in the case of Support Vector Machines [1]. The
idea behind unsupervised approaches to learning link specifications is that they
do not to utilize any training data (i.e., L = ∅). Instead, they aim to optimize
an objective function F. In the following, we present the non-deterministic and
the deterministic approaches we utilized in our experiments. We then present
two different objective functions that are based on the well-know Fβ -measure.
These functions build the basis for our evaluation.


2.1   Non-Deterministic Approach



Algorithm 1 EAGLE
Require: Sets of instances S and T , size of population, number of iterations
 Get property mapping (S, T )
 Generate initial population
 repeat
   Compute F for all individuals.
   Apply genetic operators to population
 until Number of iterations is reached
 return Overall fittest individual



     The non-deterministic approach we evaluated is based on the EAGLE ap-
proach presented in [9] and implemented in the LIMES framework [8]. The ap-
proach was modified as described in Algorithm 1. We begin by generating a
random population of n individuals. Let Gt be the population at the iteration
t. To evolve a population to the generation Gt+1 , the fitness of each individ-
uals g t ∈ Gt is computed. For this purpose, the mapping M (g t ) generated by
g t is evaluated and the value F(M (g t )) is assigned to g t . These fitness values
build the basis for selecting individuals for the genetic operator reproduction.
EAGLE uses a tournament setting between two selected individuals to decide
which one is copied to the next generation g t+1 . On randomly selected indi-
viduals the operator mutation is applied according to a probability called the
mutation rate. A mutation can affect an individual in three different ways: First,
it can alter the thresholds used by the individual. Second, a mutation can alter
the property values that are compared by one of the atomic measures on which
the classifier relies. Finally, mutations can modify the measures included in the
individuals. The third genetic operator, crossover, operates on two parent in-
dividuals and builds a new offspring by swapping two random sub-trees of the
parent genotypes. The application of these operators is carried out iteratively
until a maximal number of iterations is reached. The result of this process is
the mapping M that is returned by the best individual. M is postprocessed as
follows: Each s ∈ S such that ∃t ∈ T : (s, t) ∈ M is mapped to arg max σ(s, t).
                                                                      t∈T
The postprocessed mapping is finally returned.


2.2    Deterministic Approaches



Algorithm 2 EUCLID
Require: Specification of the datasets S and T
Require: ∆ω ∈]0, 1[
Require: γ > 1, θ > 0 with (γ, θ) ∈ N2
 Get property mapping (S, T )
 bestClassifier = (1, . . . , 1)
 Ω=∅
 for k = 0 → b1/∆ω c do
   Ω = Ω ∪ {1 − k∆ω }
 end for
 for i = 1 → n do
   σimax = arg max F(σi ≥ ω), ωimin = 0, ωimax = 1
             ω∈Ω,σi ∈Σ
  end for
  for iterations  = 1 → θ do
          |ω max −ω min |
    ∆= i γ i
             n
      C = arg max F(ωimin + ki ∆)
            i=1
    if F(C) > F(bestClassifier) then
       C = bestClassifier
    else
       bestClassifier = C
    end if
    for j = 1 → n do
       ωjmin = max(0, ζj ), ωjmax = min(1, ζj )
    end for
  end for
  return bestClassifier



   Linear and Boolean classifiers have been shown in previous work [8] to also
achieve good results on the task of LD. Both types of classifiers can be char-
acterized by a similarity function σ which depends on similarity measures σi
and parameters ωi . Linear classifiers L classify (s, t) as belonging to +1 iff
Pn                                                             n
                                                               V
   ωi σi (s, t) ≥ 1. For Boolean classifiers B, the inequality   σi (s, t) ≥ ωi must
i=1                                                          i=1
be fulfilled by the pair (s, t) for it belong to +1. In both cases, a classifier can
be encoded by the vector Ω = (ω1 , . . . , ωn ). Determining the best L or B would
require testing all possible combinations of values ωi ∈ [0, 1], which would be
impracticable. The idea behind our algorithm EUCLID (Efficient and Unsuper-
vised Classification for Link Discovery) is to reduce the number of configurations
that must be tested by applying a search within the search space whose granular-
ity increases gradually as described in Algorithm 2: We first begin by detecting
the right similarity measure for each pair of properties. To achieve this goal, we
use the similarity σimax = arg max F(σi ≥ ω) for each pair of properties, where
                           ω∈Ω,σi ∈Σ
F(σi ≥ ω) is the value of the pseudo-F-measure (PFM) of the classifier C0 such
that C0 (s, t) = +1 ⇐⇒ σi (s, t) ≥ ω, Ω = {ω > 0 with ∃k ∈ N : ω = 1 − k∆ω }
is a set of threshold values and Σ is the set of similarity measures implemented
by the framework at hand. Note that ∆ω is the first of the three parameters
required by EUCLID.
    In a second step, we compute the actual link specification. Given two param-
eters γ > 1 and θ > 0, ωimin and ωimax are set to 0 and 1 respectively for each
of the similarities σi . Then the interval ωimin and ωimax is split into γ intervals
of the same size ∆ = (|ωimax − ωimin |/γ. For each of the possible parametriza-
tion ωi = ωimin + k∆, k ∈ {0, . . . , γ}, EUCLID simply runs all of the resulting
classifiers C and compute their fitness F(C). The current overall best classifier
                                                        i
C = (ζ1 , . . . , ζn ) is used as new reference point. ωmin is set to max{0, ζi − ∆}
       i
and ωmax to min{ζi + ∆, 1} while γ := γ/2. This procedure is repeated θ times
and the best overall classifier w.r.t. F is returned.


3   Pseudo-F-measures
We considered two different PFMs for the automatic discovery of link specifica-
tions. The first PFM, dubbed Fuβ , was proposed by [10] and is based on the Fβ
measure. Consequently, it is defined as
                                                  Pu Ru
                           Fuβ = (1 + β 2 )               .                     (1)
                                              β 2 Pu + Ru
Let M ⊆ S × T be a mapping generated by an algorithm, S be the set of source
instances and T be the set of target instances. Pu is defined as
                                   |{s|∃t : (s, t) ∈ M }|
                         Pu (M ) = P                      ,                     (2)
                                      |{t : (s, t) ∈ M }|
                                       s

while Ru is computed as follows:
                                               |M |
                             Ru (M ) =                    .                     (3)
                                           min(|S|, |T |)
Note that Ru (M ) can be larger than 1 as |M | can be larger than min(|S|, |T |).
While this does not occur in the setup proposed by [10], it seems rather counter-
intuitive that a recall measure can lead to values beyond 1. We thus specified
the following novel pseudo-recall dubbed Rd :

                         |{s|∃t : (s, t) ∈ M }| + |{t|∃s : (s, t) ∈ M }|
             Rd (M ) =                                                   .     (4)
                                            |S| + |T |

This pseudo-recall computes the ratio how well all source and target instances
are covered by the Mapping M . Thus, Rd (M ) = 1 if every s ∈ S is mapped to
at least one t ∈ T and vice versa. We would argue that it is therewith more in
line with the original definition of precision and recall. Our pseudo-F-measure
Fd is thus defined as

                                         Pd (M )Rd (M )
             Fdβ (M ) = (1 + β 2 )                         with Pd = Pu .      (5)
                                     β 2 Pd (M ) + Rd (M )


4     Experiments and Results

The goal of our experiments was twofold. First, we wanted to know how deter-
ministic approaches perform in comparison to non-deterministic approaches for
the discovery of link specifications. The basic intuition here was that if deter-
ministic approaches can achieve F-scores similar to those of non-deterministic
approaches, they should be preferred as they are usually more time-efficient.
We thus compared the maximal F-measure achieved by each of our approaches
on the six different data sets at hand. Moreover, we wanted to measure how
well PFM can predict the real performance of classifiers. Especially, we were
interested in knowing whether the predictive power of pseudo-F-measures is as
reliable on real data as it has been shown to be on synthetic data. Within this
context, we were also interested in knowing which setting of β led to the best real
F-measure across the different datasets that we used, as β = 0.1 was suggested
in the past [10]. We thus ran our evaluation using both Fu and Fd for β-values
between 0.1 and 2.0 using a 0.1 increment. We used two different measures to
evaluate the correlation between PFM and F1 . In the following, we present the
data, algorithmic parameters and correlation measures used for our experiments.
We then present our results and discuss their implications for the next steps of
research on link discovery.


4.1   Experimental Setup

In all experiments, we assumed that we knew the perfect mapping between the
properties. Each experiment was ran on a single thread of an Ubuntu Linux server
running JDK1.7 and was allocated maximally 2GB of RAM. The processors were
2.0GHz quadcore Opterons.


Data We ran our experiments on three synthetic and three real datasets. The
synthetic datasets consisted of widely used and well-known Persons1, Persons2
and Restaurant datasets from the OAEI2010 set of benchmark data sets. The real
datasets consisted of the ACM-DBLP, Amazon-Google and Abt-Buy datasets
that were extracted from websites or databases and for which a gold standard was
created manually as reported in [5]. The ACM-DBLP dataset consists of 2,617
source and 2,295 target publications (gold standard: 2,224 links). The Amazon-
Google dataset links 1,363 to 3,226 products (gold standard: 1,300 links). Finally,
the Abt-Buy dataset links 1,081 to 1,092 products via 1,097 correct links. All
non-RDF datasets were transformed into RDF and all attribute values were set
to lower case. Apart from this preprocessing, no other preprocessing step was
carried out.

Parametrization of the algorithms Four parameters need to be set to run
EAGLE: the number of iterations, the size of the population, the crossover rate
and the mutation rate. Similarly to [10], we used 20 iterations with a population
of 100 individuals. The crossover and mutation rates were set to 0.6. Given that
this approach is not deterministic, we ran the experiments 5 times and present
the average values in Section 4.2. Note that the standard deviations for the F-
measures were always under 5% of the average value. For EUCLID, we used
∆ω = 0.1, γ = 4, θ = 10.

Correlation Measures To measure the accuracy of the algorithms, we used the
standard F1 -measure. We were interested in determining whether the pseudo-F-
measures Fuβ and Fdβ can be used practically to predict the F1 measure achieved
by an algorithm and which setting of β was the best to achieve this goal. Thus,
we measured the correlation of Fuβ and Fdβ with F1 across different values of
β. We used two different correlation measures: The Pearson and the Spearman
correlation. We used the Pearson correlation to ensure the comparability of our
approach with other correlation studies as this correlation is one of the most
commonly used. The main drawback of the Pearson correlation is that it is most
reliable at detecting linear correlations between distributions. As there was no
reason for assuming a linear relationship between our measures, we opted to
also use another correlation measure that do not make any assumption upon
the type of correlation between the input distributions. We used the Spear-
man correlation [13], which assesses how well a monotonic function can describe
the relationship between the input distributions by comparing the ranks of the
values in the two input distribution. For both correlations, we used a 2-tailed
significance test with a confidence threshold of 95%.

4.2   Results
We first measured the F1 -scores achieved by our approaches when relying on Fd
(see Figure 1) and Fu (see Figure 2). Our results indicate that EUCLID is in
general slightly superior to EAGLE. Note that the linear classifier in combination
with Fd even outperforms the supervised approaches presented in [5] and [9] on
the ACM-DBLP data set. In addition the linear model leads to better results
than the Boolean model in most cases (expect on the Restaurant dataset). Our
                                                                                                             0 .9
  1 .0 0 2                                                                                                                        F d P s e u d o L in e a r
                                                                                                                                   F d R e a lL in e a r
                                                                                                             0 .8
                                                                                                                                    F d P s e u d o B o o le a n
  1 .0 0 0                                                                                                                           F d R e a lB o o le a n
                                                                                                             0 .7
                                                                                                                                      F d P s e u d o G e n e tic
                                                                                                                                       F d R e a lG e n e tic
                                                                                                             0 .6
  0 .9 9 8
                                                                                                             0 .5

  0 .9 9 6                                                                                                   0 .4

                                                                                                             0 .3
  0 .9 9 4
                          F d P s e u d o L in e a r                                                         0 .2
                           F d R e a lL in e a r
                            F d P s e u d o B o o le a n
  0 .9 9 2                                                                                                   0 .1
                             F d R e a lB o o le a n
                              F d P s e u d o G e n e tic
                               F d R e a lG e n e tic                                                        0 .0
  0 .9 9 0
                0 .0                       0 .5                       1 .0                     1 .5   2 .0            0 .0         0 .5                             1 .0                              1 .5   2 .0


                                                   (a) Persons1                                                                              (b) Persons2
  1 .0                                                                                                       1 .0 0

  0 .9                                                                                                       0 .9 8

  0 .8                                                                                                       0 .9 6

  0 .7                                                                                                       0 .9 4

  0 .6                                                                                                       0 .9 2

  0 .5                                                                                                       0 .9 0
                       F d P s e u d o L in e a r                                                                              F d P s e u d o L in e a r
  0 .4                  F d R e a lL in e a r                                                                0 .8 8             F d R e a lL in e a r
                         F d P s e u d o B o o le a n                                                                            F d P s e u d o B o o le a n
  0 .3                    F d R e a lB o o le a n                                                                                 F d R e a lB o o le a n
                                                                                                             0 .8 6
                           F d P s e u d o G e n e tic                                                                             F d P s e u d o G e n e tic
                            F d R e a lG e n e tic                                                                                  F d R e a lG e n e tic
  0 .2                                                                                                       0 .8 4
         0 .0                          0 .5                         1 .0                      1 .5    2 .0              0 .0        0 .5                             1 .0                             1 .5   2 .0


                                                  (c) Restaurant                                                                          (d) ACM-DBLP
  1 .0
                                                                                                             0 .9
  0 .9
                                                                                                             0 .8
  0 .8

  0 .7                                                                                                       0 .7

  0 .6                                                                                                       0 .6

  0 .5                                                                                                       0 .5

  0 .4                                                                                                       0 .4

  0 .3                                                      F d P s e u d o L in e a r                       0 .3                                                   F d P s e u d o L in e a r
                                                             F d R e a lL in e r                                                                                     F d R e a lL in e a r
  0 .2                                                        F d P s e u d o B o o le a n                   0 .2                                                     F d P s e u d o B o o le a n
                                                               F d R e a lB o o le a n                                                                                 F d R e a lB o o le a n
  0 .1                                                          F d P s e u d o G e n e tic                                                                             F d P s e u d o G e n e tic
                                                                                                             0 .1
                                                                 F d R e a lG e n e n tic                                                                                F d R e a lG e n e tic
  0 .0
         0 .0                          0 .5                         1 .0                      1 .5    2 .0            0 .0         0 .5                             1 .0                              1 .5   2 .0


                                     (e) Amazon-Google                                                                                        (f) Abt-Buy

Fig. 1. Evaluation of algorithms based on Fd . The y-axis shows the different F-
measures while the x-axis stands for different β-values. Note that “FdPseudo” stands
for the pseudo-F-measures achieved by the different classifiers while “FdReal” stands
for the real F-measures.



results suggest that Fd is better suited for EUCLID while Fu and Fd tie for
EAGLE. With respect to runtime, EUCLID requires between 2.5 and 30s and is
therewith between 1 and 2 orders of magnitude faster than EAGLE. Given the
significant different in runtimes we observed within our experiments, we suggest
                                                                                                        1 .1
  1 .2 5                                                                                                                               F u P s e u d o L in e a r
                                                                                                        1 .0                            F u R e a lL in e a r
  1 .2 0                                                                                                                                 F u P s e u d o B o o le a n
                                                                                                        0 .9                              F u R e a lB o o le a n
  1 .1 5
                                                                                                                                           F u P s e u d o G e n e tic
                                                                                                        0 .8
  1 .1 0                                                                                                                                    F u R e a lG e n e tic
                                                                                                        0 .7
  1 .0 5
  1 .0 0                                                                                                0 .6

  0 .9 5                                                                                                0 .5

  0 .9 0                                                                                                0 .4
  0 .8 5                                                                                                0 .3
                      F u P s e u d o L in e a r
  0 .8 0               F u R e a lL in e a r
                                                                                                        0 .2
                        F u P s e u d o B o o le a n
  0 .7 5                 F u R e a lB o o le a n                                                        0 .1
  0 .7 0                  F u P s e u d o G e n e tic
                           F u R e a lG e n e tic                                                       0 .0
  0 .6 5
             0 .0                    0 .5                            1 .0                 1 .5   2 .0          0 .0                  0 .5                            1 .0   1 .5   2 .0


                                              (a) Persons1                                                                                     (b) Persons2
                                                                                                        1 .3
                    F u P s e u d o L in e a r
  1 .4               F u R e a lL in e a r
                      F u P s e u d o B o o le a n                                                      1 .2
                       F u R e a lB o o le a n
  1 .2                  F u P s e u d o G e n e tic                                                     1 .1
                         F u R e a lG e n e tic
  1 .0                                                                                                  1 .0


  0 .8                                                                                                  0 .9


  0 .6                                                                                                  0 .8
                                                                                                                      F u P s e u d o L in e a r
                                                                                                        0 .7           F u R e a lL in e a r
  0 .4                                                                                                                  F u P s e u d o B o o le a n
                                                                                                                         F u R e a lB o o le a n
                                                                                                        0 .6
                                                                                                                          F u P s e u d o G e n e tic
  0 .2                                                                                                                     F u R e a lG e n e tic
                                                                                                        0 .5
           0 .0                    0 .5                             1 .0                  1 .5   2 .0          0 .0                  0 .5                            1 .0   1 .5   2 .0


                                            (c) Restaurant                                                                                  (d) ACM-DBLP
                                                                                                        1 .3
  1 .0                                                                                                                       F u P s e u d o L in e a r
                                                                                                        1 .2                  F u R e a lL in e a r
  0 .9                                                                                                                         F u P s e u d o B o o le a n
                                                                                                        1 .1
                                                                                                                                F u R e a lB o o le a n
  0 .8
                                                                                                        1 .0                     F u P s e u d o G e n e tic
  0 .7                                                                                                                            F u R e a lG e n e tic
                                                                                                        0 .9
  0 .6                                                                                                  0 .8

  0 .5                                                                                                  0 .7

  0 .4                                                                                                  0 .6

  0 .3                                                                                                  0 .5
                                                        F u P s e u d o L in e a r
                                                         F u R e a lL in e a r                          0 .4
  0 .2
                                                          F u P s e u d o B o o le a n
                                                                                                        0 .3
  0 .1                                                     F u R e a lB o o le a n
                                                            F u P s e u d o G e n e tic                 0 .2
  0 .0                                                       F u R e a lG e n e tic
                                                                                                        0 .1
           0 .0                    0 .5                             1 .0                  1 .5   2 .0          0 .0                  0 .5                            1 .0   1 .5   2 .0


                                 (e) Amazon-Google                                                                                              (f) Abt-Buy

Fig. 2. Evaluation of algorithm based on Fu . The y-axis is the F-measure while the
x-axis stands for different β-values. Note that “FuPseudo” stands for the pseudo-F-
measures achieved by the different classifiers while “FuReal” stands for the real F-
measures.




that the development of specific algorithms for classifiers of a given type can
lead to algorithms for the discovery of link specifications that are both time-
efficient and highly accurate. The insight we gain is thus a clear answer to our
first question: unsupervised deterministic approaches can perform as well as
unsupervised approaches on both synthetic and real data.


                                                      Linear                              Boolean                               Genetic
                                                 Fd             Fu                   Fd                 Fu                 Fd             Fu
   Persons1                                      100            100           99.50                    99.50               100          100
   Persons2                                     41.45          40.60          59.12                    59.12              33.77        37.04
   Restaurant                                   88.56          88.56          88.56                    88.56              88.56        88.56
   ACM-DBLP                                     97.96          97.94          97.46                    97.46              97.62        97.71
   Amazon-Google                                49.08          48.55          39.97                    39.97              43.11        42.68
   Abt-Buy                                      48.60          48.60          37.66                    37.66              45.03        45.08
                   Table 1. Maximal F-scores (in %) achieved by the approaches.




  1 .0                                                                      1 .0



  0 .9                                                                      0 .9



  0 .8                                                                      0 .8



  0 .7                                                                      0 .7



  0 .6                                                                      0 .6


                                                                                             F d S p e a rm a n
  0 .5                                                                      0 .5              F u S p e a rm a n
                F d P e a rs o n
                 F u P e a rs o n
  0 .4                                                                      0 .4
         0 .0          0 .5              1 .0           1 .5         2 .0          0 .0         0 .5               1 .0         1 .5       2 .0


                                    (a) Pearson                                                        (b) Spearman

Fig. 3. Spearman and Pearson correlation of Fuβ and Fdβ across different values of β
measured independently from the algorithm used.


    To answer our second question, we first computed the algorithm-independent
correlation of Fd and the F1 measure as well as the correlation of Fu and the F1
measure (see Figure 3). In our experiments, the correlations varied significantly
across the different values of β, yet remained positive and significant in 97.5%
of the cases (the 2 lowest correlation scores of the Pearson correlation for Fu
were not significant). This means that optimizing Fu and Fd for one particular
setting of β is a sensible approach towards finding the best classifier for that
particular setting of β. We then computed the Pearson and Spearman correlation
(see Table 2) between Fu , Fd and the F1 measure achieved by the different
approaches across different values of β. Our results were somewhat surprising
as we detected both significant positive and negative correlations across the
different datasets and for both correlations. Interestingly, while the number of
                              Linear                Boolean               Genetic
                         Fd            Fu      Fd             Fu     Fd             Fu
    Persons1              –           –         –        -0.85*     0.84*        -0.1
    Persons2            -0.09       0.54*     -0.12       0.43      -0.43       -0.43
    Restaurant          0.73*      -0.71*     0.71*        1*       0.84*       0.16
    ACM-DBLP           -0.70*      -0.65*     -0.43      -0.97*    0.46*        0.51*
    Amazon-Google      -0.95*       0.49     -0.79*        0.07    -0.88*       -0.03
    Abt-Buy             -0.9*       0.27     -0.98*       -0.27     0.38        -0.9*

    Persons1              –           –         –        -0.87*     0.99*       -0.1
    Persons2            0.02        -0.09      0.21       -0.11    -0.56*      -0.56*
    Restaurant          0.85*      -0.54*     0.85*        1*       0.91*       0.15
   ACM-DBLP            -0.87*      -0.73*       0.05     -0.95*      0.34       0.47*
   Amazon-Google       -0.49*      0.68*       -0.27      -0.22     -0.64*      -0.39
   Abt-Buy              -0.37      0.59*      -0.82*     -0.47*      -0.13     -0.49*
Table 2. Pearson and Spearman Correlation of PFM and real F-measures across dif-
ferent β-values. The top section of the table shows the Pearson correlation while the
bottom part shows the Spearman correlation. The correlations are not defined for the
fields marked with “–” due to at least one of the standard deviations involved being 0.
Correlations marked with “*” are significant.




significant positive and negative correlations were relatively balanced for the
synthetic data sets, negative correlations seemed to dominate the set of real
datasets, thus hinting towards Fu and Fd behaving differently depending on the
type of data they are confronted with. The negative correlation values suggest
that to detect the best values of β for a real dataset automatically, the mapping
M which leads to the smallest best value of Fd across the different values of β
should be chosen. This seems rather counter-intuitive and is a hypothesis that
requires ampler testing on a larger number of real datasets. Overall, our results
show clearly that no β-value achieves a maximal F1 -measure across our data
sets. Still, for real datasets, Fd seems to perform well for β ∈ [0.8, 1.2]. Stating
such an interval for Fu is more difficult as the set of β-values that lead to the
best mapping is very heterogeneous across the different datasets. Interestingly,
this conclusion diverges from that proposed in previous work. The answer to our
second question is still clearly that while the predictive power of Fuβ and Fdβ is
sufficient for the results to be used in practical settings, significant effort still
needs to investigated to create a generic non-parametric PFM that can be used
across different datasets and algorithms to predict the F1 -measure reliably .


5     Conclusion

In this paper, we present a first series of experiments to determine how well
standard classifier models such as linear and Boolean classifiers perform in com-
parison to classifiers generated by the means of genetic programming in an un-
supervised learning setting based on maximizing a PFM. Overall, our results
indicate that we are still at the beginning of the search towards the “holy grail”
of PFMs. Especially on real data, the maximal PFM achieved by algorithms
across different values of β is often negatively correlated with the value of the
F1 . The magnitude of this effect is significantly reduced on synthetic data. This
difference suggest that there is still a need for benchmark generation methods
that allow creating benchmark data sets which reflect real data in a more holis-
tic way. Moreover, our evaluation shows that deterministic classifiers perform
as well as or better than non-deterministic approaches while still bearing the
main advantage of being significantly more time-efficient. Thus, finding more ef-
ficient extension of EUCLID or similar approaches should allow providing users
of LD frameworks with accurate link specifications within an interactive setting.
Detecting the right parametrization for PFM yet remains an unsolved problem.

References
 1. Nello Cristianini and Elisa Ricci. Support vector machines. In Encyclopedia of
    Algorithms. 2008.
 2. Aidan Hogan, Axel Polleres, Jrgen Umbrich, and Antoine Zimmermann. Some en-
    tities are more equal than others: statistical methods to consolidate linked data. In
    Workshop on New Forms of Reasoning for the Semantic Web: Scalable & Dynamic
    (NeFoRS2010), 2010.
 3. R. Isele, A. Jentzsch, and C. Bizer. Efficient Multidimensional Blocking for Link
    Discovery without losing Recall. In WebDB, 2011.
 4. Robert Isele and Christian Bizer. Learning Linkage Rules using Genetic Program-
    ming. In Sixth International Ontology Matching Workshop, 2011.
 5. Hanna Köpcke, Andreas Thor, and Erhard Rahm. Comparative evaluation of entity
    resolution approaches with fever. Proc. VLDB Endow., 2(2):1574–1577, 2009.
 6. Axel-Cyrille Ngonga Ngomo. A time-efficient hybrid approach to link discovery.
    In Proceedings of OM@ISWC, 2011.
 7. Axel-Cyrille Ngonga Ngomo and Sören Auer. Limes - a time-efficient approach for
    large-scale link discovery on the web of data. In Proceedings of IJCAI, 2011.
 8. Axel-Cyrille Ngonga Ngomo, Jens Lehmann, Sören Auer, and Konrad Höffner.
    Raven: Active learning of link specifications. In Proceedings of the Ontology Match-
    ing Workshop (co-located with ISWC), 2011.
 9. Axel-Cyrille Ngonga Ngomo and Klaus Lyko. Eagle: Efficient active learning of
    link specifications using genetic programming. In Proceedings of ESWC, 2012.
10. Andriy Nikolov, Mathieu D’Aquin, and Enrico Motta. Unsupervised learning of
    data linking configuration. In Proceedings of ESWC, 2012.
11. George Papadakis, Ekaterini Ioannou, Claudia Niedere, Themis Palpanasz, and
    Wolfgang Nejdl. Eliminating the redundancy in blocking-based entity resolution
    methods. In JCDL, 2011.
12. Jennifer Sleeman and Tim Finin. Computing foaf co-reference relations with rules
    and machine learning. In Proceedings of the Third International Workshop on
    Social Data on the Web, 2010.
13. C. Spearman. The proof and measurement of association between two things. The
    American journal of psychology, 15:72–101, 1904.