=Paper= {{Paper |id=Vol-1409/paper-04 |storemode=property |title=Interlinking: Performance Assessment of User Evaluation vs. Supervised Learning Approaches |pdfUrl=https://ceur-ws.org/Vol-1409/paper-04.pdf |volume=Vol-1409 |dblpUrl=https://dblp.org/rec/conf/www/HassanLN15 }} ==Interlinking: Performance Assessment of User Evaluation vs. Supervised Learning Approaches== https://ceur-ws.org/Vol-1409/paper-04.pdf
    Interlinking: Performance Assessment of User Evaluation
               vs. Supervised Learning Approaches

                Mofeed Hassan                          Jens Lehmann                     Axel-Cyrille Ngonga
             Department of Computer               Department of Computer                      Ngomo
                    Science                              Science                       Department of Computer
             AKSW Research Group                  AKSW Research Group                         Science
               University of Leipzig                University of Leipzig              AKSW Research Group
            mounir@informatik.uni-             lehmann@informatik.uni-                   University of Leipzig
                  leipzig.de                         leipzig.de                      ngonga@informatik.uni-
                                                                                           leipzig.de

ABSTRACT                                                         number of knowledge bases, the links among them are rel-
Interlinking knowledge bases are widely recognized as an im-     atively few with more than 500 million 2 in 2011 and they
portant, but challenging problem. A significant amount of        have very different qualities [8]. Creating high-quality links
research has been undertaken to provide solutions to this        across the knowledge bases on the Web of Data thus re-
problem with varying degrees of automation and user in-          mains a task of central importance to empower manifold
volvement. In this paper, we present a two-staged experi-        application on the Web of Data, including federated query
ment for the creation of gold standards that act as bench-       processing and cross-ontology question answering. Many al-
marks for several interlinking algorithms. In the first stage    gorithms have been proposed and implemented in different
the gold standards are generated through manual validation       interlinking tools to address this task [17, 20, 12, 13]. While
process highlighting the role of users. Using the gold stan-     these approaches vary w.r.t. several aspects, one of the most
dards obtained from this stage, we assess the performance        important aspects is the degree of user involvement. In [20,
of human evaluators in addition to supervised interlinking       16] interlinking tools are categorized based on the degree of
algorithms. We evaluate our approach on several data inter-      automation that regulates the amount of user involvement
linking tasks with respect to precision, recall and F-measure.   at different levels [17]. In general, the most costly aspect of
Additionally we perform a qualitative analysis on the types      user involvement in interlinking is the validation of links,
of errors made by humans and machines.                           also dubbed manual link validation. This is the process
                                                                 where a user, i.e. a validator or evaluator, specifies whether
Categories and Subject Descriptors                               a link generated by an interlinking tool is correct or incor-
H.4 [LINK Discovery]                                             rect. In frameworks which implement active batch learning
                                                                 to determine links (for example LIMES [9] and SILK [3]),
                                                                 the results of the link validation process are reused to learn
General Terms                                                    presumably better link specifications and thus to generate
Data Integration,Enterprise Linked Data                          high-quality links.

Keywords                                                         While several benchmarks have been made available to mea-
Interlinking, Links validation, Gold standard, Manual vali-      sure the performance of existing link discovery systems for
dation, Performance evaluation                                   the Web of Data, several questions pertaining to this task
                                                                 have remained unanswered so far such as:
1.     INTRODUCTION
Over the last years, the number of knowledge bases pub-               1. Costs of an annotation: The first question pertains
lished on the Web of Data has grown considerably. Ac-                    to the cost of link discovery. Determining how much
cording to statistics performed in the beginning of 2015, the            it actually costs (w.r.t. to time) to validate a link
number of published knowledge bases surpassed 3800 pro-                  between two knowledge bases, enable users to quantify
viding over than 88 billion triples1 . In spite of the large             how long it will take them to generate clean links from
                                                                         their knowledge base to other knowledge bases.
1
    http://stats.lod2.eu                                              2. When should a tool be used : Human annotators are
                                                                         able to detect links between knowledge bases at a small
                                                                         scale. On the other hand, machines need a significant
                                                                         number of examples and clear patterns in the underly-
                                                                         ing data to be able to detect high-quality links between
                                                                         knowledge bases. Hence, determining the knowledge
                                                                         base sizes on which machines should be used for link
                                                                         discovery is of utmost practical importance.
Copyright is held by the author/owner(s).
WWW2015 Workshop: Linked Data on the Web (LDOW2015).             2
                                                                     http://lod-cloud.net/state/
  3. Performance of machine-learning of small tasks: While       One of the interlinking tools is RKB-CRS[7]. It is a domain-
     it is well established that machine-learning tools per-     dependent tool. It focuses on universities and publications
     form well on knowledge bases that contain hundreds          domains. RKB-CRS is a semi-automated tool where its pro-
     of resources or more, many of the knowledge bases           cess depends on providing URIs using a Java program de-
     on the Web of Data are small and pertain to a dedi-         veloped by the user. This is performed according to each
     cated domain. Providing guidelines towards when to          dataset to be interlinked. The tool applies string matching
     use machine-learning algorithms to link these knowl-        technique to find URIs equivalences that can be represented
     edge bases to other knowledge bases can improve the         as an owl:sameAs relationship.
     effectiveness of linking on the Web of Data.
                                                                 Another domain-dependent tool is LD-Mapper[15]. It fo-
Consequently, we propose an experiment to investigate the        cuses on datasets in the music domain. It provides an ap-
effect of user intervention in dataset interlinking on small     proach that depends on string similarity and also considers
knowledge bases. We study the effort needed for manual           the neighbour similarity to the resources.
validation using a quantitative approach. Furthermore, we
compare the performance of a human validator and super-          Knofuss[14] is an automatic and domain-independent tools
vised interlinking approaches to find a break-even point where   that focuses on merging two datasets where each is described
machine-learning techniques should be used. Note that we         by an ontology. An alignment ontology is also given by the
intentionally limit ourselves to small numbers of resources      user in case of ontology heterogeneity. The tool has two
in our experiments as (1) experiments on large number of         contexts: (i) application context which is provided by the
resources have already established that machines perform         datasets’ ontology and (ii) object context model that points
well and (2) such experiments would be intractable for hu-       out what properties are needed for the matching process.
man users due to long time and great effort.                     Matching is performed through string matching and adap-
                                                                 tive learning techniques. Knofuss operates on local copies of
The core contributions of the paper are:                         the datasets.

   • An evaluation of the performance of human evaluators
     on the interlinking task.                                   An example of a semi-automated tool that works on datasets
   • A comparison of human and machine performance on            local copies is RDF-AI[18]. It consists of five linking phases.
     the interlinking task for small knowledge bases.            These phases are (i) preprocessing, (ii) matching, (iii) fusion,
   • A methodology for designing and executing such ex-          (iv) interlinking and (v) post-processing and each phase is
     periments.                                                  described by a XML file. The input includes the alignment
   • A gold standard for three small interlinking tasks.         method and the dataset structure. The utilized matching
                                                                 techniques are string matching and word relation matching.
The rest of our paper is organized as follows. In section 2,     RDF-AI provides a merged dataset or an entity correspon-
a short overview about interlinking approaches is provided.      dence list as an output.
Section 3 is a description of the experimental approach. The
experiment setup and preparation is described in section         SILK[3] is a semi-automated tool and it is a domain in-
4. In section 5, the results of our experiment are shown.        dependent. Unlike the aforementioned tools, it works on
A discussion about the results in section 6 is followed by       the datasets through SPARQL endpoint. The user speci-
related work summarized in section 7. Finally in section 8       fies the linking process parameters using a declarative lan-
the conclusion and future work are presented.                    guage dubbed Silk Link Specification Language (Silk-SLS).
                                                                 Using Silk-SLS allows the user to focus on specific type of re-
2. BACKGROUND                                                    sources. It supports the use of different matching techniques
2.1 Interlinking Tool                                            such as string matching, date similarities and numerical sim-
                                                                 ilarities. Set operators like MAX, AVG and MIN combines
Due to the highly increase of published datasets on the web
                                                                 more than one similarity metric. Links are generated if two
and the rising number of interlinks required among them,
                                                                 resources similarity exceeds a previously specified threshold.
many interlinking tools are proposed based on different algo-
rithms. Some surveys provided comparative studies about
                                                                 LIMES[9] is an interlinking tool belonging to the same cat-
these tools showing the major differences among them[20,
                                                                 egory as SILK by being semi-automated and domain inde-
16]. Interlinking tools differ in many aspects. Two of these
                                                                 pendent. It works as a framework for multiple interlinking
aspects are (i)domain dependency and (ii) Automation.
                                                                 algorithms either unsupervised or supervised learning algo-
                                                                 rithms. For the unsupervised algorithm the user provides
By the aspect domain dependency, the interlinking tool is
                                                                 linking specifications. The Linking specifications provide
classified as domain-dependent when it works on interlinking
                                                                 the set classes, properties and metrics to make interlink-
between two datasets in specific domain. With the Automa-
                                                                 ing. On the other hand, different supervised algorithms are
tion perspective, the interlinking tools are categorized into
                                                                 implemented by applying genetic learning combined with ac-
(a) Automated tools and (b)Semi-automated tools based on
                                                                 tive learning approaches. The target of these algorithms is
the degree of user’s contribution in the interlinking process.
                                                                 finding the best classification of candidate links using the
In the semi-automated tools User intervention is important
                                                                 minimum number of training data. Minimizing the train-
for the linking process in different views, such as setting
                                                                 ing data is performed through finding the most informative
the link specifications, ontology alignment, providing pos-
                                                                 data reviewed (labelled) by an oracle. Examples of these
itive and negative examples for tools based on supervised
                                                                 algorithms are EAGLE, RAVEN, COALA and EUCLID[11,
learning algorithms and validating the final generated links.
12, 13].                                                          the few observational studies about users interactions with
                                                                  one of the linking process phases. The study focuses on the
RAVEN and EAGLE[10, 11] are two interlinking algorithms           cognitive process performed by the users to find mappings.
that depend on active learning and genetic programming
methods with supervised algorithms. As the authors stated,        According to our knowledge there is no such observational
These algorithms implement Time-efficient matching tech-          study about the problems face users during validating datasets
niques to reduce number of comparisons between instances          interlinks and no quantifying experiment to measure the ef-
pairs. COALA[12] is combined with EAGLE to consider the           fort done by the users in validating links and generating gold
intra and inter correlation between learning examples to the      standards.
learning algorithm.
                                                                  3.     EXPERIMENTAL APPROACH
                                                                  Based on the motivations explained formerly, we designed a
LIMES was used in our work due to different reasons. One          two-stage experiment. The first stage consists of two steps.
reason is its simplicity. It uses a simple configuration file     The first step is performing interlinking between different
to perform interlinking by any of the contained algorithms.       datasets using the unsupervised learning algorithm. These
It supports working on SPARQL or dump-files. The imple-           datasets represent different domains. In the second step, the
mented algorithms are another strength point in LIMES as          resulting links are manually validated by human validators.
it supported our work with different interlinking algorithm       The validators will do this step first individually then in a
in the same pool. Three algorithms EAGLE, COALA and               group, where unsure decisions about links are reviewed by
EUCLID are used in our work and dubbed as Genetic Active          all validators. The resulting links are then considered to be
Learning (GAL), Genetic Active Learning with Correlation          a gold standard for their interlinking tasks. Later, we will
(GCAL) and Genetic Batch Learning (GBL), respectively.            discuss about problems in manual link validation of single
                                                                  evaluators and groups.
2.2     Manual Links Validation                                   In the second stage, different supervised algorithms are ap-
Validating the generated links gives two beneficial outputs.      plied on the same interlinking tasks. Using the gold stan-
First, it provides positive and negative examples for super-      dard generated from the first stage of the experiment as a
vised learning algorithms. Second, it creates gold standards      benchmark, the performance of the used approaches can be
to be used for tools and reviewers assessments of other sim-      compared to each other and to humans.
ilar linking tasks. LATC 3 is one of the efforts in generating
reviewed links samples to achieve the previously mentioned        In our evaluation we use real data for generating bench-
two benefits.                                                     marks. They are generated from actual data forming three
                                                                  interlinking tasks in different domains. Size limitation was
In [6] the authors stated that there is a need for more work on   forced to ease the validation process. We use this bench-
generating benchmarks for interlinking algorithms. A sum-         mark to evaluate different interlinking algorithms in terms
mary of different benchmarking approaches is provided with        of precision, recall , and F-Measure as assessment measure-
exposing their strengths and weaknesses. One of these ap-         ments[2].
proaches is Ontology Alignment Evaluation Initiative (OAEI).
It provides two tracks for evaluating ontology matching al-       An additional qualitative analysis is performed to detect the
gorithms and instance matching algorithms. This is done           common problems faced by humans during the links valida-
by using common benchmarks in the evaluations. Other              tion process. This analysis focuses on the problems reported
approaches rendered benchmarks are Yatskevich et al.[21],         by the validators which affects their judgement quality and
Alexe et al.[1], and SWING[6]. According to [4] three basic       the process difficulty.
points form the criticisms of many generated benchmarks.
These points are:                                                 4.     EXPERIMENTAL SETUP
                                                                  In our experiment, we applied linking on six different datasets
     • Using real data                                            using LIMES[9]. These datasets represent three different
     • Benchmarks generation flexibility                          linking tasks. Each task corresponds to specific domain that
     • Scalability and correctness                                differs in nature from the other tasks and varies in familiar-
                                                                  ity to the reviewers. This will affect the reviewers’ decisions
Recently crowdsourcing role has increased in links valida-        correctness and effort in different ways. These tasks are:
tions and gold standard generation. Crowdsourcing is a
new trend for users involvement in different publishing and            • Task 1 represents the geographic domain. In this do-
linking data phases. In[19] an analytical research about in-             main, the basic informative information are specific
terlinking and user intervention is presented. It gave an                as many locations are described by specific geometry
analysis about what phases in the interlinking process can               measures.
be amenable to crowdsourcing. A general architecture was                 In this task, links between the DBpedia and Linked-
proposed to integrate interlinking frameworks with crowd-                GeoData datasets needed to be set. Both datasets con-
sourcing (Amazon Mechanical Turk-MTurk) to enhance in-                   tain geographic information, for example latitude and
terlinking process including links validation.                           longitude, for locations such as cities, states, and coun-
In [5] a case-study was introduced to find out the problems              tries. We restricted our linking to be between cities
that the users face in ontology matching. This study is one of           where their labels started with letter ’A’. The confin-
                                                                         ing of the labels was made for getting a reasonable
3
    http://latc-project.eu/                                              number of links for evaluators in the first stage of our
  experiment and to simplify calculations in the second           The third task generated links between DBpedia and
  stage. This provides also a random sample of inter-             Drugbank datasets. We selected drugs with names
  links with the ability to tune the retrieved number of          starting with letter ’A’. Further a compound similarity
  online instances.                                               function is used involving the Levenshtein similarity
  The Label, latitude, and longitude properties are se-           metric. This function utilizes property label. Table 3
  lected to apply similarity metrics on them. Similarity          shows the basic information in this linking task where
  metrics used are Trigrams and Euclidean in a com-               ’a’ represents ’rdf:type’ property.
  pound function. The compound function combines
  atomic metrics such as Trigrams, Euclidean and Lev-             Datasets       DBpedia              DrugsBank
  enstein using metric operators such as MIN or MAX.              Restrictions   a dbpedia-owl:Drug   a drug:drugs
  Table 2 shows the basic information in this linking task                                            rdfs:label
  where ’a’ represents ’rdf:type’ property.                                                           starts with ’A’
                                                                  Similarity
                                                                                 rdfs:label           rdfs:label
                                                                  Properties
 Datasets        DBpedia               LinkedGeoData                             rdfs:label           rdfs:label
 Restrictions    a dbpedia-owl:City    a lgdo:City                               rdfs:label           drug:genericName
                                       rdfs:label                 Similarity
                                                                                 levenshtein
                                       starts with ’A’            Metrics
 Similarity
                 rdfs:label            rdfs:label
 Properties                                                           Table 3: Link specification for task 3
                 wgs84:lat             wgs84:lat
                 wgs84:long            wgs84:long            The aim of the second stage is to investigate whether us-
 Similarity                                                  ing machine learning approaches for linking can outperform
                 trigrams
 Metrics                                                     humans. Our experiment achieves this aim by using three
                 euclidean                                   different supervised learning algorithms EAGLE, COALA
                                                             and EUCLID [11, 12, 13]. The three algorithms are all im-
     Table 1: Link specification for task 1                  plemented in the LIMES framework [9]. The interlinking
                                                             approaches are given different percentage of positive and
• Task 2 represents the movies domain. This domain           negative examples for each single task. The examples are
  is very interesting as it has some tricky information      provided in an increasing percentages 10%,33% and 50% of
  of the movies such as the name of the movie. In a          the total examples resulting from the first stage for each
  movie’s series,it can be confusing for the validator to    task. As these examples play the role of oracle in the super-
  give a decision as the names of these movies are close     vised learning approaches, the increasing percentages should
  to each other, having the same actors and even the         enhance the algorithm performance and converge against a
  same director. This needs additional information such      score either above or somewhat close to single human per-
  as the movie’s date, which is not always available.        formance. The three approaches function on the same spec-
  In the second task, we performed linking on DBpe-          ifications of the tasks in the first stage and also on the same
  dia and LinkedMDB datasets that contain information        datasets.
  concerning movies. Both have large amounts of infor-
  mation on movies like their names, directors, release      Links evaluation is done by using an evaluation tool with a
  date etc. The triples are restricted to represent movies   graphical user interface dubbed Evalink(see Figure 1). The
  with release dates beginning from the year 1990. This      reviewer specifies the endpoints where the source and tar-
  provides a reasonable number of links.                     get links triples are available. It enables the evaluators to
  The similarity function applied for linking is a com-      load the links to be reviewed and retrieves their properties
  pound function of Trigrams metric. This function uses      information from the specified endpoints. The reviewer can
  properties such as label, director and release date. Ta-   check the correlated properties values and give a decision
  ble 2 shows the basic information in this linking task     either ‘Correct‘, ‘Incorrect‘ or ‘Unsure‘. The spanned time
  where ’a’ represents ’rdf:type’ property.                  for taking a decision is stored in milliseconds. The source
                                                             code is available in ”https://github.com/AKSW/Evalink”.
 Datasets        DBpedia              LinkedMDB
 Restrictions    a dbpedia-owl:Film   a linkedmdb:film       5.   RESULTS
                                      initial release date   Based on the previously described specifications, the exper-
 >”1990-01-01”
                                                             iment was carried out in two stages. The first stage aims to
 Similarity
                 label                label                  generate a gold standard for each task. A set of five indepen-
 Properties
                 director             director               dent reviewers evaluated the links generated such that each
                 releaseDate          initial release date   gold standard was provided based on minimum four out of
 Similarity                                                  five agreement on a decision for each link. In order to ex-
                 trigrams
 Metrics                                                     press the total effort needed by a reviewer to provide a gold
                                                             standard, we considered the time for deciding if a link as
     Table 2: Link specification for task 2                  correct or incorrect as a measure. This time is measured in
                                                             milliseconds. In the experiment, the average times for each
• Task 3 represents the drugs domain. Reviewers have         task to be evaluated by the users are as follows: 18773764
  to check chemical and medical information for drugs.       milliseconds for task 1; 16628607 milliseconds for task 2; and
                                                                                                           Correct     Incorrect   Total
                                                                                                 user 1     496            6        502
                                                                                                 user 2     492           10        502
                                                                                                 user 3     481           21        502
                                                                                                 user 4     487           15        502
                                                                                                 user 5     481           21        502

                                                                        Table 6: Users evaluations assessment of total eval-
                                                                        uation


                                                                        The second stage of our experiment was performing link-
                                                                        ing between the specified datasets using different supervised
        Figure 1: Evalink tool to evaluate links
                                                                        learning algorithms and assessing their performance against
                                                                        the generated gold standards in terms of precision, recall
The user selects the task to be evaluated and specifies the proper      and F-measure. LIMES[9] is an interlinking tool that is also
endpoints to access the triples. The URIs of the task are loaded
sequentially with displaying their retrieved information. By            a framework with different implemented interlinking algo-
selecting a property in source dataset, the corresponding property is   rithms with different learning approaches. EAGLE, COALA
highlighted in the target datasets’ side. By pressing the proper        and EUCLID are used to provide set of interlinks that are
button the decision of the link is specified either
”Correct”,”Incorrect”, or ”Unsure”.                                     compared to the gold standard as aforementioned. The re-
                                                                        sulting comparisons are demonstrated in tables 8,9 and 10
                                                                        in terms of Precision.

18777477 milliseconds for task 3, which are shown in table 4            The cost to achieve a specific F-Measure w.r.t. the percent-
. Overall, approximately 15 hours of evaluation effort have             age of the training data is calculated in terms of time. Using
gone into our experiment per participant.                               the average times to validate a link in each task, the times
                                                                        for different percentages are calculated. Figures 2, 3 and 4
                         Task 1        Task 2        Task 3             plot F-Measure corresponding to afforded costs in minutes
      Average time      18773764      16628607      18777477            for the three tasks. The figures show the overall supremacy
                                                                        of GCAL over other approaches and even over the human
Table 4: Average times of the evaluation processes                      performance. GBL has the worst behaviour among the su-
for each task (in milliseconds)                                         pervised learning approaches. Task 3 was the least costly
                                                                        one which is self explained by the high F-Measure values
                                                                        achieved for all algorithms.
A more detailed way to express the provided effort by a user
is the average time for a single link to be evaluated by a                              1
user in a single task. Table 5 shows the performed average
times in each task. It is evident that there are significant                                                     GAL
differences between users and that overall the evaluation of                           0.8                      GCAL
a large number of links is a time consuming process.                                                             GBL
                                                                                                                Human
                                                                        F − M easure




                          task 1    task 2    task 3                                   0.6
               user 1     36803     22974     10223
               user 2     21465     18821     20358
               user 3     12299     39363      9802                                    0.4
               user 4     10922     11329     34553
               user 5     38853     43811     44664
                                                                                       0.2
Table 5: Average times for evaluating a single link
within a task by each user(in milliseconds)
                                                                                        0
                                                                                             0            10         20       30           40   50
                                                                                                                     Cost(M in.)
An assessment of the links evaluation performed by users
was achieved. Out of 535 links, 502 links had certain, may be
different, decisions made by each single user. 32 links did not
have enough information to reach a decision and marked as               Figure 2: F-Measure results relative to the learning
unsure links. Gold standards are created by giving each link            cost of each approach in terms of time(Task 1)
a final decision using inter-rater agreement. The assessment
was done by comparing each user’s evaluation for each task
to the gold standard. Small number of decisions made by                 6.             DISCUSSION
users were incorrect compared to the gold standard. Details             Our discussion of the experiment is divided into two parts.
of the assessment are described in table 6 and table 7.                 The first part concerns the users evaluations results and
Tasks                   Task 1                                 Task 2                                 Task 3
Measures   Precision    Recall   F-Measure      Precision      Recall     F-Measure       Precision   Recall   F-Measure
User 1       0.81        0.98      0.89           0.98           1          0.99            0.97       0.99      0.98
User 2       0.83         1        0.91           0.93          0.94        0.93            0.96       0.98      0.97
User 3       0.74        0.9       0.81           0.97          0.98        0.98            0.94       0.96      0.95
User 4       0.81        0.98      0.88           0.95          0.97        0.96            0.93       0.95      0.94
User 5       0.82        0.99       0.9           0.91          0.93        0.92            0.91       0.93      0.92

     Table 7: Precision, Recall and F-Measure results achieved by every user in each task.




             Tasks                   Task 1                    Task 2                        Task 3
             percentages     10%      33%      50%     10%      33%       50%      10%        33%     50%
             GAL             0.12     0.32      0.8    0.63     0.33       0.32    0.078      0.47    0.79
             GCAL            0.81     0.76      0.8    0.69     0.27      0.056     0.88      0.54    0.88
             GBL             0.04     0.77      0.4     0.8    0.007      0.047     0.13      0.29    0.29

                       Table 8: Precision results of supervised learning approaches

                Supervised learning approaches include: GAL, GCAL and GBL. For each, its performance
                    is measured relevant to different percentages of training data 10%, 33% and 50%.




              Tasks                   Task 1                     Task 2                     Task 3
              percentages    10%       33%      50%     10%       33%       50%    10%       33%      50%
              GAL            0.398     0.35     0.35    0.28      0.71      0.92   0.17       0.5     0.53
              GCAL            0.35     0.29     0.37    0.19      0.82      0.85   0.23      0.53     0.69
              GBL             0.24     0.27     0.18    0.07      0.43      0.79   0.43      0.74     0.98

                       Table 9: Recalls results of supervised learning approaches.

                Supervised learning approaches include: GAL, GCAL and GBL. For each, its performance
                    is measured relevant to different percentages of training data 10%, 33% and 50%.




              Tasks                  Task 1                     Task 2                      Task 3
              percentages     10%     33%      50%     10%       33%       50%     10%       33%      50%
              GAL              0.4     0.3      0.5     0.4       0.4       0.5     0.1       0.5      0.6
              GCAL            0.49    0.43      0.5    0.31       0.4       0.1    0.37      0.53     0.78
              GBL              0.1     0.4      0.2     0.1        0        0.1     0.2       0.4      0.5

                  Table 10: F-Measure results of supervised learning approaches.

                Supervised learning approaches include: GAL, GCAL and GBL. For each, its performance
                    is measured relevant to different percentages of training data 10%, 33% and 50%.
                1                                                                 1
                                GAL                                                                                       GAL
                               GCAL                                                                                      GCAL
                                GBL                                                                                       GBL
               0.8                                                               0.8
                               Human                                                                                     Human
F − M easure




                                                                  F − M easure
               0.6                                                               0.6


               0.4                                                               0.4


               0.2                                                               0.2


                0                                                                 0
                     0   10   20     30       40        50                             0   10   20      30     40   50      60
                               Cost(M in.)                                                           Cost(M in.)

Figure 3: F-Measure results relative to the learning              Figure 4: F-Measure results relative to the learning
cost of each approach in terms of time(Task 2)                    cost of each approach in terms of time(Task 3)



observations. The second part analyzes the learning algo-         ficulties might rise from unfamiliar domain evaluation to
rithm’s performances.                                             users. Information concerning the resource was in some
                                                                  cases either ambiguous and, thus, not allowing for a de-
The user evaluation aims to generate links as a gold standard     cision to be made and in other cases too much informa-
to be used as a benchmark for link evaluators and interlink-      tion was available that confused the users. As an example
ing algorithms. Many observations are recorded while users        from drugs domain a URI  had plenty
jor factors that influence the evaluation process. These fac-     of information such as secondaryAccessionNumber and pre-
tors include: (i)entity description availability which includes   dictedWaterSolubility which are non crucial for the decision
endpoints availability and the amount of available informa-       making process. Both cases caused significant time delays
tion,(ii)domain familiarity and (iii)information ambiguity.       for a subset of the judgements which were made. Filtering
                                                                  the suitable information to avoid unnecessary properties and
Endpoints availability, occasionally, was problematic. As         providing crucial ones will provide great value to the evalua-
the evaluation process required querying the endpoints for        tion process in terms of time and decision correctness. With
the triples information, having the endpoints down and not        missed information, it is important to create a measure of
working consumed more time. This imposed the need to              confidence for the decisions. Building up strategies for in-
cache the appropriate data which creates an overhead. This        formation integration with other related datasets to cover
overhead was reasonable for these small datasets but it will      the absent of information can help in this case too. Measur-
increase in case of large datasets. Still having active end-      ing the time to generate gold standards for each task(table
point is necessary due to the continues information updat-        4and table 6), we find that there were no significant differ-
ing.                                                              ences among the average times of all tasks. This shows how
                                                                  links validation is improved by the availability and clarity of
Once the information are available, the second point con-         important properties identifying the linking process to the
cerning their sizes comes in focus. Although the number           validators. This indicates that with trivial domain knowl-
of links and the their related information were relatively        edge and appropriate properties to compare, the users per-
small,the manual evaluation was very tedious and exhaust-         form evaluation with F-Measure above 0.8 (table 7). In these
ing for the users. Supporting the evaluation by using Evalink     tables we can see how almost all the user’s perform with rea-
tool overcame the unnecessary efforts like loading the links      sonable high values of F-Measure in all tasks. The achieved
information and aligning them. It also put the whole eval-        F-Measure scores range from 0.81 to 0.99. These ratios will
uation effort on the time of making a decision by the user.       be used in comparison between the user performance and
The manual setting of Evalink generated more settings effort      machine (algorithm) performance.
which should be further extended for intelligent properties
mapping.                                                          The results of the second stage are represented in figures
                                                                  2, 3 and 4. We can see that, in most cases, machine learn-
The help given by Evalink had its effect on the domain fa-        ing algorithms outperform the human in terms of F-Measure
miliarity too. With the suitable evaluation tool that maps        when considering the cost to provide the training set. GAL,
the related properties between two datasets, the domain fa-       in tasks 1 and 3, has better performance compared to a hu-
miliarity was not affecting the evaluation. Finding the right     man up to 50% of the gold standard as training data. On
properties and comparing their values diminished the dif-         the other side, in task 2 although it achieved better results
than an average human but for lower costs the F-Measure                 2008.
is almost stable around 0.4, so increasing the labelling effort     [8] M. Nentwig, T. Soru, A.-C. N. Ngomo, and E. Rahm.
for training data provided no significant improvement. Even             Linklion: A link repository for the web of data.
in those cases where it improved with more training data,           [9] A.-C. Ngonga Ngomo and S. Auer. LIMES - a
its ultimate performance fell short of human performance                time-efficient approach for large-scale link discovery on
in the long run. GCAL and GBL both recorded increas-                    the web of data. In Proceedings of IJCAI, 2011.
ing results with task 1 and task 3 with more training data,        [10] A.-C. Ngonga Ngomo, J. Lehmann, S. Auer, and
while performing worst in task 2. GAL and GBL perform                   K. Höffner. RAVEN – active learning of link
learning by using a portion of the data space. If this por-             specifications. Technical report, 2012.
tion is a good representative of the data distribution, the        [11] A.-C. Ngonga Ngomo and K. Lyko. EAGLE: Efficient
performance increases. GCAL considers the correlation be-               active learning of link specifications using genetic
tween training data examples. It classifies the training data           programming. In Proceedings of ESWC, 2012.
based on the inter- and intra correlation which is calculated
                                                                   [12] A.-C. Ngonga Ngomo, K. Lyko, and V. Christen.
based on similarities between the training data examples.
                                                                        COALA – correlation-aware active learning of link
We conclude from the results for the three tasks that the
                                                                        specifications. In Proceedings of ESWC, 2013.
links of task 1 and task 3, which formed the training data,
are good representatives of the datasets for geographic and        [13] A. Nikolov, M. D’Aquin, and E. Motta. Unsupervised
drugs data while links of task 2 are randomly distributed and           learning of data linking configuration. In Proceedings
apparently not good representatives of the movies task. We              of ESWC, 2012.
can further infer that with small datasets, machine learning       [14] A. Nikolov, V. S. Uren, E. Motta, and A. N. D. Roeck.
algorithms are outperforming humans in case of well repre-              Handling instance coreferencing in the knofuss
sentative training data being available. If that is not the             architecture. In P. Bouquet, H. Halpin, H. Stoermer,
case, humans perform better in the long run.                            and G. Tummarello, editors, IRSW, volume 422 of
                                                                        CEUR Workshop Proceedings. CEUR-WS.org, 2008.
                                                                   [15] Y. Raimond, C. Sutton, and M. Sandler. Automatic
7.   CONCLUSION                                                         interlinking of music datasets on the semantic web.
In our experiment, we emphasized on the factors affecting               2008.
the evaluators in their linking evaluations. These factors         [16] F. Scharffe and J. Euzenat. Melinda: an interlinking
include: (i)endpoints availability, (ii) amount of available            framework for the web of data, 2011.
information,(iii)domain familiarity and (iv)information am-
                                                                   [17] F. Scharffe, Z. Fan, A. Ferrara, H. Khrouf, and
biguity. We quantitatively determined the human effort re-
                                                                        A. Nikolov. Methods for automated dataset
quired for interlinking in terms of time for different datasets.
                                                                        interlinking. Technical Report 4.1, Datalift, 2011.
The experiment showed how much training data is sufficient
to act as a representative of the interlinked datasets. It         [18] F. Scharffe, Y. Liu, and C. Zhou. Rdf-ai: an
also revealed experimentally that for small datasets, how               architecture for rdf datasets matching, fusion and
much training data, which is a sufficient representative of             interlink. In Proc. IJCAI 2009 workshop on Identity,
the dataset, can affect the machine learning approaches to              reference, and knowledge representation (IR-KR),
the degree that humans exceed its accuracy.                             Pasadena (CA US), 2009.
                                                                   [19] E. Simperl, S. Wölger, S. Thaler, B. Norton, and
                                                                        T. Bürger. Combining human and computation
8.   REFERENCES                                                         intelligence: the case of data interlinking tools.
 [1] B. Alexe, W. C. Tan, and Y. Velegrakis. Stbenchmark:               IJMSO, 7(2):77–92, 2012.
     towards a benchmark for mapping systems. PVLDB,               [20] S. Wolger, K. Siorapes, T. Bürger, E. Simperl,
     1(1):230–244, 2008.                                                S. Thaler, and C. Hofer. A survey on data interlinking
 [2] S. Araujo, J. Hidders, D. Schwabe, and A. P. de Vries.             methods. or Interlinking data approaches and tools.
     Serimi - resource description similarity, rdf instance             Technical Report MSU-CSE-00-2, Semantic
     matching and interlinking. CoRR, abs/1107.1104,                    Technology Insttue (STI), February 2011.
     2011.                                                         [21] M. Yatskevich, F. Giunchiglia, and P. Avesani. A large
 [3] C. Bizer, J. Volz, G. Kobilarov, and M. Gaedke. Silk -             scale dataset for the evaluation of matching systems.
     a link discovery framework for the web of data. In                 Technical report, DISI, University of Trento, 2006.
     18th International World Wide Web Conference, April
     2009.
 [4] J. Euzenat, M.-E. Rosoiu, and C. T. dos Santos.
     Ontology matching benchmarks: Generation, stability,
     and discriminability. J. Web Sem., 21:30–48, 2013.
 [5] S. Falconer and M.-A. Storey. A cognitive support
     framework for ontology mapping. pages 114–127. 2008.
 [6] A. Ferrara, S. Montanelli, J. Noessner, and
     H. Stuckenschmidt. Benchmarking matching
     applications on the semantic web. In ESWC (2), pages
     108–122, 2011.
 [7] A. Jaffri, H. Glaser, and I. Millard. Managing uri
     synonymity to enable consistent reference on the
     semantic web. http://eprints.ecs.soton.ac.uk/15614/,