Results of the
        Ontology Alignment Evaluation Initiative 2017?

     Manel Achichi1 , Michelle Cheatham2 , Zlatan Dragisic3 , Jérôme Euzenat4 ,
   Daniel Faria5 , Alfio Ferrara6 , Giorgos Flouris7 , Irini Fundulaki7 , Ian Harrow8 ,
   Valentina Ivanova3 , Ernesto Jiménez-Ruiz9 , Kristian Kolthoff10 , Elena Kuss10 ,
      Patrick Lambrix3 , Henrik Leopold11 , Huanyu Li3 , Christian Meilicke10 ,
   Majid Mohammadi12 , Stefano Montanelli6 , Catia Pesquita13 , Tzanina Saveta7 ,
 Pavel Shvaiko14 , Andrea Splendiani8 , Heiner Stuckenschmidt10 , Elodie Thiéblin15 ,
           Konstantin Todorov1 , Cássia Trojahn15 , and Ondřej Zamazal16
                           1
                               LIRMM/University of Montpellier, France
                                       lastname@lirmm.fr
             2
               Data Semantics (DaSe) Laboratory, Wright State University, USA
                              michelle.cheatham@wright.edu
     3
       Linköping University & Swedish e-Science Research Center, Linköping, Sweden
{zlatan.dragisic,valentina.ivanova,patrick.lambrix,huanyu.li}@liu.se
                       4
                          INRIA & Univ. Grenoble Alpes, Grenoble, France
                                  Jerome.Euzenat@inria.fr
                       5
                          Instituto Gulbenkian de Ciência, Lisbon, Portugal
                                 dfaria@igc.gulbenkian.pt
                               6
                                  Università degli studi di Milano, Italy
               {alfio.ferrara,stefano.montanelli}@unimi.it
                 7
                     Institute of Computer Science-FORTH, Heraklion, Greece
                         {jsaveta,fgeo,fundul}@ics.forth.gr
                                     8
                                       Pistoia Alliance Inc., USA
          {ian.harrow,andrea.splendiani}@pistoiaalliance.org
                   9
                      Department of Informatics, University of Oslo, Norway
                                     ernestoj@ifi.uio.no
                                10
                                   University of Mannheim, Germany
         {christian,elena,heiner}@informatik.uni-mannheim.de
                           11
                               Vrije Universiteit Amsterdam, Netherlands
                                        h.leopold@vu.nl
12
   Faculty of Technology, Policy, and Management, Technical University of Delft, Netherlands
                                   m.mohammadi@tudelft.nl
            13
               LASIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal
                                    cpesquita@di.fc.ul.pt
                           14
                               TasLab, Informatica Trentina, Trento, Italy
                                  pavel.shvaiko@infotn.it
                       15
                            IRIT & Université Toulouse II, Toulouse, France
                                 {cassia.trojahn}@irit.fr
                      16
                           University of Economics, Prague, Czech Republic
                                    ondrej.zamazal@vse.cz

         Abstract. Ontology matching consists of finding correspondences between se-
         mantically related entities of different ontologies.
?
    Note that the only official results of the campaign are on the OAEI web site.
       The Ontology Alignment Evaluation Initiative (OAEI) aims at comparing ontol-
       ogy matching systems on precisely defined test cases. These test cases can be
       based on ontologies of different levels of complexity (from simple thesauri to
       expressive OWL ontologies) and use different evaluation modalities (e.g., blind
       evaluation, open evaluation, or consensus). The OAEI 2017 campaign offered 9
       tracks with 23 test cases, and was attended by 21 participants. This paper is an
       overall presentation of that campaign.


1    Introduction

The Ontology Alignment Evaluation Initiative1 (OAEI) is a coordinated international
initiative, which organizes the evaluation of an increasing number of ontology matching
systems [20, 22]. The main goal of the OAEI is to compare systems and algorithms
openly and on the same basis, in order to allow anyone to draw conclusions about the
best matching strategies. Furthermore, our ambition is that, from such evaluations, tool
developers can improve their systems.
     Two first events were organized in 2004: (i) the Information Interpretation and In-
tegration Conference (I3CON) held at the NIST Performance Metrics for Intelligent
Systems (PerMIS) workshop and (ii) the Ontology Alignment Contest held at the Eval-
uation of Ontology-based Tools (EON) workshop of the annual International Semantic
Web Conference (ISWC) [46]. Then, a unique OAEI campaign occurred in 2005 at the
workshop on Integrating Ontologies held in conjunction with the International Con-
ference on Knowledge Capture (K-Cap) [5]. From 2006 until the present, the OAEI
campaigns were held at the Ontology Matching workshop, collocated with ISWC [2, 3,
7–9, 13, 16–19, 21], which this year took place in Vienna, Austria2 .
     Since 2011, we have been using an environment for automatically processing eval-
uations (§2.2) which was developed within the SEALS (Semantic Evaluation At Large
Scale) project3 . SEALS provided a software infrastructure for automatically executing
evaluations and evaluation campaigns for typical semantic web tools, including ontol-
ogy matching. In the OAEI 2017, a novel evaluation environment called HOBBIT (§10)
was adopted for the novel HOBBIT Link Discovery track. Except for this track, all sys-
tems were executed under the SEALS client in all other tracks. The Benchmark track
was discontinued in this edition of the OAEI.
     This paper synthesizes the 2017 evaluation campaign and introduces the results
provided in the papers of the participants. The remainder of the paper is organised as
follows: in Section 2, we present the overall evaluation methodology that has been used;
Sections 3-11 discuss the settings and the results of each of the test cases; Section 13
overviews lessons learned from the campaign; and finally, Section 14 concludes the
paper.

 1
   http://oaei.ontologymatching.org
 2
   http://om2017.ontologymatching.org
 3
   http://www.seals-project.eu
2     General methodology
We first present the tracks and test cases proposed this year to the OAEI participants
(§2.1). Then, we discuss the resources used by participants to test their systems and the
execution environment used for running the tools (§2.2). Finally, we describe the steps
of the OAEI campaign (§2.3-2.5) and report on the general execution of the campaign
(§2.6).

2.1   Tracks and test cases
This year’s OAEI campaign consisted of 9 tracks gathering 23 test cases, and different
evaluation modalities:

Expressive Ontology tracks offer alignments between real world ontologies ex-
   pressed in OWL:
   Anatomy (§3): The anatomy track comprises a single test case consisting of
        matching the Adult Mouse Anatomy (2744 classes) and a small fragment of
        the NCI Thesaurus (3304 classes) describing the human anatomy. Results are
        evaluated automatically against a manually curated reference alignment.
   Conference (§4): The conference track comprises a single test case that is a suite
        of 21 matching tasks corresponding to the pairwise combination of 7 ontolo-
        gies describing the domain of organizing conferences. Results are evaluated
        automatically against reference alignments in several modalities, and by using
        logical reasoning techniques.
   Large biomedical ontologies (§5): The largebio track comprises 6 test cases in-
        volving 3 large and semantically rich biomedical ontologies: FMA, SNOMED-
        CT, and NCI Thesaurus. These test cases correspond to the pairwise combina-
        tion of these ontologies in two variants: small overlapping fragments, in which
        only overlapping sections of the ontologies are matched, and whole ontologies.
        The evaluation is based on reference alignments automatically derived from the
        UMLS Metathesaurus, with mappings causing logical incoherence flagged so
        as not to be taken into account.
   Disease & Phenotype (§6): The disease & phenotype track comprises 4 test cases
        that involve 6 biomedical ontologies covering the disease and phenotype do-
        mains: HPO versus MP, DOID versus ORDO, HPO versus MeSH, and HPO
        versus OMIM. The evaluation has been performed according to (1) a consen-
        sus alignment generated from those produced by the participating systems, (2)
        a set of manually generated mappings, and (3) a manual assessment of unique
        mappings (i.e., mappings that are not suggested by other systems).
Multilingual tracks offer alignments between ontologies in different languages:
   Multifarm (§7): The multifarm track is based on a subset of the Conference data
        set translated into ten different languages, in addition to their original English:
        Arabic, Chinese, Czech, Dutch, French, German, Italian, Portuguese, Russian,
        and Spanish. It consists of two test cases: same ontologies, where two versions
        of the same ontology in different languages are matched, and different ontolo-
        gies, in which two different ontologies in different languages are matched. In
         total, 45 language pairings are evaluated, meaning that the same ontologies
         test case comprises 315 matching tasks, and the different ontologies test case
         comprises 945 matching tasks. Results are evaluated automatically against ref-
         erence alignments.
Interactive tracks provide simulated user interaction to enable the benchmarking of
    algorithms designed to make use of it, with respect to both the improvement in the
    results and the workload of the user:
    Interactive Matching Evaluation (§8): The Interactive track is based on the test
         cases from the anatomy and conference tracks. An Oracle, which matching
         tools can access programmatically, simulates user feedback by querying the
         reference alignment of the test case. The Oracle can generate erroneous re-
         sponses at a given rate, to simulate user errors. The evaluation is based on the
         same reference alignments, and contemplates the number of user interactions
         and the fraction of erroneous responses received by the tool, in addition to the
         standard evaluation parameters.
Instance Matching tracks focus on alignments between ontology instances expressed
    in the form of OWL Aboxes:
    Instance Matching (§9). The instance track comprises two independent sub-
         tracks:
         SYNTHETIC: This sub-track consists of matching instances that are found to
              refer to the same real-world entity corresponding to a creative work (that
              can be a news item, blog post or programme). It includes two evaluation
              modalities, Sandbox and Mainbox, which differ on the number of instances
              to match. The evaluation is automatic, based on a reference alignment, and
              partially blind – matching tools have access only to the Sandbox reference
              alignment.
         DOREMUS: This sub-track consists of matching real world datasets about
              classical music artworks from two major French cultural institutions: the
              French National Library (BnF) and the Philharmonie de Paris (PP). Both
              datasets use the same vocabulary, the DOREMUS model, issued from the
              DOREMUS project4 . This sub-track comprises two different test cases
              called heterogeneities (HT) and false-positives trap (FPT) characterized
              by different degrees of heterogeneity in artwork descriptions. The evalua-
              tion is automatic and based on reference alignments.
    HOBBIT Link Discovery (§10). The HOBBIT track aims to deal with link dis-
         covery for spatial data represented as trajectories or traces i.e., sequences of
         longitude, latitude pairs. It comprises two test cases: Linking and Spatial. The
         Linking test case consists in matching traces that have been modified using
         string-based approaches, different date and coordinate formats, and by addi-
         tion and/or deletion of intermediate points. In the Spatial test case, the goal is
         to identify DE-9IM (Dimensionally Extended nine-Intersection Model) topo-
         logical relations between traces: Equals, Disjoint, Touches, Contains/Within,
         Covers/CoveredBy, Intersects, Crosses, Overlaps. For each relation, a differ-
         ent pair of source and target datasets is given to the participants, so the test
 4
     http://www.doremus.org
Table 1. Characteristics of the test cases (open evaluation is made with already published refer-
ence alignments and blind evaluation is made by organizers from reference alignments unknown
to the participants).
             test formalism relations confidence modalities           language         SEALS
                                                                                         √
        anatomy     OWL         =        [0 1]       open              EN
                                                                                         √
      conference    OWL       =, <=      [0 1]    open+blind           EN
                                                                                         √
        largebio    OWL         =        [0 1]       open              EN
                                                                                         √
      phenotype     OWL         =        [0 1]       blind             EN
                                                              AR, CZ, CN, DE, EN,         √
       multifarm    OWL         =        [0 1]    open+blind
                                                             ES, FR, IT, NL, RU, PT
                                                                                          √
     interactive    OWL      =, <=       [0 1]       open              EN
                                                                                          √
       instance     OWL         =        [0 1]    open+blind           EN
      HOBBIT        OWL     =, spatial   N/A      open+blind        EN, N/A
                                                                                          √
 process model      OWL        <=        [0 1]    open+blind           EN


         case consists of 8 individual matching tasks. In both test cases, two evaluation
         modalities, Sandbox and Mainbox, were considered, differing on the number of
         instances to match. The evaluation is automatic and based on reference align-
         ments.
      Process Model Matching (§11): The process model track is concerned with the
         application of ontology matching techniques to the problem of matching pro-
         cess models. It comprises two test cases used in the Process Model Matching
         Campaign 2015 [4] which have been converted to an ontological representa-
         tion, with process model entities being represented as ontology instances. The
         first test case contains nine process models which represent the application pro-
         cess for a master program of German universities as well as reference align-
         ments between all pairs of models. The second test case consists of process
         models which describe the process of registering a newborn child in differ-
         ent countries. The evaluation is automatic, based on reference alignments, and
         uses standard precision and recall measures as well as a probabilistic variant
         described in [29].


Table 1 summarizes the variation in the proposed test cases.


2.2    The SEALS client

Since 2011, tool developers had to implement a simple interface and to wrap their tools
in a predefined way including all required libraries and resources. A tutorial for tool
wrapping was provided to the participants, describing how to wrap a tool and how to
use the SEALS client to run a full evaluation locally. This client is then executed by the
track organizers to run the evaluation. This approach ensures the reproducibility and
comparability of the results of all systems.
2.3   Preparatory phase

Ontologies to be matched and (where applicable) reference alignments have been pro-
vided in advance during the period between June 1st and July 15th , 2017. This gave
potential participants the occasion to send observations, bug corrections, remarks and
other test cases to the organizers. The goal of this preparatory period is to ensure that
the delivered tests make sense to the participants. The final test base was released on
July 15th , 2017 and did not evolve after that.


2.4   Execution phase

During the execution phase, participants used their systems to automatically match the
test case ontologies. In most cases, ontologies are described in OWL-DL and serialized
in the RDF/XML format [11]. Participants can self-evaluate their results either by com-
paring their output with reference alignments or by using the SEALS client to compute
precision and recall. They can tune their systems with respect to the non blind eval-
uation as long as the rules published on the OAEI web site are satisfied. This phase
has been conducted between July 15th and August 31st , 2017, except for the HOBBIT
track which was extended until September 15th , 2017. Like last year, we requested a
mandatory registration of systems and a preliminary evaluation of wrapped systems by
July 31st, to alleviate the burden of debugging systems with respect to issues with the
SEALS client during the Evaluation phase.


2.5   Evaluation phase

Participants were required to submit their SEALS-wrapped tools by August 31st , 2017,
and their HOBBIT-wrapped tool by September 15th , 2017. Tools were then tested by
the organizers and minor problems were reported to some tool developers, who were
given the opportunity to fix their tools and resubmit them.
    Initial results were provided directly to the participants between September 1st and
October 15th , 2017. The final results for most tracks were published on the respective
pages of the OAEI website by October 15th , although some tracks were delayed.
    The standard evaluation measures are precision, recall and F-measure computed
against the reference alignments. More details on the evaluation are given in the sections
for the test cases.


2.6   Comments on the execution

Following an initial period of growth, the number of OAEI participants has remained
approximately constant since 2012, at slightly over 20 (see Figure 1). This year was
no exception, as we counted 21 participating systems. Table 2 lists the participants and
the tracks in which they competed. Some matching systems participated with different
variants (DiSMatch and LogMap) whereas others were evaluated with different config-
urations, as requested by developers (see test case sections for details).
                   Fig. 1. Number of participating systems per year in the OAEI.


                           Table 2. Participants and the status of their submissions.
                           DiSMatch-sg
                           DiSMatch-ar


                           LogMap-Bio
                           DiSMatch-tr


                           LogMapLt


                           YAM-BIO
                           ONTMAT
                           KEPLER
                           CroLOM


                           SANOM


                                                                                               Total=21
                           RADON
                           LogMap
                           I-Match


                           njuLink

                           POMap
                  System


                           Legato


                           XMap
                           Wiki2
                           ALIN
                           AML


                           Silk
         Confidence X X X X X X - X X X X - X - X X X X - X - 16
            anatomy               # # # # #   #       # #   #   #                             11
         conference               # # # # #   #   #   #     #   #     #                       10
             largebio      #      # # # # # #
                                            G #       # # G
                                                          # # G
                                                              # # #
                                                                  G                           10
          phenotype        #      #       # #
                                            G #     G
                                                    # # # G
                                                          # # # # # #
                                                                    G G
                                                                      #                       11
           multifarm       #        # # # #   #   # # # # # # G
                                                              # #   #
                                                                    G #                        7
          interactive             # # # # # # #   # # # # # # # # #   #                        4
      process model        #      # # # #   # #   # # # # # # # # # # #                        3
             instance      #      # # # #   #     # # G
                                                      # # # # # # # # #                        5
            hobbit ld      #      # # # #   # # # # # # # # #
                                                            G # G
                                                                # # # #                        4
                total 3 9 1 1 1 1 3 5 1 8 3 4 1 1 4 1 4 1 4 6 3 65

Confidence pertains to the confidance scores returned by the system, with X indicating that they
are non-boolean; # indicates that the system did not participate in the track; indicates that it
participated fully in the track; and G
                                     # indicates that it participated in or completed only part of the
tasks of the track.


3     Anatomy
The anatomy test case confronts matching systems with two fragments of biomedical
ontologies which describe the human anatomy5 and the anatomy of the mouse6 . This
data set has been used since 2007 with some improvements over the years [15].
 5
     http://www.cancer.gov/cancertopics/cancerlibrary/
     terminologyresources/
3.1    Experimental Setting
We conducted experiments by executing each system in its standard setting and we
compare precision, recall, F-measure and recall+ against a manually curated reference
alignment. Recall+ indicates the amount of detected non-trivial correspondences, i.e.,
correspondence that do not have the same normalized label. The approach that generates
only trivial correspondences is depicted as baseline StringEquiv in the following section.
    We ran the systems on a server with 3.46 GHz (6 cores) and 8GB allocated RAM,
using the SEALS client. However, we changed the way precision and recall are com-
puted by removing trivial correspondences in the oboInOwl namespace like:
        http://...oboInOwl#Synonym = http://...oboInOwl#Synonym
as well as correspondences expressing relations different from equivalence. Thus, the
results generated by the SEALS client vary in some cases by 0.5% compared to the re-
sults presented below. Using the Pellet reasoner we also checked whether the generated
alignment is coherent, i.e., that there are no unsatisfiable classes when the ontologies
are merged with the alignment.

3.2    Results
In Table 3, we show the results of the 11 participating systems that generated an align-
ment, including 3 versions of LogMap. A number of systems participated in the anatomy
track for the first time this year: KEPLER, POMap, SANOM, WikiV2, and YAM-BIO. For
more details, we refer the reader to the papers presenting the systems.

Table 3. Comparison, ordered by F-measure, against the reference alignment, runtime is mea-
sured in seconds, the “size” column refers to the number of correspondences in the generated
alignment.
     Matcher       Runtime    Size   Precision   F-measure   Recall   Recall+   Coherent
                                                                                  √
     AML               47    1493      0.95        0.943     0.936     0.832
     YAM-BIO           70    1474     0.948        0.935     0.922     0.794       -
     POMap            808    1492      0.94        0.933     0.925     0.824       -
                                                                                   √
     LogMapBio        820    1534     0.889        0.894     0.899     0.733
                                                                                   √
     XMap              37    1412     0.926        0.893     0.863     0.639
                                                                                   √
     LogMap            22    1397     0.918        0.88      0.846     0.593
     KEPLER           234    1173     0.958        0.836     0.741     0.316       -
     LogMapLite        19    1148     0.962        0.829     0.728     0.29        -
     SANOM            295    1304     0.895        0.828     0.77      0.419       -
     Wiki2           2204    1260     0.883        0.802     0.734     0.356       -
     StringEquiv        -     946     0.997        0.766     0.622     0.000       -
                                                                                   √
     ALIN             836     516     0.996        0.506     0.339      0.0


   This year 5 out of 11 systems were able to achieve the alignment task in less than
100 seconds: LogMapLite, LogMap, XMap, AML and YAM-BIO. In 2016 and 2015, there
 6
     http://www.informatics.jax.org/searches/AMA_form.shtml
were 4 out of 13 systems and 6 out of 15 systems respectively that generated an align-
ment in this time frame. As in the last 5 years LogMapLite has the shortest runtime. The
table shows that there is no correlation between the quality of the generated alignment
in terms of precision and recall and the runtime. This result had also been observed in
previous OAEI campaigns.
    The table also shows the results for F-measure, recall+ and the size of alignments.
Regarding F-measure, the top 3 ranked systems AML, YAM-BIO, POMap achieve on
F-measure above 0.93. Among these, AML achieved the highest F-measure (0.943).
All of the long-term participants in the track showed comparable results in terms of
F-measure to their last year’s results and at least as good as the results of the best
systems in OAEI 2007-2010. Regarding recall+, AML, LogMap, LogMapLite showed
similar results to previous years. LogMapBio has a slight increase from 0.728 in 2016 to
0.733 in 2017. XMap decreases a bit from 0.647 to 0.639. Two new participants obtained
good results for recall+, POMap scored 0.824 (second place) followed by YAM-BIO with
0.794 (third place). In terms of the number of correspondences, long-term participants
computed similar numbers of correspondences as last year. AML and LogMap generated
the same number of correspondences, LogMapBio generated 3 more correspondences,
LogMapLite generated 1 more, ALIN generated 6 more and XMap generated 1 less.
    This year, 10 out of 11 systems achieved an F-measure higher than the baseline.
This is a slightly better result than last year when 9 out of 13 surpassed the baseline.
Five systems produced coherent alignments, which is comparable to the last two years
when 7 out of 13 and 5 out of 10 systems achieved this. Two of the three best systems
with respect to F-measure (YAM-BIO and POMap) produced incoherent alignments.


3.3    Conclusions

The number of systems participating in the anatomy track has varied throughout the
years. This year, it is lower than in the two previous editions, but higher than in 2014.
As noted previously there are newly-joined systems as well as long-term participants.
    The systems that participated in the previous edition in 2016 scored similarly to
their previous results. As last year, the AML system set the top result for anatomy track
with respect to F-measure. Two of the newly-joined systems (YAM-BIO and POMap)
achieved 2nd and 3rd best score in terms of F-measure.


4     Conference

The conference test cases require matching several moderately expressive ontologies
from the conference organisation domain.


4.1    Test data

The data set consists of 16 ontologies in the domain of organising conferences. These
ontologies were developed within the OntoFarm project7 .
   The main features of this test case are:
 7
     http://owl.vse.cz:8080/ontofarm/
 – Generally understandable domain. Most ontology engineers are familiar with or-
   ganising conferences. Therefore, they can create their own ontologies as well as
   evaluate the alignments among their concepts with enough erudition.
 – Independence of ontologies. Ontologies were developed independently and based
   on different resources, they thus capture the issues in organising conferences from
   different points of view and with different terminologies.
 – Relative richness in axioms. Most ontologies were equipped with OWL DL axioms
   of various kinds; this opens a way to use semantic matchers.
    Ontologies differ in their numbers of classes and properties, in expressivity, but also
in underlying resources.

4.2     Results
We performed three kinds of evaluations. First, we provide results in terms of F-
measure, comparison with baseline matchers and results of matchers from previous
OAEI editions and precision/recall triangular graph based on sharp reference align-
ments. Second, we provide an evaluation based on the uncertain version of the reference
alignment, and finally we also provide an evaluation based on violations of consistency
and conservativity principles.

Evaluation based on sharp reference alignments We evaluated the results of partic-
ipants against blind reference alignments (labelled as rar2).8 This includes all pairwise
combinations between 7 different ontologies, i.e., 21 alignments.
    We have prepared the reference alignments in two steps. First, we have generated
them as a transitive closure computed on the original reference alignments. In order to
obtain a coherent result, conflicting correspondences, i.e., those causing unsatisfiability,
have been manually inspected and incoherency has been resolved by evaluators. The
resulting reference alignments are labelled as ra2. Second, we detected violations of
conservativity using the approach from [44] and resolved them by an evaluator. The
resulting reference alignments are labelled as rar2. As a result, the degree of correctness
and completeness of the new reference alignments is probably slightly better than for
the old one. However, the differences are relatively limited. Whereas the new reference
alignments are not open, the old reference alignments (labeled as ra1 on the conference
web page) are available. These represent close approximations of the new ones.
    Table 4 shows the results of all participants with regard to the reference alignment
rar2. F0.5 -measure, F1 -measure and F2 -measure are computed for the threshold that
provides the optimal F1 -measure. F1 is the harmonic mean of precision and recall where
both are equally weighted; F2 weights recall higher than precision and F0.5 weights pre-
cision higher than recall. The matchers shown in the table are ordered according to their
highest average F1 -measure. We employed two baseline matchers. edna (string edit dis-
tance matcher) was used within the benchmark test cases in previous years and with regard to
performance it is very similar as the previously used baseline2 in the conference track; StringE-
quiv is used within the anatomy test case. This year these baselines divide matchers into two
performance groups.
 8
     More details about evaluation applying other sharp reference alignments are available at the
     conference web page.
Table 4. The highest average F[0.5|1|2] -measure and their corresponding precision and recall for
each matcher with its F1 -optimal threshold (ordered by F1 -measure). Inc.Align. means number
of incoherent alignments. Conser.V. means total number of all conservativity principle violations.
Consist.V. means total number of all consistency principle violations.
    Matcher       Prec. F0.5 -m. F1 -m. F2 -m. Rec.          Inc.Align. Conser.V. Consist.V.
      AML         0.78     0.74      0.69    0.65    0.62        0           39                0
    LogMap        0.77     0.72      0.66     0.6    0.57        0           25                0
     XMap         0.78     0.72      0.65    0.58    0.55        1           22                4
   LogMapLt       0.68     0.62      0.56     0.5    0.47        5           96               25
      edna        0.74     0.66      0.56    0.49    0.45
   KEPLER         0.67     0.61      0.55    0.49    0.46       12          123             159
    WikiV3        0.63     0.59      0.54     0.5    0.47       10          125              58
  StringEquiv     0.76     0.65      0.53    0.45    0.41
    POMap         0.69     0.59      0.49    0.42    0.38        0           1                 0
      ALIN        0.86      0.6      0.41    0.31    0.27        0           0                 0
    SANOM          0.8     0.56      0.38    0.29    0.25        1           11               18
   ONTMAT         0.06     0.07       0.1    0.19    0.41        0           1                 0


     With regard to the two baselines, we can group tools according to each matcher’s posi-
tion. In all, four tools outperformed both baselines (AML, LogMap, XMap and LogMapLt),
and two newcomers (KEPLER and WikiV3) performed better than one baseline. Other match-
ers (POMap, ALIN, SANOM and ONTMAT) performed worse than both baselines. Four tools
(ALIN, POMap, ONTMAT and SANOM) did not match properties at all. Of course, this had
a negative effect on those tools’ overall performance. More details about evaluation consider-
ing only classes or properties are on the conference web page. The performance of all matchers
(except ONTMAT) regarding their precision, recall and F1 -measure is visualised in Figure 2.
Matchers are represented as squares or triangles. Baselines are represented as circles.

Comparison with previous years with regard to rar2 Four matchers, top-performers, also
participated in the Conference test cases in OAEI 2016. None of them improved with regard to
F1-measure evaluation.


Evaluation based on uncertain version of reference alignments The confidence values
of all matches in the sharp reference alignments for the conference track are all 1.0. For the un-
certain version of this track, the confidence value of a match has been set equal to the percentage
of a group of people who agreed with the match in question (this uncertain version is based on the
reference alignment labeled ra1). One key thing to note is that the group was only asked to val-
idate matches that were already present in the existing reference alignments – so some matches
had their confidence value reduced from 1.0 to a number near 0, but no new match was added.
     There are two ways that we can evaluate matchers according to these “uncertain” reference
alignments, which we refer to as discrete and continuous. The discrete evaluation considers any
match in the reference alignment with a confidence value of 0.5 or greater to be fully correct
and those with a confidence less than 0.5 to be fully incorrect. Similarly, a matcher’s match is
considered a “yes” if the confidence value is greater than or equal to the matcher’s threshold
and a “no” otherwise. In essence, this is the same as the “sharp” evaluation approach, except
that some matches have been removed because less than half of the crowdsourcing group agreed
with them. The continuous evaluation strategy penalises a matcher more if it misses a match on
                                                                                     Alin
                                                                                     AML
                                                                                     KEPLER
                                                                                     LogMap
                                                                                     LogMapLt
          F1 -measure=0.7                                                            POMap
                                                                                     SANOM
       F1 -measure=0.6                                                               WikiV3
                                                                                     XMap
   F1 -measure=0.5


                                                                                     edna
                                                                                     StringEquiv
     rec=1.0        rec=.8        rec=.6        pre=.6       pre=.8      pre=1.0


Fig. 2. Precision/recall triangular graph for the conference test case. Dotted lines depict level
of precision/recall while values of F1 -measure are depicted by areas bordered by corresponding
lines F1 -measure=0.[5|6|7].
which most people agree than if it misses a more controversial match. For instance, if A ≡ B
with a confidence of 0.85 in the reference alignment and a matcher gives that correspondence a
confidence of 0.40, then that is counted as 0.85 × 0.40 = 0.34 true positive and 0.85 − 0.40 =
0.45 false negative.
     Out of the ten alignment matchers, three (ALIN, LogMapLt and ONTMAT) use 1.0 as the
confidence value for all matches they identify. Two more have a narrow range of confidence
values (POMap’s values vary between 0.8 and 1.0, with the majority falling between 0.93 and
1.0 while SANOM’s values are relatively tightly clustered between 0.73 and 0.9). The remaining
five systems (AML, KEPLER, LogMap, WikiV3 and XMap) have a wide variation of confidence
values.
     When comparing the performance of the matchers on the uncertain reference alignments ver-
sus that on the sharp version (see Table 5), we see that in the discrete case all matchers performed
the same or slightly better. Improvement in F-measure ranged from 0 to 8 percentage points
over the sharp reference alignment. This was driven by increased recall, which is a result of the
presence of fewer “controversial” matches in the uncertain version of the reference alignment.
     The performance of most matchers is very similar regardless of whether a discrete or con-
tinuous evaluation methodology is used (provided that the threshold is optimized to achieve the
highest possible F-measure in the discrete case). The primary exceptions to this are KEPLER,
LogMap and SANOM. These systems perform significantly worse when evaluated using the con-
tinuous version of the metrics. In the LogMap and SANOM cases, this is because the matcher
assigns low confidence values to some matches in which the labels are equivalent strings, which
many crowdsourcers agreed with unless there was a compelling technical reason not to. This hurts
recall, but using a low threshold value in the discrete version of the evaluation metrics ’hides’ this
problem. In the case of KEPLER, the issue is that entities whose labels share a word in common
Table 5. F-measure, precision, and recall of the different matchers when evaluated using the sharp
(ra1), discrete uncertain and continuous uncertain metrics.
                            Sharp                  Discrete              Continuous
         Matcher      Prec. F1 -m. Rec.       Prec. F1 -m. Rec.      Prec. F1 -m. Rec.
          ALIN         0.89   0.41    0.27    0.89    0.49   0.34    0.89     0.5    0.35
          AML          0.84   0.74    0.66    0.79    0.78   0.77    0.8     0.77    0.74
        KEPLER         0.76   0.59    0.48    0.76    0.67   0.6     0.58    0.62    0.68
         LogMap        0.82   0.69    0.59    0.78    0.73   0.68    0.8     0.67    0.57
        LogMapLt       0.73   0.59    0.5     0.72    0.67   0.62    0.72    0.67    0.63
        ONTMAT         0.06   0.11    0.43    0.06    0.11   0.54    0.06    0.11    0.55
         POMap         0.73   0.52    0.4     0.73     0.6   0.5     0.71    0.59    0.51
         SANOM         0.81   0.38    0.25    0.81    0.45   0.31    0.81    0.38    0.25
         WikiV3        0.67   0.57    0.49    0.74    0.62   0.52    0.73    0.63    0.55
          XMap         0.84   0.68    0.57    0.79    0.72   0.67    0.81    0.73    0.67


have fairly high confidence values, even though they are often not equivalent. For example, “Re-
view” and “Reviewing Event”. This hurts precision in the continuous case, but is taken care of
by using a high threshold value in the discrete case.
     Five matchers from this year also participated last year, and thus we are able to make some
comparisons over time. The F-measures of all systems essentially held constant (within one per-
cent) when evaluated against the uncertain reference alignments. This is in contrast to last year,
in which most matchers made modest gains (in the neighborhood of 1 to 6 percent) over 2015. It
seems that, barring any new advances, participating matchers have reached something of a steady
state on this performance metric.


Evaluation based on violations of consistency and conservativity principles We per-
formed evaluation based on detection of conservativity and consistency violations [44, 45]. The
consistency principle states that correspondences should not lead to unsatisfiable classes in the
merged ontology; the conservativity principle states that correspondences should not introduce
new semantic relationships between concepts from one of the input ontologies.
    Table 4 shows the number of unsatisfiable TBoxes after the ontologies are merged
(Inc. Align.), the total number of all conservativity principle violations within all alignments
(Conser.V.) and the total number of all consistency principle violations (Consist.V.).
    Five tools (ALIN, AML, LogMap, ONTMAT and POMap) have no consistency principle
violation (in comparison to seven last year) and two tools (SANOM and XMap) generated only
one incoherent alignment. There is one tool (ALIN) having no conservativity principle violations.
Further two tools (ONTMAT and POMap) have an average of conservativity principle violations
around 1. We should note that these conservativity principle violations can be “false positives”
since the entailment in the aligned ontology can be correct although it was not derivable in the
single input ontologies.


4.3   Conclusions
In conclusion, this year four of ten matchers performed better than both baselines on sharp ref-
erence alignments. Further, this year five matchers generated coherent alignments (against seven
matchers last year and five matchers the year before). Based on the uncertain reference align-
ments we can conclude that all matchers perform better on the fuzzy versus sharp version of the
benchmark and eight matchers have close correspondence on the continuous and discrete version,
indicating good agreement with the human matchers. Finally, none of the five matchers that also
participated last year improved their performance with regard to the evaluation based on the sharp
or the uncertain reference alignments.


5     Large biomedical ontologies (largebio)

The largebio test case aims at finding alignments between the large and semantically rich
biomedical ontologies FMA, SNOMED-CT, and NCI, which contain 78,989, 306,591 and 66,724
classes, respectively.


5.1   Test data

The test case has been split into three matching problems: FMA-NCI, FMA-SNOMED and
SNOMED-NCI. Each matching problem has been further divided in 2 tasks involving differently
sized fragments of the input ontologies: small overlapping fragments versus whole ontologies
(FMA and NCI) or large fragments (SNOMED-CT).
     The UMLS Metathesaurus [6] has been selected as the basis for reference alignments. UMLS
is currently the most comprehensive effort for integrating independently-developed medical the-
sauri and ontologies, including FMA, SNOMED-CT, and NCI. The extraction of mapping from
UMLS is detailed in [26]).
     Since alignment coherence is an aspect of ontology matching that we aim to promote, in
previous editions we provided coherent reference alignments by refining the UMLS mappings
using the Alcomo (alignment) debugging system [32], LogMap’s (alignment) repair facility [25],
or both [27].
     However, concerns were raised about the validity and fairness of applying automated align-
ment repair techniques to make reference alignments coherent [37]. It is clear that using the
original (incoherent) UMLS alignments would be penalizing to ontology matching systems that
perform alignment repair. However, using automatically repaired alignments would penalize sys-
tems that do not perform alignment repair and also systems that employ a repair strategy that
differs from that used on the reference alignments [37].
     Thus, as of the 2014 edition, we arrived at a compromise solution that should be fair to all
ontology matching systems. Instead of repairing the reference alignments as normal, by remov-
ing correspondences, we flagged the incoherence-causing correspondences in the alignments by
setting the relation to “?” (unknown). These “?” correspondences will neither be considered as
positive nor as negative when evaluating the participating ontology matching systems, but will
simply be ignored. This way, systems that do not perform alignment repair are not penalized
for finding correspondences that (despite causing incoherences) may or may not be correct, and
systems that do perform alignment repair are not penalized for removing such correspondences.
     To ensure that this solution was as fair as possible to all alignment repair strategies, we
flagged as unknown all correspondences suppressed by any of Alcomo, LogMap or AML [39], as
well as all correspondences suppressed from the reference alignments of last year’s edition (using
Alcomo and LogMap combined). Note that, we have used the (incomplete) repair modules of
the above mentioned systems.
     The flagged UMLS-based reference alignment for the OAEI 2017 campaign is summarized
in Table 6.
Table 6. Number of correspondences in the reference alignments of the large biomedical ontolo-
gies tasks
                       Reference alignment     “=” corresp. “?” corresp.
                       FMA-NCI                     2,686          338
                       FMA-SNOMED                  6,026         2,982
                       SNOMED-NCI                 17,210         1,634


5.2   Evaluation setting, participation and success

We have run the evaluation in a Ubuntu Laptop with an Intel Core i7-4600U CPU @ 2.10GHz x 4
and allocating 15Gb of RAM. Precision, Recall and F-measure have been computed with respect
to the UMLS-based reference alignment. Systems have been ordered in terms of F-measure.
     In the OAEI 2017 largebio track 10 out of 21 participating systems have been able to cope
with at least one of the tasks of the largebio track with a 4 hours timeout. Note that we also
include the results of Tool1 (the developers withdrew the system from the campaign) as reference.
9 systems were able to complete more than one task, while 6 systems were able to complete all
tasks. This is an improvement with respect to last year results where only 4 systems were able to
complete all tasks


5.3   Background knowledge

Regarding the use of background knowledge, LogMap-Bio uses BioPortal as mediating ontology
provider, that is, it (automatically) retrieves from BioPortal the most suitable top-10 ontologies
for the matching task.
     LogMap uses normalisations and spelling variants from the general (biomedical) purpose
UMLS Lexicon (a different resource with respect to the UMLS Metathesaurus).
     AML has three sources of background knowledge which can be used as mediators be-
tween the input ontologies: the Uber Anatomy Ontology (Uberon), the Human Disease Ontology
(DOID) and the Medical Subject Headings (MeSH).
     YAM-BIO uses as background knowledge a file containing mappings from the DOID and
UBERON ontologies to other ontologies like FMA, NCI or SNOMED CT.
     XMAP uses synonyms provided by the UMLS Metathesaurus. Note that matching systems
using UMLS Metathesaurus as background knowledge will have a notable advantage since the
largebio reference alignment is also based on the UMLS Metathesaurus.


5.4   Alignment coherence

Together with Precision, Recall, F-measure and run times we have also evaluated the coherence
of alignments. We report (1) the number of unsatisfiabilities when reasoning with the input on-
tologies together with the computed alignments, and (2) the ratio of unsatisfiable classes with
respect to the size of the union of the input ontologies.
    We have used the OWL 2 reasoner HermiT [35] to compute the number of unsatisfiable
classes. For the cases in which HermiT could not cope with the input ontologies and the align-
ments (in less than 2 hours) we have provided a lower bound on the number of unsatisfiable
classes (indicated by ≥) using the OWL 2 EL reasoner ELK [28].
                   Table 7. System runtimes (in seconds) and task completion.

                        FMA-NCI           FMA-SNOMED          SNOMED-NCI
      System                                                                     Average     #
                      Task 1 Task 2       Task 3 Task 4       Task 5 Task 6
      LogMapLite         1        10        2         18        9        22         10    6
      AML               44        77       109       177       669      312        231    6
      LogMap            12        92        57       477       207      652        250    6
      XMap              20       130        62       625       106      563        251    6
      YAM-BIO           56       279        60       468      2,202     490        593    6
      Tool1             65      1,650      245      2,140      481     1,150       955    6
      LogMapBio       1,098     1,552     1,223     2,951     2,779    4,728      2,389   6
      POMAP            595         -      1,841        -        -         -       1,218   2
      SANOM            679         -      3,123        -        -         -       1,901   2
      KEPLER           601         -      3,378        -        -         -       1,990   2
      Wiki2          108,953       -         -         -        -         -      108,953 1
      # Systems         11        10        7         7         7        7        10,795 49


     In this OAEI edition, only three distinct systems have shown alignment repair facilities: AML,
LogMap and its LogMap-Bio variant, and XMap (which reuses the repair techniques from Al-
como [32]). Note that only LogMap and LogMap-Bio are able to reduce to a minimum the
number of unsatisfiable classes across all tasks. Missing 9 unsatisfiable classes in the worst case
(whole FMA-NCI task).
     Tables 8-9 (see last two columns) show that even the most precise alignment sets may lead to
a huge number of unsatisfiable classes. This proves the importance of using techniques to assess
the coherence of the generated alignments if they are to be used in tasks involving reasoning. We
encourage ontology matching system developers to develop their own repair techniques or to use
state-of-the-art techniques such as Alcomo [32], the repair module of LogMap (LogMap-Repair)
[25] or the repair module of AML [39], which have worked well in practice [27, 23].


5.5     Runtimes and task completion
Table 7 shows which systems were able to complete each of the matching tasks in less than 4
hours and the required computation times. Systems have been ordered with respect to the num-
ber of completed tasks and the average time required to complete them. Times are reported in
seconds.
    The last column reports the number of tasks that a system could complete. For example, 7
system (including the withdrawn system Tool1) were able to complete all six tasks. The last row
shows the number of systems that could finish each of the tasks. The tasks involving SNOMED
were also harder with respect to both computation times and the number of systems that com-
pleted the tasks.


5.6     Results for the FMA-NCI matching problem
Table 8 summarizes the results for the tasks in the FMA-NCI matching problem.
    XMap and YAM-BIO achieved the highest F-measure in Task 1, while XMap and AML in
Task 2. Note however that the use of background knowledge based on the UMLS Metathesaurus
has an important impact in the performance of XMap. The use of background knowledge led to
                    Table 8. Results for the FMA-NCI matching problem.

                                                        Scores          Incoherence
      System           Time (s)    # Corresp.
                                                Prec.    F-m. Rec.     Unsat. Degree
                             Task 1: small FMA and NCI fragments
      XMap*                20         2,649     0.98 0.94 0.90           2      0.019%
      YAM-BIO              56         2,681     0.97 0.93 0.90          800       7.8%
      AML                  44         2,723     0.96 0.93 0.91           2      0.019%
      LogMapBio          1,098        2,807     0.93 0.92 0.91           2      0.019%
      LogMap               12         2,747     0.94 0.92 0.90           2      0.019%
      KEPLER              601         2,506     0.96 0.89 0.83         3,707     36.1%
      Average            10,193       2,550     0.95 0.89 0.84         1,238     12.0%
      LogMapLite            1         2,483     0.97 0.89 0.82         2,045     19.9%
      SANOM               679         2,457     0.95 0.87 0.80         1,183     11.5%
      POMAP               595         2,475     0.90 0.86 0.83         3,493     34.0%
      Tool1                65         2,316     0.97 0.86 0.77         1,128     11.0%
      Wiki2             108,953       2,210     0.88 0.80 0.73         1,261     12.3%
                            Task 2: whole FMA and NCI ontologies
      XMap*               130        2,735    0.88 0.87 0.85              9     0.006%
      AML                  77        2,968    0.84 0.86 0.87             10     0.007%
      YAM-BIO             279        3,109    0.82 0.85 0.89          11,770      8.1%
      LogMap               92        2,701    0.86 0.83 0.81              9     0.006%
      LogMapBio          1,552       2,913    0.82 0.83 0.83              9     0.006%
      Average             541        2,994    0.80 0.81 0.83           7,389      5.1%
      LogMapLite           10        3,477    0.67 0.74 0.82          26,478     18.1%
      Tool1              1,650       3,056    0.69 0.71 0.74          13,442      9.2%
*Uses background knowledge based on the UMLS Metathesaurus which is the basis of the large-
bio reference alignments.


an improvement in recall from LogMap-Bio over LogMap in both tasks, but this came at the cost
of precision, resulting in the two variants of the system having identical F-measures.
     Note that the effectiveness of the systems decreased from Task 1 to Task 2. One reason for
this is that with larger ontologies there are more plausible mapping candidates, and thus it is
harder to attain both a high precision and a high recall. Another reason is that the very scale
of the problem constrains the matching strategies that systems can employ: AML for example,
foregoes its matching algorithms that are computationally more complex when handling very
large ontologies, due to efficiency concerns.
     The size of Task 2 prove a problem for a number of systems, which were unable to complete
it within the allotted time: POMAP, SANOM, KEPLER and Wiki2.


5.7   Results for the FMA-SNOMED matching problem

Table 9 summarizes the results for the tasks in the FMA-SNOMED matching problem.
    XMap produced the best results in terms of both Recall and F-measure in Task 3 and Task
4, but again, we must highlight that it uses background knowledge based on the UMLS Metathe-
saurus. Among the other systems, AML and YAM-BIO achieved the highest F-measure in Tasks
3 and 4, respectively.
                  Table 9. Results for the FMA-SNOMED matching problem.

                                                          Scores            Incoherence
      System           Time (s)     # Corresp.
                                                  Prec.    F-m. Rec.       Unsat. Degree
                          Task 3: small FMA and SNOMED fragments
      XMap*                62        7,400      0.97 0.91 0.85     0                  0.0%
      AML                 109        6,988      0.92 0.84 0.76     0                  0.0%
      YAM-BIO              60        6,817      0.97 0.83 0.73 13,240                56.1%
      LogMapBio          1,223       6,315      0.95 0.80 0.69     1                0.004%
      LogMap               57        6,282      0.95 0.80 0.69     1                0.004%
      Average            1,010       4,623      0.89 0.62 0.51   2,141                9.1%
      KEPLER             3,378       4,005      0.82 0.56 0.42   3,335               14.1%
      SANOM              3,123       3,146      0.69 0.42 0.30   2,768               11.7%
      POMAP              1,841       2,655      0.68 0.42 0.30   1,013                4.3%
      LogMapLite           2         1,644      0.97 0.34 0.21    771                 3.3%
      Tool1               245          979      0.99 0.24 0.14    287                 1.2%
               Task 4: whole FMA ontology with SNOMED large fragment
      XMap*            625      8,665      0.77 0.81 0.84         0                    0.0%
      YAM-BIO          468      7,171      0.89 0.80 0.73 54,081                      26.8%
      AML              177      6,571      0.88 0.77 0.69         0                    0.0%
      LogMap           477      6,394      0.84 0.73 0.65         0                    0.0%
      LogMapBio       2,951     6,634      0.81 0.72 0.65         0                    0.0%
      Average          979      5,470      0.84 0.63 0.56       8,445                  4.2%
      LogMapLite        18      1,822      0.85 0.34 0.21       4,389                  2.2%
      Tool1           2,140     1,038      0.87 0.23 0.13        649                   0.3%
*Uses background knowledge based on the UMLS Metathesaurus which is the basis of the large-
bio reference alignments.


     Overall, the quality of the results was lower than that observed in the FMA-NCI matching
problem, as the matching problem is considerable larger. Like in the FMA-NCI matching prob-
lem, the effectiveness off all systems decreases as the ontology size increases from Task 3 to Task
4; and of the systems that completed the former, for example, POMAP was unable to complete
the latter.


5.8   Results for the SNOMED-NCI matching problem

Table 10 summarizes the results for the tasks in the SNOMED-NCI matching problem.
     AML achieved the best results in terms of both Recall and F-measure in Tasks 5 and 6, while
LogMap and AML achieved the best results in terms of precision in Tasks 5 and 6, respectively.
     The overall performance of the systems was lower than in the FMA-SNOMED case, as this
test case is even larger. Indeed, several systems were unable to complete even the smaller Task 5
within the allotted time: POMAP, SANOM and KEPLER.
     As in the previous matching problems, effectiveness decreased as the ontology size increases.
Unlike in the FMA-NCI and FMA-SNOMED matching problems, the use of the UMLS Metathe-
saurus did not positively impact the performance of XMap, which obtained lower results than
expected.
                    Table 10. Results for the SNOMED-NCI matching problem.

                                                      Scores              Incoherence
      System          Time (s)   # Corresp.
                                              Prec.    F-m. Rec.        Unsat.     Degree
                           Task 5: small SNOMED and NCI fragments
      AML               669         14,740   0.87 0.80 0.75       ≥3,966              ≥5.3%
      LogMap            207         12,414   0.95 0.80 0.69         ≥0                ≥0.0%
      LogMapBio        2,779        13,205   0.89 0.77 0.68         ≥0                ≥0.0%
      YAM-BIO          2,202        12,959   0.90 0.77 0.68        ≥549               ≥0.7%
      Average           921         12,220   0.89 0.70 0.59        21,264             28.3%
      XMap*             106         16,968   0.89 0.69 0.57       ≥46,091            ≥61.3%
      LogMapLite         9          10,942   0.89 0.69 0.57       ≥60,450            ≥80.4%
      Tool1             481          4,312   0.87 0.35 0.22       ≥37,797            ≥50.2%
                    Task 6: whole NCI ontology with SNOMED large fragment
      AML               312        13,176      0.90 0.77 0.67        ≥720            ≥0.4%
      YAM-BIO           490        15,027      0.83 0.76 0.70       ≥2,212           ≥1.2%
      LogMapBio        4,728       13,677      0.84 0.73 0.64         ≥5           ≥0.003%
      LogMap            652        12,273      0.87 0.71 0.60         ≥3           ≥0.002%
      LogMapLite         22        12,894      0.80 0.66 0.57 ≥150,656              ≥79.5%
      Average          1,131       13,666      0.84 0.66 0.56       55,496           29.3%
      XMap*             563        23,707      0.82 0.66 0.55 ≥137,136              ≥72.4%
      Tool1            1,150        4,911      0.81 0.34 0.22      ≥97,743          ≥51.6%
*Uses background knowledge based on the UMLS Metathesaurus which is the basis of the large-
bio reference alignments.


6      Disease and Phenotype Track (phenotype)
The Pistoia Alliance Ontologies Mapping project team9 organises this track based on a real use
case where it is required to find alignments between disease and phenotype ontologies. Specifi-
cally, in the OAEI 2017 edition of this track the selected ontologies are the Human Phenotype On-
tology (HPO), the Mammalian Phenotype Ontology (MP), the Human Disease Ontology (DOID),
the Orphanet and Rare Diseases Ontology (ORDO), the Medical Subject Headings (MESH) on-
tology, and the Online Mendelian Inheritance in Man (OMIM) ontology. The extended results for
the OAEI 2016 Disease and Phenotype track (previous campaign) are available in [24].


6.1     Test data
The 2017 edition comprises of four tasks requiring the pairwise alignment of:

     – Human Phenotype Ontology (HP) to Mammalian Phenotype Ontology (MP);
     – Human Disease Ontology (DOID) to the Orphanet Rare Disease Ontology (ORDO);
     – Human Phenotype Ontology (HP) to Medical Subject Headings (MESH); and
     – Human Phenotype Ontology (HP) to Online Mendelian Inheritance in Man (OMIM).

    Currently, mappings between these ontologies are mostly curated by bioinformatics and dis-
ease experts who would benefit from automation of their workflows supported by implementation
of ontology matching algorithms.
 9
     http://www.pistoiaalliance.org/projects/ontologies-mapping/
                  Table 11. Disease and Phenotype ontology versions and sources

                     Ontology     Version                        Source
                     HP           2017-06-30                     OBO Foundry
                     MP           2017-06-29                     OBO Foundry
                     DOID         2017-06-13                     OBO Foundry
                     ORDO         v2.4                           ORPHADATA
                     MESH         Hoehndorf’s version (2014)     BioPortal
                     OMIM         UMLS 2016AB                    BioPortal


    Table 11 summarizes the ontology versions and sources of the ontologies used in the OAEI
2017. Note that the version and source of HP, MP, DOID and ORDO are different from the ones
used in 2016.
    We have extracted “baseline” reference alignments based on the available BioPortal map-
pings (July 8, 2017). Most of the BioPortal [38] mappings are generated automatically by the
LOOM10 system, which should only be considered as a baseline since it is incomplete or may
contain errors.


6.2      Evaluation setting
We have run the evaluation in a Ubuntu Laptop with an Intel Core i7-4600U CPU @ 2.10GHz x
4 and allocating 15Gb of RAM.
    In the OAEI 2017 phenotype track 10 out of 21 participating OAEI 2017 systems have been
able to cope with at least one of the tasks with 4 hours.


6.3      Evaluation criteria
Systems have been evaluated according to the following criteria:

     – Precision and recall with respect to a consensus alignment automatically generated by voting
       based on the outputs of all participating systems (we have used vote=2, vote=3 and vote=4).
     – Semantic recall with respect to manually generated mappings for several areas of interest
       (e.g., carbohydrate, obesity and breast cancer).
     – Manual assessment of a subset unique mappings (i.e., mappings that are not suggested by
       other systems).

    We have used the OWL 2 reasoner HermiT to calculate the semantic recall. For example,
a positive hit will mean that a mapping in the reference has been (explicitly) included in the
output mappings or it can be inferred using reasoning from the input ontologies and the output
mappings.11 .


6.4      Use of background knowledge
LogMapBio uses BioPortal as mediating ontology provider, that is, it retrieves from BioPortal
the most suitable top-10 ontologies for the matching task.
10
     https://www.bioontology.org/wiki/index.php/BioPortal_Mappings
11
     Details about the used notion of semantic precision and recall can be found in [24]
                        Table 12. Disease and Phenotype task completion.

            System            HP-MP       DOID-ORDO          HP-MESH        HP-OMIM
            AML                  X            X                  X               X
            DiSMatch             X            X                  X               X
            LogMap               X            X                  X               X
            LogMapBio            X            X                  X               X
            LogMapLite           X            X                  X            empty
            KEPLER             time           X                time            time
            POMAP                X            X                time            time
            Tool1                X            X                  X            empty
            XMap                 X            X                  X            empty
            YAM-BIO              X            X                  X            empty

X: completed; empty: produced empty alignment; error: runtime error; time: timed out (4 hours).


    LogMap uses normalisations and spelling variants from the general (biomedical) purpose
UMLS Lexicon (a different resource with respect to the UMLS Metathesaurus).
    AML has three sources of background knowledge which can be used as mediators be-
tween the input ontologies: the Uber Anatomy Ontology (Uberon), the Human Disease Ontology
(DOID) and the Medical Subject Headings (MeSH). Additionally, for the HPO-MP test case, it
uses the logical definitions of both ontologies, which define some of their classes as being a com-
bination of an anatomic term (i.e., a class from either FMA or Uberon) with a phenotype modifier
term (i.e., a class from the Phenotypic Quality Ontology).
    YAM-BIO uses as background knowledge a file containing mappings from the DOID and
UBERON ontologies to other ontologies like FMA, NCI or SNOMED CT.
    DiSMatch estimates the similarity among concepts through textual semantic relatedness.
DiSMatch relies on a corpus of relevant biomedical textual resources.
    XMAP uses synonyms provided by the UMLS Metathesaurus.


6.5   Results
AML, DiSMatch, LogMap, and LogMapBio produced the most complete results according to
both the automatic and manual evaluation.
    Table 12 summarizes the tasks where each system was able to produce results within a 4-
hours time frame.


Results against the consensus alignments Table 13 shows the size of the consensus align-
ments built with the outputs of the systems participating in the OAEI 2017 campaign. Note that
systems participating with different variants only contributed once in the voting, that is, the voting
was done by family of systems/variants rather than by individual systems.
    Table 3 shows the results achieved by each of the participating systems. We deliberately
did not rank the systems since the consensus alignments only allow us to assess how systems
perform in comparison with one another. On the one hand, some of the mappings in the consensus
alignment may be erroneous (false positives), as all it takes for that is that 2, 3 or 4 systems agree
on part of the erroneous mappings they find. On the other hand, the consensus alignments are
not complete, as there will likely be correct mappings that no system is able to find, and as we
will show in the manual evaluation, there are a number of mappings found by only one system
                            Table 13. Size of consensus alignments

                           Task             Vote 2   Vote 3    Vote 4
                           HP-MP            3,130    2,153     1,780
                           DOID-ORDO        3,354    2,645     2,188
                           HP-MESH          4,711    3,847     3,227
                           HP-OMIM          6,834    4,177     3,462


               Fig. 3. Results against consensus alignments with vote 2, 3 and 4.


(and therefore not in the consensus alignments) which are correct. Nevertheless, the results with
respect to the consensus alignments do provide some insights into the performance of the systems,
which is why we highlighted in the table the 4 systems that produce results closest to the silver
standards: AML, DiSMatch, LogMap, and LogMapBio.


Results against manually created mappings The manually generated mappings for six
areas (carbohydrate, obesity and breast cancer, urinary incontinence, abnormal heart and Charcot-
Marie Tooth disease) include 86 mappings between HP and MP and 175 mappings between
DOID and ORDO. Most of them represent subsumption relationships. Tables 14 and 15 shows
the results in terms of recall and semantic recall for each of the system. LogMapBio and LogMap
              Table 14. Results against manually created mappings: HP-MP task
                 System                   Standard Recall      Semantic Recall
                 BioPortal (baseline)          0.20                0.51
                 AML                           0.40                0.62
                 DiSMatch-ar                   0.42                0.65
                 LogMap                        0.38                0.67
                 LogMapBio                     0.38                0.67
                 LogMapLt                      0.20                0.51
                 Tool1                         0.31                0.60
                 POMap                         0.38                0.65
                 XMap                          0.30                0.60
                 YAM-BIO                       0.22                0.51


           Table 15. Results against manually created mappings: DOID-ORDO task
                 System                   Standard Recall      Semantic Recall
                 BioPortal (baseline)          0.13                0.14
                 AML                           0.33                0.48
                 DiSMatch-ar                   0.21                0.25
                 DiSMatch-sg                   0.21                0.25
                 DiSMatch-tr                   0.21                0.25
                 KEPLER                        0.13                0.17
                 LogMap                        0.30                0.42
                 LogMapBio                     0.32                0.44
                 LogMapLt                      0.13                0.14
                 Tool1                         0.27                0.30
                 POMap                         0.27                0.30
                 XMap                          0.13                0.14
                 YAM-BIO                       0.13                0.14


obtain the best results in terms of semantic recall in the HP-MP task, while AML obtains the best
results in the DOID-ORDO task. The results in both tasks are far from optimal since a large
fragment of the manually created mappings have not been (explicitly) identified by the systems
nor can be derived via reasoning.


Manual assessment of unique mappings Figures 4 and 5 show the results of the man-
ual assessment to estimate the precision of the unique mappings generated by the participating
systems. Unique mappings are correspondences that no other system (explicitly) provided in the
output. We manually evaluated up to 30 mappings and we focused the assessment on unique
equivalence mappings.
     For example LogMap’s output contains 189 unique mappings in the HP-MP task. The man-
ual assessment revealed an (estimated) precision of 0.9333. In order to also take into account the
number of unique mappings that a system is able to discover, Tables 4 and 5 also include the
estimation of the positive and negative contribution of the unique mappings with respect to the
total unique mappings discovered by all participating systems.
                          Fig. 4. Unique mappings in the HP-MP task.


                       Fig. 5. Unique mappings in the DOID-ORDO task.


7    MultiFarm


The MultiFarm data set [33] aims at evaluating the ability of matching systems to deal with
ontologies in different natural languages. This data set results from the translation of 7 ontolo-
gies from the conference track (cmt, conference, confOf, iasted, sigkdd, ekaw and edas) into 10
languages: Arabic, Chinese, Czech, Dutch, French, German, Italian, Portuguese, Russian, and
Spanish. It is composed of 55 pairs of languages (see [33] for details on how the original Mul-
tiFarm data set has been generated). For each pair, taking into account the alignment direction
(cmten →confOfde and cmtde →confOfen , for instance, as distinct matching tasks), we have 49
matching tasks. The whole data set is composed of 55 × 49 matching tasks.
7.1   Experimental setting
Part of the data set is used for blind evaluation. This subset includes all matching tasks involving
the edas and ekaw ontologies (resulting in 55 × 24 matching tasks). As last year, the results
reported here are based on the blind data set. Participants were able to test their systems on the
available subset of matching tasks (open evaluation), available via the SEALS repository. The
open subset covers 45 × 25 tasks. The open subset does not include Italian translations.
    We distinguish two types of matching tasks: i) those tasks where two different ontologies
(cmt→confOf, for instance) have been translated into two different languages; and ii) those tasks
where the same ontology (cmt→cmt) has been translated into two different languages. For the
tasks of type ii), good results are not directly related to the use of specific techniques for dealing
with cross-lingual ontologies, but on the ability to exploit the identical structure of the ontologies.
    This year, 8 systems (out of 22) have participated in the MultiFarm track (i.e., those that
have been assigned to the task in the registration phase) : AML, CroLOM, KEPLER, LogMap,
LogMapLite, SANOM, WikiV3, and XMAP. LogMapLite does not implement any specific cross-
lingual strategy. The number of participants is stable with respect to the last campaign (7 in 2016,
5 in 2015, 3 in 2014, 7 in 2013, and 7 in 2012). For sake of simplicity, we refer in the following
to cross-lingual systems those implementing cross-lingual matching strategies and non-cross-
lingual systems those without that feature. The reader can refer to the OAEI papers for a detailed
description of the strategies adopted by each system. In fact, most of them still adopts a translation
step before the matching itself.
    For this track, the general comments with respect to the running are : i) CroLOM partici-
pated with the same version than last year; ii) LogMap had encountered problems for accessing
the Google translator server; iii) KEPLER generated some parsing errors for some pairs; iv)
some systems (AML, LogMap and LogMapLite) have generated correspondences with confi-
dence higher than 1.0 (no post-processing has been done in these cases).


7.2   Execution setting and runtime
The systems have been executed on a Windows machine configured with 8GB of RAM running
under a i7-7500U CPU 2.70GHz x4 processors. All measurements are based on a single run. As
Table 16 shows, we can observe large differences in the time required for a system to complete
the 55 x 24 matching tasks. Note as well that the concurrent access to the SEALS repositories
during the evaluation period may have an impact in the time required for completing the task.


7.3   Evaluation results
Table 16 presents the aggregated results for the 55×24 matching tasks. They have been computed
using the Alignment API 4.6 and can slightly differ from those computed with the SEALS client.
We haven’t applied any threshold on the results. They are measured in terms of classical precision
and recall.
    Overall, as expected, systems implementing cross-lingual techniques outperform the non-
cross-lingual systems. However, as stated above, this year we did not run all systems and focus
on the systems that have been registered for the task. In this task, AML outperforms all other
systems in terms of F-measure for task i), keeping its top place in this task. AML is followed
by LogMap, CroLOM, KEPLER and WikiV3. With respect to the task ii), AML has relatively
low performance, due mainly to some errors in parsing the alignments for which a confidence
higher than 1 was generated. KEPLER has provided the higher F-measure for task ii), followed
by LogMap, CroLOM and AML. We observe that WikiV3 is able to maintain its performance in
both tasks.
Table 16. MultiFarm aggregated results per matcher, for each type of matching task – different
ontologies (i) and same ontologies (ii).
                                  Type (i) – 22 tests per pair     Type (ii) – 2 tests per pair
       System     Time #pairs
                                 Size Prec. F-m.          Rec.    Size Prec. F-m.          Rec.
          AML 677          55    8.21 .72(.72) .46(.46) .35(.35) 45.54 .89(.96) .26(.28) .16(.17)
       CroLOM 5501         55    8.56 .55(.55) .36(.36) .28(.28) 38.76 .89(.90) .40(.40) .26(.27)
       KEPLER 2180         55   10.63 .43(.43) .31(.31) .25(.25) 58.34 .90(.90) .52(.52) .38(.38)
        LogMap    57       55    6.99 .73(.73) .37(.37) .25(.25) 46.80 .95(.96) .42(.43) .28(.28)
     LogMapLite   38       55    1.16 .36(.36) .04(.04) .02(.02) 94.5 .02(.02) .01(.03) .01(.02)
        SANOM     22       30    2.86 .43(.79) .13(.25) .08(.15) 8.33 .54(.99) .06(.12) .03(.06)
         WikiV3 1343       55   11.89 .30(.30) .25(.25) .21(.21) 29.37 .62(.62) .23(.23) .14(.14)
         XMAP 102          27    3.84 .24(.50) .06(.14) .04(.09) 15.76 .66(.91) .10(.14) .06(.09)

Time is measured in minutes (for completing the 55 × 24 matching tasks); #pairs indicates the
number of pairs of languages for which the tool is able to generated (non empty) alignments;
size indicates the average of the number of generated correspondences for the tests where an
(non empty) alignment has been generated. Two kinds of results are reported: those do not dis-
tinguishing empty and erroneous (or not generated) alignments and those—indicated between
parenthesis—considering only non empty generated alignments for a pair of languages.


     With respect to the pairs of languages for test cases of type i), for the sake of brevity, we do
not present the results for the 55 pairs. The reader can refer to the OAEI results web page for
the detailed results. 5 cross-lingual systems out of 7 were able to deal with all pairs of languages
(AML, CroLOM, KEPLER, LogMap and WikiV3). While the only non-specific system was
able to generate non empty (but erroneous) results for all pairs, specific systems as SANOM and
XMap have problems to deal with ar, cn and ru languages and hence were not able to generate
alignments for most pairs involving these languages. This behaviour has also been observed in
the last campaign for specific systems.
     For the group of systems implementing cross-lingual strategies, their top F-measure include
the pairs es-it (AML), nl-pt (CroLOM), de-pt (KEPLER), en-nl (LogMap), es-it (SANOM), it-
pt (WikiV3), es-pt (XMap). We can observe that most of the systems better deal with the pairs
involving pt, it, es, nl, de and en languages. This may due to the coverage or performance of the
resources and translations for these languages, together with the fact that dealing with comparable
languages12 can make the task easier. In fact, we can also observe that for most systems, the worst
results have been produced for the pairs involving ar, cn, cz and ru. The exceptions are SANOM
and XMap, for which, worst results also include the pairs es, nl and pt or fr, en and it, respectively.
     With respect to the only non cross-lingual systems, LogMapLite, it in fact takes advantage
of comparable languages, in the absence of specific strategies. This can be corroborated by the
fact that it has generated its best F-measure for the pairs de-en, es-pt, it-pt, es-it. This (expected)
fact has been observed along the campaigns.
12
     An example of comparable natural languages is English and German, both belonging to the
     Germanic language family. Comparable natural languages can also be languages that are not
     from the same language family. For example, Italian belonging to the Romance language fam-
     ily, and German belonging to the Germanic language family can still be compared using string
     comparison techniques such as edit distance, as they are both alphabetic letter-based with com-
     parable graphemes. An example of natural languages that are not comparable in this context
     can be Chinese and English, where the former is logogram-based and the latter is alphabetic
     letter-based [12]
Comparison with previous campaigns. The number of participants implementing cross-
lingual strategies remains stable this year with respect to the last campaigns (7 in 2016, 5 in 2015,
3 in 2014, 7 in 2013 and 2012 and 3 in 2011). 4 systems have also participated last year (AML,
LogMap, CroLOM, and XMap) and we count 3 new systems (KEPLER, SANOM, and WikiV3).
Comparing the results from last year, in terms F-measure and with respect to the blind evaluation
(cases of type i), AML maintains its performance, with a very little increase (.46 in 2017, .45
in 2016 and .47 in 2015). CroLOM, LogMap, and XMAP maintained their performance (.36,
.37 and .06, respectively). The newcomer WikiV3 obtained stable results for both kinds of tasks,
but with a F-measure below AML, LogMap, CroLOM and KEPLER. For the task ii), we can
observe that KEPLER (.52) outperforms LogMap (.44), the best system from last year, in terms
of F-measure for this task.


7.4   Conclusion
From 22 participants, 8 were evaluated in MultiFarm. In terms of performance, the F-measure
for blind tests remains relatively stable across campaigns. AML and LogMap keep their positions
with respect to the previous campaigns, followed by the CroLOM and KEPLER. Still, all systems
privilege precision in detriment to recall and the results are below the ones obtained for the
Conference original dataset. We can observe as well that the systems are not able to provide
good results or deal with pairs involving specific languages, as ar, cn and ru. As last years, still
cross-lingual approaches are mainly based on translation strategies and the combination of other
resources (like cross-lingual links in Wikipedia, BabelNet, etc.) and strategies (machine learning,
indirect alignment composition) remains underexploited. As last year, the evaluation has been
conducted only on the blind set (results have not been reported for the open data set). As future
work, we plan to compare the performance of the systems on both multilingual and cross-lingual
settings.


8     Interactive matching
The interactive matching track was organized at OAEI 2017 for the fifth time. The goal of this
evaluation is to simulate interactive matching [36, 14], where a human expert is involved to vali-
date correspondences found by the matching system. In the evaluation, we look at how interacting
with the user improves the matching results. Currently, this track does not evaluate the user ex-
perience or the user interfaces of the systems.


8.1   Datasets
The Interactive track uses four OAEI datasets: Anatomy (Section 3), Conference (Section 4),
LargeBio (Section 5), and Phenotype (Section 6). For details on the datasets, please refer to their
respective sections.


8.2   Experimental setting
The Interactive track relies on the SEALS client’s Oracle class to simulate user interactions. An
interactive matching system can present a correspondence to the oracle, which will tell the system
whether that correspondence is right or wrong. This year we have extended this functionality by
allowing a user to present a collection of mappings simultaneously to the oracle. If a system
presents up to three mappings together and each mapping presented has a mapped entity (i.e.,
class or property) in common with at least one other mapping presented, the oracle counts this as
a single interaction, under the rationale that this corresponds to a scenario where a user is asked
to choose between conflicting candidate mappings.
     To simulate the possibility of user errors, the oracle can be set to reply with a given error
probability (randomly, from a uniform distribution). We evaluated systems with four different
error rates: 0.0 (perfect user), 0.1, 0.2, and 0.3.
     The evaluations of the Conference and Anatomy datasets were run on a server with 3.46 GHz
(6 cores) and 8GB RAM allocated to the matching systems. Each system was run ten times and
the final result of a system for each error rate represents the average of these runs. For the Con-
ference dataset with the ra1 alignment, precision and recall correspond to the micro-average over
all ontology pairs, whereas the number of interactions represent the total number of interactions
for all the pairs. Both are averaged for the ten runs.
     The Phenotype and Largebio evaluation was run on a Ubuntu Laptop with an Intel Core i7-
4600U CPU @ 2.10GHz x 4 and allocating 15Gb of RAM. Each system was run only one time
due to the time required to run some of the systems. Since errors are randomly introduced we
expect minor variations between runs. Nevertheless, the Phenotype and Largebio tasks involve
large ontologies and a comparatively large number of questions, hence the variations between
runs are expected to be mostly negligible.


8.3      Evaluation
For the sake of brevity, we present only the results for the Anatomy, Conference, and LargeBio
tasks. For the Phenotype tasks, please refer to the OAEI website 13 . Table 17 and Figure 6 show
the results for the Anatomy and Conference datasets, and Table 18 and Figure 7 show the results
for the LargeBio tasks.
     The tables include the following information (column names within parentheses):

     – The number of unsatisfiable classes resulting from the alignments computed as detailed in
       Section 5 - only for the LargeBio data set.
     – The performance of the system: Precision (Prec.), Recall (Rec.) and F-measure (F-m.) with
       respect to the fixed reference alignment, as well as Recall+ (Rec.+) for the Anatomy task (as
       detailed in Section 3). To facilitate the assessment of the impact of user interactions, we also
       provide the performance results from the original tracks, without interaction (line with Error
       NI).
     – To ascertain the impact of the oracle errors, we provide the performance of the system with
       respect to the oracle (i.e., the reference alignment as modified by the errors introduced by
       the oracle: Precision oracle (Prec. oracle), Recall oracle (Rec. oracle) and F-measure oracle
       (F-m. oracle). For a perfect oracle these values match the actual performance of the system.
     – Total requests (Tot Reqs.) represents the number of distinct user interactions with the tool,
       where each interaction can contain one to three conflicting mappings, that could be analysed
       simultaneously by a user.
     – Distinct mappings (Dist. Mapps) counts the total number of mappings for which the oracle
       gave feedback to the user (regardless of whether they were submitted simultaneously, or
       separately).
     – Finally, the performance of the oracle itself with respect to the errors it introduced can be
       gauged through the positive precision (Pos. Prec.) and negative precision (Neg. Prec.), which
       measure respectively the fraction of positive and negative answers given by the oracle that
       are correct. For a perfect oracle these values are equal to 1 (or 0, if no questions were asked).
13
     http://oaei.ontologymatching.org/2017/results/interactive/
    The figures show the time intervals between the questions to the user/oracle for the different
systems and error rates. Different runs are depicted with different colours.


8.4   Discussion
The matching systems that participated in this track employ different user-interaction strategies.
While LogMap, XMap and AML make use of user interactions exclusively in the post-matching
steps to filter their candidate mappings, ALIN can also add new candidate mappings to its initial
set. LogMap and AML both request feedback on only selected mapping candidates (based on
their similarity patterns or their involvement in unsatisfiabilities) and AML presents one mapping
at a time to the user. XMap also presents one mapping at a time and asks mainly about false
mappings. ALIN and LogMap can both ask the oracle to analyse several conflicting mappings
simultaneously.
      The performance of the systems usually improves when interacting with a perfect oracle in
comparison with no interaction. The one exception is XMap in the Conference dataset, because
it is barely interactive in this dataset. In general, XMap performs very few requests to the oracle
compared to the other systems, except in the SNOMED-NCI task, where it makes the most re-
quests. Thus, it is also the system that improves the least with user interaction. On the other end of
the spectrum, ALIN is the system that improves the most, not only because it makes a high num-
ber of oracle requests (the most in Anatomy and Conference) but also because its non-interactive
performance was the lowest of the interactive systems, and thus the easiest to improve.
      Although systems’ performance deteriorates when the error rate increases, there are still ben-
efits from the user interaction—some of the systems’ measures stay above their non-interactive
values even for the larger error rates. Naturally, the more a system relies on the oracle, the more
its performance tends to be affected by its errors.
      The impact of the oracle’s errors is linear for ALIN, AML and for XMap in most tasks, as
the F-measure according to the oracle remains approximately constant across all error rates. It
is supra-linear for LogMap in all data sets, and for XMap in the SNOMED-NCI task, as the
F-measure according to the oracle decreases as the error rate increases. This means that the latter
systems are deliberately or implicitly letting the oracle’s replies affect their selection of mappings
beyond those they asked about, and thus propagating the oracle’s errors.
      Two models for system response times are frequently used in the literature [10]: Shneiderman
and Seow take different approaches to categorise the response times. Shneiderman takes a task-
centred view and sorts the response times in four categories according to task complexity: typing,
mouse movement (50-150 ms), simple frequent tasks (1 s), common tasks (2-4 s) and complex
tasks (8-12 s). He suggests that the user is more tolerable to delays with the growing complexity
of the task at hand. Unfortunately, no clear definition is given for how to define the task complex-
ity. Seow’s model looks at the problem from a user-centred perspective by considering the user
expectations towards the execution of a task: instantaneous (100-200 ms), immediate (0.5-1 s),
continuous (2-5 s), captive (7-10 s). Ontology alignment is a cognitively demanding task and can
fall into the third or fourth categories in both models. In this regard the response times (request
intervals as we call them above) observed in all data sets fall into the tolerable and acceptable
response times, and even into the first categories, in both models. The request intervals for AML,
LogMap and XMAP stay at a few milliseconds for most data sets. ALIN’s request intervals are
higher, but still in the tenth of second range. It could be the case, however, that a user would not
be able to take advantage of these low response times because the task complexity may result in
higher user response time (i.e., the time the user needs to respond to the system after the system
is ready).
      Regarding the number of unsatisfiable classes resulting from the alignments we observe
some expected variations as the error increases. We note that, with interaction, the alignments
        Table 17. Interactive matching results for the Anatomy and Conference datasets
                                         Prec. Rec. F-m. Tot. Dist. Pos. Neg.
     Tool    Error Prec. Rec. F-m. Rec.+ oracle oracle oracle Reqs. Mapps Prec. Prec.
                                         Anatomy Dataset
              NI    0.985 0.339 0.504 0.0     –     –     –        –      –    –      –
              0.0   0.993 0.794 0.882 0.454 0.993 0.794 0.882     939   1472 1.0     1.0
     ALIN     0.1    0.94 0.745 0.831 0.403 0.993 0.79 0.88       905   1352 0.905 0.8977
              0.2   0.895 0.703 0.787 0.358 0.993 0.788 0.879     891   1311 0.824 0.796
              0.3   0.846 0.649 0.735 0.301 0.993 0.781 0.874     882   1266 0.734 0.668
              NI     0.95 0.936 0.943 0.832 –       –     –        –     –    –     –
              0.0   0.968 0.948 0.958 0.862 0.968 0.948 0.958     241   240 1.0    1.0
     AML      0.1   0.956 0.946 0.95 0.856 0.969 0.949 0.959      266   264 0.73 0.972
              0.2   0.939 0.942 0.94 0.849 0.969 0.951 0.96       283   280 0.513 0.93
              0.3   0.922 0.939 0.931 0.843 0.97 0.952 0.961      310   308 0.359 0.902
          NI        0.911 0.846 0.877 0.593 –       –     –        –      –    –     –
          0.0       0.982 0.846 0.909 0.595 0.982 0.846 0.909     388   1164 1.0    1.0
   LogMap 0.1       0.962 0.83 0.891 0.564 0.966 0.803 0.877      388   1164 0.748 0.964
          0.2       0.944 0.823 0.88 0.552 0.945 0.762 0.843      388   1164 0.566 0.927
          0.3       0.931 0.82 0.872 0.544 0.92 0.722 0.809       388   1164 0.431 0.879
              NI    0.926 0.863 0.893 0.639 –       –     –        –      –     –     –
              0.0   0.927 0.865 0.895 0.644 0.927 0.865 0.895     35     35    1.0   1.0
    XMap      0.1   0.927 0.865 0.895 0.644 0.927 0.863 0.894     35     35   0.602 0.964
              0.2   0.927 0.865 0.895 0.644 0.927 0.862 0.893     35     35   0.422 0.964
              0.3   0.927 0.865 0.895 0.644 0.927 0.861 0.893     35     35   0.278 0.93
                                        Conference Dataset
              NI    0.892 0.272 0.417     –     –     –     –      –     –    –     –
              0.0   0.957 0.731 0.829     –   0.957 0.731 0.829   329   571 1.0    1.0
     ALIN     0.1   0.804 0.669 0.73      –   0.961 0.737 0.834   321   549 0.752 0.966
              0.2   0.669 0.622 0.645     –   0.965 0.751 0.845   313   534 0.558 0.93
              0.3   0.577 0.56 0.568      –   0.966 0.752 0.845   302   517 0.431 0.875
              NI    0.841 0.659 0.739           –     –     –      –     –    –     –
              0.0   0.912 0.711 0.799     –   0.912 0.711 0.799   271   270 1.0    1.0
     AML      0.1   0.841 0.701 0.765     –   0.923 0.732 0.816   282   275 0.704 0.975
              0.2   0.768 0.672 0.717     –   0.925 0.745 0.825   292   279 0.538 0.92
              0.3   0.713 0.651 0.68      –   0.929 0.751 0.83    291   274 0.45 0.877
          NI        0.818 0.59 0.686      –     –     –     –      –     –    –     –
          0.0       0.886 0.61 0.723      –   0.886 0.61 0.723    82    246 1.0    1.0
   LogMap 0.1       0.851 0.598 0.702     –   0.855 0.573 0.686   82    246 0.698 0.978
          0.2       0.821 0.585 0.684     –   0.829 0.542 0.656   82    246 0.507 0.941
          0.3       0.795 0.581 0.671     –   0.807 0.518 0.631   82    246 0.363 0.902
              NI    0.837 0.57 0.678      –     –     –     –     –      –      –      –
              0.0   0.837 0.57 0.678      –   0.837 0.57 0.678    4      4     0.0    1.0
    XMap      0.1   0.837 0.57 0.678      –   0.837 0.57 0.678    4      4     0.0    1.0
              0.2   0.837 0.57 0.678      –   0.837 0.569 0.677   4      4     0.0    1.0
              0.3   0.837 0.57 0.678      –   0.837 0.569 0.678   4      4     0.0    1.0
NI stands for non-interactive, and refers to the results obtained by the matching system in the
original track.
Fig. 6. Time intervals between requests to the user/oracle for the Anatomy (top 4 plots) and Con-
ference (bottom 4 plots) datasets. Whiskers: Q1-1,5IQR, Q3+1,5IQR, IQR=Q3-Q1. The labels
under the system names show the average number of requests and the mean time between the
requests for the ten runs.
                Table 18. Interactive matching results for the LargeBio dataset
                                         Prec. Rec. F-m. Tot. Dist. Pos. Neg.
     Tool   Error Unsat. Prec. Rec. F-m. oracle oracle oracle Reqs. Mapps Prec. Prec.
                                   FMA-NCI Small Dataset
             NI     N/A   0.995 0.455 0.624 –       –     –         –      –    –     –
             0.0      2   0.996 0.63 0.772 0.996 0.63 0.772        653   1,019 1      1
    ALIN     0.1     85   0.971 0.614 0.752 0.996 0.63 0.772       629    932 0.908 0.907
             0.2    152   0.958 0.593 0.733 0.996 0.624 0.767      605    881 0.855 0.788
             0.3     91   0.937 0.58 0.716 0.996 0.623 0.767       589    855 0.772 0.696
             NI      2 0.963 0.902 0.932 –       –     –            –      –    –     –
             0.0     2  0.99 0.913 0.95 0.99 0.913 0.95            449    447   1     1
     AML     0.1    222 0.98 0.908 0.943 0.99 0.914 0.95           497    484 0.896 0.936
             0.2     2 0.974 0.894 0.932 0.987 0.91 0.947          450    450 0.794 0.768
             0.3     2 0.966 0.894 0.929 0.981 0.911 0.945         450    450 0.751 0.734
          NI         2    0.944 0.897 0.92 –        –     –     –          –     –     –
          0.0        2    0.992 0.901 0.944 0.992 0.901 0.944 1,131      1,131 1       1
   LogMap 0.1        2     0.98 0.881 0.928 0.983 0.892 0.935 1,209      1,209 0.942 0.909
          0.2        2    0.967 0.874 0.918 0.964 0.875 0.917 1,247      1,247 0.837 0.84
          0.3        2    0.963 0.872 0.915 0.935 0.849 0.89 1,327       1,327 0.727 0.776
             NI      2    0.977 0.901 0.937 –      –     –          –      –    –     –
             0.0     2    0.991 0.9 0.943 0.991 0.9 0.943          188    188   1     1
    XMap     0.1     2    0.988 0.895 0.939 0.99 0.9 0.943         187    187 0.962 0.819
             0.2     2    0.988 0.892 0.938 0.99 0.899 0.942       187    187 0.939 0.753
             0.3     2    0.985 0.887 0.933 0.99 0.899 0.942       188    188 0.851 0.628
                                SNOMED-NCI Small Dataset
             NI    3,966 0.904 0.713 0.797 –     –     –     –             –     –     –
             0.0     0 0.972 0.726 0.831 0.972 0.726 0.831 2,730         2,730 1       1
     AML     0.1     0 0.967 0.717 0.823 0.972 0.724 0.83 2,730          2,730 0.942 0.857
             0.2     0 0.961 0.707 0.815 0.972 0.721 0.828 2,730         2,730 0.88 0.732
             0.3     0 0.955 0.697 0.806 0.972 0.719 0.827 2,730         2,730 0.818 0.622
          NI          0   0.922 0.663 0.771 –       –     –     –          –     –     –
          0.0         0   0.985 0.669 0.797 0.985 0.669 0.797 5,596      5,596 1       1
   LogMap 0.1        16   0.974 0.651 0.78 0.971 0.656 0.783 6,201       6,201 0.945 0.855
          0.2        16   0.965 0.64 0.77 0.948 0.639 0.763 6,737        6,737 0.859 0.766
          0.3        16   0.959 0.635 0.764 0.92 0.62 0.741 7,159        7,159 0.753 0.693
             NI    46,091 0.911 0.564 0.697 –      –     –     –       –     –     –
             0.0   35,869 0.924 0.59 0.72 0.924 0.59 0.72 11,932 11,689 1          1
    XMap     0.1   35,455 0.923 0.591 0.721 0.84 0.568 0.678 11,931 11,694 0.99 0.602
             0.2   35,968 0.921 0.591 0.72 0.754 0.541 0.63 11,911 11,682 0.975 0.41
             0.3   36,619 0.919 0.592 0.72 0.676 0.514 0.584 11,903 11,693 0.953 0.297
NI stands for non-interactive, and refers to the results obtained by the matching system in the
original track. ALIN was unable to complete the SNOMED-NCI task.
Fig. 7. Time intervals between requests to the user/oracle for the FMA-NCI (top 4 plots)
and SNOMED-NCI (bottom 4 plots) datasets from the LargeBio track. Whiskers: Q1-1,5IQR,
Q3+1,5IQR, IQR=Q3-Q1. The labels under the system names show the number of requests and
the mean time between the requests.
produced by the systems are typically larger than without interaction, which makes the repair
process harder. The introduction of oracle errors complicates the process further, and may make
an alignment irreparable if the system follows the oracle’s feedback blindly.


9     Instance matching

The instance matching track aims at evaluating the performance of matching tools when the goal
is to detect the degree of similarity between pairs of items/instances expressed in the form of
OWL Aboxes. The track is organized in two independent tasks called SYNTHETIC and DORE-
MUS. Each test is based on two datasets called source and target and the goal is to discover the
matching pairs (i.e., mappings) among the instances in the source dataset and the instances in the
target dataset.
     For the sake of clarity, we split the presentation of he task results in two different subsections.


9.1   SYNTHETIC task

Task data The SYNTHETIC datasets are produced using SPIMBENCH [40] with the aim to
generate descriptions of the same entity where value-based, structure-based and semantics-aware
transformations are employed on source data in order to create the target data.
     The value-based transformations consider mainly typographical errors and different data for-
mats, the structure-based transformations implement transformations applied on the structure of
object and datatype properties and the semantics-aware transformations concern the instance level
and take into account schema information. The latter are used to examine if the matching sys-
tems take into account RDFS and OWL constructs in order to discover correspondences between
instances that can be found only by considering schema information.
     We stress that an instance in the source dataset can have none or one matching counterpart in
the target dataset. A dataset is composed of a Tbox and a corresponding Abox. Source and target
datasets share almost the same Tbox (differences are found in the properties due to the employed
structure-based transformations). The Sandbox scale is 10K triples ≈ 380 instances while the
Mainbox scale is 50K triples ≈ 1800 instances. We asked the participants to match the creative
works (news items, blogposts and programmes) in the source dataset against the instances of the
corresponding class in the target dataset.


Results The participants of the SYNTHETIC task are the AgreementMakerLight (AML), I-
Match, Legato and LogMap systems. In order to evaluate those systems we built a ground
truth containing the set of expected links where an instance i1 in the source dataset is associated
with an instance j1 in the target dataset that has been generated as a modified description of i1 .
The value-based, structure-based and semantics-aware transformations were applied on different
triples of the source dataset pertaining to one class instance.
     The systems were judged on the basis of the precision, recall and F-measure results shown
in Table 19. LogMap and Legato produce links that are very often correct (resulting in a good
precision) but fail to capture a large number of the expected links (resulting in a lower recall).
In the case of AML and I-Match systems, the probability of capturing a correct link is high, but
the probability of a retrieved link to be correct is lower, resulting in a high (almost perfect) recall
but a low precision. Regarding the size of the dataset, LogMap and Legato systems have better
results for the Sandbox dataset. On the other hand, AML and I-Match systems exhibit the same
performance for both the Sandbox and Mainbox datasets.
                               Table 19. SYNTHETIC task results

                            Sandbox task               Mainbox task
               System
                     Precision Recall F-measure Precision Recall F-measure
              AML      0.849 1.000      0.918     0.855 1.000      0.922
              I-Match 0.854 0.997       0.920     0.856 0.997      0.921
              Legato   0.980 0.730      0.840     0.970 0.700      0.810
              LogMap 0.938 0.763        0.841     0.893 0.709      0.790


9.2    DOREMUS task
Task data The DOREMUS task, having its second appearance at the OAEI, contains real world
datasets coming from two major French cultural institutions – The BnF (French National Li-
brary) and the PP (Philharmonie de Paris). The data are about classical music works and follow
the DOREMUS model (one single vocabulary for both datasets) issued from the DOREMUS
project.14 Each data entry, or instance, is a bibliographical record about a musical piece, contain-
ing properties such as the composer, the title(s) of the work, the year of creation, the key, the
genre, the instruments, to name a few. These data have been converted to RDF from their original
UNI- and INTER-MARC formats and anchored to the DOREMUS ontology and a set of domain
controlled vocabularies by the help of the marc2rdf converter,15 developed for this purpose within
the DOREMUS Project (for more details on the conversion method and on the ontology we refer
to [1] and [31]). Note that these data are highly heterogeneous. We have selected works described
both at the BnF and at the PP with different degrees of heterogeneity in their descriptions. The
datasets have been selected for the purposes of two sub-tasks.

Heterogeneities (HT): This sub-task consists in aligning two datasets, BnF-1 and PP-1, con-
taining about 238 instances each, by discovering 1:1 equivalence relations between them. There
are different types of heterogeneities that these data manifest, identified by music library experts,
such as multilingualism, differences in catalogs, differences in spelling, different degrees of de-
scription, etc. The goal is to test the ability of linking tools to cope with these heterogeneities.
The participants are asked to map only instances of the F 22 Self − Contained Expression
class.

False Positives Trap (FPT): This sub-task consists in correctly disambiguating the instances
contained in two datasets of small sizes (75 instances each), BnF-2 and PP-2, by discovering 1:1
equivalence relations between the instances that they contain. Librarian experts have selected
several groups of music works with highly similar descriptions across the two datasets, where
there exist only one correct match in each group. The goal is to challenge the linking tools
capacity to avoid the generation of false positives and match correctly instances in the presence
of highly similar but yet distinct candidates. The participants are asked to map only instances of
the F 22 Self − Contained Expression class.


Results Five systems participated and returned results on the DOREMUS track: AML, I-Match,
Legato, LogMap and NjuLink. Two systems stand out, outperforming significantly the other
participants on both sub-tasks – Legato and NjuLink, both achieving F-measures of over 0.9
14
     http://www.doremus.org
15
     https://github.com/DOREMUS-ANR/marc2rdf
(NjuLink leading on HT and Legato - on FP-trap). Both tasks appear to be fairly challenging for
the majority of the systems, with average F-measures of 0.636 for HT task and 0.565 for the
FP-trap task.


                              Table 20. Results of the DOREMUS task

                                  HT task                 FP-Trap task
                 System
                        Precision Recall F-measure Precision Recall F-measure
                AML       0.851 0.479      0.613     0.914 0.427      0.582
                I-Match   0.680 0.071      0.129     1.00    0.053    0.101
                Legato    0.930 0.920      0.930     1.00    0.980    0.990
                LogMap 0.406 0.882         0.556     0.119 0.880      0.210
                NjuLink 0.966 0.945        0.955     0.959 0.933      0.946


10       HOBBIT Link Discovery
In this track, two benchmark generators are proposed to deal with link discovery for spatial data
represented as trajectories i.e., sequences of longitude, latitude pairs. This new track is using the
HOBBIT platform16 and follows different instructions than the SEALS-based tracks.
     We use TomTom17 datasets in order to create the benchmark. TomTom datasets contain rep-
resentations of traces (GPS fixes). Each trace consists of a number of points. Each point has a
timestamp, longitude, latitude and speed. The points are sorted in ascending order by the times-
tamp of the corresponding GPS fix. Each task of the HOBBIT Link Discovery Track is composed
of two datasets with different number of instances to match, namely the Sandbox and the Main-
box.
     The HOBBIT Link Discovery track comprises of two tasks:
     – Task 1 (Linking) measures how well the systems can match traces that have been modified
       using string-based approaches along with addition and deletion of intermediate points. Since
       TomTom datasets only contain coordinates, in order to apply string-based modifications im-
       plemented in LANCE [41] we have replaced a number of those points with labels retrieved
       from Linked Data spatial datasets using the Google Maps18 , Foursquare19 and Nominatim
       Openstreetmap20 APIs. This task also contains modifications on date and coordinate formats.
       An instance in the source dataset has one matching counterpart in the target dataset. For the
       Linking Task, the Sandbox scale is 100 instances while the Mainbox scale is 5K instances.
       We asked the participants to match traces in the source and the target datasets.
       The participants of the Linking task are AgreementMakerLight (AML) and OntoIdea sys-
       tems. For evaluation, we built a ground truth containing the set of expected links where an
       instance i1 in the source dataset is associated with an instance j1 in the target dataset that
       has been generated as an altered description of i1 .
       The way that the transformations were done, was to apply value-based, and structure-based
       transformations on different triples pertaining to instances of class Trace.
16
   https://project-hobbit.eu/outcomes/hobbit-platform/
17
   https://www.tomtom.com/
18
   https://developers.google.com/maps/
19
   https://developer.foursquare.com/
20
   http://nominatim.openstreetmap.org/
                      Table 21. HOBBIT Link Discovery Linking Task
                      System Precision Recall F-measure Run Time
                                       Sandbox task
                     AML          1.000 1.000     1.000          11722
                     OntoIdea     0.990 0.990     0.990          19806
                                        Mainbox task
                     AML          1.000 1.000      1.000      134456
                     OntoIdea        Platform Time Limit (75 mins)


  The systems were judged on the basis of precision, recall, F-measure and runtime results that
  are shown in Table 21. Both AML and OntoIdea systems return high precision and recall
  capturing all the correct links. Regarding runtime, for the Sandbox dataset, AML needs less
  time than OntoIdea and for the Mainbox dataset, AML completes the task with perfect results
  in contrast to OntoIdea that was not able to complete it and stopped when it hit the platform
  time limit (75 mins). Datasets, reference alignments, and task results are available on the
  HOBBIT website: https://project-hobbit.eu/challenges/om2017/.

– Task 2 (Spatial) measures how well the systems can identify the DE-9IM (Dimensionally
  Extended nine-Intersection Model) topological relations. The supported spatial relations are
  the following: Equals, Disjoint, Touches, Contains/Within, Covers/CoveredBy, Intersects,
  Crosses, Overlaps. The traces are represented in the Well-known text (WKT) format. For
  each relation, a different pair of source and target datasets is given to the participants.
  Given a LineString source geometry s, a LineString target geometry t and a DE-9IM topo-
  logical relation r, we ask the participants to match an instance from s with one or more
  instances in t such as their Intersection Matrix follows the definition of r. For evaluation,
  we built a ground truth using RADON [42] containing the set of expected links where an
  instance i1 in the source dataset is associated with one or more instances in the target dataset
  that has been generated as an altered description of i1 . For the Spatial Task, the Sandbox
  scale is 10 instances and the Mainbox scale is 2K instances.
  The participants to the Spatial task are AgreementMakerLight (AML), OntoIdea, Rapid
  Discovery of Topological Relations (RADON) and Silk systems.
  The systems were judged on the basis of precision, recall, F-measure and runtime results
  shown in Table 22 and Figures 8 and 9. We should mention that we are only presenting
  the time performance and not precision, recall and f-measure as all were equal to 1.0 except
  OntoIdea that reports for the Touches and Overlaps relations value 0.99. Moreover, Silk is
  not participating in relations Covers and Covered By and OntoIdea is not participating in
  relation Disjoint.
  From the results we can observe that:
    • OntoIdea has the best performance in the Sandbox dataset but in the Mainbox dataset
      the runtime increases and the system seems to not be able to handle large datasets easily.
    • Silk also seems to have a similar behaviour as OntoIdea.
    • RADON and AML systems seem to handle the growth of the dataset size smoother.
    • AML does not provide any results for the Disjoint relation since it reaches the platform
      time limit
  Datasets, reference alignments, and task results are available on the HOBBIT website:
  https://project-hobbit.eu/challenges/om2017/.
                   Fig. 8. HOBBIT Link Discovery Spatial Task (Sandbox)


                   Fig. 9. HOBBIT Link Discovery Spatial Task (Mainbox)


11    Process Model Matching

In 2013 and in 2015 the community, interested in business process modeling conducted an eval-
uation campaign similar to the OAEI [4]. Instead of matching ontologies, the task was to match
process models described in different formalisms like BPMN and Petri Nets. Within this track we
offer a subset of the tasks from the Process Model Matching Contest as OAEI track by converting
the process models to an ontological representation. By offering this track, we hope to gain in-
sights in how far ontology matching systems are capable of solving the more specific problem of
matching process models. This track is also motivated by the discussions at the end of the 2015
Ontology Matching workshop, where many participants showed their interest in such a track.
                              Table 22. Spatial Benchmark results

                   RelationSystem Sandbox Run Time Mainbox Run Time
                            AML         8157               10284
                          OntoIdea      1531              567169
               EQUALS
                          RADON         2215               4680
                            Silk        4059              125967
                            AML         7173        Time-out (75 min)
                          OntoIdea         Not participating
               DISJOINT
                          RADON         1558               19214
                            Silk        3224              257877
                            AML        11207               20252
                          OntoIdea      4712              473430
               TOUCHES
                          RADON         2672              485765
                            Silk        4805             1777747
                            AML         9191               16966
                          OntoIdea      1489              223857
               CONTAINS
                          RADON         2228               6937
                            Silk        4160               83958
                            AML        10186               12308
                          OntoIdea      4517              236506
               WITHIN
                          RADON         2203               5036
                            Silk        4037               88758
                            AML         7177               11859
                          OntoIdea      1503              313298
               COVERS
                          RADON         2180               6772
                            Silk           Not participating
                            AML         8184               14703
                          OntoIdea      1467              304509
               COVERED BY
                          RADON         2132               4721
                            Silk           Not participating
                            AML         9269               66681
                          OntoIdea      1505              510938
               INTERSECTS
                          RADON         2737              339742
                            Silk        3582             1718035
                            AML         8224               19385
                          OntoIdea      1509              461693
               CROSSES
                          RADON         2131               8490
                            Silk        3917              203763
                            AML        10223              194838
                          OntoIdea      1486              530752
               OVERLAPS
                          RADON         2167               60801
                            Silk        4217              464382


11.1   Experimental Settings

We used two datasets from the 2015 Process Matching Contest. The first dataset (University Ad-
mission dataset) deals with processing applications of Master students to a university. It consists
of nine different process models where each describes the concrete process of a specific German
university. We already used that dataset in the 2016 edition of the OAEI. The models are encoded
as BPMN process models. We converted the BPMN representation of the process models to a
set of assertions (ABox) using the vocabulary defined in the BPMN 2.0 ontology (TBox). The
second dataset, known as the Birth Registration dataset, describes the process of registering a
new born child in different countries. The process models were originally available as Petri Nets.
We converted them also to an ABox in an ontological representation. For that reason the resulting
matching tasks are instance matching tasks where each ABox is described by the same TBox.
     For each pair of processes manually generated reference alignments are available. Typical
activities within that domain are Sending acceptance, Invite student for interview, or Wait for
response. These examples illustrate one of the main differences to the ontology matching task.
The labels are usually verb-object phrases that are sometimes extended with more words. Another
important difference is related to the existence of an execution order (i.e., the model is a complex
sequence of activities) which can be understood as the counterpart to a type hierarchy.
     Only three systems generated non-empty results when running them against our datasets.
These systems are AML, LogMap, and I-Match. Note that we tried to execute all systems marked
as instance matching systems. However, the other systems threw exceptions or produced empty
alignments. We have collected all generated non-empty alignments. These alignments are the raw
results that the following report is based on.
     In our evaluation, we computed standard precision and recall, as well as the harmonic mean
known as f-measure. The dataset we used consists of several test cases. We aggregated the results
and present the micro average results. The gold standard we used for our first set of evaluation
experiments is based on the gold standard that has also been used at the Process Model Matching
Contest in 2015 [4]. We modified only some minor mistakes (resulting in changes less than 0.5
percentage points). In order to compare the results to the results obtained by the process model
matching community, we present also the recomputed values of the submissions to the 2015
contest.
     We extent our evaluation (“Standard” in Tables 23 and 24) by an evaluation measure that
makes use of a non-binary reference alignment (“Probabilistic” in Tables 23 and 24). This prob-
abilistic measure is based on a gold standard which is manually and independently generated by
several domain experts. The number of votes of these annotators are applied as support values in
the probabilistic evaluation. For a detailed discussion, please refer to [29].
     Furthermore, we evaluate the matching systems via matching patterns. Therefore the match-
ing task as well as the matcher output is automatically categorized into categories with different
complexity level. We classified each alignment in one out of five categories exclusively. In this
way, strength and weaknesses of the matching systems can be analysed. For more details we refer
to [30].


11.2   Results

The following tables show the results of our evaluation. Participants of the Process Model Match-
ing Contest and the OAEI 2016 edition are depicted in gray font, while this years OAEI partic-
ipants are shown in black font. Note that some systems participated with a version that has not
been modified with respect to its results comparing the OAEI 2016 and 2017 submission. We
added only one entry for them with the label OAEI-16/17. This is only the case for the first
dataset, which we have used already in 2016.
     Tables 23 and 24 summarize the results of our evaluation. “P” abbreviates precision, “R” is
recall, “FM” stands for f-measure and “Rk” means rank. The prefix “Pro” indicates the proba-
bilistic versions of the precision, recall, f-measure and the associated rank. The OAEI participants
are ranked on position 1, 11, 12 with an overall number of 17 systems listed in the table (when
using the standard metrics). Note that AML-PM at the PMMC 2015 was a matching system that
was based on a predecessor of AML participating at the OAEI 2016. The good results of AML
are surprising, since we expected that matching systems specifically developed for the purpose of
process model matching would outperform ontology matching systems applied to the special case
of process model matching. While AML contains also components that are specifically designed
for the process matching task (a flooding-like structural matching algorithm), its relevant main
components are developed for ontology matching and the sub-problem of instance matching.
AML and LogMap achieve the same results as in 2016. I-Match participates in 2017 for the first
time. Compared to the results of the tools specialized for the problem of process model matching,
the results of I-Match are still very good. There are still five systems that have in particular been
designed for matching process models, which achieve worse results.


  Table 23. Results of the Process Model Matching track for the University Admission dataset

              Participant                        Standard                  Probabilistic
     System          Contest      Size     P      R    FM        Rk   ProP ProR ProFM Rk
 AML         OAEI-16/17 221              0.719   0.685   0.702    1   0.742   0.283    0.410      2
 AML-PM      PMMC-15 579                 0.269   0.672   0.385   15   0.377   0.398    0.387      4
 BPLangMatch PMMC-15 277                 0.368   0.440   0.401   13   0.532   0.272    0.360      8
 DKP          OAEI-16 177                0.621   0.474   0.538    8   0.686   0.219    0.333      9
 DKP*         OAEI-16 150                0.680   0.440   0.534    9   0.772   0.211    0.331     10
 KnoMa-Proc  PMMC-15 326                 0.337   0.474   0.394   14   0.506   0.302    0.378      5
 KMatch-SSS  PMMC-15 261                 0.513   0.578   0.544    6   0.563   0.274    0.368      7
 LogMap      OAEI-16/17 267              0.449   0.517   0.481   11   0.594   0.291    0.390      3
 I-Match      OAEI-17 192                0.521   0.431   0.472   12   0.523   0.183    0.271     16
 Match-SSS   PMMC-15 140                 0.807   0.487   0.608    4   0.761   0.192    0.307     12
 OPBOT       PMMC-15 234                 0.603   0.608   0.605    5   0.648   0.258    0.369      6
 pPalm-DS    PMMC-15 828                 0.162   0.578   0.253   17   0.210   0.335    0.258     17
 RMM-NHCM    PMMC-15 220                 0.691   0.655   0.673    2   0.783   0.297    0.431      1
 RMM-NLM     PMMC-15 164                 0.768   0.543   0.636    3   0.681   0.197    0.306     13
 RMM-SMSL    PMMC-15 262                 0.511   0.578   0.543    7   0.516   0.242    0.329     11
 RMM-VM2     PMMC-15 505                 0.216   0.470   0.296   16   0.309   0.294    0.301     14
 TripleS     PMMC-15 230                 0.487   0.483   0.485   10   0.486   0.210    0.293     15


     The results for the Birth Registration dataset are more interesting, because we are using this
dataset in 2017 for the first time. Moreover, the dataset contains a higher amount of correspon-
dences that are hard to find by comparing the labels on a lexical level. This results usually in a
significantly lower F-measure compared to the University Admission dataset.
     The results show that AML is no longer the best of all matching systems. Four systems
from the process matching community achieve better results in terms of f-measure. This dataset
is dominated by the OPBOT system, while AML is among a group of follow-up systems that
perform still significantly better than the rest of the field. The other two systems, LogMap and
I-Match, achieve close results which are slightly worse than the average results. It is interesting
to see that the ranking among the three systems is the same across the two datasets.
     In the probabilistic evaluation, in the University Admission dataset however, the OAEI partic-
ipants gain position 2, 3, 16 respectively. LogMap rises from position 11 to 3. The (probabilistic)
precision improves over-proportionally for this matcher, because LogMap generates many corre-
    Table 24. Results of the Process Model Matching track for the Birth Registration dataset

            Participant                          Standard                   Probabilistic
      System       Contest        Size     P      R    FM        Rk    ProP ProR ProFM Rk
  AML                OAEI-17      502    0.454   0.391   0.420     5   0.467   0.515    0.490     10
  AML-PM            PMMC-15       503    0.423   0.365   0.392     7   0.513   0.505    0.509      7
  BPLangMatch       PMMC-15       279    0.645   0.309   0.418     6   0.661   0.417    0.511      5
  KnoMa-Proc        PMMC-15       740    0.234   0.297   0.262    15   0.224   0.437    0.296     15
  KMatch-SSS        PMMC-15       185    0.800   0.254   0.385     8   0.865   0.379    0.527      4
  LogMap             OAEI-17      239    0.615   0.252   0.358    11   0.834   0.411    0.551      3
  I-Match            OAEI-17      188    0.734   0.237   0.358    12   0.812   0.366    0.504      8
  Match-SSS         PMMC-15       128    0.922   0.202   0.332    13   0.974   0.315    0.476     11
  OPBOT             PMMC-15       383    0.713   0.468   0.565     1   0.650   0.517    0.576      1
  pPalm-DS          PMMC-15       490    0.502   0.422   0.459     2   0.469   0.521    0.493      9
  RMM-NHCM          PMMC-15       267    0.727   0.333   0.456     3   0.781   0.443    0.565      2
  RMM-NLM           PMMC-15       128    0.859   0.189   0.309    14   0.912   0.293    0.443     14
  RMM-SMSL          PMMC-15       354    0.508   0.309   0.384     9   0.518    0.42    0.464     13
  RMM-VM2           PMMC-15       492    0.474   0.400   0.433     4   0.454    0.48    0.466     12
  TripleS           PMMC-15       266    0.613   0.280   0.384    10   0.651   0.426    0.515      6


spondences which are not included in the binary gold standard but are included in the probabilistic
one. The ranking of LogMap demonstrates that a strength of the probabilistic metric lies in the
broadened definition of the gold standard where weak mappings are included but softened (via
the support values). In the probabilistic evaluation for the Birth Registration dataset, the three par-
ticipating matchers gain ranking 3, 8 and 10. LogMap rises from rank 11 to 3 in the probabilistic
evaluation. The matcher LogMap mainly identifies correspondences with high support (of which
many are not included in the binary gold standard). For the matcher AML, the opposite effect can
be observed. The matcher AML does not profit as much from the broadened gold standard in the
probabilistic evaluation in the Birth Registration dataset compared to the other matching systems.
The matchers improve their performance compared to the binary evaluation. This indicates that
in the binary gold standard many reasonable alignments are missing. Thus the matchers improve
their performance with the probabilistic evaluation. For details about the probabilistic metric,
please refer to [29].
     The results indicate that the progress made in ontology matching has also a positive impact
on other related matching problems, like it is the case for process model matching. While it
might require to reconfigure, adapt, and extend some parts of the ontology matching systems,
such a system seems to offer a good starting point which can be turned with a reasonable amount
of work into a good process matching tool. We have to emphasize that only three participants
decided to apply their systems to the new track of process model matching. Thus, we have to be
cautious to generalize the results we observed so far.
     To allow for an in-depth analysis of the performance of the matching systems, we make use
of a new evaluation method which automatically classifies the matching task into matching pat-
terns with different attributes. The matching patterns are assigned automatically to the reference
alignment, as well as to the matcher output of the three participating matchers. Then category-
dependent precision, recall and f-measure are computed for each category separately. For more
details please refer to [30].
     Tables 25 and 26 show the results of the matching systems for each of the categories. The
second column, the f-measure (FM) over all matching patterns, is given as the micro value, i.e. it
 Approach FM              Cat.             Cat. I          Cat. II           Cat. III           Cat.
                         trivial      no word iden.    one verb iden.    one word iden.         misc
                      [44.3%][103]      [29.3%][68]      [11.6%][27]       [7.3%][17]        [7.3%][17]
                    cP    cR cFM     cP     cR cFM    cP     cR cFM     cP    cR cFM      cP    cR cFM
 AML         .702   .890 .942 .915   .953 .603 .739   .833 .185 .303    .667 .353 .462    .167 .529 .254
 I-Match     .472   .907 .942 .924   –    –    –      .400 .074 .125    –    –    –       .500 .059 .105
 LogMap      .481   .894 .981 .935   –    –    –      .500 .148 .229    .133 .353 .194    .089 .529 .153

           Table 25. Results assigned to matching patterns of University Admission dataset


 Approach FM              Cat.             Cat. I         Cat. II            Cat. III           Cat.
                         trivial      no word iden.    one verb iden.    one word iden.         misc
                       [4.5%][26]      [74.9%][437]      [1.5%][9]         [9.9%][58]        [9.1%][53]
                    cP    cR cFM     cP    cR cFM     cP    cR cFM      cP    cR cFM      cP    cR cFM
 AML         .420   .759 .846 .800   .427 .364 .393   .133 .222 .167    .438 .362 .396    .632 .453 .527
 I-Match     .358   .950 .731 .826   .746 .236 .358   .667 .222 .333    .400 .103 .164    .667 .151 .246
 LogMap      .358   .339 .731 .463   .726 .261 .384   –    –    –       .357 .086 .139    .818 .170 .281

            Table 26. Results assigned to matching patterns of Birth Registration dataset


is computed over all test cases. The remaining columns provide the category-dependent precision
(cP), recall (cR) and f-measure (cFM) for each matcher in each category. cP, cR and cFM are
macro values, independently computed for each category. Moreover, for each category, the tables
contain in the heading the fraction of correspondences from the whole data set as well as the total
number of correspondences of a category in the reference alignment. Cat. I contains alignments
which have no word in common (syntactically). It can be observed that for the University Admis-
sion dataset it is sufficient to identify mainly trivial correspondences. I-Match and LogMap do
not compute any alignments of the most complex category (“Cat. I”). However, AML has a very
high performance for “Cat. I”. In the Birth Registration dataset the fraction of trivial alignments
is very low. The most dominant category is “Cat. I”. Therefore, it is not sufficient to focus on the
identification of trivial alignments. In contrast to the University Admission datatset, the matchers
compute reasonable alignments from “Cat. I” in the Birth Registration dataset. The low perfor-
mance of the three matchers for “Cat. trivial” in the Birth Registration dataset indicates mistakes
in the binary gold standard.


11.3       Conclusions

In 2016 we organized the Process Model Matching track for the first time. Our evaluation effort
was motivated by the idea that Ontology Matching methods and techniques can also be used in
the related field of Process Model Matching. For that reason we converted one (and in 2017 two)
of the most prominent Process Model matching test datasets into an ontological representation.
The resulting matching problems are instance matching tasks.
     While we were aware that an instance matching system will not be able to exploit the se-
quential aspects of the given process models out of the box, we expected lexical components to
generate results that are already on an acceptable level. Even though some of the systems gener-
ated very good results, overall only a few of the systems participating at the OAEI were capable
of generating any results for our test cases. We still do not fully understand the reasons for this
outcome.
     In order to facilitate the evaluation process for participants which cannot evaluate their match-
ers with SEALS, we developed a web-based evaluation platform21 to potentially increase the
number of participants. This platform was intended to be used by potential participants from the
process matching community that are not interested in an OAEI participation, which is tailored
for ontology matching systems. Within this platform, participants are able to select one or multi-
ple gold standards for one of the datasets and subsequently upload their corresponding matcher
results. Afterwards, the participants are able to select from a variety of different metrics includ-
ing not only different types of precision, recall and f-measure but also general statistics for the
generated output. Unfortunately, no further matching systems participated via the platform.
     The participation rate indicates that only a limited number of participants is interested in
process model matching. For that reason we will not offer a third edition of this track in 2018.


12      Statistical analysis

The traditional evaluation carried out in the OAEI tracks consists simply of comparing and rank-
ing systems based on performance scores such as F-measure. In the case of tracks with multiple
datasets, performance scores are averaged for all datasets, and the systems are compared ac-
cordingly. While performance scores enable us to gage the performance of matching systems
individually, they are insufficient for drawing statistically meaningful comparisons between sys-
tems.
    In the interest of providing a more in-depth comparison of the matching systems that partici-
pated in this year’s competition, this section presents an analysis based on statistical inference.


12.1     Methods

For one-dataset comparisons, we use McNemar’s test. This test takes as input the alignments pro-
duced by two matching systems plus the reference alignment, and produces as output an indicator
which shows if either system is better than the other or whether they are approximately the same.
This method of comparison does not need a particular performance score to be determined before-
hand. Further, the comparison is not solely based on the juxtaposition of two scalars, but rather, it
is substantiated by the statistical evidence (null hypothesis testing). Two variants of McNemar’s
test were considered: one where false correspondences were ignored so that the comparison was
predicated only on the correct correspondences found by matching systems; and another where
both correct and false correspondences were considered, meaning that systems were compared
based on the full alignment they generated. A directed graph can be used to visualize the out-
come of the test. Interested readers are referred to [34] for more details about the utilization of
this methodology.
     For comparisons over multiple datasets, we used the Friedman test with the corresponding
post-hoc procedure for comparison. This test requires the specification of one performance score.
The outcome of the test can be visualized by critical difference (CD) diagrams.
     Since the comparisons between matching systems are done pairwise, it is necessary to correct
the statistics for multiple testing. We used the Bergmann correction method to control the family-
wise error rate in all tests.

21
     http://alkmaar.informatik.uni-mannheim.de/pmmc
12.2   Results
Anatomy track In this year’s competition, 11 systems participate in the anatomy track. How-
ever, the alignments of the LogMap family could not be parsed by the Alignment API, so we had
to leave them out from the comparative analysis for this track.
     Figure 10 shows the directed graph with the outcome of McNemar’s test over participatory
systems when the false correspondences are not taken into account. Figure 11 shows the cor-
responding result when all correspondences are considered. The nodes in these graphs are the
systems and a directed edge A → B indicates the superiority of A over B. If there is no such an
edge between any two systems, then they are claimed to be more or less equivalent.
     According to these figures, AML is the best system and Wiki3 and ALIN are the bottom
ones, from both perspectives. There are two differences between the two approaches to conduct-
ing the test. SANOM outperforms KEPLER when the false correspondences are not considered,
and KEPLER is better than SANOM if wrong correspondences are taken into account. It means
that SANOM discovers more correct correspondences than KEPLER, but also more false corre-
spondences. A similar pattern holds for the comparison of POMAP and YAM-BIO. Interestingly,
no systems are declared to be equivalent, so the outcome of McNemar’s test is similar to a ranking
scheme.


                                           AML


                     POMAP


                                                             YAM-BIO


                                          XMap


                                                          SANOM


                                   KEPLER


                                                    WikiV3


                                               ALIN


Fig. 10. Comparison of alignment systems participated in OAEI 2017 on the anatomy track while
the false correspondences are not considered.


Conference track This track consists of 21 small matching tasks between 7 different ontolo-
gies. Three different types of matching are considered: (i) M1: only matching the classes; (i)
M2: only matching the properties; (ii) M3: matching both classes and properties. The reference
alignment has also three different variants. Hence, there are nine different modes of evaluating
systems, based on the type of matching and the type of reference alignment. The Friedman test
                                    AML


                               YAM-BIO


                                                      POMAP


                                XMap


                                                        KEPLER


                              SANOM


                                               WikiV3


                                 ALIN


Fig. 11. Comparison of alignment systems participated in OAEI 2017 on the anatomy track while
the false correspondences are taken into account.


was applied considering the F-measure of the systems on each of the 21 tasks for each of the
evaluation modes.
    Figure 12 shows the CD diagram of the systems that participated in this track. In this figure,
the x axis is the average rank obtained by the Friedman test, and the systems with the same
performance are connected to each other by the red lines. The lower the average rank in the CD
plot, the better the performance of the system.
    The CD diagram for this track provides little information and insight about the difference
between systems, likely due to the small sample size for the comparison (systems produce only
between 90 and 240 correspondences in total in this track). What is readily seen from this plot is
the superiority of AML, LogMap, and XMap and the poor performance of ALIN, SANOM, and
POMap.


Fig. 12. Comparison of alignment systems participated in OAEI 2017 on the Conference track.
The x-axis is the average rank of each system obtained by the Friedman test. Systems which are
not significantly different from each other are connected by the red lines.
LargeBio track This track consists six matching tasks of large size. The Friedman test was
applied to the F-measure obtained by each system over each alignment task. Figure 13 shows
the corresponding CD diagram for this track. According to this plot, the group containing AML,
XMap, YAM-BIO, LogMap, and LogMapBio are the best systems, and POMAP, SANOM, and
KEPLER are the systems with lackluster performance in this track.


Fig. 13. Comparison of alignment systems participated in OAEI 2017 on the LargeBio track. The
x-axis is the average rank of each system obtained by the Friedman test. Systems which are not
significantly different from each other are connected by the red lines.


Multifarm track This track involves 55 matching tasks with ontologies from different lan-
guages. The Friedman test was applied to the F-measure obtained by each system over each task.
The CD diagram depicting the outcome of the test is shown in Figure 14.
    According to this graph, AML is exclusively the best alignment system in this track. LogMap,
CroLOM, and KEPLER perform equally better than the remaining systems. At the other extreme,
LogMapLite, XMap, and SANOM show a poor performance in this track, while WikiV3 ranks
in between the two trios.


Fig. 14. Comparison of alignment systems participated in OAEI 2017 on the Conference track.
The x-axis is the average rank of each system obtained by the Friedman test. Systems which are
not significantly different from each other are connected by the red lines.


13    Lesson learned and suggestions
The lessons learned from running OAEI 2017 were the following:
A) Like last year, this year we requested tool registration in June and preliminary submission of
   wrapped systems by the end of July, but were more strict in its enforcement. As a result, we
   recorded the smallest number of errors and incompatibilities with the SEALS client during
   the evaluation phase in recent OAEI editions.
B) As has been the trend, some system developers struggled to get their systems working with
   the SEALS client, mostly due to incompatible versions of libraries. While participation on
   the new HOBBIT track was relatively low due to the novelty of the HOBBIT platform and
   the short deadline for systems to adapt to it, the solution of using Docker containers to wrap
   systems seems promising, and we are already looking into phasing out the SEALS client in
   favour of the HOBBIT platform.
C) While the number of participants this year was similar to that of recent years, their distri-
   bution through the tracks was uneven. The expressive ontologies tracks had no shortage of
   participants, and still a fair number participated in the more specialized multifarm track.
   However, participation in the interactive matching track and in the three instance match-
   ing tracks (process model, instance, and hobbit) was underwhelming. The latter is puzzling
   considering the prize sponsored by IBM Research for the system with the best performance
   across the instance matching tracks. Granted, the division of instance matching tracks be-
   tween the SEALS client and the HOBBIT platform did not help their cause, as of the 7
   total systems that participated in instance matching tasks, only 2 made both a SEALS and
   a HOBBIT submission. Nevertheless, the division between “traditional” ontology matching
   and instance matching is readily apparent, as only 2 systems have participated in both track
   families.
D) In previous years we identified the need for considering non-binary forms of evaluation,
   namely in cases where there is uncertainty about some of the reference mappings. A first
   non-binary evaluation type was implemented in the Conference track in 2015, followed by
   Disease and Phenotype, and Process Model in 2016. This year, we have introduced statistical
   tests to compare matching systems, an analysis that was carried out on the results of 4 tracks.
   This approach provides more insights into the comparative performance of systems as well
   as more statistical rigour, and thus we hope that it can be expanded and fully integrated into
   the OAEI tracks in future editions.


The lessons learned in the various OAEI 2017 track were the following:
conference: Since there have been no improvement in matchers performance this year from the
     perspective of performed evaluation modalities we will consider to add or replace existing
     evaluation modalities for future editions of OAEI to help disclose further matchers charac-
     teristics.
largebio: While the current reference alignments, with incoherence-causing mappings flagged
     as uncertain, make the evaluation fair to all systems, they are only a compromise solution,
     not an ideal one. Thus, we should aim for manually repairing and validating the reference
     alignments for future editions.
phenotype: This track attracted a similar level of participation this year compared to last, despite
     no cash prize, which demonstrates its intrinsic value and interest among the community of
     ontology matching algorithm developers.
interactive: This track’s participation has remained low, as most systems participating in OAEI
     opt to focus exclusively on fully automatic matching. We hope to draw more participants to
     this track in the future and will continue to expand it so as to better approximate real user
     interactions.
process model: The results of the Process Model track have shown that the participating ontol-
     ogy matching systems are capable of generating good results for the specific problem of
     process model matching, even though few were able to exploit the sequential aspects of the
     process models. Even though we offered an alternative evaluation process for participants
     which cannot evaluate their matchers with SEALS, this alternative failed to attract further
     participants. The low participation rate in this track indicates that only a limited number of
     participants is interested in process model matching. For that reason we will not offer a third
     edition of this track in 2018.
instance: In order to attract more instance matching systems to participate in value semantics
     (val-sem), value structure (val-struct), and value structure semantics (val-struct-sem) tasks,
     we need to produce benchmarks that have fewer instances (in the order of 10000), of the
     same type (in our benchmark we asked systems to compare instances of different types).
     To balance those aspects, we must then produce benchmarks that contain more complex
     transformations.


14    Conclusions

The OAEI 2017 saw the same number of participants as in recent years, with a healthy mix of new
and returning systems. While last year we posited that new participants were drawn by the allure
of prize money in the new Disease and Phenotype track, the evidence this year seems to contradict
it. On the one hand, participation in Disease and Phenotype remain high this year despite no prize
money. On the other hand, the prize money on offer for performance in instance matching did
not attract many participants to those tracks. Nevertheless, the fact that there continues to be
corporate interest in ontology matching to the point of offering prize money bodes well for the
future of the OAEI.
     Like last year, judging from the repeated tracks, there has been no substantial progress to the
state of the art in ontology matching overall this year:

  – There was no noticeable improvement with regard to system run times.
  – There were few improvements with regard to F-measure, with the top results in most tracks
    remaining the same.
  – There was no significant progress with regard to the ability of matching systems to handle
    large ontologies and datasets, either in traditional ontology matching or in instance matching.
  – There was no progress with regard to alignment repair systems, with only a few returning
    systems employing them.

This conclusion may be due to a plateau being reached by matching systems in some tracks, and
investing in improving results further would bring diminishing returns. However, it is also the
case that long-term participants tend to focus more on the new datasets and tracks on offer than
on improving in repeated tracks. Given the variety of tracks on offer, it is difficult for system
developers to aim at improving across all tracks each year.
     Most of the participants have provided a description of their systems and their experience in
the evaluation. These OAEI papers, like the present one, have not been peer reviewed. However,
they are full contributions to this evaluation exercise and reflect the hard work and clever insight
people put into the development of participating systems. Reading the papers of the participants
should help people involved in ontology matching find out what makes these algorithms work
and what could be improved.
     The Ontology Alignment Evaluation Initiative will strive to remain a reference to the ontol-
ogy matching community by improving both the test cases and the testing methodology to better
reflect actual needs, as well as to promote progress in this field [43]. More information can be
found at:

                        http://oaei.ontologymatching.org.
Acknowledgements
We warmly thank the participants of this campaign. We know that they have worked hard to have
their matching tools executable in time and they provided useful reports on their experience. The
best way to learn about the results remains to read the papers that follow.
     We would also like to thank IBM Research for sponsoring the instance matching tracks by
offering prize money for the best performing systems.
     We are grateful to the Universidad Politécnica de Madrid (UPM), especially to Nandana Mi-
hindukulasooriya and Asunción Gómez Pérez, for moving, setting up and providing the necessary
infrastructure to run the SEALS repositories.
     We are also grateful to Martin Ringwald and Terry Hayamizu for providing the reference
alignment for the anatomy ontologies and thank Elena Beisswanger for her thorough support on
improving the quality of the data set.
     We thank Khiat Abderrahmane for his support in the Arabic data set and Catherine Comparot
for her feedback and support in the MultiFarm test case.
     We also thank for their support the other members of the Ontology Alignment Evaluation Ini-
tiative steering committee: Yannis Kalfoglou (Ricoh laboratories, UK), Miklos Nagy (The Open
University (UK), Natasha Noy (Google Inc., USA), Yuzhong Qu (Southeast University, CN),
York Sure (Leibniz Gemeinschaft, DE), Jie Tang (Tsinghua University, CN), Heiner Stucken-
schmidt (Mannheim Universität, DE), George Vouros (University of the Aegean, GR).
     Michelle Cheatham has been supported by the National Science Foundation award ICER-
1440202 “EarthCube Building Blocks: Collaborative Proposal: GeoLink”.
     Jérôme Euzenat, Ernesto Jimenez-Ruiz, Christian Meilicke, Heiner Stuckenschmidt and
Cássia Trojahn dos Santos have been partially supported by the SEALS (IST-2009-238975) Eu-
ropean project in previous years.
     Daniel Faria was supported by the ELIXIR-EXCELERATE project (INFRADEV-3-2015).
     Ernesto Jimenez-Ruiz has also been partially supported by the BIGMED project (IKT
259055), the HealthInsight project (IKT 247784), the SIRIUS Centre for Scalable Data Access
(Research Council of Norway, project no.: 237889).
     Catia Pesquita was supported by the FCT through the LASIGE Strategic Project
(UID/CEC/00408/2013) and the research grant PTDC/EEI-ESS/4633/2014.


References
 1. Manel Achichi, Rodolphe Bailly, Cécile Cecconi, Marie Destandau, Konstantin Todorov,
    and Raphaël Troncy. Doremus: Doing reusable musical data. In ISWC PD: International
    Semantic Web Conference Posters and Demos, 2015.
 2. Manel Achichi, Michelle Cheatham, Zlatan Dragisic, Jerome Euzenat, Daniel Faria, Alfio
    Ferrara, Giorgos Flouris, Irini Fundulaki, Ian Harrow, Valentina Ivanova, Ernesto Jiménez-
    Ruiz, Elena Kuss, Patrick Lambrix, Henrik Leopold, Huanyu Li, Christian Meilicke, Ste-
    fano Montanelli, Catia Pesquita, Tzanina Saveta, Pavel Shvaiko, Andrea Splendiani, Heiner
    Stuckenschmidt, Konstantin Todorov, Cássia Trojahn, and Ondrej Zamazal. Results of the
    ontology alignment evaluation initiative 2016. In Proc. 11th ISWC ontology matching work-
    shop (OM), Kobe (JP), pages 73–129, 2016.
 3. José Luis Aguirre, Bernardo Cuenca Grau, Kai Eckert, Jérôme Euzenat, Alfio Ferrara,
    Robert Willem van Hague, Laura Hollink, Ernesto Jiménez-Ruiz, Christian Meilicke, An-
    driy Nikolov, Dominique Ritze, François Scharffe, Pavel Shvaiko, Ondrej Sváb-Zamazal,
    Cássia Trojahn, and Benjamin Zapilko. Results of the ontology alignment evaluation initia-
    tive 2012. In Proc. 7th ISWC ontology matching workshop (OM), Boston (MA, US), pages
    73–115, 2012.
 4. Gonçalo Antunes, Marzieh Bakhshandeh, José Borbinha, João Cardoso, Sharam Dadash-
    nia, Chiara Di Francescomarino, Mauro Dragoni, Peter Fettke, Avigdor Gal, Chiara Ghi-
    dini, Philip Hake, Abderrahmane Khiat, Christopher Klinkmüller, Elena Kuss, Henrik
    Leopold, Peter Loos, Christian Meilicke, Tim Niesen, Catia Pesquita, Timo Péus, Andreas
    Schoknecht, Eitam Sheetrit, Andreas Sonntag, Heiner Stuckenschmidt, Tom Thaler, Ingo
    Weber, and Matthias Weidlich. The process model matching contest 2015. In 6th EMISA
    Workshop, pages 127–155, 2015.
 5. Benhamin Ashpole, Marc Ehrig, Jérôme Euzenat, and Heiner Stuckenschmidt, editors. Proc.
    K-Cap Workshop on Integrating Ontologies, Banff (Canada), 2005.
 6. Olivier Bodenreider. The unified medical language system (UMLS): integrating biomedical
    terminology. Nucleic Acids Research, 32:267–270, 2004.
 7. Caterina Caracciolo, Jérôme Euzenat, Laura Hollink, Ryutaro Ichise, Antoine Isaac,
    Véronique Malaisé, Christian Meilicke, Juan Pane, Pavel Shvaiko, Heiner Stuckenschmidt,
    Ondrej Sváb-Zamazal, and Vojtech Svátek. Results of the ontology alignment evaluation ini-
    tiative 2008. In Proc. 3rd ISWC ontology matching workshop (OM), Karlsruhe (DE), pages
    73–120, 2008.
 8. Michelle Cheatham, Zlatan Dragisic, Jérôme Euzenat, Daniel Faria, Alfio Ferrara, Giorgos
    Flouris, Irini Fundulaki, Roger Granada, Valentina Ivanova, Ernesto Jiménez-Ruiz, et al.
    Results of the ontology alignment evaluation initiative 2015. In Proc. 10th ISWC ontology
    matching workshop (OM), Bethlehem (PA, US), pages 60–115, 2015.
 9. Bernardo Cuenca Grau, Zlatan Dragisic, Kai Eckert, Jérôme Euzenat, Alfio Ferrara, Roger
    Granada, Valentina Ivanova, Ernesto Jiménez-Ruiz, Andreas Oskar Kempf, Patrick Lam-
    brix, Andriy Nikolov, Heiko Paulheim, Dominique Ritze, François Scharffe, Pavel Shvaiko,
    Cássia Trojahn dos Santos, and Ondrej Zamazal. Results of the ontology alignment evalu-
    ation initiative 2013. In Pavel Shvaiko, Jérôme Euzenat, Kavitha Srinivas, Ming Mao, and
    Ernesto Jiménez-Ruiz, editors, Proc. 8th ISWC ontology matching workshop (OM), Sydney
    (NSW, AU), pages 61–100, 2013.
10. Jim Dabrowski and Ethan V. Munson. 40 years of searching for the best computer system
    response time. Interacting with Computers, 23(5):555–564, 2011.
11. Jérôme David, Jérôme Euzenat, François Scharffe, and Cássia Trojahn dos Santos. The
    alignment API 4.0. Semantic web journal, 2(1):3–10, 2011.
12. Cássia Trojahn dos Santos, Bo Fu, Ondrej Zamazal, and Dominique Ritze. State-of-the-art
    in multilingual and cross-lingual ontology matching. In Towards the Multilingual Semantic
    Web, Principles, Methods and Applications, pages 119–135. 2014.
13. Zlatan Dragisic, Kai Eckert, Jérôme Euzenat, Daniel Faria, Alfio Ferrara, Roger Granada,
    Valentina Ivanova, Ernesto Jiménez-Ruiz, Andreas Oskar Kempf, Patrick Lambrix, Ste-
    fano Montanelli, Heiko Paulheim, Dominique Ritze, Pavel Shvaiko, Alessandro Solimando,
    Cássia Trojahn dos Santos, Ondrej Zamazal, and Bernardo Cuenca Grau. Results of the on-
    tology alignment evaluation initiative 2014. In Proc. 9th ISWC ontology matching workshop
    (OM), Riva del Garda (IT), pages 61–104, 2014.
14. Zlatan Dragisic, Valentina Ivanova, Patrick Lambrix, Daniel Faria, Ernesto Jiménez-Ruiz,
    and Catia Pesquita. User validation in ontology alignment. In The Semantic Web - ISWC
    2016 - 15th International Semantic Web Conference, Kobe, Japan, October 17-21, 2016,
    Proceedings, Part I, pages 200–217, 2016.
15. Zlatan Dragisic, Valentina Ivanova, Huanyu Li, and Patrick Lambrix. Experiences from
    the anatomy track in the ontology alignment evaluation initiative. Journal of Biomedical
    Semantics, 2017.
16. Jérôme Euzenat, Alfio Ferrara, Laura Hollink, Antoine Isaac, Cliff Joslyn, Véronique
    Malaisé, Christian Meilicke, Andriy Nikolov, Juan Pane, Marta Sabou, François Scharffe,
    Pavel Shvaiko, Vassilis Spiliopoulos, Heiner Stuckenschmidt, Ondrej Sváb-Zamazal, Vo-
    jtech Svátek, Cássia Trojahn dos Santos, George Vouros, and Shenghui Wang. Results of
    the ontology alignment evaluation initiative 2009. In Proc. 4th ISWC ontology matching
    workshop (OM), Chantilly (VA, US), pages 73–126, 2009.
17. Jérôme Euzenat, Alfio Ferrara, Christian Meilicke, Andriy Nikolov, Juan Pane, François
    Scharffe, Pavel Shvaiko, Heiner Stuckenschmidt, Ondrej Sváb-Zamazal, Vojtech Svátek, and
    Cássia Trojahn dos Santos. Results of the ontology alignment evaluation initiative 2010. In
    Proc. 5th ISWC ontology matching workshop (OM), Shanghai (CN), pages 85–117, 2010.
18. Jérôme Euzenat, Alfio Ferrara, Robert Willem van Hague, Laura Hollink, Christian Meil-
    icke, Andriy Nikolov, François Scharffe, Pavel Shvaiko, Heiner Stuckenschmidt, Ondrej
    Sváb-Zamazal, and Cássia Trojahn dos Santos. Results of the ontology alignment evalu-
    ation initiative 2011. In Proc. 6th ISWC ontology matching workshop (OM), Bonn (DE),
    pages 85–110, 2011.
19. Jérôme Euzenat, Antoine Isaac, Christian Meilicke, Pavel Shvaiko, Heiner Stuckenschmidt,
    Ondrej Svab, Vojtech Svatek, Willem Robert van Hage, and Mikalai Yatskevich. Results of
    the ontology alignment evaluation initiative 2007. In Proc. 2nd ISWC ontology matching
    workshop (OM), Busan (KR), pages 96–132, 2007.
20. Jérôme Euzenat, Christian Meilicke, Pavel Shvaiko, Heiner Stuckenschmidt, and Cássia Tro-
    jahn dos Santos. Ontology alignment evaluation initiative: six years of experience. Journal
    on Data Semantics, XV:158–192, 2011.
21. Jérôme Euzenat, Malgorzata Mochol, Pavel Shvaiko, Heiner Stuckenschmidt, Ondrej Svab,
    Vojtech Svatek, Willem Robert van Hage, and Mikalai Yatskevich. Results of the ontology
    alignment evaluation initiative 2006. In Proc. 1st ISWC ontology matching workshop (OM),
    Athens (GA, US), pages 73–95, 2006.
22. Jérôme Euzenat and Pavel Shvaiko. Ontology matching. Springer-Verlag, Heidelberg (DE),
    2nd edition, 2013.
23. Daniel Faria, Ernesto Jiménez-Ruiz, Catia Pesquita, Emanuel Santos, and Francisco M.
    Couto. Towards Annotating Potential Incoherences in BioPortal Mappings. In 13th In-
    ternational Semantic Web Conference, volume 8797 of Lecture Notes in Computer Science,
    pages 17–32. Springer, 2014.
24. Ian Harrow, Ernesto Jiménez-Ruiz, Andrea Splendiani, Martin Romacker, Peter Woollard,
    Scott Markel, Yasmin Alam-Faruque, Martin Koch, James Malone, and Arild Waaler. Match-
    ing Disease and Phenotype Ontologies in the Ontology Alignment Evaluation Initiative.
    Journal of Biomedical Semantics, 2018.
25. Ernesto Jiménez-Ruiz and Bernardo Cuenca Grau. LogMap: Logic-based and scalable on-
    tology matching. In Proc. 10th International Semantic Web Conference (ISWC), Bonn (DE),
    pages 273–288, 2011.
26. Ernesto Jiménez-Ruiz, Bernardo Cuenca Grau, Ian Horrocks, and Rafael Berlanga. Logic-
    based assessment of the compatibility of UMLS ontology sources. J. Biomed. Sem., 2, 2011.
27. Ernesto Jiménez-Ruiz, Christian Meilicke, Bernardo Cuenca Grau, and Ian Horrocks. Eval-
    uating mapping repair systems with large biomedical ontologies. In Proc. 26th Description
    Logics Workshop, 2013.
28. Yevgeny Kazakov, Markus Krötzsch, and Frantisek Simancik. Concurrent classification of
    EL ontologies. In Proc. 10th International Semantic Web Conference (ISWC), Bonn (DE),
    pages 305–320, 2011.
29. Elena Kuss, Henrik Leopold, Han Van der Aa, Heiner Stuckenschmidt, and Hajo A Reijers.
    Probabilistic evaluation of process model matching techniques. In Conceptual Modeling:
    35th International Conference, ER 2016, Gifu, Japan, November 14-17, 2016, Proceedings
    35, pages 279–292. Springer, 2016.
30. Elena Kuss and Heiner Stuckenschmidt. Automatic classification to matching patterns for
    process model matching evaluation. In Proceedings of the ER Forum 2017 and the ER 2017
    Demo Track co-located with the 36th International Conference on Conceptual Modelling
    (ER 2017), Valencia, Spain, - November 6-9, 2017., pages 292–305, 2017.
31. Pasquale Lisena, Manel Achichi, Eva Fernández, Konstantin Todorov, and Raphaël Troncy.
    Exploring linked classical music catalogs with overture. In ISWC PD: International Seman-
    tic Web Conference Posters and Demos, 2016.
32. Christian Meilicke. Alignment Incoherence in Ontology Matching. PhD thesis, University
    Mannheim, 2011.
33. Christian Meilicke, Raúl Garcı́a Castro, Frederico Freitas, Willem Robert van Hage, Elena
    Montiel-Ponsoda, Ryan Ribeiro de Azevedo, Heiner Stuckenschmidt, Ondrej Sváb-Zamazal,
    Vojtech Svátek, Andrei Tamilin, Cássia Trojahn, and Shenghui Wang. MultiFarm: A bench-
    mark for multilingual ontology matching. Journal of web semantics, 15(3):62–68, 2012.
34. Majid Mohammadi, Amir Ahooye Atashin, Wout Hofman, and Yaohua Tan. Comparison
    of ontology alignment algorithms across single matching task via the McNemar test. arXiv,
    arXiv:1704.00045.
35. Boris Motik, Rob Shearer, and Ian Horrocks. Hypertableau reasoning for description logics.
    Journal of Artificial Intelligence Research, 36:165–228, 2009.
36. Heiko Paulheim, Sven Hertling, and Dominique Ritze. Towards evaluating interactive ontol-
    ogy matching tools. In Proc. 10th Extended Semantic Web Conference (ESWC), Montpellier
    (FR), pages 31–45, 2013.
37. Catia Pesquita, Daniel Faria, Emanuel Santos, and Francisco Couto. To repair or not to
    repair: reconciling correctness and coherence in ontology reference alignments. In Proc. 8th
    ISWC ontology matching workshop (OM), Sydney (AU), page this volume, 2013.
38. Manuel Salvadores, Paul R. Alexander, Mark A. Musen, and Natalya Fridman Noy. BioPor-
    tal as a dataset of linked biomedical ontologies and terminologies in RDF. Semantic Web,
    4(3):277–284, 2013.
39. Emanuel Santos, Daniel Faria, Catia Pesquita, and Francisco Couto. Ontology alignment
    repair through modularization and confidence-based heuristics. CoRR, abs/1307.5322, 2013.
40. T. Saveta, E. Daskalaki, G. Flouris, I. Fundulaki, M. Herschel, and A.-C. Ngonga Ngomo.
    Pushing the limits of instance matching systems: A semantics-aware benchmark for linked
    data. In WWW, Companion Volume, 2015.
41. Tzanina Saveta, Evangelia Daskalaki, Giorgos Flouris, Irini Fundulaki, Melanie Herschel,
    and Axel-Cyrille Ngonga Ngomo. Lance: Piercing to the heart of instance matching tools.
    In International Semantic Web Conference, pages 375–391. Springer, 2015.
42. Mohamed Ahmed Sherif, Kevin Dreßler, Panayiotis Smeros, and Axel-Cyrille Ngonga
    Ngomo. RADON - Rapid Discovery of Topological Relations. In Proceedings of The Thirty-
    First AAAI Conference on Artificial Intelligence (AAAI-17), 2017.
43. Pavel Shvaiko and Jérôme Euzenat. Ontology matching: state of the art and future challenges.
    IEEE Transactions on Knowledge and Data Engineering, 25(1):158–176, 2013.
44. Alessandro Solimando, Ernesto Jiménez-Ruiz, and Giovanna Guerrini. Detecting and cor-
    recting conservativity principle violations in ontology-to-ontology mappings. In The Seman-
    tic Web–ISWC 2014, pages 1–16. Springer, 2014.
45. Alessandro Solimando, Ernesto Jimenez-Ruiz, and Giovanna Guerrini. Minimizing con-
    servativity violations in ontology alignments: Algorithms and evaluation. Knowledge and
    Information Systems, 2016.
46. York Sure, Oscar Corcho, Jérôme Euzenat, and Todd Hughes, editors. Proc. ISWC Workshop
    on Evaluation of Ontology-based Tools (EON), Hiroshima (JP), 2004.

                                            Montpellier, Dayton, Linköping, Grenoble, Lisboa,
                                        Milano, Heraklion, Kent, Oslo, Mannheim, Amsterdam,
                                                               Delft, Trento, Toulouse, Prague
                                                                               December 2017