=Paper= {{Paper |id=None |storemode=property |title=Results of the Ontology Alignment Evaluation Initiative 2012 |pdfUrl=https://ceur-ws.org/Vol-946/oaei12_paper0.pdf |volume=Vol-946 |dblpUrl=https://dblp.org/rec/conf/semweb/Aguirre0EFHHMNRSSSSJGZ12 }} ==Results of the Ontology Alignment Evaluation Initiative 2012== https://ceur-ws.org/Vol-946/oaei12_paper0.pdf

Results of the
Ontology Alignment Evaluation Initiative 2012?

José Luis Aguirre1 , Kai Eckert4 , Jérôme Euzenat1 , Alfio Ferrara2 , Willem Robert van
Hage3 , Laura Hollink3 , Christian Meilicke4 , Andriy Nikolov5 , Dominique Ritze4 ,
François Scharffe6 , Pavel Shvaiko7 , Ondřej Šváb-Zamazal8 , Cássia Trojahn9 , Ernesto
Jimenez-Ruiz11 , Bernardo Cuenca Grau11 , and Benjamin Zapilko10
1
INRIA & LIG, Montbonnot, France
{Jose-Luis.Aguirre,Jerome.Euzenat}@inria.fr
2
Università degli studi di Milano, Italy
alfio.ferrara@unimi.it
3
Vrije Universiteit Amsterdam, The Netherlands
{W.R.van.Hage,L.Hollink}@vu.nl
4
University of Mannheim, Mannheim, Germany
{christian,dominique,kai}@informatik.uni-mannheim.de
5
The Open University, Milton Keynes, United Kingdom
A.Nikolov@open.ac.uk
6
LIRMM, Montpellier, France
francois.scharffe@lirmm.fr
7
TasLab, Informatica Trentina, Trento, Italy
pavel.shvaiko@infotn.it
8
University of Economics, Prague, Czech Republic
ondrej.zamazal@vse.cz
9
IRIT – Université Toulouse II, Toulouse, France
cassia.trojahn@irit.fr
10
GESIS – Leibniz Institute for the Social Sciences, Cologne, Germany
benjamin.zapilko@gesis.org
11
University of Oxford, UK
{ernesto,berg}@cs.ox.ac.uk

Abstract. Ontology matching consists of finding correspondences between se-
mantically related entities of two ontologies. OAEI campaigns aim at compar-
ing ontology matching systems on precisely defined test cases. These test cases
can use ontologies of different nature (from simple thesauri to expressive OWL
ontologies) and use different modalities, e.g., blind evaluation, open evaluation,
consensus. OAEI 2012 offered 7 tracks with 9 test cases followed by 21 partici-
pants. Since 2010, the campaign has been using a new evaluation modality which
provides more automation to the evaluation. This paper is an overall presentation
of the OAEI 2012 campaign.

?
This paper improves on the “Preliminary results” initially published in the on-site proceedings
of the ISWC workshop on Ontology Matching (OM-2012). The only official results of the
campaign, however, are on the OAEI web site.
1 Introduction

The Ontology Alignment Evaluation Initiative1 (OAEI) is a coordinated international
initiative, which organizes the evaluation of the increasing number of ontology match-
ing systems [12; 10; 26]. The main goal of OAEI is to compare systems and algorithms
on the same basis and to allow anyone for drawing conclusions about the best match-
ing strategies. Our ambition is that, from such evaluations, tool developers can improve
their systems.
Two first events were organized in 2004: (i) the Information Interpretation and In-
tegration Conference (I3CON) held at the NIST Performance Metrics for Intelligent
Systems (PerMIS) workshop and (ii) the Ontology Alignment Contest held at the Eval-
uation of Ontology-based Tools (EON) workshop of the annual International Semantic
Web Conference (ISWC) [27]. Then, a unique OAEI campaign occurred in 2005 at the
workshop on Integrating Ontologies held in conjunction with the International Confer-
ence on Knowledge Capture (K-Cap) [1]. Starting from 2006 through 2011 the OAEI
campaigns were held at the Ontology Matching workshops collocated with ISWC [11;
9; 3; 6; 7; 8]. In 2012, the OAEI results will be presented again at the Ontology Match-
ing workshop2 collocated with ISWC, in Boston, USA.
Since last year, we have been promoting an environment for automatically process-
ing evaluations (§2.2), which has been developed within the SEALS (Semantic Evalu-
ation At Large Scale) project3 . SEALS provided a software infrastructure, for automat-
ically executing evaluations, and evaluation campaigns for typical semantic web tools,
including ontology matching. An intermediate campaign was executed in March 2012
in coordination with the Second SEALS evaluation campaigns. This campaign, called
OAEI 2011.5, had five tracks and 18 participants, and it only ran on the SEALS plat-
form. The results of OAEI 2011.5 have been independently published on the OAEI web
site and are now integrated in this paper with those of OAEI 2012. For OAEI 2012,
almost all of the OAEI data sets were evaluated under the SEALS modality, providing
a more uniform evaluation setting.
This paper synthetizes the 2012 evaluation campaign and introduces the results pro-
vided in the papers of the participants. The remainder of the paper is organized as fol-
lows. In Section 2, we present the overall evaluation methodology that has been used.
Sections 3-9 discuss the settings and the results of each of the test cases. Section 10
overviews lessons learned from the campaign. Finally, Section 11 concludes the paper.

2 General methodology

We first present the test cases proposed this year to the OAEI participants (§2.1). Then,
we discuss the resources used by participants to test their systems and the execution
environment used for running the tools (§2.2). Next, we describe the steps of the OAEI
campaign (§2.3-2.5) and report on the general execution of the campaign (§2.6).
1
http://oaei.ontologymatching.org
2
http://om2012.ontologymatching.org
3
http://www.seals-project.eu
2.1 Tracks and test cases

This year’s campaign consisted of 7 tracks gathering 9 data sets and different evaluation
modalities:

The benchmark track (§3): Like in previous campaigns, a systematic benchmark se-
ries has been proposed. The goal of this benchmark series is to identify the areas in
which each matching algorithm is strong or weak by systematically altering an on-
tology. This year, like in OAEI 2011, we used new systematically generated bench-
marks, based on four ontologies other than the original bibliographic one. For three
of these ontologies, the evaluation was performed in blind mode.
The expressive ontologies track offers real world ontologies using OWL modelling
capabilities:
Anatomy (§4): The anatomy real world case is about matching the Adult Mouse
Anatomy (2744 classes) and a part of the NCI Thesaurus (3304 classes) de-
scribing the human anatomy.
Conference (§5): The goal of the conference task is to find all correct correspon-
dences within a collection of ontologies describing the domain of organizing
conferences (the domain being well understandable for every researcher). Re-
sults were evaluated automatically against reference alignments and by using
logical reasoning techniques.
Large biomedical ontologies (§8): This track aims at finding alignments between
large and semantically rich biomedical ontologies such as FMA, SNOMED
CT, and NCI. The UMLS Metathesaurus has been selected as the basis for the
track’s reference alignments.
Multilingual
Multifarm(§6): This dataset is composed of a subset of the Conference dataset,
translated in eight different languages (Chinese, Czech, Dutch, French, Ger-
man, Portuguese, Russian, and Spanish) and the corresponding alignments be-
tween these ontologies.
Directories and thesauri
Library(§7): The library track is a real-word task to match two thesauri. The goal
of this track is to find whether the matchers can handle such lightweight ontolo-
gies including a huge amount of concepts and additional descriptions. Results
are evaluated both against a reference alignment and through manual scrutiny.
Instance matching (§9): The goal of the instance matching track is to evaluate the per-
formance of different tools on the task of matching RDF individuals which originate
from different sources but describe the same real-world entity. Instance matching
is organized in two sub-tasks:
Sandbox: The Sandbox is a simple dataset that has been specifically conceived to
provide examples of some specific matching problems (like name spelling and
other controlled variations). This is intended to serve as a test for those tools
that are in an initial phase of their development process and/or for tools that are
facing very focused tasks, such as person name matching.
IIMB: IIMB is an OWL-based dataset that is automatically generated by intro-
ducing a set of controlled transformations in an initial OWL Abox, in order:
i) to provide an evaluation dataset for various kinds of data transformations,
including value transformations, structural transformations, and logical trans-
formations; ii) to cover a wide spectrum of possible techniques and tools.
Table 1 summarizes the variation in the tests under consideration.
test formalism relations confidence modalities language SEALS
√
benchmark OWL = [0 1] blind+open EN
√
anatomy OWL = [0 1] open EN
√
conference OWL-DL =, <= [0 1] blind+open EN
√
multifarm OWL = [0 1] open CZ, CN, DE, EN, ES,
DE, FR, RU, PT
√
library OWL = [0 1] open EN, DE
√
large bio OWL = [0 1] open EN
sandbox RDF = [0 1] open EN
iimb RDF = [0 1] open EN

Table 1. Characteristics of the test cases (open evaluation is made with already published refer-
ence alignments and blind evaluation is made by organizers from reference alignments unknown
to the participants).

We do not present the New York Times (NYT) sub-track, held in the instance match-
ing track, since it had only one participant.

2.2 The SEALS platform
In 2010, participants of the Benchmark, Anatomy and Conference tracks were asked for
the first time to use the SEALS evaluation services: they had to wrap their tools as web
services and the tools were executed on the machines of the tool developers [28]. Since
2011, tool developers had to implement a simple interface and to wrap their tools in a
predefined way including all required libraries and resources. A tutorial for tool wrap-
ping was provided to the participants. This tutorial describes how to wrap a tool and
how to use a simple client to run a full evaluation locally. After local tests are passed
successfully, the wrapped tool was uploaded for a test on the SEALS portal4 . Conse-
quently, the evaluation was executed by the organizers with the help of the SEALS
technology. This approach allowed to measure runtime and ensured the reproducibility
of the results. As a side effect, this approach ensures also that a tool is executed with
the same settings for all of the six tracks that were executed in the SEALS mode. This
was already requested in the previous years, however, this rule was sometimes ignored
by participants.

2.3 Preparatory phase
Ontologies to be matched and (where applicable) reference alignments have been pro-
vided in advance during the period between June 15th and July 1st , 2012. This gave
4
http://www.seals-project.eu/join-the-community/
potential participants the occasion to send observations, bug corrections, remarks and
other test cases to the organizers. The goal of this preparatory period is to ensure that
the delivered tests make sense to the participants. The final test base was released on
July 6th , 2012. The data sets did not evolve after that.

2.4 Execution phase
During the execution phase, participants used their systems to automatically match the
test case ontologies. In most cases, ontologies are described in OWL-DL and serialized
in the RDF/XML format [4]. Participants can self-evaluate their results either by com-
paring their output with reference alignments or by using the SEALS client to compute
precision and recall. They can tune their systems with respect to the non blind evalua-
tion as long as the rules published on the OAEI web site are satisfied. This phase has
been conducted between July 6th and August 31st , 2012.

2.5 Evaluation phase
Participants have been encouraged to provide (preliminary) results or to upload their
wrapped tools on the SEALS portal by September 1st , 2012. For the SEALS modality,
a full-fledged test including all submitted tools has been conducted by the organizers
and minor problems were reported to some tool developers, until finally a properly
executable version of all the tools has been uploaded on the SEALS portal.
First results were available by September 22nd , 2012. The track organizers pro-
vided these results individually to the participants. The results were published on the
respective web pages by the track organizers by October 15th . The standard evalua-
tion measures are precision and recall computed against the reference alignments. For
the matter of aggregation of the measures, we used weighted harmonic means (weights
being the size of the true positives). Another technique that was used is the computa-
tion of precision/recall graphs so it was advised that participants provide their results
with a weight to each correspondence they found. We also computed for some tracks
the degree of alignment coherency. Additionally, we measured runtimes for all tracks
conducted under the SEALS modality.

2.6 Comments on the execution
For a few years, the number of participating systems has remained roughly stable: 4
participants in 2004, 7 in 2005, 10 in 2006, 17 in 2007, 13 in 2008, 16 in 2009, 15
in 2010, 18 in 2011, 21 in 2012. However, participating systems are now constantly
changing. In 2012, 7 systems have not participated in any of the previous campaigns.
The list of participants is summarized in Table 2.
This year only four systems participated in the instance matching track; two of them
(LogMap and LogMapLt) participated also in the SEALS tracks.
Two tools, OMR and OntoK, are not shown in the table and were not included in
the final evaluation.
OMR generated alignments with correspondences containing non-existing entities
in one or both of the ontologies being matched. Moreover, it seems that its alignments
ServOMapLt
AUTOMSv2

WikiMatch
ServOMap
LogMapLt
MaasMtch

MEDLEY
GOMMA

Hotmatch

Total=21
MapSSS
LogMap

YAM++
Hertuda

TOAST
semsim
Optima

WeSeE
SBUEI
Aroma

CODI
ASE
System
√√√ √√ √ √ √ √√√√√√
Confidence 14
√√√ √√√√√√√√√ √√ √√√
benchmarks 17
√ √√√√√√√√ √ √√√√√√
anatomy 16
√√√√√√√√√√√√√ √√ √√√
conference 18
√√√√√√√√√√√√√ √√ √√√
multifarm 18
√ √√√√√√ √ √ √√ √ √
library 13
√ √ √√√√√√√ √√ √√
large bio 13
√√ √
sandbox 3
√√ √√
iimb 4
total 6 3 4 4 6 6 6 8 8 5 6 3 5 2 1 6 6 1 5 5 6 102

Table 2. Participants and the state of their submissions. Confidence stands for the type of result
returned by a system: it is ticked when the confidence is a non boolean value.

are iteratively composed one by one, and that the last alignment contains all correspon-
dences from other alignments but with changed namespaces, leading in many cases to
very low precisions with quite high recalls. This behavior was reproduced along all the
tracks.
OntoK revealed several bugs when preliminary tests were executed; the developers
were not able to fix all of them before proceeding to the final evaluation.
Another tool was disqualified because we found that in quite a large number of
cases across different tracks, the results it provided were too often exactly those of
another matcher, including syntactic errors. After further scrutiny, it became obvious
that the system implemented specific tricks to run well the OAEI tests instead of being
a genuine matcher.
Finally, some systems were not able to pass some test cases as indicated in Table 2
The summary of the results track by track is presented in the following sections.

3 Benchmark

The goal of the benchmark data set is to provide a stable and detailed picture of each
algorithm. For that purpose, algorithms are run on systematically generated test cases.

3.1 Test data

The systematic benchmark test set is built around a seed ontology and many variations
of it. Variations are artificially generated, and focus on the characterization of the be-
havior of the tools rather than having them compete on real-life problems. They are
organized in three groups:

Simple tests (1xx) such as comparing the reference ontology with itself;
Systematic tests (2xx) obtained by discarding/modifying features from the reference
ontology. Considered features are names of entities, comments, the specialization
hierarchy, instances, properties and classes.
Real-life ontologies (3xx) found on the web.

Full description of the systematic benchmark test set can be found on the OAEI web
site.
This year the focus was on scalability, i.e., the ability of matchers to deal with
data sets of increasing number of elements. To that extent, we departed from the usual
bibliographic benchmark that has been used since 2004. We used a test generator [25]
in order to reproduce the structure of benchmark for different seed ontologies, from
different domains and with different sizes. We have generated five different benchmarks
against which matchers have been evaluated:

benchmark (biblio) allows for comparison with other systems since 2004. The seed
ontology concerns bibliographic references and is inspired freely from BibTeX. It
contains 33 named classes, 24 object properties, 40 data properties, 56 named indi-
viduals and 20 anonymous individuals. This year we have used a new automatically
generated version for this benchmark.
benchmark2 is related with the commerce domain. Its seed ontology contains 74
classes, 106 object properties and 35 named individuals.
benchmark3 is related with bioinformatics. Its seed ontology contains 233 classes, 83
object properties, 38 data properties and 681 named individuals.
benchmark4 is related with the product design domain. Its seed ontology contains 182
classes, 88 object properties, 202 data properties and 376 named individuals.
benchmark5 (finance) is based on the Finance ontology5 , which contains 322 classes,
247 object properties, 64 data properties and 1113 named individuals. It has been
already considered in OAEI 2011 and 2011.5.

Having these five data sets also allowed us to better evaluate the dependency be-
tween the results and the seed ontology. biblio and finance were disclosed to the par-
ticipants; the other benchmarks were tested in blind mode.
For all data sets, the reference alignments are still limited: they only match named
classes and properties and use the “=” relation with confidence of 1.

3.2 Results

We evaluated 18 systems from the 22 participating in the SEALS tracks (Table 2). Be-
sides OMR, OntoK and TOAST, excluded for reasons already explained, requirements
for executing CODI in our machines were not met due to software license problems. In
the following, we present the evaluation results.

5
http://www.fadyart.com/ontologies/data/Finance.owl
Compliance Benchmark compliance tests have been executed on two cores and 8GB
RAM Debian virtual machines (VM) running continuously in parallel, except for the
finance data set which required 10GB RAM for some systems. For each benchmark
seed ontology, data sets of 94 tests were automatically generated. We excluded from the
whole systematic benchmark test set (111 tests), the tests that were not automatically
generated: 102–104, 203–210, 230–231, 301–304.
Table 3 shows the compliance results (harmonic means of precision, F-measure and
recall) of the five benchmark data sets for all the participants, as well as those given by
edna, a simple edit distance algorithm on labels which is used as a baseline. The table
also presents the confidence-weighted values of the same parameters.
Only ASE presented problems to process the finance data set, and MEDLEY did not
completed the evaluation of the benchmark4 and the finance data sets in a reasonable
amount of time (12 hours).
Table 3 shows that, with few exceptions, all systems achieve higher levels of preci-
sion than recall for all benchmarks. Besides, no tool had a worst precision performance
than the baseline, and only ServOMapLt had a significantly lower recall, with LogMap
having slightly lower values for the same measure.
For those systems which have provided their results with confidence measures dif-
ferent from 1 or 0 (see Table 2), it is possible to draw precision/recall graphs and to
compute weighted precision and recall. Systems providing accurate confidence values
are rewarded by these measures [7]. Precision is increased for systems with many in-
correct correspondences and low confidence, like edna and MaasMatch. Recall is de-
creased for systems with apparently many correct correspondences and low confidence,
like AROMA, LogMap and YAM++. The variation for YAM++ is quite impressive,
especially for the biblio benchmark.
Precision/recall graphs are given in Figure 1. The graphs show the real precision
at n% recall and they stop when no more correspondences are available; then the end
point corresponds to the precision and recall reported in Table 3.

Comparison across data sets From the results in Table 3, we observe that on average,
all matchers have better performance than the baseline. The group of best systems in
each data set remains relatively the same across the different benchmarks: YAM++,
MapSSS and AROMA seems to generate the best alignments in terms of F-measure.
We also observe a high variance in the results of some systems across different
benchmarks. Outliers are, for example, a poor precision for AROMA with bench-
mark3 and a poor recall for ServOMapLt with biblio. These variations suggest inter-
dependencies between matching systems and datasets that would need additional anal-
ysis requiring a deep knowledge of the evaluated systems. Such information is, in par-
ticular, useful for developers to detect and fix problems specific to their tools.
We also compare the results obtained by the tools that have participated in OAEI
2011 and 2012 on biblio and finance benchmarks which have been used in both cam-
paigns. With respect to biblio, we observe negative variations between 2-4% for some
tools, as well as positive variations between 1-3% for others. Regarding finance, the
number of systems able to pass the tests increased, and for many tools that passed the
tests in previous campaigns, positive variations between 1-3% were observed.
system refalign edna AROMA ASE AUTOMSv2 GOMMA Hertuda
test P F R P F R P F R P F R P F R P F R P F R
biblio 1.0 1.0 1.0 .35(.45) .41(.47) .50 .98(.99) .77(.73) .64(.58) .49 .51(.52) .54 .97 .69 .54 .75(.74) .67(.65) .61(.58) .90 .68 .54
benchmark2 1.0 1.0 1.0 .46(.61) .48(.55) .50 .97(.98) .76(.73) .63(.58) .72(.74) .61 .53 .97 .68 .52 .97 .69(.67) .53(.51) .93 .67 .53
benchmark3 1.0 1.0 1.0 .22(.25) .30(.33) .50 .38(.43) .53(.54) .83(.73) .27 .36 .54 .99(1.0) .70 .54 .99(1.0) .70(.69) .54(.52) .94 .68 .54
benchmark4 1.0 1.0 1.0 .31(.37) .38(.42) .50 .96 .73(.70) .59(.55) .40(.41) .45 .51 .91(.92) .65 .51(.50) .86 .63(.62) .50(.49) .90 .66 .51
finance 1.0 1.0 1.0 .22(.25) .30(.33) .50 .94 .72(.70) .58(.56) n.a. n.a. n.a. .35 .42(.39) .55(.46) .95(.94) .66(.64) .51(.49) .72 .62 .55
HotMatch LogMap LogMapLt MaasMatch MapSSS MEDLEY
test P F R P F R P F R P F R P F R P F R
biblio .96 .66 .50 .73 .56(.51) .45(.39) .71 .59 .50 .54(.90) .56(.63) .57(.49) .99 .87 .77 .60(.59) .54(.53) .50 (.48)
benchmark2 .99 .68 .52 1.0 .64(.59) .47(.42) .95 .66 .50 .60(.93) .60(.65) .60(.50) 1.0 .86 .76(.75) .92(.94) .65(.63) .50 (.48)
benchmark3 .99 .68 .52 .95(.96) .65(.60) .49(.44) .95 .65 .50 .53(.90) .53(.63) .53(.48) 1.0 .82 .70 .78 .61(.56) .50 (.43)
benchmark4 .99 .66 .50 .99(1.0) .63(.58) .46(.41) .95 .65 .50 .54(.92) .54(.64) .54(.49) 1.0(.99) .81 .68 to to to
finance .97 .67 .51 .95(.94) .63(.57) .47(.40) .90 .66 .52 .59(.92) .59(.62) .60(.48) .99 .83 .71 to to to
Optima ServOMap ServOMapLt WeSeE WikiMatch YAM++
test P F R P F R P F R P F R P F R P F R
biblio .89 .63 .49 .88 .58 .43 1.0 .33 .20 .99 .69(.68) .53(.52) .74 .62 .54 .98(.95) .83(.18) .72(.10)
benchmark2 1.0 .66 .50 1.0 .67 .50 1.0 .51 .35(.34) 1.0 .69(.68) .52 .97 .67(.68) .52 .96(1.0) .89(.72) .82(.56)
benchmark3 .97 .69 .53 1.0 .67 .50 1.0 .55 .38 1.0 .70(.69) .53 .96(.97) .68 .52 .97(1.0) .85(.70) .76(.54)
benchmark4 .92 .60 .45 .89 .60(.59) .45(.44) 1.0 .41 .26 1.0 .67(.66) .50 .94(.95) .66 .51 .96(1.0) .83(.70) .72(.54)
finance .96 .66 .51 .92 .63 .48 .99 .51(.50) .34 .99 .70(.69) .54(.53) .74(.75) .62(.63) .54 .97(1.0) .90(.72) .84(.57)

Table 3. Results obtained by participants on the five benchmark test cases aggregated with harmonic means (values within parentheses are weighted
version of the measure reported only when different).
1. 1.

precision

precision
0.0. recall − biblio 1. 0.3
0. recall − benchmark2 1.
1. 1.
precision

precision

0.2
0. recall − benchmark3 1. 0.0. recall − benchmark4 1.
1.

LogMap Wmatch
HotMatch WeSeE
Hertuda ServOMapLt
precision

GOMMA ServOMap
YAM++ Optima
edna MEDLEY
AUTOMSv2 MapSSS
ASE MaasMatch
AROMA LogMapLt

0.0. recall − f inance 1.

Fig. 1. Precision/recall graphs for benchmarks. The alignments generated by matchers are cut
under a threshold necessary for achieving n% recall and the corresponding precision is com-
puted. Systems for which these graphs are not meaningful (because they did not provide graded
confidence values) are drawn in dashed lines.
runtime
(seconds)
100000 Hertuda MEDLEY YAM
MapSSS WikiMatch
10000 GOMMA MaasMatch WeSeE
1000 AUTOMSv2 LogMapLt ServOMapLt
LogMap ServOMap
100
AROMA HotMatch Optima
10
ontology size
0 158 316 474 633 (classes + properties)
f25 f50 f75 f
Fig. 2. Runtimes for different versions of finance (f25=finance25%, f50=finance50%,
f75=finance75%, f=finance).

runtime
(seconds)
100000 Hertuda MEDLEY YAM
GOMMA MapSSS WikiMatch
10000 MaasMatch WeSeE
1000 AUTOMSv2 LogMapLt ServOMapLt
ASE LogMap ServOMap
100
AROMA HotMatch Optima
10
ontology size
0 97 247 354 472 633 (classes + properties)
b b2 b3 b4 f
Fig. 3. Benchmark track runtimes (b=biblio, b2=benchmark2, b3=benchmark3, b4=benchmark4,
f=finance).

Runtime Regarding runtime, scalability has been evaluated from two perspectives:
on the one hand we considered the five seed ontologies from different domains and
with different sizes; on the other hand we considered the finance ontology scaling it
by reducing its size by different factors (25%, 50% and 75%). For the two modalities,
the data sets were composed of a subset containing 15 tests extracted from a whole
systematic benchmark.
All the experiments were done on a 3GHz Xeon 5472 (4 cores) machine running
Linux Fedora 8 with 8GB RAM. Figures 2 and 3 show semi-log graphs for runtime
measurements against data set sizes in terms of classes and properties.
First of all we observe that for the finance tests, the majority of tools have a mono-
tonic increasing run time, with the exception of MapSSS which exhibits an almost con-
stant response time. On the contrary, this does not happen for the benchmark tests, for
which the benchmark3 test causes a break in the monotonic behavior. One reason for
this could be that the benchmark3 ontology has a more complex structure than the
other ones, and that matchers basing their work in structural analysis are more affected
than others.
Figures also show that there is a set of tools that distance themselves from the others:
LogMapLt, ServOMapLt, LogMap, ServOMap, GOMMA and Aroma are the fastest
tools, and are able to process large ontologies in a short time. On the contrary, there exist
tools that were not able to deal with large ontologies in the same conditions: MEDLEY
and MapSSS fall in this category.

3.3 Conclusions
Having five different benchmarks allowed us to see the degree of dependency on the
shapes and sizes of the seed ontologies. Even if differences were observed in the re-
sults obtained, we can conclude that excepting a few cases, the tools do not show large
variations in the results to the different benchmarks.
Regarding compliance, we observed that with very few exceptions, the systems per-
formed always better than the baseline. However, there were no significant improve-
ments in the performance of the systems with respect to their performance in last OAEI
campaigns (OAEI 2011 and 2011.5).
Regarding runtime, we noticed that given ontologies of different sizes sharing the
structure and knowledge domain to a big extent, the response time follows generally the
shape of a monotonic increasing function. On the contrary, this is not always true if the
shapes or the knowledge domains of the ontologies change.
The results obtained this year allow us to confirm that we cannot conclude on a
general correlation between runtime and quality of alignments. The slowest tools do
not necessarily provide the best compliance results.

4 Anatomy
The anatomy track confronts matchers with a specific type of ontologies from the
biomedical domain. In this domain, many ontologies have been built covering differ-
ent aspects of medical research. We focus on two fragments of biomedical ontologies
which describe the human anatomy and the anatomy of the mouse. The data set of
this track has been used since 2007 with some improvements over the last years. For a
detailed description, we refer the reader to the OAEI 2007 results paper [9].

4.1 Experimental setting
Contrary to previous years, we conducted only a single evaluation experiment by ex-
ecuting each matcher in its standard setting. In our experiments, we compare preci-
sion, recall, F-measure and recall+. The measure recall+ indicates the amount of de-
tected non-trivial correspondences. The matched entities in a non-trivial correspon-
dence do not have the same normalized label. The approach that generates only triv-
ial correspondences is depicted as baseline StringEquiv in the following section. In
OAEI 2011/2011.5, we executed the systems on our own (instead of analyzing submit-
ted alignments) and reported about measured runtimes. Unfortunately, we did not use
exactly the same machine compared to previous years. Thus, runtime results are not
fully comparable across years. In 2012, we used an Ubuntu machine with 2.4 GHz (2
cores) and 3GB RAM allocated to the matching systems. Further, we used the SEALS
client to execute our evaluation. However, we slightly changed the way precision and
recall are computed, i.e., the results generated by the SEALS client vary in some cases
by 0.5% compared to the results presented below. In particular, we remove trivial cor-
respondences in the oboInOwl namespace like

http://...oboInOwl#Synonym = http://...oboInOwl#Synonym

as well as correspondences expressing relations different from equivalence. We also
checked whether the generated alignment is coherent, i.e., there are no unsatisfiable
concepts when the ontologies are merged with the alignment.

4.2 Results

In Table 4, we listed all the participating systems that generated an alignment in less
than ten hours. The listing comprises 17 entries. Three systems participated each with
two different versions. This is GOMMA (the extension ”‘-bk”’ refers to the usage of
background knowledge), LogMap and ServoMap (both systems have submitted an ad-
ditional lightweight version that uses only some core components). Thus, 14 different
systems generated an alignment within the given time frame. There were three partici-
pants ASE, AUTOMSv2, and MEDLEY that did no finish in time or threw an exception.
Due to several hardware and software requirements, we could not install TOAST on the
machine on which we executed the other systems. We executed the matcher on a dif-
ferent machine of similar strength. For this reason, the runtime of TOAST is not fully
comparable to the other runtimes (indicated by an asterisk).
Compared to previous years, we can observe a clear speed increase. In 2012, five
systems (counting two versions of the same system as one) finished in less than 100
seconds, compared to two systems in OAEI 2011 and three systems in OAEI 2011.5.
This has to be mentioned as a positive trend. Moreover, in 2012 we were finally able
to generate results for 14 of 17 systems, while in 2011 only 7 of 14 systems generated
results of acceptable quality within the given time frame. The top systems in terms of
runtimes are GOMMA, LogMap and ServOMap. Depending on the specific version of
the systems, they require between 6 and 34 seconds to match the ontologies. Table 4
shows that there is no correlation between the quality of the generated alignment in
terms of precision and recall and the required runtime. This result has also been ob-
served in previous campaigns.
Table 4 also shows the results for precision, recall and F-measure. We ordered the
matching systems with respect to the achieved F-measure. The F-measure is an ag-
gregation of precision and recall. Depending on the application for which the generated
alignment is used, it might, for example, be more important to favor precision over recall
or vice versa. In terms of F-measure, GOMMA-bk is ahead of the other participants.
The differences of GOMMA-bk compared to GOMMA (and the other systems) are
based on mapping composition techniques and the reuse of mappings between UMLS,
Uberon and FMA. GOMMA-bk is followed by a group of matching systems (YAM++,
CODI, LogMap, GOMMA) generating alignments that are very similar with respect to
precision, recall and F-measure (between 0.87 and 0.9 F-measure). To our knowledge,
Matcher Runtime(s) Size Precision F-measure Recall Recall+ Coherent
GOMMA-bk 15 1534 0.917 0.923 0.928 0.813 -
YAM++ 69 1378 0.943 0.898 0.858 0.635 -
√
CODI 880 1297 0.966 0.891 0.827 0.562
√
LogMap 20 1392 0.920 0.881 0.845 0.593
GOMMA 17 1264 0.956 0.870 0.797 0.471 -
MapSSS 453 1212 0.935 0.831 0.747 0.337 -
WeSeE 15833 1266 0.911 0.829 0.761 0.379 -
LogMapLt 6 1147 0.963 0.829 0.728 0.290 -
TOAST 3464* 1339 0.854 0.801 0.755 0.401 -
ServOMap 34 972 0.996 0.778 0.639 0.054 -
ServOMapL 23 976 0.990 0.775 0.637 0.052 -
HotMatch 672 989 0.979 0.773 0.639 0.145 -
AROMA 29 1205 0.865 0.766 0.687 0.321 -
StringEquiv - 946 0.997 0.766 0.622 0.000 -
Wmatch 17130 1184 0.864 0.758 0.675 0.157 -
Optima 6460 1038 0.854 0.694 0.584 0.133 -
Hertuda 317 1479 0.690 0.681 0.673 0.154 -
MaasMatch 28890 2737 0.434 0.559 0.784 0.501 -

Table 4. Comparison against the reference alignment, runtime is measured in seconds, the “size”
column refers to the number of correspondences in the generated alignment.

these systems either do not use specific background knowledge for the biomedical do-
main or use it only in a very limited way. The results of these systems are at least as
good as the results of the best system in OAEI 2007-2010, only AgreementMaker, using
additional background knowledge, could generate better results in OAEI 2011. Most of
the evaluated systems achieve an F-measure that is higher than the baseline that is based
on (normalized) string equivalence. Moreover, nearly all systems find many non-trivial
correspondences. An exception is the system ServOMap (and its lightweight version)
that generates an alignment that is quite similar to the alignment generated by the base-
line approach.
Concerning alignment coherency, only CODI and LogMap generated coherent
alignments. We have to conclude that there have been no improvements compared
to OAEI 2011 with respect to taking alignment coherence into account. LogMap and
CODI generated a coherent alignment already in 2011. Furthermore, is can be observed
(see Section 5) that YAM++ generates coherent alignments for the ontologies of the
Conference track, which are much smaller but more expressive, while it fails to generate
coherent alignments for larger biomedical ontologies (see also Section 8). This might
be based on using different settings for larger ontologies to avoid reasoning problems
with larger input.

4.3 Conclusions

Most of the systems top the string equivalence baseline with respect to F-measure.
Moreover, we reported that several systems achieve very good results compared to the
evaluations of the previous years. A clear improvement compared to previous years
can be seen in the number of systems that are able to generate such results. It is also a
positive trend that more matching systems can create good results within short runtimes.
This might partially be caused by offering the Anatomy track constantly in its current
form over the last six years together with publishing matcher runtimes. At the same
time, new tracks that deal with large (and very large) matching tasks are offered. These
tasks can only be solved with efficient matching strategies that have been implemented
over the last years.

5 Conference

The conference test case introduces matching several moderately expressive ontologies.
Within this track, participant results were evaluated against reference alignments (con-
taining merely equivalence correspondences) and by using logical reasoning. As last
year, the evaluation has been supported by the SEALS technology. This year we used
refined and harmonized reference alignments.

5.1 Test data

The collection consists of sixteen ontologies in the domain of organizing conferences.
These ontologies have been developed within the OntoFarm project6 .
The main features of this test case are:

– Generally understandable domain. Most ontology engineers are familiar with or-
ganizing conferences. Therefore, they can create their own ontologies as well as
evaluate the alignments among their concepts with enough erudition.
– Independence of ontologies. Ontologies were developed independently and based
on different resources, they thus capture the issues in organizing conferences from
different points of view and with different terminologies.
– Relative richness in axioms. Most ontologies were equipped with OWL DL axioms
of various kinds; this opens a way to use semantic matchers.

Ontologies differ in their numbers of classes, of properties, in expressivity, but also
in underlying resources.

5.2 Results

This year, we provide results in terms of F0.5 -measure, F1 -measure and F2 -measure,
comparison with baseline matcher, precision/recall triangular graph and coherency eval-
uation.
6
http://nb.vse.cz/˜svatek/ontofarm.html
Evaluation based on reference alignments We evaluated the results of participants
against new reference alignments (labelled as ra2 on the conference web-page). This
includes all pairwise combinations between 7 different ontologies, i.e. 21 alignments.
New reference alignments have been generated as a transitive closure computed
on the original reference alignments. In order to obtain a coherent result, conflicting
correspondences, i.e., those causing unsatisfiability, have been manually inspected and
removed. As a result the degree of correctness and completeness of the new reference
alignment is probably slightly better than for the old one. However, the differences are
relatively limited. Whereas the new reference alignments are not open, the old reference
alignments (labeled as ra1 on the conference web-page) are available. These represent
close approximation of the new ones.
Matcher Precision F0.5 -measure F1 -measure F2 -measure Recall
YAM++ .78 .75 .71 .67 .65
LogMap .77 .71 .63 .57 .53
CODI .74 .69 .63 .58 .55
Optima .60 .61 .61 .62 .63
GOMMA .79 .68 .56 .47 .43
Hertuda .70 .63 .56 .49 .46
MaasMatch .60 .58 .56 .53 .52
Wmatch .70 .63 .55 .48 .45
WeSeE .72 .64 .55 .48 .44
HotMatch .67 .62 .55 .50 .47
LogMapLt .68 .62 .54 .48 .45
Baseline .76 .64 .52 .43 .39
ServOMap .68 .60 .51 .45 .41
ServOMapLt .82 .65 .50 .41 .36
MEDLEY .59 .55 .49 .45 .42
ASE* .61 .55 .48 .43 .40
MapSSS .47 .47 .46 .46 .46
AUTOMSv2* .64 .54 .44 .37 .33
AROMA .33 .34 .37 .39 .41

Table 5. The highest average F[0.5|1|2] -measure and their corresponding precision and recall for
each matcher F1 -optimal threshold.

Table 5 shows the results of all participants with regard to the new reference align-
ment. F0.5 -measure, F1 -measure and F2 -measure are computed for the threshold that
provides the highest average F1 -measure. F1 is the harmonic mean of precision and re-
call where both are equally weighted; F2 weights recall higher than precision and F0.5
weights precision higher than recall. The matchers shown in the table are ordered ac-
cording to their highest average F1 -measure. Our baseline7 divides matchers into two
groups. Group 1 consists of matchers (YAM++, LogMap, CODI, Optima, GOMMA,
Hertuda, MaasMatch, Wmatch, WeSeE, HotMatch and LogMapLt) having better (or
equal) results than Baseline. Other matchers (ServOMap, ServOMapLt, MEDLEY,
7
String matcher based on string equality applied on local names of entities which were lower-
cased before.
ASE, MapSSS, AUTOMSv2 and AROMA) performed worse than baseline. There are
two matchers (ASE and AUTOMSv2) with asterisks which did not generate 3 out of 21
alignments. Thus, their results are just an approximation.
Performance of matchers from Group 1 regarding F1 -measure is visualized in Fig-
ure 4.
YAM++
LogMap
CODI
Optima
GOMMA
F1 -measure=0.7 Hertuda
MaasMatch
F1 -measure=0.6 Wmatch
WeSeE
F1 -measure=0.5
HotMatch
LogMapLt
Baseline

rec=1.0 rec=.8 rec=.6 pre=.6 pre=.8 pre=1.0

Fig. 4. Precision/recall triangular graph for the conference track. Matchers are represented as
squares and Baseline is represented as a circle. Dotted lines depict level of precision/recall
while values of F1 -measure are depicted by areas bordered by corresponding lines F1 -
measure=0.[5|6|7].

Comparison with previous years Seven matchers also participated in OAEI 2011 and
10 matchers participated in OAEI 2011.5. The largest improvement was achieved by
Optima (precision from .23 to .60 and recall from .52 to .63) and YAM++ (precision
from .74 to .78 and recall from .51 to .65) between OAEI 2011 and 2012. Four match-
ers were improved between OAEI 2011.5 and 2012 and five matchers were improved
between OAEI 2011 and 2012.

Runtimes We measured the total time of generating all 120 alignments. It was executed
on a laptop with Ubuntu machine running on Intel Core i5, 2.67GHz and 4GB RAM. In
all, four matchers finished all 120 test cases within 1 minute (LogMapLt - 44 seconds,
Hertuda - 49 seconds, ServOMapLt - 50 seconds and AROMA - 55 seconds). Next, four
matchers needed less than 2 minutes (ServOMap, HotMatch, GOMMA and ASE). 10
minutes were enough for the next four matchers (LogMap, MapSSS, MaasMatch and
AUTOMSv2). Finally, 5 matchers needed up to 40 minutes to finish all 120 test cases
(Optima - 22 min, MEDLEY - 30 min, WeSeE - 36 min, CODI - 39 min and Wmatch -
40 min). YAM++ did not finish the task of matching all 120 test cases within five hours.
In conclusion, regarding performance we can see (clearly from Figure 4) that
YAM++ is on the top. Next three matchers (LogMap, CODI, and Optima) are relatively
close to each other. This year there is a largest group of matchers which are above base-
line than previous years. Moreover, it is very positive that several matchers managed to
improve their performance in such a short time as one year or even half a year.

Evaluation based on alignment coherence As in previous years, we applied the Max-
imum Cardinality measure to evaluate the degree of alignment incoherence. Details on
this measure and its implementation can be found in [19]. The results of our experi-
ments are depicted in Table 6. Contrary to last year, we only compute the average for
all test cases of the conference track for which there exists a reference alignment. The
presented results are thus aggregated mean values for 21 test cases. In some cases we
could not compute the degree of incoherence due to the combinatorial complexity of
the problem. In this case we were still able to compute a lower bound for which we
know that the actual degree is (probably only slightly) higher. Such results are marked
with a *. Note that we only included in our evaluation those matchers that generated
alignments for all test cases of the subset with reference alignments.

MaasMatch*
LogMapLt
HotMatch
GOMMA
AROMA

LogMap
Hertuda
CODI

Matcher
Alignment Size 20.9 11.1 8.2 9.9 10.5 10.3 9.9 83.1
Incoherence Degree 19.4% 0% 1.1% 5.2% 5% 0% 5.4% 24.3%
Incoherent Alignments 18 0 2 9 9 0 7 20
ServOMapLt
MEDLEY*

ServOMap
MapSSS

Wmatch

YAM++
Optima

WeSeE

Matcher
Alignment Size 14.8 55.6 15.9 9 6.5 9.4 9.9 12.5
Incoherence Degree 12.6% 30.5% 7.6% 3.3% 0% 3.2% 6% 0%
Incoherent Alignments 18 20 12 5 0 6 10 0

Table 6. Average size of alignments, average degree of incoherence, and number of incoherent
alignments. The mark * is added if we only provide lower bound of the degree of incoherence
due to the combinatorial complexity of the problem.

Four matchers can generate coherent alignments. These matchers are CODI,
LogMap, ServOMapLt, and YAM++. However, it is not always clear whether this is
related to a specific approach that tries to ensure the coherency, or whether this is only
indirectly caused by generating small and highly precise alignments. In particular, the
coherence of the alignments from ServOMapLt, which does not apply any semantic
technique, might be caused by such an approach. The matcher generates overall the
smallest alignments. Because there are some matchers that cannot generate a coherent
alignment for alignments that have in average a size from 8 to 12 correspondences, it can
be assumed that CODI, LogMap, and YAM++ have implemented specific coherency-
preserving methods. Those matchers generate also between 8 to 12 correspondences,
however, none of their alignments is incoherent. This is an important improvement
compared to the previous years, for which we observed that only one or two match-
ers managed to generate (nearly) coherent alignments.

6 MultiFarm

In order to be able to evaluate the ability of matching systems to deal with ontologies in
different languages, the MultiFarm dataset has been proposed [21]. This dataset results
from the translation of seven Conference track ontologies (cmt, conference, confOf,
iasted, sigkdd, ekaw and edas), in eight languages: Chinese, Czech, Dutch, French,
German, Portuguese, Russian, and Spanish (+ English). The translations in 8 languages
+ English result in 36 pairs of languages. Overall, we have 36 × 49 matching tasks (see
[21] for details on the pairs of languages and ontologies).

6.1 Experimental setting

For the 2012 evaluation campaign, we have used a subset of the whole MultiFarm
dataset, omitting all the pairs of matching tasks involving the ontologies edas and ekaw
(resulting in 36 × 25 matching tasks). This allows for using the omitted test cases as
blind evaluation tests in the future. Contrary to OAEI 2011.5, we have included the
Chinese and Russian translations.
Within the MultiFarm dataset, we can distinguish two types of matching tasks: (i)
those test cases in which two different ontologies have been translated in different lan-
guages (cmt–confOf, for instance); and (ii) those test cases where the same ontology
has been translated in different languages (cmt–cmt, for instance). For the test cases of
type (ii), good results are not directly related to the use of specific techniques for deal-
ing with ontologies in different natural languages, but on the ability to exploit the fact
that both ontologies have an identical structure (and that the reference alignment covers
all entities described in the ontologies).
This year, seven participating systems (out of 21 systems participated in OAEI,
see Table 2) use specific multilingual methods: ASE, AUTOMSv2, GOMMA, MED-
LEY, WeSeE, Wmatch, and YAM++. The other systems are not specifically designed
to match ontologies in different languages, nor do they make use of a component that
can be used for that purpose.
ASE (a version of AUTOMSv2) uses the Microsoft Bing Translator API for trans-
lating the ontologies to English. This process is performed before ASE profiling, con-
figuration and matching methods are executed, so its input will consider only English
labeled copies of ontologies. AUTOMSv2 follows a similar approach, but re-using a
free Java API named WebTranslator. GOMMA uses a free translation API (MyMem-
ory), for translating non-English concept labels to English. The translations are associ-
ated to concepts as new synonyms. Iteratively, GOMMA creates a bilingual dictionary
for each ontology, which is used within the matching process. WeSeE and YAM++,
as AUTOMS2, use Microsoft Bing Translation for translating the labels contained in
the input ontologies to English. Then, the translated English ontologies are matched
using standard matching procedures of WeSeE and YAM++. Finally, Wmatch exploits
Wikipedia for extracting inter-language. All matchers (with the exception of Wmatch)
use English as a pivot language. MEDLEY is the only matcher for which we have no
information on the techniques it exploits to deal with multilingualism.

6.2 Execution setting and runtime

All systems (with the exception of CODI) have been executed on a 3GHz Xeon 5472
(4 cores) machine, running Linux Fedora 8 with 8GB RAM. The runtimes for each
system can be found in Table 7. As CODI has been executed on a different setting, its
runtime cannot be compared with the runtime of other systems. We observe large dif-
ferences between the time required for a system to complete the 36×25 matching tasks.
While WeSeE requires ∼ = 15 minutes, Wmatch takes ∼ = 17 hours. It is mainly due to the
fact that Wmatch uses an external resource (Wikipedia) for looking for inter-languages
links. This requires considerably more time than simpler requests for translations.

6.3 Overall results

Before discussing the results per pairs of languages, we present the aggregated results
for the test cases within type (i) and (ii) matching task. Table 7 shows the aggregated
results. Systems not listed in this table have generated empty alignments, for all test
cases (ServOMap and ServOMapL), have thrown exceptions (ASE, OMR, OntoK), or
have not been evaluated due to their execution requirements (TOAST). AROMA was
not able to generate alignments for test cases of type (i).
As shown in Table 7, we can observe significant differences between the results
obtained for each type of matching task, specially in terms of precision. While the sys-
tems that implement specific multilingual techniques clearly generate the best results
for test cases of type (i), only one of these systems (YAM++) is among the top (3)
F-measures for type (ii) test cases. For these test cases, MapSSS and CODI, which
implement strategies to deal with ontologies that share structural similarities, have bet-
ter results. Due to this feature, they have preserved their overall performance this year
(using the same version as for the last campaign), even though harder tests have been
included in 2012 (Chinese and Russian translations). On the other hand, for the other
matchers in the same situation, the differences in the results are explained by the pres-
ence of such harder tests cases this year.
Furthermore, as observed in the OAEI 2011.5 campaign and corroborated in 2012,
MapSSS and CODI have generated very good results on the benchmark track. This
suggests a strong correlation between the ranking in Benchmark and the ranking for
MultiFarm test cases of type (ii), while there is, on the other hand, no (or only a very
weak) correlation between results for test cases of type (i) and type (ii). For that reason,
we only analyze in the following the results for test cases of type (i).
Different ontologies (i) Same ontologies (ii)
System Runtime Prec. Fmeas. Rec. Prec. Fmeas. Rec.
AUTOMSv2 512.7 .49 .36 .10 .69 .24 .06
Multilingual GOMMA 35.0 .29 .31 .36 .63 .38 .29
MEDLEY 76.5 .16 .16 .07 .34 .18 .09
WeSeE 14.7 .61 .41 .32 .90 .41 .27
Wmatch 1072.0 .22 .21 .22 .43 .17 .11
YAM++ 367.1 .50 .40 .36 .91 .60 .49
AROMA 6.9 .31 .01 .01
CODI x .17 .08 .02 .82 .62 .50
Hertuda 23.5 .00 .01 1.00 .02 .03 1.00
HotMatch 16.5 .00 .01 .00 .40 .04 .02
Non specific

LogMap 14.9 .17 .09 .02 .35 .03 .01
LogMapLt 5.5 .12 .07 .02 .30 .03 .01
MaasMatch 125.0 .02 .03 .14 .14 .14 .14
MapSSS 17.3 .08 .09 .04 .97 .66 .50
Optima 142.5 .00 .01 .59 .02 .03 .41

Table 7. MultiFarm aggregated results per matcher, for each type of matching task – types (i) and
(ii). Runtime is measured in minutes (time for completing the 36 × 25 matching tasks). The top-5
values for each column are marked in bold-face.

6.4 Language specific results

Table 8 shows the results aggregated per language pair. For the sake of readability, we
present only F-measure values. The reader can refer to the OAEI results web page for
more detailed results on precision and recall.
As expected and already reported above, the systems that apply specific strategies to
deal with multilingual matching labels outperform all other systems (overall F-measure
for both cases): YAM++, followed by WeSeE, GOMMA, AUTOMSv2, Wmatch, and
MEDLEY, respectively. Wmatch has the ability to deal with all pairs of languages, what
is not the case for AUTOMSv2 and MEDLEY, specially for the pairs involving Chinese,
Czech and Russian languages.
Most of the systems translating non-English ontology labels to English have better
scores on pairs where English is present (by group of pairs, YAM++ is the typical case).
This owes to the fact that multiple translations (pt → en and fr → en, for matching pt
→ fr, for instance) may result in more ambiguous translated concepts, which makes
harder the process of finding correct correspondences. Furthermore, as somehow ex-
pected, good results are also obtained for pairs of languages having similarities on their
vocabularies (es-pt and fr-pt, for instance). These two observations may explain the top
F-measures of the specific multilingual methods: AUTOMSv2 (es-pt, en-es, de-nl, en-
nl), GOMMA (en-pt, es-pt, cn-en, de-en), MEDLEY (en-fr, en-pt, cz-en, en-es), WeseE
(en-es, es-fr, en-pt, es-pt, fr-pt), YAM++ (cz-en, cz-pt, en-pt). Wmatch has an interest-
ing pair score, where Russian appears in the top F-measures: nl-ru, en-es, en-nl, fr-ru,
es-ru. This may be explained by the use of Wikipedia multilingual inter-links, which
are not limited to English or language similarities.
AUTOMSv2

LogMapLt

MaasMtch

MEDLEY
HotMatch
GOMMA

MapSSS
Herduda

LogMap

Wmatch
YAM++
Optima

WeSeE
CODI
Pair
cn cz .34 .01 .01 .01 .29 .21 .35
cn de .34 .01 .01 .01 .25 .08 .35
cn en .41 .01 .00 .01 .31 .10 .41
cn es .33 .01 .01 .00 .01 .34 .14 .20
cn fr .35 .01 .01 .00 .01 .32 .09 .39
cn nl .25 .01 .01 .00 .01 .23 .10 .34
cn pt .31 .01 .01 .01 .30 .09 .35
cn ru .27 .01 .01 .00 .01 .27 .13 .33
cz de .10 .24 .01 .10 .09 .06 .07 .19 .01 .41 .24 .45
cz en .07 .36 .01 .05 .04 .06 .08 .28 .01 .48 .24 .58
cz es .11 .30 .01 .11 .11 .06 .11 .13 .01 .47 .25 .20
cz fr .01 .16 .01 .01 .01 .01 .04 .01 .08 .01 .47 .20 .53
cz nl .09 .21 .01 .04 .04 .07 .05 .09 .01 .48 .21 .55
cz pt .15 .37 .01 .13 .13 .06 .12 .18 .01 .44 .12 .57
cz ru .21 .01 .00 .01 .17 .49
de en .38 .20 .41 .01 .22 .20 .06 .16 .27 .01 .39 .28 .52
de es .35 .06 .35 .01 .12 .06 .06 .15 .15 .01 .41 .24 .20
de fr .32 .04 .21 .01 .04 .04 .05 .13 .13 .01 .41 .25 .46
de nl .39 .05 .24 .01 .04 .04 .06 .15 .12 .01 .37 .25 .40
de pt .35 .08 .36 .01 .07 .07 .05 .06 .15 .01 .35 .22 .42
de ru .33 .01 .01 .01 .01 .42 .26 .47
en es .42 .04 .40 .01 .15 .04 .08 .18 .28 .01 .52 .30 .23
en fr .31 .04 .36 .01 .01 .06 .04 .09 .13 .33 .01 .48 .27 .53
en nl .39 .10 .38 .01 .08 .10 .09 .15 .24 .01 .49 .29 .53
en pt .37 .08 .45 .01 .06 .06 .06 .07 .30 .01 .51 .26 .56
en ru .34 .01 .01 .01 .43 .25 .47
es fr .37 .01 .29 .01 .07 .01 .08 .06 .06 .01 .52 .24 .20
es nl .38 .33 .01 .05 .01 .03 .01 .46 .27 .16
es pt .44 .22 .44 .01 .24 .23 .11 .23 .22 .01 .51 .25 .25
es ru .21 .01 .00 .01 .28 .19
fr nl .27 .13 .21 .01 .01 .13 .12 .07 .11 .16 .01 .43 .28 .47
fr pt .35 .32 .01 .01 .06 .02 .08 .01 .50 .23 .53
fr ru .24 .01 .00 .01 .47 .28 .46
nl pt .37 .04 .32 .01 .01 .01 .05 .02 .06 .01 .47 .20 .51
nl ru .22 .01 .01 .01 .42 .31 .42
pt ru .30 .01 .00 .01 .45 .17 .44

Table 8. MultiFarm results per pair of language, for the test cases of type (i). We distinguished
empty alignments, represented by empty cells, from wrong ones.

For non-specific systems, though all of them cannot deal at all with Chinese and
Russian languages, MapSSS, LogMap and CODI obtain better results. These system
perform better for some specific pairs: MapSSS (es-pt, en-es, de-en), LogMap and
LogMapL (es-pt, de-en, en-es, cz-pt), CODI (es-pt, de-en, cz-pt). From all these sys-
tems, the pairs es-pt and de-en are obtain better F-measures. Again, we can see that
similarities in the language vocabulary have an important role in the matching task. On
the other hand, although it is likely harder to find correspondences between cz-pt than
es-pt, for some systems their best score include such combinations (cz-pt, for CODI and
LogMapLt). This can be explained by the specific way systems combine their internal
matching techniques (ontology structure, reasoning, coherence, linguistic similarities,
etc).

6.5 Conclusions

We observe that specific methods for dealing with ontologies that are described in dif-
ferent languages, work much better than non specific systems. This is the expected
behavior. However, the absolute results are still not very good, if compared to the top
results of the original Conference dataset (∼
= .75 F-measure for the best matcher). For all
specific multilingual methods, the techniques implemented in YAM++ generate the best
alignments in terms of F measure (∼ = .53 overall F-measure for both types of matching
tasks). YAM++ is followed by WeSeE and GOMMA, respectively. With the exception
of Wmatch, all systems use English as pivot language.
Looking at the participation in the OAEI 2011.5 campaign, only 3 participants, out
of 19 have used specific techniques. We counted this year with new systems imple-
menting specific multilingual methods (seven out of 21). Although there is room for
improvements to achieve the same level of compliance than in the original dataset, the
increasing number of matchers dealing with multilingual matching is a sign that the
field is progressing.

7 Library

This library track is a new track within the OAEI. Its challenge is to match two real-
world thesauri: TheSoz (social sciences) and STW (economics). However, there has al-
ready been a library track from 2007 to 2009 [9; 3; 6] using different thesauri,8 as well
as other thesaurus tracks like the food track9 and the environment track.10 A common
motivation is that these tracks use a real-world scenario, i.e., real thesauri. For us, it is
still a motivation to develop a better understanding, how thesauri differ from ontologies
and how these differences affect state-of-the-art ontology matchers. We hope that the
community accepts the challenge and that subsequently significant improvements can
be seen that push the quality of automatic alignments between thesauri. Furthermore,
we will use the matching results as input for the maintainers of the reference alignment
to improve the alignment. While a full manual evaluation of all matching results is cer-
tainly not feasible, this way we constantly improve the reference alignment and mitigate
possible weaknesses and incompleteness.
8
http://oaei.ontologymatching.org/2009/library/
9
http://oaei.ontologymatching.org/2006/food/
10
http://oaei.ontologymatching.org/2007/environment/
7.1 Test data
The library track uses two real-world thesauri, that are in many aspects comparable.
They have roughly the same size, are both originally developed in German, are today
both multilingual, both have English translations, and, most important, despite being
from two different domains, they have huge overlapping areas. Not least, both are freely
available in RDF using SKOS.11

STW The STW Thesaurus for Economics provides vocabulary on any economic sub-
ject: more than 6,000 standardized subject headings (skos:Concepts, with preferred
labels in English and German) and 19,000 additional keywords (skos:altLabels) in
both languages. The vocabulary was developed for indexing purposes in libraries and
economic research institutions and includes technical terms used in law, sociology,
or politics, and geographic names. The entries are richly interconnected by 16,000
skos:broader/narrower and 10,000 skos:related relations. An additional hierarchy of
main categories provides a high level overview. The vocabulary is maintained on a
regular basis by ZBW12 , the German National Library of Economics - Leibniz Centre
for Economics, and has been translated into SKOS [24].

TheSoz The Thesaurus for the Social Sciences (TheSoz) serves as a crucial instrument
for indexing documents and research information in the social sciences. It contains over-
all about 12,000 keywords, from which 8,000 are standardized subject headings (in En-
glish and German) and 4,000 additional keywords. The thesaurus covers all topics and
sub-disciplines of the social sciences. Additionally terms from associated and related
disciplines are included in order to support an accurate and adequate indexing process
of interdisciplinary, practical-oriented and multi-cultural documents. The thesaurus is
owned and maintained by GESIS13 , the Leibniz Institute for the Social Sciences, and is
available in SKOS [31].

Reference Alignment An alignment between STW and TheSoz already exists and has
been manually created by domain experts in the KoMoHe project [18]. However, it does
not cover the changes and enhancements in both thesauri since 2006. It is available in
SKOS with the different matching types SKOS:exactMatch, SKOS:broaderMatch and
SKOS:narrowerMatch. Within the reference alignment, concepts of one thesaurus are
aligned to more than one concept of the second thesaurus. Thus, we face a n:m map-
ping of the concepts. All in all, 4,285 TheSoz concepts and 2,320 STW concepts are
aligned with 2,839 exact matches, 34 broader matches and 1,416 narrower matches.
It is important to note that the reference alignment only contains alignments between
the descriptors of both thesauri, i.e., the concepts that are actually used for document
indexing. The upper part of the hierarchy consists of non-descriptor concepts (or cate-
gories) that are only used to organize the descriptors below them. We take this specialty
into account as we only assess the generated alignments between descriptors and ignore
alignments between non-descriptors. However, this might change in the future, as the
11
http://www.w3.org/2004/02/skos/
12
http://zbw.eu/index-e.html
13
http://www.gesis.org/en/home
results of this track could be used to extend the reference alignment to the upper part of
the hierarchy.

Transformation Most ontology matching systems taking part in the OAEI only work
on OWL ontologies and are not (yet) ready to deal with the specialties of a thesaurus.
To get first results and to lower the barrier of taking part in this challenge, we provide
OWL versions of the thesauri, generated as follows:

skos:concept → owl:class
skos:prefLabel, skos:altLabel → rdfs:label
skos:scopeNote, skos:notation → rdfs:comment
A skos:narrower B → B rdfs:subClassOf A
skos:broader → rdfs:subClassOf
skos:related → rdfs:seeAlso

This transformation obviously is not lossless. First and foremost, within the ontol-
ogy, it is not recognizable which label is the preferred one and which ones are alternative
labels. Since matching systems mostly have to focus on the labels, this transformation
might lead to suboptimal results. There are, however, more fundamental differences
between ontologies and thesauri that we show in the next section.

SKOS vs. OWL Thesauri – and other, similar knowledge structures like classifications
or taxonomies – are often called lightweight ontologies [29]. However, ontologies and
thesauri fundamentally differ. This is also reflected by the fact that with SKOS a specific
model for thesauri exists that is formulated in OWL. There, a skos:Concept is not
an owl:Class. Concepts sometimes represent classes, for example the STW concept
C OMMODITIES. However, this is not true for every skos:Concept, e.g., the STW
concept G ERMANY is an instance, not a class.
Having a look at the subordinate concepts of C OMMODITIES, they mostly indeed
represent classes, like M ETALS – M ETAL P RODUCTS – R AZOR. Nevertheless, the rela-
tion in SKOS between these concepts is skos:broader, not rdfs:subClassOf.
A subclass relationship states that if a class B is a subclass of a class A, then all in-
stances of B will also be instances of A. Here, all metals are commodities, but not all
metal products are metals: the razor consists partly of metal, but it is no metal.
Thesauri are created for a very specific purpose and are used in a predetermined
way. This is inter alia reflected by the distinction of descriptors and non-descriptors.
Only descriptors are assigned to publications during the indexation or classification.
All non-descriptors serve as additional information to provide the correct context or to
build up a proper hierarchy. Such a distinction typically does not exist in an ontology.
Very difficult for ontology matchers (not necessarily only automatic ones) is the
quasi-synonymy of the describing labels for a concept. A skos:altLabel is often
used to indicate subconcepts that should be subsumed under the concept in question
to avoid extensive subclassing. As an example, the STW descriptor 14117-2 with the
preferred English label “Tropical fruit” has German alternative labels like “pineapple”,
“avocado”, and “kiwi”. In an (OWL) ontology, these alternative labels should be mod-
eled as subclasses of the class T ROPICAL FRUIT. In contrast, other alternative labels
might really indicate alternative, synonymous terms for the preferred label.
At last, instead of arbitrary semantic relations that are part of an ontology, in the-
sauri, relations like skos:related or compoundEquivalence in TheSoz exist.
They often contain information for the (manual) use of the thesaurus for indexing, i.e.,
which descriptor should be used in which case or how combinations of descriptors are to
be used. Transferring them to ontological relations is not always possible and depends
often on the single case.
It can be seen that the development of a thesaurus matcher is indeed a challenge
that differs from ontology matching. Nevertheless, the commonalities between thesauri
and ontologies are large enough to pave the way for further developments by means of
current ontology matchers.

7.2 Experimental Setting

To compare the created alignments with the reference alignment, we use the
Alignment API. For this first evaluation, we only included equivalence relations
(skos:exactMatch).
All matching processes have been performed on a Debian machine with one 2.4GHz
core and 7GB RAM allocated to each system. The evaluation has been executed by us-
ing SEALS technologies. For ServOMap, ServOMapLt and Optima, we used slightly
adapted ontologies as input since they cannot handle URIs with the last part only con-
sisting of numbers as it is the case in the official version. Each participating system
uses the OWL version. We computed precision, recall and F-measure (β = 1) for each
matcher. We only consider equivalence correspondences between two descriptors as
non-descriptors are not included in the reference alignment. This filtering improves the
precision (≈ 8%) as well as the F-measure (≈ 4%) for all systems. Moreover, we
measured the runtime, the size of the created alignment and checked whether a 1:1
alignment has been created. To assess the results of the matchers, we developed three
straightforward matching strategies, using the original SKOS version of the thesauri:

– Matcherpref DE : Compares the German lower-case preferred labels and generates
a correspondence if these labels are completely equivalent.
– Matcherpref EN : Compares the English lower-case preferred labels and generates a
correspondence if these labels are completely equivalent.
– Matcherpref : Creates a correspondence, if either Matcherpref DE or
Matcherpref EN or both create a correspondence.
– MatcherallLabels : Creates a correspondences whenever at least one label (preferred
or alternative, all languages) of an entity is equivalent to one label of another entity.

7.3 Results

All systems listed in Table 9 are sorted according to their F-measures. Altogether 13 of
the 21 submitted matching systems were able to create an alignment. Three matching
System Precision F-measure Recall Time [s] Size 1:1
Matcherpref 0.820 0.720 0.642 75 2190 -
Matcherpref DE 0.891 0.717 0.601 42 1885 -
MatcherallLabels 0.544 0.677 0.896 735 4605 -
GOMMA 0.537 0.674 0.906 804 4712 -
ServOMapLt 0.654 0.670 0.687 45 2938 -
LogMap 0.688 0.665 0.644 95 2620 -
√
ServOMap 0.717 0.665 0.619 44 2413
YAM++ 0.595 0.664 0.750 496 3522 -
LogMapLt 0.577 0.662 0.776 21 3756 -
Hertuda 0.465 0.619 0.925 14363 5559 -
√
WeSeE 0.612 0.609 0.607 144070 2774
√
HotMatch 0.645 0.608 0.575 14494 2494
Matcherpref EN 0.808 0.569 0.439 36 1518 -
√
CODI 0.434 0.445 0.481 39869 3100
√
MapSSS 0.520 0.272 0.184 2171 989
AROMA 0.107 0.184 0.652 1096 17001 -
Optima 0.321 0.117 0.072 37457 624 -

Table 9. Results of the Library track.

systems (MaasMatch, MEDLEY, Wmatch) did not finish within the time frame of one
week while five threw an exception (no heap space exception).
Of all these systems, GOMMA performs best in terms of F-measure, closely fol-
lowed by ServOMapLt and LogMap. However, the precision and recall measures vary
a lot across the three systems. Depending on the application, an alignment either achiev-
ing high precision or recall is preferable. If recall is in the focus, the alignment created
by GOMMA is probably the best choice with a recall of about 90%. Other systems gen-
erate alignments with higher precision, e.g. ServOMap with over 70% precision, while
mostly having significantly lower recall values (except for Hertuda).
From the results obtained by the matching strategies taking the different types of
labels into account, we can see that a matching based on preferred labels only, outper-
forms other matching strategies. Matcherpref achieves the highest F-measure in these
tests. The results of Matcherpref DE and Matcherpref EN provide an insight into the
language characteristics of both thesauri and the reference alignments. Matcherpref DE
achieves the highest precision value (nearly 90%), albeit with a recall of only 60%.
Both thesauri as well as the reference alignment have been developed in Germany and
focus on German terms. From the results of Matcherpref EN , we can see the difference:
precision and especially recall significantly decrease when only the preferred English
labels are used. On the one hand, only about 80% of the found correspondences are
correct and on the other hand, less than a half of all correspondences can be found this
way. This can be a disadvantage for systems that use NLP techniques on English labels
or rely on language-specific background knowledge like WordNet.
The high precision values of the pref ∗ matchers reflect the fact that the preferred
labels are chosen specifically to unambiguously identify the concepts. Our interpreta-
tion is that the English translations are partly not as precise as the original German
terms (drop in precision) and not consistent regarding the English terminology (drop in
recall).
In contrast, the MatcherallLabels achieves a quite high recall (90%) but a rather
low precision (54%). This means that most but not all of the correspondences can be
found by only having a look at equivalent labels. However, when following this idea,
nearly a half of the found correspondences are incorrect. The rather high F-measure
of MatcherallLabels is therefore misleading, as at least if the results would be used
unchecked in an retrieval system, a higher precision would clearly be preferred over a
higher recall. In this respect, matchers like ServOMap show better results. In any case,
it can be seen that a matching system using the original SKOS version could achieve a
better result. The information loss when converting SKOS to OWL really matters.
Concerning runtime, LogMap as well as ServOMap are quite fast with a runtime
below 50 seconds. These values are comparable or even better (LogMapLt) than both
strategies computing the equivalence between preferred labels. Thus, they are very ef-
fective in matching large ontologies while achieving very good results. Other matchers
take several hours or even days and do not produce better alignments in terms of F-
measure. By computing the correlation between F-measure and runtime, we notice a
slightly negative correlation (-0.085) but the small amount of samples is not sufficient
to make a significant statement. However, we can say for certain that a longer runtime
does not necessarily lead to better results.
We further observe that the n:m reference alignment affects the results because some
matching systems (ServOMap, WeSeE, HotMatch, CODI, MapSSS) only create 1:1
alignments and discard correspondences with entities that already occur in another cor-
respondence. Whenever a system creates a lot of n:m correspondences, e.g., Hertuda
and GOMMA, the recall significantly increases. This difference becomes clear when
comparing ServOMapLt and ServOMap. Both systems are mostly based on the same
methods but ServOMapLt does not use the 1:1 filtering. Consequently, its recall in-
creases and its precision decreases.
Since the reference alignment has not been updated for about six years, it does not
contain updates of both thesauri. Thus, new correct correspondences might be found by
matching systems but they are indicated as incorrect because they are not included in
the reference alignment. Therefore, we applied a manual evaluation to check whether
matching systems found correct correspondences which are not included in the ref-
erence alignment at all. In turn, these information can help to improve the reference
alignment.
The manual evaluation has been conducted by domain experts. Many newly de-
tected correspondences, which have not been contained in the reference alignment yet,
have been considered. By now, we only examined correspondences between descriptors
as well as the ones that did not contain a term which is already matched in the reference
alignment.
The matchers detected between 38 and 251 correct correspondences, which have
not been in the reference alignment before. This includes especially terms, which hold a
strong syntactical similarity or equivalence. But, some matching systems even detected
difficult correspondences, e.g., between the German label for “automated production”
(“Automatische Produktion”) and “CAM”, which has been identified by their associated
non-preferred labels. Furthermore, correspondences of geographical terms have been
detected, but some of the matchers have not been able to distinguish between the terms
for citizens of a country, their language or the country itself, although these differences
can be derived from the structure of the thesauri.
But this manual evaluation exposed several issues, which can either be explained by
the typical behavior of matching systems or by domain-specific differences inside the
thesauri. There are similar terms inside TheSoz and STW, which are used in totally dif-
ferent contexts, e.g., the term “self-assessment”. Even when considering the structure
of both thesauri these differences are difficult to identify. In general, term similarities
often led to wrong correspondences, which is not surprising at first. But, in turn syn-
tactically equal terms have not been detected simultaneously in some cases. By now,
we did not have the possibility to evaluate the matching systems with the improved
reference alignment, but we plan to perform this additional evaluation soon.

7.4 Conclusion
Nevertheless, the newly detected correspondences determine already a useful result for
the maintainers of the two thesauri. The correct correspondences can be added to the
existing reference alignment, which is already applied in information portals for sup-
porting search term recommendation and query expansion services among differently
indexed databases. As all matching systems delivered exact matches for the corre-
spondences, some of the wrong correspondences will be examined again in the future,
whether other relationships like broader, narrower or related matches can be considered
for those.
We expect further improvements, if the matchers are tailored more specifically to the
library track, i.e., if they exploit the information found in the original SKOS version.
A promising approach is also the use of additional knowledge, e.g., instance data –
resources that are indexed with different thesauri [30].
This time, we collected the results of the matchers as a first survey and compared
them to our simple string-matching strategy that takes advantage of the different types
of labels. In future evaluations, we assume that better results can be achieved and that
these strategies simply form a baseline.

8 Large biomedical ontologies
This track aims at finding alignments between the large and semantically rich biomed-
ical ontologies FMA, SNOMED CT, and NCI, which contains 78,989, 306,591 and
66,724 classes, respectively.

8.1 Test data
UMLS Metathesaurus [2] has been selected as the basis for the track’s reference align-
ments. UMLS is currently the most comprehensive effort for integrating independently-
developed medical thesauri and ontologies, including FMA, SNOMED CT, and NCI.
Although the standard UMLS distribution does not directly provide “alignments” (in the
OAEI sense) between the integrated ontologies, it is relatively straightforward to extract
them from the information provided in the distribution files (see [17] for details).
It has been noticed, however, that although the creation of UMLS alignments com-
bines expert assessment and auditing protocols they lead to a significant number of
logical inconsistencies when integrated with the corresponding source ontologies [17].
To address this problem, we have considered two refinements of the UMLS align-
ments that do not lead to (many) unsatisfiable classes. These refinements have been
generated using LogMap’s repair facility [16] and the Alcomo debugging system [20].
The track has been split into three matching problems: FMA-NCI, FMA-SNOMED
and SNOMED-NCI; and each matching problem in three tasks involving different frag-
ments of the input ontologies.

FMA-NCI matching We have compared the results of the matching tools against both
the original and refined UMLS alignment sets:

– Original UMLS alignments: 3,024 alignments (≡).
– Refined UMLS alignments:
• LogMap’s repair module : 2,898 alignments (≡, v, w).
• Alcomo debbuging system: 2,819 alignments (≡).

Three tasks have been considered involving different fragments of FMA and NCI:

– Task 1 consists of matching two (relatively small) modules of FMA (3,696 classes,
5%) and NCI (6,488 classes 10%).
– Task 2 consists of matching two (relatively large) modules of FMA (28,861 classes,
37%) and NCI (25,591 classes, 38%).
– Task 3 consists of matching the whole FMA and NCI ontologies.

FMA-SNOMED matching We have compared the results of the matching tools
against both the original and refined UMLS alignment sets:

– Original UMLS alignments: 9,008 alignments (≡).
– Refined UMLS alignments:
• LogMap’s repair module : 8,111 alignments (≡, v, w).
• Alcomo debbuging system: 8,132 alignments (≡).

Three tasks have been considered involving different fragments of FMA and SNOMED:

– Task 4 consists of matching two (relatively small) modules of FMA (10,157 classes,
13%) and SNOMED (13,412 classes, 5%).
– Task 5 consists of matching two (relatively large) modules of FMA (50,523 classes,
64%) and SNOMED (122,464 classes, 40%).
– Task 6 consists of matching the whole FMA and SNOMED ontologies.
SNOMED-NCI matching We have compared the results of the matching tools against
both the original and refined UMLS alignment sets:

– Original UMLS alignments: 18,844 alignments (≡).
– Refined UMLS alignments:
• LogMap’s repair module : 18,324 alignments (≡, v, w).

Note that, at the time of creating the datasets, we could not compute a refined UMLS
alignment set with Alcomo. The new version of Alcomo, however, has shown to be
able to cope with SNOMED-NCI. Three tasks have been considered involving different
fragments of SNOMED and NCI:

– Task 7 consists of matching two (relatively small) modules of SNOMED (51,128
classes, 17%) and NCI (23,958 classes, 36%).
– Task 8 consists of matching two (relatively large) modules of SNOMED (122,464
classes, 40%) and NCI (49,795 classes, 75%).
– Task 9 consists of matching the whole SNOMED and NCI ontologies.

8.2 Results

We have run the evaluation in a high performance server with 16 CPUs and allocat-
ing 15GB RAM. In total, 15 out of 23 participating systems/configurations have been
able to cope with at least one of the tasks of the track matching problem. Optima and
MEDLEY failed to complete the smallest task with a time out of 24 hours, while OMR,
OntoK, ASE and WeSeE, threw an Exception during the matching process. CODI was
evaluated in a different setting using only 7GB and threw an exception related to insuf-
ficient memory when processing the smallest matching task. TOAST was not evaluated
since it was only configured for the Anatomy track and it required a complex installa-
tion. LogMapLt, a very fast string matcher, has been used as baseline.
Note that GOMMA has also been evaluated with a configuration that exploits
specialized background knowledge based on the UMLS Metathesaurus (GOMMAbk ).
GOMMAbk exploits the alignments O1 -UMLS and UMLS-O2 and applies alignment
composition techniques. LogMap, MaasMatch and YAM++ also use different kinds of
background knowledge. LogMap uses normalisations and spelling variants from the
domain specific resource UMLS Lexicon14 . YAM++ and MaasMatch use the general
purpose background knowledge provided by WordNet15 .
LogMap has also been evaluated with two configurations. LogMap’s default algo-
rithm computes an estimation of the overlapping between the input ontologies before
the matching process, while the variant LogMapnoe has this feature deactivated.
Precision and recall in Tables 11-13 average the obtained results with respect to the
reference alignments. Systems have been ordered in terms of the average F-measure.
14
http://www.nlm.nih.gov/pubs/factsheets/umlslex.html
15
http://wordnet.princeton.edu/
Alignment coherence We have evaluated the coherence of the generated alignments
and we have reported (1) number of unsatisfiabilities when reasoning16 with the in-
put ontologies together with the computed alignments, (2) the degree of unsatisfiable
classes with respect to the size of the merged ontology (based on the Unsatisfiability
Measure proposed in [22]), and (3) an approximation of the root unsatisfiability. The
root unsatisfiability aims at providing a more precise amount of errors, since many of
the unsatisfiabilities may be derived (i.e., a subclass of an unsatisfiable class will also
be reported as unsatisfiable). The provided approximation is based on LogMap’s (in-
complete) repair facility and shows the number of classes that this facility needed to
repair in order to solve (most of) the unsatisfiabilities [16].
LogMap and its variant LogMapnoe were the unique systems generating an almost
clean output. Tables 11-13 shown that even the most precise alignment sets may lead to
a huge amount of unsatisfiable classes. This proves the importance of using techniques
to assess the coherence of the generated alignments.
Note that, LogMap may fail to detect and repair unsatisfiable classes which are out-
side the computed overlapping (i.e. ontology fragments) between the input ontologies.
Thus, LogMapnoe provides, in general, a cleaner output than LogMap.

Runtimes Table 10 shows which systems were able to complete each of the matching
tasks in less than 24 hours and the required computation times. Systems have been or-
dered with respect to the number of completed tasks and total time required to complete
them. The last column reports the number of tasks that a system could complete. For
example, only eight systems were able to complete all nine tasks. The last row shows
the number of systems that could finish each of the tasks. The tasks involving larger
ontology sizes were completed by only 8-10 systems. Furthermore, the tasks involving
SNOMED were also harder with respect to both computation times and the number of
systems that completed the tasks.
The runtimes were, in general, positive. For example, 7 systems completed Task 3
in less than 5 minutes. Additionally, the computation times for these systems increased
“smoothly” with respect to the size of the input ontologies.

FMA-NCI matching Table 11 summarizes the results for the three tasks in the FMA-
NCI matching problem. GOMMAbk provided the best results in terms of F-measure in
Task 1, whereas YAM++ in Tasks 2 and 3. GOMMAbk also obtained the best results
in terms of recall in all three tasks, while ServOMap computed the most precise align-
ments. Overall, the results were very positive and 7 systems obtained an F-measure
greater than 0.80 in all three tasks.
LogMap and LogMapnoe provided the same results in Task 1 since the input ontolo-
gies are already small fragments of FMA and NCI and thus, the overlapping estimation
performed by LogMap did not have any impact.
16
We have used HermiT [23] in the FMA-NCI and FMA-SNOMED matching problems. In
the SNOMED-NCI matching problem we have estimated the number of unsatisfiable classes
with the Dowling-Gallier algorithm for propositional Horn satisfiability [5] (implemented in
LogMap’s repair facility) since no OWL 2 reasoner is known to cope with the integration of
SNOMED and NCI via alignments [15].
FMA-NCI FMA-SNOMED SNOMED-NCI
System #
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8 Task 9
LogMapLt 8 29 55 14 96 171 54 104 178 9
ServOMapL 20 95 251 39 234 517 147 363 738 9
ServOMap 25 98 204 46 315 532 153 282 654 9
LogMap 18 77 131 65 484 612 221 514 955 9
LogMapnoe 18 74 206 63 521 791 211 575 1,505 9
GOMMA 26 69 217 54 437 1,994 197 527 1,820 9
GOMMAbk 26 83 231 148 636 1,893 226 638 1,940 9
YAM++ 78 245 1,304 326 3,780 23,900 1,901 6,127 30,155 9
AROMA 63 7,538 - 51,191 62,801 - 15,624 - - 5
MapSSS 561 30,575 - 3,129 - - 27,381 - - 4
Hertuda 3,327 - - 17,625 - - - - - 2
HotMatch 4,271 - - 31,718 - - - - - 2
MaasMatch 27,157 - - - - - - - - 1
AUTOMSv2 62,407 - - - - - - - - 1
Wmatch 65,399 - - - - - - - - 1
Completed 15 10 8 12 9 8 10 8 8 88

Table 10. System runtimes (s) and task completion.

In Task 1, our baseline also provided very good results in terms of F-measure and
outperformed 8 of the participating systems. MaasMatch and Hertuda provided com-
petitive results in terms of recall, but the low precision damaged the final F-measure.
MapSSS and AUTOMSv2 provided a set of alignments with high precision, however,
the F-measure was damaged due to the low recall of their alignments.
Efficiency in Task 2 and Task 3 have decreased considerably with respect to Task 1.
This is mostly due to the fact that larger ontologies also involves more possible candi-
date alignments and it is harder to keep high precision values without damaging recall,
and vice versa.

FMA-SNOMED matching Table 12 summarizes the results for the three tasks in the
FMA-SNOMED matching problem. GOMMAbk provided the best results in terms of
F-measure in Task 4, whereas ServOMap in Tasks 5 and 6. GOMMAbk also obtained
the best results in terms of recall in all three tasks, while LogMapLt computed the most
precise alignments in Task 4 and ServOMapL in Tasks 5 and 6.
Overall, the results were less positive than in the FMA-NCI matching tasks and only
5 systems obtained an F-measure greater than 0.70 in all three tasks. Furthermore, 6 sys-
tems (including our baseline) failed to provide a recall higher than 0.4. Thus, matching
FMA against SNOMED represents a significant leap in complexity with respect to the
FMA-NCI matching problem.
As in the FMA-NCI matching problem, efficiency also decreases as the ontol-
ogy size increases. The most important variations were suffered by GOMMAbk and
GOMMA, where their average precision decreased from 0.893 and 0.875 (Task 4) to
0.571 and 0.389 (Task 5), respectively. This is an interesting fact, since the background
knowledge used by GOMMAbk could not avoid the decrease in precision while keeping
the highest recall.

SNOMED-NCI matching Table 13 summarizes the results for the three matching
tasks in the SNOMED-NCI matching problem. LogMapnoe provided the best results
in terms of both recall and F-measure in Tasks 7 and 8 while YAM++ obtained the
Average Incoherence
System Time (s) Size
P F R All Unsat. Degree Root Unsat.
Task 1: small FMA and NCI fragments
GOMMAbk 26 2,843 0.94 0.92 0.91 6,204 61% 193
YAM++ 78 2,614 0.96 0.91 0.86 2,352 23% 92
LogMap/LogMapnoe 18 2,740 0.93 0.90 0.88 2 0,02% 0
GOMMA 26 2,626 0.95 0.90 0.86 2,130 21% 127
ServOMapL 20 2,468 0.96 0.88 0.82 5,778 57% 79
LogMapLt 8 2,483 0.95 0.87 0.81 2,104 21% 116
ServOMap 25 2,300 0.97 0.86 0.77 5,597 55% 50
HotMatch 4,271 2,280 0.96 0.84 0.75 285 3% 65
Wmatch 65,399 3,178 0.79 0.82 0.86 3,168 31% 482
AROMA 63 2,571 0.86 0.80 0.76 7,196 70% 421
Hertuda 3,327 4,309 0.58 0.69 0.86 2,675 26% 277
MaasMatch 27,157 3,696 0.61 0.68 0.78 9,598 94% 3,113
AUTOMSv2 62,407 1,809 0.80 0.62 0.50 5,346 52% 392
MapSSS 561 1,483 0.84 0.57 0.43 565 6% 94
Task 2: big FMA and NCI fragments
YAM++ 245 2,688 0.90 0.87 0.83 22,402 35% 102
ServOMapL 95 2,640 0.89 0.85 0.81 22,315 35% 143
GOMMA 69 2,810 0.86 0.84 0.83 2,398 4% 116
GOMMAbk 83 3,116 0.81 0.84 0.87 4,609 8% 146
LogMapnoe 74 2,663 0.87 0.83 0.80 5 0.01% 0
LogMap 77 2,656 0.87 0.83 0.79 5 0.01% 0
ServOMap 98 2,413 0.91 0.83 0.76 21,688 34% 86
LogMapLt 29 3,219 0.73 0.77 0.81 12,682 23% 443
AROMA 7,538 3,856 0.54 0.61 0.69 20,054 24% 1600
MapSSS 30,575 2,584 0.38 0.36 0.34 21,893 40% 358
Task 3: whole FMA and NCI ontologies
YAM++ 1,304 2,738 0.89 0.86 0.83 50,550 29% 141
GOMMA 217 2,843 0.85 0.84 0.83 5,574 4% 139
ServOMapL 251 2,700 0.87 0.84 0.81 50,334 28% 164
GOMMAbk 231 3,165 0.80 0.83 0.87 12,939 9% 245
LogMapnoe 206 2,646 0.87 0.83 0.79 9 0.01% 0
LogMap 131 2,652 0.86 0.82 0.78 9 0.01% 0
ServOMap 204 2,465 0.89 0.82 0.76 48,743 27% 114
LogMapLt 55 3,466 0.68 0.74 0.81 26,429 9% 778

Table 11. Results for the FMA-NCI matching problem.

best results in Task 9. GOMMAbk obtained the best recall in Task 9 and ServOMap
generated the most precise alignments in all three tasks.
As in the previous matching problems, efficiency decreases as the ontology size
increases. For example, in Task 9 none of the systems could reach an F-measure of 0.7,
while seven systems (including our baseline) exceeded this value in Task 7.

8.3 Conclusions
Although the proposed matching tasks represented a significant leap in complexity with
respect to the tasks in previous campaigns, the obtained results have been very promis-
ing and eight systems completed all matching tasks with very competitive results.
There is, however, plenty of room for improvement: (1) most of the participating
systems disregard the coherence of the generated alignments; (2) the size of the input
ontologies should not significantly affect efficiency; and (3) recall in the tasks involving
SNOMED should be improved while keeping the current precision values.
The alignment coherence measure was the weakest point of the systems participat-
ing in this track. As shown in Tables 11-13, even highly precise alignment sets may
Average Incoherence
System Time (s) Size
P F R All Unsat. Degree Root Unsat.
Task 4: small FMA and SNOMED fragments
GOMMAbk 148 8,598 0.89 0.90 0.91 13,685 58% 4,674
ServOMapL 39 6,346 0.92 0.79 0.69 10,584 45% 3,056
YAM++ 326 6,421 0.91 0.79 0.69 14,534 62% 3,150
LogMapnoe 63 6,363 0.91 0.78 0.69 0 0% 0
LogMap 65 6,164 0.91 0.77 0.67 2 0.01% 2
ServOMap 46 6,008 0.92 0.76 0.66 8,165 35% 2,721
GOMMA 54 3,667 0.88 0.53 0.38 2,058 9% 206
MapSSS 3,129 3,458 0.75 0.44 0.31 9,084 39% 389
AROMA 51,191 5,227 0.53 0.40 0.33 21,083 89% 2,296
HotMatch 31,718 2,139 0.84 0.34 0.21 907 4% 104
LogMapLt 14 1,645 0.94 0.31 0.18 773 3% 21
Hertuda 17,625 3,051 0.56 0.30 0.20 1,020 4% 47
Task 5: big FMA and SNOMED fragments
ServOMapL 234 6,563 0.88 0.77 0.69 55,970 32% 1,192
ServOMap 315 6,272 0.88 0.75 0.65 143,316 83% 1,320
YAM++ 3,780 7,003 0.82 0.75 0.68 69,345 40% 1,360
LogMapnoe 521 6,450 0.84 0.73 0.64 0 0% 0
LogMap 484 6,292 0.83 0.71 0.62 0 0% 0
GOMMAbk 636 12,614 0.57 0.68 0.86 75,910 44% 3,344
GOMMA 437 5,591 0.39 0.31 0.26 7,343 4% 480
AROMA 62,801 2,497 0.66 0.30 0.20 54,459 31% 271
LogMapLt 96 1,819 0.85 0.30 0.18 2,994 2% 24
Task 6: whole FMA and SNOMED ontologies
ServOMapL 517 6,605 0.88 0.77 0.69 99,726 26% 2,862
ServOMap 532 6,320 0.87 0.75 0.65 273,242 71% 2,617
YAM++ 23,900 7,044 0.81 0.74 0.68 106,107 28% 3,393
LogMap 612 6,312 0.83 0.71 0.62 10 0.003% 0
LogMapnoe 791 6,406 0.82 0.71 0.62 10 0.003% 0
GOMMAbk 1,893 12,829 0.56 0.68 0.86 119,657 31% 5,289
LogMapLt 171 1,823 0.85 0.30 0.18 4,938 1% 37
GOMMA 1,994 5,823 0.35 0.29 0.24 10,752 3% 609

Table 12. Results for the FMA-SNOMED matching problem.

lead to a huge number of unsatisfiable classes. Thus, the use of techniques to assess
mapping coherence is critical. Unfortunately, only a few systems in OAEI 2012 have
successfully used such techniques. In future campaigns, this aspect should not be ne-
glected. Developers can reuse available state-of-the-art mapping debugging techniques
such as the implemented in Alcomo [20] or in LogMap [16].
The UMLS-based reference alignments may contain errors and may also be in-
complete. Thus, in order to turn the current reference alignments into a agreed-upon
gold standard, expert assessment is required, which is almost unfeasible for large
alignment sets. In this track, we have opted to move towards a “silver standard”
by “harmonising” the outputs of different matching tools over the different match-
ing tasks (see http://www.cs.ox.ac.uk/isg/projects/SEALS/oaei/
harmonisation/oaei2012_harmo.html for details).
Average Incoherence
System Time (s) Size
P F R All Unsat. Degree Root Unsat.
Task 7: small SNOMED and NCI fragments
LogMapnoe 211 13,525 0.90 0.75 0.65 0 0% 0
LogMap 221 13,454 0.90 0.75 0.65 0 0% 0
GOMMAbk 226 12,294 0.94 0.75 0.62 48,681 65% 863
YAM++ 1,901 11,961 0.95 0.74 0.61 50,089 67% 471
ServOMapL 147 11,730 0.95 0.74 0.60 62,367 83% 657
ServOMap 153 10,829 0.97 0.71 0.56 51,020 68% 467
LogMapLt 54 10,947 0.95 0.70 0.56 61,269 82% 801
GOMMA 197 10,555 0.94 0.68 0.53 42,813 57% 851
AROMA 15,624 11,783 0.85 0.66 0.54 70,491 94% 1,286
MapSSS 27,381 9,608 0.79 0.54 0.41 46,083 61% 794
Task 8: big SNOMED and NCI fragments
LogMapnoe 575 13,184 0.88 0.73 0.62 0 0% 0
YAM++ 6,127 13,083 0.86 0.71 0.61 104,492 61% 618
ServOMapL 363 12,784 0.86 0.70 0.59 136,909 79% 1,101
LogMap 514 12,142 0.87 0.69 0.57 3 0.002% 2
ServOMap 282 11,632 0.89 0.69 0.56 110,253 64% 820
GOMMAbk 638 15,644 0.72 0.66 0.61 116,451 68% 2,741
LogMapLt 104 12,741 0.81 0.66 0.56 131,073 76% 2,201
GOMMA 527 12,320 0.80 0.63 0.53 96,945 56% 1,621
Task 9: whole SNOMED and NCI ontologies
YAM++ 30,155 14,103 0.79 0.68 0.60 238,593 64% 979
ServOMapL 738 13,964 0.79 0.68 0.59 286,790 77% 1,557
LogMap 955 13,011 0.81 0.67 0.57 16 0.004% 10
LogMapnoe 1,505 13,058 0.81 0.67 0.57 0 0% 0
ServOMap 654 12,462 0.83 0.67 0.56 230,055 62% 1,546
GOMMAbk 1,940 17,045 0.66 0.64 0.61 239,708 64% 4,297
LogMapLt 178 14,043 0.74 0.63 0.56 305,648 82% 3,160
GOMMA 1,820 13,693 0.71 0.61 0.53 215,959 58% 2,614

Table 13. Results for the SNOMED-NCI matching problem.

9 Instance matching

The instance matching track aims at evaluating the performance of different matching
tools on the task of matching RDF individuals which originate from different sources
but describe the same real-world entity. Data interlinking is known under many names
according to various research communities: equivalence mining, record linkage, object
consolidation and coreference resolution to mention the most used ones. In each case,
these terms are used for the task of finding equivalent entities in or across data sets [14].
As the quantity of data sets published on the Web of data dramatically increases, the
need for tools helping to interlink resources becomes more critical. It is particularly
important to maximize the automation of the interlinking process in order to be able to
follow this expansion.
Unlike the other tracks, the instance matching track specifically focuses on an on-
tology ABox. However, the problems which have to be resolved in order to correctly
match instances can originate at the schema level (use of different properties and clas-
sification schemas) as well as at the data level, e.g., different formats of values. This
year, the track included two tasks. The first task, called Sandbox, is a simple dataset
that has been specifically conceived to provide a examples of some specific matching
problems highlighted (like name spelling and other controlled variations). This is in-
tended to serve as a test for those tools that are in an initial phase of their development
process and/or for tools that are facing very focused tasks, like for example person name
matching. The second one, called IIMB is an OWL-based dataset that is automatically
generated by introducing a set of controlled transformations in an initial OWL Abox.
The list of participants to the instance matching track is shown in Table 14.
Dataset LogMap LogMap lite SBUEI semsim
√ √ √
Sandbox
√ √ √ √
IIMB

Table 14. Participants in the instance matching track.

9.1 Sandbox
The dataset used for the Sandbox task has been automatically generated by extracting
data from Freebase, an open knowledge base that contains information about 11 million
real objects including movies, books, TV shows, celebrities, locations, companies and
more. Data has been extracted in JSON through the Freebase JAVA API17 . Sandox
is a collection of OWL files consisting of 31 concepts, 36 object properties, 13 data
properties and 375 individuals divided into 10 test cases18 . In order to provide simple
matching challenges mainly conceived for systems in their initial developing phase,
we limited the way data are transformed from the original Abox to the test cases. In
particular, we introduced only changes in data format (misspelling, errors in text, etc.).

Sandbox results An overview of the precision, recall and F1 -measure results of the
Sandbox task is shown in Table 15.
test Precision Recall F1 -measure
LogMap 0.94 0.94 0.94
LogMap lite 0.95 0.89 0.92
SBUEI 0.95 0.98 0.96

Table 15. Results of the Sandbox task.

As expected, all the participating systems obtained very good results for the simple
tests provided by the Sandbox task. This result confirms that the currently available sys-
tems for instance matching provide efficient facilities for data matching when dealing
with simple errors and syntactic heterogeneities.

9.2 IIMB
The IIMB task is focused on two main goals:
1. to provide an evaluation data set for various kinds of data transformations, including
value transformations, structural transformations and logical transformations;
17
http://code.google.com/p/freebase-java/
18
DL expressivity of ontologies is ALHI(D)
2. to cover a wide spectrum of possible techniques and tools.

ISLab Instance Matching Benchmark (IIMB), that has been generated using the
SWING tool [13]. Participants were requested to find the correct correspondences
among individuals of the first knowledge base and individuals of the other one. An
important task here is that some of the transformations require automatic reasoning for
finding the expected alignments.
IIMB is composed of a set of test cases, each one represented by a set of instances,
i.e., an OWL ABox, built from an initial data set of real linked data extracted from the
web. Then, the ABox is automatically modified in several ways by generating a set of
new ABoxes, called test cases. Each test case is produced by transforming the individ-
ual descriptions in the reference ABox in new individual descriptions that are inserted
in the test case at hand. The goal of transforming the original individuals is twofold:
on one side, we provide a simulated situation where data referring to the same objects
are provided in different data sources; on the other side, we generate different data sets
with a variable level of data quality and complexity. IIMB provides transformation tech-
niques supporting modifications of data property values, modifications of number and
type of properties used for the individual description, and modifications of the individ-
uals classification. The first kind of transformations is called data value transformation
and it aims at simulating the fact that data expressing the same real object in different
data sources may be different because of data errors or because of the usage of differ-
ent conventional patterns for data representation. The second kind of transformations is
called data structure transformation and it aims at simulating the fact that the same real
object may be described using different properties/attributes in different data sources.
Finally, the third kind of transformations, called data semantic transformation, simu-
lates the fact that the same real object may be classified in different ways in different
data sources.
The 2012 edition has been created by exploiting the same OWL source used for the
Sandbox task. The main difference is that we introduced in IIMB a large set of data
transformations. In particular, test cases from 0 to 20 contain changes in data format
(misspelling, errors in text, etc); test cases 21 to 40 contain changes in structure (proper-
ties missing, RDF triples changed); 41 to 60 contain logical changes (class membership
changed, logical errors); finally, test cases 61 to 80 contain a mix of the previous.

IIMB results An overview of the precision, recall and F1 -measure results per set of
tests of the IIMB subtrack is shown in Table 16. A precision-recall graph visualization
is shown in Figure 5.
As a general comment, we can conclude that all the four systems participating in
this edition of the instance matching track obtained good results, both in terms of pre-
cision and recall. Table 16 suggests that the most challenging tasks in IIMB this year
were those included in test cases 061-080, which provides a combination of different
data transformations, ranging from syntactic to semantic transformations. According
to this conclusion, we will better investigate the problem of dealing with the problem
of combining data transformations in order to generate a new challenging dataset for
instance matching evaluation.
1

precision

0.
0. recall 1.
refalign logmap logmap lite

SBUEI semsim

Fig. 5. Precision/recall of the systems participating in the IIMB subtrack.

test 001–020 021–040 041–060 061–080
P F R P F R P F R P F R
LogMap .94 .90 .87 .93 .96 1.0 .93 .96 1.0 .95 .86 .79
LogMap lite .95 .78 .66 .93 .95 .97 .74 .84 .98 .77 .73 .69
SBUEI .96 .97 .98 .97 .98 .99 .91 .90 .89 .58 .53 .48
semsim .93 .93 .93 .91 .91 .91 .94 .94 .94 .66 .66 .66

Table 16. Results of the IIMB subtrack.
10 Lesson learned and suggestions

There are, this year, very few comments about the evaluation execution:

A) This year indicated again that requiring participants to implement a minimal inter-
face was not a strong obstacle to participation. Moreover, the community seems to
get used to the SEALS technology introduced for OAEI 2011. This might be one
of the reasons for an increasing participation.
B) We have not delivered any comparative results prior to the deadline for submitting
the papers that contain the systems description. This procedure was motivated by
avoiding a focus on competitive aspects of OAEI. However, several tool developers
have complained about this procedure, because they wanted to include such results
in their papers.
C) Last years we reported that we had many new participants. The same trend can be
observed for 2012.
D) Again, given the high number of publications on data interlinking, it is surprising
to have so few participants to the instance matching tracks.

11 Conclusions

These year, both in OAEI 2011.5 and 2012, some tracks have focused on scalability
and runtime measurement. The low number of systems that could generate results for
the Anatomy track was an uncommon result in 2011. However, in 2012 many more
systems could generate results for the Anatomy track. Moreover, many systems could
also generate results for the Library and the Large Biomed track that is concerned with
significantly larger test cases.
Compared to the previous years, we observed a significant improvements of run-
times. Matching systems are becoming more robust and also more efficient with respect
to runtimes. In particular, these improvements are more general compared to increased
precision and recall scores. We dare thinking that this improvement has been steered by
OAEI efforts towards more challenging test sets.
There is a high variance in runtimes and there is no correlation between runtime and
quality of the generated results. This is a result that we already observed in 2011.
There has been a considerable increase in the number of participants implementing
specific techniques for dealing with the task of matching ontologies in different natural
languages (seven participants in 2012, three in OAEI 2011.5). Although there is room
for improvements to achieve the same level of compliance than in the original OntoFarm
dataset, this increase is a sign that the field is progressing.
All participants have provided a description of their systems and their experience in
the evaluation. These OAEI papers, like the present one, have not been peer reviewed.
However, they are full contributions to this evaluation exercise and reflect the hard work
and clever insight people put in the development of participating systems. Reading the
papers of the participants should help people involved in ontology matching to find what
makes these algorithms work and what could be improved. Sometimes participants offer
alternate evaluation results.
The Ontology Alignment Evaluation Initiative will continue these tests by improv-
ing both test cases and testing methodology for being more accurate. Further informa-
tion can be found at:

http://oaei.ontologymatching.org.

Acknowledgements
We warmly thank the participants of this campaign. We know that they have worked
hard for having their matching tools executable in time and they provided insightful
papers presenting their experience. The best way to learn about the results remains to
read the following papers.
We would like to thank Andreas Oskar Kempf from GESIS for the manual evalua-
tion of the new detected correspondences. We thank Jan Noessner for providing data in
the process of constructing the IIMB data set. We are grateful to Martin Ringwald and
Terry Hayamizu for providing the reference alignment for the anatomy ontologies and
thank Elena Beisswanger for her thorough support on improving the quality of the data
set.
We also thank the other members of the Ontology Alignment Evaluation Initia-
tive steering committee: Yannis Kalfoglou (Ricoh laboratories, UK), Miklos Nagy (The
Open University (UK), Natasha Noy (Stanford University, USA), Yuzhong Qu (South-
east University, CN), York Sure (Leibniz Gemeinschaft, DE), Jie Tang (Tsinghua Uni-
versity, CN), George Vouros (University of the Aegean, GR).
José Luis Aguirre, Bernardo Cuenca Grau, Jérôme Euzenat, Ernesto Jimenez-Ruiz,
Christian Meilicke, Heiner Stuckenschmidt and Cássia Trojahn dos Santos have been
partially supported by the SEALS (IST-2009-238975) European project.
Ondřej Šváb-Zamazal has been supported by the CSF grant P202/10/0761.
Ernesto and Bernardo have been partially supported by the Royal Society, EPSRC
project LogMap.

References

1. Benhamin Ashpole, Marc Ehrig, Jérôme Euzenat, and Heiner Stuckenschmidt, editors. Proc.
of the K-Cap Workshop on Integrating Ontologies, Banff (Canada), 2005.
2. Olivier Bodenreider. The unified medical language system (UMLS): integrating biomedical
terminology. Nucleic Acids Research, 32:267–270, 2004.
3. Caterina Caracciolo, Jérôme Euzenat, Laura Hollink, Ryutaro Ichise, Antoine Isaac,
Véronique Malaisé, Christian Meilicke, Juan Pane, Pavel Shvaiko, Heiner Stuckenschmidt,
Ondrej Sváb-Zamazal, and Vojtech Svátek. Results of the ontology alignment evaluation
initiative 2008. In Proc. 3rd International Workshop on Ontology Matching (OM) collocated
with ISWC, pages 73–120, Karlsruhe (Germany), 2008.
4. Jérôme David, Jérôme Euzenat, François Scharffe, and Cássia Trojahn dos Santos. The
alignment API 4.0. Semantic web journal, 2(1):3–10, 2011.
5. William F. Dowling and Jean H. Gallier. Linear-time algorithms for testing the satisfiability
of propositional Horn formulae. J. Log. Prog., 1(3):267–284, 1984.
6. Jérôme Euzenat, Alfio Ferrara, Laura Hollink, Antoine Isaac, Cliff Joslyn, Véronique
Malaisé, Christian Meilicke, Andriy Nikolov, Juan Pane, Marta Sabou, François Scharffe,
Pavel Shvaiko, Vassilis Spiliopoulos, Heiner Stuckenschmidt, Ondrej Sváb-Zamazal, Vo-
jtech Svátek, Cássia Trojahn dos Santos, George Vouros, and Shenghui Wang. Results of the
ontology alignment evaluation initiative 2009. In Proc. 4th Workshop on Ontology Matching
(OM) collocated with ISWC, pages 73–126, Chantilly (USA), 2009.
7. Jérôme Euzenat, Alfio Ferrara, Christian Meilicke, Andriy Nikolov, Juan Pane, François
Scharffe, Pavel Shvaiko, Heiner Stuckenschmidt, Ondrej Sváb-Zamazal, Vojtech Svátek, and
Cássia Trojahn dos Santos. Results of the ontology alignment evaluation initiative 2010. In
Pavel Shvaiko, Jérôme Euzenat, Fausto Giunchiglia, Heiner Stuckenschmidt, Ming Mao, and
Isabel Cruz, editors, Proc. 5th ISWC workshop on ontology matching (OM) collocated with
ISWC, Shanghai (China), pages 85–117, 2010.
8. Jérôme Euzenat, Alfio Ferrara, Robert Willem van Hague, Laura Hollink, Christian Meil-
icke, Andriy Nikolov, François Scharffe, Pavel Shvaiko, Heiner Stuckenschmidt, Ondrej
Sváb-Zamazal, and Cássia Trojahn dos Santos. Results of the ontology alignment evalu-
ation initiative 2011. In Pavel Shvaiko, Isabel Cruz, Jérôme Euzenat, Tom Heath, Ming
Mao, and Christoph Quix, editors, Proc. 6th ISWC workshop on ontology matching (OM),
Bonn (DE), pages 85–110, 2011.
9. Jérôme Euzenat, Antoine Isaac, Christian Meilicke, Pavel Shvaiko, Heiner Stuckenschmidt,
Ondrej Svab, Vojtech Svatek, Willem Robert van Hage, and Mikalai Yatskevich. Results of
the ontology alignment evaluation initiative 2007. In Proc. 2nd International Workshop on
Ontology Matching (OM) collocated with ISWC, pages 96–132, Busan (Korea), 2007.
10. Jérôme Euzenat, Christian Meilicke, Pavel Shvaiko, Heiner Stuckenschmidt, and Cássia Tro-
jahn dos Santos. Ontology alignment evaluation initiative: six years of experience. Journal
on Data Semantics, XV:158–192, 2011.
11. Jérôme Euzenat, Malgorzata Mochol, Pavel Shvaiko, Heiner Stuckenschmidt, Ondrej Svab,
Vojtech Svatek, Willem Robert van Hage, and Mikalai Yatskevich. Results of the ontol-
ogy alignment evaluation initiative 2006. In Proc. 1st International Workshop on Ontology
Matching (OM) collocated with ISWC, pages 73–95, Athens, Georgia (USA), 2006.
12. Jérôme Euzenat and Pavel Shvaiko. Ontology Matching. Springer, Heidelberg (DE), 2007.
13. Alfio Ferrara, Stefano Montanelli, Jan Noessner, and Heiner Stuckenschmidt. Benchmarking
matching applications on the semantic web. In Proc. of the 8th Extended Semantic Web
Conference (ESWC 2011), Heraklion, Greece, 2011.
14. Alfio Ferrara, Andriy Nikolov, and Francois Scharffe. Data linking for the semantic web.
International Journal on Semantic Web and Information Systems, 7(3):46–76, 2011.
15. E. Jiménez-Ruiz, B. Cuenca Grau, and I. Horrocks. On the feasibility of using OWL 2 DL
reasoners for ontology matching problems. In OWL Reasoner Evaluation Workshop, 2012.
16. Ernesto Jimenez-Ruiz and Bernardo Cuenca Grau. LogMap: Logic-based and scalable on-
tology matching. In Int’l Sem. Web Conf. (ISWC), pages 273–288, 2011.
17. Ernesto Jiménez-Ruiz, Bernardo Cuenca Grau, Ian Horrocks, and Rafael Berlanga. Logic-
based assessment of the compatibility of UMLS ontology sources. J. Biomed. Sem., 2, 2011.
18. Philipp Mayr and Vivien Petras. Building a terminology network for search: The KoMoHe
project. In Proc. of the Int. Conference on Dublin Core and Metadata Applications, pages
177 – 182, 2008.
19. Christian Meilicke. Alignment Incoherence in Ontology Matching. PhD thesis, University
Mannheim, 2011.
20. Christian Meilicke. Alignment Incoherence in Ontology Matching. PhD thesis, University
of Mannheim, 2011.
21. Christian Meilicke, Raúl Garcı́a-Castro, Fred Freitas, Willem Robert van Hage, Elena
Montiel-Ponsoda, Ryan Ribeiro de Azevedo, Heiner Stuckenschmidt, Ondřej Šváb Zamazal,
Vojtěch Svátek, Andrei Tamilin, Cássia Trojahn, and Shenghui Wang. MultiFarm: A bench-
mark for multilingual ontology matching. Web Semantics: Science, Services and Agents on
the World Wide Web, (0):–, 2012.
22. Christian Meilicke and Heiner Stuckenschmidt. Incoherence as a basis for measuring the
quality of ontology mappings. In Proc. 3rd International Workshop on Ontology Matching
(OM) collocated with ISWC, pages 1–12, Karlsruhe (Germany), 2008.
23. Boris Motik, Rob Shearer, and Ian Horrocks. Hypertableau reasoning for description logics.
J. Artif. Intell. Res., 36:165–228, 2009.
24. Joachim Neubert. Bringing the “thesaurus for economics” on to the web of linked data. In
Proc. of the WWW Workshop on Linked Data on the Web (LDOW), 2009.
25. Maria Roşoiu, Cássia Trojahn dos Santos, and Jérôme Euzenat. Ontology matching bench-
marks: generation and evaluation. In Pavel Shvaiko, Isabel Cruz, Jérôme Euzenat, Tom
Heath, Ming Mao, and Christoph Quix, editors, Proc. 6th International Workshop on Ontol-
ogy Matching (OM) collocated with ISWC, Bonn (Germany), 2011.
26. Pavel Shvaiko and Jérôme Euzenat. Ontology matching: state of the art and future challenges.
IEEE Transactions on Knowledge and Data Engineering, 2012, to appear.
27. York Sure, Oscar Corcho, Jérôme Euzenat, and Todd Hughes, editors. Proc. of the Workshop
on Evaluation of Ontology-based Tools (EON) collocated with ISWC, Hiroshima (Japan),
2004.
28. Cássia Trojahn dos Santos, Christian Meilicke, Jérôme Euzenat, and Heiner Stuckenschmidt.
Automating OAEI campaigns (first report). In Proc. International Workshop on Evaluation
of Semantic Technologies (iWEST) collocated with ISWC, Shanghai (China), 2010.
29. Michael Uschold and Michael Gruninger. Ontologies and semantics for seamless connectiv-
ity. SIGMOD Rec., 33(4):58–64, 2004.
30. Shenghui Wang, Antoine Isaac, Stefan Schlobach, Lourens van der Meij, and Balthasar
Schopman. Instance-based semantic interoperability in the cultural heritage. Semantic Web,
3(1):45–64, 2012.
31. Benjamin Zapilko, Johann Schaible, Philipp Mayr, and Brigitte Mathiak. TheSoz: A SKOS
representation of the thesaurus for the social sciences. Semantic Web – Interoperability,
Usability, Applicability, 2012. accepted.

Grenoble, Milano, Amsterdam, Mannheim, Milton-Keynes, Montpellier, Trento,
Prague, Toulouse, Köln, Oxford,
December 2012