=Paper= {{Paper |id=Vol-3324/oaei22_paper0 |storemode=property |title=Results of the Ontology Alignment Evaluation Initiative 2022 |pdfUrl=https://ceur-ws.org/Vol-3324/oaei22_paper0.pdf |volume=Vol-3324 |authors=Mina Abd Nikooie Pour,Alsayed Algergawy,Patrice Buche,Leyla J. Castro,Jiaoyan Chen,Hang Dong,Omaima Fallatah,Daniel Faria,Irini Fundulaki,Sven Hertling,Yuan He,Ian Horrocks,Martin Huschka,Liliana Ibanescu,Ernesto Jiménez-Ruiz,Naouel Karam,Amir Laadhar,Patrick Lambrix,Huanyu Li,Ying Li,Franck Michel,Engy Nasr,Heiko Paulheim,Catia Pesquita,Tzanina Saveta,Pavel Shvaiko,Cássia Trojahn,Chantelle Verhey,Mingfang Wu,Beyza Yaman,Ondřej Zamazal,Lu Zhou |dblpUrl=https://dblp.org/rec/conf/semweb/PourABCC0FFFH0022 }} ==Results of the Ontology Alignment Evaluation Initiative 2022== https://ceur-ws.org/Vol-3324/oaei22_paper0.pdf

Results of the Ontology Alignment Evaluation
Initiative 2022
Mina Abd Nikooie Pour1 , Alsayed Algergawy2 , Patrice Buche3 , Leyla J. Castro4 ,
Jiaoyan Chen5 , Hang Dong6 , Omaima Fallatah7 , Daniel Faria8 , Irini Fundulaki9 ,
Sven Hertling10 , Yuan He6 , Ian Horrocks6 , Martin Huschka11 , Liliana Ibanescu12 ,
Ernesto Jiménez-Ruiz13 , Naouel Karam14 , Amir Laadhar15 , Patrick Lambrix1,16 ,
Huanyu Li1 , Ying Li1 , Franck Michel17 , Engy Nasr18 , Heiko Paulheim10 ,
Catia Pesquita19 , Tzanina Saveta9 , Pavel Shvaiko20 , Cassia Trojahn21 ,
Chantelle Verhey22 , Mingfang Wu23 , Beyza Yaman24 , Ondrej Zamazal25 and
Lu Zhou26
1
Linköping University & Swedish e-Science Research Center, Linköping, Sweden
2
Heinz Nixdorf Chair for Distributed Information Systems, Friedrich Schiller University Jena, Germany
3
UMR IATE, INRAE, University of Montpellier, France
4
ZB MED Information Centre for Life Sciences, Germany
5
Department of Computer Science, The University of Manchester, UK
6
Department of Computer Science, University of Oxford, UK
7
Information School, The University of Sheffield, Sheffield, UK
8
University of Lisbon, Portugal
9
Institute of Computer Science-FORTH, Heraklion, Greece
10
Data and Web Science Group, University of Mannheim, Germany
11
Fraunhofer Institute for High-Speed Dynamics, Ernst-Mach-Institut, EMI, Germany
12
Université Paris-Saclay, INRAE, AgroParisTech, UMR MIA Paris-Saclay, France
13
City, University of London, UK & SIRIUS, University of Oslo, Norway
14
Fraunhofer FOKUS & Institute for Applied Informatics, University of Leipzig, Germany
15
University of Stuttgart, Germany
16
University of Gävle, Sweden
17
University Côte d’Azur, CNRS, Inria
18
Albert Ludwig University of Freiburg, Germany
19
LASIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal
20
Trentino Digitale SpA, Trento, Italy
21
Institut de Recherche en Informatique de Toulouse, France
22
World Data System, International Technology Office, USA
23
Australian Research Data Commons
24
ADAPT Centre, Trinity College Dublin
25
Prague University of Economics and Business, Czech Republic
26
TigerGraph, Inc. USA

Abstract
The Ontology Alignment Evaluation Initiative (OAEI) aims at comparing ontology matching systems on
precisely defined test cases. These test cases can be based on ontologies of different levels of complexity
and use different evaluation modalities. The OAEI 2022 campaign offered 14 tracks and was attended by
18 participants. This paper is an overall presentation of that campaign.
1. Introduction
The Ontology Alignment Evaluation Initiative1 (OAEI) is a coordinated international initiative,
which organizes the evaluation of ontology matching systems [1, 2], and which has been run for
eighteen years now. The main goal of the OAEI is to compare systems and algorithms openly and
on the same basis, in order to allow anyone to draw conclusions about the best ontology matching
strategies. Furthermore, the ambition is that, from such evaluations, developers can improve their
systems and offer better tools addressing the evolving application needs.
Two first events were organized in 2004: (i) the Information Interpretation and Integration
Conference (I3CON) held at the NIST Performance Metrics for Intelligent Systems (PerMIS)
workshop and (ii) the Ontology Alignment Contest held at the Evaluation of Ontology-based
Tools (EON) workshop of the annual International Semantic Web Conference (ISWC) [3]. Then,
a unique OAEI campaign occurred in 2005 at the workshop on Integrating Ontologies held in
conjunction with the International Conference on Knowledge Capture (K-Cap) [4]. From 2006
until the present, the OAEI campaigns were held at the Ontology Matching workshop, collocated
with ISWC [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], which this year took place
virtually2 .
From 2011, we have been using an environment for automatically processing evaluations
which was developed within the SEALS (Semantic Evaluation At Large Scale) project3 . SEALS
provided a software infrastructure for automatically executing evaluations and evaluation cam-
paigns for typical semantic web tools, including ontology matching. Since OAEI 2017, a novel
evaluation environment, called HOBBIT (Section 2.1), was adopted for the HOBBIT Link Dis-
covery track, and later extended to enable the evaluation of other tracks. Some tracks are run
exclusively through SEALS and others through HOBBIT, but several allow participants to choose
the platform they prefer. Since last year, the MELT framework [21] has been adopted in order to
facilitate the SEALS and HOBBIT wrapping and evaluation. This year, most tracks have adopted
MELT as their evaluation platform.
This paper synthesizes the 2022 evaluation campaign and introduces the results provided in the
papers of the participants. The remainder of the paper is organized as follows: in Section 2, we
present the overall evaluation methodology; in Section 3 we present the tracks and datasets; in
Section 4 we present and discuss the results; and finally, Section 5 discusses the lessons learned.

OM-2022: Proceedings of the 17th International Workshop on Ontology Matching, October 2022, Hangzhou, China
(Virtual)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
CEUR Workshop Proceedings (CEUR-WS.org)
Proceedings
http://ceur-ws.org
ISSN 1613-0073

1
http://oaei.ontologymatching.org
2
http://om2022.ontologymatching.org
3
http://www.seals-project.eu
2. Methodology
2.1. Evaluation platforms
The OAEI evaluation was carried out in one of three alternative platforms: the SEALS client, the
HOBBIT platform, or the MELT framework. All of them have the goal of ensuring reproducibility
and comparability of the results across matching systems. As of this campaign, the use of the
SEALS client and packaging format is deprecated in favor for MELT, with the sole exception of
the Interactive Matching track, as simulated interactive matching is not yet supported by MELT.
The SEALS client was developed in 2011. It is a Java-based command line interface for
ontology matching evaluation, which requires system developers to implement an interface and
to wrap their tools in a predefined way including all required libraries and resources.
The HOBBIT platform4 was introduced in 2017. It is a web interface for linked data and
ontology matching evaluation, which requires systems to be wrapped inside docker containers
and includes a SystemAdapter class, then being uploaded into the HOBBIT platform [22].
The MELT framework5 [21] was introduced in 2019 and is under active development. It
allows the development, evaluation, and packaging of matching systems for evaluation interfaces
like SEALS or HOBBIT. It further enables developers to use Python or any other programming
language in their matching systems, which beforehand had been a hurdle for OAEI partici-
pants. The evaluation client6 allows organizers to evaluate packaged systems whereby multiple
submission formats are supported (SEALS packages or matchers implemented as Web services).
All platforms compute the standard evaluation metrics against the reference alignments: pre-
cision, recall, and F-measure. In test cases where different evaluation modalities are required,
evaluation was carried out a posteriori, using the alignments produced by the matching systems.

2.2. Submission formats
This year, three submission formats were allowed: (1) SEALS package, (2) HOBBIT, and (3)
MELT Web interface. With the increasing usage of other programming languages than Java and
increasing hardware requirements for matching systems, since 2021 the MELT Web interface
was introduced in order to address this issue. It mainly consists of a technology-independent
HTTP interface7 which participants can implement as they wish. Alternatively, they can use
the MELT framework to assist them, as it can be used to wrap any matching system as docker
container implementing the HTTP interface. In 2022, 10 systems were submitted as MELT Web
docker container, 5 systems were submitted as SEALS package, 3 systems were uploaded to the
HOBBIT platform, and one system implemented the Web interface directly and provided hosting
for the system.

4
https://project-hobbit.eu/outcomes/hobbit-platform/
5
https://github.com/dwslab/melt
6
https://dwslab.github.io/melt/matcher-evaluation/client
7
https://dwslab.github.io/melt/matcher-packaging/web
2.3. OAEI campaign phases
As in previous years, the OAEI 2022 campaign was divided into three phases: preparatory,
execution, and evaluation.
In the preparatory phase, the test cases were provided to participants in an initial assessment
period between June 30𝑡ℎ and July 31𝑠𝑡 , 2022. The goal of this phase is to ensure that the test
cases make sense to participants, and give them the opportunity to provide feedback to organizers
on the test case as well as potentially report errors. At the end of this phase, the final test base
was frozen and released.
During the ensuing execution phase, participants test and potentially develop their matching
systems to automatically match the test cases. Participants can self-evaluate their results either
by comparing their output with the reference alignments or by using either of the evaluation
platforms. They can tune their systems with respect to the non-blind evaluation as long as they
respect the rules of the OAEI. Participants were required to register their systems by July 31𝑠𝑡
and make a preliminary evaluation by August 31𝑡ℎ . The execution phase was terminated on
September 30𝑡ℎ , 2022, at which date participants had to submit the (near) final versions of their
systems (SEALS-wrapped and/or HOBBIT-wrapped).
During the evaluation phase, systems were evaluated by all track organizers. In case minor
problems were found during the initial stages of this phase, they were reported to the developers,
who were given the opportunity to fix and resubmit their systems. Initial results were provided
directly to the participants, whereas final results for most tracks were published on the respective
OAEI web pages before the workshop.

3. Tracks and test cases
This year’s OAEI campaign consisted of 14 tracks, all of them including OWL ontologies while
only one also including SKOS thesauri, namely the Biodiversity and the Ecology track. They can
be grouped into:

– Schema matching tracks, which have as objective matching ontology classes and/or proper-
ties.

– Instance matching tracks, which have as objective matching ontology instances.

– Instance and schema matching tracks, which involve both of the above.

– Complex matching tracks, which have as objective finding complex correspondences
between ontology entities.

– Interactive tracks, which simulate user interaction to enable the benchmarking of interactive
matching algorithms.

The tracks are summarized in Table 1 and detailed in the following sections.
test formalism relations confidence modalities language SEALS HOBBIT MELT
T-Box/Schema matching
√ √
anatomy OWL = [0 1] open EN √
conference OWL =, <= [0 1] open+blind EN √
multifarm OWL = [0 1] open+blind AR, CZ,
CN, DE,
EN, ES,
FR, IT,
NL, RU, PT √
complex OWL = [0 1] open+blind EN, ES √
food OWL = [0 1] open EN √
interactive OWL =, <= [0 1] open EN √
bio-ML OWL =, <= [0 1] open EN √
biodiv OWL/SKOS = [0 1] open EN √
mse OWL =, <=, >= [0 1] open EN √
crosswalks OWL = [0 1] open EN √
common knowl. graph OWL = [0 1] open EN
Instance and schema matching
√
knowledge graph OWL = [0 1] open EN
Instance matching or link discovery
√
spimbench OWL = [0 1] open+blind EN √
link discovery OWL = [0 1] open EN

Table 1
Tracks in OAEI 2022.

3.1. Anatomy
The anatomy track comprises a single test case consisting of matching two fragments of biomedi-
cal ontologies which describe the human anatomy8 (3304 classes) and the anatomy of the mouse9
(2744 classes). The evaluation is based on a manually curated reference alignment. This dataset
has been used since 2007 with some improvements over the years [23].
Systems are evaluated with the standard parameters of precision, recall, F-measure. Addi-
tionally, recall+ is computed by excluding trivial correspondences (i.e., correspondences that
have the same normalized label). Alignments are also checked for coherence using the Pellet
reasoner. The evaluation was carried out on a machine with a 5 core CPU @ 1.80 GHz with
16GB allocated RAM, using the MELT framework. For some systems, the SEALS client has
been used. However, the evaluation parameters were computed a posteriori, after removing from
the alignments produced by the systems, correspondences expressing relations other than equiva-
lence, as well as trivial correspondences in the oboInOwl namespace (e.g., oboInOwl#Synonym
= oboInOwl#Synonym). The results obtained with the SEALS client vary in some cases by 0.5%
compared to the results presented in section 4.

3.2. Conference
The conference track feature two test cases. The main test case is a suite of 21 matching tasks
corresponding to the pairwise combination of 7 moderately expressive ontologies describing the
domain of organizing conferences. The dataset and its usage are described in [24]. This year we

8
www.cancer.gov/cancertopics/cancerlibrary/terminologyresources
9
http://www.informatics.jax.org/searches/AMA form.shtml
again run a second test case consisting of a suite of three tasks of matching DBpedia ontology
(filtered to the dbpedia namespace) and three ontologies from the conference domain.
For the main test case the track uses several reference alignments for evaluation: the old (and
not fully complete) manually curated open reference alignment, ra1; an extended, also manually
curated version of this alignment, ra2; a version of the latter corrected to resolve violations of
conservativity, rar2; and an uncertain version of ra1 produced through crowd-sourcing, where
the score of each correspondence is the fraction of people in the evaluation group that agree
with the correspondence. The latter reference was used in two evaluation modalities: discrete
and continuous evaluation. In the former, correspondences in the uncertain reference alignment
with a score of at least 0.5 are treated as correct whereas those with lower score are treated
as incorrect, and standard evaluation parameters are used to evaluated systems. In the latter,
weighted precision, recall and F-measure values are computed by taking into consideration the
actual scores of the uncertain reference, as well as the scores generated by the matching system.
For the sharp reference alignments (ra1, ra2 and rar2), the evaluation is based on the standard
parameters, as well the F0.5 -measure and F2 -measure and on conservativity and consistency
violations. Whereas F1 is the harmonic mean of precision and recall where both receive equal
weight, 𝐹2 gives higher weight to recall than precision and F0.5 gives higher weight to precision
higher than recall. The second test case contains open reference alignment and systems were
evaluated using the standard metrics.
Two baseline matchers are used to benchmark the systems: edna string edit distance matcher;
and StringEquiv string equivalence matcher as in the anatomy test case.

3.3. Multifarm
The multifarm track [25] aims at evaluating the ability of matching systems to deal with ontologies
in different natural languages. This dataset results from the translation of 7 ontologies from the
conference track (cmt, conference, confOf, iasted, sigkdd, ekaw and edas) into 10 languages:
Arabic (ar), Chinese (cn), Czech (cz), Dutch (nl), French (fr), German (de), Italian (it), Portuguese
(pt), Russian (ru), and Spanish (es). The dataset is composed of 55 pairs of languages, with 49
matching tasks for each of them, taking into account the alignment direction (e.g. cmt𝑒𝑛 →edas𝑑𝑒
and cmt𝑑𝑒 →edas𝑒𝑛 are distinct matching tasks). While part of the dataset is openly available, all
matching tasks involving the edas and ekaw ontologies (resulting in 55 × 24 matching tasks) are
used for blind evaluation.
We consider two test cases: i) those tasks where two different ontologies (cmt→edas, for
instance) have been translated into two different languages; and ii) those tasks where the same
ontology (cmt→cmt) has been translated into two different languages. For the tasks of type ii),
good results are not only related to the use of specific techniques for dealing with cross-lingual
ontologies, but also on the ability to exploit the identical structure of the ontologies.
The reference alignments used in this track derive directly from the manually curated Confer-
ence ra1 reference alignments. In 2021, alignments have been manually evaluated by domain
experts. The evaluation is blind. The systems have been executed on a Ubuntu Linux machine
configured with 32GB of RAM running under a Intel Core CPU 2.00GHz x8 cores.
3.4. Complex Matching
The complex matching track is meant to evaluate the matchers based on their ability to gen-
erate complex alignments. A complex alignment is composed of complex correspondences
typically involving more than two ontology entities, such as 𝑜1 :AcceptedPaper ≡ 𝑜2 :Paper ⊓
𝑜2 :hasDecision.𝑜2 :Acceptance.
The Conference dataset is composed of three ontologies: cmt, conference and ekaw from the
conference dataset. The reference alignment was created as a consensus between experts. In the
evaluation process, the matchers can take the simple reference alignment ra1 as input. The preci-
sion and recall measures are manually calculated over the complex equivalence correspondences
only.
The Taxon dataset is composed of four knowledge bases containing knowledge about plant
taxonomy: AgronomicTaxon, AGROVOC, TAXREF-LD and DBpedia. The alignment systems
have been executed on a Ubuntu Linux machine configured with 32GB of RAM running under a
Intel Core CPU 2.00GHz x8 cores. All measurements are based on a single run.
This year, the other complex sub-tracks (Hydrography, GeoLink, Populated GeoLink and
Populated Enslaved datasets) have been discontinued.

3.5. Food
The Food Nutritional Composition track aims at finding alignments between food concepts
from CIQUAL10 , the French food nutritional composition database, and food concepts from
SIREN11 , the Scientific Information and Retrieval Exchange Network of the US Food and Drug
administration. Foods from both databases are described in LanguaL12 , a well-known multilingual
thesaurus using faceted classification. LanguaL stands for ”Langua aLimentaria” or ”language
of food” and more than 40,000 foods used in food composition databases are described using
LanguaL.
In [26] we propose the method to provide OWL modelling of food concepts from both datasets,
CIQUAL13 and SIREN 14 , and a gold standard.
The evaluation was performed using the MELT platform. Every participating system was
executed in its standard setting and we compare precision, recall and F-measure as well as the
computation time.

3.6. Interactive Matching
The interactive matching track aims to assess the performance of semi-automated matching
systems by simulating user interaction [27, 28, 29]. The evaluation thus focuses on how interaction
with the user improves the matching results. Currently, this track does not evaluate the user
experience or the user interfaces of the systems [30, 28].

10
https://ciqual.anses.fr/
11
http://langual.org/langual indexed datasets.asp
12
https://www.langual.org/default.asp
13
https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.15454/6CEYU3
14
https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.15454/5LLGVY
The interactive matching track is based on the datasets from the Anatomy and Conference tracks,
which have been previously described. It relies on the SEALS client’s Oracle class to simulate
user interactions. An interactive matching system can present a collection of correspondences
simultaneously to the oracle, which will tell the system whether that correspondence is correct or
not. If a system presents up to three correspondences together and each correspondence presented
has a mapped entity (i.e., class or property) in common with at least one other correspondence
presented, the oracle counts this as a single interaction, under the rationale that this corresponds
to a scenario where a user is asked to choose between conflicting candidate correspondences. To
simulate the possibility of user errors, the oracle can be set to reply with a given error probability
(randomly, from a uniform distribution). We evaluated systems with four different error rates: 0.0
(perfect user), 0.1, 0.2, and 0.3.
In addition to the standard evaluation parameters, we also compute the number of requests
made by the system, the total number of distinct correspondences asked, the number of positive
and negative answers from the oracle, the performance of the system according to the oracle (to
assess the impact of the oracle errors on the system) and finally, the performance of the oracle
itself (to assess how erroneous it was).
The evaluation was carried out on a server with 3.46 GHz (6 cores) and 8GB RAM allocated
to the matching systems. For systems requiring more RAM, the evaluation was carried out on a
computer with an AMD Ryzen 7 5700G 3.80 GHz CPU and 32GB RAM, with 10GB of max
heap space allocated to java.Each system was run ten times and the final result of a system for
each error rate represents the average of these runs. For the Conference dataset with the ra1
alignment, precision and recall correspond to the micro-average over all ontology pairs, whereas
the number of interactions is the total number of interactions for all the pairs.

3.7. Bio-ML
The Bio-ML track [31] incorporates both equivalence and subsumption ontology matching
(OM) tasks for biomedical ontologies, with ground truth (equivalence) mappings extracted
from Mondo [32] and UMLS [33] (see Table 2). Mondo aims to integrate disease concepts
worldwide, while UMLS is a meta-thesaurus for the biomedical domain. Based on techniques
(ontology pruning, subsumption mapping construction, negative candidate mapping generation,
etc.) proposed in [31], we introduced five OM pairs with their information reported in Table 3.
Each OM pair is accompanied with both equivalence and subsumption matching tasks; each
matching task has two data split settings: (i) unsupervised setting (90% of the mappings for
testing, and 10% for validation), and (ii) semi-supervised setting (70% of the mappings for testing,
20% for training, and 10% for validation).
For evaluation, in [31] we proposed both global matching and local ranking; the former aims
to evaluate the overall performance by computing Precision, Recall, and F1 metrics for the
output mappings against the reference mappings, while the latter aims to evaluate the ability
of distinguishing the correct mapping out of several challenging negatives by ranking metrics
Hits@K and MRR. Note that subsumption mappings are inherently incomplete, so only local
ranking evaluation is applied for subsumption matching.
15
Created from OMIM texts by Mondo’s pipeline tool avaiable at: https://github.com/monarch-initiative/omim.
16
Created by the official snomed-owl-toolkit available at: https://github.com/IHTSDO/snomed-owl-toolkit.
Table 2
Information of the source ontologies used for creating the OM datasets in Bio-ML.
Mapping Source Ontology Ontology Source & Version #Classes
OMIM Mondo15 44,729
ORDO BioPortal, V3.2 14,886
Mondo
NCIT BioPortal, V18.05d 140,144
DOID BioPortal, 2017-11-28 12,498
SNOMED UMLS, US.2021.09.0116 358,222
UMLS FMA BioPortal, V4.14.0 104,523
NCIT BioPortal, V21.02d 163,842
Table 3
Information of each OM dataset in Bio-ML, where the numbers of equivalence and subsumption
reference mappings are reported in #Refs(≡) and #Refs (⊑), respectively.
Mapping Source Ontology Pair Category #Refs (≡) #Refs (⊑)
OMIM-ORDO Disease 3,721 103
Mondo
NCIT-DOID Disease 4,684 3,339
SNOMED-FMA Body 7,256 5,506
UMLS SNOMED-NCIT Pharm 5,803 4,225
SNOMED-NCIT Neoplas 3,804 213

As Bio-ML is a new track in the OAEI this year and it attempts to support machine learning-
based OM systems, we adopted a flexible way for evaluating participating systems. First,
participants can freely choose any tasks and settings they would like to attend. Second, for
systems that have been well-adapted to the MELT platform, we used MELT to produce the output
mappings. Third, for systems that have been implemented elsewhere and not easy to be made
compatible with MELT, we used their source code. Fourth, we also allowed participants (with
trust) to directly upload output mappings if their systems had not been published and had not
been made compatible with MELT. All our evaluations were conducted with the DeepOnto17
library on a local machine with Intel Xeon Bronze 3204 CPU 1.90GHz x11 processors, 126GB
RAM, and two Quadro RTX 8000 GPUs. The GPUs were mainly used for training systems that
involve deep neural networks.

3.8. Biodiversity and Ecology
The biodiversity and ecology (biodiv) track is motivated by the GFBio18 (The German Federation
for Biological Data) alongside its successor NFDI4Biodiversity19 and the AquaDiva20 projects,
which aim at providing semantically enriched data management solutions for data capture, anno-
tation, indexing and search [34, 35, 36]. In 2020, we partnered with the D2KAB project21 , which
develops the AgroPortal22 ontology repository, to include new matching tasks involving important

17
https://krr-oxford.github.io/DeepOnto/#/
18
www.gfbio.org
19
www.nfdi4biodiversity.org/en/
20
www.aquadiva.uni-jena.de
21
www.d2kab.org
22
agroportal.lirmm.fr
thesauri in agronomy and environmental sciences. The track features the three tasks also present
in former editions: matching the Environment Ontology (ENVO) to the Semantic Web for Earth
and Environment Technology Ontology (SWEET), the AGROVOC thesaurus to the US National
Agricultural Library Thesaurus (NALT) and the General Multilingual Environmental Thesaurus
(GEMET) to the Analysis and Experimentation on Ecosystems thesaurus (ANAEETHES). In the
2021 edition, we added a task to align between two biological taxonomies with rather different but
complementary scopes: the well-known NCBI taxonomy (NCBITAXON), and TAXREF-LD [37].
No matching system was able to achieve this matching task due to the large size of the considered
taxonomies. To cope with this issue in this years edition, we split the large matching task into
a set of smaller, more manageable subtasks through the use of modularization. We obtained
six groups corresponding to the kingdoms: Animalia, Bacteria, Chromista, Fungi, Plantae and
Protozoa, leading to six well balanced matching subtasks. Table 4 presents detailed information
about the ontologies and thesauri used in this year edition.

Table 4
Biodiversity and Ecology track ontologies and thesauri.
Ontology/Thesaurus Format Version Classes Instances
ENVO OWL 2021-05-19 6,566 44
SWEET OWL 2019-10-12 4,533 -
AGROVOC SKOS 2020-10-02 46 706,803
NALT SKOS 2020-28-01 2 74,158
GEMET SKOS 2020-13-02 7 5,907
ANAEETHES SKOS 2017-22-03 2 3,323
NCBITAXON Animalia OWL 2021-02-15 74729 -
TAXREF-LD Animalia OWL 2020-06-23 (v13.0) 73528 -
NCBITAXON Bacteria OWL 2021-02-15 326 -
TAXREF-LD Bacteria OWL 2020-06-23 (v13.0) 312 -
NCBITAXON Chromista OWL 2021-02-15 2344 -
TAXREF-LD Chromista OWL 2020-06-23 (v13.0) 2290 -
NCBITAXON Fungi OWL 2021-02-15 13149 -
TAXREF-LD Fungi OWL 2020-06-23 (v13.0) 12732 -
NCBITAXON Plantae OWL 2021-02-15 27013 -
TAXREF-LD Plantae OWL 2020-06-23 (v13.0) 26302 -
NCBITAXON Protozoa OWL 2021-02-15 538 -
TAXREF-LD Protozoa OWL 2020-06-23 (v13.0) 501 -

3.9. Material Sciences and Engineering (MSE)
Data in Material Sciences and Engineering (MSE) can be characterised by scarcity, complexity
and presence of gaps. Therefore the MSE community aims for ontology-based data integration
via decentralized data management architectures. Several actors using different ontologies results
in the growing demand for automatic alignment of ontologies in the MSE domain.
The MSE track uses small to mid-sized ontologies common in the MSE field that are imple-
mented with and without upper-level ontologies. The ontologies follow heterogeneous design
principles with only partial overlap to each other. The current version v1.123 of the MSE track
includes three test cases summarised in Table 5, where each test case consists of two MSE
ontologies to be matched [ O1 ; O2] as well as one manual reference alignment R that can be
used for evaluation of the matching task. The benchmark also provides background knowledge
resources.

Table 5
The building blocks of the MSE track (MSE benchmark v1.1).
Inputs First Test Case Second Test Case Third Test Case
O1 Reduced MaterialInformation MaterialInformation MaterialInformation
O2 MatOnto MatOnto EMMO
R = , ⊂, ⊃ corresp. = corresp. = corresp.
resources Chemical Elements Dictionary (DICT), EMMO

The MSE track makes use of three different MSE ontologies in total, in each of which
an ontology using an upper-level ontology is matched to one without an upper-level. The
MaterialInformation[38] domain ontology was designed without upper-level ontology and serves
as infrastructure for material information and knowledge exchange (545 classes, 98 properties
and 411 individuals). Three out of eight submodules of the MaterialInformation were merged
to create the Reduced MaterialInformation (32 classes, 43 properties and 17 individuals) for a
more efficient creation of the manual reference alignment in the First Test Case, see Table 5.The
MatOnto Ontology v2.124 (847 classes, 96 properties and 131 individuals) bases on the upper-level
ontology bfo225 . The Elementary Multiperspective Material Ontology (EMMO v1.0.0-alpha2)26 ,
is a standard representational ontology framework based on current materials modelling and
characterisation knowledge incorporating an upper-, mid- and domain-level (451 classes, 35
properties). For every test case, a manual reference alignment R was created in close cooperation
with MSE domain experts.
The evaluation was performed using the MELT platform on a Windows 10 system with Intel
Core i7 870 CPU @2.93GHz x4 and 16 GB RAM. For the time being, no background knowledge
was used for evaluation. Every participating system was executed in its standard setting and
precision, recall and F-measure as well as the computation time is compared.

3.10. Crosswalks Data Schema Matching
This is a new track introduced this year. It aims at evaluating the ability of systems to deal with
the schema metadata matching task, in particular, with a collection of crosswalks from fifteen
research data schemas to Schema.org[39, 40]. It is based on the work carried out by the Research
Data Alliance (RDA) Research Metadata Schemas Working Group. The collection serve as a
reference for data repositories when they develop their crosswalks, as well as an indication of
semantic interoperability among the schemas.
23
https://github.com/EngyNasr/MSE-Benchmark/releases/tag/v1.1
24
https://raw.githubusercontent.com/iNovexIrad/MatOnto-Ontologies/master/matonto-release.ttl
25
http://purl.obolibrary.org/obo/bfo/2.0/bfo.owl
26
https://raw.githubusercontent.com/emmo-repo/EMMO/1.0.0-alpha2/emmo.owl
Table 6
The number of classes and instances in the two common KGs benchmarks
Dataset #Classes #Instances
DBpedia 138 631,461
NELL 134 1,184,377
YAGO 304 5,149,594
Wikidata 304 2,158,547

The dataset is composed of 15 source research metadata describing datasets that have been
aligned to Schema.org. The source schemas include discipline agnostic schemas Dublin Core,
Data Catalogue Vocabulary (DCAT), Data Catalogue Vocabulary - Application Profile (DCAT-
AP), Registry Interchange Format - Collections and Services (RIF-CS), DataCite Schema, Data-
verse; and discipline schemas ISO19115-1, EOSC/EDMI, Data Tag Suite (DATS), Bioschemas,
B2FIND, Data Documentation Initiative (DDI), European Clinical Research Infrastructure Net-
work (ECRIN), Space Physics Archive Search and Extract (SPASE); as well as CodeMeta for
software.
This year a subset of the 16 metadata schemas aligned to schema.org has been considered. This
subset corresponds to the set of schemas and vocabularies for which an OWL/RDFS serialisation
is available. It involves: Data Catalogue Vocabulary (DCAT-v3), Data Catalogue Vocabulary -
Application Profile (DCAT-AP), DataCity, Dublin Core (DC), ISO19115-1 schemas (ISO) and
RIFCS.
Using as a reference the manually established correspondences, the evaluation here will be
based on the well-know measures of precision, recall and F-measure. The systems have been
executed on a Ubuntu Linux machine configured with 32GB of RAM running under a Intel Core
CPU 2.00GHz x8 processors.

3.11. Common Knowledge Graphs
This track was introduced to OAEI in 2021, and it evaluates the ability of matching systems
to match the schema (classes) in large cross-domain knowledge graphs such as DBpedia [41],
YAGO [42] and NELL [43]. The dataset used for the evaluation is generated from DBpedia and
the Never-Ending Language Learner (NELL). While DBpedia is generated from structured data in
Wikipedia’s articles, NELL is an automatically generated knowledge graph with entities extracted
from large-scale text corpus shared on websites. The automatic extraction process is one of the
aspects that make common knowledge graphs different from ontologies, as they often result in
less well-formatted and cross-domain datasets. In addition to the NELL and DBpedia test case,
this year we introduced a new test case for matching classes from YAGO and Wikidata [44]. The
numbers of entities in the four KG datasets are illustrated in Table 6.
The NELL and DBpedia benchmark [45] was human-annotated and verified by experts. This
gold standard is only a partial gold standard, since not every class in each knowledge graph
has an equivalent class in the opposite one. To avoid over-penalizing matches that may discover
reasonable matches that are not included in the partial gold standard, our evaluation ignores
any predicted matches where neither of the classes in that pair exists in a true positive pair with
another class in the reference alignments. In terms of YAGO and Wikidata gold standard, it was
originally created [46] and expanded according to OAEI standard as part of [44].
With the respect to the reference alignment, matching systems were evaluated using standard
precision, recall, and f-measure. The evaluation was carried out on a Linux virtual machine with
128 GB of RAM and 16 vCPUs (2.4 GHz) processors. The evaluation was performed using
MELT for matchers wrapped using both SEALS, and the web packaging via Docker. As baseline,
we utilize a simple string matcher which is available through MELT.

3.12. Knowledge Graph
The Knowledge Graph track was run for the fourth year. The task of the track is to match pairs
of knowledge graphs, whose schema and instances have to be matched simultaneously. The
individual knowledge graphs are created by running the DBpedia extraction framework on eight
different Wikis from the Fandom Wiki hosting platform27 in the course of the DBkWik project
[47, 48]. They cover different topics (movies, games, comics and books) and three Knowledge
Graph clusters sharing the same domain e.g. star trek, as shown in Table 7.

Table 7
Characteristics of the Knowledge Graphs in the Knowledge Graph track, and the sources they
were created from.
Source Hub Topic #Instances #Properties #Classes
Star Wars Wiki Movies Entertainment 145,033 700 269
The Old Republic Wiki Games Gaming 4,180 368 101
Star Wars Galaxies Wiki Games Gaming 9,634 148 67
Marvel Database Comics Comics 210,996 139 186
Marvel Cinematic Universe Movies Entertainment 17,187 147 55
Memory Alpha TV Entertainment 45,828 325 181
Star Trek Expanded Universe TV Entertainment 13,426 202 283
Memory Beta Books Entertainment 51,323 423 240

The evaluation is based on reference correspondences at both schema and instance levels.
While the schema level correspondences were created by experts, the instance correspondences
were extracted from the wiki page itself. Due to the fact that not all inter wiki links on a page
represent the same concept a few restrictions were made: 1) only links in sections with a header
containing “link” are used, 2) all links are removed where the source page links to more than one
concept in another wiki (ensures the alignments are functional), 3) multiple links which point to
the same concept are also removed (ensures injectivity), 4) links to disambiguation pages were
manually checked and corrected. Since we do not have a correspondence for each instance, class,
and property in the graphs, this gold standard is only a partial gold standard.
The evaluation was executed on a virtual machine (VM) with 32GB of RAM and 16 vCPUs
(2.4 GHz), with Debian 9 operating system and Openjdk version 1.8.0 265. For evaluating all
possible submission formats, MELT framework is used. The corresponding code for evaluation
can be found on Github28 .

27
https://www.wikia.com/
28
https://github.com/dwslab/melt/tree/master/examples/kgEvalCli
The alignments were evaluated based on precision, recall, and f-measure for classes, properties,
and instances (each in isolation). The partial gold standard contained 1:1 correspondences and
we further assume that in each knowledge graph, only one representation of the concept exists.
This means that if we have a correspondence in our gold standard, we count a correspondence to
a different concept as a false positive. The count of false negatives is only increased if we have a
1:1 correspondence and it is not found by a matcher.
As a baseline, we employed two simple string matching approaches. The source code for these
matchers is publicly available.29

3.13. SPIMBENCH
The SPIMBENCH track consists of matching instances that are found to refer to the same real-
world entity corresponding to a creative work (that can be a news item, blog post or programme).
The datasets were generated and transformed using SPIMBENCH [49] by altering a set of original
linked data through value-based, structure-based, and semantics-aware transformations (simple
combination of transformations). They share almost the same ontology (with some differences in
property level, due to the structure-based transformations), which describes instances using 22
classes, 31 data properties, and 85 object properties. Participants are requested to produce a set
of correspondences between the pairs of matching instances from the source and target datasets
that are found to refer to the same real-world entity. An instance in the source dataset can have
none or one matching counterpart in the target dataset. The SPIMBENCH task uses two sets of
datasets30 with different scales (i.e., number of instances to match):

– Sandbox (380 INSTANCES, 10000 TRIPLES). It contains two datasets called source
(Tbox1) and target (Tbox2) as well as the set of expected correspondences (i.e., reference
alignment).

– Mainbox (1800 CWs, 50000 TRIPLES). It contains two datasets called source (Tbox1)
and target (Tbox2). This test case is blind, meaning that the reference alignment is not
given to the participants.

In both cases, the goal is to discover the correspondences among the instances in the source
dataset (Tbox1) and the instances in the target dataset (Tbox2).
The evaluation was carried out using the HOBBIT platform.

3.14. Link Discovery
The Link Discovery track features Spatial test case this year, that deal with link discovery for
spatial data represented as trajectories i.e., sequences of longitude, latitude pairs. The track is
based on two datasets generated from TomTom31 and Spaten [50].
The Spatial test case aims at testing the performance of systems that deal with topological
relations proposed in the state of the art DE-9IM (Dimensionally Extended nine-Intersection
29
http://oaei.ontologymatching.org/2019/results/knowledgegraph/kgBaselineMatchers.zip
30
Although the files are called Tbox1 and Tbox2, they actually contain a Tbox and an Abox.
31
https://www.tomtom.com/en gr/
Model) model [51]. The benchmark generator behind this test case implements all topological
relations of DE-9IM between trajectories in the two dimensional space. To the best of our
knowledge such a generic benchmark, that takes as input trajectories and checks the performance
of linking systems for spatial data does not exist. The focus for the design was (a) on the correct
implementation of all the topological relations of the DE-9IM topological model and (b) on
producing datasets large enough to stress the systems under test. The supported relations are:
Equals, Disjoint, Touches, Contains/Within, Covers/CoveredBy, Intersects, Crosses, Overlaps.
The test case comprises tasks for all the DE-9IM relations and for LineString/LineString and
LineString/Polygon cases, for both TomTom and Spaten datasets, ranging from 200 to 2K
instances.
We did not exceed 64 KB per instance due to a limitation of the Silk system32 and run all the
systems using a single core in order to enable a fair comparison of the systems participating in
this track. But we can not fail to mention that Silk and DS-JedAI have a multi core version as
well as that DS-JedAI’s time performance also includes Spark start-up time.
The evaluation was carried out using the HOBBIT platform.

4. Results and Discussion
4.1. Participation
Following an initial period of growth, the number of OAEI participants has remained approxi-
mately constant since 2012, at slightly over 20. This year we count with 18 participating systems.
Table 8 lists the participants and the tracks in which they competed. It is worth mentioning that
the first-year Bio-ML track has four additional participants (e.g., BERTMap [52] and BERTSubs
[53]) that are not listed in Table 8. This is because they need training and validation which are
not yet fully supported by the OAEI evaluation platforms, and thus they were tested locally with
Bio-ML results reported, but without an OAEI system submission. Some matching systems
participated with different variants (Matcha and LogMap) whereas others were evaluated with
different configurations, as requested by developers (see test case sections for details). The
following sections summarise the results for each track.

4.2. Anatomy
The results for the Anatomy track are shown in Table 9. Of the 10 systems participating in
the Anatomy track, 8 achieved an F-measure higher than the StringEquiv baseline. Three
systems were first time participants (Matcha, ALIOn and SEBMatcher). Long-term participating
systems showed few changes in comparison with previous years with respect to alignment quality
(precision, recall, F-measure, and recall+), size and run time. The exceptions were ALIN which
increased in size (from 1119 to 1159), F-measure (from 0.835 to 0.852), recall (from 0.726 to
0.752) and recall+ (from 0.438 to 0.501), and LogMapBio increased in size (from 1586 to 1596),
recall (from 0.914 to 0.919) and recall+ (from 0.773 to 0.787). In terms of run time, 4 out of 10
systems computed an alignment in less than 100 seconds. LogMapLt remains the system with the

32
https://github.com/silk-framework/silk/issues/57
Table 8
Participants and the status of their submissions.

WomboCombo
GraphMatcher

LSMatchMulti
KGMatcher+

LogMap-Bio

SEBMather
CIDER-ML
ATMatcher

LogMapLt
DS-JedAI

TOMATO
LSMatch

Total=18
LogMap
DLinker
System

Matcha
A-LIOn
ALIN
AMD
anatomy # # # # # # # 10#
conference # # # # # 12#
multifarm # # # # # # # # # # # # # 5
complex # # #
G # # # # # # # # # # # G
# # # # 2
food # # # # # # # # # # # # # # 4
interactive # # # # # # # # # # # # # # # # 2
bio-ML # # #
G #
G # # # # # #
G # #
G #
G # #
G # # # 6
biodiv # # # # # # # # # # # #
G # # # 4
mse # #
G # # # # # # # # # # # # 4
commonKG # # # # # # # # # # # 7
crosswalks # # # # # # # # # # # # # # 4
spimbench # # # # # # # # # # # # # # # # 2
link discovery # # # # # # # # # # # # # # # # 2
knowledge graph # # # # # # # # # # # 7
total 3 3 9 5 3 2 1 1 3 12 2 9 6 1 10 2 1 0 71

shortest runtime. Regarding quality, Matcha achieved the highest F-measure (0.941) and recall+
(0.817), but four other systems obtained an F-measure above 0.88 (SEBMatcher, LogMapBio,
LogMap, and AMD) which is at least as good as the best systems in OAEI 2007-2010. Like in
previous years, there is no significant correlation between the quality of the generated alignment
and the run time. Two systems produced coherent alignments (LogMapBio and LogMap).

4.3. Conference
The conference evaluation results using the sharp reference alignment rar2 are shown in Table 10.
For the sake of brevity, only results with this reference alignment and considering both classes
and properties are shown. For more detailed evaluation results, please check conference track’s
web page.
With regard to two baselines we can group tools according to system’s position: six systems
outperformed above both baselines (ALIN, ATMatcher, GraphMatcher, LogMap, LogMapLt,
and SEBMatcher); two systems performed better than StringEquiv baseline (AMD, LSMatch),
and four systems performed worse than both baselines (ALIOn, KGMatcher+, TOMATO, and
Matcha). Seven matchers (AMD, ALIN, ALIOn, ATMatcher, KGMatcher+, LSMatch, and
SEBMatcher) do not match properties at all. On the other side, Matcha does not match classes at
all, while it dominates in matching properties. Naturally, this has a negative effect on their overall
performance.
The performance of all matching systems regarding their precision, recall and F1 -measure is
Table 9
Anatomy results, ordered by F-measure. Runtime is measured in seconds; “size” is the number
of correspondences in the generated alignment.

System Runtime Size Precision F-measure Recall Recall+ Coherent
Matcha 37 1482 0.951 0.941 0.93 0.817 -
SEBMatcher 35602 1402 0.945 0.908 0.874 0.674 -
√
LogMapBio 1183 1596 0.873 0.895 0.919 0.787 √
LogMap 9 1402 0.917 0.881 0.848 0.602
AMD 160 1299 0.953 0.88 0.817 0.522 -
ALIN 374 1159 0.984 0.852 0.752 0.501 -
LogMapLt 3 1147 0.962 0.828 0.728 0.288 -
ATMatcher 156 1037 0.978 0.794 0.669 0.133 -
StringEquiv - 946 0.997 0.766 0.622 0.000 -
LSMatch 20 1009 0.952 0.761 0.634 0.037 -
ALIOn 26134 1913 0.364 0.407 0.46 0.136 -

Table 10
The highest average F[0.5|1|2] -measure and their corresponding precision and recall for each
matcher with its F1 -optimal threshold (ordered by F1 -measure). Inc.Align. means number of
incoherent alignments. Conser.V. means total number of all conservativity principle violations.
Consist.V. means total number of all consistency principle violations.

System Prec. F0.5 -m. F1 -m. F2 -m. Rec. Inc.Align. Conser.V. Consist.V.
LogMap 0.76 0.71 0.64 0.59 0.56 0 21 0
GraphMatcher 0.75 0.7 0.63 0.58 0.55 6 21 61
SEBMatcher 0.79 0.7 0.6 0.52 0.48 4 6 50
ATMatcher 0.69 0.64 0.59 0.54 0.51 1 72 8
ALIN 0.82 0.7 0.57 0.48 0.44 0 2 0
LogMapLt 0.68 0.62 0.56 0.5 0.47 3 97 18
edna 0.74 0.66 0.56 0.49 0.45
AMD 0.82 0.68 0.55 0.46 0.41 1 2 6
LSMatch 0.83 0.69 0.55 0.46 0.41 0 2 0
StringEquiv 0.76 0.65 0.53 0.45 0.41
KGMatcher+ 0.83 0.67 0.52 0.43 0.38 0 1 0
ALIOn 0.66 0.44 0.3 0.22 0.19 3 17 49
TOMATO 0.09 0.11 0.16 0.28 0.6 15 4 777
Matcha 0.37 0.2 0.12 0.08 0.07 2 3 24

plotted in Figure 1. Systems are represented as squares or triangles, whereas the baselines are
represented as circles.
The Conference evaluation results using the uncertain reference alignments are presented in
Table 11. Out of the 12 alignment systems, 8 (ALIN, ALION, AMD, KGMatcher+, LogMapLt,
LSMatch, SEBMatcher, TOMATO) use 1.0 as the confidence value for all matches they identify.
The remaining 4 systems (ATMatcher, GraphMatcher, LogMap, Matcha) have a wide variation
ALIN
AMD
ATMatcher
KGMatcher+
LogMap
F1 -measure=0.7 LogMapLt
LSMatch
F1 -measure=0.6 SEBMatcher
GraphMatcher
F1 -measure=0.5

edna
StringEquiv
rec=1.0 rec=.8 rec=.6 pre=.6 pre=.8 pre=1.0

Figure 1: Precision/recall triangular graph for the conference test case. Dotted lines depict level
of precision/recall while values of F1 -measure are depicted by areas bordered by corresponding
lines F1 -measure=0.[5|6|7].
Table 11
F-measure, precision, and recall of the different matchers when evaluated using the sharp (ra1),
discrete uncertain and continuous uncertain metrics.

Sharp Discrete Continuous
System Prec F-ms Rec Prec F-ms Rec Prec F-ms Rec
ALIN 0.88 0.61 0.47 0.88 0.70 0.59 0.87 0.71 0.60
ALION 0.75 0.34 0.22 0.75 0.40 0.27 0.75 0.41 0.28
AMD 0.87 0.58 0.43 0.87 0.66 0.53 0.86 0.67 0.55
ATMatcher 0.74 0.62 0.53 0.77 0.67 0.59 0.76 0.68 0.62
GraphMatcher 0.80 0.67 0.57 0.76 0.72 0.68 0.75 0.72 0.68
KGMatcher+ 0.88 0.55 0.40 0.88 0.64 0.50 0.88 0.65 0.51
LogMap 0.81 0.68 0.58 0.81 0.70 0.62 0.80 0.66 0.57
LogMapLt 0.73 0.59 0.50 0.73 0.67 0.62 0.72 0.67 0.63
LSMatch 0.88 0.57 0.42 0.87 0.66 0.53 0.88 0.67 0.54
Matcha 0.38 0.13 0.08 0.35 0.14 0.09 0.35 0.12 0.08
SEBMatcher 0.84 0.63 0.50 0.81 0.70 0.61 0.81 0.71 0.62
TOMATO 0.09 0.16 0.63 0.08 0.15 0.74 0.08 0.15 0.73

of confidence values.
When comparing the performance of the matchers on the uncertain reference alignments
versus that on the sharp version, we see that in the discrete case all matchers performed the same
or better in terms of F-measure. Changes in F-measure of discrete cases ranged from 3 to 18
percent over the sharp reference alignment. ALION is the system whose performance surges
Table 12
Threshold, F-measure, precision, and recall of systems when evaluated using reference alignment
for (filtered) DBpedia to OntoFarm ontologies

System Thres. Prec. F0 .5-m. F1 -m. F2 -m. Rec.
LogMap 0.59 0.52 0.55 0.61 0.68 0.73
ATMatcher 0.76 0.5 0.52 0.55 0.58 0.6
KGMatcher+ 0 0.5 0.52 0.55 0.58 0.6
LSMatch 0 0.5 0.52 0.55 0.58 0.6
edna 0.91 0.34 0.38 0.45 0.56 0.67
StringEquiv 0 0.32 0.35 0.42 0.51 0.6
LogMapLt 0 0.23 0.26 0.34 0.48 0.67
Matcha 0.8 0.07 0.08 0.09 0.11 0.13

most (18%), followed by KGMatcher+ (16%) and LSMatch (16%). This was predominantly
driven by increased recall, which is a result of the presence of fewer ’controversial’ matches in
the uncertain version of the reference alignment.
The performance of the matchers with confidence values always 1.0 is very similar regardless
of whether a discrete or continuous evaluation methodology is used, because many of the matches
they find are the ones that the experts had high agreement about, while the ones they missed were
the more controversial matches. GraphMatcher produces the highest F-measure under both the
continuous (72%) and discrete (72%) evaluation methodologies, indicating that this system’s
confidence evaluation does a good job of reflecting cohesion among experts on this task. Of the
remaining systems, LogMap has relatively small drops in F-measure when moving from discrete
to continuous evaluation, while Matcha drops 14 percent in F-measure.
Overall, in comparison with last year, the F-measures of most returning matching systems
essentially held constant when evaluated against the uncertain reference alignments. ALIN,
ALION, GraphMatcher, Matcha, SEBMatcher are 6 new systems participating in this year.
ALION’s performance increases 18 percent in discrete case and 20 percent in continuous case in
terms of F-measure over the sharp reference alignment from 0.34 to 0.40 and 0.41 respectively,
which it is mainly driven by increased recall. ALIN, GraphMatcher, and SEBMatcher also
perform significantly better in both discrete and continuous cases compared to sharp case in term
of F-measure. This is also mostly driven by increased recall. From the results, Matcha outputs low
precision and recall among three different versions of reference alignment in general because it
assigns the threshold to zero and the matches with relatively high confidence value even the labels
of two entities have low string similarity, for example, “hasBid” and “hasPart” has similarity over
0.63 and “addedBy” and “awarded by” also have similarity over 0.66. Reasonably, it achieves
slightly better recall from sharp to discrete case (13%), but the precision and F-measure both drop
slightly. TOMATO returns better recall in both discrete and continuous cases. but the precision is
significantly lower that other systems, because it outputs multiple matches for same entity and
assigns the confidence value as 1.0.
This year we again conducted experiment of matching cross-domain DBpedia ontology to
three OntoFarm ontologies. The DBpedia ontology has been filtered to the dbpedia namespace
Table 13
MultiFarm aggregated results per matcher, for each type of matching task – different ontologies.
Time is measured in minutes.
Different ontologies (i)
System Time(Min) Prec. F-m. Rec.
CIDER-LM 157 .16 .25 .58
LSMatch 33 .24 .038 .21
LSMatch Multilingual 69 .68 .47 .36
LogMap 9 .72 .44 .31
LogMapLt 175 .24 .038 .02

since we merely focused on entities of DBpedia ontology (dbo). In order to evaluate resulted
alignments we prepared reference alignment of DBpedia to three OntoFarm ontologies (ekaw,
sigkdd and confOf) as explained in [54]. Out of 12 systems 6 (ATMatcher, KGMatcher+, LogMap,
LogMapLt, LSMatch, and Matcha) managed to match DBpedia to OntoFarm ontologies.
We evaluated alignments from the systems and the results are in Table 12. Additionally, we
added two baselines: StringEquiv as a string matcher based on string equality applied on local
names of entities which were lowercased and edna as a string editing distance matcher.
We can see four systems (LogMap, ATMatcher, KGMatcher+, and LSMatch) perform better
than two baselines. LogMap dominates with 0.61 of F1-measure. Most systems achieve lower
scores of measures than in the case of matching domain ontologies except KGMatcher+. This
shows that these test cases are more difficult for traditional ontology matching systems.

4.4. Multifarm
This year, 5 systems have registered to participate in the MultiFarm track: CIDER-LM, LSMatch,
LSMatch Multilingual, LogMap and LogMapLt. The number of participating tools is stable
with respect to the last 4 campaigns (6 in 2021, 6 in 2020, 5 in 2019, 6 in 2018, 8 in 2017, 7
in 2016, 5 in 2015, 3 in 2014, 7 in 2013, and 7 in 2012). This year, we lost the participation
of ALOD2Vec, AML, ATMatcher and Wiktionary. But we received new participation from
CIDER-LM, LSMatch and LSMatch Multilingual. The reader can refer to the OAEI papers for a
detailed description of the strategies adopted by each system.
The Multifarm evaluation results based on the blind dataset are presented in Table 13 demon-
strating the aggregated results for the matching tasks. They have been computed using the MELT
framework without applying any threshold to the results. They are measured in terms of macro
precision and recall. The results of non-specific systems are not reported here, as we could
observe in the last campaigns that they can have intermediate results in tests of type ii) (same
ontologies task) and poor performance in tests i) (different ontologies task). In terms of runtime,
the results are not comparable to those from last year as the systems have been run in a different
environment in terms of memory and number of processors. On the other hand, this year MELT
framework was used instead of SEAL.
The systems have been executed on a Ubuntu Linux machine configured with 32GB of RAM
running under an Intel Core CPU 2.00GHz x8 processors. All measurements are based on a
single run. As for each campaign, we observed large differences in the time required for a system
Table 14
Food results per matcher. Time is measured in seconds.
System Corresp. Precision Recall F1-measure Time(s)
AMD 0
LogMap 71 0.0779 0.0822 0.0805 8
LogMapLight 658 0 0 0 14
Matcha 288 0.0833 0,3287 0,13296 38

to complete the 55 x 24 matching tasks: CIDER-LM (157 minutes), LSMatch (33 min), LSMatch
Multilingual (69 min), LogMap (9 minutes) and LogMapLt (175 minutes). When we compare the
times to the last year’s campaign, we can see that LogMap has a stable 9 min execution whereas
LogMapLt improved the timing from 212 min to 175 min. Since the other tools are participating
for the first time their timings are not comparable. These measurements are only indicative of the
time the systems required for finishing the task in a common environment. LSMatch Multilingual
outperformed all other systems in terms of F-measure (0.47) whereas CIDER-LM outperformed
all other systems in terms of Recall (0.58) and LogMap outperformed all other systems in terms
of Precision (0.72).
It is seen that a similar number of systems are participating in the campaign through the years.
However, there is a dynamicity of the tools, such that, each year participating tools vary. In 2022,
we had 7 systems participating in the campaign where 5 of them were new systems and 2 of
them were long-term participating systems. As observed in several campaigns, still, all systems
privilege precision in detriment to recall (recall below 0.50) and the results are below the ones
obtained for the Conference original dataset.

4.5. Complex Matching
Unfortunately, this track is not attracting many participants since last year. This year, only
MatchaC and AMD (for some complex subtracks) have been registered to participate. We lost
AMLC, AROA and CANARD with a newcomer MatchaC.
The Conference subtrack of the complex track had only one participant MatchaC. However,
MatchaC failed to generate alignments. The Hydrography, GeoLink, and Populated Enslaved
datasets, as introduced before, have been also discontinued after the announcement of the datasets.
For the Taxon dataset, as for the Conference dataset, MatchaC and AMD failed to generate
alignments. Contrary to last year, we did not run this year the systems generating simple
alignments, as simple alignments for this task are usually rather obvious.

4.6. Food
This is the first year of the track and four systems were registered: AMD, LogMap, LogMapLite
and Matcha. The evaluation results are presented in Table 14.
The test case evaluates matching systems regarding their capability to find ”equal” (=), cor-
respondences between the CIQUAL ontology and the SIREN ontology. None of the evaluated
systems finds correspondences other than ”equal” (=). All evaluated systems compute the align-
ment in less than a minute. LogMapLight stands out for its high number of correspondences and
performance indicators all equal to zero. LogMap stands out for its very fast calculation time of
8s. LogMap and Matcha have similar results for precision. However, LogMap’s recall is 4 times
less than Matcha’s one. Matcha is the best performing participant in the food test case in terms of
precision, recall and F1-measure.

4.7. Interactive matching
This year, two systems (ALIN, and LogMap) participated in the Interactive matching track. Their
results are shown in Table 15 and Figure 2 for both the Anatomy and Conference datasets.

Table 15
Interactive matching results for the Anatomy and Conference datasets.
Prec. Rec. F-m. Tot. Dist. Pos. Neg.
Tool Error Prec. Rec. F-m. Rec.+ oracle oracle oracle Reqs. Mapps Prec. Prec.
Anatomy Dataset
NI 0.983 0.726 0.835 0.438 – – – – – – –
0.0 0.987 0.92 0.952 0.787 0.987 0.92 0.952 579 1453 1.0 1.0
ALIN 0.1 0.91 0.9 0.905 0.754 0.987 0.921 0.953 551 1383 0.661 0.976
0.2 0.847 0.883 0.865 0.727 0.988 0.924 0.955 536 1350 0.472 0.947
0.3 0.793 0.865 0.827 0.698 0.988 0.925 0.956 529 1325 0.346 0.914
NI 0.915 0.847 0.88 0.602 – – – – – – –
0.0 0.988 0.846 0.912 0.595 0.988 0.846 0.912 388 1164 1.0 1.0
LogMap 0.1 0.967 0.831 0.893 0.565 0.971 0.803 0.879 388 1164 0.749 0.965
0.2 0.949 0.823 0.882 0.553 0.948 0.761 0.844 388 1164 0.564 0.925
0.3 0.938 0.817 0.873 0.543 0.93 0.727 0.816 388 1164 0.439 0.88
Conference Dataset
NI 0.874 0.456 0.599 – – – – – – – –
0.0 0.919 0.744 0.822 – 0.919 0.744 0.822 309 815 1.0 1.0
ALIN 0.1 0.704 0.706 0.705 – 0.935 0.775 0.847 300 787 0.507 0.991
0.2 0.569 0.663 0.612 – 0.944 0.796 0.863 291 764 0.307 0.972
0.3 0.476 0.636 0.545 – 0.951 0.812 0.876 283 741 0.209 0.958
NI 0.801 0.58 0.67 – – – – – – – –
0.0 0.886 0.61 0.723 – 0.886 0.61 0.723 82 246 1.0 1.0
LogMap 0.1 0.852 0.598 0.703 – 0.861 0.574 0.689 82 246 0.688 0.978
0.2 0.81 0.584 0.679 – 0.828 0.546 0.658 82 246 0.494 0.94
0.3 0.799 0.587 0.677 – 0.804 0.516 0.629 82 246 0.366 0.901
NI stands for non-interactive, and refers to the results obtained by the matching system in the
original track.

The table includes the following information (column names within parentheses):

– The performance of the system: Precision (Prec.), Recall (Rec.) and F-measure (F-m.)
with respect to the fixed reference alignment, as well as Recall+ (Rec.+) for the Anatomy
task. To facilitate the assessment of the impact of user interactions, we also provide the
performance results from the original tracks, without interaction (line with Error NI).
– To ascertain the impact of the oracle errors, we provide the performance of the system with
respect to the oracle (i.e., the reference alignment as modified by the errors introduced
by the oracle: Precision oracle (Prec. oracle), Recall oracle (Rec. oracle) and F-measure
oracle (F-m. oracle). For a perfect oracle these values match the actual performance of the
system.

– Total requests (Tot Reqs.) represents the number of distinct user interactions with the tool,
where each interaction can contain one to three conflicting correspondences, that could be
analysed simultaneously by a user.

– Distinct correspondences (Dist. Mapps) counts the total number of correspondences for
which the oracle gave feedback to the user (regardless of whether they were submitted
simultaneously, or separately).

– Finally, the performance of the oracle itself with respect to the errors it introduced can be
gauged through the positive precision (Pos. Prec.) and negative precision (Neg. Prec.),
which measure respectively the fraction of positive and negative answers given by the
oracle that are correct. For a perfect oracle these values are equal to 1 (or 0, if no questions
were asked).

The figure shows the time intervals between the questions to the user/oracle for the different
systems and error rates. Different runs are depicted with different colors.
The matching systems that participated in this track employ different user-interaction strategies.
While LogMap makes use of user interactions exclusively in the post-matching steps to filter
their candidate correspondences, ALIN can also add new candidate correspondences to its initial
set. LogMap requests feedback on only selected correspondences candidates (based on their
similarity patterns or their involvement in unsatisfiabilities). ALIN and LogMap can both ask the
oracle to analyze several conflicting correspondences simultaneously.
The performance of the systems usually improves when interacting with a perfect oracle in
comparison with no interaction. ALIN is the system that improves the most, because its high
number of oracle requests and its non-interactive performance was the lowest of the interactive
systems, and thus the easiest to improve.
Although system performance deteriorates when the error rate increases, there are still benefits
from the user interaction—some of the systems’ measures stay above their non-interactive values
even for the larger error rates. Naturally, the more a system relies on the oracle, the more its
performance tends to be affected by the oracle’s errors.
The impact of the oracle’s errors is linear for ALIN in most tasks, as the F-measure according
to the oracle remains approximately constant across all error rates. It is supra-linear for LogMap
in all datasets.
Another aspect that was assessed, was the response time of systems, i.e., the time between
requests. Two models for system response times are frequently used in the literature [55]:
Shneiderman and Seow take different approaches to categorize the response times taking a task-
centered view and a user-centered view respectively. According to task complexity, Shneiderman
defines response time in four categories: typing, mouse movement (50-150 ms), simple frequent
tasks (1 s), common tasks (2-4 s) and complex tasks (8-12 s). While Seow’s definition of response
Figure 2: Time intervals between requests to the user/oracle for the Anatomy (top 4 plots)
and Conference (bottom 4 plots) datasets. Whiskers: Q1-1,5IQR, Q3+1,5IQR, IQR=Q3-Q1.
The labels under the system names show the average number of requests and the mean time
between the requests for the ten runs.
Table 16
Equivalence matching results for OMIM-ORDO (Disease) in Bio-ML track
Unsupervised (90% Test Mappings) Semi-supervised (70% Test Mappings)
System Precision Recall F-score MRR Hits@1 Precision Recall F-score MRR Hits@1
LogMap 0.827 0.498 0.622 0.803 0.742 0.783 0.547 0.644 0.821 0.743
LogMap-Lite 0.935 0.259 0.405 - - 0.932 0.519 0.667 - -
AMD 0.664 0.565 0.611 - - 0.792 0.528 0.633 - -
BERTMap 0.730 0.572 0.641 0.873 0.817 0.575 0.784 0.664 0.965 0.947
BERTMap-Lite 0.819 0.499 0.620 0.776 0.729 0.775 0.713 0.743 0.900 0.876
Matcha 0.743 0.508 0.604 - - 0.704 0.564 0.626 - -
Matcha-DL - - - - - 0.956 0.615 0.748 0.654 0.640
ATMatcher 0.940 0.247 0.391 - - 0.835 0.286 0.426 - -
LSMatch 0.650 0.221 0.329 - - 0.877 0.238 0.374 - -

time is based on the user expectations towards the execution of a task: instantaneous (100-200
ms), immediate (0.5-1 s), continuous (2-5 s), captive (7-10 s). Ontology alignment is a cognitively
demanding task and can fall into the third or fourth categories in both models. In this regard
the response times (request intervals as we call them above) observed in all datasets fall into the
tolerable and acceptable response times, and even into the first categories, in both models. The
request intervals for LogMap and ALIN stay at a few milliseconds for most datasets. It could be
the case, however, that a user would not be able to take advantage of these low response times
because the task complexity may result in higher user response time (i.e., the time the user needs
to respond to the system after the system is ready).

4.8. Bio-ML
Our results include ten tables in total: five tables for equivalence matching and five tables for
subsumption matching, where each table corresponds to an OM pair and includes results of both
the supervised and semi-supervised settings. See Table 16 for equivalence matching results of
OMIM-ORDO (Disease) and Table 17 for subsumption matching results of SNOMED-NCIT
(Neoplas). For the full results, please refer to the OAEI 2022 website or the Bio-ML website33 .
Briefly, we have the following participants for equivalence matching: (i) machine learning-
based systems including BERTMap [52], BERTMap-Lite [52], AMD [56], Matcha [57] and
Matcha-ML [57]; and (ii) traditional systems including LogMap [58], LogMap-Lite, ATMatcher
[59], LSMatcher [60]. Note that -Lite means a lightweight version using literal matching.
Machine learning-based systems generally perform better with Match-DL attaining best F1 on 4
out of 5 semi-supervised tasks, BERTMap (and -Lite) attaining best F1 on 4 out of 5 unsupervised
tasks, and best ranking scores of all tasks. For subsumption matching, all the three participants
Word2Vec plus Random Forest (RF), OWL2Vec* [61] plus RF, BERTSubs (IC) [53] BERTSubs
performs the best on 2 out of 5 subsumption tasks, while OWL2Vec* performs the best on the
remaining 3. Overall, subsumption matching is more challenging than equivalence matching as
seen from the lower scores in general. This is intuitive because simple string similarity patterns
can play a role in equivalence matching but not likely in subsumption matching.

33
https://www.cs.ox.ac.uk/isg/projects/ConCur/oaei/2022/
Table 17
Subsumption matching results for SNOMED-NCIT (Neoplas) in Bio-ML track
Unsupervised (90% Test Mappings) Semi-supervised (70% Test Mappings)
System MRR Hits@1 Hits@5 Hits@10 MRR Hits@1 Hits@5 Hits@10
Word2Vec+RF 0.512 0.368 0.694 0.834 0.577 0.433 0.773 0.880
OWL2Vec*+RF 0.603 0.461 0.782 0.860 0.666 0.547 0.827 0.880
BERTSubs (IC) 0.530 0.333 0.786 0.948 0.638 0.463 0.859 0.953

4.9. Biodiversity and Ecology
This year, only the LogMap family systems (LogMap, LogMapBio and LogMapLt) alongside
Matcha managed to generate an output for at least one of the track tasks. As in previous editions,
we used precision, recall and F-measure to evaluate the performance of the participating systems.
The results for the Biodiversity and Ecology track are shown in Table 18.
In comparison to the previous year, a smaller number of systems succeeded in generating
alignments for the track tasks. The results of the participating systems are comparable to last
year in terms of F-measure. In terms of run time, LogMapBio took longer due to the loading
of mediating ontologies from BioPortal. Regarding the ENVO-SWEET task, only the LogMap
family systems achieved it with a similar performance to last year. The ANAEETHES-GEMET
and AGROVOC-NALT matching tasks have the particularity of being resources developed in
SKOS. Only LogMapLt could handle the task based on ontology files resulting from an automatic
transformation of SKOS files into OWL. For the transformation, we made use of a source code
directly derived from the AML ontology parsing module, kindly provided to us by its developers.
LogMap, LogMapLt and Matcha performed well on most NCBITAXON-TAXREF-LD subtasks,
with slightly the same levels of precision and recall, the larger subtask could only be handled
by LogMapLt. Overall, in this fifth evaluation, the number of participating systems decreased
considerably and the performance of the successful ones remained similar.

4.10. Material Sciences and Engineering (MSE)
This year five systems registered on the MSE track, each of which was used for evaluation with
the three test cases of the MSE benchmarkk. AMD produced errors and an empty alignment file,
so results are only available for four of the matchers: A-LIOn, LogMap, LogMapLt, Matcha. The
evaluation results are shown in Table 19.
The first test case evaluates matching systems regarding their capability to find ”equal” (=),
”superclass” (>) and ”subclass” (<) correspondences between the mid-sized MatOnto and the
small-sized (since reduced) MaterialInformation ontology. None of the evaluated systems finds
correspondences other than ”equal” (=). All evaluated systems compute the alignment in less than
a minute. LogMap stands out for its very fast calculation time of 9s and the maximum precision
of 1.0. However, since only one correspondence was found by LogMap, the recall and hence the
F1-measure is low (0.083). In direct comparison, LogMapLt calculates the alignment in three
times the time and achieves much lower precision (0.4) but due to a greater amount of correctly
found correspondences the F1-measure is the best of the tested systems in the first test case -
although still low with 0.142. A-LIOn finds the highest number of correspondences, but of those
23 found correspondences 20 are false positives which results in the second best F1-measure
Table 18
Results for the Biodiversity & Ecology track.
System Time Number of Precision Recall F-measure
(HH:MM:SS) mappings
ENVO-SWEET task
LogMap 00:00:25 676 0.781 0.656 0.713
LogMapBio 01:00:03 697 0.753 0.652 0.699
LogMapLt 00:07:32 576 0.829 0.594 0.692
ANAEETHES-GEMET task
LogMapLt 00:00:03 182 0.840 0.458 0.593
AGROVOC-NALT task
LogMapLt 00:00:10 19185 0.744 0.953 0.836
NCBITAXON-TAXREFLD Animalia task
LogMapLt 00:00:43 72010 0.665 0.993 0.796
NCBITAXON-TAXREFLD Bacteria task
LogMap 00:00:01 304 0.575 1.0 0.730
LogMapLt 00:00:00 290 0.6 0.994 0.748
Matcha 00:00:05 303 0.577 1.0 0.732
NCBITAXON-TAXREFLD Chromista task
LogMap 00:00:04 2218 0.623 0.985 0.764
LogMapLt 00:00:01 2165 0.637 0.982 0.773
Matcha 00:00:15 2219 0.623 0.984 0.763
NCBITAXON-TAXREFLD Fungi task
LogMap 00:00:39 12949 0.783 0.998 0.878
LogMapLt 00:00:07 12929 0.783 0.997 0.877
Matcha 00:00:51 12936 0.785 0.999 0.879
NCBITAXON-TAXREFLD Plantae task
LogMapLt 00:00:17 26359 0.746 0.987 0.849
Matcha 00:01:15 26675 0.741 0.992 0.848
NCBITAXON-TAXREFLD Protozoa task
LogMap 00:00:01 496 0.719 1.0 0.837
LogMapLt 00:00:00 477 0.746 0.997 0.853
Matcha 00:00:11 494 0.722 1.0 0.839

at the slowest pace. Matcha finds 4 incorrect crorrespondences thus is the worst performing
participant in the MSE test case one. Investigating the alignment produced by Matcha, classes
are wrongly matched to object properties, e.g. ”Temperature” =” hasTemperature”.

The second test case evaluates the matching systems to find correspondences between the
large-sized MaterialInformation and the mid-sized BFO-based MatOnto. In comparison to the
first test case, two of the four evaluated systems (A-LIOn, Matcha) need much longer to calculate
the alignment. Surprisingly two of the systems are even quicker (LogMap, LogMapLight) than in
the first test case. A-LIOn finds a large number of correspondences, hence has the highest recall
of the evaluated systems but 100 out of the 163 found correspondences are incorrect. This results
in a moderate F1-measure of 0.271 and a rather slow calculation time of over 3 minutes. LogMap
stands out again for its very fast computation time of only 3s at a high precision of 0.881. Since
LogMap found only 59 correct correspondences out of the 302 reference correspondences, the
recall is rather low but the F1-measure is still the highest of the tested systems. LogMapLt
is almost 30 times slower than LogMap but finds the same amount of correspondences with
2 additional false positives, so it achieves a slightly lower overall F1-measure than LogMap.
Matcha finds 6 wrong correspondences where classes are matched to object properties as in the
first test case.

The third test case evaluates matching systems to find correspondences between the large-sized
MaterialInformation and the mid-sized EMMO. All evaluated systems compute the alignments
in under 3 minutes. Surprisingly, A-LIOn takes the longest to compute the alignments but
does not find any correspondence which might be due to reasoning errors that were produced
for EMMO. LogMap again stands out for the fast computation time and high precision with
53 correct correspondences out of the 56 in total. Although LogMap misses out 10 reference
correspondences, the F1-measure of 0.891 is the best of the whole MSE track. LogMapLt is 6
times slower than LogMap with a slightly lower precision and 2 additional false positives. Due
to those, LogMapLt achieves a slightly worse F1-measure of 0.857 which is still the second
best of the whole MSE track. Matcha finds 2 correct correspondences out of the 63 reference
correspondences which results in the only non-zero recall for Matcha in the MSE track along
with a fair precision of 0.5 at a rather fast calculation time of 21s.

Table 19
Results for the three test cases of the MSE track.
System Corresp. Precision Recall F1-Measure Time [s]
First Test Case
A-LIOn 23 0.130 0.130 0.130 38
LogMap 1 1.000 0.043 0.083 9
LogMapLt 5 0.400 0.087 0.143 27
Matcha 4 0.000 0.000 0.000 22
Second Test Case
A-LIOn 163 0.387 0.209 0.271 208
LogMap 67 0.881 0.195 0.320 3
LogMapLt 67 0.851 0.189 0.309 83
Matcha 6 0.000 0.000 0.000 15
Third Test Case
A-LIOn 0 0.000 0.000 0.000 135
LogMap 56 0.946 0.841 0.891 14
LogMapLt 56 0.911 0.810 0.857 84
Matcha 4 0.500 0.032 0.060 21

In summary, LogMap stands out for its very fast computing speed with very high precision at
the same time. LogMapLt is significantly slower in every test case and almost constantly shows
worse results - only in the first test case the recall of LogMapLight is higher than for LogMap. In
our opinion, LogMap is definitely recommended for MSE applications where high precision is
demanded. In comparison to that, LogMapLight does not appear to bring any decisive advantage
over LogMap.
Matcha in its current implementation is not recommended for MSE applications since it
matches classes to properties.
A-LIOn produces moderate results but does not bring any advantage over LogMap. Further-
more, A-LIOn produces errors while reasoning on EMMO. The latter is the only one of the MSE
ontologies used with a significant proportion of essential axioms. According to the annotations in
EMMO, this ontology exclusively can be inferred with the FaCT++ reasoner. That might be a
cause for the occurring reasoning errors of A-LIOn and bad results in the third test case.
None of the evaluated matcher finds all reference correspondences correctly and none of the
matchers.

4.11. Common Knowledge Graphs
We evaluated all the participating systems that were packaged as SEALS packages or as web
services using Docker (even those not registered to participate on this new track). However, not
all systems were able to complete the task as some systems finished with an empty alignment file.
Here, we include the results of 8 systems that were able to finish the task within the 24 hours time
limit with a non-empty alignment file: LogMap, ATMatcher, Matcha, KGMatcher+, LogMapLite,
LogMapKG, LsMatch, and AMD.
Table 20 shows the aggregated results on the two datasets for systems that produced non-empty
alignment files. The size column indicates the total number of class alignments discovered
by each system. While the majority of the systems discovered alignments at both schema and
instance levels, we have only evaluated class alignments, as the two gold standard does not
include any instance-level ground truth. Further, Not all systems were able to handle the original
dataset versions (i.e., those with all annotated instances). In terms of the NELL-DBpedia test
case, LogMap, ATMatcher, KGMatcher+, and AMD were able to generate results when applied
to the full-size dataset. While on the YAGO-Wikidata dataset, which is large-scale compared
to the first dataset, only ATMatcher, KGMatcher+, and AMD were able to generate alignments
with the original dataset. Other systems either fail to complete the task within the allocated 24
hours time limit such as LogMapLite, Matcha, and LsMatch, or produce an empty alignment file
such as Matcha, LogMap (only on the YagoWikidata dataset). LogMapKG on the other hand
tend to only align instances when it is applied to full-size datasets. Similar to 2021 evaluation
results, AMD does generate schema alignments but in the wrong format, therefore, they can not
be evaluated.
The resulted alignment files from all the participating systems are available to download on
the track’s result webpage34 . On the Nell-DBpedia dataset, all systems were able to outperform
the basic string matcher, in terms of f-measure, except for LogMapLite. On the YagoWikidata
dataset, two systems were not able to outperform the baseline, which are LogMapLite and

34
https://oaei.ontologymatching.org/2022/results/commonKG/index.html
Table 20
Results for the Common Knowledge Graphs track
Matcher Size Precision Recall F1 measure Time Dataset Size
NELL-DBpedia
LogMap 105 0.99 0.80 0.88 00:03:17 original
ATMatcher 104 1.00 0.80 0.89 00:03:10 original
Matcha 104 1.00 0.81 0.90 00:01:00 small
KGMatcher+ 117 1.00 0.91 0.95 02:43:50 original
LogMapLite 77 1.00 0.60 0.75 00:26:19 small
LogMapKG 104 0.98 0.80 0.88 00:00:00 small
AMD 102 0.00 0.00 0.00 00:00:23 original
LsMatch 101 0.96 0.75 0.84 00:00:52 small
String Baseline 78 1.00 0.60 0.75 00:00:37 original
YAGO-Wikidata
LogMap 233 1.00 0.76 0.86 00:01:19 small
ATMatcher 233 1.00 0.77 0.87 00:19:04 original
Matcha 243 1.00 0.80 0.89 00:03:18 small
KGMatcher+ 253 0.99 0.83 0.91 02:07:59 original
LogMapLite 211 1.00 0.70 0.81 00:48:19 small
LogMapKG 232 1.00 0.76 0.83 00:00:10 small
AMD 125 0.00 0.00 0.00 00:29:04 original
LsMatch 196 0.96 0.63 0.76 00:02:28 small
String Baseline 212 1.00 0.70 0.82 00:00:02 original

LsMatch. Similar to last year, KGMatcher+ outperforms other systems in terms of f-measure.
It achieves 0.95 as f-measure on the NELL-DBpedia test case and 0.91 on the YAGO-Wikidata
test case. While other systems were not able to improve last year’s results, Matcha has improved
last year’s AML results with 0.90 f-measure on the NELL-DBpedia test case, and 0.89 on the
YAGO-Wikidata.
In terms of runtime, Table 20 also presents the run time as HH:MM:SS where we can observe
that all matching were able to finish the task in less than 30 minutes except for KGMatcher+ and
LogMapLite. Finally, the dataset size column identifies whether the system was able to perform
on the original dataset or only on the smaller version.
4.12. Crosswalks Data Schema Matching
For this first version of the track, four systems registered to participate: AMD, LogMap,
LogMapLt and Matcha. Table 21 shows the results for the systems that have generated cor-
respondences. AMD was not able to generate any correspondence. LogMap, LogMapLt and
Matcha are the only systems able to generated (few) correct correspondences. The generated
correspondences involved mostly classes and properties where labels are the same, for instance:
https://schema.org/distribution and http://www.w3.org/ns/dcat#distribution.
With respect to the pairs of schemas, the systems generated a higher number of correspondences
for DCAT-v3 and RIFCS. LogMapLt is the system that is able to deal with a higher number of
matching pairs. None of the participant systems generated outputs for DC.

Table 21
Results for the Crosswalk task.
Matcha
correct output expected
dcat3 3 17 42
datacity 0 4 34
LogMap
correct output expected
dcat3 0 12 42
datacity 0 3 34
rifcs 0 11 24
dcat-ap 0 2 34
LogMapLt
correct output expected
dcat3 3 41 42
datacity 0 4 34
rifcs 0 9 24
dcat-ap 0 4 34
iso 0 2 42

This task mostly deals with properties of metadata schemes. Still, dealing with properties is a
challenging task.
This year, as introduced above, we have used the schemes for which a OWL/RDFS serialization
is available, as OAEI matching systems are used to the format. However, this does not reflect the
reality of the field, as schemas are not usually exposed in such a structured format. This opens
the possibility of providing a dedicated task next year.

4.13. Link Discovery
This year the Link Discovery track counted four participants: DS-JedAI, Silk, RADON and
DLinker. DLinker participated for the first time.
We divided the Spatial test cases into four suites. In the two suites (SLL and LLL), the systems
were asked to match LineStrings to LineStrings considering a given relation for 200 and 2K
instances for the TomTom and Spaten datasets. In the other two suites (SLP, LLP), the systems
were asked to match LineStrings to Polygons (or Polygons to LineStrings depending on the
relation) again for both datasets. Since the precision, recall and F-measure results from all
systems were equal to 1.0, we are only presenting results regarding the time performance. The
time performance of the matching systems in the SLL, LLL, SLP and LLP suites are shown in
Figures 3-4 35 .
The detailed results can also be found in HOBBIT git 36 . Silk and GS-JedAI did not participate
for COVERED BY and Silk also did not participate for COVERS. DLinker only participated for
EQUALS and OVERLAPS tasks and only for LineStrings to LineStrings.
In the SLL suite, RADON has the best performance in most cases except for the Touches
and Intersects relations. DS-JedAI seems to need the most time while Silk has the second best
performance. DLinker perfom well regarding Overlaps and also Equals for Spaten dataset.
In the LLL suite we have a more clear view of the capabilities of the systems with the increase
in the number of instances. In this case, RADON and Silk have similar behavior as in the
small dataset, but it is more clear that the systems need much more time to match instances
from the TomTom dataset. On the other hand DS-JedAI, scales pretty well in larger datasets
as Spark start-up time is negligible in comparison to the matching time. RADON has still the
best performance in most cases. Dlinker scales pretty well for the Overlaps and also Equals for
Spaten dataset following the performance of SLL suite.
In the SLP suite, in contrast to the first two suites, RADON has the best performance for all
relations. Silk has the second best time performance while DS-JedAI needs the most time to
complete the matchings. All the systems need more time for the TomTom dataset but due to the
small size of the instances the time difference is minor.
In the LLP suite, RADON again has the best performance in all cases. Again, DS-JedAI scales
better in large datasets, thus it needs less time than Silk.
Taking into account the executed test cases we can identify the capabilities of the tested systems
as well as suggest some improvements. Three of the systems participated in most of the test cases,
with the exception of Silk that did not participate in the Covers and Covered By and DS-JedAI that
did not participate in Covered By test cases. Some of those systems did not manage to complete
some test cases, mostly Disjoint. One system, DLinker only participated for Equals and Overlaps
relations and only for Linestrings to Linestrings test cases.
RADON was the only system that successfully addressed all the tasks, and had the best
performance for the SLP and LLP suites, but it can be improved for the Touches and Intersects
relations for the SLL and LLL suites. DS-JedAI addressed most of the tasks and scales better
in larger datasets and can be improved for Overlaps, Touches and Within. Silk can be improved
for the Touches, Intersects and Overlaps relations and for the SLL and LLL tasks and for the
Disjoint relation in SLP and LLP Tasks. DLinker can be improved to the number of the supported
relations as well as to the supported geometries.
In general, all systems needed more time to match the TomTom dataset than the Spaten one, due
to the smaller number of points per instance in the latter. Comparing the LineString/LineString to
the LineString/Polygon Tasks we can say that all the systems needed less time for the first for the
Contains, Within, Covers and Covered by relations, more time for the Touches, Intersects and
Crosses relations, and approximately the same time for the Disjoint relation.

35
In order to make the diagrams more comprehensible we have excluded the extreme values.
36
https://hobbit-project.github.io/OAEI 2022.html
Figure 3: Time performance for TomTom & Spaten SLL (top) and LLL (bottom) suites for DLinker,
RADON, Silk and DS-JedAI.

4.14. SPIMBENCH
This year, the SPIMBENCH track counted two participants: LogMap and DLinker. DLinker
participated for the first time but only for the Sanbox task while LogMap participates every
year. The evaluation results of the track are shown in Table 22. The results can also be found in
HOBBIT git 37 .

37
https://hobbit-project.github.io/OAEI 2022.html
Figure 4: Time performance for TomTom & Spaten SLP (top) and LLP (bottom) suites for
RADON, Silk and DS-JedAI.

LogMap has the best performance overall both in terms of F-measure and run time. The run
time scaled very well with the increase if the number of instances while we do not have scaling
information for DLinker as it did not participate for the large dataset.

4.15. Knowledge Graph
This year we evaluated all participants with the MELT framework to include all possible sub-
mission formats i.e. SEALS, and Web format. First, all systems are evaluated on a very small
Table 22
Results for SPIMBENCH task.
Sandbox Dataset ( 380 instances, 10000 triples)
System Fmeasure Precision Recall Time (in ms)
LogMap 0.8413 0.9382 0.7625 5699
DLinker 0.7026 0.7907 0.6321 15555
Mainbox Dataset ( 1800 instances, 50000 triples)
System Fmeasure Precision Recall Time (in ms)
LogMap 0.7856 0.8801 0.7094 27140

matching task38 (even those not registered for the track). This revealed that not all systems were
able to handle the task, and in the end, 5 matchers can provide results for at least one test case.
Similar to the previous years, some systems like AMD need a post-processing step of the
resulting alignment file to be able to parse it. The reason is that the KGs in the knowledge graph
track contains special characters, e.g. ampersand. These characters need to be encoded in order
to parse these XML formatted files correctly. The resulting alignments are available for download
39 .

Table 23 shows the results for all systems divided into class, property, instance, and overall
results. This also includes the number of tasks in which they were able to generate a non-empty
alignment (#tasks) and the average number of generated correspondences (size). We report the
macro averaged precision, F-measure, and recall results, where we do not distinguishing empty
and erroneous (or not generated) alignments. The values in parentheses show the results when
considering only non empty alignments.
This years best overall system is ATMatcher. The result of 0.85 is still behind the top result over
all years which was 0.87. All other systems, regarding F-Mesasure, are below the baseline which
includes the alternative labels. The highest recall is achieved by Matcha (0.88), a new system
participating this year. It returns more correspondences than all others (32844.2 on average) but
is only able to match instances in this track. AMD returned some correspondences but achieved
an overall F-Measure of 0.0 for all test cases. Thus the system is not included in the final table.
Detailed results for each test case can be found on the OAEI results page of the track40 .
Property matches are still not created by all systems. KGMatcher, LogMap, and
Matcha do not return any of those mappings. One reason might be that the proper-
ties are typed as rdf:Property and not distinguished into owl:ObjectProperty or
owl:DatatypeProperty. ATMatcher reaches the best score with 0.96 F-Measure. The
number of instance correspondences are in the same range (3,641 - 5,872) for all systems except
Matcha (32,844) which thus reaches a high recall value.
Regarding runtime, LSMatch (4:17:13) was the slowest system. In comparison to last year
with more than 12 hours, the runtimes of this campain is rather good and shows the scalability of
the systems. Besides the baselines (which need around 12 minutes for all test cases) ATMatcher
(00:18:48) and LogMap (00:55:52) were the fastest systems.
38
http://oaei.ontologymatching.org/2019/results/knowledgegraph/small test.zip
39
http://oaei.ontologymatching.org/2022/results/knowledgegraph/knowledgegraph-alignments.zip
40
http://oaei.ontologymatching.org/2022/results/knowledgegraph/index.html
Table 23
Knowledge Graph track results, divided into class, property, and instance performance. For
matchers that were not capable to complete all tasks, the numbers in parantheses denote the
performance when only averaging across tasks that were completed.
System Time (s) # tasks Size Prec. F-m. Rec.
class performance
ATMatcher 00:18:48 5 25.6 0.97 (0.97) 0.87 (0.87) 0.79 (0.79)
BaselineAltLabel 00:11:37 5 16.4 1.00 (1.00) 0.74 (0.74) 0.59 (0.59)
BaselineLabel 00:11:27 5 16.4 1.00 (1.00) 0.74 (0.74) 0.59 (0.59)
KGMatcher 03:01:17 5 21.2 1.00 (1.00) 0.79 (0.79) 0.66 (0.66)
LogMap 00:55:52 5 19.4 0.93 (0.93) 0.81 (0.81) 0.71 (0.71)
LSMatch 04:17:13 5 23.6 0.97 (0.97) 0.78 (0.78) 0.64 (0.64)
Matcha 02:40:21 4 0.0 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)
property performance
ATMatcher 00:18:48 5 78.8 0.97 (0.97) 0.96 (0.96) 0.95 (0.95)
BaselineAltLabel 00:11:37 5 47.8 0.99 (0.99) 0.79 (0.79) 0.66 (0.66)
BaselineLabel 00:11:27 5 47.8 0.99 (0.99) 0.79 (0.79) 0.66 (0.66)
KGMatcher 03:01:17 5 0.0 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)
LogMap 00:55:52 5 0.0 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)
LSMatch 04:17:13 5 85.6 0.73 (0.73) 0.71 (0.71) 0.69 (0.69)
Matcha 02:40:21 4 0.0 0.00 (0.00) 0.00 (0.00) 0.00 (0.00)
instance performance
ATMatcher 00:18:48 5 4856.6 0.89 (0.89) 0.84 (0.84) 0.80 (0.80)
BaselineAltLabel 00:11:37 5 4674.8 0.89 (0.89) 0.84 (0.84) 0.80 (0.80)
BaselineLabel 00:11:27 5 3641.8 0.95 (0.95) 0.81 (0.81) 0.71 (0.71)
KGMatcher 03:01:17 5 3789.6 0.94 (0.94) 0.82 (0.82) 0.74 (0.74)
LogMap 00:55:52 5 4012.4 0.90 (0.90) 0.78 (0.78) 0.69 (0.69)
LSMatch 04:17:13 5 5872.2 0.66 (0.66) 0.63 (0.63) 0.60 (0.60)
Matcha 02:40:21 4 32844.2 0.53 (0.66) 0.61 (0.76) 0.72 (0.90)
overall performance
ATMatcher 00:18:48 5 4961.0 0.89 (0.89) 0.85 (0.85) 0.80 (0.80)
BaselineAltLabel 00:11:37 5 4739.0 0.89 (0.89) 0.84 (0.84) 0.80 (0.80)
BaselineLabel 00:11:27 5 3706.0 0.95 (0.95) 0.81 (0.81) 0.71 (0.71)
KGMatcher 03:01:17 5 3810.8 0.94 (0.94) 0.82 (0.82) 0.72 (0.72)
LogMap 00:55:52 5 4031.8 0.90 (0.90) 0.77 (0.77) 0.68 (0.68)
LSMatch 04:17:13 5 5981.4 0.66 (0.66) 0.63 (0.63) 0.61 (0.61)
Matcha 02:40:21 4 32844.2 0.53 (0.66) 0.60 (0.76) 0.70 (0.88)

For further analysis of the results, we also provide an online dashboard41 generated with
MELT[62]. It allows to inspect the results on a correspondence level. Due to the large amount of
these correspondences, it can take some time to load the full dashboard.

41
http://oaei.ontologymatching.org/2022/results/knowledgegraph/knowledge graph dashboard.html
5. Conclusions and Lessons Learned
In 2022 we witnessed a healthy mix of new and returning systems. Like last year, the distribution
of participants by tracks was uneven.
The schema matching tracks saw abundant participation, but, as has been the trend of the
recent years, little substantial progress in terms of quality of the results or run time of top
matching systems, judging from the long-standing tracks. On the one hand, this may be a sign of
a performance plateau being reached by existing strategies and algorithms, which would suggest
that new technology is needed to obtain significant improvements. On the other hand, it is also
true that established matching systems tend to focus more on new tracks and datasets than on
improving their performance in long-standing tracks, whereas new systems typically struggle to
compete with established ones.
According to the Conference track there is still need for an improvement with regard to the
ability of matching systems to match properties (even, majority systems only match classes; 7
out of 12 systems). Next, much fewer systems, with regard to the last year, managed to match
DBpedia ontology to conference ontologies (85% vs. 50%). With respect to the first point, it
has been corroborated in the new track (Crosswalks Data Schema Matching), that concerns
properties in particular. In fact systems still fail in dealing with this kind of ontology entity.
Since the creation of the Material Sciences and Engineering track a large amount of new
ontologies have been developed and utilized in various MSE applications. In contrast to the
early development stages of this track, those ontologies are now easily accessible on the new
Matportal42 . In the future, the MSE track should be updated with the currently most used top and
mid-level MSE ontologies, which include the BWMD-mid43 , the MSEO44 , the PMDco45 , the
prov-o46 . Apart from also considering frequently used domain and application ontologies, also
multi-ontology matching, knowledge graph matching and the usage of background knowledge
should be considered in this domain where such applications become increasingly important.
With respect to the cross-lingual version of Conference, the MultiFarm track still attracts too
few number of participants. Despite this fact, this year new participants came with alternative
strategies (i.e., deep learning) with respect to the last campaigns.
In the Food track none of the evaluated systems finds all the reference correspondences. The
usage of background knowledge available in CIQUAL and SIREN ontologies in terms of food
description based on FoodON concepts should be considered in future OAEI campaigns.
The Bio-ML track is new to the OAEI. It has several participants for equivalence matching but
very few for subsumption matching which is more challenging. The best performing systems are
not consistent across tasks and settings, demonstrating the diversity of our datasets. To encourage
more participants and provide more convincing results in the future, we will consider creating a
new set of OM pairs (different ontologies and/or semantic types from up-to-date sources) and
anonymize the testing mappings.
In the Biodiversity and Ecology track, none of the systems has been able to detect mappings
42
https://matportal.org/
43
https://matportal.org/ontologies/BWMD-MID
44
https://matportal.org/ontologies/MSEO
45
https://github.com/materialdigital/core-ontology/
46
https://www.ebi.ac.uk/ols/ontologies/prov
established by domain experts. Detecting such correspondences requires the use of domain-
specific core knowledge that captures biodiversity concepts. In addition this year, we did confirm
on the one hand the inability of most systems to handle SKOS as input format and to handle
very large ontologies and thesauri in the other hand. We plan to reuse techniques from the Large
Biomedical Ontologies track as well as experts knowledge to provide manageable subsets.
The Interactive matching track also witnessed a small number of participants. Two systems
participated this year. This is puzzling considering that this track is based on the Anatomy and
Conference test cases, and those tracks had 10 and 12 participants, respectively. The process of
programmatically querying the Oracle class used to simulate user interactions is simple enough
that it should not be a deterrent for participation, but perhaps we should look at facilitating the
process further in future OAEI editions by providing implementation examples.
The Complex matching track tackles a challenge task that still attracts a very few number of
participants. This year, no system was able to complete the task. While some sub-tracks have
been discontinued, the Taxon track has to evolve, in particular considering new versions of the
used resources (TAXREF-LD) and additional resources as NCBI and DBpedia.
In the Instance matching tracks participation decreased this year for SPIMBENCH and
increased for Spatial benchmark. Regarding Spatial benchmark some systems had newer versions.
Automatic instance-matching benchmark generation algorithms have been gaining popularity, as
evidenced by the fact that they are used in all three instance matching tracks of this OAEI edition.
One aspect that has not been addressed in such algorithms is that, if the transformation is too
extreme, the correspondence may be unrealistic and impossible to detect even by humans. As
such, we argue that human-in-the-loop techniques can be exploited to do a preventive quality-
checking of generated correspondences, and refine the set of correspondences included in the
final reference alignment.
In the Knowledge graph track, there is a slight decrease of systems able to solve all test
cases. The overall best scores are still unbeaten. Furthermore the proportion of matchers not able
to produce property alignments is high. This might change next year with new and improved
systems.
This is the second year of the Common knowledge graphs track, which challenges matching
systems to map the schema of large-scale, automatically constructed, and cross-domain knowledge
graphs. The number of participants is similar to last year, i.e. 8 systems. While a number of
systems were able to finish the task, other systems still faced problems coping with the scalability
issue. Some of the systems were only able to produce alignments when applied to smaller versions
of the KGs dataset. Therefore, we still expect those systems to be adapted to the task, and we
look forward to having more participants in the next OAEI campaign.
Like in previous OAEI editions, most participants provided a description of their systems and
their experience in the evaluation, in the form of OAEI system papers. These papers, like the
present one, have not been peer reviewed. However, they are full contributions to this evaluation
exercise, reflecting the effort and insight of matching systems developers, and providing details
about those systems and the algorithms they implement.
As each year, fruitful discussions at the Ontology Matching point out different directions
for future improvements in OAEI. In particular, in terms of new use cases, using SSSOM
as alignment format for towards making FAIR alignments, and alternative ways for running
resource-consuming systems.
The Ontology Alignment Evaluation Initiative will strive to remain a reference to the ontology
matching community by improving both the test cases and the testing methodology to better
reflect actual needs, as well as to promote progress in this field. More information can be found
at: http://oaei.ontologymatching.org.

Acknowledgements
We warmly thank the participants of this campaign. We know that they have worked hard to have
their matching tools executable in time and they provided useful reports on their experience. The
best way to learn about the results remains to read the papers that follow.
We are also grateful to Martin Ringwald and Terry Hayamizu for providing the reference
alignment for the anatomy ontologies and thank Elena Beisswanger for her thorough support on
improving the quality of the dataset.
We thank Andrea Turbati and the AGROVOC team for their very appreciated help with the
preparation of the AGROVOC subset ontology. We are also grateful to Catherine Roussey and
Nathalie Hernandez for their help on the Taxon alignment.
We also thank for their support the past members of the Ontology Alignment Evaluation
Initiative steering committee: Jérôme Euzenat (INRIA, FR), Yannis Kalfoglou (Ricoh laboratories,
UK), Miklos Nagy (The Open University,UK), Natasha Noy (Google Inc., USA), Yuzhong Qu
(Southeast University, CN), York Sure (Leibniz Gemeinschaft, DE), Jie Tang (Tsinghua University,
CN), Heiner Stuckenschmidt (Mannheim Universität, DE), George Vouros (University of the
Aegean, GR).
Daniel Faria and Catia Pesquita were supported by the FCT through the LASIGE Research Unit
(UIDB/00408/2020 and UIDP/00408/2020) and by the KATY project funded by the European
Union’s Horizon 2020 research and innovation program under grant agreement No 101017453.
Ernesto Jimenez-Ruiz has been partially supported by the SIRIUS Centre for Scalable Data
Access (Research Council of Norway, project no.: 237889).
Irini Fundulaki and Tzanina Saveta were supported by the EU’s Horizon 2020 research and
innovation programme under grant agreement No 688227 (Hobbit).
Patrick Lambrix, Huanyu Li, Mina Abd Nikooie Pour and Ying Li have been supported by the
Swedish e-Science Research Centre (SeRC), the Swedish Research Council (Vetenskapsrådet,
dnr 2018-04147) and the Swedish National Graduate School in Computer Science (CUGS).
Beyza Yaman has been supported by ADAPT SFI Research Centre [grant 13/RC/2106 P2].
Jiaoyan Chen, Hang Dong, Yuan He and Ian Horrocks have been supported by Samsung
Research UK (SRUK) and the EPSRC project ConCur (EP/V050869/1).
The Biodiversity and Ecology track has been partially funded by the German Research Foun-
dation in the context of NFDI4BioDiversity project (number 442032008) and the CRC 1076
AquaDiva. In 2021, the track was also supported by the Data to Knowledge in Agronomy and
Biodiversity (D2KAB) project that received funding from the French National Research Agency
(ANR-18-CE23-0017). We would like to thank FAO AIMS and US NAL as well as the GACS
project for providing mappings between AGROVOC and NALT. We would like to thank Christian
Pichot and the ANAEE France project for providing mappings between ANAEETHES and
GEMET.
References
[1] J. Euzenat, C. Meilicke, P. Shvaiko, H. Stuckenschmidt, C. Trojahn dos Santos, Ontology
alignment evaluation initiative: six years of experience, Journal on Data Semantics XV
(2011) 158–192.
[2] J. Euzenat, P. Shvaiko, Ontology matching, 2nd ed., Springer-Verlag, 2013.
[3] Y. Sure, O. Corcho, J. Euzenat, T. Hughes (Eds.), Proceedings of the Workshop on Evalua-
tion of Ontology-based Tools (EON), Hiroshima (JP), 2004.
[4] B. Ashpole, M. Ehrig, J. Euzenat, H. Stuckenschmidt (Eds.), Proc. K-Cap Workshop on
Integrating Ontologies, Banff (Canada), 2005. URL: http://ceur-ws.org/Vol-156/.
[5] M. Abd Nikooie Pour, A. Algergawy, F. Amardeilh, R. Amini, O. Fallatah, D. Faria,
I. Fundulaki, I. Harrow, S. Hertling, P. Hitzler, M. Huschka, L. Ibanescu, E. Jiménez-Ruiz,
N. Karam, A. Laadhar, P. Lambrix, H. Li, Y. Li, F. Michel, E. Nasr, H. Paulheim, C. Pesquita,
J. Portisch, C. Roussey, T. Saveta, P. Shvaiko, A. Splendiani, C. Trojahn, J. Vatascinová,
B. Yaman, O. Zamazal, L. Zhou, Results of the ontology alignment evaluation initiative
2021, in: P. Shvaiko, J. Euzenat, E. Jiménez-Ruiz, O. Hassanzadeh, C. Trojahn (Eds.),
Proceedings of the 16th International Workshop on Ontology Matching co-located with the
20th International Semantic Web Conference (ISWC 2021), Virtual conference, October 25,
2021, volume 3063 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 62–108.
URL: http://ceur-ws.org/Vol-3063/oaei21 paper0.pdf.
[6] M. Abd Nikooie Pour, A. Algergawy, R. Amini, D. Faria, I. Fundulaki, I. Harrow, S. Hertling,
E. Jiménez-Ruiz, C. Jonquet, N. Karam, A. Khiat, A. Laadhar, P. Lambrix, H. Li, Y. Li,
P. Hitzler, H. Paulheim, C. Pesquita, T. Saveta, P. Shvaiko, A. Splendiani, É. Thiéblin,
C. Trojahn, J. Vatascinová, B. Yaman, O. Zamazal, L. Zhou, Results of the ontology
alignment evaluation initiative 2020, in: P. Shvaiko, J. Euzenat, E. Jiménez-Ruiz, O. Has-
sanzadeh, C. Trojahn (Eds.), Proceedings of the 15th International Workshop on Ontology
Matching co-located with the 19th International Semantic Web Conference (ISWC 2020),
Virtual conference (originally planned to be in Athens, Greece), November 2, 2020, vol-
ume 2788 of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 92–138. URL:
http://ceur-ws.org/Vol-2788/oaei20 paper0.pdf.
[7] A. Algergawy, D. Faria, A. Ferrara, I. Fundulaki, I. Harrow, S. Hertling, E. Jiménez-Ruiz,
N. Karam, A. Khiat, P. Lambrix, H. Li, S. Montanelli, H. Paulheim, C. Pesquita, T. Saveta,
P. Shvaiko, A. Splendiani, É. Thiéblin, C. Trojahn, J. Vatascinová, O. Zamazal, L. Zhou,
Results of the ontology alignment evaluation initiative 2019, in: Proceedings of the 14th
International Workshop on Ontology Matching, Auckland, New Zealand, 2019, pp. 46–85.
[8] A. Algergawy, M. Cheatham, D. Faria, A. Ferrara, I. Fundulaki, I. Harrow, S. Hertling,
E. Jiménez-Ruiz, N. Karam, A. Khiat, P. Lambrix, H. Li, S. Montanelli, H. Paulheim,
C. Pesquita, T. Saveta, D. Schmidt, P. Shvaiko, A. Splendiani, É. Thiéblin, C. Trojahn,
J. Vatascinová, O. Zamazal, L. Zhou, Results of the ontology alignment evaluation initiative
2018, in: Proceedings of the 13th International Workshop on Ontology Matching, Monterey
(CA, US), 2018, pp. 76–116.
[9] M. Achichi, M. Cheatham, Z. Dragisic, J. Euzenat, D. Faria, A. Ferrara, G. Flouris, I. Fundu-
laki, I. Harrow, V. Ivanova, E. Jiménez-Ruiz, K. Kolthoff, E. Kuss, P. Lambrix, H. Leopold,
H. Li, C. Meilicke, M. Mohammadi, S. Montanelli, C. Pesquita, T. Saveta, P. Shvaiko,
A. Splendiani, H. Stuckenschmidt, É. Thiéblin, K. Todorov, C. Trojahn, O. Zamazal, Re-
sults of the ontology alignment evaluation initiative 2017, in: Proceedings of the 12th
International Workshop on Ontology Matching, Vienna, Austria, 2017, pp. 61–113. URL:
http://ceur-ws.org/Vol-2032/oaei17 paper0.pdf.
[10] M. Achichi, M. Cheatham, Z. Dragisic, J. Euzenat, D. Faria, A. Ferrara, G. Flouris, I. Fun-
dulaki, I. Harrow, V. Ivanova, E. Jiménez-Ruiz, E. Kuss, P. Lambrix, H. Leopold, H. Li,
C. Meilicke, S. Montanelli, C. Pesquita, T. Saveta, P. Shvaiko, A. Splendiani, H. Stucken-
schmidt, K. Todorov, C. Trojahn, O. Zamazal, Results of the ontology alignment evaluation
initiative 2016, in: Proceedings of the 11th International Ontology matching workshop,
Kobe (JP), 2016, pp. 73–129.
[11] M. Cheatham, Z. Dragisic, J. Euzenat, D. Faria, A. Ferrara, G. Flouris, I. Fundulaki,
R. Granada, V. Ivanova, E. Jiménez-Ruiz, P. Lambrix, S. Montanelli, C. Pesquita, T. Saveta,
P. Shvaiko, A. Solimando, C. Trojahn, O. Zamazal, Results of the ontology alignment
evaluation initiative 2015, in: Proceedings of the 10th International Ontology matching
workshop, Bethlehem (PA, US), 2015, pp. 60–115.
[12] Z. Dragisic, K. Eckert, J. Euzenat, D. Faria, A. Ferrara, R. Granada, V. Ivanova,
E. Jiménez-Ruiz, A. O. Kempf, P. Lambrix, S. Montanelli, H. Paulheim, D. Ritze,
P. Shvaiko, A. Solimando, C. T. dos Santos, O. Zamazal, B. C. Grau, Results of
the ontology alignment evaluation initiative 2014, in: Proceedings of the 9th Inter-
national Ontology matching workshop, Riva del Garda (IT), 2014, pp. 61–104. URL:
http://ceur-ws.org/Vol-1317/oaei14 paper0.pdf.
[13] B. Cuenca Grau, Z. Dragisic, K. Eckert, J. Euzenat, A. Ferrara, R. Granada, V. Ivanova,
E. Jiménez-Ruiz, A. Kempf, P. Lambrix, A. Nikolov, H. Paulheim, D. Ritze, F. Scharffe,
P. Shvaiko, C. Trojahn dos Santos, O. Zamazal, Results of the ontology alignment evaluation
initiative 2013, in: P. Shvaiko, J. Euzenat, K. Srinivas, M. Mao, E. Jiménez-Ruiz (Eds.),
Proceedings of the 8th International Ontology matching workshop, Sydney (NSW, AU),
2013, pp. 61–100. URL: http://oaei.ontologymatching.org/2013/results/oaei2013.pdf.
[14] J. Aguirre, B. Cuenca Grau, K. Eckert, J. Euzenat, A. Ferrara, R. van Hague, L. Hollink,
E. Jiménez-Ruiz, C. Meilicke, A. Nikolov, D. Ritze, F. Scharffe, P. Shvaiko, O. Sváb-
Zamazal, C. Trojahn, B. Zapilko, Results of the ontology alignment evaluation initiative
2012, in: Proceedings of the 7th International Ontology matching workshop, Boston (MA,
US), 2012, pp. 73–115. URL: http://oaei.ontologymatching.org/2012/results/oaei2012.pdf.
[15] J. Euzenat, A. Ferrara, R. van Hague, L. Hollink, C. Meilicke, A. Nikolov, F. Scharffe,
P. Shvaiko, H. Stuckenschmidt, O. Sváb-Zamazal, C. Trojahn dos Santos, Results of the
ontology alignment evaluation initiative 2011, in: Proceedings of the 6th International
Ontology matching workshop, Bonn (DE), 2011, pp. 85–110.
[16] J. Euzenat, A. Ferrara, C. Meilicke, A. Nikolov, J. Pane, F. Scharffe, P. Shvaiko, H. Stuck-
enschmidt, O. Sváb-Zamazal, V. Svátek, C. Trojahn dos Santos, Results of the ontology
alignment evaluation initiative 2010, in: Proceedings of the 5th International Ontology
matching workshop, Shanghai (CN), 2010, pp. 85–117. URL: http://oaei.ontologymatching.
org/2010/results/oaei2010.pdf.
[17] J. Euzenat, A. Ferrara, L. Hollink, A. Isaac, C. Joslyn, V. Malaisé, C. Meilicke, A. Nikolov,
J. Pane, M. Sabou, F. Scharffe, P. Shvaiko, V. Spiliopoulos, H. Stuckenschmidt, O. Sváb-
Zamazal, V. Svátek, C. Trojahn dos Santos, G. Vouros, S. Wang, Results of the ontology
alignment evaluation initiative 2009, in: Proceedings of the 4th International Ontology
matching workshop, Chantilly (VA, US), 2009, pp. 73–126.
[18] C. Caracciolo, J. Euzenat, L. Hollink, R. Ichise, A. Isaac, V. Malaisé, C. Meilicke, J. Pane,
P. Shvaiko, H. Stuckenschmidt, O. Sváb-Zamazal, V. Svátek, Results of the ontology align-
ment evaluation initiative 2008, in: Proceedings of the 3rd Ontology matching workshop,
Karlsruhe (DE), 2008, pp. 73–120.
[19] J. Euzenat, A. Isaac, C. Meilicke, P. Shvaiko, H. Stuckenschmidt, O. Svab, V. Svatek,
W. van Hage, M. Yatskevich, Results of the ontology alignment evaluation initiative 2007,
in: Proceedings 2nd International Ontology matching workshop, Busan (KR), 2007, pp.
96–132. URL: http://ceur-ws.org/Vol-304/paper9.pdf.
[20] J. Euzenat, M. Mochol, P. Shvaiko, H. Stuckenschmidt, O. Svab, V. Svatek, W. R. van
Hage, M. Yatskevich, Results of the ontology alignment evaluation initiative 2006, in:
Proceedings of the 1st International Ontology matching workshop, Athens (GA, US), 2006,
pp. 73–95. URL: http://ceur-ws.org/Vol-225/paper7.pdf.
[21] S. Hertling, J. Portisch, H. Paulheim, Melt - matching evaluation toolkit, in: M. Acosta,
P. Cudré-Mauroux, M. Maleshkova, T. Pellegrini, H. Sack, Y. Sure-Vetter (Eds.), Semantic
Systems. The Power of AI and Knowledge Graphs, Springer International Publishing, Cham,
2019, pp. 231–245.
[22] E. Jiménez-Ruiz, T. Saveta, O. Zamazal, S. Hertling, M. Röder, I. Fundulaki, A.-C. N.
Ngomo, M. A. Sherif, A. Annane, Z. Bellahsene, S. B. Yahia, G. Diallo, D. Faria,
M. Kachroudi, A. Khiat, P. Lambrix, H. Li, M. Mackeprang, M. Mohammadi, M. Rybinski,
B. S. Balasubramani, C. Trojahn, Introducing the HOBBIT platform into the Ontology
Alignment Evaluation Campaign, in: Proceedings of the 13th International Workshop on
Ontology Matching, 2018.
[23] Z. Dragisic, V. Ivanova, H. Li, P. Lambrix, Experiences from the anatomy track
in the ontology alignment evaluation initiative, Journal of Biomedical Semantics 8
(2017) 56:1–56:28. URL: https://doi.org/10.1186/s13326-017-0166-5. doi:10.1186/
s13326-017-0166-5.
[24] O. Zamazal, V. Svátek, The ten-year ontofarm and its fertilization within the onto-sphere,
Web Semantics: Science, Services and Agents on the World Wide Web 43 (2017) 46–53.
[25] C. Meilicke, R. Garcı́a Castro, F. Freitas, W. van Hage, E. Montiel-Ponsoda, R. Ribeiro de
Azevedo, H. Stuckenschmidt, O. Sváb-Zamazal, V. Svátek, A. Tamilin, C. Trojahn, S. Wang,
MultiFarm: A benchmark for multilingual ontology matching, Journal of web semantics 15
(2012) 62–68. URL: http://www.sciencedirect.com/science/article/pii/S157082681200039X.
doi:10.1016/j.websem.2012.04.001.
[26] P. Buche, J. Cufi, S. Dervaux, J. Dibie, L. Ibanescu, A. Oudot, M. Weber, How to manage
incompleteness of nutritional food sources?: A solution using foodon as pivot ontology, Int. J.
Agric. Environ. Inf. Syst. 12 (2021) 1–26. URL: https://doi.org/10.4018/ijaeis.20211001.oa4.
doi:10.4018/ijaeis.20211001.oa4.
[27] H. Paulheim, S. Hertling, D. Ritze, Towards evaluating interactive ontology matching tools,
in: Proceedings of the 10th Extended Semantic Web Conference, Montpellier (FR), 2013,
pp. 31–45. URL: http://dx.doi.org/10.1007/978-3-642-38288-8 3.
[28] Z. Dragisic, V. Ivanova, P. Lambrix, D. Faria, E. Jiménez-Ruiz, C. Pesquita, User validation
in ontology alignment, in: Proceedings of the 15th International Semantic Web Conference,
Kobe (JP), 2016, pp. 200–217. URL: http://dx.doi.org/10.1007/978-3-319-46523-4 13.
doi:10.1007/978-3-319-46523-4_13.
[29] H. Li, Z. Dragisic, D. Faria, V. Ivanova, E. Jiménez-Ruiz, P. Lambrix, C. Pesquita, User
validation in ontology alignment: functional assessment and impact, The Knowledge
Engineering Review 34 (2019) e15. doi:10.1017/S0269888919000080.
[30] V. Ivanova, P. Lambrix, J. Åberg, Requirements for and evaluation of user support for
large-scale ontology alignment, in: Proceedings of the European Semantic Web Conference,
2015, pp. 3–20.
[31] Y. He, J. Chen, H. Dong, E. Jiménez-Ruiz, A. Hadian, I. Horrocks, Machine learning-
friendly biomedical datasets for equivalence and subsumption ontology matching, in:
U. Sattler, A. Hogan, C. M. Keet, V. Presutti, J. P. A. Almeida, H. Takeda, P. Monnin,
G. Pirrò, C. d’Amato (Eds.), The Semantic Web - ISWC 2022 - 21st International Semantic
Web Conference, Virtual Event, October 23-27, 2022, Proceedings, volume 13489 of
Lecture Notes in Computer Science, Springer, 2022, pp. 575–591. URL: https://doi.org/10.
1007/978-3-031-19433-7 33. doi:10.1007/978-3-031-19433-7\_33.
[32] K. A. Shefchek, N. L. Harris, M. A. Gargano, N. Matentzoglu, D. R. Unni, M. H. Brush,
D. Keith, T. Conlin, N. A. Vasilevsky, X. A. Zhang, J. P. Balhoff, L. Babb, S. M. Bello,
H. Blau, Y. M. Bradford, S. Carbon, L. Carmody, L. E. Chan, V. Cipriani, A. Cuzick,
M. G. D. Rocca, N. A. Dunn, S. Essaid, P. Fey, C. A. Grove, J.-P. F. Gourdine, A. Hamosh,
M. A. Harris, I. Helbig, M. E. Hoatlin, M. P. Joachimiak, S. Jupp, K. B. Lett, S. E. Lewis,
C. McNamara, Z. M. Pendlington, C. Pilgrim, T. Putman, V. Ravanmehr, J. T. Reese,
E. R. Riggs, S. M. C. Robb, P. Roncaglia, J. Seager, E. Segerdell, M. N. Similuk, A. L.
Storm, C. Thaxon, A. E. Thessen, J. O. B. Jacobsen, J. A. McMurry, T. Groza, S. Köhler,
D. Smedley, P. N. Robinson, C. J. Mungall, M. A. Haendel, M. C. Munoz-Torres, D. Osumi-
Sutherland, The Monarch Initiative in 2019: an integrative data and analytic platform
connecting phenotypes to genotypes across species, Nucleic Acids Research (2020).
[33] O. Bodenreider, The unified medical language system (umls): integrating biomedical
terminology, Nucleic acids research (2004).
[34] N. Karam, C. Müller-Birn, M. Gleisberg, D. Fichtmüller, R. Tolksdorf, A. Güntsch, A
terminology service supporting semantic annotation, integration, discovery and analysis
of interdisciplinary research data, Datenbank-Spektrum 16 (2016) 195–205. URL: https:
//doi.org/10.1007/s13222-016-0231-8. doi:10.1007/s13222-016-0231-8.
[35] F. Klan, E. Faessler, A. Algergawy, B. König-Ries, U. Hahn, Integrated semantic search
on structured and unstructured data in the adonis system, in: Proceedings of the 2nd
International Workshop on Semantics for Biodiversity, 2017.
[36] N. Karam, A. Khiat, A. Algergawy, M. Sattler, C. Weiland, M. Schmidt, Matching
biodiversity and ecology ontologies: challenges and evaluation results, Knowl. Eng.
Rev. 35 (2020) e9. URL: https://doi.org/10.1017/S0269888920000132. doi:10.1017/
S0269888920000132.
[37] F. Michel, O. Gargominy, S. Tercerie, C. Faron-Zucker, A Model to Represent Nomen-
clatural and Taxonomic Information as Linked Data. Application to the French Taxonomic
Register, TAXREF, in: A. Algergawy, N. Karam, F. Klan, C. Jonquet (Eds.), Proceedings
of the 2nd International Workshop on Semantics for Biodiversity co-located with 16th
International Semantic Web Conference (ISWC 2017), Vienna, Austria, October 22nd,
2017, volume 1933 of CEUR Workshop Proceedings, CEUR-WS.org, 2017.
[38] T. Ashino, Materials Ontology: An Infrastructure for Exchanging Materials Information
and Knowledge, Data Science Journal 9 (2010) 54–61.
[39] M. Wu, P. Hagan, B. Cecconi, S. M. Richard, C. Verhey, R. R. M. S. WG, A collection of
crosswalks from fifteen research data schemas to schema.org, 2022. URL: https://doi.org/
10.15497/RDA00069. doi:10.15497/RDA00069.
[40] M. Wu, S. M. Richard, C. Verhey, L. J. Castro, B. Cecconi, N. Juty, An analysis of
crosswalks from research data schemas to schema.org, Data Intelligence (2022) 1–21. URL:
https://doi.org/10.1162/dint a 00186. doi:10.1162/dint_a_00186.
[41] C. Bizer, J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mende,
S. Hellmann, M. Morsey, P. van Kleef, S. Auer, C. Bizer, DBpedia – A Large-scale,
Multilingual Knowledge Base Extracted from Wikipedia, Semantic Web (2012) 1–5.
[42] F. M. Suchanek, G. Kasneci, G. Weikum, Yago: a core of semantic knowledge, in:
Proceedings of the 16th International Conference on World Wide Web, 2007, pp. 697–706.
[43] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka, T. M. Mitchell, Toward an
architecture for never-ending language learning, in: Twenty-Fourth AAAI Conference on
AI, 2010.
[44] O. Fallatah, Z. Zhang, F. Hopfgartner, The impact of imbalanced class distribution on
knowledge graphs matching, in: Proceedings of the 17th International Workshop on
Ontology Matching (OM 2022). CEUR-WS, 2022.
[45] O. Fallatah, Z. Zhang, F. Hopfgartner, A gold standard dataset for large knowledge graphs
matching, in: Ontology Matching 2020: Proceedings of the 15th International Workshop on
Ontology Matching co-located with (ISWC 2020), 2020.
[46] P. Krauss, schemaorg-wikidata-map, https://github.com/okfn-brasil/
schemaOrg-Wikidata-Map, 2017.
[47] S. Hertling, H. Paulheim, Dbkwik: extracting and integrating knowledge from thousands of
wikis, Knowledge and Information Systems (2019).
[48] S. Hertling, H. Paulheim, Dbkwik: A consolidated knowledge graph from thousands of
wikis, in: Proceedings of the International Conference on Big Knowledge, 2018.
[49] T. Saveta, E. Daskalaki, G. Flouris, I. Fundulaki, M. Herschel, A.-C. Ngonga Ngomo,
Pushing the limits of instance matching systems: A semantics-aware benchmark for linked
data, in: Proceedings of the 24th International Conference on World Wide Web, ACM, New
York, NY, USA, 2015, pp. 105–106. URL: http://doi.acm.org/10.1145/2740908.2742729.
doi:10.1145/2740908.2742729.
[50] T. D. Doudali, I. Konstantinou, N. K. Doudali, Spaten: a Spatio-Temporal and Textual Big
Data Generator, in: IEEE Big Data, 2017, pp. 3416–3421.
[51] C. Strobl, Encyclopedia of GIS, Springer, 2008, pp. 240–245.
[52] Y. He, J. Chen, D. Antonyrajah, I. Horrocks, Bertmap: a bert-based ontology alignment
system, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 36,
2022, pp. 5684–5691.
[53] J. Chen, Y. He, Y. Geng, E. Jimenez-Ruiz, H. Dong, I. Horrocks, Contextual semantic
embeddings for ontology subsumption prediction, arXiv preprint arXiv:2202.09791 (2022).
[54] M. Šatra, O. Zamazal, Towards matching of domain ontologies to cross-domain ontology:
Evaluation perspective, in: Proceedings of the 19th International Workshop on Ontology
Matching, 2020.
[55] J. Dabrowski, E. V. Munson, 40 years of searching for the best computer system response
time, Interacting with Computers 23 (2011) 555–564. URL: http://www.sciencedirect.
com/science/article/pii/S0953543811000579. doi:http://dx.doi.org/10.1016/j.
intcom.2011.05.008.
[56] Z. Wang, I. F. Cruz, Agreementmakerdeep results for oaei 2021., in: OM@ ISWC, 2021,
pp. 124–130.
[57] D. Faria, M. C. Silva, P. Cotovio, P. Eugénio, C. Pesquita, Matcha and matcha-dl results for
oaei 2022, 2022.
[58] E. Jiménez-Ruiz, B. Cuenca Grau, LogMap: Logic-based and scalable ontology matching,
in: Proceedings of the 10th International Semantic Web Conference, Bonn (DE), 2011, pp.
273–288.
[59] S. Hertling, H. Paulheim, Atbox results for oaei 2021, in: CEUR Workshop Proceedings,
volume 3063, RWTH Aachen, 2021, pp. 137–143.
[60] A. Sharma, A. Patel, S. Jain, Lsmatch results for oaei 2021., in: OM@ ISWC, 2021, pp.
178–184.
[61] J. Chen, P. Hu, E. Jimenez-Ruiz, O. M. Holter, D. Antonyrajah, I. Horrocks, Owl2vec*:
Embedding of owl ontologies, Machine Learning 110 (2021) 1813–1845.
[62] J. Portisch, S. Hertling, H. Paulheim, Visual analysis of ontology matching results with the
melt dashboard, in: European Semantic Web Conference, 2020, pp. 186–190.

Linköping, Jena, Lisboa, Heraklion, Mannheim, Montpellier, Oslo, London, Berlin, Trento,
Toulouse, Prague, Manhattan, Dublin, Oxford
December 2022