=Paper=
{{Paper
|id=Vol-3324/om2022_STpaper1
|storemode=property
|title=Too big to match: a strategy around matching tasks for large taxonomies
|pdfUrl=https://ceur-ws.org/Vol-3324/om2022_STpaper1.pdf
|volume=Vol-3324
|authors=Alsayed Algergawy,Naouel Karam,Amir Laadhar,Franck Michel
|dblpUrl=https://dblp.org/rec/conf/semweb/AlgergawyKLM22
}}
==Too big to match: a strategy around matching tasks for large taxonomies==
<pdf width="1500px">https://ceur-ws.org/Vol-3324/om2022_STpaper1.pdf</pdf>
<pre>
Too Big to Match: a Strategy Around Matching Tasks
for Large Taxonomies
Alsayed Algergawy1,* , Naouel Karam2 , Amir Laadhar3 and Frank Michel4
1
  Institute for Computer Science, University of Jena, Germany
2
  Fraunhofer FOKUS, InfAI e.V., Berlin, Germany
3
  University of Stuttgart, Germany
4
  University Côte d’Azur, CNRS, Inria, I3S, Nice, France


                                         Abstract
                                         Following the introduction of a new matching task at the biodiversity and ecology track of the Ontology
                                         Alignment Evaluation Initiative (OAEI), to align between two large taxonomies, we acknowledged the
                                         fact that large ontologies or taxonomies still cannot be efficiently tackled by state-of-the-art ontology
                                         matching systems. In this paper, we take advantage of structural specificities of taxonomies to devise a
                                         strategy for deviding large scale taxonomy matching tasks into smaller, more manageable subtasks. Our
                                         modularization approach is based on a locality-based module extraction technique. We conducted a first
                                         assessment of the coverage of the obtained modules as well as a preliminary evaluation using a set of
                                         tools from OAEI.

                                         Keywords
                                         Ontology matching, Large taxonomies, NCBITAXON, TAXREF-LD


1. Introduction
At the 2021 edition of the Ontology Alignment Evaluation Initiative (OAEI), we introduced a
new matching task to align between the NCBI Organismal Classification [1] (NCBITaxonomy
for short) and TAXREF-LD [2]. Both are large biologic taxonomies that respectively contain
1,983,907 and 285,863 classes. No matching system succeeded in the matching of NCBITaxonomy
and TAXREF-LD, in the given time frame and with constrained computing resources. Indeed, the
upmost search space that need to be considered by matching systems is the cartesian product of
entities from the two input taxonomies, which for the task at hand represents almost 529 billion
candidate correspondences. The same issue has been addressed by the OAEI Large BioMed
Track (largebio) for many years, the size of the biggest considered ontology corresponding to
the range of TAXREF-LD with around 300,000 classes. The largebio track organizers proposed
and applied a strategy to divide the ontologies into smaller fragments [3], the track consists
of manageable subtasks with different fragments sizes. Furthermore, several approaches have

OM2022: the International Workshop on Ontology Matching, October 23, 2022, Hangzhou, China
*
 Corresponding author.
$ alsayed.algergawy@uni-jena.de (A. Algergawy); naouel.karam@fokus.fraunhofer.de (N. Karam);
amirl@cs.aau.dk (A. Laadhar); fmichel@i3s.unice.fr (F. Michel)
 0000-0002-8550-4720 (A. Algergawy); 0000-0002-0877-7063 (N. Karam); 0000-0001-7116-9338 (A. Laadhar);
0000-0002-9421-8566 (F. Michel)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
been proposed in the literature to support matching large ontologies [4, 5, 6], e.g. reduction
of the search space and parallel matching are two common strategies. A state of the art and a
comparison of the approaches based on the employed technique has been presented in [7].
   In this paper, we propose a method to split the large matching task into a set of smaller,
more manageable subtasks through the use of modularization. Each subtask will cover one
self-contained module from each of the two taxonomies. Unlike state of the art approaches
that operate on ontologies and hence need to deal with issues around preserving the coverage
of the relevant ontology alignments, as well as an accurate and self-contained division, our
approach is tailored for taxonomies and their specificities. Indeed, structural characteristics
of biological taxonomies allow us to divide based on a specific hierarchy level, the so-called
taxonomic rank and thus obtain independent and logically relevant modules. We performed a
preliminary evaluation of the new matching tasks based on the obtained taxonomy modules
and their corresponding subset of the reference alignments using a set of matching systems
from OAEI.


2. Characteristics of Biological Taxonomies
In this section, we describe some characteristics of biological taxonomies and their particular
structure that serves as basis for our modularization approach.
   Large size. Biologic taxonomies are used to name, define and classify groups of biological
organisms based on common characteristics. They result from a huge amount of efforts that
taxonomists have dedicated in studying living organisms and publishing their knowledge for
over two centuries. Those are in consequence inherently large. For instance, at the time of
writing this paper, the NCBI taxonomy contains over 2 million species. TAXREF-LD, although
restricted to the french territories, still counts at this day over 200,000 species. In addition
to the constant growth in size, the progress of species studies leads to frequent changes like
the redefinition, reclassification and merging of taxon concepts, making taxonomies highly
dynamic [8].
   Structure and naming conventions. Biological taxonomies consist of a hierarchy where
each level is assigned a taxonomic rank, such as species, genus, family, etc. up to kingdom. In the
hierarchy, the parent of a taxon is the taxon with higher taxonomic rank (e.g. ”Delphinus delphis”
is of rank species whereas its parent class ”Delphinus” is of rank genus). Different taxonomies
adopt certain perspective, or are meant for certain purpose, thus covering complementary
and possibly overlapping regions, epochs or domains. For instance, NCBITaxonomy covers
the organisms referenced in the NCBI nucleotide and protein sequences database. Biologic
taxonomies consist of two coexisting yet distinct realities, the taxonomy (the description and
characterization of biological entities called taxa), and the nomenclature (the rules specifying
how to define scientific names and assign them to taxa).
   High dynamicity. The circumscription of a taxon, i.e. the set of individuals that it actually
consists of, is provided by the set of names that may be used to refer to it: the reference name
(called accepted name in zoology or valid name in botany), and possibly multiple synonyms.
These multiple names stem from the fact that scientific consensus constantly evolves, in light
e.g. of new evidence, that leads to recombination of taxa (merging, splitting, moving from
one parent to another etc.). The way these recombinations must be handled is specified by
nomenclatural rules that are compiled in several Codes of nomenclature, for animals, for plants
and fungi, and for bacteria.


3. Alignment Task at Hand
The NCBI Taxonomy is the standard nomenclature and classification repository for the source
organisms in the sequence databases of the International Nucleotide Sequence Database Col-
laboration (INSDC). The NCBITaxonomy ontology is an automatic translation of the NCBI
taxonomy database into OWL. The translation treats each taxon as a class whose instances
would be individual organisms. The NCBI Taxonomy is updated daily but the releases of the
OWL counterpart are triggered manually by an OBO administrator bi-annually. The OAEI
biodiv track uses the release 2021-02-15, containing 1,983,907 classes.
   TAXREF is the French taxonomic register for fauna, flora and fungus, maintained and curated
by the National Museum of Natural History of Paris. TAXREF is available in multiple formats,
in particular as a knowledge graph based on the linked data principles, called TAXREF-LD [2].
To account for the distinction between taxonomy and nomenclature , TAXREF-LD holds two
distinct levels of modeling. At the taxonomic level, each biological taxon is modeled as an OWL
class whose members are the biological individuals in that taxon. At the nomenclatural level,
scientific names are represented as the concepts of a SKOS thesaurus. The OAEI biodiv task
currently relies on TAXREF-LD version 13 that registers 266,846 taxa and 657,609 scientific
names. Since the tools cannot deal with OWL and SKOS at the same time, we turned the SKOS
part into simple name labels attached to the OWL classes.
   TAXREF-LD comes with alignments to NCBITaxonomy, computed using the SILK frame-
work [9], that the authors extended with a plugin that implements rules for the alignment of
scientific names. These rules are designed to work around common mistakes that are being
made when spelling scientific names. This typically pertains to the use of parentheses and
abbreviations, accentuated characters, or the transcription of letters using the Latin ligature
(e.g. "Æ" may be spelled "AE"). Furthermore, only taxa of species or infra-specific ranks were
considered to compute the reference alignments. A reason is that this is where the vast majority
of taxa are. A more pragmatic reason is that the names of species or infra-specific ranks consist
of at least two terms, the genus followed by an epithet, which is quite discriminating. Con-
versely, names in higher ranks are usually single-worded and some of them may be very similar,
sometimes varying by only one letter (e.g. sub-family Tenrecinae belongs to family Tenrecidae),
such that lexical alignment methods tend to produce lots of false-positives.


4. Modularization of the Taxonomies
In general, there are two ways to split an ontology into a set of partitions: module extraction
and module partitioning [10], where module extraction aims to extract from the given ontology
a small fragment that captures the intended meaning of input terms [11, 12], while ontology
partitioning splits the given ontology into a set of modules. In this work, we make use of a
module extraction technique.
   Locality-based module extraction. The locality-based module extraction is the process
that extracts a meaningful subset of an ontology given a number of terms (signature). The
extracted module guarantees to completely capture the meaning of the given set of terms. In the
context of the alignment task at hand, we first prepared the set of terms that will be used as input
for the module extraction process. The current implementation of the locality-based approach
supports extracting three types of syntactic-locality-based modules: bottom module, top module
and the star module [10]. In this work, we applied the locality-bottom and -top module extraction
strategies. Thus covering the relevant hierarchy information needed by matching systems.
   Input terms preparation. The specific structure of the taxonomies hierarchy based on
taxonomic rank levels described in Section 2 constitute the perfect basis to divide the taxonomy
into meaningful and independent modules. We chose to start at the highest taxonomic rank,
namely the kingdom. We made use of the TAXREF-LD API to extract the kingdom of all entities
appearing in the original set of reference alignments, then we grouped the alignments by
kingdom. We obtained 6 groups corresponding to the kingdoms: Animalia, Bacteria, Chromista,
Fungi, Plantae and Protozoa. We then used the obtained set of terms together with their original
taxonomy as input for the locality-based module extraction tool.
   After applying the locality-based module extraction on both taxonomies given the input set
of terms for each category, we get six modules for each taxonomy. The number of concepts
within each module is presented in Table 1. The matching task has been split into six well
balanced matching subtasks. This is due to the nature of the extraction process, since we used
the same set of input terms and the taxonomies share a similar rank-based structure.

Table 1
Sizes of obtained modules for each taxonomy
         Matching task       T1         T2          T3         T4        T5         T6
                          Animalia    Bacteria   Chromista    Fungi    Plantae   Protozoa
         NCIBITAXON        74729        326        2344       13149     27013      538
          TAXREFLD         73528        312        2290       12732     26302      501


   In prevision of this year’s OAEI edition, we conducted a preliminary evaluation of the
obtained modules using a set of OAEI matching tools. We ran three different matching systems,
namely AML [13], LogMap [14] and ATBox [15]. Our goal with this first evaluation was to
ensure matching systems will be able to deal with all subtasks. All systems completed the tasks
successfully. We will perform a full evaluation of the participating systems of this year’s edition
based on the OAEI schedule.
   In Table 2, we show the number of mappings computed by each system for each subtask.
All systems computed nearly the same size of mappings, these are however much bigger than
our reference alignments. This is probably due to the fact that we considered only species or
infra-specific ranks (c.f. Section 3). For the final evaluation, we will ignore the set of mappings
not covered by the reference alignment and perform a manual assessment of the mappings
produced by 2 systems or more, to be potentially added to our reference alignment.
Table 2
Number of computed mappings by each system on the matching subtasks
  Task (size of reference)   T1 (48220)   T2 (175)   T3 (1405)   T4 (10162)   T5 (19914)   T6 (357)
            AML                71269        303        2219        12937        26671        496
          LogMap               72838        302        2219        12937        26862        496
           ATBox               71383        295        2192        12623        25862        478


5. Conclusion
We have presented an approach to divide a taxonomy matching task into subtasks based on
taxonomic ranks and a locality-based module extraction. The obtained modules are consequently
logically coherent and independent from each other. The bottom-up module extraction strategy
guarantees the inclusion of relevant information required by matching systems while ensuring
the coverage of the initial reference alignment. We tested a set of systems on the sub-tasks and
we will be performing a full evaluation at the 2022 OAEI edition.

Acknowledgements
This work has been partially funded by the German Research Foundation (DFG) as part of the
CRC 1076 Aquadiva, NFDI4Biodiversity (442032008) and NFDI-MatWerk (460247524) projects.


References
 [1] S. Federhen, The NCBI Taxonomy database, Nucleic Acids Research 40 (2012).
 [2] F. Michel, O. Gargominy, S. Tercerie, C. Faron-Zucker, A Model to Represent Nomenclatural
     and Taxonomic Information as Linked Data. Application to the French Taxonomic Register,
     TAXREF, in: Proceedings of the ISWC2017 workshop on Semantics for Biodiversity
     (S4BioDiv), volume 1933, Vienna, Austria, 2017.
 [3] E. Jiménez-Ruiz, A. Agibetov, J. Chen, M. Samwald, V. Cross, Dividing the ontology
     alignment task with semantic embeddings and logic-based modules, in: ECAI 2020,
     Santiago de Compostela, Spain, volume 325, 2020.
 [4] E. Rahm, Towards large-scale schema and ontology matching, in: Schema matching and
     mapping, 2011.
 [5] F. Hamdi, B. Safar, C. Reynaud, H. Zargayouna, Alignment-Based Partitioning of Large-
     Scale Ontologies, 2010.
 [6] W. Hu, Y. Qu, G. Cheng, Matching large ontologies: A divide-and-conquer approach, Data
     & Knowledge Engineering 67 (2008).
 [7] P. Ochieng, S. Kyanda, Large-scale ontology matching: State-of-the-art analysis, ACM
     Computing Surveys (CSUR) 51 (2018).
 [8] A. Kohlbecker, N. Karam, A. Paschke, A. Güntsch, Preserving taxonomic change and
     subsequent taxon relationships over time, in: Proceedings of the Joint Ontology Workshops
     2021 Episode VII: The Bolzano Summer of Knowledge, volume 2969, 2021.
 [9] J. Volz, C. Bizer, M. Gaedke, G. Kobilarov, Silk - A Link Discovery Framework for the Web
     of Data., in: 2nd Workshop about Linked Data on the Web, Madrid, Spain, 2009.
[10] A. Algergawy, S. Babalou, F. Klan, B. König-Ries, Ontology modularization with oapt,
     Journal on Data Semantics (2020).
[11] B. C. Grau, I. Horrocks, Y. Kazakov, U. Sattler, Just the right amount: extracting modules
     from ontologies, in: Proceedings of the 16th international conference on World Wide Web,
     2007.
[12] A. A. Romero, M. Kaminski, B. C. Grau, I. Horrocks, Module extraction in expressive
     ontology languages via datalog reasoning, J. Artif. Intell. Res. (2016).
[13] D. Faria, C. Pesquita, E. Santos, M. Palmonari, I. F. Cruz, F. M. Couto, The agreementmak-
     erlight ontology matching system, in: OTM Confederated International Conferences" On
     the Move to Meaningful Internet Systems", 2013.
[14] E. Jiménez-Ruiz, B. Cuenca Grau, Logmap: Logic-based and scalable ontology matching,
     in: International Semantic Web Conference, 2011.
[15] S. Hertling, H. Paulheim, Atbox results for oaei 2020, in: CEUR Workshop Proceedings,
     volume 2788, 2020.

</pre>