1.2. Specific Techniques Used

Daniel Faria

0 1

Marta C. Silva

0 2

Pedro Cotovio

0 2

Lucas Ferraz

0 2

Laura Balbi

0 2

Catia Pesquita

0 2 0 1.1. State , Purpose, General Statement 1 INESC-ID, Instituto Superior Técnico, Universidade de Lisboa , Portugal 2 LASIGE, Faculdade de Ciências, Universidade de Lisboa , Portugal

2024

Matcha is an ontology matching system designed to tackle long-standing challenges such as complex and holistic ontology matching. It incorporates all of the key algorithms from AgreementmakerLight over a novel broader core architecture that includes several new algorithms. In this year's edition, some strategies were modified to rectify some gaps found in last year, and a few new strategies were debuted, with particular note for the inclusion of Language Models in two of our algorithms. Matcha performed well overall, achieving the highest F-measure in 15 out of 43 distinct OAEI tasks and ranking in the top three in ten others.

1.2. Specific Techniques Used

Matcha includes all of AML’s lexical and structural matching algorithms [ 4 ], as well as some of its background knowledge strategy [ 5 ]. For this year’s OAEI, some matching techniques were revised, and some were newly developed.

One of the new matching algorithms uses a Language Model (LM) in order to go beyond the information that is explicitly stated in the ontology and exploit the context that labels and synonyms can provide when represented through a language model. The matching algorithm uses the LM to represent the entities’ labels and synonyms as embeddings, which are subsequently compared through cosine similarity. Similarly to last year, we used the pre-trained sentence-BERT [ 6 ] all-MiniLM-L6-v2 model1 without fine-tuning. Matches classes based on overlapping individuals that instantiate them, computed through conservative instance matching algorithms Matches ontologies by finding literal full name matches between their lexicons. Weighs matches according to the provenance of the names Matches ontologies by computing the cosine similarity between the language model embeddings of their lexicons Matches ontologies by using cross-references and/or exact lexical matches between them and a third mediating ontology Matches ontologies by measuring the maximum string similarity, using one of the four available string similarity measures Matches ontologies by measuring the word similarity, using a weighted Jaccard index

Instance Matching Matches individuals by finding literal matches between the values of their annotation and data properties Maps individuals by comparing their values through the ISub string similarity metric Maps individuals by comparing the lexicon entries of one with the values of the other using a combination of string and word matching algorithms Maps individuals by comparing sentence representations of the source and target labels, obtained with a LM trained in a multilingual setting

Additionally, for any task that requires translation, we constructed a new translation module that uses a pre-trained multilingual translation LM, the "M2M100" [ 7 ], with 1.2B parameters and trained on over 100 languages. The model uses an Encoder-Decoder Long Short-Term Memory architecture that consists of two complex recurrent neural networks that act as an encoder and decoder pair. The matching algorithm uses the encoder to map each of the source and target ontologies’ labels to an embedding representation, followed by a computation of the cosine similarity between the embeddings to generate a mapping score.

Matcha’s matching algorithms are described in Table 1.

1.3. Adaptations Made for the Evaluation

The MELT [ 8 ] web-based package was implemented in Matcha for the required evaluation in OAEI. Given two ontologies and a set of parameters, Matcha will generate a complete alignment between them according to the type of entities to be matched. For local alignment tasks, where each entity in the test set has a predetermined list of candidate matches, Matcha calculates scores for each candidate. These candidates are then ranked based on the highest score obtained from the various matching algorithms.

Matcha was packaged in a docker container for ease of sharing and running the evaluation, which included, for example, the files necessary for some of the algorithms, such as background knowledge ontologies used in some tracks.

2. Results

Matcha’s results for OAEI are summarized in Table 2, with the exception of the results for the BioML track, which are presented in Table 3. Matcha performed well overall, achieving the highest F-measure out of all systems in 15 out of the 43 distinct OAEI tasks, while ranking in the top 3 in ten others.

2.1. Anatomy track

Matcha continues to excel in this track, placing first among all systems and with all evaluation metrics above a 0.9 (0.951 for precision, 0.931 for recall, 0.941 for F-measure). While not ranking ifrst in precision, both precision and recall are very high, resulting in a high F-measure. It is interesting to note that the second-best system yields a 0.903 in F-measure.

2.2. Archaeology Multilingual track

This task had two participants: Matcha and LogMap with its three variants. Matcha achieved ifrst place in F-measure in three out of ten tasks, but in some tasks, both systems achieved lowperformance scores, including one task where all systems failed to return results. The results are heterogeneous, with some tasks achieving a high precision (including perfect precision for the de-de task), while others achieve values close to zero (six out of ten). In terms of recall, the results are also fairly heterogeneous, with the values varying between close to zero and 0.75.

With this being the first year that Matcha debuts the integrated MLLM-based translation module, we count on exploring other multilingual pretrained models and possibly performing a statistical analysis to understand the diferences in language coverage and depth to sustain our future choice of a multilingual model for Matcha’s translation module.

2.3. Biodiversity and Ecology track

This task also had two participants: Matcha and LogMap with its three variants. Matcha achieved first place in the F-measure in three of the nine tasks, while in most of the other tasks, it achieved scores that were very close to those obtained by LogMap. It is interesting to note that, in the NCBITAXON-TAXREFLD group, Matcha achieves perfect recall in two tasks, while in four others the value is very close to 1.0 (the lowest being 0.984). Precision is mostly consistent, oscillating between 0.57 and 0.74, with results being poorer in the MACROALGAEMACROZOOBENTHOS and FISH-ZOOPLANKTON tasks, which are around 0.2.

2.4. Circular Economy track

In this new track, Matcha placed first out of all systems by F-measure. The results are moderate and close to other competing systems, with 0.393 precision and 0.611 recall. According to the organizers and an additional assessment performed, Matcha’s optimal threshold could be set to 0.9, which would capture most true positives. On a less positive note, in terms of false positives and using a manual evaluation by the organizers, Matcha finds a fair amount of mappings that are probably due to the usage of either the same name or the same words, an interesting insight that could be used to further improve our strategies.

2.5. Conference track

Matcha tied in first place with another competing system, improving over last year’s placement in all measures (precision, recall, and F-measure).

An additional evaluation was run to assess any diferences in results from sharp, discrete, and continuous settings. From this assessment, it is noted that Matcha performs well in the sharp evaluation in terms of recall (0.67), but in the discrete uncertain setting, while its precision drops, recall improves to 0.77, indicating that it is successful at identifying uncertain matches. Matcha also appears to adapt well to the uncertain framework in the continuous setting, as its recall and F-measure remain relatively high at 0.75 and 0.71.

Regarding the evaluation performed based on logical reasoning, Matcha has 86 conservativity principle violations and 72 consistency principle violations in an alignment of 21 mappings. However, as the organizers note, conservativity principle violations can simply be false positives.

2.6. Digital Humanities track

Matcha achieves overall good results in this track, even if somewhat heterogeneous. Matcha ranks first in four out of the eight tracks, and in the top 3 in one other. Precision is high in some tracks (with a value of 1.0 in one of them), however in some others the value is close to zero, with Matcha yielding no results in one of these tasks. Recall sufers from less of this variability, with only the failed task having a value close to zero, and with good values for all others.

Similarly to the Archaeology Multilingual track, this track uses the MLLM-based translation module, which will be further explored and reviewed.

2.7. Food Nutritional Composition track

Matcha only competed in the “equal” relation testcase placing first against competing systems, however with a value of 0.1016 in F-measure which is lower than other tracks where it also places ifrst. While Matcha is less precise than other systems (0.0611 against 0.1333), it compensates with its ten times higher recall (0.3013 against 0.0274 ). This track poses challenges that current systems are clearly not well equipped to handle.

2.8. Knowledge Graph track

Matcha places last in this track when assessing the aggregated results. Looking at class mappings, Matcha has a high performance overall with 0.97 of precision, 0.8 of recall, and 0.87 of F-measure, outperforming both competing systems and the baselines. All systems fail at finding property mappings, and as for instance mappings, Matcha has a lower performance with 0.55 of precision, 0.86 of recall and 0.63 of F-measure, finding far more mappings than other systems (249510 mappings versus 6653.8 by the next system) which decreases precision significantly.

When looking at each of the test cases, a pattern emerges where Matcha has lower precision and higher recall when compared to all other systems. However the precision values are low enough that cannot be compensated by the high recall, and therefore lead Matcha to place last in four of the five test cases, placing second in the remaining one, according to F-measure.

In this track, two main problems arise which need to be assessed and corrected: the lack of property mappings and the excessive amount of instance mappings produced, which directly influence the system’s precision.

2.9. Multifarm track

Matcha’s performance in this track is fairly balanced considering a new strategy of using LLMs for multilingual machine-translation was debuted this year.

This year Matcha ranked second out of four systems, outperforming last year’s scores where Matcha competed without the LLM module and ranked 4th out of four, with a clear improvement in recall and F-measure. Although Matcha’s running time is within the same order of magnitude as other competing systems, we recognize that it is very time-consuming in its current iteration and could be optimized in future versions of the system.

2.10. Bio-ML track

This year marks Matcha’s first time competing in the local alignment challenges of this track. While Matcha’s Bio-ML rankings based on F-score were moderate compared to the other participating models, Matcha demonstrated a stronger relative performance when considering MRR, especially in the unsupervised setting. Matcha’s middle-ranking F-scores were caused by largely high precisions paired with relatively low recalls, a trend also evident among most other participating systems, highlighting that the challenge of improving recall without compromising precision still remains an open issue. Overall Matcha results to note in the Bio-ML track include: a top-3 MRR ranking placement in 3 of the 5 tasks in the unsupervised setting; a first and second place in MRR ranking in the unsupervised and supervised settings of the SNOMED-FMA (body) task, respectively; and a second place in the F-score ranking in the unsupervised SNOMED-NCIT (pharm) task.

3. Conclusions

Matcha achieved the highest F-measure in 15 out of the 43 distinct OAEI tasks and ranked in the top 3 in ten others, making it overall the second-best system that competed this year.

This year a new approach for the translation model was debuted that allowed Matcha to improve its rank in the multifarm track, and place fairly well in the new tracks of archaeologymultilingual and digital humanities. Moreover, across all tasks, Matcha tends to outperform other systems in recall, while tending to underperform in precision, sometimes due to an exaggerated number of mappings found that turn out to be false positives. Some tracks require further review, such as the knowledge graph track where Matcha fails to find any property mappings.

Acknowledgements

This work was supported by FCT through fellowships 2022.11895.BD (Marta Silva), 2022.10557.BD (Pedro Cotovio), and through the KATY project through fellowships R881.7 (Laura Balbi), and the LASIGE Research Unit, ref. UIDB/00408/2020 (https://doi.org/10.54499/UIDB/00408/2020) and ref. UIDP/00408/2020 (https://doi.org/10.54499/UIDP/00408/2020). It was partially supported by the KATY project which has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 101017453, and it was also partially supported by project 41, HfPT: Health from Portugal, funded by the Portuguese Plano de Recuperação e Resiliência. Mouse-Human idai-pactols_de-de idai-pactols_de-en idai-pactols_de-fr idai-pactols_de-it idai-pactols_en-en idai-pactols_en-fr idai-pactols_en-it idai-pactols_fr-fr idai-pactols_fr-it idai-pactols_it-it CEON-BiOnto OntoFarm (rar2-M3) arch1_defc-pactols arch2_idai-pactols arch3_ironagedanube-pactols arch4_pactols-parthenos cult1_idai-parthenos cult2_oeai-parthenos dhcs1_dha-unesco dhcs2_tadirah-unesco Test Case Food V2 Aggregated (overall) Aggregated ENVO-SWEET NCBITAXON-TAXREFLD Animalia NCBITAXON-TAXREFLD Bacteria NCBITAXON-TAXREFLD Chromista NCBITAXON-TAXREFLD Fungi NCBITAXON-TAXREFLD Plantae NCBITAXON-TAXREFLD Protozoa MACROALGAE-MACROZOOBENTHOS FISH-ZOOPLANKTON

MRR

Rank *

OMIM-ORDO NCIT-DOID SNOMED-FMA SNOMED-NCIT (Pharm) SNOMED-NCIT (Neoplas) OMIM-ORDO NCIT-DOID SNOMED-FMA SNOMED-NCIT (Pharm) SNOMED-NCIT (Neoplas)

[1]

É.

Thiéblin ,

Haemmerlé ,

Hernandez ,

Trojahn , Survey on complex ontology matching, Semantic Web 11 ( 2020 ) 689 - 727 . URL: https://doi.org/10.3233/SW-190366. doi: 10 .3233/SW-190366.

[2]

Megdiche ,

Teste ,

Trojahn , An extensible linear approach for holistic ontology matching , in: International Semantic Web Conference, Springer, 2016 , pp. 393 - 410 .

[3]

Faria ,

Santos ,

B. S.

Balasubramani ,

M. C.

Silva ,

F. M.

Couto ,

Pesquita , Agreementmakerlight, Semantic Web ( 2023 ) 1 - 13 .

[4]

Faria ,

Pesquita , E. Santos,

Palmonari ,

I. F.

Cruz ,

F. M.

Couto , The AgreementMakerLight Ontology Matching System , in: OTM Conferences - ODBASE , 2013 , pp. 527 - 541 .

[5]

Faria ,

Pesquita ,

Santos ,

I. F.

Cruz ,

F. M.

Couto , Automatic Background Knowledge Selection for Matching Biomedical Ontologies , PLoS One 9 ( 2014 ) e111226 .

[6]

Reimers , I. Gurevych , Sentence-bert: Sentence embeddings using siamese bert-networks , arXiv preprint arXiv: 1908 . 10084 ( 2019 ).

[7]

Fan ,

Bhosale ,

Schwenk ,

Ma , A. El-Kishky , S.

Goyal , M.

Baines , O.

Celebi , G.

Wenzek , V.

Chaudhary , N.

Goyal , T.

Birch , V.

Liptchinsky , S.

Edunov , E. Grave, M.

Auli , A.

Joulin , Beyond english-centric multilingual machine translation , 2020 . arXiv: 2010 .11125.

[8]

Hertling ,

Portisch ,

Paulheim , MELT - matching evaluation toolkit , in: Semantic Systems. The Power of AI and Knowledge Graphs - 15th International Conference, SEMANTiCS 2019 , Karlsruhe, Germany, September 9- 12 , 2019 , Proceedings, 2019 , pp. 231 - 245 . URL: https://doi.org/10.1007/978-3- 030 -33220-4_ 17 . doi: 10 .1007/978-3- 030 -33220-4\_ 17 .