=Paper=
{{Paper
|id=Vol-3324/oaei22_paper5
|storemode=property
|title=Cross-lingual ontology matching with CIDER-LM: results for OAEI 2022
|pdfUrl=https://ceur-ws.org/Vol-3324/oaei22_paper5.pdf
|volume=Vol-3324
|authors=Javier Vela,Jorge Gracia
|dblpUrl=https://dblp.org/rec/conf/semweb/VelaG22
}}
==Cross-lingual ontology matching with CIDER-LM: results for OAEI 2022==
Cross-lingual ontology matching with CIDER-LM: results for OAEI 2022 Javier Vela, Jorge Gracia Department of Computer Science and Systems Engineering University of Zaragoza María de Luna 1, 50018 Zaragoza, Spain Abstract In this paper, the CIDER-LM cross-lingual matching system is presented, as well as the results it achieved during the OAEI (Ontology Alignment Evaluation Initiative) 2022 campaign. This is the first appearance of CIDER-LM in OAEI where it only participated in MultiFarm, the track for cross-lingual ontology alignment evaluation. The matching system uses a pre-trained multilingual language model based on transformers, fine-tuned using the openly available portion of the MultiFarm dataset. The model calculates the vector embeddings of the labels associated to every ontology entity and its context. The confidence degree between matching entities is computed as the cosine similarity between their associated embeddings. CIDER-LM is novel in the use of multilingual language models for cross-lingual ontology matching. Its initial version obtained promising results in the OAEI’22 MultiFarm track, attaining a modest precision but the best overall performance in recall. Keywords Cross-lingual Ontology Matching, Ontology Alignment, Natural Language Processing, Language Models, Transformers, CIDER-LM, Sentence-BERT 1. Presentation of the system CIDER-LM is largely inspired by an earlier system called CIDER-CL [1], the cross-lingual version of CIDER (Context and Inference basED ontology alignER), although CIDER-LM completely redesigns and reimplements it. CIDER-LM stands for Language Model-based CIDER. Unlike its predecessor, which based its cross-lingual capabilities on cross-lingual explicit semantic analy- sis [2], CIDER-LM uses language models as BERT [3] based on the Transformer architecture [4]. Transformers are encoder-decoder neural networks with a self attention mechanism. They are very popular for solving Natural Language Processing (NLP) tasks. BERT is a language model that leverages the transformer encoder to predict tokens in a sentence when given a context. Numerous studies have been published exploring how BERT can be fine-tuned to solve a wide variety of NLP tasks. The original BERT was pre-trained using a document-level corpus in English, extracted from the BooksCorpus1 (800M words) and the English Wikipedia2 (2,500M Ontology Alignment Evaluation Initiative (OAEI) 2022 campaign, October 23rd or 24th, 2022, Hangzhou, China Envelope-Open 775593@unizar.es (J. Vela); jogracia@unizar.es (J. Gracia) GLOBE https://github.com/javiervela/ (J. Vela); http://jogracia.url.ph/web/ (J. Gracia) Orcid 0000-0002-6818-9191 (J. Vela); 0000-0001-6452-7627 (J. Gracia) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 1 https://huggingface.co/datasets/bookcorpus 2 https://huggingface.co/datasets/wikipedia words). The use of BERT and similar models has proven to be useful in ontology matching, for example, in the biomedical domain [5], but they have been largely unexplored for cross-lingual matching until now. Multilingual language models can be used to that end, which can represent tokens in several languages in the same embedding space. Sentence-BERT (SBERT) [6], which is a modification of the pre-trained BERT, produces semantically meaningful embeddings for sentences. These sentence embeddings are aligned in the same embedding space, in contrast to other models that create the embeddings at the token level. In addition, SBERT monolingual models can be made multilingual using knowledge distillation [7]. The resulting models produce vector embeddings that can be compared to find the similarity between sentences in two different languages. CIDER-LM uses SBERT to associate embeddings to the labels of the ontology entities (and their context), which are compared using cosine similarity to create a matching confidence score between them. The choice of sentence embeddings instead of token embeddings (as other multilingual language models do) is motivated by the fact that many ontology labels are conformed by multi-word expressions, thus embeddings that represent whole sentences (and not only atomic tokens) are more suitable to capture their semantic content and to compute similarities between them [8, 6]. 1.1. State, purpose, general statement CIDER-LM is a cross-lingual ontology matching system that utilizes a transformer-based pre- trained multilingual language model, fine-tuned using the openly available portion of the MultiFarm dataset3 . The model calculates the vector embeddings of the labels associated to every ontology entity and its context. The confidence degree between two matching entities is computed as the cosine similarity between their associated embeddings. The generated alignments are one-to-one mappings between entities from two input ontologies. The input ontologies must be in OWL or RDF-S format, and the output provided is expressed in Alignment Format4 . The type of discovered correspondence is “equivalence”, with a confidence degree in [0, 1]. CIDER-LM works with ontology classes and properties, not yet with instances. 1.2. Specific techniques used CIDER-LM integrates a fine-tuned version of distiluse-base-multilingual-cased-v2 5 , a multilingual pre-trained model from Sentence Transformers6 . This ontology aligner is implemented in Python and is wrapped using the Matching EvaLuation Toolkit (MELT), a framework for ontology matching and evaluation [9]. The system is packaged as a Docker image implementing the Web Interface 7 . An overall view of the CIDER-LM matching pipeline is shown in Figure 1. In short, the process is as follows. First, the aligner receives a source (𝑂1 ) and target (𝑂2 ) ontology. Both 3 https://www.irit.fr/recherches/MELODI/multifarm/ 4 http://alignapi.gforge.inria.fr/format.html 5 https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2 6 https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models 7 https://dwslab.github.io/melt/matcher-packaging/web#web-interface-http-matching-interface Figure 1: CIDER-LM architecture: Given two ontologies 𝑂1 and 𝑂2 , they are expanded by a reasoner. The labels characterising the compared entities are extracted, verbalized, and an embedding is obtained for every one of them. The embeddings are compared using the Cosine Similarity, passed through the Max Weight Bipartite Extractor and the Threshold Filter, obtaining the final alignment. ontologies are read into Python objects using the library Owlready28 and fed individually into a semantic reasoner, which extends the ontologies by inferring semantic relations not initially declared. The classes and properties labels extracted from both ontologies are verbalized using their ontological context. Then, the verbalized labels are passed to the Transformer model that obtains an embedding for each of the entities in the ontologies (i.e., a vector capturing the semantics of the entity). Each embedding from 𝑂1 is compared to every embedding from 𝑂2 using cosine similarity, forming a bipartite graph. The Maximum Weight Bipartite Extractor algorithm obtains an initial alignment that is reduced by a threshold filter, obtaining the final alignment. The following paragraphs describe each of the involved techniques with some more detail. 1.2.1. Ontology Reasoning CIDER-LM performs a preliminary reasoning of both the source and the target ontologies using HermiT OWL Reasoner9 . This step expands the semantic relations in the ontologies, inferring new knowledge not initially asserted. However, including the reasoning in the pipeline increases the execution time of CIDER-LM considerably. This step is optional, but we kept it for OAEI-22. 8 https://owlready2.readthedocs.io/en/v0.37/ 9 http://www.hermit-reasoner.com/ 1.2.2. Verbalization of entities CIDER-LM considers ontology classes and properties. For both types of entities, their associated labels are extracted to build the corresponding sentence embeddings. In order to make the calculated embedding more representative of the semantics of the analyzed ontology entity, the labels of other entities that are part of its ontological context are considered. The verbalization process is needed so that the language model can process the set of labels (coming from the entities and its ontological context) as an individual sentence. In its current version, CIDER-LM builds the ontological context of an entity based on its neigh- boring entities, treating classes and properties differently (we use ‘+’ as the string concatenation operator): Class Verbalization. For a class 𝑎 and its label 𝑙(𝑎), the set of parent classes and child classes in the hierarchy of the ontology are 𝑃𝑎 and 𝐶𝑎 , respectively. The verbalization of the class is the concatenation of the following: • 𝑙(𝑎) + ‘, ’ ; • for every class in 𝑃𝑎 : 𝑙(𝑎) + ‘ i s a ’ + 𝑙(𝑃𝑎𝑖 ) + ‘, ’ • for every class in 𝐶𝑎 : 𝑙(𝐶𝑎𝑖 ) + ‘ i s a ’ + 𝑙(𝑎) + ‘, ’ Property Verbalization. For a property 𝑎 and its label 𝑙(𝑎), the sets of its domain and range classes in the ontology are 𝐷𝑎 and 𝑅𝑎 , respectively. The verbalization of the property is the concatenation of the following: • 𝑙(𝑎) + ‘, ’ • for every entity in 𝐷𝑎 : 𝑙(𝑎) + ‘ h a s d o m a i n ’ + 𝑙(𝐷𝑎𝑖 ) + ‘, ’ • for every entity in 𝑅𝑎 : 𝑙(𝑎) + ‘ h a s r a n g e ’ + 𝑙(𝑅𝑎𝑖 ) + ‘, ’ Supposing that a certain entity has more than one label assigned, the first label is chosen to concatenate in the verbalization. The verbalized sentences from labels are concatenated with particles in English, independently of the language of the ontology. During the preliminary evaluation of the system, we found evidence that the particular language used for concatenating the labels was not very relevant, even if it differed from the language of the ontology. This will require further exploration in the future. 1.2.3. Fine-tuned language model CIDER-LM relies on d i s t i l u s e - b a s e - m u l t i l i n g u a l - c a s e d - v 2 , which is pre-trained on Semantic Textual Similarity (STS) and uses the SBERT architecture. The model is the knowledge-distilled version of the Universal Sentence Encoder and supports more than 50 languages. Given a sentence, the model obtains a vector embedding in a 512 dimensional dense vector space. A checkpoint of the d i s t i l u s e - b a s e - m u l t i l i n g u a l - c a s e d - v 2 model has been downloaded from HuggingFace, using the SentenceTransformers Python framework. In fact, CIDER-LM uses a fine-tuned version of the model, specialized on the task of obtaining similarities between two entity labels in different languages. Given a set of pairs of entity labels obtained from the training set and a true confidence of 1 or 0, indicating if the entities are a match or not; the model is trained on reducing the CosineSimilarityLoss between the predicted and true confidence. We use the SenteceTransformers framework and a basic training pipeline to obtain the fine-tuned model used on the matching system. The cosine similarity metric is used to find the distance between the vector embeddings asso- ciated with the entities that come from the two different ontologies in the common embedding space. This offers a measure of how similar two sentences (the verbalised sets of entity labels) are, resulting in the confidence degree associated to the possible matching between the two entities. 1.2.4. Maximum Weight Bipartite Extractor CIDER-LM obtains the matching confidence for every pair of entities from the target and source ontology. Once the confidence degrees have been determined, the alignment can be considered as a bipartite graph with an edge (the confidence in a match) from every node (entity) from the source ontology to the target ontology. Using the implementation of the Hungarian algorithm in the scipy 10 Python library, the maximum weight matching in bipartite graphs problem is solved. 1.2.5. Threshold filter After having a complete alignment, a threshold filter is used to remove all the correspondences with low confidence from the alignment. Every correspondence with a confidence lower than the default threshold value is removed. The threshold value can be used to direct the results obtained by the system: a higher value will promote precision, while a lower value will favour recall. CIDER-LM applies a default threshold value of 0.5 that, according to the results, promotes recall. 1.3. Adaptations made for the evaluation To participate in the OAEI campaign, we have wrapped the Python implementation of CIDER- LM around the M a t c h e r C L I Java class from the MELT framework. Wrapping enables MELT evaluation and the packaging plugin, which is used to create the Docker container image for submission to the OAEI. CIDER-LM performs a preliminary substitution of ‘& x s d ; d a t e ’ for ‘& x s d ; d a t e T i m e ’ and ‘h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a # d a t e ’ for ‘h t t p : / / w w w . w 3 . o r g / 2 0 0 1 / X M L S c h e m a # d a t e T i m e ’ because the dataset of ontologies used for the evaluation was not initially recognized by the HermiT reasoner as OWL Version 2, which is the only version of the format taken by the reasoner. Performing the substitution fixes the error reading the ontology. 1.4. Link to the system and parameters file The implementation of the system is hosted in a GitHub repository11 . The container image with the matching system is also available on GitHub in the Packages section. 10 https://scipy.org/ 11 https://github.com/sid-unizar/CIDER-LM Table 1 MultiFarm aggregated results per matcher for different ontologies System Time(Min) Precision F-measure Recall CIDER-LM ∼ 157 0.16 0.25 0.58 LSMatch ∼ 33 0.24 0.038 0.021 LSMatch Multilingual ∼ 69 0.68 0.47 0.36 LogMap ∼9 0.72 0.44 0.31 LogMapLt ∼ 175 0.24 0.038 0.02 2. Results In its first participation in the OAEI campaign (2022), CIDER-LM contributed to the MultiFarm track only. The reason is that this tool is primarily aimed at cross-lingual ontology matching. However, the method is able to produce monolingual mappings as well, so we do not discard participation in other tracks in future OAEI editions. The MultiFarm evaluation involves two ontologies from the Conference domain (edas and ekaw), which are translated into eight languages from English, resulting in 55 x 24 matching tasks. Details about the test ontologies, the evaluation process, and the complete results for the MultiFarm track can be found on the OAEI’22 website12 . The results reported by the OAEI organisers on October 9th of 2022 describe the precision, recall and F-measure of the alignments produced by each of the participant systems. The aggregated results for completing the matching task are shown in Table 1. 3. General comments The following sections contain some remarks and comments on the results obtained and the evaluation process. 3.1. Comments on the results The obtained results in MultiFarm are intermediate in terms of F-measure (third-best result out of five participants), and are also very good in terms of recall, having attained the best result of any OAEI edition for the MultiFarm “different ontologies” sub-task13 . The results of CIDER-LM largely improve those obtained by its predecessor tool CIDER-CL [1] for the MultiFarm different ontologies subtask, which were 𝑃 = 0.16, 𝑅 = 0.19, and 𝐹 = 0.17 The results of CIDER-LM in the OAEI multilingual track are still modest, especially in precision, but the fact that even the best systems do not score very high illustrates the difficulty of the problem. For instance, the F-measure attained in MultiFarm was never higher than 0.47 in any OAEI edition. 12 https://oaei.ontologymatching.org/2022/multifarm/index.html 13 Only in OAEI’12 higher values can be found, but from two matchers that gave nearly every possible combination as result, thus resulting in close to zero values of precision and F-measure 3.2. Discussions on the way to improve the proposed system The analysis of the OAEI’22 results shows that the CIDER-LM matching system has the po- tential to improve its current results in several ways. For instance, the current version builds embeddings based on the entity labels, the labels of parents and children for classes, and the labels of domain and range for properties. A future version could include more features in the verbalization of labels. The current results show a clear unbalance between precision and recall. A higher threshold value would help promote the precision measure, further achieving a higher F-measure. The results also show that the execution time of the tool is greater than that of the other participants. Removing the use of a reasoner would greatly reduce the time, with only a small impact on the performance of the results, as seen in our internal experiments. Furthermore, a more careful study of the fine-tuning process could lead to an improvement of the CIDER-LM alignment results. Involving more sophisticated techniques on the training and validation of the fine-tuned model can reduce overfitting and provide a more general model that behaves better with ontologies different from the ones seen in training. 3.3. Comments on the OAEI procedure The MELT wrapper for the Python matching system has proven to be easy to comprehend and implement, and useful for encapsulating the tool and later creating the web server con- tainer image for the OAEI submission. We consider the inclusion of the MELT framework a significant advancement in OAEI to allow the integration and participation of non-Java-based implementations. 4. Conclusion This paper presented the first version of CIDER-LM, which explores for the first time the potential of multilingual language models on the task of finding cross-lingual alignments between ontologies in different languages. The system uses SBERT, a multilingual language model based on the Transformer architecture. It was evaluated on the OAEI’22 MultiFarm track, achieving intermediate results in terms of the F-measure and very good results in terms of recall. Although there is much room for further improvements, we consider that CIDER-LM results have proved the viability of using multilingual language models for this task. In future versions, more features will be considered to build the ontological context, new verbalization strategies will be analyzed, and a more careful study of the fine-tuning process will be carried out, to attain a better and more general model for cross-lingual ontology matching. Acknowledgments This article is the result of a collaboration grant 2021-22 at the Department of Computer Science and Systems Engineering, University of Zaragoza, funded by the Ministry of Education and Professional Training (Spain). It has also been partially supported by the Engineering Research Institute of Aragon (I3A), by the Spanish project PID2020-113903RB-I00 (AEI/FEDER, UE), by DGA/FEDER, and by the Agencia Estatal de Investigación of the Spanish Ministry of Economy and Competitiveness and the European Social Fund through the “Ramón y Cajal” program (RYC2019-028112-I). References [1] J. Gracia, K. Asooja, Monolingual and cross-lingual ontology matching with CIDER-CL: Evaluation report for OAEI 2013, in: Proc. of 8th Ontology Matching Workshop (OM’13), at 12th International Semantic Web Conference (ISWC’13), volume 1111, CEUR-WS, ISSN- 1613-0073, Syndey (Australia), 2013. [2] P. Sorg, P. Cimiano, Exploiting Wikipedia for cross-lingual and multilingual information retrieval, Data & Knowledge Engineering 74 (2012) 26–45. doi:1 0 . 1 0 1 6 / j . d a t a k . 2 0 1 2 . 0 2 . 003. [3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Stroudsburg, PA, USA, 2019, pp. 4171–4186. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 . [4] A. Vaswani, G. Brain, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention Is All You Need, in: Proc. of 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017. [5] Y. He, J. Chen, D. Antonyrajah, I. Horrocks, Biomedical Ontology Alignment with BERT, in: Proc. of 16th International Workshop on Ontology Matching (OM’21) co-located with the 20th International Semantic Web Conference (ISWC 2021), CEUR-WS, 2021, pp. 1–12. [6] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Process- ing, Association for Computational Linguistics, 2019. URL: https://arxiv.org/abs/1908.10084. [7] N. Reimers, I. Gurevych, Making monolingual sentence embeddings multilingual using knowledge distillation, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2020. URL: https://arxiv.org/abs/2004.09813. [8] S. Neutela, M. H. de Boerb, Towards Automatic Ontology Alignment using BERT, in: Proc. of the AAAI 2021 Spring Symposium on Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021), CEUR-WS, Palo Alto, California, USA, 2021. [9] S. Hertling, J. Portisch, H. Paulheim, MELT - matching evaluation toolkit, in: Semantic Systems. The Power of AI and Knowledge Graphs - 15th International Conference, SE- MANTiCS 2019, Karlsruhe, Germany, September 9-12, 2019, Proceedings, 2019, pp. 231–245. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 3 3 2 2 0 - 4 \ _ 1 7 .