1.2. Specific techniques used

SORBETMatcher Results for OAEI 2023

FrancisGosselin

Amal Zouaq

0 0 LAMA-WeST Lab, Departement of Computer Engineering and Software Engineering , Polytechnique Montreal, 2500 Chem. de Polytechnique, Montréal, QC H3T 1J4 , Canada

This paper presents the results of SORBETMatcher in the OAEI 2023 competition. SORBETMatcher is a schema matching system for both equivalence matching and subsumption matching. SORBETMatcher is largely based on SORBET Embeddings, a novel ontology embedding method that leverages large language models, random walks, and a regression loss to construct a latent space that encapsulates ontology structures. Despite recognizing certain limitations inherent in SORBET Embeddings, SORBETMatcher performed well in the OAEI competition. It emerged as the leading system in three out of the five subsumption matching challenges within the Bio-ML track, as well as in the equivalence matching problem involving ORDO-DOID.

eol>Ontology alignment Schema matching Representation Learning ISWC-2023

1.2. Specific techniques used 1.2.1. Candidate Selection

The first step of the matching process is to determine which concepts are likely to be matched. Since fetching SORBET Embeddings can be a long process for large ontologies, reducing the number of candidate concepts can greatly improve the runtime. There are three strategies to obtain a smaller set of candidate classes. Firstly, we employ a string matcher that identifies pairs of concepts with matching labels or synonyms as alignments. Concepts originating from these high-precision alignments are pruned from the set of considered classes. Secondly, some classes in the Bio-ML track has theuse_in_alignment tag indicating whether they should be used or not. Finally, in the local ranking of the Bio-ML track, candidates mappings are suggested from a test.cands file. We identify each unique class in the candidates, and consider them as the sole relevant classes.

1.2.2. SORBET Embeddings

SORBET is an Ontology Embedding method that has the goal of obtaining rich BERT embeddings while rearranging the latent space based on the ontology’s structure. To achieve this, SORBET ifne-tunes SentenceBERT, a pre-trained siamese BERT model, with a regression loss based on the distance between classes:

Where M, is a training dataset containing pairs of class es,, . is a predicted similarity, and A is a hyperparameter representing the distance between 2 classes. Intuitively the parameter A will control the sparsity of the ontology’s classes in the latent space. The bigger the value of A the larger the distance between neighbor classes. The distance d is defined by the number of subClassOf relationships betwee n and .

To obtain SORBET embeddings representing classes, the input of the SentenceBERT model is a random walk describing each class, providing context to classes given their neighbor subclasses, parent classes, and classes related by object properties. Both at training and inference time, a new random walk is created to describe a concept. The fine-tuning of SentenceBERT is achieved with pairs of concepts composed of positive samples, semi-negative samples and negative samples.

By sampling a class and its neighbors, then computing a similarity score relative to their distance, SORBET Embeddings attempts to replicate the structure of the ontology in the latent space. Therefore, similar classes from diferent ontologies get restricted into the same region of the latent space, making embeddings well-suited for the alignment or matching task.

1.2.3. Compute cosine similarity matrix

Using the embedding of all relevant classes, a similarity matrix is constructed using the cosine similarity measure as highlighted by equatio2n.

= where Ω is a function that transforms a concept into its SORBET Embedding an ,d represent the source and target concept respectively.

The i-th row represents the i-th concept from the source ontology and the j-th column represents the j-th concept from the target ontology. This matrix is initialized with a few values. The similarity of alignments outputted by the string matcher (described in the candidate selection in section1.2.1) are set to 1.0 while the columns and rows of the concepts whose use_in_alignment property is False are set to 0. For local rankings, the similarity between pairs of classes that are not in the candidates are set to 0. All the remaining cells are filled with the cosine similarity values.

1.2.4. Greedy Matcher with threshold

To determine which mappings from the similarity matrix will be retained, we utilize a straightforward greedy algorithm, akin to approaches in related works such7]aasn[d [ 1 ]. This simple algorithm sorts the similarity values and then iterates through each ele m e nint descending order of scores and selecting mappings provided that neither its source nor target concepts have already been chosen. The algorithm goes on until the value ogf oes below the threshold value of 0.75. Even though the neighbors of equivalent classes also have a high similarities, the goal of the greedy matching algorithm is to reduce these false positive alignments and to produce 1:1 alignments.

1.2.5. Local Ranking

The local ranking evaluation method requires only the target candidates for a concept to be sorted in descending order. The algorithm then iterates through each non-null row i, and applies an index sort to indicate to which j-th concept (from the j-th column) the source concept is most likely to be matched with.

1.3. Specific settings and Hyperparameters

For the OAEI competition, two models with diferent hyperparameters were used, the MEL6T] [ submission model and the semi-supervised Bio-ML model. For the both models, SORBET Embeddings were trained starting from the pre-trained SentenceBEsRenTtence-transformers/allMiniLM-L6-v2. The MELT model was trained simultaneously on the conference and anatomy track with a value of A equal to 5 while no other changes were made to the original hyperparameters used in SORBET4[]. This model was also used for the evaluation of the unsupervised equivalence matching of the Bio-ML track. This was done to show how SORBETMatcher performs in a zero-shot learning context.

For the remaining results of the Bio-ML track, SORBET was individually fine-tuned on the sub-tracks, using the train reference alignments as positive samples in the SORBET training. The hyper-parameters of SORBET’s semi-supervised version on Bio-ML were the following: For the A value, our experiments hinted that shallow ontologies are better embedded with a low A value. Therefore, the OMIM-ORDO and NCIT-DOID had a value of A kept at 4, while for the rest of the sub-tracks A was reduced to 3. Other experiments have also shown that the generation of negative samples during training lead to worse results, this caused us to remove them completely. This may be due to the fact that negative samples are normally used to increase the precision but at the cost of reducing recall. However, since the precision is high in most sub-tracks, the trade-of can be counter-productive.

2. Results 2.1. Anatomy

Full results for all SORBETMatcher’s alignments are shown in T1a.ble The anatomy track involves aligning the Adult Mouse Anatomy (MA) with the NCI Thesaurus, which describes Human Anatomy (NCI). SEBMatcher achieved an F1-score of 0.909, with a precision of 0.923 and a recall of 0.895. In comparison to other systems in this year’s competition, SORBET obtained the 3rd position out of 9 based on the F1 score. However, it is worth noting that SEBMatcher’s performance lagged in terms of runtime, having a total time of 4032 seconds.

2.2. Conference

The conference track involves aligning a set of ontologies that describe the domain of conference organization. This track encompasses multiple reference alignment sets, with M1 alignments focusing solely on classes, M2 on properties, and M3 containing both classes and properties. Given that SORBET is presently only able to embed classes exclusively, its performance is less robust when applied to the M3 reference alignments. 2.3. Bio-ML The Bio-ML consists of 5 diferent reference alignments across multiple ontologies. It is separated into equivalence matching and subsumption matching. SORBETMatcher participated to both sub-tasks. The equivalence matching is also decomposed into 2 categories, one with the unsupervised test set (100% of reference alignments) and one with the semi-supervised test set (70% of reference alignments). 0.181 0.695 0.311 0.659 0.557

3. General comments and Conclusion

Overall, SORBETMatcher achieved a top performance in some of the tracks while still having some improvements to be made on others.

The Bio-ML subsumption track is the task where SORBETMatcher scored the strongest, with three first places and one second place. However, SORBETMatcher scored last in the OMIM-ORDO sub-track by a a large margin, especially for higher Hits@K. This may indicate a lfaw in the SORBET Embeddings obtained on the OMIM or ORDO ontologies. The nature of this problem is still to be further investigated, but our initial hypothesis is that it might be due to the restriction axioms (which are numerous in these ontologies) and which are not considered by SORBET in its semi-negative sampling.

The results of the Bio-ML equivalence matching track are mixed. SORBETMatcher scored the best in the NCIT-DOID subtrack where it achieved first place in both unsupervised and supervised test sets. Considering the subsumption results for the NCIT-DOID subtrack, where SORBETMatcher largely outperformed other systems, we hypothesize that SORBET Embeddings are much more representative of ontologies with higher depths such as DOID. Another conclusion we can draw from these results is the capability of SORBET Embeddings to work in zero-shot learning tasks. Indeed, the unsupervised results all come from the MELT packaging of the system, in which SORBET is frozen after being trained on the conference and anatomy tracks. Therefore, at inference time, the BERT model has never seen the concept to embed, hence our conclusion about its zero-shot capability. As for datasets like Pharm and Neoplas, SORBETMatcher has yielded disappointing results. The problem may be of the same nature as the OMIM-ORDO dataset in subsumption matching, but it could also be because of the lack of hyper-parameters tuning, which can be very sensitive.

As for the performance of SORBETMatcher on the conference and anatomy tracks, SORBETMatcher was able to obtain good results by reaching the second and third place respectivelly.

4. Acknowledgements

This research has been funded by Canada’s NSERC Discovery Research Program.

[1]

Alexandre

Bento , Amal Zouaq, and Michel Gagnon. “ Ontology Matching Using Convolutional Neural Networks” . English. PInr o:ceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, May 2020 , pp. 5648 - 5653 . isbn: 979 - 10 -95546-34-4. url: https://aclanthology.org/ 2020 .lrec- 1 .6. 93

[2]

Jiaoyan

Chen et al. “ Contextual semantic embeddings for ontology subsumption prediction” . In: World Wide Web 26.5 ( Sept . 2023 ), pp. 2569 - 2591 . issn: 1573 - 1413 . doi: 10 .1007/s11280- 023-01169-9. url: https://doi.org/10.1007/s11280-023-01169-9.

[3]

Francis

Gosselin and

Amal

Zouaq . “SEBMatcher Results for OAEI 2022” . OInn:tology Matching 2022 : Proceedings of the 17th International Workshop on Ontology Matching (OM 2022 ) co-located with the 21th International Semantic Web Conference (ISWC 2022 ), Hangzhou, China, virtual conference, October 23 , 2022 . Vol. 3324 . CEUR Workshops Proceedings . CEURWS.org, 2022 , pp. 202 - 209 .

[4]

Francis

Gosselin and

Amal

Zouaq . “ SORBET: A Siamese Network for Ontology Embeddings Using a Distance-Based Regression Loss and BERT” . IInn:ternational Semantic Web Conference . Springer. 2023 , pp. 561 - 578 .

[5]

Yuan

He et al. “ Bertmap: A bert-based ontology alignment system” . PI nro:ceedings of the AAAI Conference on Artificial Intelligence . Vol. 36 . 5. 2022 , pp. 5684 - 5691 .

[6]

Sven

Hertling , Jan Portisch, and Heiko Paulheim. “ MELT - Matching EvaLuation Toolkit ” . In: Semantic Systems. The Power of AI and Knowledge Graphs - 15th International Conference, SEMANTiCS 2019 , Karlsruhe, Germany, September 9- 12 , 2019 , Proceedings. 2019 , pp. 231 - 245 . doi: 10 .1007/978-3- 030 -33220-4\_17. url: https://doi.org/10.1007/978-3- 030 -33220- 4%5C_ 17 .

[7]

Vivek

Iyer , Arvind Agarwal, and Harshit Kumar. “ VeeAlign: Multifaceted Context Representation Using Dual Attention for Ontology Alignment” . PIrnoc:eedings of the 2021 Conference on Empirical Methods in Natural Language Processing . Ed. by Marie-Francine Moens et al. Online and Punta Cana , Dominican Republic: Association for Computational Linguistics , Nov. 2021 , pp. 10780 - 10792 .doi: 10 .18653/v1/ 2021 .emnlp-main. 842 . url: https://aclanthology.org/ 2021 .emnlp-main. 8 . 42

[8]

Nils

Reimers and

Iryna

Gurevych . “ Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” . In:Conference on Empirical Methods in Natural Language Processing . 2019 .

[9]

Jifang

Wu et al. “ Daeom: A deep attentional embedding approach for biomedical ontology matching” . In: Applied Sciences 10.21 ( 2020 ), p. 7909 .