Exploring Large Language Models for Ontology
                                Alignment
                                Yuan He1 , Jiaoyan Chen2,1 , Hang Dong1 and Ian Horrocks1
                                1
                                    Department of Computer Science, University of Oxford
                                2
                                    Department of Computer Science, The University of Manchester


                                                                         Abstract
                                                                         This work investigates the applicability of recent generative Large Language Models (LLMs), such as
                                                                         the GPT series and Flan-T5, to ontology alignment for identifying concept equivalence mappings across
                                                                         ontologies. To test the zero-shot1 performance of Flan-T5-XXL and GPT-3.5-turbo, we leverage chal-
                                                                         lenging subsets from two equivalence matching datasets of the OAEI Bio-ML track, taking into account
                                                                         concept labels and structural contexts. Preliminary findings suggest that LLMs have the potential to
                                                                         outperform existing ontology alignment systems like BERTMap, given careful framework and prompt
                                                                         design.2

                                                                         Keywords
                                                                         Ontology Alignment, Ontology Matching, Large Language Model, GPT, Flan-T5


                                1. Introduction
                                Ontology alignment, also known as ontology matching (OM), is to identify semantic correspon-
                                dences between ontologies. It plays a crucial role in knowledge representation, knowledge
                                engineering and the Semantic Web, particularly in facilitating semantic interoperability across
                                heterogeneous sources. This study focuses on equivalence matching for named concepts.
                                   Previous research has effectively utilised pre-trained language models like BERT and T5 for
                                OM [1, 2], but recent advancements in large language models (LLMs) such as ChatGPT [3] and
                                Flan-T5 [4] necessitate further exploration. These LLMs, characterised by larger parameter
                                sizes and task-specific fine-tuning, are typically guided by task-oriented prompts in a zero-shot
                                setting or a small set of examples in a few-shot setting when applying to downstream tasks.
                                   This work explores the feasibility of employing LLMs for zero-shot OM. Given the significant
                                computational demands of LLMs, it is crucial to conduct experiments with smaller yet represen-
                                tative datasets before full deployment. To this end, we extract two challenging subsets from the
                                NCIT-DOID and the SNOMED-FMA (Body) equivalence matching datasets, both part of Bio-ML1

                                1
                                 The term “zero-shot” in the context of LLMs usually refers to using the pre-trained LLMs without fine-tuning.
                                2
                                 Our code and datasets will be made available at: https://github.com/KRR-Oxford/LLMap-Prelim
                                ISWC 2023 Posters and Demos: 22nd International Semantic Web Conference, November 6–10, 2023, Athens, Greece
                                " yuan.he@cs.ox.ac.uk (Y. He); jiaoyan.chen@manchester.ac.uk (J. Chen); hang.dong@cs.ox.ac.uk (H. Dong);
                                ian.horrocks@cs.ox.ac.uk (I. Horrocks)
                                 0000-0002-4486-1262 (Y. He); 0000-0003-4643-6750 (J. Chen); 0000-0001-6828-6891 (H. Dong);
                                0000-0002-2685-7462 (I. Horrocks)
                                                                       © 2023 Copyright c 2023 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                            CEUR Workshop Proceedings (CEUR-WS.org)
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073


                                1
                                    OAEI Bio-ML Track: https://www.cs.ox.ac.uk/isg/projects/ConCur/oaei/


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
[5] – a track of the Ontology Alignment Evaluation Initiative (OAEI) that is compatible with
machine learning-based OM systems. Notably, the extracted subsets exclude “easy” mappings,
i.e., concept pairs that can be aligned through string matching.
   We mainly evaluate the open-source LLM, Flan-T5-XXL, the largest version of Flan-T5
containing 11B parameters [4]. We assess its performance factoring in the use of concept labels,
score thresholding, and structural contexts. For baselines, we adopt the previous top-performing
OM system BERTMap and its lighter version, BERTMapLt. Preliminary tests are also conducted
on GPT-3.5-turbo; however, due to its high cost, only initial results are reported. Our findings
suggest that LLM-based OM systems hold the potential to outperform existing ones, but require
efforts in prompt design and exploration of optimal presentation methods for ontology contexts.


2. Methodology
Task Definition The task of OM can be defined as follows. Given the source and target
ontologies, denoted as 𝒪𝑠𝑟𝑐 and 𝒪𝑡𝑔𝑡 , and their respective sets of named concepts 𝒞𝑠𝑟𝑐 and 𝒞𝑡𝑔𝑡 ,
the objective is to generate a set of mappings in the form of (𝑐 ∈ 𝒞𝑠𝑟𝑐 , 𝑐′ ∈ 𝒞𝑡𝑔𝑡 , 𝑠𝑐≡𝑐′ ), where 𝑐
and 𝑐′ are concepts from 𝒞𝑠𝑟𝑐 and 𝒞𝑡𝑔𝑡 , respectively, and 𝑠𝑐≡𝑐′ ∈ [0, 1] is a score that reflects
the likelihood of the equivalence 𝑐 ≡ 𝑐′ . From this definition, we can see that a paramount
component of an OM system is its mapping scoring function 𝑠 : 𝒞𝑠𝑟𝑐 × 𝒞𝑡𝑔𝑡 → [0, 1]. In the
following, we formulate a sub-task for LLMs regarding this objective.
Concept Identification This is essentially a binary classification task that determines if two
concepts, given their names (multiple labels per concept possible) and/or additional structural
contexts, are identical or not. As LLMs typically work in a chat-like manner, we need to provide
a task prompt that incorporates the available information of two input concepts, and gather
classification results from the responses of LLMs. To avoid excessive prompt engineering, we
present the task description (as in previous sentences) and the available input information (such
as concept labels and structural contexts) to ChatGPT based on GPT-42 , and ask it to generate a
task prompt for an LLM like itself. The resulting template is as follows:
      Given the lists of names and hierarchical relationships associated with two concepts, your task is to determine whether
      these concepts are identical or not. Consider the following:
      Source Concept Names: <list of concept names>
      Parent Concepts of the Source Concept: <list of concept names>
      Child Concepts of the Source Concept: <list of concept names>
      ... (same for the target concept)
      Analyze the names and the hierarchical information provided for each concept and provide a conclusion on whether these
      two concepts are identical or different (“Yes” or “No”) based on their associated names and hierarchical relationships.

where the italic part is generated in the second round when we inform ChatGPT parent/child
contexts can be considered. Since the prompt indicates a yes/no question, we anticipate the
generation of “Yes” or “No” tokens in the LLM responses. For simplicity, we use the generation
probability of the “Yes” token as the classification score. Note that this score is proportional to
the final mapping score but is not normalised. For ranking-based evaluation, given a source
2
    ChatGPT (GPT-4 version): https://chat.openai.com/?model=gpt-4
concept, we also consider candidate target concepts with the “No” answer as well as their “No”
scores, placing them after the candidate target concepts with the “Yes” answer in an ascending
order – a larger “No” score implies a lower rank.


3. Evaluation
Dataset Construction Evaluating LLMs with the current OM datasets of normal or large
scales can be time and resource intensive. To yield insightful results prior to full implementation,
we leverage two challenging subsets extracted from the NCIT-DOID and the SNOMED-FMA
(Body) equivalence matching datasets of the OAEI Bio-ML track. We opt for Bio-ML as its ground
truth mappings are curated by humans and derived from dependable sources, Mondo and UMLS.
We choose NCIT-DOID and SNOMED-FMA (Body) from five available options because their
ontologies are richer in hierarchical contexts. For each original dataset, we first randomly select
50 matched concept pairs from ground truth mappings, but excluding pairs that can be aligned
with direct string matching (i.e., having at least one shared label) to restrict the efficacy of
conventional lexical matching. Next, with a fixed source ontology concept, we further select 99
unmatched target ontology concepts, thus forming a total of 100 candidate mappings (inclusive
of the ground truth mapping). This selection is guided by the sub-word inverted index-based idf
scores as in He et al. [1], which are capable of producing target ontology concepts lexically akin
to the fixed source concept. We finally randomly choose 50 source concepts that do not have
a matched target concept according to the ground truth mappings, and create 100 candidate
mappings for each. Therefore, each subset comprises 50 source ontology concepts with a match
and 50 without. Each concept is associated with 100 candidate mappings, culminating in a total
extraction of 10,000, i.e., (50+50)*100, concept pairs.
Evaluation Metrics From all the 10,000 concept pairs in a given subset, the OM system is
expected to predict the true mappings, which can be compared against the 50 available ground
truth mappings using Precision, Recall, and F-score defined as:
                      |ℳ𝑝𝑟𝑒𝑑 ∩ ℳ𝑟𝑒𝑓 |      |ℳ𝑝𝑟𝑒𝑑 ∩ ℳ𝑟𝑒𝑓 |         2𝑃 𝑅
                𝑃 =                   , 𝑅=                 , 𝐹1 =
                         |ℳ𝑝𝑟𝑒𝑑 |             |ℳ𝑟𝑒𝑓 |             𝑃 +𝑅
where ℳ𝑝𝑟𝑒𝑑 refers to the set of concept pairs (among the 10,000 pairs) that are predicted as
true mappings by the system, and ℳ𝑟𝑒𝑓 refers to the 50 ground truth (reference) mappings.
   Given that each source concept is associated with 100 candidate mappings, we can calculate
ranking-based metrics based on their scores. Specifically, we calculate Hits@1 for the 50
matched source concepts, counting a hit when the top-scored candidate mapping is a ground
truth mapping. The MRR score is also computed for these matched source concepts, summing
the inverses of the ground truth mappings’ relative ranks among candidate mappings. These
two scores are formulated as:
                       ∑︁                                       ∑︁
      𝐻𝑖𝑡𝑠@𝐾 =                I𝑟𝑎𝑛𝑘𝑐′ ≤𝐾 /|ℳ𝑟𝑒𝑓 |, 𝑀 𝑅𝑅 =             𝑟𝑎𝑛𝑘𝑐−1
                                                                            ′ /|ℳ𝑟𝑒𝑓 |

                    (𝑐,𝑐′ )∈ℳ𝑟𝑒𝑓                               (𝑐,𝑐′ )∈ℳ𝑟𝑒𝑓

  For the 50 unmatched source concepts, we compute the Rejection Rate (RR), considering a
successful rejection when all the candidate mappings are predicted as false mappings by the
           System                        Precision Recall     F-score   Hits@1      MRR         RR
           Flan-T5-XXL                    0.643     0.720      0.679      0.860      0.927     0.860
            + threshold                   0.861     0.620      0.721      0.860      0.927     0.940
            + parent/child                0.597     0.740      0.661      0.880      0.926     0.760
            + threshold & parent/child    0.750     0.480      0.585      0.880      0.926     0.920
           GPT-3.5-turbo                  0.217     0.560      0.313        -          -         -
           BERTMap                        0.750     0.540      0.628      0.900     0.940      0.920
           BERTMapLt                      0.196     0.180      0.187      0.460     0.516      0.920

Table 1
Results on the challenging subset of the NCIT-DOID equivalence matching dataset of Bio-ML.
           System                        Precision Recall     F-score   Hits@1      MRR         RR
           Flan-T5-XXL                    0.257     0.360      0.300      0.500      0.655     0.640
            + threshold                   0.452     0.280      0.346      0.500      0.655     0.820
            + parent/child                0.387     0.240      0.296      0.540      0.667     0.900
            + threshold & parent/child    0.429     0.120      0.188      0.540      0.667     0.940
           GPT-3.5-turbo                  0.075     0.540      0.132        -          -         -
           BERTMap                        0.485     0.640      0.552      0.540     0.723      0.920
           BERTMapLt                      0.516     0.320      0.395      0.340     0.543      0.960

Table 2
Results on the challenging subset of the SNOMED-FMA (Body) equivalence matching dataset of Bio-ML.


system. The unmatched source concepts are assigned a “null” match, denoted as 𝑐𝑛𝑢𝑙𝑙 . This
results in a set of “unreferenced” mappings, represented as ℳ𝑢𝑛𝑟𝑒𝑓 . We can then define RR as:
                                     ∑︁        ∏︁
                         𝑅𝑅 =                      (1 − I𝑐≡𝑑 )/|ℳ𝑢𝑛𝑟𝑒𝑓 |
                                     (𝑐,𝑐𝑛𝑢𝑙𝑙 )∈ℳ𝑢𝑛𝑟𝑒𝑓 𝑑∈𝒯𝑐

where 𝒯𝑐 is the set of target candidate classes for a source concept 𝑐, and I𝑐≡𝑑 is a binary
indicator that outputs 1 if the system predicts a match between 𝑐 and 𝑑, and 0 otherwise. It
is worth noting that the product term becomes 1 only when all target candidate concepts are
predicted as false matches, i.e., ∀𝑑 ∈ 𝒯𝑐 .I𝑐≡𝑑 = 0.
Model Settings We examine Flan-T5-XXL under various settings: (i) the vanilla setting, where
a mapping is deemed true if it is associated with a “Yes” answer; (ii) the threshold3 setting,
which filters out the “Yes” mappings with scores below a certain threshold; (iii) the parent/child
setting, where sampled parent and child concept names are included as additional contexts; and
(iv) parent/child+threshold setting, incorporating both structural contexts and thresholding.
We also conduct experiments for GPT-3.5-turbo, the most capable variant in the GPT-3.5 series,
using the same prompt. However, only setting (i) is reported due to a high cost of this model.
For the baseline models, we consider BERTMap and BERTMapLt [1, 6], where the former uses a
fine-tuned BERT model for classification and the latter uses the normalised edit similarity. Note
that both BERTMap and BERTMapLt inherently adopt setting (ii).
Results As shown in Table 1-2, we observe that the Flan-T5-XXL (+threshold) obtains the
3
    The thresholds are empirically set to 0.650, 0.999, and 0.900 for Flan-T5-XXL, BERTMap, and BERTMapLt in a
    pioneer experiment concerning small fragments.
best F-score among its settings. While it outpaces BERTMap by 0.093 in F-score on the NCIT-
DOID subset but lags behind BERTMap and BERTMapLt by 0.206 and 0.049, respectively, on
the SNOMED-FMA (Body) subset. Regarding MRR, BERTMap leads on both subsets. Among
Flan-T5-XXL settings, using a threshold enhances precision but reduces recall. Incorporating
parent/child contexts does not enhance matching results – this underscores the need for a more
in-depth examination of strategies for leveraging ontology contexts. GPT-3.5-turbo4 does not
perform well with the given prompt. One possible reason is the model’s tendency to furnish
extended explanations for its responses, making it challenging to extract straightforward yes/no
answers. Besides, no ranking scores are presented for GPT-3.5-turbo because it does not support
extracting generation probabilities. The suboptimal performance of BERTMapLt is as expected
because we exclude concept pairs that can be string-matched from the extracted datasets while
BERTMapLt relies on the edit similarity score.


4. Conclusion and Future Work
This study presents an exploration of LLMs for OM in a zero-shot setting. Results on two
challenging subsets of OM datasets suggest that using LLMs can be a promising direction for
OM but various problems need to be addressed including, but not limited to, the design of
prompts and overall framework5 , and the incorporation of ontology contexts. Future studies
include refining prompt-based approaches, investigating efficient few-shot tuning, and exploring
structure-informed LLMs. The lessons gleaned from these OM studies can also offer insights
into other ontology engineering tasks such as ontology completion and embedding, and pave
the way for a broader study on the integration of LLMs with structured data.


References
[1] Y. He, J. Chen, D. Antonyrajah, I. Horrocks, BERTMap: A BERT-based ontology alignment
    system, in: AAAI, 2022.
[2] M. Amir, M. Baruah, M. Eslamialishah, S. Ehsani, A. Bahramali, S. Naddaf-Sh, S. Zarandioon,
    Truveta mapper: A zero-shot ontology alignment framework, arXiv (2023).
[3] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,
    K. Slama, A. Ray, et al., Training language models to follow instructions with human
    feedback, in: NeurIPS, 2022.
[4] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani,
    S. Brahma, et al., Scaling instruction-finetuned language models, arXiv (2022).
[5] Y. He, J. Chen, H. Dong, E. Jiménez-Ruiz, A. Hadian, I. Horrocks, Machine learning-friendly
    biomedical datasets for equivalence and subsumption ontology matching, in: ISWC, 2022.
[6] Y. He, J. Chen, H. Dong, I. Horrocks, C. Allocca, T. Kim, B. Sapkota, Deeponto: A python
    package for ontology engineering with deep learning, arXiv preprint arXiv:2307.03067
    (2023).
4
    The experimental trials for text-davinci-003 and GPT-4 also showed suboptimal results.
5
    This work focuses on the mapping scoring, but the searching (or candidate selection) part of OM is also crucial,
    especially considering that LLMs are highly computationally expensive.