<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Athens, Greece
" yuan.he@cs.ox.ac.uk (Y. He); jiaoyan.chen@manchester.ac.uk (J. Chen); hang.dong@cs.ox.ac.uk (H. Dong);
ian.horrocks@cs.ox.ac.uk (I. Horrocks)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Exploring Large Language Models for Ontology Alignment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yuan He</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiaoyan Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hang Dong</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ian Horrocks</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, The University of Manchester</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Oxford</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This work investigates the applicability of recent generative Large Language Models (LLMs), such as the GPT series and Flan-T5, to ontology alignment for identifying concept equivalence mappings across ontologies. To test the zero-shot1 performance of Flan-T5-XXL and GPT-3.5-turbo, we leverage challenging subsets from two equivalence matching datasets of the OAEI Bio-ML track, taking into account concept labels and structural contexts. Preliminary findings suggest that LLMs have the potential to outperform existing ontology alignment systems like BERTMap, given careful framework and prompt design.2 Ontology alignment, also known as ontology matching (OM), is to identify semantic correspondences between ontologies. It plays a crucial role in knowledge representation, knowledge engineering and the Semantic Web, particularly in facilitating semantic interoperability across heterogeneous sources. This study focuses on equivalence matching for named concepts. Previous research has efectively utilised pre-trained language models like BERT and T5 for OM [1, 2], but recent advancements in large language models (LLMs) such as ChatGPT [3] and Flan-T5 [4] necessitate further exploration. These LLMs, characterised by larger parameter sizes and task-specific fine-tuning, are typically guided by task-oriented prompts in a zero-shot setting or a small set of examples in a few-shot setting when applying to downstream tasks. This work explores the feasibility of employing LLMs for zero-shot OM. Given the significant computational demands of LLMs, it is crucial to conduct experiments with smaller yet representative datasets before full deployment. To this end, we extract two challenging subsets from the NCIT-DOID and the SNOMED-FMA (Body) equivalence matching datasets, both part of Bio-ML1</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Ontology Alignment</kwd>
        <kwd>Ontology Matching</kwd>
        <kwd>Large Language Model</kwd>
        <kwd>GPT</kwd>
        <kwd>Flan-T5</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] – a track of the Ontology Alignment Evaluation Initiative (OAEI) that is compatible with
machine learning-based OM systems. Notably, the extracted subsets exclude “easy” mappings,
i.e., concept pairs that can be aligned through string matching.
      </p>
      <p>
        We mainly evaluate the open-source LLM, Flan-T5-XXL, the largest version of Flan-T5
containing 11B parameters [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We assess its performance factoring in the use of concept labels,
score thresholding, and structural contexts. For baselines, we adopt the previous top-performing
OM system BERTMap and its lighter version, BERTMapLt. Preliminary tests are also conducted
on GPT-3.5-turbo; however, due to its high cost, only initial results are reported. Our findings
suggest that LLM-based OM systems hold the potential to outperform existing ones, but require
eforts in prompt design and exploration of optimal presentation methods for ontology contexts.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>
        Task Definition The task of OM can be defined as follows. Given the source and target
ontologies, denoted as  and , and their respective sets of named concepts  and ,
the objective is to generate a set of mappings in the form of ( ∈ , ′ ∈ , ≡ ′ ), where 
and ′ are concepts from  and , respectively, and ≡ ′ ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] is a score that reflects
the likelihood of the equivalence  ≡ ′. From this definition, we can see that a paramount
component of an OM system is its mapping scoring function  :  ×   → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]. In the
following, we formulate a sub-task for LLMs regarding this objective.
      </p>
      <p>Concept Identification This is essentially a binary classification task that determines if two
concepts, given their names (multiple labels per concept possible) and/or additional structural
contexts, are identical or not. As LLMs typically work in a chat-like manner, we need to provide
a task prompt that incorporates the available information of two input concepts, and gather
classification results from the responses of LLMs. To avoid excessive prompt engineering, we
present the task description (as in previous sentences) and the available input information (such
as concept labels and structural contexts) to ChatGPT based on GPT-42, and ask it to generate a
task prompt for an LLM like itself. The resulting template is as follows:</p>
      <p>Given the lists of names and hierarchical relationships associated with two concepts, your task is to determine whether
these concepts are identical or not. Consider the following:
Source Concept Names: &lt;list of concept names&gt;
Parent Concepts of the Source Concept: &lt;list of concept names&gt;
Child Concepts of the Source Concept: &lt;list of concept names&gt;
... (same for the target concept)
Analyze the names and the hierarchical information provided for each concept and provide a conclusion on whether these
two concepts are identical or diferent (“Yes” or “No”) based on their associated names and hierarchical relationships.
where the italic part is generated in the second round when we inform ChatGPT parent/child
contexts can be considered. Since the prompt indicates a yes/no question, we anticipate the
generation of “Yes” or “No” tokens in the LLM responses. For simplicity, we use the generation
probability of the “Yes” token as the classification score. Note that this score is proportional to
the final mapping score but is not normalised. For ranking-based evaluation, given a source
2ChatGPT (GPT-4 version): https://chat.openai.com/?model=gpt-4
concept, we also consider candidate target concepts with the “No” answer as well as their “No”
scores, placing them after the candidate target concepts with the “Yes” answer in an ascending
order – a larger “No” score implies a lower rank.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>
        Dataset Construction Evaluating LLMs with the current OM datasets of normal or large
scales can be time and resource intensive. To yield insightful results prior to full implementation,
we leverage two challenging subsets extracted from the NCIT-DOID and the SNOMED-FMA
(Body) equivalence matching datasets of the OAEI Bio-ML track. We opt for Bio-ML as its ground
truth mappings are curated by humans and derived from dependable sources, Mondo and UMLS.
We choose NCIT-DOID and SNOMED-FMA (Body) from five available options because their
ontologies are richer in hierarchical contexts. For each original dataset, we first randomly select
50 matched concept pairs from ground truth mappings, but excluding pairs that can be aligned
with direct string matching (i.e., having at least one shared label) to restrict the eficacy of
conventional lexical matching. Next, with a fixed source ontology concept, we further select 99
unmatched target ontology concepts, thus forming a total of 100 candidate mappings (inclusive
of the ground truth mapping). This selection is guided by the sub-word inverted index-based idf
scores as in He et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which are capable of producing target ontology concepts lexically akin
to the fixed source concept. We finally randomly choose 50 source concepts that do not have
a matched target concept according to the ground truth mappings, and create 100 candidate
mappings for each. Therefore, each subset comprises 50 source ontology concepts with a match
and 50 without. Each concept is associated with 100 candidate mappings, culminating in a total
extraction of 10,000, i.e., (50+50)*100, concept pairs.
      </p>
      <p>Evaluation Metrics From all the 10,000 concept pairs in a given subset, the OM system is
expected to predict the true mappings, which can be compared against the 50 available ground
truth mappings using Precision, Recall, and F-score defined as:
 = |ℳ ∩ ℳ | ,  = |ℳ ∩ ℳ | , 1 =
|ℳ| |ℳ |
2 
 + 
where ℳ refers to the set of concept pairs (among the 10,000 pairs) that are predicted as
true mappings by the system, and ℳ refers to the 50 ground truth (reference) mappings.</p>
      <p>Given that each source concept is associated with 100 candidate mappings, we can calculate
ranking-based metrics based on their scores. Specifically, we calculate Hits@1 for the 50
matched source concepts, counting a hit when the top-scored candidate mapping is a ground
truth mapping. The MRR score is also computed for these matched source concepts, summing
the inverses of the ground truth mappings’ relative ranks among candidate mappings. These
two scores are formulated as:</p>
      <p>∑︁
(,′)∈ℳ</p>
      <p>I′ ≤  /|ℳ |,   =</p>
      <p>∑︁
(,′)∈ℳ
− 1
′ /|ℳ |</p>
      <p>For the 50 unmatched source concepts, we compute the Rejection Rate (RR), considering a
successful rejection when all the candidate mappings are predicted as false mappings by the
Flan-T5-XXL
+ threshold
+ parent/child
+ threshold &amp; parent/child
GPT-3.5-turbo
BERTMap
BERTMapLt
System
Flan-T5-XXL
+ threshold
+ parent/child
+ threshold &amp; parent/child
GPT-3.5-turbo
BERTMap
BERTMapLt</p>
      <p>0.720
0.620
0.740
0.480
0.560
system. The unmatched source concepts are assigned a “null” match, denoted as . This
results in a set of “unreferenced” mappings, represented as ℳ . We can then define RR as:
 =</p>
      <p>∑︁ ∏︁ (1 − I≡ )/|ℳ |
(,)∈ℳ ∈
where  is the set of target candidate classes for a source concept , and I≡  is a binary
indicator that outputs 1 if the system predicts a match between  and , and 0 otherwise. It
is worth noting that the product term becomes 1 only when all target candidate concepts are
predicted as false matches, i.e., ∀ ∈ .I≡  = 0.</p>
      <p>
        Model Settings We examine Flan-T5-XXL under various settings: (i) the vanilla setting, where
a mapping is deemed true if it is associated with a “Yes” answer; (ii) the threshold3 setting,
which filters out the “Yes” mappings with scores below a certain threshold; (iii) the parent/child
setting, where sampled parent and child concept names are included as additional contexts; and
(iv) parent/child+threshold setting, incorporating both structural contexts and thresholding.
We also conduct experiments for GPT-3.5-turbo, the most capable variant in the GPT-3.5 series,
using the same prompt. However, only setting (i) is reported due to a high cost of this model.
For the baseline models, we consider BERTMap and BERTMapLt [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ], where the former uses a
ifne-tuned BERT model for classification and the latter uses the normalised edit similarity. Note
that both BERTMap and BERTMapLt inherently adopt setting (ii).
      </p>
      <p>Results</p>
      <p>As shown in Table 1-2, we observe that the Flan-T5-XXL (+threshold) obtains the
3The thresholds are empirically set to 0.650, 0.999, and 0.900 for Flan-T5-XXL, BERTMap, and BERTMapLt in a
pioneer experiment concerning small fragments.
best F-score among its settings. While it outpaces BERTMap by 0.093 in F-score on the
NCITDOID subset but lags behind BERTMap and BERTMapLt by 0.206 and 0.049, respectively, on
the SNOMED-FMA (Body) subset. Regarding MRR, BERTMap leads on both subsets. Among
Flan-T5-XXL settings, using a threshold enhances precision but reduces recall. Incorporating
parent/child contexts does not enhance matching results – this underscores the need for a more
in-depth examination of strategies for leveraging ontology contexts. GPT-3.5-turbo4 does not
perform well with the given prompt. One possible reason is the model’s tendency to furnish
extended explanations for its responses, making it challenging to extract straightforward yes/no
answers. Besides, no ranking scores are presented for GPT-3.5-turbo because it does not support
extracting generation probabilities. The suboptimal performance of BERTMapLt is as expected
because we exclude concept pairs that can be string-matched from the extracted datasets while
BERTMapLt relies on the edit similarity score.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and Future Work</title>
      <p>This study presents an exploration of LLMs for OM in a zero-shot setting. Results on two
challenging subsets of OM datasets suggest that using LLMs can be a promising direction for
OM but various problems need to be addressed including, but not limited to, the design of
prompts and overall framework5, and the incorporation of ontology contexts. Future studies
include refining prompt-based approaches, investigating eficient few-shot tuning, and exploring
structure-informed LLMs. The lessons gleaned from these OM studies can also ofer insights
into other ontology engineering tasks such as ontology completion and embedding, and pave
the way for a broader study on the integration of LLMs with structured data.
4The experimental trials for text-davinci-003 and GPT-4 also showed suboptimal results.
5This work focuses on the mapping scoring, but the searching (or candidate selection) part of OM is also crucial,
especially considering that LLMs are highly computationally expensive.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Antonyrajah</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Horrocks</surname>
          </string-name>
          ,
          <article-title>BERTMap: A BERT-based ontology alignment system</article-title>
          ,
          <source>in: AAAI</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Baruah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eslamialishah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ehsani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bahramali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Naddaf-Sh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zarandioon</surname>
          </string-name>
          ,
          <article-title>Truveta mapper: A zero-shot ontology alignment framework</article-title>
          ,
          <source>arXiv</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          , et al.,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>in: NeurIPS</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brahma</surname>
          </string-name>
          , et al.,
          <article-title>Scaling instruction-finetuned language models</article-title>
          ,
          <source>arXiv</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jiménez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hadian</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Horrocks,</surname>
          </string-name>
          <article-title>Machine learning-friendly biomedical datasets for equivalence and subsumption ontology matching</article-title>
          ,
          <source>in: ISWC</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dong</surname>
          </string-name>
          , I. Horrocks,
          <string-name>
            <given-names>C.</given-names>
            <surname>Allocca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sapkota</surname>
          </string-name>
          ,
          <article-title>Deeponto: A python package for ontology engineering with deep learning</article-title>
          ,
          <source>arXiv preprint arXiv:2307.03067</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>