1. Introduction

Can Language Models Align Biomedical Ontologies?: Evaluating Retrieval-Augmented Prompt Strategies in Bio-ML.

Lucas Ferraz

Pedro Giesteira Cotovio

Catia Pesquita

0 0 LASIGE, Faculdade de Ciências, Universidade de Lisboa , Portugal

2025

36 0009 0009

Aligning biomedical ontologies presents a significant challenge due to their complexity and the highly domainspecific nature of their vocabulary. Recent advancements in Language Models (LMs) have led to their increasing application in ontology alignment tasks, ofering promising results. However, a systematic evaluation of semanticsbased prompting strategies for leveraging LMs in this context remains unexplored. This study investigates the efectiveness of diferent prompting techniques to enhance biomedical ontology alignment performance. We have developed a framework to support the design of LM-based queries to assess the semantic similarity between ontology classes. The framework interrogates the ontologies to align to extract relevant contextual information to inject into the LM prompts allowing the use of Retrieval Augmented Generation (RAG). We conduct preliminary experiments on selected hard cases from biomedical ontologies that compose the Ontology Alignment Evaluation Initiative Bio-ML track and provide some insights into the efectiveness, reliability, and limitations of prompt-based approaches in ontology matching.

eol>Language Models Ontology Alignment Knowledge Representation

1. Introduction

Ontologies have become increasingly popular in various fields due to their ability to provide structured, formal representations of knowledge. These knowledge structures are particularly valuable in areas such as Artificial Intelligence (AI), Natural Language Processing (NLP), and Semantic Web technologies. An ontology represents a set of concepts within a domain and the relationships between them, allowing for more efective data sharing, discovery, and reasoning across diferent systems and applications.

As individual ontologies grow and evolve independently from each other, any given concept will inevitably display conceptual, linguistic, and structural diferences when modelled in diferent ontologies, in diferent contexts and by diferent creators. These diferences often arise from varying domain perspectives, terminologies and modelling choices across the maintainers and communities that develop and use the ontologies.

Ontology alignment addresses this issue through the generation of a set of mappings (correspondences) between entities in diferent ontologies to establish semantic interoperability [ 1 ]. However, automatically identifying these correspondences is a highly complex task. In general, ontologies are typically designed within a specific context, relying on implicit background knowledge that is not explicitly captured in their schema definitions [ 2 ].

Most ontology alignment techniques perform entity mapping based on leveraging lexical, structural, semantical, and external information of the entities being matched [ 1, 3 ]. Lexical information has proved to be the more successful source for biomedical ontology alignment [ 4, 5 ], with algorithms based on exploring the lexical component of ontologies outperforming other approaches by a good margin [ 6 ]. Lexical information can be extracted from entity labels, facilitating the exploration of word sense disambiguation and the inference of lexical relationships between entities. However, the lexical component of biomedical ontologies is typically restricted to the labels of concepts, which results in limitations in capturing mappings that require more contextual information beyond the simple similarity of labels.

The dawn of Large Language Models (LLMs) marked a turning point in our ability to capture and understand deep semantic relationships between terms. Traditional NLP techniques were insuficient at extracting contextual meaning, relying on simpler models that could not fully grasp language nuances. However, with the advent of LLMs such as GPT [ 7 ], BERT [ 8 ], and other transformer-based architectures [ 9 ], we currently have the ability to process and model complex relationships between words, phrases, and even complete documents.

LLMs are trained on massive datasets containing billions of tokens and are capable of understanding / representing not just the meaning of individual terms but how they interact in context. This allows us to capture subtle semantic relationships, such as synonyms, antonyms, hyponyms, and hypernyms, which are extremely useful for tasks such as translation, summarization, and question-answering. These capabilities are likely to translate to the ontology alignment scenario provided a suitable formulation of the problem is achieved. The success of prompt-based strategies in ontology alignment [ 10 ] has motivated us to explore whether LLMs are able to tackle the mapping of classes from biomedical ontologies and how well they are able to handle the more dificult cases. However, recent works have highlighted the dificulties in applying prompt-based strategies to real-world ontologies in other domains [ 11 ].

In this paper, we present a preliminary study that focuses on investigating the impact of including hierarchical relations in the prompt, exploring diferent design patterns for its verbalisation. We performed an evaluation of prompt-design strategies using a carefully selected set of challenging mappings extracted from Bio-ML [ 12 ], highlighting the pitfalls and strengths of each strategy.

2. Related Work

Ontology alignment is the process of identifying correspondences between entities in two distinct ontologies, typically referred to as the source ontology () and the target ontology (). The goal of ontology alignment is to establish meaningful mappings between entities, ensuring interoperability between heterogeneous data sources. The resulting alignment consists of a set of mappings, often represented as tuples <, , , >, where and are entities from and respectively, denotes a semantic relation (e.g., equivalence, subsumption), and represents the confidence score of the mapping. These mappings are crucial for tasks such as data integration, Knowledge Graph fusion, and semantic interoperability in domains like healthcare, biology, and the Semantic Web.

Ontology matching systems are predominantly unsupervised, relying on heuristics and rules instead of deriving a mapping function through learning. These systems typically include three stages: preprocessing (the identification and retrieval of entities from the ontologies based on specific criteria); candidate generation (the use of diverse matching techniques to generate possible correspondences based on ontology features); and filtering (the refinement of initial matches through discarding of unlikely mappings).

In recent years, with the advent of language models, more attention has been devoted to Machine Learning-based ontology alignment, with several systems incorporating it [13, 14, 15, 16] and the creation of the Bio-ML track at the Ontology Alignment Evaluation Initiative. ML introduces a datafocused approach to ontology alignment, shifting away from heuristic and rule-based approaches. Unlike traditional OM systems, ML-based methods aim to learn a mapping function using labelled reference alignments, enabling more adaptive and scalable matching solutions. In principle, this allows for improved candidate generation, better matcher combination strategies, and more efective filtering techniques. Most approaches employ BERT-like methods [ 8 ] and typically sacrifice recall in favour of precision [17].

3. Methodology 3.1. Overview

To investigate the impact of semantic context (in the form of verbalized hierarchical relations) in prompt-based ontology matching tasks we developed a simple framework to design prompts based on combinations of relevant elements into meaningful patterns. Prompts built using this framework were evaluated using diferent language models with diferent model sizes. Our approach takes as input two candidate entities (typically classes) from each ontology to align, designs a prompt to evaluate the validity of a mapping between them based on diferent parameters, interrogates the language model using the prompt, and evaluates its output. In our study, we assume candidates are already selected, and focus only on prompt design and evaluation.

3.2. Prompt design for matching

We present a two-stage framework for generating context-aware prompts designed for tasks such as ontology alignment. This framework decomposes prompt generation into a static stage — where invariant templates (static skeletons) are constructed from a base template using task-specific configuration parameters — and a dynamic stage — where these skeletons are enriched with instance-specific data. This modular design allows a small number of configurable templates to be eficiently adapted to large datasets.

3.2.1. Static Stage: Template Construction

The static stage begins with a base template, denoted by , which contains symbolic placeholders indicating the diferent elements that compose a prompt, where specific types of information will be inserted. The elements are listed in Table 1.

Element Tag Description $TC (Task Context) A description of the task the model should perform. $I (Instruction) A description of the nature of the question that will be asked and the expected answer format. $CONF (Confidence) The type of confidence that the model should output (if any). $S (Source) The main label(s) of the source ontology entity. $CTX_S (Source Context) Labels of potentially meaningful entities to the source entity. $T (Target) The main label(s) of the target ontology entity. $CTX_T (Target Context) Labels of potentially meaningful entities to the target entity.

$TYPE (Equivalence Type) The type of equivalence to be assessed.

A brief description of each of these elements is presented in Table 1 and their possible values are presented in Table 2.

Category Type String Value Task Context — "You are doing an ontology alignment task," Instruction — "I am going to ask you a question and you should answer ’yes’ or ’no’." Confidence float "Followed by confidence as a score from 0 to 1 (e.g., ’yes:0.8’)"

categorical "Followed by confidence as ’Not Confident’, ’Confident’, or ’Very Confident’ (e.g., ’yes:Confident’)" Context subclass_of "a subclass of $SC", with ’$SC’ being a superclass of either $S or $T

kind_of "a kind of $SC", with ’$SC’ being a superclass of either $S or $T

Equivalence Type Equivalent "equivalent" You are doing an ontology alignment task. I am going to ask you a question and you should answer ’yes’ or ’no’, followed by your confidence in your answer as a score from 0 to 1, like this: ’yes:0.8’.

Question: Are ’Neuraminidase Deficiency’ and ’glycoproteinosis’ equivalent?

A set of configuration parameters ∈ {1, . . . , } governs the substitution of these placeholders with specific verbalizations. We define four vectors: = ( 1, 2, . . . , ), = ( 1, 2, . . . , ), = ( 1, 2, . . . , ), = ( 1, 2, . . . , ), where: to where: • ∈ {True, False} indicates whether to include a task context. • ∈ Γ specifies the comparison prompt. • ∈ Σ represents the semantic context prompt.

• ∈ ℒ denotes the confidence type (e.g., float or cat).

Then, for each configuration ∈ {1, . . . , }, the static base template corresponds to a unique pattern combining each of the four elements < , , , > by replacing each placeholder in by the appropriate string. The output of this stage is the set {}=1 of static templates that capture the invariant, configuration-specific aspects of the prompt.

3.2.2. Dynamic Stage: Instance-Specific Enrichment

Let {( , )}=1 be the set of source–target entity pairs in the dataset. In the dynamic stage, each static template is enriched with instance-specific information to produce a dynamically-built prompt.

For each entity pair ( , ) and each static template , a dynamic prompt is generated according = dynamic(; , , , ), 1. Label Formatting: The entities and provide label sets, which are formatted (e.g., truncated to a specified cardinality and concatenated with a given delimiter) to yield ( ) and ( ). These formatted labels replace the placeholders $S and $T, respectively. 2. Contextual Enrichment: Additional contextual information is extracted from the ontology and formatted as ( ) and ( ), replacing the placeholders $CTX_S and $CTX_T. In cases of absent context, extraneous semantic tokens may be removed. In this work, we focused on subsumption relations to include hierarchical context.

Thus, the dynamic prompt for each static skeleton and entity pair is obtained via the function dynamic, and for each , the complete set of dynamic prompts is given by

= { }=1.

These dynamically enriched prompts form the final output features for the dataset.

In summary, the static stage produces a family of invariant templates and the dynamic stage adapts these templates to each instance ( , ). Figures 1 and 2 illustrate two prompt examples with and without hierarchical context.

You are doing an ontology alignment task. I am going to ask you a question and you should answer ’yes’ or ’no’, followed by your confidence in your answer as a score from 0 to 1, like this: ’yes:0.8’. Question: Are ’Neuraminidase Deficiency’ ( a kind of Mucolipidosis) and ’glycoproteinosis’ (a kind of lysosomal storage disease) equivalent?

3.3. Models

Our experiments evaluated the prompts in five diferent language models with varying numbers of parameters and reasoning capabilities. The Flan-T5-Base model [18] (with 220 million parameters), is a lightweight transformer model developed by Google, tailored for instruction-based tasks and without any reasoning capabilities. The Claude 3.7 Sonnet model [19] was developed by Anthropic and is a significantly larger model than lightweight models such as Flan-T5-base, possessing 137 billion parameters but also lacking reasoning capabilities. Our experiments also incorporate GPT4 [20], a largescale model comprising 1.76 trillion parameters, which represents a significant milestone in enhancing linguistic fluency and contextual comprehension within generative language models. Additionally, we also analyse the performance of two state-of-the-art reasoning models: GPT4o [21], a multimodal model with 200 billion parameters and OpenAIo1 [22], another multimodal model with 175 billion parameters.

Model Source

Flan-t5-base Claude 3.7 Sonnet OpenAIo1 GPT4o GPT4

Number of Parameters Reasoning

220 million 137 billion 175 billion 200 billion 1.76 trillion No No Yes Yes No

3.4. Parsing Model Responses

Let denote a parsing function that maps a textual response , generated by a predictive model, into a numerical confidence score ∈ [ − 1, 1 ]. This confidence score quantifies the certainty associated with a binary classification decision, indicating either a positive or negative outcome.

Parsing Process: The function is defined through the sequential application of the following procedures: 1. Text Normalization: 2. Numeric Confidence Extraction: • Transform the textual response into lowercase form: ← • Remove leading and trailing white space: ← trim(). lowercase(). 3. Default Uncertainty Handling: • Use regular expressions to search for numeric confidence values within . • If a numeric value is found, convert it to a float and clip its value to the range [ 0, 1 ]. For instance, a response including "0.85" would yield = 0.85. • In the absence of numeric values, assign a default confidence = 1.0. This default ensures reliance exclusively on the binary signal derived from keyword polarity. 4. Solution Polarity Determination: • Adjust the polarity of the mapping based on explicit binary indicators:

⎧⎪0, if "no" (negative) is detected, = ⎨1, if "yes" (positive) is detected,

⎪⎩0.0, if neither or both indicators ("yes" and "no") are detected.

This parsing approach enables consistent extraction of numerical confidence scores from multiple textual responses generated for each query.

3.5. Evaluation

Our preliminary experiments focused on a subset of the mappings for the NCIT-DOID task of BioML[ 12 ]. This track includes a special dataset, Bio-ML LLM, which contains 50 randomly selected matched class pairs from ground truth mappings, excluding pairs that can be aligned with direct string matching (i.e., having at least one shared label). This restricts the eficacy of conventional lexical matching. Of these 50 pairs, we selected the six which were considered as particularly hard to detect. For each source class in these "very hard" mappings, we created an additional "hard" negative (i.e., a target class with some lexical similarity to the source). The mappings are listed in Table 4.

Source

Esophageal Verrucous Carcinoma Esophageal Verrucous Carcinoma Diabetic Vascular Disorder Diabetic Vascular Disorder Malignant Hypopharyngeal Neoplasm Malignant Hypopharyngeal Neoplasm Neuraminidase Deficiency Neuraminidase Deficiency Bone Necrosis Bone Necrosis Microcystic Adnexal Carcinoma Microcystic Adnexal Carcinoma

Target Status

esophagus verrucous carcinoma 1 esophageal varix 0 diabetic angiopathy 1 diabetic encephalopathy 0 hypopharynx cancer 1 malignant granular cell skin tumor 0 glycoproteinosis 1 biotinidase 0 ischemic bone disease 1 dysbaric osteonecrosis 0 malignant syringoma 1 nasopharynx carcinoma 0

4. Results

Table 5 presents the confusion matrix for the preliminary experiments. When the prompt does not include hierarchical contextual information, the best-performing models are OpenAIo1 and GTP4o, which despite being smaller than GPT4 have improved reasoning capabilities. These reasoning capabilities may help the models perform better when there is less information available. In fact, GPT4 ranks fourth despite being the largest model.

When semantic contextual information is given to the models, we observed very diferent behaviours between the "kind_of" prompt and the "subclass_of" prompt. While the "kind_of" resulted in improved results for the non-reasoning models, for the reasoning models, it had either no impact or a small negative impact. The "subclass_of" prompt, however, did not perform as well, having a negative impact in most models. These results demonstrate that hierarchical contextual information should be considered when designing prompts for biomedical ontology alignment. It is well worth noting that the second best performing approach was the pairing between Claude-3.7-Sonnet and the "kind_of" prompt, which achieved nearly identical results with GPT4, while being 10% of its size. flan-t5-base

Claude 3.7 Sonnet OpenAIo1

GPT4o GPT4 w/o HC w/ HC (kind of) Pred: 1

Pred: 0

Pred: 1

Pred: 0

Pred: 1

Pred: 0 Actual: 1 Actual: 0 Actual: 1 Actual: 0 Actual: 1 Actual: 0 Actual: 1 Actual: 0 Actual: 1 Actual: 0

We also investigated in more depth some false negative cases, depicted in Table 6. Some mappings, such as "Neuraminidase Deficiency - glycoproteinosis" are missed by all models, regardless of the context that is imparted in the prompt. Curiously, some sources indicate that this may not actually be an equivalence but rather a subsumption, with the corresponding diseases being modelled as such in ICD-10 (categories E77 and E77.1). However, including the hierarchical context in the form of "kind_of" prompts mitigates these issues, with most models, especially the mid to large sized, improving their recall of hard-to-find positive mappings.

5. Conclusion

This study explored the efectiveness of semantic prompting strategies, particularly the use of hierarchical contextual information, in enhancing biomedical ontology alignment with language models. Our experiments revealed that the impact of the inclusion of hierarchical context depended on the prompt wording. While the "kind_of" prompt — which more closely aligns with everyday language — improved the performance for non-reasoning models, the "subclass_of" prompt generally led to decreased performance. These findings highlight that the value of adding semantic context is heavily influenced by the verbalisation used when designing prompts.

We also found that smaller models like OpenAIo1 and GPT4o outperformed larger models like GPT4 when no hierarchical context was included in the prompt. This suggests that smaller models with better reasoning capabilities may perform more efectively when limited information is provided. Interestingly, the pairing of Claude-3.7-Sonnet with the "kind_of" prompt delivered nearly identical results to GPT4, despite being only 10% of its size, showing that less resource-intensive models can still achieve strong performance when combined with the right prompting strategies.

Additionally, the inclusion of hierarchical context through the "kind_of" prompt improved the recall of hard-to-find mappings, especially for mid- to large-sized models. However, some mappings remained challenging for all models, indicating that certain biomedical ontology mappings require more advanced approaches.

Future work will focus on extending the prompt design framework to include in-context learning based on positive and negative examples and developing additional strategies to extract semantic context by exploring common biomedical ontology features such as partonomy, rich synonyms and logical definitions.

Acknowledgments

This work was supported by FCT through the fellowships 2022.10557.BD (Pedro Cotovio), and the LASIGE Research Unit, ref. UID/00408/2025. It was also partially supported by the KATY project which has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 101017453. This work was also supported partially by project 41, HfPT: Health from Portugal, funded by the Portuguese Plano de Recuperação e Resiliência.

Declaration on Generative AI

During the preparation of this work, the author(s) used GPT-4o in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Euzenat ,

Shvaiko , Ontology matching, 2nd ed., Springer-Verlag, Heidelberg (DE), 2013 .

[2]

Portisch ,

Hladik ,

Paulheim , Background knowledge in ontology matching: A survey , Semantic Web 15 ( 2024 ) 2639 - 2693 . doi: 10 .3233/SW-223085.

[3]

Liu ,

Tong ,

Liu ,

Qin , Ontology matching: State of the art, future challenges, and thinking based on utilized information , in: Proceedings of the 19th International Workshop on Ontology Matching. , volume 9 , 2021 , pp. 91235 - 91243 . doi: 10 .1109/ACCESS. 2021 . 3057081 .

[4]

Faria ,

Santos ,

B. S.

Balasubramani ,

M. C.

Silva ,

F. M.

Couto ,

Pesquita , Agreementmakerlight, Semantic Web ( 2024 ) 1 - 13 .

[5]

Jiménez-Ruiz ,

B. Cuenca

Grau , Logmap: Logic-based and scalable ontology matching , in: The Semantic Web-ISWC 2011 : 10th International Semantic Web Conference, Bonn, Germany, October 23-27 , 2011 , Proceedings, Part I 10 , Springer, 2011 , pp. 273 - 288 .

[6]

Faria ,

Pesquita , I. Mott,

Martins ,

F. M.

Couto ,

I. F.

Cruz , Tackling the challenges of matching biomedical ontologies , Journal of biomedical semantics 9 ( 2018 ) 1 - 19 .

[7]

Radford ,

Narasimhan ,

Salimans ,

Sutskever , et al., Improving language understanding by generative pre-training ( 2018 ).

[8]

Devlin ,

M. W.

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , volume 1 , 2019 .

[9]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , Advances in neural information processing systems 30 ( 2017 ).

[10]

He ,

Chen ,

Dong , I. Horrocks , Exploring large language models for ontology alignment , arXiv preprint arXiv:2309.07172 ( 2023 ).

[11]

Macilenti ,

Stellato ,

Fiorelli , Prompting is not all you need evaluating gpt-4 performance on a real-world ontology alignment use case , Procedia Computer Science 246 ( 2024 ) 1289 - 1298 .

[12]

He ,

Chen ,

Dong ,

Jiménez-Ruiz ,

Hadian , I. Horrocks, Machine learning-friendly biomedical datasets for equivalence and subsumption ontology matching , in: International Semantic Web Conference, Springer, 2022 , pp. 575 - 591 .