1. Introduction

Conversational Ontology Alignment with ChatGPT

Sanaz Saki Norouzi

Mohammad Saeid Mahdavinejad

Pascal Hitzler

0 0 Department of Computer Science, Kansas State University , USA

This study evaluates the applicability and eficiency of ChatGPT for ontology alignment using a naive approach. ChatGPT's output is compared to the results of the Ontology Alignment Evaluation Initiative 2022 campaign using conference track ontologies. This comparison is intended to provide insights into the capabilities of a conversational large language model when used in a naive way for ontology matching and to investigate the potential advantages and disadvantages of this approach.

eol>Ontology alignment ChatGPT Schema matching Ontology matching Large language models Prompt engineering LLM behavior

1. Introduction

Ontology alignment (OA), also referred to as ontology matching, is a central task in semantic web technologies that aims to find semantic correspondences between two ontologies with overlapping domains. As using ontologies is extending to many diferent fields, this task’s importance is increasing, so ontology matching is required for bridging the semantic gap between various ontologies [ 1 ]. Although OA already looks back to many years of research, the task remains challenging, often requiring expert intervention to ensure accurate results. Expert-driven matching can be both time-consuming and subject to human biases, so even in this case absolute precision remains elusive [ 2, 3, 4 ]. To tackle this challenge, a variety of ontology matching systems, incorporating natural language processing (NLP) techniques considering grammar changes and diferent similarity measurements, machine learning, fuzzy lexical matching, and other advanced methodologies are proposed in the Ontology Alignment Evaluation Initiative (OAEI) 2022 [ 5 ]. Each approach attempts to automate the matching process and alleviate the need for extensive human involvement.

With the emergence of large language models (LLMs), we have seen impressive results in many NLP downstream tasks. Recently, using LLMs is increased for human-centric tasks, and models like ChatGPT1 by OpenAI2 have attracted attention for doing diferent tasks such as logical reasoning [ 6 ], question answering [ 7 ], and mental health analysis [ 8 ]. Prompt engineering is a skill that is required to work with LLMs eficiently. A prompt can be considered as a direction to interact with LLMs to adjust and control their output [ 9 ]. Generally, for using LLMs, there are three main approaches: fine-tuning, few-shot prompting, and zero-shot prompting. For using some LLMs in downstream tasks, fine-tuning would be helpful since it would make the LLM adapt its knowledge (from the pre-training process) to the specific task. Recently, as it is reported, models like GPT-3 [ 10 ] are able to generate responses to some tasks that it has not been trained on, so prompt engineering became more popular. In few-shot prompting, a few examples of the task and the format of input/output are given to the model, so it would be able to give the output based on the format while in zero-shot prompting it is only possible to evaluate the performance of the LLM based on its knowledge in one prompt. Thus, prompt patterns are important in the results provided by these LLMs.

In this paper, we conduct a comparative analysis of ChatGPT’s performance in ontology alignment when prompted with diferent strategies. We compare ChatGPT’s output with the reference alignments provided by the Ontology Alignment Evaluation Initiative (OAEI) 2022 campaign, which uses conference-related ontologies. By evaluating ChatGPT’s performance in a zero-shot manner, we aim to shed light on the capabilities and limitations of using a conversational large language model for ontology matching. Furthermore, we discuss the implications of our findings and propose potential directions for future research in this exciting area.

2. Methodology Data

Our evaluation focuses on conference track ontologies provided by the OAEI [ 11 ], encompassing seven ontologies: cmt, conference, sigkdd, iasted, ekaw, edas, and confOf. This selection yields 21 pairs of matched ontologies. We use the original reference alignment known as ra13 for our evaluation. It is mentioned by OAEI, that M3 evaluation means both properties and classes are considered for matching. Thus, we consider ra1-M3 OAEI 2022 results for comparison.

Prompts and Formatting

An essential aspect of this evaluation involves designing prompts that efectively incorporate the triples from the conference track ontologies. We explore diferent approaches to include ontology triples in the prompts, with two primary methods considered: converting triples into sentences and transforming them into formatted text following the pattern Predicate(Subject, Object).

After conducting experiments and considering the efectiveness of diferent prompt approaches, we choose to adopt the formatted text approach for our prompts, which aligns well with suggestions from OpenAI. This formatting presents triples in a structured manner, making it easier for ChatGPT to comprehend and generate appropriate responses. For instance, an original triple such as "track subclassOf conference_part" can be represented as "Is-a (track, conference part)" using the formatted text approach. Similarly, properties are expressed in the same structured format, such as "authorOf (Person, Document)". 3https://oaei.ontologymatching.org/2023/conference/data

The limitation of a basic version of ChatGPT (v3.5), which we will elaborate on more in the discussion section, led us to divide it into smaller parts instead of using one long prompt. This approach allowed us to maintain essential context throughout the interaction, resulting in a better understanding of the model and more accurate responses.

In our early experiments, we found that adding more complex ontology axioms made it more dificult for ChatGPT to capture the best possible matches between two ontologies. Therefore, we decided to include only axioms that can be directly expressed as triples. We formulated our prompt with a structured approach as follows:

<Problem Definition> In this task, we are given two ontologies in the form of Relation(Subject, Object), which consist of classes and properties.

Ontology 1: Ontology 1 Triples

Ontology 2: Ontology 2 Triples

<Objective> Our objective is to provide ontology mapping for the provided ontologies based on their semantic similarities.

3. Results and Analysis

In this section, we present the results of our evaluation. The objective was to gain insights and investigate this approach’s potential advantages and disadvantages. Among the prompts, "prompt 7" demonstrated the highest recall. However, it should be noted that the number of generated statements for this prompt was relatively higher than "prompt 1" since it is a repetitive prompt for each class/property name, and it tries to find the best match for each of them. Thus, the increased recall came at the cost of reduced precision, while it should be noted that some of the generated statements were deemed irrelevant even by non-expert evaluators. Nonetheless, "prompt 7" exhibited the highest F1-score among all the prompts, showcasing a balance between recall and precision.

While the first three prompts are similar in essence but have diferent objectives, their F1scores are almost the same. Asking for a complete and comprehensive matching gives the highest recall, but also the least precision. On average, the first prompt achieved the best balance between recall and precision. Interestingly, employing prompts that explicitly asked for matching classes or properties, such as prompts 4 and 5, resulted in higher recall but lower precision and F1-scores. Nevertheless, this drawback can be mitigated by domain experts who can easily filter out irrelevant generated statements. For a more comprehensive evaluation, we compare our results with OAEI 2022 results in Table 2. The prompts’ results are shown in Table 3.

4. Discussion

Our evaluation highlighted a significant challenge related to precision. The generated statements often introduced errors that caused a decrease in precision. We identified several factors contributing to this issue:

ChatGPT context length limit: ChatGPT (v4.0) was used in our experiments because ChatGPT (v3.5) struggled to retain context when the input was lengthy, afecting its performance in ontology alignment tasks. ChatGPT (v4.0) has improved contextual understanding and better adaptability to long inputs, and its maximum token length of 8192 accommodates both ontology triples within the prompt.

Inverse Functional Properties: These Properties can lead to imprecise matches if they are not properly accounted for. For example, the statement hasBeenAssigned(Reviewer, Paper) is matched to hasReviewer(Paper, Possible_Reviewer) by ChatGPT. However, the correct entity for this matching is ReviewerOfPaper, which is the inverse of hasReviewer. If we properly account for this inverse relationship, we can enhance precision by reducing the number of false positives.

Matches with Subclasses: The generated alignments sometimes matched a class in one ontology to one class and all its subclasses in the other, leading to unintended matches. For instance in the conference-edas matching, "active_conference_participant" and "passive_conference_participant" which are subclasses of conf_participant are matched with at

Uncertain Matching: In certain cases, even though ChatGPT acknowledges that a matching is unlikely, it still generates such matches and proposes new entities to be included in the graph.

5. Conclusion and Future Work

In this paper, we have evaluated the applicability and eficiency of ChatGPT for ontology alignment using a naive approach. Our evaluation showed that ChatGPT can achieve high recall but also sufers from low precision. We identified several factors contributing to this issue, including the context length limit of ChatGPT, the handling of inverse functional properties, the matching with subclasses, unseen alignments, and uncertain matchings. Despite the mentioned challenges, we believe that ChatGPT has the potential to be a valuable tool for ontology alignment. The high recall of ChatGPT means that it can be used to identify a large number of potential matches, which can then be filtered by domain experts. Additionally, the ability of ChatGPT to generate new entities suggests that it could be used to expand reference ontologies. In future work, we plan to address the precision issues identified in this paper. We also plan to explore other ways to use ChatGPT for ontology alignment, such as generating prompts for more sophisticated alignment algorithms. Overall, we believe that the results of this paper demonstrate the potential of ChatGPT for ontology alignment. We believe that this approach can be used to improve the eficiency and efectiveness of ontology alignment tasks.

Acknowledgments

This work was supported by the National Science Foundation (NSF) under Grant 2033521 A1. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.

[1]

Shvaiko ,

Euzenat , Ontology matching: state of the art and future challenges , IEEE Transactions on knowledge and data engineering 25 ( 2011 ) 158 - 176 .

[2]

Trojahn ,

Vieira ,

Schmidt ,

Pease , G. Guizzardi, Foundational ontologies meet ontology matching: A survey , Semantic Web 13 ( 2022 ) 685 - 704 .

[3]

Stevens ,

Lord ,

Malone ,

Matentzoglu , Measuring expert performance at manually classifying domain entities under upper ontology classes , Journal of Web Semantics 57 ( 2019 ) 100469 .

[4]

Cheatham , P. Hitzler, Conference v2. 0: An uncertain version of the OAEI conference benchmark , in: P. Mika,

Tudorache ,

Bernstein ,

Welty ,

C. A.

Knoblock ,

Vrandecic ,

Groth ,

N. F.

Noy ,

Janowicz ,

C. A.

Goble (Eds.), The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference, Riva del Garda, Italy, October 19-23 , 2014 . Proceedings, Part

, volume 8797 of Lecture Notes in Computer Science, Springer, 2014 , pp. 33 - 48 . doi: 10 .1007/978-3- 319 -11915-1\_3.

[5]

M. A. N.

Pour ,

Algergawy ,

Buche ,

L. J.

Castro ,

Chen ,

Dong ,

Fallatah ,

Faria , I. Fundulaki,

Hertling , et al., Results of the ontology alignment evaluation initiative 2022 , CEUR Workshop Proceedings, 2023 .

[6]

Liu ,

Ning ,

Teng ,

Liu ,

Zhou , Y. Zhang, Evaluating the logical reasoning ability of chatgpt and gpt-4 , arXiv preprint arXiv: 2304 .03439 ( 2023 ).

[7]

Tan ,

Min ,

Li ,

Hu ,

Chen , G. Qi, Evaluation of chatgpt as a question answering system for answering complex questions , arXiv preprint arXiv:2303.07992 ( 2023 ).

[8]

Yang ,

Ji ,

Zhang ,

Xie ,

Ananiadou , On the evaluations of chatgpt and emotionenhanced prompting for mental health analysis , arXiv preprint arXiv:2304.03347 ( 2023 ).

[9]

White ,

Fu ,

Hays ,

Sandborn ,

Olea ,

Gilbert ,

Elnashar ,

Spencer-Smith ,

D. C.

Schmidt , A prompt pattern catalog to enhance prompt engineering with chatgpt , arXiv preprint arXiv:2302.11382 ( 2023 ).

[10]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell , et al., Language models are few-shot learners , Advances in neural information processing systems 33 ( 2020 ) 1877 - 1901 .

[11]

Zamazal ,

Svátek , The ten-year ontofarm and its fertilization within the onto-sphere , Journal of Web Semantics 43 ( 2017 ) 46 - 53 .