LLM-driven Ontology Evaluation: Verifying Ontology Restrictions with ChatGPT Stefani Tsaneva∗ , Stefan Vasic and Marta Sabou Vienna University of Economics and Business, Welthandelsplatz 1, 1020, Vienna, Austria Abstract Recent advancements in artificial intelligence, particularly in large language models (LLMs), have sparked interest in their application to knowledge engineering (KE) tasks. While existing research has primarily explored the utilisation of LLMs for constructing and completing semantic resources such as ontologies and knowledge graphs, the evaluation of these resources- addressing quality issues- has not yet been thoroughly investigated. To address this gap, we propose an LLM-driven approach for the verification of ontology restrictions. We replicate our previously conducted human-in-the-loop experiment using ChatGPT-4 instead of human contributors to assess whether comparable ontology verification results can be obtained. We find that (1) ChatGPT-4 achieves intermediate-to-expert scores on an ontology modelling qualification test; (2) the model performs ontology restriction verification with accuracy of 92.22%; (3) combining model answers on the same ontology axiom represented in different formalisms improves the accuracy to 96.67%; and (4) higher accuracy is observed in identifying defects related to the incompleteness of ontology axioms compared to errors due to restrictions misuse. Our results highlight the potential of LLMs in supporting knowledge engineering tasks and outline future research directions in the area. Keywords ontology evaluation, large language models, defect detection 1. Introduction Knowledge graphs (KGs) conceptualise real-world knowledge and act as a foundational com- ponent in many advanced intelligent application harnessing human knowledge [1]. With the emergence of the 3rd wave of AI [2], KGs and other semantic resources such as taxonomies and ontologies have been explored for their potential benefit to machine learning models [3, 4, 5]. Nevertheless, ensuring the quality of the knowledge corpus is crucial for preventing incorrect outputs, bias and potential harm caused by the enabled systems. The evaluation of semantic resources plays a key role in ensuring the quality of these resources, yet it is a time and cost intensive task [6, 7]. While automated approaches can detect some quality issues, such as logical inconsistencies, there is also a family of errors that require a human-centric judgement to be detected (e.g, concepts not aligned with human cognition, inaccurately represented domain facts) [6]. While the traditional approach of involving a domain expert for the evaluation has been complemented by human computation & crowdsourcing Data Quality meets Machine Learning and Knowledge Graphs, DQMLKG Workshop at ESWC 2024 ∗ Corresponding author. Envelope-Open stefani.tsaneva@wu.ac.at (S. Tsaneva); h12012581@wu.ac.at (S. Vasic); marta.sabou@wu.ac.at (M. Sabou) Orcid 0000-0002-0895-6379 (S. Tsaneva); 0000-0001-9301-8418 (M. Sabou) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings approaches, which leverage the wisdom of the crowds at a reduced cost, the evaluation of large semantic resources remains a challenge. Large language models (LLMs) have shown performance similar to humans on a number of natural language tasks, typically requiring commonsense or domain knowledge thus reducing the needed human intervention [8]. With recent advances of LLMs and their application in a broad range of tasks, an interest into the synergy between LLMs and knowledge engineering (KE) has emerged: in [9] a road map of current and future research directions combining LLMs and KGs is proposed; LLM-based support for knowledge engineering tasks, part of the CommonKADS [10] knowledge engineering methodology is discussed in [11]; ontology engineer’s role changes and potential benefits from LLM-advancements are presented in [12]. While LLM-enabled KG construction and completion have gained considerable research attention (e.g., [13, 14, 15, 16]), other knowledge engineering tasks such as the quality assessment of semantic resources with LLMs have not yet been sufficiently explored. In this paper, we address this gap by performing an experimental investigation of the capabilities of LLMs in identifying quality issues of semantic resources. Since ontologies serve as a basis for KGs and capture more complex structures than taxonomies we focus on them in this paper. Building on our prior work on human-in-the-loop (HiL) ontology evaluation [17], we perform a differentiated replication, substituting semi-experts through ChatGPT-4, for one particular ontology evaluation task - the verification of ontology restrictions. We explore to what extent LLMs’ capabilities to verify ontology axioms are comparable to the judgements obtained from human contributors. We experiment with several settings varying the representational formalism in which ontology axioms are included in the prompt and show that ChatGPT-4 reaches verification accuracy of up to 96.67%, nearly matching the benchmark of 100% accuracy attained through human majority votes. Our investigation further reveals that: (1) the ontology axiom representation used in the prompts influence the verification scores; (2) a majority-voting strategy combining responses from differently designed prompts can yield recall improvements; and (3) incomplete axioms are easily detected by the model while axioms containing improperly used restrictions are more challenging to identify. The rest of the paper is structured as follows: In Sect. 2 we discuss related work. We give an overview of the performed replication study and how each component from the original experiment was adopted in Sect. 3. In Sect. 4 we describe our LLM-based ontology verification approach and discuss results in Sect. 5. Limitations, lessons learned and future work are summarised in Sect. 6. 2. Related Work Our study intersects with two main research areas: First, we discuss the application of (L)LMs within knowledge engineering tasks, incorporating an evaluation component. Second, we present studies approaching human-centric evaluation tasks using LLMs or exploring the extent of knowledge that LLMs possess. (L)LM-augmented knowledge engineering tasks. The support of LLMs for KG construc- tion and completion has attracted much research interest in recent years (e.g., the LM-KBC challenge1 ). However other knowledge engineering tasks such as the evaluation of semantic resources have not yet received much attention or have been included as secondary tasks. The identification of incorrect KG triples has been briefly addressed in [8] as part of a KG generation process. However, no concrete quantitative results or comparison with other automatic/ manual approaches is provided. As part of a KG link prediction approach, PKGC [18] includes an LM-binary classification of predicted triples as correct or incorrect where triples are represented in natural language sentences. The triple classification model reaches up to 86.2% accuracy, suggesting the potential of (L)LM-assisted KG evaluation. In this paper we focus on the LLM-based evaluation of ontology restrictions and the detection of concrete defect types - a KE task, to the best of our knowledge, not yet explored with LLMs. In addition, we investigate the effects of different ontology representations (i.e., when axioms are presented in a machine-readable format versus as natural language sentences) on the verification performance. Human-in-the-loop vs LLM-in-the-loop. Recently, the authors of [19] compared the per- formance of LLMs and human contributors when presented with the task of evaluating the quality of (automatically) generated text and found that models such as ChatGPT provided ratings similar to the experts’ judgements. Additionally, the authors identified that LLMs bring some additional benefits: (1) compared to human judgements which may vary across groups and time points LLMs provided more reproducible results; (2) each text sample was independently evaluated by the models while human contributors tend to draw comparisons between different samples; and (3) LLMs offer a cheaper and faster task completion. Nevertheless, the paper also outlines current LLM challenges such as potentially presenting incorrect factual knowledge or biased perspectives. Additionally, several approaches have been taken to assess LLMs using qualification exams. For instance, in [20] a comparison between the scores of LLMs and post-graduate students is presented on multiple-choice questions in the clinical chemistry domain. They show that ChatGPT-4’ scores match the best student scores while ChatGPT-3.5, Bing and Bard scored above average. Inspired by the results in other domains, in this work, we performed a comparison between ChatGPT-4 and human contributors’ skills in verifying ontology axioms - a task requiring logical reasoning, which LLMs have been previously shown to mostly lack [12]. 3. Method We investigate an LLM-enhanced ontology restrictions verification approach by performing a differentiated replication [21] of our prior experiment [17] where we tackle the verification problem from a human-in-the-loop perspective. In this section we summarise the main objectives 1 Knowledge Base Construction from Pre-trained Language Models (LM-KBC): https://lm-kbc.github.io/chal- lenge2023/ of the original HiL experiment and how each experiment component was adopted to fit an LLM solution utilising ChatGPT-4 in place of human intelligence. 3.1. Human-in-the-loop ontology verification experiment In [17] we performed an experimental investigation of a human-in-the-loop ontology restriction verification approach. On one hand, the study aimed at understanding the effect of prior background knowledge on the verification results. On the other hand, we explored the influence of the ontology axiom representation on the quality of the collected judgements. In particular, we investigated the textual formalisms proposed by Rector [22] and Warren [23] and the visual notation VOWL [24]. The HiL experiment contained three main components- a pre-study, the experiment itself and a post-study, which we briefly describe next. Pre-study. The pre-study consisted in the assessment of human contributors’ background knowledge both subjectively (self-assessment test) and objectively through a qualification test2 . Based on the test scores participants were classified in four skill groups having no/little/some/ex- pert knowledge. The self-assessment test contained several background areas: English, formal logics, general modelling skills, ontology modelling skills, and crowdsourcing experience. The qualification test aimed at assessing only ontology modelling skills and more concretely the modelling of ontology restrictions. It included ontology axioms represented in each of the three formalism (i.e., Rector, Warren, VOWL) in order not to bias the investigated influence of the formalism on the final verification results. As part of the pre-study stage, the experiment included a short tutorial in order to familiarise the participants with the used crowdsourcing platform. Experiment. The main study component consisted in the verification of 30 ontology axioms from the well known Pizza Ontology3 . Half of these axioms were correct while the other half were either incomplete (i.e., missing universal or existential restriction) or included a misused restriction (i.e, universal restriction incorrectly used in-place of an existential one). As part of the verification, axioms are either classified as correct or a specific defect is selected from a set of possible answers based on a defect taxonomy. Additional context is provided in the form of a pizza menu item. Each axiom is verified independently from the rest- as a separate Human Intelligence Task (HIT)4 and the order of axioms is randomised for each participant. Post-Study. Once the study participants completed the ontology axiom verifications, they were asked to complete a feedback questionnaire. The form included questions about their experience and preferences towards an ontology representational formalism. 2 The utilised self-assessment and qualification tests are available in [25]. 3 Pizza Ontology: https://protege.stanford.edu/ontologies/pizza/pizza.owl. 4 The Human Intelligence Tasks designed for the original experiment are available in [25]. 3.2. Differentiated replication experiment utilising ChatGPT In this work, we replicate the pre-study and experiment stages of the HiL experiment, described in Sect. 3.1 using ChatGPT-4 instead of human contributors. We aim at gathering insights of whether LLMs have some ontology modelling skills and the use of which ontology representation leads to the best verification results. During the experiment replication we encountered issues with ChatGPT-4’s functionality to interpret the graphical ontology models represented in VOWL. Thus, we opted for Turtle 5 as an alternative for the replication. Given that the original experiment separately investigated the different representational formalisms, we believe that this substitution does not impact the outcomes of this replication study. Figure 1 provides an overview of the conducted replication and the adaptation of the experi- ment components, which we discuss next. Pre-Study Experiment Self-Assesment Test Rector Verification Warren Verification Turtle Verification Prompt with instructions Prompt with instructions Prompt with instructions Prompt with 8 self-self-assesment questions and correct & incorrect and correct & incorrect and correct & incorrect examples examples examples [Rector formalism] [Warren formalism] [Turtle formalism] Turtle Qualification Test Rector Qualification Test Prompt with 11 qualification Prompt with 11 qualification Verification Prompt Verification Prompt Verification Prompt test questions test questions Pizza Axiom 1 Pizza Axiom 1 Pizza Axiom 1 [Turtle formalism] [Rector formalism] [Rector formalism] [Warren formalism] [Turtle formalism] randomised randomised randomised Combined Qualification Test Warren Qualification Test order order order Prompt with 11 qualification Prompt with 11 qualification Verification Prompt Verification Prompt Verification Prompt test questions test questions Pizza Axiom 30 Pizza Axiom 30 Pizza Axiom 30 [all 3 formalisms] [Warren formalism] [Rector formalism] [Warren formalism] [Turtle formalism] Figure 1: Overview of the main stages of the experiment replication: a pre-study and an experiment. Pre-study. As a first step of the pre-study replication, we prompted ChatGPT to assess its level of background knowledge using the same self-assessment test developed for human contributors. An example question from the ontology modelling category is shown in Fig. 2. As in the original experiment, for each assessed area additional context was provided describing what each knowledge level entails (2 in Fig. 2). Next, we conducted the qualification test in 4 different setups which varied by the axiom representation in the test questions: 3 instances used a single formalism (Rector or Warren or Turtle) while the last setup included all 3 alternative representations for each ontology axiom. Figure 3 shows an example question from the qualification test. For each question, instructions to follow are included (1 in Fig. 3), together with one or more ontology axioms (2) and the question to be answered based on the provided axioms (3). Following the qualification classification schema designed for the HiL experiment, we categorise ChatGPT’s ontology modelling knowledge as no/little/some/expert according to the achieved test scores. 5 Terse RDF Triple Language: https://www.w3.org/TR/turtle/ 1 Background area Knowledge Scale 2 Levels Description Self-Assessment 3 Question Figure 2: An example question from the self-assessment test prompt showing: (1) the assessed area, (2) skill level descriptions of the no/little/some/expert knowledge expertise on ontology modelling, and (3) the assessment question. 1 Instructions Ontology Axioms 2 (Rector formalism) Qualification 3 Question Figure 3: An example from the qualification test prompt on ontology modelling displaying (1) the question instructions, (2) ontology axioms in the Rector formalism, and (3) the qualification question. The pre-study replication omitted the tutorial-component since its main objective was to familiarise human contributors with the used verification platform. Nevertheless, the examples from the tutorial were used in the investigation of the prompting strategy for the experiment as described next. Experiment. For the replication experiment we used the same 30 pizza axioms as in the original study. The verification of the axioms is performed in 3 settings (in 3 separate ChatGPT conversations) where prompts utilise either the Rector, Warren or Turtle representational formal- ism. The pizza axioms are sent in a randomised order and each axiom is verified independently from the rest as a separate prompt. We investigated different in-context learning prompt strategies prior to the verification of the 30 pizza axioms until promising results were obtained. For this purpose we used axiom examples which were included in the HITs instructions and tutorial from the original experiment (assets available in [25]). We attempted a zero-shot approach for which the prompt included the instructions, the ontology axiom, context and verification question from the HITs used in the HiL experiment. Several prompt formulations were tested: e.g., adding “Think step by step” in the prompt, adding additional theoretical explanations in the instructions, etc. Nevertheless this approach did not deliver satisfactory results. Human intelligence tasks typically provide human contributors not only with a set of rules to follow but with a number of examples. Similarly, a few-shot approach provides the model with a few annotated examples together with the instructions to use for completing the task [26]. There- fore, we continued the investigation with a few-shot strategy providing additional examples taken from the HIT instructions. In Sect. 4 we provide an in-depth description of the LLM-driven approach utilising the few-prompt strategy for the verification of ontology restrictions. 4. LLM-Enhanced Ontology Verification We propose an LLM-based approach towards the verification of ontology restrictions through the identification of concrete defects. Our approach builds on top of our prior work on HiL ontology verification and consists of the following main steps visualised in Fig. 4: Ontology Ontology Axioms Ontology Axioms Extraction Defect Types Taxonomy ontology restriction defect Ontology Axioms Ontology Axiom Formalisation Representations misuse incompleteness - PetLoverA has some Cats - PetLoverA has at least one Cat missing missing - PetLoverB has some Dogs existential existential universal . - PetLoverB . has at least one Dog restriction used . xxx restriction restriction Ontology Verification instead of xxx universal universal restriction used [Instructions][Examples] - PetLoverC has some Dogs and restriction some Cats instead of Verify Axiom 1 - PetLoverC has at least one Dog existential and at least one Car restriction Defect in Axiom 1 . . . Set of Identified Defects Verify Axiom n incomplete Defect in Axiom n misuse Figure 4: Overview of the LLM-based approach for ontology restriction verification. Step 1: Ontology axioms extraction. Modelling defects are typically not related to a single triple but instead result from the incorrect modelling of a set of logical constrains describing an ontology relation. Therefore, as a first step, ontology axioms, each describing a specific ontology relation are extracted. Step 2: Ontology axiom formalisation. In this step the extracted axioms are translated into a formalism of choice in which the axioms will be used in the prompts. One possibility is to use the original formalism of the axioms (e.g., Turtle). Two alternative textual representations of ontology axioms are proposed by Rector [22] and Warren [23]. For Steps 1&2 we reuse our implementations developed in the context of HiL ontology verification [7, 27]. Step 3: Ontology verification. A few-shot approach, using assets from [25], is employed to verify the ontology axioms. An example prompt representing the ontology axioms in the Warren formalism is shown in Fig. 5. The prompt includes the verification question (1 in Fig. 5) and possible answer options (2) corresponding to a defect taxonomy. Additionally, as context a real-world entity is included (3) together with four annotated examples with justification of their correctness or an explanation of the included defect (4). In Fig. 5 two examples have been omitted to allow for a better readability. Afterwards, each axiom is sent for verification in a single prompt containing only (1) the context, (2) the ontology axiom model and (3) the verification question as exemplified in Fig. 6. 5. ChatGPT-4 Replication Study Results In this section, we describe the results of the differentiated experiment for which we used ChatGPT-4 for the verification of ontology axioms. In Sect. 5.1 we present our findings from the pre-study, while the verification scores are discussed in Sect. 5.2. 5.1. Background knowledge assessment On all questions of the self-assessment test ChatGPT-4 rated its skills at the highest level provided, that is expert knowledge. In Fig. 7 the response to the exemplary question from Fig. 2 is included with a short justification of the selection. The qualification test classified ChatGPT-4 in the intermediate category in the setups where axioms were provided in a single formalism while the combination test categorised ChatGPT-4 as an expert. The mistakes made vary among the different representations with the exception of one question (shown in Fig. 3) which was answered incorrectly in every test instance. The pre-requisite for answering this question correctly is to know that the universal restriction can be trivially satisfied, that is: there can be a common instance of PetLoverTypeG & PetLoverTypeF that has no pets at all, therefore the classes are not disjoint. ChatGPT-4’s response (Fig. 8) indicates that the model relies on common-sense thinking rather than applying such knowledge on ontology modelling. Additionally, we apply a majority vote aggregation of the three single-formalism test answers which lead to equivalent results as the combined qualification test and the classification of 1 Verification Question Possible 2 Verification Options 3 Context (real-life entity) 4 Correct & Incorrect Ontology Axiom Examples Figure 5: A part of the initial verification prompt including the (1) verification question, (2) possible answer options corresponding to the defect taxonomy, (3) a context item and (4) examples of correct and incorrect ontology axioms (Warren formalism). 1 Context (real-life entity) Ontology 2 Axiom 3 Verification Question Figure 6: Verification prompt of the ”Spicy Pizza” ontology axiom (Warren formalism). the model as an expert. This findings indicate that alternative formulations can be added in the same prompt or results from different prompts can be aggregated to overcome the prompt character limit to improve the model’s performance. These insights could be applied to other domains where different phrasing of the tasks can be generated to potentially achieve better results on LLM-supported tasks. Figure 7: ChatGPT-4’s self-assessment of its ontology modelling skills as a response to the self- assessment question shown in Fig 2. Figure 8: ChatGPT-4’s incorrect answer to Question 8 shown in Fig. 3 Based on the observations gathered in the pre-study, we argue that ChatGPT-4’s knowledge of ontology modelling is comparable to that of the junior-experts who participated in our original HiL experiment where most participants were classified in the intermediate and expert categories (for more details see [17]). 5.2. Axiom verification performance Overall results. We used ChatGPT-4 to verify a total of 90 axioms (30 axioms each represented in 3 formalisms) and achieved a 92.22% accuracy of the verifications. In comparison, in the human-in-the-loop approach we collected 2629 verifications (90 axioms, each verified by several human contributors) with an overall accuracy of 92.58%. These findings show that ChatGPT- 4 performs as well as an average human evaluator. However, in the human computation & crowdsourcing domain it is rather rare that tasks are performed by a single contributor. Instead each task is sent to a number of participants (the crowd) and the collected answers are aggregated, e.g., trough majority voting. After a majority vote aggregation of the human judgements in the original HiL experiment a 100% accuracy of the verification was achieved. Since the qualification test results showed that aggregating results of different formalisms for each axioms leads to improved scores, we applied the majority vote strategy to ChatGPT-4’s axiom verifications. For the 30 axioms the verification accuracy improves to 96.67%. Additionally, this aggregated approach leads to a recall of 100% (see Table 1). Formalism-based results. The verification accuracy of ChatGPT-4 varies across the used representational formalisms. In Table 1 we present the achieved performance in each setting with a comparison to the HiL approach. While the qualification test scores did not indicate a difference among the textual representa- tions Rector&Warren and the machine-readable format Turtle, the results from the Turtle-based verification of the axioms are considerably lower (86.67%). Highest accuracy scores were achieved when the prompt included the Warren ontology representations- the accuracy is equivalent to the ChatGPT-majority aggregation approach (96.67%), outperforming the correctness of the individual human judgements (91.74%). Moreover, the precision of this setting reaches 100% and thus matching the crowd majority vote. The results also indicate that the ChatGPT aggregated majority judgements reach 100% recall while having a slightly lower precision. One possible future work direction would be to design a Find-Verify workflow (e.g., as in the HiL approach from [28]) including (1) a defect detection stage following a ChatGPT majority vote strategy and (2) a round of verification with the Warren formalism (or a human-in-the-loop). Table 1 Overview of the ontology verification scores achieved with ChatGPT-4 compared to the human-in- the-loop approach from [17]: Overall performance across all verified axioms, results based on selected formalism and scores from the ChatGPT formalism majority aggregation. ChatGPT-4 Human Contributor individual accuracy =precision=recall=F1 accuracy precision recall F1 judgements (majority vote) overall 92.22% 93.18% 91.11% 92.13% 92.58% 100% Rector 93.33% 93.33% 93.33% 93.33% 92.28% 100% Warren 96.67% 100% 93.33% 96.55% 91.74% 100% Turtle 86.67% 86.67% 86.67% 86.67% - - VOWL - - - - 93.76% 100% aggregated 96.67% 93.33% 100% 96.55% (majority vote) Defect-based results. ChatGPT-4 showed varying levels of performance in identifying different types of defects in the ontology axioms. Correct axioms were identified as correct with an accuracy of 93.33% while all (100% accuracy) incompleteness-related defects were correctly detected. In contrast, the misuse of the restrictions was more challenging to detect and resulted in only 73.33% correctly identified misuse-defects. In the inaccurate verifications the wrong defect type was selected, nevertheless, the axioms were still identified as incorrect. This results strengthen the idea of a Find-Verify workflow, where potential defect candidates could be selected and sent for further verification. 6. Conclusion The evaluation of semantic resources such as knowledge graphs, ontologies and taxonomies is traditionally a time-intensive and expensive task since it requires the involvement of domain experts or crowd-workers. In this paper we explore the capabilities of LLMs, in particular ChatGPT-4, for evaluating ontology restrictions by replicating our previously conducted human- in-the-loop experiment [17]. We used our previously developed ontology modelling qualification test (available in [25]) and report that ChatGPT achieved intermediate to expert scores. In particular when a single axiom representation (either Rector [22], Warren [23], or Turtle) is provided in the prompts the results were intermediate. However, when provided with a combination of the three representations for each ontology axiom, the model was classified as an expert with 10/11 correctly answered questions. Additionally, ChatGPT-4 correctly verified 92,22% of the ontology axioms across the different representation settings. We show that the answers on the same ontology model sent in different representational formats can be combined and with a majority voting strategy the accuracy could be improved up to 96.67%. This results are comparable to semi-experts’ responses which provided 92,58% correct judgements and 100% majority vote accuracy. Moreover, we observe a difference in ChatGPT-4’s performance based on the used ontology representations and while the Warren textual representation delivers best results in terms of precision (100%), when combining the model responses on different representations we could improve the recall (100%). Lastly, we look at the accuracy in identifying different defect types and find that the model correctly identified a missing restriction in the axiom every time. In contrast, the misuse of the restrictions showed to be a more challenging task for ChatGPT-4 being achieved with 73.33% accuracy. Study insights. We gained several interesting insights that can potentially be applied to other knowledge engineering tasks where LLMs are included: • Resource verbalisation. We achieved highest verification results when the ontology axioms were represented in natural language. The concrete language used also played a role in the performance. Therefore, the verbalisation of semantic resources in the LLM-supported knowledge engineering tasks should be carefully considered. • Turle as a complementary asset. The results obtained when using Tutle were considerably lower, however, when combined with natural language they lead to improved results. • HiL inspiration. Overall, there are many similarities between human intelligence tasks and LLM prompts. Tasks designed following human computation & crowdsourcing methodologies can be applied to LLM prompting with little to no modifications. As such, the nascent field of LLM-based KE could benefit from earlier findings in the human computation & crowdsourcing field. Limitations and open research questions. While this paper presents first insights into the verification of ontology restrictions with LLMs, the following limitations can lead to further research: • Pizza ontology. We used a very simple ontology where no particular domain knowledge was required. Further investigations are needed to understand whether comparable results can be obtained when a less-known or more complex resource is verified. Nevertheless by using the same ontology we utilised in our prior work, we could provide a clear comparison between human contributors and ChatGPT-4. • Extended experiments. While in this work we only focused on a single ontology and a small set of defect types, further exploration is needed as to whether LLMs’ capability could support other modelling verification tasks (e.g., identifying the incorrect use of “some not” in place of “not some”) and domain-dependant assessments (e.g, detection of incorrect domain knowledge). Moreover, a comparison of the performance of different LLMs on the verification tasks could provide a better overview of the feasibility of the tasks. • Verification workflows. We identify that different prompting settings have certain benefits to the overall performance scores. Further explorations are needed on how to best combine different LLM settings when the number and types of verification tasks increases. In this paper we present first insights into the strengths and weaknesses of large language models for ontology evaluation tasks compared to human contributors. We plan to conduct a number of follow-up studies to explore the generalizability of the findings to further verification tasks and ontology domains and formalise a human-LLM evaluation workflow addressing the scalability challenge of current HiL evaluation approaches. Acknowledgments This work was supported by the FWF HOnEst project (V 745) and the PERKS project (101120323) co-funded by the European Union. Views and opinions expressed are, however, those of the authors only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them. References [1] H. Paulheim, Knowledge graph refinement: A survey of approaches and evaluation methods, Semantic Web J. 8 (2017) 489–508. [2] A. d. Garcez, L. C. Lamb, Neurosymbolic ai: The 3 rd wave, Artificial Intelligence Review (2023) 1–20. [3] A. Breit, L. Waltersdorfer, F. J. Ekaputra, M. Sabou, A. Ekelhart, A. Iana, H. Paulheim, J. Portisch, A. Revenko, A. t. Teije, F. van Harmelen, Combining machine learning and semantic web: A systematic mapping study, ACM Comput. Surv. (2023). [4] F. van Harmelen, A. ten Teije, A boxology of design patterns for hybrid learning and reasoning systems, Journal of Web Engineering 18 (2019) 97–124. [5] M. Kulmanov, F. Z. Smaili, X. Gao, R. Hoehndorf, Semantic similarity and machine learning with ontologies, Briefings in Bioinformatics 22 (2020). [6] M. P. Villalón, A. G. Pérez, Ontology evaluation: a pitfall-based approach to ontology diagnosis, PhD Tesis, Universidad Politecnica de Madrid, Escuela Tecnica Superior de Ingenieros Informaticos (2016). [7] S. Tsaneva, K. Käsznar, M. Sabou, Human-centric ontology evaluation: Process and tool support, in: O. Corcho, L. Hollink, O. Kutz, N. Troquard, F. J. Ekaputra (Eds.), Knowledge Engineering and Knowledge Management, Springer International Publishing, Cham, 2022, pp. 182–197. [8] H. Khorashadizadeh, N. Mihindukulasooriya, S. Tiwari, J. Groppe, S. Groppe, Exploring in-context learning capabilities of foundation models for generating knowledge graphs from text, 2023. arXiv:2305.08804 . [9] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, X. Wu, Unifying large language models and knowledge graphs: A roadmap, IEEE Transactions on Knowledge and Data Engineering (2024). [10] G. T. Schreiber, H. Akkermans, Knowledge engineering and management: the Com- monKADS methodology, MIT Press, Cambridge, MA, USA, 2000. [11] B. P. Allen, L. Stork, P. Groth, Knowledge engineering using large language models, arXiv preprint arXiv:2310.00637 (2023). [12] F. Neuhaus, Ontologies in the era of large language models ? a perspective, Applied ontology 18 (2023) 399–407. doi:10.3233/ao- 230072 . [13] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao, S. Deng, H. Chen, N. Zhang, Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities, arXiv preprint arXiv:2305.13168 (2023). [14] M. Trajanoska, R. Stojanov, D. Trajanov, Enhancing knowledge graph construction using large language models, 2023. arXiv:2305.04676 . [15] S. Carta, A. Giuliani, L. Piano, A. S. Podda, L. Pompianu, S. G. Tiddia, Iterative zero-shot llm prompting for knowledge graph construction, arXiv preprint arXiv:2307.01128 (2023). [16] B. Zhang, I. Reklos, N. Jain, A. M. Peñuela, E. Simperl, Using large language models for knowledge engineering (llmke): A case study on wikidata, arXiv preprint arXiv:2309.08491 (2023). [17] S. Tsaneva, M. Sabou, Enhancing human-in-the-loop ontology curation results through task design, J. Data and Information Quality (2023). URL: https://doi.org/10.1145/3626960. doi:10.1145/3626960 . [18] X. Lv, Y. Lin, Y. Cao, L. Hou, J. Li, Z. Liu, P. Li, J. Zhou, Do pre-trained models benefit knowledge graph completion? a reliable evaluation and a reasonable approach, Association for Computational Linguistics, 2022. [19] C.-H. Chiang, H.-y. Lee, Can large language models be an alternative to human evaluations?, arXiv preprint arXiv:2305.01937 (2023). [20] M. Sallam, K. Al-Salahat, H. Eid, J. Egger, B. Puladi, Human versus artificial intelli- gence: Chatgpt-4 outperforming bing, bard, chatgpt-3.5, and humans in clinical chemistry multiple-choice questions, medRxiv (2024). doi:10.1101/2024.01.08.24300995 . [21] R. M. Lindsay, A. Ehrenberg, The design of replicated studies, American Statistician - AMER STATIST 47 (1993) 217–228. doi:10.1080/00031305.1993.10475983 . [22] A. Rector, N. Drummond, M. Horridge, J. Rogers, H. Knublauch, R. Stevens, H. Wang, C. Wroe, Owl pizzas: Practical experience of teaching owl-dl: Common errors & common patterns, in: Int. Conf. on Knowledge Engineering and Knowledge Management, Springer, 2004, pp. 63–81. [23] P. Warren, P. Mulholland, T. Collins, E. Motta, Improving comprehension of knowledge representation languages: A case study with description logics, Int. J. of Human-Computer Studies 122 (2019) 145–167. [24] S. Lohmann, S. Negru, F. Haag, T. Ertl, Vowl 2: User-oriented visualization of ontologies, in: K. Janowicz, S. Schlobach, P. Lambrix, E. Hyvönen (Eds.), Knowledge Engineering and Knowledge Management, Springer International Publishing, Cham, 2014, pp. 266–281. [25] S. Tsaneva, K. Käsznar, M. Sabou, Hero- a human-centric ontology evaluation process, 2023. URL: https://doi.org/10.5281/zenodo.7643357. [26] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 1877–1901. [27] S. Tsaneva, Human-Centric Ontology Evaluation, Master’s thesis, Technische Universität Wien, 2021. URL: https://repositum.tuwien.at/handle/20.500.12708/17249. [28] M. Acosta, A. Zaveri, E. Simperl, D. Kontokostas, F. Flöck, J. Lehmann, Detecting Linked Data quality issues via crowdsourcing: A DBpedia study, Semantic Web 9 (2016) 303–335. doi:10.3233/sw- 160239 .