<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Tsaneva);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Capabilities of LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefani Tsaneva</string-name>
          <email>stefani.tsaneva@wu.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guntur Budi Herwanto</string-name>
          <email>guntur.budi.herwanto@wu.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marta Sabou</string-name>
          <email>marta.sabou@wu.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Data, Process and Knowledge Management, Vienna University of Economics and Business</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>With the advent of Generative AI, numerous approaches exploring large language models (LLMs) have been proposed for addressing a number of Knowledge Engineering (KE) tasks. Yet, the status of this research field is rather preliminary and there is, for now, no systematic and comprehensive understanding on how LLMs perform on selected knowledge engineering tasks (e.g., what is their expertise level in understanding ontology modeling concepts). Such insights would be crucial for researchers working in this field to support with selecting the most suitable LLMs during experiment design. This situation is exacerbated by the rapid expansion in the number of available LLMs. We therefore see the need for methodologies and tools that allow (comparatively) assessing LLM capabilities. To address this need, we propose the creation of an assessment test benchmark for evaluating the LLM knowledge engineering skills. We present ongoing work and preliminary results on assessing the expertise of LLMs in terms of a concrete KE task, namely ontology validation. Our experiments highlight the superiority of proprietary models on this task, particularly GPT-4o and Claude-Sonnet-3.5, over open source models. Lastly, we identify the need of a community-driven comparative LLM assessment platform that facilitates resource sharing and experience exchange, while protecting the integrity and privacy of the envisioned benchmark. We share (i) the current version of the qualification tests and (ii) its implementation for assessing LLM capabilities for ontology validation.</p>
      </abstract>
      <kwd-group>
        <kwd>knowledge engineering</kwd>
        <kwd>ontology validation</kwd>
        <kwd>LLMs</kwd>
        <kwd>expertise evaluation</kwd>
        <kwd>assessment tests</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Context and Research Need</title>
      <p>
        Context. Advances in Generative AI, and specifically large language models (LLMs), ofer many
opportunities for enhancing Knowledge Engineering (KE) activities [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Numerous LLM-based solutions
have already been successfully implemented for KE tasks. For instance, the construction and completion
of knowledge graphs (KGs) have gained considerable research attention [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7">4, 5, 6, 7</xref>
        ]. Several approaches
have also been proposed supporting the evaluation of semantic resources (i.e., KGs, ontologies, etc.):
ontology requirements compliance has been approached through LLM-powered competency questions
validation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and coverage testing [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]; KG triple validation with LLMs has been explored in [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]; and
ontology modeling error detection and correction through LLMs have been proposed in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>As research on the use of LLMs in KE is intensifying, it is essential to investigate weather
LLMproduced results can be replicated with other models or are heavily dependent on the used LLM. Thus,
it is important to test each approach with several LLMs, with diferent characteristics. Yet, LLMs are
being published at a staggering rate. In this situation, it is increasingly challenging to decide which
LLM to choose for which task, and to understand the “expertise” level of each LLM relevant for a given
task. Having an understanding of which LLM is capable of which KE task and to what level, would
enable adaptations of LLM-based solutions across diferent use cases with various data privacy needs
and available budget. Moreover, it would allow the construction of more complex workflows, involving
several LLMs, each responsible for a particular KE task.</p>
      <p>Research Need. We therefore see a research need for the (comparative) assessment of LLM capabilities
in terms of performing a variety of KE tasks. This requires the community to develop a collection
of qualification tests as an instrument to assess the expertise level of LLMs in various KE tasks. For
(M. Sabou)</p>
      <p>CEUR</p>
      <p>ceur-ws.org
example, in an ontology validation scenario, basic understanding of modeling constructs, or validation
of ontology restriction correctness would be required. Having such assessment tests for LLMs is a
pre-requisite for a straightforward LLM selection process for experimentation.</p>
      <p>
        Related Work. The need to understand the strengths and limitations of numerous LLMs in concrete
tasks has become apparent across domains. For instance, a systematic LLM evaluation framework is
designed in the chemistry domain [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The framework relies on a number of diverse chemistry tasks
selected to assess the capabilities of the LLM in terms of reasoning, understanding, and explaining within
the domain. The deductive, inductive and abductive reasoning skills of LLMs are systematically assessed
through a collection of evaluation methods across diferent dimensions (answer correctness, explanation
redundancy, etc.) in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Similarly, LLMs have been assessed in terms of their physchological profile
by recording and analysing their answers to standard psychometric inventories [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed Approach: a Collection of Assessment Tests for LLM</title>
    </sec>
    <sec id="sec-3">
      <title>Ontology Validation Skills</title>
      <p>We take inspiration from research in the human-in-the-loop (HiL) area, where qualification test have
frequently been applied to select the participants with the required skills to solve a particular knowledge
validation task. Typically, a set of questions will be created to assess contributors’ background knowledge
and only those who score above a certain threshold will qualify to work on the available tasks.</p>
      <p>
        In our earlier work, we developed a qualification test for assessing the expertise level of students in
understanding basic ontology representations with focus on the meaning of ontology restrictions [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
To the best of our knowledge, this is the only assessment test for knowledge engineering skills, which
is publicly available in full. Indeed, authors of other similar resources opted to only present selected
example questions from these to prevent participant bias (e.g., [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]). Moreover, unlike similar tests, our
qualification test classifies examinees according to their expertise level ( novice, beginner, intermediate, or
expert) rather than using a binary classification of qualified/ unqualified , providing a more comprehensive
understanding of their skill levels.
      </p>
      <p>
        We propose adopting HiL qualification testing approaches for LLM skills assessment. Particularly,
we focus on assessing ontology validation capabilities in terms of the detection of modelling errors. As
a starting point, we utilise the qualification test from [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The test should be extended and diversified
to also assess other skill aspects relevant for ontology defect detection, such as understanding of
disjointness axioms or the correct usage of intersections and unions. Creating a common collection of
KE assessment tests allows researchers to evaluate available LLMs and select those that best fit their
research need. While we focus on ontology validation skills, a similar approach can be followed for
assessing other knowledge engineering skills as well.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Ongoing Work and First Results</title>
      <p>We briefly describe the current version of the qualification test and subsequently the evaluation of a
number of LLMs when they were administered this test.</p>
      <sec id="sec-4-1">
        <title>3.1. Qualification test for LLM assessment in evaluating ontology restrictions</title>
        <p>
          We start from a qualification test on ontology restrictions modeling initially developed in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The test
classifies the examinee as a novice, beginner, intermediate, or expert based on the achieved scores across
questions with varying levels dificulty. It consists of 11 questions grouped in three categories:
• 4 beginner-level questions, which test the ability to identify ontology components (i.e., classes,
relations and restrictions) in graphical and textual representations of an ontology.
• 3 intermediate level questions, assessing the understanding of the implications of ontology axioms
containing the universal and existential restrictions.
• 4 expert questions, examining the ability to reason with ontology models, as well as compare and
relate these ontology models to each other.
        </p>
        <p>
          The expertise classification takes into account both the number of correctly answered questions from
a specific category and the correct answers overall, as explained in detail in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>An example question, requiring intermediate modelling skills, is shown in Figure 1. To answer the
question correctly, one needs to (1) identify the usage of a universal restriction; (2) understand that the
universal restriction implies that instances of PetLoverTypeA cannot have pets that are not dogs; (3) be
aware that the universal restriction, does not imply that instances of PerLoverTypeA must have a dog.</p>
        <p>
          The complete test1 is designed using three ontology axiom representation modalities: the graphical
representation VOWL, and two natural language formalisms proposed by Rector [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and Warren [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Comparative assessment of LLM expertise in evaluating ontology restrictions</title>
        <p>
          In previous work [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], we tested ChatGPT-4’s capabilities with the qualification test described above
and concluded expertise levels comparable to an expert. Subsequently, we run the test on a variety
of LLMs with diferent characteristics. We consider both proprietary and open-source foundational
LLMs. While proprietary models, such as ChatGPT-4, have demonstrated strong performance [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], they
bring high costs and reliance on cloud services. In contrast, open models ofer the opportunity for local
installation, ensuring data privacy and the potential for fine-tuning specific tasks. Thus, we evaluate
models across three size categories: small (Llama3-8b2, Gemma-7b3), medium (Llama3-70b, Mixtral
8x22b4, Qwen2-72b5), and large (DeepSeek model6: deepseek-coder-v2 at 236b and deepseek-v2 at 236b).
We compare the performance of these open-source models against the state-of-the-art proprietary LLM
GPT-4o7 and Claude-3.5-Sonnet8. We report the results of each LLM on the qualification test in Table 1.
        </p>
        <p>Notably, no models fell into the lowest (novice) expertise category, indicating that all tested LLMs
possess at least a basic level of ontology understanding competence. In fact, the expertise level tends to
increase with the size of the model. The smallest models (Llama3-8b, Gemma-7b) showed beginner
expertise, while medium-sized LLMs performed slightly better and received beginner (Mixtral-8x7b)
to intermediate qualifications (Llama3-70b, Qwen2-72b). The open-source large models
(DeepSeek1The qualification test is included withing the Zenodo resource available at https://zenodo.org/records/7643357
2https://huggingface.co/meta-llama
3https://huggingface.co/google
4https://huggingface.co/mistralai
5https://huggingface.co/Qwen/Qwen2-72B
6https://deepseek.com
7https://openai.com
8https://claude.ai
V2, DeepSeek-Coder-V2) also achieved an intermediate score, while Claude-Sonnet-3.5 and GPT-4o
answered the most qualification questions correctly, showcasing expert skills.</p>
        <p>The experiment was carried out on June 20th, 2024. We share the translation of the qualification test
into a format suitable for LLMs9.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Outlook</title>
      <p>First results of assessing LLM ontology validation skills indicate the power and superiority of proprietary
models, in particular GPT-4o and Claude-Sonnet-3.5, compared to smaller open source models.
Nevertheless, small and medium size models should not be ignored since they ofer other benefits such as
processing of privacy-sensitive tasks on local installations. Further investigations are needed whether
providing additional instructions and context would improve the achieved results and whether these
smaller LLMs could achieve comparable or better results to other knowledge validation tasks.</p>
      <p>In the future, we plan to design and evaluate additional qualification tests focusing on complementary
ontology validation aspects to allow for a comprehensive LLM profiling according to their capabilities.
We would further like to motivate fellow researchers to publish previously created ontology engineering
qualification tests, utilised in human-in-the-loop experiments or other assessment settings, to allow for
the reuse of these valuable resources.</p>
      <p>Motivated by recent work, assessing LLM capabilities with regards to their understanding of SPARQL
and Turtle over time [21], we believe the collection of assessment tests would support the re-evaluation
of LLMs upon new releases to allow for a historic analysis of selected capabilities.</p>
      <p>Moreover, we see a need for a community-driven platform, where researchers can share their LLM
assessment tests and contribute to LLM capability profiling on previously conducted skill assessments.
This approach would not only promote the reuse of resources and facilitate experience sharing, but
it would also support a more sustainable approach for advancing LLM assessment by preventing that
similar tests are performed by several researchers in parallel, with high computational costs.</p>
      <p>A crucial aspect to consider when making LLM assessment tests accessible to the community is
the potential risk that these tests could be included in future model training datasets. Therefore, a
community-wide strategy is needed to mitigate this risk and avoid evaluation biases. A
communitydriven assessment platform, could allow this by requiring authorised access to view tests, while providing
functionality for “blind” LLM assessments.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was funded by the Austrian Science Fund (FWF) Bilateral AI (Grant Nr. 10.55776/COE12)
and HOnEst (V 745) projects.
9The LLM assessment implementation is available at https://github.com/wu-semsys/llm-ontology-qualification-test
Studies 122 (2019) 145–167.
[21] J. Frey, L.-P. Meyer, F. Brei, S. Gründer-Fahrer, M. Marti, Assessing the Evolution of LLM capabilities
for Knowledge Graph Engineering in 2023, in: The Semantic Web: ESWC Satellite Events, 2024.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Unifying large language models and knowledge graphs: A roadmap, IEEE Transactions on Knowledge and Data Engineering (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Neuhaus</surname>
          </string-name>
          ,
          <source>Ontologies in the Era of Large Language Models? a Perspective</source>
          , Applied ontology
          <volume>18</volume>
          (
          <year>2023</year>
          )
          <fpage>399</fpage>
          -
          <lpage>407</lpage>
          . doi:
          <volume>10</volume>
          .3233/ao-230072.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Allen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Stork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <article-title>Knowledge Engineering using Large Language Models</article-title>
          ,
          <source>arXiv preprint arXiv:2310.00637</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. Zhang,</surname>
          </string-name>
          <article-title>LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities</article-title>
          ,
          <source>arXiv preprint arXiv:2305.13168</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Trajanoska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stojanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Trajanov</surname>
          </string-name>
          ,
          <source>Enhancing Knowledge Graph Construction Using Large Language Models</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>04676</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Carta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Giuliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Piano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Podda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pompianu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Tiddia</surname>
          </string-name>
          ,
          <article-title>Iterative zero-shot LLM prompting for knowledge graph construction</article-title>
          ,
          <source>arXiv preprint arXiv:2307.01128</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , I. Reklos,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Peñuela</surname>
          </string-name>
          , E. Simperl,
          <article-title>Using Large Language Models for Knowledge Engineering (LLMKE): A Case Study on Wikidata</article-title>
          ,
          <source>arXiv preprint arXiv:2309.08491</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Tufek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S. S.</given-names>
            <surname>Thuluva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bandyopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. P.</given-names>
            <surname>Just</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sabou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Ekaputra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <article-title>Validating Semantic Artefacts With Large Language Models</article-title>
          ,
          <source>in: The Semantic Web: ESWC Satellite Events</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. A.</given-names>
            <surname>Carriero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Schreiberhuber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsaneva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          , J. de Berardinis,
          <article-title>OntoChat: a Framework for Conversational Ontology Engineering using Language Models</article-title>
          ,
          <source>in: The Semantic Web: ESWC Satellite Events</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>P. T. G. Bradley P. Allen</surname>
          </string-name>
          ,
          <article-title>Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models</article-title>
          ,
          <source>in: The Semantic Web: ESWC Satellite Events</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Khorashadizadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mihindukulasooriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Groppe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Groppe</surname>
          </string-name>
          ,
          <article-title>Exploring In-Context Learning Capabilities of Foundation Models for Generating Knowledge Graphs from Text</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>08804</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsaneva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vasic</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sabou, LLM-driven Ontology Evaluation: Verifying Ontology Restrictions with ChatGPT</article-title>
          ,
          <source>in: The Semantic Web: ESWC Satellite Events</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fathallah</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. De Giorgis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Poltronieri</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Haase</surname>
          </string-name>
          , L. Kovriguina,
          <string-name>
            <surname>NeOn-GPT: A Large Language</surname>
          </string-name>
          <article-title>Model-Powered Pipeline for Ontology Learning</article-title>
          ,
          <source>in: The Semantic Web: ESWC Satellite Events</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Nan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. V.</given-names>
            <surname>Chawla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Wiest</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Zhang,</surname>
          </string-name>
          <article-title>What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>18365</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lin</surname>
          </string-name>
          , J. Han,
          <string-name>
            <surname>T</surname>
          </string-name>
          . Zhao,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          , E. Cambria,
          <string-name>
            <surname>Are Large Language Models Really Good Logical Reasoners</surname>
            ?
            <given-names>A Comprehensive</given-names>
          </string-name>
          <string-name>
            <surname>Evaluation</surname>
          </string-name>
          and Beyond,
          <year>2024</year>
          . arXiv:
          <volume>2306</volume>
          .
          <fpage>09841</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pellert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Lechner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Rammstedt</surname>
          </string-name>
          , M. Strohmaier,
          <source>AI Psychometrics: Assessing the Psychological Profiles of Large Language Models Through Psychometric Inventories, Perspectives on Psychological Science</source>
          <volume>19</volume>
          (
          <year>2024</year>
          )
          <fpage>808</fpage>
          -
          <lpage>826</lpage>
          . doi:
          <volume>10</volume>
          .1177/17456916231214460.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsaneva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sabou</surname>
          </string-name>
          ,
          <article-title>Enhancing Human-in-the-Loop Ontology Curation Results through Task Design</article-title>
          ,
          <source>J. Data and Information Quality</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1145/3626960.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>J. M. Mortensen</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Musen</surname>
            ,
            <given-names>N. F.</given-names>
          </string-name>
          <string-name>
            <surname>Noy</surname>
          </string-name>
          ,
          <article-title>Crowdsourcing the verification of relationships in biomedical ontologies</article-title>
          ,
          <source>in: AMIA Annual symposium proceedings</source>
          , volume
          <volume>2013</volume>
          , American Medical Informatics Association,
          <year>2013</year>
          , pp.
          <fpage>1020</fpage>
          -
          <lpage>1029</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rector</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Drummond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Horridge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Knublauch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wroe</surname>
          </string-name>
          ,
          <article-title>Owl pizzas: Practical experience of teaching owl-dl: Common errors &amp; common patterns</article-title>
          ,
          <source>in: International Conference on Knowledge Engineering and Knowledge Management</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2004</year>
          , pp.
          <fpage>63</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>P.</given-names>
            <surname>Warren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulholland</surname>
          </string-name>
          , T. Collins, E. Motta,
          <article-title>Improving comprehension of knowledge representation languages: A case study with description logics</article-title>
          ,
          <source>International Journal of Human-Computer</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>