<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Comprehensive Benchmark for Evaluating LLM-Generated Ontologies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Julien Plu</string-name>
          <email>julien@lettria.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar Moreno Escobar</string-name>
          <email>oscar@lettria.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edouard Trouillez</string-name>
          <email>edouard@lettria.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Axelle Gapin</string-name>
          <email>axelle@lettria.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raphaël Troncy</string-name>
          <email>raphael.troncy@eurecom.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>EURECOM</institution>
          ,
          <addr-line>Campus SophiaTech, 450 Route des Chappes, 06410 Biot</addr-line>
          ,
          <country country="FR">FRANCE</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LLM, Knowledge Engineering, Ontology Development</institution>
          ,
          <addr-line>Benchmark, Evaluation</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>LettrIA</institution>
          ,
          <addr-line>13 Rue des Petits Hôtels, 75010 Paris</addr-line>
          ,
          <country country="FR">FRANCE</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a methodology for evaluating ontologies that are automatically generated by Large Language Models (LLMs). Our approach combines quantitative metrics that compare generated ontologies with respect to a human-made reference and qualitative user assessments across diverse domains. We apply this methodology to evaluate the ontologies produced by various LLMs, including Claude 3.5 Sonnet, GPT-4o, and GPT-4o-mini. The results demonstrate the benchmark's efectiveness in identifying strengths and weaknesses of LLM-generated ontologies, providing valuable insights for improving automated ontology generation techniques.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The advent of LLMs has opened new avenues for automated ontology generation, such as in [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1, 2, 3, 4, 5</xref>
        ].
However, evaluating the quality and utility of these generated ontologies presents a significant challenge.
This paper introduces a comprehensive benchmark methodology designed to assess LLM-generated
ontologies through both quantitative and qualitative measures through 30 criteria. Our benchmark
aims to provide a standardized approach for comparing diferent LLM-generated ontologies, evaluating
their accuracy, completeness, and practical utility across various domains. All results and documents
are available on GitHub https://github.com/jplu/ontology-benchmark.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Ontology-Toolkit</title>
      <p>
        The Ontology-Toolkit is our LLM-based tool for ontology generation, designed to evolve with the
benchmark, facilitate experimentation, support domain-specific ontologies, and promote wider adoption.
It features a modular process with a user-friendly interface, allowing the refinement of the results
at each step. Our primary goal was to minimize user interactions, as our target users preferred a
non-conversational application for quick ontology generation, making approaches such as OntoChat[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
unsuitable. The Ontology-Toolkit functions as follows:
      </p>
      <p>1. Generate Classes: This initial step produces essential ontology components based on input
documents, specified domain, and use case. For example, given medical results analysis documents,
an appropriate domain could be medicine and pharmacology. Users can manually refine the generated
classes. The number of classes is crucial; too many can lead to overly specific classes (e.g., City vs Place)
instead of desired hierarchies like Place -&gt; PopulatedPlace -&gt; City. Overly specific classes can narrow
the ontology’s applicability.</p>
      <p>2. Generate Questions: This step creates competency questions to guide ontology development
and deduce class relations. The questions are based on the input documents and classes from Step 1.
†Author contributed during her internship.</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>Users can manually add, remove, or update these questions. Our experiments showed that questions
should be as generic as possible, cover the entire scope of the document, and refer to classes rather than
specific instances. Given that example: ”Barack Obama was born in Hawaii,” appropriate questions are
”Where was [Person] born?” or ”Is [Place] the birthplace of [Person]?”</p>
      <p>3. Generate Ontology: The final step instructs an LLM to produce the ontology in RDF using
Turtle syntax. This process uses only the classes, questions, domain definition, and use case; original
documents are not needed at this stage.</p>
      <p>The Ontology-Toolkit allows for refinement at each step, balancing automation with user control. The
objectives are to streamline the ontology generation process while maintaining flexibility for diverse
use cases. A demo is available at https://tinyurl.com/iswc-demo.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Benchmark</title>
      <p>Our benchmark is applicable to any automated ontology generation approach. We developed our own
benchmark instead of using existing tools like OWLUnit1 for three main reasons: 1) Our target users,
including non-technical individuals, lack the required expertise in Semantic Web technologies; 2) The
unpredictable structure of generated ontologies makes pre-defined query test suites challenging and
potentially biased; 3) Existing tools cannot efectively test ontology coverage, such as class hierarchy,
properties, or missing classes. Our benchmark is divided into two evaluations: 1) Quantitative: Assesses
the quality of generation parameters; 2) Qualitative: Enables non-technical users to evaluate ontology
coverage.</p>
      <p>This approach addresses the limitations of existing tools and provides a more suitable evaluation
method for our use case and target audience.</p>
      <sec id="sec-3-1">
        <title>3.1. Quantitative Evaluation</title>
        <p>We present one example of our quantitative evaluation on a single ontology, focusing on a single
parameter (the utility of providing a use case). This evaluation has been applied to 19 other ontologies.</p>
        <p>Our results show whether providing a use case to the model improves the generated ontology, with the
answer correlating to the selected LLM. Prior evaluations determined that 50 classes and 50 competency
questions yield the best results for this use case, which is why these numbers are set for this evaluation.
Ontology. We evaluate using a real-world use case by comparing the ontology generated by the
Ontology-Toolkit with a reference ontology manually developed by professional knowledge engineers.
This reference ontology belongs to the financial domain and enables to provide information about
companies as well as specific financial events. It comprises 37 classes, 54 object properties, and 48 data
properties. It also follows an event-based design pattern, primarily created to capture information on
events afecting companies such as mergers, acquisitions, settlements, and legal disputes. The Event
superclass plays a central role, with no fewer than 15 subclasses, including MergerEvent, SettlementEvent,
and LegalDisputeEvent. The ontology also includes 20 object properties indicating the participation
of companies and organizations in these events, such as biddingParticipant, regulatorParticipant, and
victimParticipant.</p>
        <p>Experiments. The generated ontology with Ontology-Toolkit used a corpus of 30 documents on
ifnancial events from BBC and Reuters, averaging 900 tokens each. The domain is defined as
marketmoving events, and the use case is Assess and analyze the impact of market-moving events, the parties
involved, and their subsequent efects .</p>
        <p>Two configurations named with_use_case_50 and no_use_case_50 are evaluated. Six ontologies were
generated for this experiment, specifying a use case or not and using three LLMs: Claude 3.5 Sonnet,
1https://github.com/luigi-asprino/owl-unit
GPT-4o, and GPT-4o-mini. This setup allowed for a comparative analysis of the models’ performance
based on the presence or absence of an explicit use case.</p>
        <p>Evaluation Results. For each generated ontology, a manual analysis was conducted, comparing
the presence of classes and properties between the generated and reference ontologies. We also
assessed the presence of hallucinations. We had a degree of tolerance in the comparison: missing
classes or properties may correspond to elements with broader or narrower meanings, which were
noted separately. For example, the object property hasLocation is broader than country, while the data
property usesCryptocurrency is narrower than currency. Two sets of evaluations were conducted using
diferent metrics: 1) Accuracy, and (2) Precision, Recall, and F1 Score. Accuracy: The overall proportion
of matching classes and properties between the generated and reference ontologies. Precision: The ratio
of concepts or relations in the generated ontology that are present in the reference ontology. Recall: The
ratio of concepts or relations in the reference ontology that are also present in the generated ontology.
F1 Score: The harmonic mean of precision and recall.</p>
        <p>Model
Claude 3.5 Sonnet</p>
        <p>GPT-4o
GPT-4o-mini</p>
        <p>Ontology
With use case 50
Without use case 50
With use case 50
Without use case 50
With use case 50
Without use case 50</p>
        <p>Type
classes
properties
classes
properties
classes
properties
classes
properties
classes
properties
classes
properties</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Qualitative Evaluation</title>
        <p>We performed a user-based evaluation in which participants were asked to use Ontology-Toolkit with
Claude 3.5 Sonnet to develop an ontology on the domain of their choice, selecting relevant documents,
and then assessing its quality using an evaluation grid. A detailed How-To guide was provided to ensure
users could independently navigate the process of both generating and evaluating the ontology.</p>
        <p>We successfully generated eight ontologies, covering topics ranging from API documentation to the
Capetian dynasty. The generated ontologies have labels in English or French depending on the source
document language without modifying the prompt. To ensure optimal visualization and exploration
of the ontologies, we used Protégé2 as our primary tool. The original text documents, along with
information on their domains and use cases, as well as the resulting generated ontologies are available
in the GitHub repository.</p>
        <p>
          Evaluation Grid. Building on the ontology evaluation criteria provided by Ouyang et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], we
developed an evaluation grid tailored to assess ontology quality composed of 30 criteria. The grid
is divided into three sections: the first evaluates classes, the second focuses on properties, and the
third provides an overall assessment. Each section uses criteria related to accuracy (e.g. classes
accurately represent entities from the input text), completeness (e.g. key relations between classes
are captured comprehensively by properties), clarity (e.g. properties are described using clear and
consistent terminology), coverage (e.g. no significant information from the source material, relevant to
the ontology, is omitted) and relevance (e.g. the ontology focuses on classes relevant to the intended
domain) ensuring a thorough and structured evaluation of the ontology.
        </p>
        <p>The grid uses an evaluation scale where each criterion is rated from 1 to 5, where 1 represents
poor performance, 2 indicates needs improvement, 3 reflects satisfactory performance, 4 denotes good
performance, and 5 represents excellent performance. At the end of each section, users are encouraged
to provide honest feedback, highlighting areas for improvement and noting particularly well-executed
aspects.</p>
        <p>Evaluation Grid Results. Results are reported in Table 3. The average score by user varies from
3.0 to 5.0. The average score per section also shows some variation, with better results observed for
the Properties section compared to the others. The complete grid results are available in the GitHub
repository. Generally, performances are good, with most evaluations showing balanced scores across
sections, as seen in the cases of users 1, 3, 7, and 8. However, some scores fell below the average, notably
in the Overall section, highlighting concerns with the ontology’s accuracy and consistency. User 2,
for instance, gave lower scores, reflecting these concerns. The Properties section received the highest
average score (4.19), while Classes also performed well with an average of 3.99, suggesting that the
ontology’s class structure is satisfactory. However, the Overall section had the lowest average score
(3.74), indicating that the general usability of the ontology still needs optimization according to users.</p>
        <p>To better understand these results, we identified the lowest-scoring criteria in each section and
analyzed the user feedback for each area. The feedback for the class section highlights both strengths
and weaknesses. The lowest scores are observed on criteria 1.1, 1.9, and 1.10: users suggest that a more
refined hierarchical structure and a reduction in class overload would be beneficial. While the generated
ontology performs well overall, it lacks classes for specific use cases and occasionally contains irrelevant
classes. The class hierarchy also needs improvement, as some related classes could be grouped under
broader categories.</p>
        <p>Users generally found the generated properties suficient, with no major areas for improvement
identified. Only criterion 2.10 received a low score. Key properties are present, providing logical links
between classes, even those not directly relevant to the use case. However, property specificity was a
concern, with users noting that overly detailed distinctions between concepts (such as zone and zone
policy) could hinder usability. Visualization challenges were also noted, though these concerns relate
more to the tool used for visualization than the generated ontology itself. Some users struggled to
navigate the ontology and understand the relationships between property domains and ranges.</p>
        <p>For the overall section, users found that criteria 3.5, 3.7, and 3.8 did not meet their expectations, giving
them the lowest scores. Feedback indicates that while the automatically generated ontology ofers a
strong and well-structured foundation, its shallow, flat design is limiting. There are too many top-level
classes and insuficient depth in the hierarchy. To enhance its usefulness, we should group related
classes under parent categories and create more specific subclasses. Additionally, the ontology currently
lacks focus on specific use cases and is missing some crucial classes and properties. By addressing these
issues, we can create a more cohesive, practical, and comprehensive ontology.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Our proposed benchmark ofers a robust framework for evaluating LLM-generated ontologies. The
combination of quantitative comparisons against professional references and qualitative user
assessments provides a holistic view of ontology quality. Results highlight the potential of LLMs in ontology
generation, with Claude 3.5 Sonnet showing particular promise. The benchmark also reveals areas
for improvement, notably in overall coherence and accuracy. This work contributes to the field by
providing a standardized evaluation method, paving the way for advancements in automated ontology
generation techniques.</p>
      <p>For future work, to enhance the validity of the generated ontology and verify its accuracy, we
propose developing a text-to-graph approach. This method would enable us to assess the quality of data
extracted by the generated ontology. Additionally, we are currently investigating the implementation
of a GraphRAG approach, which would allow users to ask natural language questions as a final testing
endpoint.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Caufield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Emonet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. L.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Joachimiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Matentzoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Moxon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Reese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Haendel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Robinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Mungall</surname>
          </string-name>
          ,
          <article-title>Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>40</volume>
          (
          <year>2024</year>
          ). URL: https://doi.org/10.1093/bioinformatics/btae104.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mateiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Groza</surname>
          </string-name>
          ,
          <source>Ontology engineering with Large Language Models</source>
          ,
          <year>2023</year>
          . URL: https://arxiv. org/abs/2307.16699. arXiv:
          <volume>2307</volume>
          .
          <fpage>16699</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Kommineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>König-Ries</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Samuel</surname>
          </string-name>
          ,
          <article-title>From human experts to machines: An LLM supported approach to ontology and knowledge graph construction, 2024</article-title>
          . URL: https://arxiv.org/abs/2403. 08345. arXiv:
          <volume>2403</volume>
          .
          <fpage>08345</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bischof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Filtz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. X.</given-names>
            <surname>Parreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Steyskal</surname>
          </string-name>
          ,
          <article-title>LLM-based Guided Generation of Ontology Term Definitions</article-title>
          , in: European Semantic Web Conference (ESWC),
          <source>Industry Track</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rebboud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tailhardat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          , Can LLMs Generate Competency Questions?,
          <source>in: European Semantic Web Conference (ESWC)</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. A.</given-names>
            <surname>Carriero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Schreiberhuber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsaneva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          , J. de Berardinis,
          <article-title>Ontochat: a framework for conversational ontology engineering using language models</article-title>
          ,
          <source>arXiv preprint arXiv:2403.05921</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>A method of ontology evaluation based on coverage, cohesion and coupling</article-title>
          ,
          <source>in: 8ℎ International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)</source>
          , volume
          <volume>4</volume>
          ,
          <year>2011</year>
          , pp.
          <fpage>2451</fpage>
          -
          <lpage>2455</lpage>
          . doi:
          <volume>10</volume>
          .1109/FSKD.
          <year>2011</year>
          .
          <volume>6020046</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>