<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Web Semantics 85 (2025) 100844.
[8] Y. R. Jean</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/SYNASC61333.2023.00038</article-id>
      <title-group>
        <article-title>Taming Hallucinations: A Semantic Matching Evaluation Framework for LLM-Generated Ontologies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nadeen Fathallah</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefen Staab</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alsayed Algergawy</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Analytic Computing, Institute for Artificial Intelligence, University of Stuttgart</institution>
          ,
          <addr-line>Stuttgart</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Data and Knowledge Engineering, University of Passau</institution>
          ,
          <addr-line>Passau</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute for Informatics, Friedrich-Schiller-University Jena</institution>
          ,
          <addr-line>Jena</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Southampton</institution>
          ,
          <addr-line>Southhampton</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>14265</volume>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Ontology learning using Large Language Models (LLMs) has shown promise yet remains challenged by hallucinations-spurious or inaccurate concepts and relationships that undermine domain validity. This issue is particularly critical in highly specialized fields such as life sciences, where ontology accuracy directly impacts knowledge representation and decision-making. In this work, we introduce an automated evaluation framework that systematically assesses the quality of LLM-generated ontologies by comparing their concepts and relationship triples against domain knowledge (i.e. expert-curated domain ontologies). Our approach leverages transformer-based semantic similarity methods to detect hallucinations, ensuring that generated ontologies align with real-world knowledge. We evaluate our framework using six LLM-generated ontologies, validating them against three reference ontologies with increasing domain specificity. This work establishes a scalable, automated approach for validating LLM-generated ontologies, paving the way for their broader adoption in complex, knowledge-intensive domains.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Life Science Domain</kwd>
        <kwd>NeOn-GPT</kwd>
        <kwd>Ontology Learning</kwd>
        <kwd>Ontology Matching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Ontologies provide structured frameworks for representing domain knowledge, enabling interoperability,
reasoning, and information organization. Large Language Models (LLMs) have shown promise in tasks
like ontology generation and ontology population [
        <xref ref-type="bibr" rid="ref1">1, 2, 3, 4, 5</xref>
        ]. However, one major challenge is the
tendency of LLMs to produce hallucinations—instances where they generate concepts or relationships
that either do not exist or are irrelevant to the domain [6, 7]. This issue can lead to significant errors in
ifelds like life sciences, where ontologies support decision-making and knowledge representation. The
tendency of LLMs to hallucinate is particularly pronounced when tasked to model highly specialized
domains like ecology and biology, as the lack of domain-specific training data increases the likelihood
of generating inaccurate or irrelevant concepts and relationships. Although manual validation of
LLMgenerated ontologies by domain experts is efective, it is resource-intensive and does not scale. This work
addresses the need for an automated framework to evaluate LLM-generated ontologies against domain
knowledge, ultimately reducing the manual verification eforts required by domain experts. Importantly,
our evaluation framework is not limited solely to mitigating hallucinations in LLM-generated ontologies;
it can also be adapted for other knowledge engineering tasks performed by LLMs. For instance, the
framework can validate LLM-generated knowledge graphs, ensure accuracy in semantic annotations,
and verify consistency in automated taxonomy creation. To achieve these goals, our proposed evaluation
framework is based on semantic ontology matching, identifying correspondences between concepts
and relationships across ontologies [8].
      </p>
      <p>A pressing question emerges in this context: How well can LLMs model domain-specific concepts
and relationships that align with real-world domain knowledge? To address this question, we leverage
six LLM-generated ontologies as a case study; those ontologies were generated in our previous work [9]
using our enhanced NeOn-GPT pipeline for ontology learning proposed in [10, 9]. We validate concepts
and relationship triples generated by LLM against three domain-specific ontologies recommended by
domain experts using our automated evaluation framework. These ontologies increase in relevance to
the domain, allowing us to assess whether LLM-generated knowledge aligns with established domain
knowledge rather than being generic.</p>
      <p>Our results demonstrate that LLM-generated ontologies exhibit increasing domain alignment,
supporting their use as automated ontology generation and population tools in highly specialized domains,
with concepts and relationship triples aligning more strongly as the reference ontology becomes more
domain-specific. Furthermore, our findings show that our automated evaluation framework efectively
captures these alignments while significantly reducing the manual eforts required for validation by
domain experts. The paper is structured as follows: Section 2 reviews related work, Section 3 outlines
our methodology, Section 4 presents results, Section 5 discusses findings, and Section 6 concludes with
future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Recent work shows that LLMs hold considerable promise for knowledge engineering tasks [11, 12,
13, 14, 15], particularly in the realm of ontology creation [
        <xref ref-type="bibr" rid="ref1">1, 2, 3, 4, 5</xref>
        ]. Several recent approaches
employ structured prompting to facilitate ontology creation tasks. Notable works such as OntoChat [5],
OntoGenix [16], Ontogenia [17] and our own NeOn-GPT [10] illustrate the promising capabilities of
LLMs in generating ontologies. These works identify challenges with ontology generation using LLMs,
such as syntax errors, logical inconsistencies, common modeling pitfalls, and hallucinations, where
LLMs generate incorrect or irrelevant ontology elements to the domain due to sparse domain-specific
training data. Unlike other methods, our NeOn-GPT framework is designed to address syntax and
logical consistency issues and common pitfalls internally. It integrates detection tools such as RDFLib
for syntax checking, reasoners such as Pellet and HermiT to verify logical consistency and a pitfall
detection tool. Error messages from these tools, which describe the problems encountered, prompt the
LLM to fix these issues automatically. However, while these mechanisms efectively handle syntactical
errors, logical inconsistency, and common pitfalls (e.g., wrong inverse relations, cycles in the class
hierarchy), reducing hallucinations remains a significant challenge. This motivates our current work,
where we propose an automatic evaluation framework to mitigate hallucinations and reduce the manual
efort required to validate LLM-generated ontologies.
      </p>
      <p>Recent literature underscores the necessity of rigorous evaluation frameworks for systematically
assessing semantic accuracy and detecting LLM-induced errors [18, 19]. Lavrinovics et al. [7] categorize
various hallucination types, illustrating their negative impacts on the reliability and trustworthiness
of ontology outputs. Agrawal et al. [20] survey knowledge-augmented LLM methods, showing how
incorporating Knowledge Graphs (KGs) can mitigate hallucinations by grounding model outputs
in validated, domain-specific knowledge. However, despite these advances, hallucinations remain
challenging, i.e., incorrect or irrelevant ontology elements, especially in highly specialized domains with
sparse training data. Our current work proposes a novel automated framework that targets this gap. To
determine whether a generated ontology contains hallucinations—such as non-existent or irrelevant
concepts and triples—we compare it against expert-curated, domain-specific ontologies. This process,
known as ontology matching, aims to identify correspondences between semantically related concepts
across diferent ontologies [ 21]. Our framework performs ontology matching by leveraging
transformerbased embedding techniques to semantically align concepts and triples between the generated and
reference ontologies. The percentage of matched elements serves as an indicator of semantic accuracy
and domain relevance, helping to flag potential hallucinations in the generated output.</p>
      <p>In the broader context of ontology matching, lexical and heuristic methods such as PROMPT [22]
and COMA [23] have been widely used for their ability to identify matches based on name or string
similarity. However, these methods often fail to capture conceptual equivalence when similar terms are
expressed diferently. Recent approaches use embedding-based models and LLMs to identify semantically
equivalent concepts based on meaning and context rather than wording alone. Embedding-based
models like BERTMap [24] fine-tune BERT on ontology texts and apply logic-based constraints—such as
disjointness and hierarchy rules—to ensure consistent alignment of equivalent concepts. Unsupervised
methods like TEXTO [25], PropMatch [26], and [27] enhance matching by combining transformer
embeddings with structural features, such as class hierarchies and property relationships, allowing them
to identify concept equivalence beyond string similarity. The LLMs4OM framework [28] systematically
evaluates LLMs in ontology matching, employing retrieval-augmented generation (RAG) to combine
semantic retrieval with LLM-based classification. It explores multiple retrieval models (e.g.,
sentenceBERT, OpenAI’s text-embedding-ada) and evaluates LLMs across 20 datasets, demonstrating competitive
performance against traditional systems like LogMap [29] and AML [30]. These studies show the
promise of transformer-based models in identifying semantic similarities and capturing deep semantic
relationships and contextual nuances. Consequently, we adopt a transformer-based methodology as the
cornerstone of our evaluation framework.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In this study, we introduce an automated evaluation framework 1 designed to assess the reliability of
LLM-generated ontologies by systematically comparing their concepts and relationships with established
domain knowledge (i.e. expert-curated domain ontologies). We evaluate LLM-generated ontologies
at both the concept level (to assess entity correctness) and the relationship-triple level (to validate
relational integrity). By matching these elements against expert-curated ontologies, we ensure that
generated knowledge aligns with established domain standards rather than being artificially constructed.
The framework leverages semantic ontology matching techniques- sentence embeddings and
similaritybased alignment, to quantify the degree of conceptual and relational consistency between LLM-generated
ontologies and expert-curated reference ontologies, an overview of our proposed framework is shown
in Figure 1. We utilize six LLM-generated ontologies that were previously developed in [9] using our
enhanced NeOn-GPT pipeline [10, 9] for ontology learning with GPT-4o [31] as a case study. These
ontologies represent diferent aspects of the AquaDiva 2 research domain [32, 33], which investigates
microbial ecology, biogeochemical cycles, and environmental processes in subsurface ecosystems:
• AquaDiva Ontology (Version 1): Represents concepts in groundwater ecosystems, including
aquifers, microbial communities, and biogeochemical processes, but with limited structural depth.
• AquaDiva Ontology (Version 2): Expands the AquaDiva domain representation by
incorporating a deeper class hierarchy and more object properties, improving relational depth between
entities.
• AquaDiva Ontology (Version 3): Merges previous AquaDiva ontology versions 1 and 2.
• Habitat Ontology: A module of the AquaDiva ontology that captures knowledge about diferent
habitat types within groundwater ecosystems.
• Role Ontology: A module of the AquaDiva ontology that models the functional roles of biological,
chemical, and environmental agents in groundwater systems.
• Carbon &amp; Nitrogen Cycling Ontology: A module of the AquaDiva ontology that represents
biochemical processes related to carbon and nitrogen cycles in groundwater.</p>
      <p>These ontologies serve as test cases for our framework, allowing us to assess how well LLM-generated
knowledge aligns with domain-specific standards. To validate the accuracy and domain relevance of
1Our code base is publicly available for research and development purposes, accessible at: https://github.com/NadeenAhmad/
TamingHallucinations
2https://www.aquadiva.uni-jena.de/</p>
      <p>Input Ontologies
(O1 ,O2 )</p>
      <p>Concepts</p>
      <p>Extraction
Input Ontologies
(O1 ,O2 )</p>
      <p>Triples
Extraction
(C1 ,C2 )
(T1 ,T2 )
Generate Embeddings: (EC1 ,EC2 ) Compute</p>
      <sec id="sec-3-1">
        <title>SeEnmtebnecdediTnrganMsofodmeler Similarity</title>
        <p>all-MiniLM-L6-v2 Score</p>
        <p>Triples Matching</p>
      </sec>
      <sec id="sec-3-2">
        <title>Generate Embeddings: (ET1 ,ET2 ) Compute</title>
      </sec>
      <sec id="sec-3-3">
        <title>SeEnmtebnecdediTnrganMsofodmeler Similarity</title>
        <p>all-MiniLM-L6-v2 Score</p>
      </sec>
      <sec id="sec-3-4">
        <title>Similarity Score ≥ Ƭ ✅ Accepted</title>
        <p>Similarity Score &lt; Ƭ
Flagged as
Halucination</p>
      </sec>
      <sec id="sec-3-5">
        <title>Similarity Score ≥ Ƭ ✅ Accepted</title>
        <p>Similarity Score &lt; Ƭ</p>
        <p>Flagged as
Halucination
these ontologies, we compare their concepts and triples against three expert-recommended reference
ontologies:
• OBOE-SBC (Santa Barbara Coastal Observation Ontology) [34]: Describes environmental
observations specific to the Santa Barbara Coastal Long Term Ecological Research Project.
• ENVO (Environmental Ontology) [35]: Provides a controlled vocabulary for describing
environmental entities, including ecosystems, environmental processes, and qualities.
• CHEBI (Chemical Entities of Biological Interest) [36]: Provides a structured classification of
chemical compounds of biological relevance.</p>
        <sec id="sec-3-5-1">
          <title>3.1. Ontology Concept and Triple Extraction</title>
          <p>Our evaluation framework starts by extracting concepts and triples from LLM-generated and
expertcurated reference ontologies.</p>
          <p>Concept Extraction: We use transformer-based models that rely on textual semantics to compute
embeddings, so extracting human-readable labels for ontology concepts is crucial for accurate matching,
interpretation, and validation. These labels retain natural language structure, helping models understand
relationships instead of treating concepts as meaningless identifiers. Without readable labels,
embeddings fail to reflect the actual meaning of concepts and triples, leading to misalignment. For example,
an identifier like OBO:0003742 provides no semantic value, whereas its label: Microbial Biomass
enables a model to contextualize the concept within biological and ecological domains, improving
similarity computation and alignment accuracy. Concept labels alone can be ambiguous without context.
For example, "cell" refers to a biological unit or a prison room, highlighting the need to extract labels
and definitions. Definitions provide crucial disambiguation, improving alignment with expert-curated
ontologies. For instance, an LLM-generated concept labeled "Microbial Activity" without a definition
may be dificult to align with the ENVO ontology’s "Microbial Biogeochemical Process," defined in
the ENVO ontology as "A process mediated by microbial activity influencing the transformation of
chemical compounds in an ecosystem." Extracting both labels and definitions ensures a more accurate
semantic comparison.</p>
          <p>Our framework extracts ontology concepts by parsing class labels and their associated definitions.
Definitions are key in concept matching by resolving ambiguities and standardizing variations.
Disambiguation ensures that terms with multiple meanings are classified correctly (e.g., "cell" as a biological
unit vs. a prison room). At the same time, standardization aligns diferent representations of the same
concept (e.g., "2" vs. "carbon dioxide").</p>
          <p>We developed an automated extraction pipeline to extract concepts and their definitions from LLM
ontologies represented in Turtle (TTL) format. The pipeline applies regular expressions to identify
ontology classes (owl:Class) and extract their corresponding labels and definitions (rdfs:comment).
The extracted concepts and their definitions are stored in a structured dictionary; each class is paired
with its corresponding definition or labeled as an empty string if missing. Our analysis observed that
LLM-generated ontologies sometimes contain duplicate classes with identical labels, leading to
redundant entries. To prevent inflating the results, we implemented a filtering step to remove these duplicates
before storing the processed data in JSON format for further analysis. For example, in the Carbon &amp;
Nitrogen Cycling Ontology, we extracted the concept: "Forest Ecosystem": "An ecosystem
dominated by trees and other vegetation, playing a key role in carbon and
nitrogen cycling."</p>
          <p>Similarly, a pipeline was developed to extract concepts and their definitions from reference ontologies
using the BioPortal API. Our pipeline retrieves ontology classes from OBOE-SBC, ENVO, and CHEBI
repositories by making iterative API requests. The data extraction process involves querying the
API, parsing JSON responses to extract concept labels and definitions, and handling pagination to
ensure the retrieval of all available entries. To ensure meaningful semantic content in the extracted
concepts, we excluded blank nodes (BNodes), as they often lack clear labels or definitions. Additionally,
we removed UUID-like alphanumeric strings using regular expression filtering, as these randomly
generated identifiers do not contribute to the ontology’s conceptual structure. Concepts containing
ORCID IDs, ontology prefixes (e.g., foodon:01234 or CHEBI:12345), or database-specific notations
were replaced with readable terms by retrieving the class label from their corresponding data sources.
For example, instead of retaining CHEBI:15377, we used API-based label retrieval to replace it with its
human-readable name, "Water." This ensured that the extracted concepts remained interpretable and
useful for semantic matching. The extracted data is stored in structured JSON files for further analysis.</p>
          <p>The extracted set 1 from the LLM-generated ontology consists of concepts paired with their
corresponding definitions. The set from the expert-curated reference ontology is denoted as 2 (as
shown in Figure 1).</p>
          <p>Triple Extraction: We extract subject–predicate–object (SPO) triples from the ontology and obtain
their human-readable forms by extracting labels for each entity in the triple. For example, the triple
(OBO:0003742) - [obo:RO_0002234] → (OBO:0000270) is transformed into (Microbial
Biomass) - [is affected by] → (Dissolved Organic Carbon).</p>
          <p>We developed an automated extraction pipeline to extract triples from LLM-generated ontologies
represented in Turtle (TTL) format. The extracted triples include (a) Class Hierarchies (subClassOf
and is a relationships), (b) Object Properties (links between ontology concepts), and (c) Data
Properties (attributes associated with ontology entities). The extraction process begins with domain and
range identification to determine property domain and range constraints, specifying the types of
entities a property can connect. Using this structured information, we then proceed to construct
SPO triples; for instance, if the property "is consumed by" is defined with "Trace Gas" as its
domain and "Microbial Community" as its range, the extracted triple would be: (Trace Gas)
-[is consumed by]-&gt; (Microbial Community). The final set of triples is stored in structured
CSV files for further ontology matching. Additionally, the pipeline identifies hierarchical
relationships, extracting subClassOf relationships that define taxonomic structures within the ontology:
(Methane Production) -[subClassOf]-&gt; (Carbon Cycling Process). We also capture "is
a" (rdf:type) relationships, which categorize entities into specific classes, such as (North Sea)
-[is a]-&gt; (Marine Ecosystem). The examples presented above were extracted from the Carbon
and Nitrogen Cycling Ontology.</p>
          <p>Similarly, a pipeline was developed to extract triples from reference ontologies using the BioPortal
API. Our pipeline retrieves ontology triples from the OBOE-SBC, ENVO, and CHEBI repositories by
making iterative API requests. The data extraction process involves querying the API and handling the
same types of triple (a) Class Hierarchies (subClassOf and is a relationships), (b) Object Properties,
and (c) Data Properties. We applied the same filtering mechanisms as in concept extraction to ensure
semantic relevance. The extracted set of triples from the LLM-generated ontology is 1, while the
expert-curated reference ontology set is 2 (see Figure 1).</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>3.2. Ontology Concept and Triple Matching</title>
          <p>Concept Matching: We match ontology concepts by comparing class labels and their definitions across
LLM-generated ontologies and reference ontologies. To achieve this, we employ a
concatenationbased embedding strategy, where the concept name and its definition are merged into a single
text representation before generating an embedding. Each concept is formatted as: "concept
tokenizer.sep_token definition". This approach allows the model to process the concept
and its associated definition simultaneously. Including a separator token explicitly signals the boundary
between the concept label and its definition, helping the embedding model distinguish and appropriately
weigh the semantic contributions of each component. Thus, instead of merging labels and definitions
into one potentially ambiguous sentence, our concatenation approach ensures accurate
contextualization and improved embedding quality. Our concept matching pipeline utilizes all-MiniLM-L6-v2
[37], a pre-trained sentence transformer model, to generate fixed-size vector embeddings for both
LLM-generated and reference ontology concepts. We selected all-MiniLM-L6-v2 as our embedding
model due to its lightweight architecture, eficiency, and strong performance in semantic similarity
tasks. This model generates 384-dimensional sentence embeddings, efectively capturing the semantic
meaning of the text while maintaining a compact size of 22MB. Its eficiency suits it for handling
large-scale ontology matching without requiring extensive computational resources. Furthermore,
all-MiniLM-L6-v2 has demonstrated strong semantic search, clustering, and sentence similarity
performance [38, 39]. In this process, the sets of concepts 1 and 2 are transformed into the sets of
embeddings 1 and 2, respectively.</p>
          <p>Embeddings are then compared using cosine similarity, a mathematical measure that calculates
the angle between two vectors in a high-dimensional space [40]. Unlike Euclidean distance, which
measures absolute diferences, cosine similarity evaluates how directionally similar two vectors are,
making it suited for semantic comparisons. A score of 1 indicates identical meanings, while 0 suggests
no similarity. In this process, each concept and definition embedding from 1 is compared against
all concepts in 2 using cosine similarity (see Figure 1). Concepts that exceed a similarity threshold
( ) (e.g., 0.50) are retained as valid matches. Concepts that fail to find a meaningful match are flagged
as hallucinations for domain experts to verify. For example, in the AquaDiva Ontology (Version 2),
the concept: "GroundwaterPrecipitation": "The process where water precipitates,
either through chemical means within groundwater systems or as a part of the
hydrological cycle impacting groundwater recharge." was matched with the concept
"PrecipitationWaterSample": "PrecipicationWater falls from the atmosphere to
earth, as rain or snow. Also, see the process called Precipitation." in OBOE-SBC
ontology with a similarity score of 0.66. Concepts such as "Trace Gas Consumption" that lacked
strong matches were flagged as hallucinations for experts to review. The final output consists of three
main components: (a) Accepted Matches - LLM-generated concepts successfully aligned with reference
ontology concepts; (b) Hallucinated Concepts - Concepts with no meaningful match, indicating potential
LLM errors that need manual verification by domain experts; and (c) Match Confidence Statistics - A
breakdown of how many LLM concepts were validated and their match distribution across reference
ontologies.</p>
          <p>Triple Matching: We match ontology triples by comparing Subject-Predicate-Object (SPO)
relationships across LLM-generated ontologies and reference ontologies. We employ a sentence-based
embedding strategy to achieve this, where each SPO triple is converted into a natural language sentence
representation before generating an embedding. For example, (TraceGas) - [is consumed by] -&gt;
(MicrobialCommunity) is transformed to "TraceGas is consumed by MicrobialCommunity".
This approach ensures that the semantic relationships within triples are preserved, allowing the model
to process them holistically rather than as disjointed components. Our triple matching pipeline
utilizes the same model all-MiniLM-L6-v2, transforming the sets of triples 1 and 2 into the sets
of embeddings  1 and  2, respectively. These embeddings are then compared using cosine
similarity as well. Each triple embedding from  1 is compared against all triples in  2 using cosine
similarity (see Figure 1). Triples that exceed a similarity threshold ( ) (e.g., 0.50) are retained as valid
matches. In contrast, triples that fail to find a meaningful match are flagged as hallucinations for
domain experts to verify. For example, the triple "Karst Groundwater is a Water" extracted from
Carbon &amp; Nitrogen Cycling ontology was matched with the following triple from the ENVO ontology:
"freshwater subclass of water" with a similarity score of 0.58 and "TraceGas is consumed
by MicrobialCommunity" was matched with "methane has role bacterial metabolite"
from CHEBI ontology with similarity score 0.52.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>To evaluate the alignment of LLM-generated ontologies with domain-specific reference ontologies, each
LLM-generated ontology was matched against three reference ontologies ranked by domain experts (i.e.
ecologists) in ascending order of relevance to the AquaDiva ontology domain. The matching process
proceeded in the following stages: (1) Matching with the least relevant reference ontology (OBOE-SBC),
(2) Matching with the combination of the least and second least relevant reference ontologies
(OBOESBC + ENVO), and (3) Matching with all three reference ontologies together (OBOE-SBC + ENVO
+ CHEBI). This stepwise approach reveals whether alignment improves with more domain-specific
references—indicating higher semantic relevance in the LLM-generated ontologies.</p>
      <sec id="sec-4-1">
        <title>4.1. Concept Matching Results</title>
        <p>The percentage of matched concepts across diferent reference ontology combinations is summarized
in Table 1. The results show that matching only with OBOE-SBC resulted in relatively low concept
match percentages across all ontologies (e.g., 46.27% for AquaDiva (Version1) and 36.94% for Carbon
and Nitrogen Cycling). Adding ENVO significantly increased the rate of matched concepts, almost
doubling the first stage. Incorporating all three reference ontologies (ENVO + OBOE-SBC + CHEBI)
led to marginal improvements beyond the second stage, with all ontologies exceeding 90% alignment.
The Carbon &amp; Nitrogen Cycling ontology achieved the highest match percentages, likely due to their
alignment with the CHEBI ontology, which classifies biologically relevant chemical compounds. Since
this ontology focuses on biochemical processes related to carbon and nitrogen cycles, its terminology
closely matches CHEBI’s structured vocabulary.</p>
        <p>LLM-Generated Ontology via NeOn-GPT
AquaDiva (Version1)
AquaDiva (Version2)
AquaDiva (Version3)
Habitat
Role
Carbon and Nitrogen Cycling
46.27%
43.36%
41.72%
32.53%
39.31%
36.94%</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Triple Matching Results</title>
        <p>The percentage of matched triples across diferent reference ontology combinations is summarized in
Table 2. The results show that matching only with OBOE-SBC resulted in significantly lower match
percentages for triples compared to concepts (e.g., 15.98% for AquaDiva (Version1) and 13.29% for
Carbon and Nitrogen Cycling). Adding ENVO led to a substantial improvement in triple alignment,
with match percentages increasing by more than 40 percentage points in all cases. Including all three
reference ontologies (OBOE-SBC + ENVO + CHEBI) further improved the match percentages, though
the gain was less pronounced than in the second stage.</p>
        <p>% of Matched % of
Triples with OBOE- Triples
SBC (OBOE-SBC
+ENVO)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>The high concept matching rates indicate that LLMs are efective at generating widely accepted
entitylevel knowledge, likely due to their ability to synthesize common terms from large-scale training
corpora. The incremental ontology matching approach revealed that as more relevant ontologies were
included, the match rate increased significantly, especially for concepts. Despite the high concept
alignment observed in our matching process, some LLM-generated concepts remained unmatched,
highlighting some semantic consistency and structured representation challenges. A manual review of
the unmatched concepts suggests that many terms were either highly specialized (i.e., highly relevant to
the AquaDiva ontology domain but not represented in reference ontologies) or overly generic to align
with structured reference vocabularies. Highly specialized concepts such as "Hainich Critical
Zone" from the AquaDiva (Version 3) ontology represent valid scientific terms that reflect
domainspecific knowledge. Their absence from reference ontologies does not imply inaccuracy but rather
illustrates the potential of LLMs to surface novel or underrepresented entities relevant to the target
domain. On the other hand, generic terms like "Extreme Weather Event" in the AquaDiva (Version
1) ontology are meaningful but often not formalized in structured vocabularies.</p>
      <p>This is where human-in-the-loop validation becomes essential, enabling domain experts to assess
such unmatched concepts’ correctness, relevance, and potential value, as shown in Figure 1. For this
reason, our approach complements—rather than replaces—expert validation, helping reduce the manual
efort required. It flags unmatched concepts and triples for expert review, acknowledging that they may
represent legitimate and valuable domain knowledge that falls outside the scope of existing ontologies.</p>
      <p>Unlike concept matching, triple matching showed lower alignment rates. Similar to highly
specialized unmatched concepts, some triples remained unmatched because they were highly relevant
to the AquaDiva ontology domain only, such as the triple: (TriassicLimestone) - [is a] -&gt;
(GeologicalFormation) from the AquaDiva (Version 3) ontology. Many unmatched triples lacked
clear hierarchical or property constraints, making them dificult to align. For example, the unmatched
triple (reflects changes in) - [is a] -&gt; (ObjectProperty), (reflects changes in)
suggests a causal relationship, but standard ontologies often use more rigid property constraints, such
as has Process or affects. The absence of standardized predicates in LLM-generated ontologies
makes direct alignment with structured ontologies challenging. Unlike traditional ontology engineering
methods that rely on formal logic and domain expertise, LLMs rely on statistical correlations and
vector-based search methods rather than deductive reasoning. As a result, LLMs struggle to generate
subject-relation-object triples that conform to well-defined ontological structures. This explains why
concept alignment is significantly higher than triple alignment—while LLMs can extract and generate
entity-level knowledge efectively, they struggle to formalize structured semantic relationships.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future work</title>
      <p>In this study, we proposed an evaluation framework for assessing LLM-generated ontologies by matching
their concepts and triples against domain-specific reference ontologies, aiming to reduce the manual
verification eforts required from domain experts. The results demonstrate that while LLMs excel at
generating domain-relevant concepts, their performance declines when it comes to producing structured
relationships, as reflected in the lower triple alignment rates. Our stepwise ontology matching strategy
further confirmed that the relevance of the reference ontology significantly influences the alignment
quality, with higher alignment percentages achieved when using more domain-specific ontologies.
Future work should also investigate the potential of leveraging LLMs as the domain expert in this
pipeline inspired by previous works that use LLM-as-a-judge [41, 42]. In our previous work, we evaluated
LLM-generated ontologies for syntactic correctness, logical consistency, common modeling pitfalls,
and structural properties [10, 9], this work extends this evaluation to the semantic level, assessing
the alignment of concepts and relationship triples with expert-curated reference ontologies. In future
work, we plan to assess the practical utility of these ontologies through task-based evaluations, such as
their ability to support competency questions and other real-world applications, providing a deeper
understanding of their functional value. Beyond the current use case, we aim to use this framework
to evaluate and compare diferent ontology learning pipelines and LLMs. Additionally, we plan to
adapt the framework to support other knowledge engineering tasks, such as validating LLM-generated
knowledge graphs, semantic annotations, or taxonomy construction, helping to ensure consistency
and domain relevance across a wider range of automated knowledge modeling scenarios. Finally, the
results of this semantic evaluation suggest that future ontology generation may benefit from models
with improved contextual understanding; thus, we intend to explore the potential of Large Context
Models (LCMs) to improve hierarchical structuring in LLM-generated ontologies [43].</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgement</title>
      <p>The authors thank Mr. Yihang Zhao (Department of Informatics, King’s College London) for kindly
presenting this paper at SemTech4STLD @ ESWC 2025 on their behalf.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4 and Grammarly for Grammar and spelling
checks. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed
and take(s) full responsibility for the publication’s content.
Base Construction (LM-KBC) co-located with the 22nd International Semantic Web Conference
(ISWC 2023), Athens, Greece, November 6, 2023, volume 3577 of CEUR Workshop Proceedings,
CEUR-WS.org, 2023. URL: https://ceur-ws.org/Vol-3577/paper8.pdf.
[16] M. Val-Calvo, M. E. Aranguren, J. M. Martínez-Hernández, G. Almagro-Hernández, P. Deshmukh,
J. A. Bernabé-Díaz, P. Espinoza-Arias, J. L. Sánchez-Fernández, J. Mueller, J. T. Fernández-Breis,
Ontogenix: Leveraging large language models for enhanced ontology engineering from datasets,
Inf. Process. Manag. 62 (2025) 104042. URL: https://doi.org/10.1016/j.ipm.2024.104042. doi:10.
1016/J.IPM.2024.104042.
[17] A. S. Lippolis, M. Ceriani, S. Zuppiroli, A. G. Nuzzolese, Ontogenia: Ontology generation with
metacognitive prompting in large language models, in: A. Meroño-Peñuela, Ó. Corcho, P. Groth,
E. Simperl, V. Tamma, A. G. Nuzzolese, M. Poveda-Villalón, M. Sabou, V. Presutti, I. Celino,
A. Revenko, J. Raad, B. Sartini, P. Lisena (Eds.), The Semantic Web: ESWC 2024 Satellite Events
Hersonissos, Crete, Greece, May 26-30, 2024, Proceedings, Part I, volume 15344 of Lecture Notes in
Computer Science, Springer, 2024, pp. 259–265. URL: https://doi.org/10.1007/978-3-031-78952-6_38.
doi:10.1007/978-3-031-78952-6\_38.
[18] F. Petroni, T. Rocktäschel, S. Riedel, P. S. H. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, Language
models as knowledge bases?, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China,
November 3-7, 2019, Association for Computational Linguistics, 2019, pp. 2463–2473. URL: https:
//doi.org/10.18653/v1/D19-1250. doi:10.18653/V1/D19-1250.
[19] H. Ghanem, C. Cruz, Fine-tuning vs. prompting: evaluating the knowledge graph construction
with llms, in: 3rd International Workshop on Knowledge Graph Generation from Text (Text2KG)
Co-located with the Extended Semantic Web Conference (ESWC 2024), volume 3747, 2024, p. 7.
[20] G. Agrawal, T. Kumarage, Z. Alghamdi, H. Liu, Can knowledge graphs reduce hallucinations in
llms?: A survey, arXiv preprint arXiv:2311.07914 (2023).
[21] J. Euzenat, P. Shvaiko, et al., Ontology matching, volume 18, Springer, 2007.
[22] N. F. Noy, M. A. Musen, PROMPT: algorithm and tool for automated ontology merging and
alignment, in: H. A. Kautz, B. W. Porter (Eds.), Proceedings of the Seventeenth National Conference
on Artificial Intelligence and Twelfth Conference on on Innovative Applications of Artificial
Intelligence, July 30 - August 3, 2000, Austin, Texas, USA, AAAI Press / The MIT Press, 2000, pp.
450–455. URL: http://www.aaai.org/Library/AAAI/2000/aaai00-069.php.
[23] H.-H. Do, E. Rahm, Coma—a system for flexible combination of schema matching approaches, in:
VLDB’02: Proceedings of the 28th International Conference on Very Large Databases, Elsevier,
2002, pp. 610–621.
[24] Y. He, J. Chen, D. Antonyrajah, I. Horrocks, Bertmap: a bert-based ontology alignment system, in:</p>
      <p>Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2022, pp. 5684–5691.
[25] Y. Peng, M. Alam, T. Bonald, Ontology matching using textual class descriptions, in: P. Shvaiko,
J. Euzenat, E. Jiménez-Ruiz, O. Hassanzadeh, C. Trojahn (Eds.), Proceedings of the 18th International
Workshop on Ontology Matching co-located with the 22nd International Semantic Web Conference
(ISWC 2023), Athens, Greece, November 7, 2023, volume 3591 of CEUR Workshop Proceedings,
CEUR-WS.org, 2023, pp. 67–72. URL: https://ceur-ws.org/Vol-3591/om2023_STpaper2.pdf.
[26] G. Sousa, R. Lima, C. Trojahn, Combining word and sentence embeddings with alignment extension
for property matching., in: OM@ ISWC, 2023, pp. 91–96.
[27] G. Sousa, R. Lima, C. Trojahn, Complex ontology matching with large language model embeddings,
arXiv preprint arXiv:2502.13619 (2025).
[28] H. B. Giglou, J. D’Souza, F. Engel, S. Auer, Llms4om: Matching ontologies with large language
models, arXiv preprint arXiv:2404.10317 (2024).
[29] E. Jiménez-Ruiz, B. C. Grau, Logmap: Logic-based and scalable ontology matching, in: L. Aroyo,
C. Welty, H. Alani, J. Taylor, A. Bernstein, L. Kagal, N. F. Noy, E. Blomqvist (Eds.), The Semantic Web
- ISWC 2011 - 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011,
Proceedings, Part I, volume 7031 of Lecture Notes in Computer Science, Springer, 2011, pp. 273–288.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mateiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Groza</surname>
          </string-name>
          ,
          <article-title>Ontology engineering with large language models</article-title>
          ,
          <source>in: 25th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC</source>
          <year>2023</year>
          , Nancy,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>