1. Introduction

Journal of Web Semantics 85 (2025) 100844. [8] Y. R. Jean

10.1109/SYNASC61333.2023.00038

Taming Hallucinations: A Semantic Matching Evaluation Framework for LLM-Generated Ontologies

Nadeen Fathallah

Stefen Staab

0 3

Alsayed Algergawy

1 2 0 Analytic Computing, Institute for Artificial Intelligence, University of Stuttgart , Stuttgart , Germany 1 Data and Knowledge Engineering, University of Passau , Passau , Germany 2 Institute for Informatics, Friedrich-Schiller-University Jena , Jena , Germany 3 University of Southampton , Southhampton , UK

2025

14265 0000 0001

Ontology learning using Large Language Models (LLMs) has shown promise yet remains challenged by hallucinations-spurious or inaccurate concepts and relationships that undermine domain validity. This issue is particularly critical in highly specialized fields such as life sciences, where ontology accuracy directly impacts knowledge representation and decision-making. In this work, we introduce an automated evaluation framework that systematically assesses the quality of LLM-generated ontologies by comparing their concepts and relationship triples against domain knowledge (i.e. expert-curated domain ontologies). Our approach leverages transformer-based semantic similarity methods to detect hallucinations, ensuring that generated ontologies align with real-world knowledge. We evaluate our framework using six LLM-generated ontologies, validating them against three reference ontologies with increasing domain specificity. This work establishes a scalable, automated approach for validating LLM-generated ontologies, paving the way for their broader adoption in complex, knowledge-intensive domains.

eol>Large Language Models Life Science Domain NeOn-GPT Ontology Learning Ontology Matching

1. Introduction

Ontologies provide structured frameworks for representing domain knowledge, enabling interoperability, reasoning, and information organization. Large Language Models (LLMs) have shown promise in tasks like ontology generation and ontology population [ 1, 2, 3, 4, 5 ]. However, one major challenge is the tendency of LLMs to produce hallucinations—instances where they generate concepts or relationships that either do not exist or are irrelevant to the domain [6, 7]. This issue can lead to significant errors in ifelds like life sciences, where ontologies support decision-making and knowledge representation. The tendency of LLMs to hallucinate is particularly pronounced when tasked to model highly specialized domains like ecology and biology, as the lack of domain-specific training data increases the likelihood of generating inaccurate or irrelevant concepts and relationships. Although manual validation of LLMgenerated ontologies by domain experts is efective, it is resource-intensive and does not scale. This work addresses the need for an automated framework to evaluate LLM-generated ontologies against domain knowledge, ultimately reducing the manual verification eforts required by domain experts. Importantly, our evaluation framework is not limited solely to mitigating hallucinations in LLM-generated ontologies; it can also be adapted for other knowledge engineering tasks performed by LLMs. For instance, the framework can validate LLM-generated knowledge graphs, ensure accuracy in semantic annotations, and verify consistency in automated taxonomy creation. To achieve these goals, our proposed evaluation framework is based on semantic ontology matching, identifying correspondences between concepts and relationships across ontologies [8].

A pressing question emerges in this context: How well can LLMs model domain-specific concepts and relationships that align with real-world domain knowledge? To address this question, we leverage six LLM-generated ontologies as a case study; those ontologies were generated in our previous work [9] using our enhanced NeOn-GPT pipeline for ontology learning proposed in [10, 9]. We validate concepts and relationship triples generated by LLM against three domain-specific ontologies recommended by domain experts using our automated evaluation framework. These ontologies increase in relevance to the domain, allowing us to assess whether LLM-generated knowledge aligns with established domain knowledge rather than being generic.

Our results demonstrate that LLM-generated ontologies exhibit increasing domain alignment, supporting their use as automated ontology generation and population tools in highly specialized domains, with concepts and relationship triples aligning more strongly as the reference ontology becomes more domain-specific. Furthermore, our findings show that our automated evaluation framework efectively captures these alignments while significantly reducing the manual eforts required for validation by domain experts. The paper is structured as follows: Section 2 reviews related work, Section 3 outlines our methodology, Section 4 presents results, Section 5 discusses findings, and Section 6 concludes with future directions.

2. Related Work

Recent work shows that LLMs hold considerable promise for knowledge engineering tasks [11, 12, 13, 14, 15], particularly in the realm of ontology creation [ 1, 2, 3, 4, 5 ]. Several recent approaches employ structured prompting to facilitate ontology creation tasks. Notable works such as OntoChat [5], OntoGenix [16], Ontogenia [17] and our own NeOn-GPT [10] illustrate the promising capabilities of LLMs in generating ontologies. These works identify challenges with ontology generation using LLMs, such as syntax errors, logical inconsistencies, common modeling pitfalls, and hallucinations, where LLMs generate incorrect or irrelevant ontology elements to the domain due to sparse domain-specific training data. Unlike other methods, our NeOn-GPT framework is designed to address syntax and logical consistency issues and common pitfalls internally. It integrates detection tools such as RDFLib for syntax checking, reasoners such as Pellet and HermiT to verify logical consistency and a pitfall detection tool. Error messages from these tools, which describe the problems encountered, prompt the LLM to fix these issues automatically. However, while these mechanisms efectively handle syntactical errors, logical inconsistency, and common pitfalls (e.g., wrong inverse relations, cycles in the class hierarchy), reducing hallucinations remains a significant challenge. This motivates our current work, where we propose an automatic evaluation framework to mitigate hallucinations and reduce the manual efort required to validate LLM-generated ontologies.

Recent literature underscores the necessity of rigorous evaluation frameworks for systematically assessing semantic accuracy and detecting LLM-induced errors [18, 19]. Lavrinovics et al. [7] categorize various hallucination types, illustrating their negative impacts on the reliability and trustworthiness of ontology outputs. Agrawal et al. [20] survey knowledge-augmented LLM methods, showing how incorporating Knowledge Graphs (KGs) can mitigate hallucinations by grounding model outputs in validated, domain-specific knowledge. However, despite these advances, hallucinations remain challenging, i.e., incorrect or irrelevant ontology elements, especially in highly specialized domains with sparse training data. Our current work proposes a novel automated framework that targets this gap. To determine whether a generated ontology contains hallucinations—such as non-existent or irrelevant concepts and triples—we compare it against expert-curated, domain-specific ontologies. This process, known as ontology matching, aims to identify correspondences between semantically related concepts across diferent ontologies [ 21]. Our framework performs ontology matching by leveraging transformerbased embedding techniques to semantically align concepts and triples between the generated and reference ontologies. The percentage of matched elements serves as an indicator of semantic accuracy and domain relevance, helping to flag potential hallucinations in the generated output.

In the broader context of ontology matching, lexical and heuristic methods such as PROMPT [22] and COMA [23] have been widely used for their ability to identify matches based on name or string similarity. However, these methods often fail to capture conceptual equivalence when similar terms are expressed diferently. Recent approaches use embedding-based models and LLMs to identify semantically equivalent concepts based on meaning and context rather than wording alone. Embedding-based models like BERTMap [24] fine-tune BERT on ontology texts and apply logic-based constraints—such as disjointness and hierarchy rules—to ensure consistent alignment of equivalent concepts. Unsupervised methods like TEXTO [25], PropMatch [26], and [27] enhance matching by combining transformer embeddings with structural features, such as class hierarchies and property relationships, allowing them to identify concept equivalence beyond string similarity. The LLMs4OM framework [28] systematically evaluates LLMs in ontology matching, employing retrieval-augmented generation (RAG) to combine semantic retrieval with LLM-based classification. It explores multiple retrieval models (e.g., sentenceBERT, OpenAI’s text-embedding-ada) and evaluates LLMs across 20 datasets, demonstrating competitive performance against traditional systems like LogMap [29] and AML [30]. These studies show the promise of transformer-based models in identifying semantic similarities and capturing deep semantic relationships and contextual nuances. Consequently, we adopt a transformer-based methodology as the cornerstone of our evaluation framework.

3. Methodology

In this study, we introduce an automated evaluation framework 1 designed to assess the reliability of LLM-generated ontologies by systematically comparing their concepts and relationships with established domain knowledge (i.e. expert-curated domain ontologies). We evaluate LLM-generated ontologies at both the concept level (to assess entity correctness) and the relationship-triple level (to validate relational integrity). By matching these elements against expert-curated ontologies, we ensure that generated knowledge aligns with established domain standards rather than being artificially constructed. The framework leverages semantic ontology matching techniques- sentence embeddings and similaritybased alignment, to quantify the degree of conceptual and relational consistency between LLM-generated ontologies and expert-curated reference ontologies, an overview of our proposed framework is shown in Figure 1. We utilize six LLM-generated ontologies that were previously developed in [9] using our enhanced NeOn-GPT pipeline [10, 9] for ontology learning with GPT-4o [31] as a case study. These ontologies represent diferent aspects of the AquaDiva 2 research domain [32, 33], which investigates microbial ecology, biogeochemical cycles, and environmental processes in subsurface ecosystems: • AquaDiva Ontology (Version 1): Represents concepts in groundwater ecosystems, including aquifers, microbial communities, and biogeochemical processes, but with limited structural depth. • AquaDiva Ontology (Version 2): Expands the AquaDiva domain representation by incorporating a deeper class hierarchy and more object properties, improving relational depth between entities. • AquaDiva Ontology (Version 3): Merges previous AquaDiva ontology versions 1 and 2. • Habitat Ontology: A module of the AquaDiva ontology that captures knowledge about diferent habitat types within groundwater ecosystems. • Role Ontology: A module of the AquaDiva ontology that models the functional roles of biological, chemical, and environmental agents in groundwater systems. • Carbon & Nitrogen Cycling Ontology: A module of the AquaDiva ontology that represents biochemical processes related to carbon and nitrogen cycles in groundwater.

These ontologies serve as test cases for our framework, allowing us to assess how well LLM-generated knowledge aligns with domain-specific standards. To validate the accuracy and domain relevance of 1Our code base is publicly available for research and development purposes, accessible at: https://github.com/NadeenAhmad/ TamingHallucinations 2https://www.aquadiva.uni-jena.de/

Input Ontologies (O1 ,O2 )

Concepts

Extraction Input Ontologies (O1 ,O2 )

Triples Extraction (C1 ,C2 ) (T1 ,T2 ) Generate Embeddings: (EC1 ,EC2 ) Compute

SeEnmtebnecdediTnrganMsofodmeler Similarity

all-MiniLM-L6-v2 Score

Triples Matching

Generate Embeddings: (ET1 ,ET2 ) Compute SeEnmtebnecdediTnrganMsofodmeler Similarity

all-MiniLM-L6-v2 Score

Similarity Score ≥ Ƭ ✅ Accepted

Similarity Score < Ƭ Flagged as Halucination

Similarity Score ≥ Ƭ ✅ Accepted

Similarity Score < Ƭ

Flagged as Halucination these ontologies, we compare their concepts and triples against three expert-recommended reference ontologies: • OBOE-SBC (Santa Barbara Coastal Observation Ontology) [34]: Describes environmental observations specific to the Santa Barbara Coastal Long Term Ecological Research Project. • ENVO (Environmental Ontology) [35]: Provides a controlled vocabulary for describing environmental entities, including ecosystems, environmental processes, and qualities. • CHEBI (Chemical Entities of Biological Interest) [36]: Provides a structured classification of chemical compounds of biological relevance.

3.1. Ontology Concept and Triple Extraction

Our evaluation framework starts by extracting concepts and triples from LLM-generated and expertcurated reference ontologies.

Concept Extraction: We use transformer-based models that rely on textual semantics to compute embeddings, so extracting human-readable labels for ontology concepts is crucial for accurate matching, interpretation, and validation. These labels retain natural language structure, helping models understand relationships instead of treating concepts as meaningless identifiers. Without readable labels, embeddings fail to reflect the actual meaning of concepts and triples, leading to misalignment. For example, an identifier like OBO:0003742 provides no semantic value, whereas its label: Microbial Biomass enables a model to contextualize the concept within biological and ecological domains, improving similarity computation and alignment accuracy. Concept labels alone can be ambiguous without context. For example, "cell" refers to a biological unit or a prison room, highlighting the need to extract labels and definitions. Definitions provide crucial disambiguation, improving alignment with expert-curated ontologies. For instance, an LLM-generated concept labeled "Microbial Activity" without a definition may be dificult to align with the ENVO ontology’s "Microbial Biogeochemical Process," defined in the ENVO ontology as "A process mediated by microbial activity influencing the transformation of chemical compounds in an ecosystem." Extracting both labels and definitions ensures a more accurate semantic comparison.

Our framework extracts ontology concepts by parsing class labels and their associated definitions. Definitions are key in concept matching by resolving ambiguities and standardizing variations. Disambiguation ensures that terms with multiple meanings are classified correctly (e.g., "cell" as a biological unit vs. a prison room). At the same time, standardization aligns diferent representations of the same concept (e.g., "2" vs. "carbon dioxide").

We developed an automated extraction pipeline to extract concepts and their definitions from LLM ontologies represented in Turtle (TTL) format. The pipeline applies regular expressions to identify ontology classes (owl:Class) and extract their corresponding labels and definitions (rdfs:comment). The extracted concepts and their definitions are stored in a structured dictionary; each class is paired with its corresponding definition or labeled as an empty string if missing. Our analysis observed that LLM-generated ontologies sometimes contain duplicate classes with identical labels, leading to redundant entries. To prevent inflating the results, we implemented a filtering step to remove these duplicates before storing the processed data in JSON format for further analysis. For example, in the Carbon & Nitrogen Cycling Ontology, we extracted the concept: "Forest Ecosystem": "An ecosystem dominated by trees and other vegetation, playing a key role in carbon and nitrogen cycling."

Similarly, a pipeline was developed to extract concepts and their definitions from reference ontologies using the BioPortal API. Our pipeline retrieves ontology classes from OBOE-SBC, ENVO, and CHEBI repositories by making iterative API requests. The data extraction process involves querying the API, parsing JSON responses to extract concept labels and definitions, and handling pagination to ensure the retrieval of all available entries. To ensure meaningful semantic content in the extracted concepts, we excluded blank nodes (BNodes), as they often lack clear labels or definitions. Additionally, we removed UUID-like alphanumeric strings using regular expression filtering, as these randomly generated identifiers do not contribute to the ontology’s conceptual structure. Concepts containing ORCID IDs, ontology prefixes (e.g., foodon:01234 or CHEBI:12345), or database-specific notations were replaced with readable terms by retrieving the class label from their corresponding data sources. For example, instead of retaining CHEBI:15377, we used API-based label retrieval to replace it with its human-readable name, "Water." This ensured that the extracted concepts remained interpretable and useful for semantic matching. The extracted data is stored in structured JSON files for further analysis.

The extracted set 1 from the LLM-generated ontology consists of concepts paired with their corresponding definitions. The set from the expert-curated reference ontology is denoted as 2 (as shown in Figure 1).

Triple Extraction: We extract subject–predicate–object (SPO) triples from the ontology and obtain their human-readable forms by extracting labels for each entity in the triple. For example, the triple (OBO:0003742) - [obo:RO_0002234] → (OBO:0000270) is transformed into (Microbial Biomass) - [is affected by] → (Dissolved Organic Carbon).

We developed an automated extraction pipeline to extract triples from LLM-generated ontologies represented in Turtle (TTL) format. The extracted triples include (a) Class Hierarchies (subClassOf and is a relationships), (b) Object Properties (links between ontology concepts), and (c) Data Properties (attributes associated with ontology entities). The extraction process begins with domain and range identification to determine property domain and range constraints, specifying the types of entities a property can connect. Using this structured information, we then proceed to construct SPO triples; for instance, if the property "is consumed by" is defined with "Trace Gas" as its domain and "Microbial Community" as its range, the extracted triple would be: (Trace Gas) -[is consumed by]-> (Microbial Community). The final set of triples is stored in structured CSV files for further ontology matching. Additionally, the pipeline identifies hierarchical relationships, extracting subClassOf relationships that define taxonomic structures within the ontology: (Methane Production) -[subClassOf]-> (Carbon Cycling Process). We also capture "is a" (rdf:type) relationships, which categorize entities into specific classes, such as (North Sea) -[is a]-> (Marine Ecosystem). The examples presented above were extracted from the Carbon and Nitrogen Cycling Ontology.

Similarly, a pipeline was developed to extract triples from reference ontologies using the BioPortal API. Our pipeline retrieves ontology triples from the OBOE-SBC, ENVO, and CHEBI repositories by making iterative API requests. The data extraction process involves querying the API and handling the same types of triple (a) Class Hierarchies (subClassOf and is a relationships), (b) Object Properties, and (c) Data Properties. We applied the same filtering mechanisms as in concept extraction to ensure semantic relevance. The extracted set of triples from the LLM-generated ontology is 1, while the expert-curated reference ontology set is 2 (see Figure 1).

3.2. Ontology Concept and Triple Matching

Concept Matching: We match ontology concepts by comparing class labels and their definitions across LLM-generated ontologies and reference ontologies. To achieve this, we employ a concatenationbased embedding strategy, where the concept name and its definition are merged into a single text representation before generating an embedding. Each concept is formatted as: "concept tokenizer.sep_token definition". This approach allows the model to process the concept and its associated definition simultaneously. Including a separator token explicitly signals the boundary between the concept label and its definition, helping the embedding model distinguish and appropriately weigh the semantic contributions of each component. Thus, instead of merging labels and definitions into one potentially ambiguous sentence, our concatenation approach ensures accurate contextualization and improved embedding quality. Our concept matching pipeline utilizes all-MiniLM-L6-v2 [37], a pre-trained sentence transformer model, to generate fixed-size vector embeddings for both LLM-generated and reference ontology concepts. We selected all-MiniLM-L6-v2 as our embedding model due to its lightweight architecture, eficiency, and strong performance in semantic similarity tasks. This model generates 384-dimensional sentence embeddings, efectively capturing the semantic meaning of the text while maintaining a compact size of 22MB. Its eficiency suits it for handling large-scale ontology matching without requiring extensive computational resources. Furthermore, all-MiniLM-L6-v2 has demonstrated strong semantic search, clustering, and sentence similarity performance [38, 39]. In this process, the sets of concepts 1 and 2 are transformed into the sets of embeddings 1 and 2, respectively.

Embeddings are then compared using cosine similarity, a mathematical measure that calculates the angle between two vectors in a high-dimensional space [40]. Unlike Euclidean distance, which measures absolute diferences, cosine similarity evaluates how directionally similar two vectors are, making it suited for semantic comparisons. A score of 1 indicates identical meanings, while 0 suggests no similarity. In this process, each concept and definition embedding from 1 is compared against all concepts in 2 using cosine similarity (see Figure 1). Concepts that exceed a similarity threshold ( ) (e.g., 0.50) are retained as valid matches. Concepts that fail to find a meaningful match are flagged as hallucinations for domain experts to verify. For example, in the AquaDiva Ontology (Version 2), the concept: "GroundwaterPrecipitation": "The process where water precipitates, either through chemical means within groundwater systems or as a part of the hydrological cycle impacting groundwater recharge." was matched with the concept "PrecipitationWaterSample": "PrecipicationWater falls from the atmosphere to earth, as rain or snow. Also, see the process called Precipitation." in OBOE-SBC ontology with a similarity score of 0.66. Concepts such as "Trace Gas Consumption" that lacked strong matches were flagged as hallucinations for experts to review. The final output consists of three main components: (a) Accepted Matches - LLM-generated concepts successfully aligned with reference ontology concepts; (b) Hallucinated Concepts - Concepts with no meaningful match, indicating potential LLM errors that need manual verification by domain experts; and (c) Match Confidence Statistics - A breakdown of how many LLM concepts were validated and their match distribution across reference ontologies.

Triple Matching: We match ontology triples by comparing Subject-Predicate-Object (SPO) relationships across LLM-generated ontologies and reference ontologies. We employ a sentence-based embedding strategy to achieve this, where each SPO triple is converted into a natural language sentence representation before generating an embedding. For example, (TraceGas) - [is consumed by] -> (MicrobialCommunity) is transformed to "TraceGas is consumed by MicrobialCommunity". This approach ensures that the semantic relationships within triples are preserved, allowing the model to process them holistically rather than as disjointed components. Our triple matching pipeline utilizes the same model all-MiniLM-L6-v2, transforming the sets of triples 1 and 2 into the sets of embeddings 1 and 2, respectively. These embeddings are then compared using cosine similarity as well. Each triple embedding from 1 is compared against all triples in 2 using cosine similarity (see Figure 1). Triples that exceed a similarity threshold ( ) (e.g., 0.50) are retained as valid matches. In contrast, triples that fail to find a meaningful match are flagged as hallucinations for domain experts to verify. For example, the triple "Karst Groundwater is a Water" extracted from Carbon & Nitrogen Cycling ontology was matched with the following triple from the ENVO ontology: "freshwater subclass of water" with a similarity score of 0.58 and "TraceGas is consumed by MicrobialCommunity" was matched with "methane has role bacterial metabolite" from CHEBI ontology with similarity score 0.52.

4. Results

To evaluate the alignment of LLM-generated ontologies with domain-specific reference ontologies, each LLM-generated ontology was matched against three reference ontologies ranked by domain experts (i.e. ecologists) in ascending order of relevance to the AquaDiva ontology domain. The matching process proceeded in the following stages: (1) Matching with the least relevant reference ontology (OBOE-SBC), (2) Matching with the combination of the least and second least relevant reference ontologies (OBOESBC + ENVO), and (3) Matching with all three reference ontologies together (OBOE-SBC + ENVO + CHEBI). This stepwise approach reveals whether alignment improves with more domain-specific references—indicating higher semantic relevance in the LLM-generated ontologies.

4.1. Concept Matching Results

The percentage of matched concepts across diferent reference ontology combinations is summarized in Table 1. The results show that matching only with OBOE-SBC resulted in relatively low concept match percentages across all ontologies (e.g., 46.27% for AquaDiva (Version1) and 36.94% for Carbon and Nitrogen Cycling). Adding ENVO significantly increased the rate of matched concepts, almost doubling the first stage. Incorporating all three reference ontologies (ENVO + OBOE-SBC + CHEBI) led to marginal improvements beyond the second stage, with all ontologies exceeding 90% alignment. The Carbon & Nitrogen Cycling ontology achieved the highest match percentages, likely due to their alignment with the CHEBI ontology, which classifies biologically relevant chemical compounds. Since this ontology focuses on biochemical processes related to carbon and nitrogen cycles, its terminology closely matches CHEBI’s structured vocabulary.

LLM-Generated Ontology via NeOn-GPT AquaDiva (Version1) AquaDiva (Version2) AquaDiva (Version3) Habitat Role Carbon and Nitrogen Cycling 46.27% 43.36% 41.72% 32.53% 39.31% 36.94%

4.2. Triple Matching Results

The percentage of matched triples across diferent reference ontology combinations is summarized in Table 2. The results show that matching only with OBOE-SBC resulted in significantly lower match percentages for triples compared to concepts (e.g., 15.98% for AquaDiva (Version1) and 13.29% for Carbon and Nitrogen Cycling). Adding ENVO led to a substantial improvement in triple alignment, with match percentages increasing by more than 40 percentage points in all cases. Including all three reference ontologies (OBOE-SBC + ENVO + CHEBI) further improved the match percentages, though the gain was less pronounced than in the second stage.

% of Matched % of Triples with OBOE- Triples SBC (OBOE-SBC +ENVO)

5. Discussion

The high concept matching rates indicate that LLMs are efective at generating widely accepted entitylevel knowledge, likely due to their ability to synthesize common terms from large-scale training corpora. The incremental ontology matching approach revealed that as more relevant ontologies were included, the match rate increased significantly, especially for concepts. Despite the high concept alignment observed in our matching process, some LLM-generated concepts remained unmatched, highlighting some semantic consistency and structured representation challenges. A manual review of the unmatched concepts suggests that many terms were either highly specialized (i.e., highly relevant to the AquaDiva ontology domain but not represented in reference ontologies) or overly generic to align with structured reference vocabularies. Highly specialized concepts such as "Hainich Critical Zone" from the AquaDiva (Version 3) ontology represent valid scientific terms that reflect domainspecific knowledge. Their absence from reference ontologies does not imply inaccuracy but rather illustrates the potential of LLMs to surface novel or underrepresented entities relevant to the target domain. On the other hand, generic terms like "Extreme Weather Event" in the AquaDiva (Version 1) ontology are meaningful but often not formalized in structured vocabularies.

This is where human-in-the-loop validation becomes essential, enabling domain experts to assess such unmatched concepts’ correctness, relevance, and potential value, as shown in Figure 1. For this reason, our approach complements—rather than replaces—expert validation, helping reduce the manual efort required. It flags unmatched concepts and triples for expert review, acknowledging that they may represent legitimate and valuable domain knowledge that falls outside the scope of existing ontologies.

Unlike concept matching, triple matching showed lower alignment rates. Similar to highly specialized unmatched concepts, some triples remained unmatched because they were highly relevant to the AquaDiva ontology domain only, such as the triple: (TriassicLimestone) - [is a] -> (GeologicalFormation) from the AquaDiva (Version 3) ontology. Many unmatched triples lacked clear hierarchical or property constraints, making them dificult to align. For example, the unmatched triple (reflects changes in) - [is a] -> (ObjectProperty), (reflects changes in) suggests a causal relationship, but standard ontologies often use more rigid property constraints, such as has Process or affects. The absence of standardized predicates in LLM-generated ontologies makes direct alignment with structured ontologies challenging. Unlike traditional ontology engineering methods that rely on formal logic and domain expertise, LLMs rely on statistical correlations and vector-based search methods rather than deductive reasoning. As a result, LLMs struggle to generate subject-relation-object triples that conform to well-defined ontological structures. This explains why concept alignment is significantly higher than triple alignment—while LLMs can extract and generate entity-level knowledge efectively, they struggle to formalize structured semantic relationships.

6. Conclusion and Future work

In this study, we proposed an evaluation framework for assessing LLM-generated ontologies by matching their concepts and triples against domain-specific reference ontologies, aiming to reduce the manual verification eforts required from domain experts. The results demonstrate that while LLMs excel at generating domain-relevant concepts, their performance declines when it comes to producing structured relationships, as reflected in the lower triple alignment rates. Our stepwise ontology matching strategy further confirmed that the relevance of the reference ontology significantly influences the alignment quality, with higher alignment percentages achieved when using more domain-specific ontologies. Future work should also investigate the potential of leveraging LLMs as the domain expert in this pipeline inspired by previous works that use LLM-as-a-judge [41, 42]. In our previous work, we evaluated LLM-generated ontologies for syntactic correctness, logical consistency, common modeling pitfalls, and structural properties [10, 9], this work extends this evaluation to the semantic level, assessing the alignment of concepts and relationship triples with expert-curated reference ontologies. In future work, we plan to assess the practical utility of these ontologies through task-based evaluations, such as their ability to support competency questions and other real-world applications, providing a deeper understanding of their functional value. Beyond the current use case, we aim to use this framework to evaluate and compare diferent ontology learning pipelines and LLMs. Additionally, we plan to adapt the framework to support other knowledge engineering tasks, such as validating LLM-generated knowledge graphs, semantic annotations, or taxonomy construction, helping to ensure consistency and domain relevance across a wider range of automated knowledge modeling scenarios. Finally, the results of this semantic evaluation suggest that future ontology generation may benefit from models with improved contextual understanding; thus, we intend to explore the potential of Large Context Models (LCMs) to improve hierarchical structuring in LLM-generated ontologies [43].

Acknowledgement

The authors thank Mr. Yihang Zhao (Department of Informatics, King’s College London) for kindly presenting this paper at SemTech4STLD @ ESWC 2025 on their behalf.

Declaration on Generative AI

During the preparation of this work, the author(s) used GPT-4 and Grammarly for Grammar and spelling checks. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content. Base Construction (LM-KBC) co-located with the 22nd International Semantic Web Conference (ISWC 2023), Athens, Greece, November 6, 2023, volume 3577 of CEUR Workshop Proceedings, CEUR-WS.org, 2023. URL: https://ceur-ws.org/Vol-3577/paper8.pdf. [16] M. Val-Calvo, M. E. Aranguren, J. M. Martínez-Hernández, G. Almagro-Hernández, P. Deshmukh, J. A. Bernabé-Díaz, P. Espinoza-Arias, J. L. Sánchez-Fernández, J. Mueller, J. T. Fernández-Breis, Ontogenix: Leveraging large language models for enhanced ontology engineering from datasets, Inf. Process. Manag. 62 (2025) 104042. URL: https://doi.org/10.1016/j.ipm.2024.104042. doi:10. 1016/J.IPM.2024.104042. [17] A. S. Lippolis, M. Ceriani, S. Zuppiroli, A. G. Nuzzolese, Ontogenia: Ontology generation with metacognitive prompting in large language models, in: A. Meroño-Peñuela, Ó. Corcho, P. Groth, E. Simperl, V. Tamma, A. G. Nuzzolese, M. Poveda-Villalón, M. Sabou, V. Presutti, I. Celino, A. Revenko, J. Raad, B. Sartini, P. Lisena (Eds.), The Semantic Web: ESWC 2024 Satellite Events Hersonissos, Crete, Greece, May 26-30, 2024, Proceedings, Part I, volume 15344 of Lecture Notes in Computer Science, Springer, 2024, pp. 259–265. URL: https://doi.org/10.1007/978-3-031-78952-6_38. doi:10.1007/978-3-031-78952-6\_38. [18] F. Petroni, T. Rocktäschel, S. Riedel, P. S. H. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, Language models as knowledge bases?, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Association for Computational Linguistics, 2019, pp. 2463–2473. URL: https: //doi.org/10.18653/v1/D19-1250. doi:10.18653/V1/D19-1250. [19] H. Ghanem, C. Cruz, Fine-tuning vs. prompting: evaluating the knowledge graph construction with llms, in: 3rd International Workshop on Knowledge Graph Generation from Text (Text2KG) Co-located with the Extended Semantic Web Conference (ESWC 2024), volume 3747, 2024, p. 7. [20] G. Agrawal, T. Kumarage, Z. Alghamdi, H. Liu, Can knowledge graphs reduce hallucinations in llms?: A survey, arXiv preprint arXiv:2311.07914 (2023). [21] J. Euzenat, P. Shvaiko, et al., Ontology matching, volume 18, Springer, 2007. [22] N. F. Noy, M. A. Musen, PROMPT: algorithm and tool for automated ontology merging and alignment, in: H. A. Kautz, B. W. Porter (Eds.), Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on on Innovative Applications of Artificial Intelligence, July 30 - August 3, 2000, Austin, Texas, USA, AAAI Press / The MIT Press, 2000, pp. 450–455. URL: http://www.aaai.org/Library/AAAI/2000/aaai00-069.php. [23] H.-H. Do, E. Rahm, Coma—a system for flexible combination of schema matching approaches, in: VLDB’02: Proceedings of the 28th International Conference on Very Large Databases, Elsevier, 2002, pp. 610–621. [24] Y. He, J. Chen, D. Antonyrajah, I. Horrocks, Bertmap: a bert-based ontology alignment system, in:

Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2022, pp. 5684–5691. [25] Y. Peng, M. Alam, T. Bonald, Ontology matching using textual class descriptions, in: P. Shvaiko, J. Euzenat, E. Jiménez-Ruiz, O. Hassanzadeh, C. Trojahn (Eds.), Proceedings of the 18th International Workshop on Ontology Matching co-located with the 22nd International Semantic Web Conference (ISWC 2023), Athens, Greece, November 7, 2023, volume 3591 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 67–72. URL: https://ceur-ws.org/Vol-3591/om2023_STpaper2.pdf. [26] G. Sousa, R. Lima, C. Trojahn, Combining word and sentence embeddings with alignment extension for property matching., in: OM@ ISWC, 2023, pp. 91–96. [27] G. Sousa, R. Lima, C. Trojahn, Complex ontology matching with large language model embeddings, arXiv preprint arXiv:2502.13619 (2025). [28] H. B. Giglou, J. D’Souza, F. Engel, S. Auer, Llms4om: Matching ontologies with large language models, arXiv preprint arXiv:2404.10317 (2024). [29] E. Jiménez-Ruiz, B. C. Grau, Logmap: Logic-based and scalable ontology matching, in: L. Aroyo, C. Welty, H. Alani, J. Taylor, A. Bernstein, L. Kagal, N. F. Noy, E. Blomqvist (Eds.), The Semantic Web - ISWC 2011 - 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I, volume 7031 of Lecture Notes in Computer Science, Springer, 2011, pp. 273–288.

[1]

Mateiu ,

Groza , Ontology engineering with large language models , in: 25th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, SYNASC 2023 , Nancy,