<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Llugiqi);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Experts to LLMs: Evaluating the Quality of Automatically Generated Ontologies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Majlinda Llugiqi</string-name>
          <email>majlinda.llugiqi@wu.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fajar J. Ekaputra</string-name>
          <email>fajar.ekaputra@wu.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marta Sabou</string-name>
          <email>marta.sabou@wu.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Ontology Generation, Domain-Specific Ontologies, Large-Language Models, Ontology Evaluation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vienna University of Economics and Business</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Ontologies play a crucial role in knowledge representation, yet their manual construction requires domain expertise and efort. While previous work has focused on using large language models (LLMs) for assessing ontology creation, fully automated ontology generation remains underexplored. As a consequence, most research relies on a limited set of well-known ontologies or knowledge graphs, which constrains the evaluation of various tasks such as link prediction and knowledge graph completion. This highlights the need for diverse ontology benchmarks with varying characteristics, such as number of concepts, hierarchy depth and so on, to efectively evaluate tasks such as link prediction and knowledge graph completion. In this work, we investigate the feasibility of generating ontologies using LLMs and evaluate whether they can produce ontologies of comparable quality to human-built ones. Given a seed set of concepts, a target number of concepts, relations, and maximum hierarchy depth, we employ three diferent LLMs to generate ontologies within the heart disease domain. Defining a seed set of concepts is particularly important for modeling the features of tabular datasets, enabling structured knowledge representation for downstream tasks. We systematically evaluate the generated ontologies by analyzing their structural integrity, semantic coherence, and suitability for downstream tasks. Our results show that while LLMgenerated ontologies difer structurally from human-built ones, they remain comparable in semantic similarity and downstream ML performance, with LLaMA-generated ontologies proving to be the most efective. These findings highlight the potential of LLM-generated ontologies not only to support automated knowledge representation but also to enhance ontology benchmarks by introducing diverse structural characteristics, enabling more comprehensive evaluations of machine learning tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Ontologies play a crucial role in structuring knowledge across various domains, enabling tasks such as
semantic reasoning, data integration, and knowledge representation. However, ontology construction
methods rely heavily on expert knowledge and manual efort, which can be time-consuming, costly and
dificult to scale [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Moreover, there is a lack of automated methods for generating domain-specific
ontologies that accurately represent datasets’ features [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Most research and applications depend on
well-known ontologies and knowledge graphs, which, while useful, do not always ofer the flexibility
needed for evaluating diverse tasks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In particular, there is a need for ontologies with varying
structural characteristics to enable a more comprehensive evaluation of Machine Learning (ML) tasks
such as link prediction and knowledge graph completion [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Several eforts have explored the use of large language models (LLMs) to assist ontology engineering
[
        <xref ref-type="bibr" rid="ref1 ref5 ref6 ref7">1, 5, 6, 7</xref>
        ], aiming to address the gap mentioned above. These include generating competency questions,
completing missing ontology components, and supporting ontology alignment. However, existing
approaches do not provide a fully automated pipeline for generating high-quality ontologies from scratch,
containing diferent characteristics. To address the need for ontologies with diferent characteristics,
recent work has introduced tools such as PyGraft [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which can generate synthetic ontologies with
predefined structural properties such as the number of their concepts and relations. Although these
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org
tools are valuable for benchmarking and experimentation, they produce domain-agnostic ontologies,
making them less suitable for practical applications where domain knowledge is essential.</p>
      <p>Addressing these challenges, our work investigates the potential of LLMs to generate domain-specific
ontologies. Given a set of seed concepts, along with constraints on the number of concepts, relations,
and maximum depth, our approach aims to construct structured ontologies that accurately capture
domain knowledge, modeling datasets’ features while allowing flexibility in their characteristics. Our
research is guided by the following research questions:
• RQ1: To what extent do LLM-generated ontologies align with human-built ontologies in terms of
domain relevance, hierarchical organization and performance on downstream tasks?
• RQ2: How to evaluate LLM generated ontologies?
• RQ3: Which LLM performs best in generating ontologies that closely resemble human-built ones?
To address these research questions, we used a methodology that systematically compares
LLMgenerated ontologies with a human-built ontology in heart disease domain. Specifically, we focused
on an ontology designed to represent heart-disease dataset’s features 1, which contain clinical and
diagnostic features relevant to cardiovascular conditions (e.g., chest pain, heart rate). We generated three
ontologies using three diferent LLMs, each prompted with the same request to produce an ontology
incorporating predefined medical terms, a specified number of concepts and relations, and a maximum
hierarchical depth. These were evaluated against the human-built ontology using structural (e.g.,
average degree, path length, branching factor), semantic (e.g., information content, Jaccard similarity
for classes and relations, ontology embeddings) comparisons, as well as task-based evaluation.</p>
      <p>
        In the task-based evaluation, we followed [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] approach, transforming the ontologies into knowledge
graphs by populating them with dataset instances and generating knowledge graph embeddings using
four diferent embedding methods. We then computed two embedding-based metrics. These metrics
were used to augment the original tabular dataset, and then evaluated four ML models for binary
classification. While in these previous works exclusively human-built ontologies were used, our study
extends this evaluation to LLM-generated ontologies to assess whether they can support downstream
tasks as efectively as human-built ones. Performance was measured using accuracy and F2-score,
providing insights into their applicability in machine learning contexts.
      </p>
      <p>Our main contribution is a new method for ontology engineering that allows for the generation of
ontologies with diverse structural and semantic characteristics, along with a systematic evaluation
framework for assessing LLM-generated ontologies. Our framework evaluates these ontologies through
structural, semantic, and task-based comparisons. We analyze graph-based properties to evaluate
hierarchical organization, use Jaccard similarity, information content, and ontology embeddings for
semantic alignment, and assess real-world applicability by evaluating their impact on a binary
classiifcation task. Results show that while LLM-generated ontologies exhibit structural diferences from
human-built ones, they remain comparable in semantic similarity and task-based performance. Among
them, LLaMA-generated ontologies prove to be the most efective, in some cases even outperforming
human-built ontologies in downstream tasks. Additionally, both human-built and LLaMA-generated
ontologies tend to form more structured knowledge hierarchies, whereas GPT and DeepSeek-generated
ontologies show diferent structural biases, GPT favoring flatter structures and DeepSeek generating
deeper, more linear taxonomies. Furthermore, ontology embeddings from LLaMA closely align with
those of human-built ontologies, whereas GPT and DeepSeek-generated embeddings show greater
discrepancies, suggesting a weaker preservation of conceptual relationships.</p>
      <p>The rest of the paper is organized as follows. In Section 2, we discuss the related work, followed
by Section 3 where we present the methodology that we use for ontology generation and evaluation.
Further in Section 4 we discuss the experiment setup, continued by the results in Section 5. Finally, we
summarize our findings and outline directions for future work in Section 6.
1https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In this section we review related work on ontology evaluations methods and LLMs for ontology
generation.</p>
      <p>
        Ontology Evaluation Methods Ontology evaluation has been widely studied in knowledge
representation, with various methodologies developed to assess their quality, consistency, and applicability.
Traditional methods include crowdsourcing approaches [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and expert evaluation, both of which ofer
high-quality assessments but can become costly and impractical when evaluating large and multiple
ontologies. To mitigate this, automated reasoning techniques such as HermiT [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and OntoClean [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
have been employed to ensure logical consistency and conceptual clarity in ontology development.
These approaches allow for automated verification of taxonomic and logical constraints, reducing the
reliance on human intervention. Additional evaluation metrics include completeness, graph-based
structural properties, and domain coherence.
      </p>
      <p>
        More recently, LLMs have been leveraged for ontology evaluation. Tsaneva et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] proposed
an LLM-driven approach for verifying ontology restrictions, demonstrating that LLMs can achieve
intermediate-to-expert performance levels on ontology modeling qualification tests.
      </p>
      <p>
        In addition to assessing structural correctness, ontology evaluation also considers the suitability
of semantic resources, such as ontologies and knowledge graph (KG) for downstream applications,
such as knowledge graph embeddings (KGEs) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. This suggests that ontology evaluation should
extend beyond correctness checks and incorporate empirical validation through task-based performance
assessments.
      </p>
      <p>
        Raad et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] categorized ontology evaluation methods into four groups: gold standard-based,
corpus-based, task-based, and criteria-based approaches. The integration of LLMs into these
methodologies represents a promising direction for improving ontology validation, particularly in cases where
conventional expert-driven evaluation is infeasible.
      </p>
      <p>
        In this paper, we use three diferent ways to evaluate the ontologies generated with LLMs. Given
our focus on the structure of ontologies generated by LLMs under predefined structural constraints,
we utilize graph-based structural evaluation metrics, including average degree, path length, branching
factor, and degree distribution. Furthermore, since we aim to generate ontologies containing specific
terms for modeling datasets and possess a gold-standard ontology, we incorporate semantic metrics to
compare human-generated and LLM-generated ontologies. These metrics include Jaccard similarity,
embedding similarity, and information content. Finally, we conduct a task-based evaluation of the
ontologies, drawing on recent studies that explore the use of ontologies and KG information to enhance
ML prediction [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ].
      </p>
      <p>
        LLMs for supporting knowledge engineering LLMs are increasingly used to support knowledge
engineering tasks. One application is the generation of Competency Questions, as demonstrated by
[
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]. Another growing research area is the automatic generation of ontologies from textual or
structured data using LLMs [
        <xref ref-type="bibr" rid="ref1 ref18 ref19 ref20">1, 18, 19, 20</xref>
        ]. For instance, [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] evaluate multiple LLMs and prompting
strategies for generating OWL ontologies directly from ontological requirements, identifying GPT-4 as
the most efective among the models tested. Additionally, LLMs have been explored for knowledge
graph completion, where they assist in inferring missing links and enriching structured knowledge
bases [
        <xref ref-type="bibr" rid="ref22 ref23">22, 23</xref>
        ].
      </p>
      <p>
        Beyond these applications, LLMs have also been leveraged for ontology alignment, where they
facilitate the matching of concepts across diferent ontologies [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], and for ontology population, where
they help instantiate knowledge bases with factual data extracted from unstructured text [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. In
the realm of visual activity understanding, hybrid approaches that integrate LLMs with symbolic
reasoning have been developed to enhance explainability, generalization, and data eficiency. For
instance, the Symbol-LLM framework leverages LLMs to generate broad-coverage symbols and rational
rules, facilitating fuzzy logic-based reasoning over visual inputs. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ].
      </p>
      <p>Most of the existing work discussed focuses on supporting specific knowledge engineering tasks
rather than constructing the entire ontology. In contrast, our work aims to generate ontologies given
only a domain specification, a set of seed concepts, and structural constraints. This approach enables the
creation of ontologies tailored to represent tabular datasets with defined features. Moreover, it serves as
a first step toward systematically generating ontologies with varying structural and semantic properties,
which can be used to evaluate diferent downstream tasks, such as link prediction, knowledge graph
completion, and ML tasks.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology for LLM-Based Ontology Generation and Evaluation</title>
      <p>This section presents our methodology, with Section 3.1 focusing on the generation of ontologies using
LLMs, while Section 3.2 describes the evaluation approaches.</p>
      <sec id="sec-3-1">
        <title>3.1. Ontology Generation</title>
        <p>As shown in Figure 1, the ontology generation process is organized in three steps: prompt construction,
LLM-based ontology generation and checking for constraint satisfaction.</p>
        <p>Step 1: Prompt Construction The process begins with the selection of desired characteristics, which
define the structure of the ontology, as well as the seed concepts, which are the concepts that we want
to model, and might be extracted from tabular dataset features.</p>
        <p>
          For our experiment, we used the Heart Disease ontology from [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] (as discussed in the Section 4) and
the Heart Disease Prediction dataset2. The characteristics selected for ontology generation included
the number of classes (29), relations (6), and max depth (5), which are the characteristics that the
human-built ontology has, that may serve as a reference for comparison. For the seed concepts, the
heart disease dataset contains 14 features, so we created a corresponding list of 14 terms, using their
full descriptive names rather than the abbreviated versions found in the dataset (e.g., as presented on
Kaggle), as follows:
2https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
terms = [ ”age”, ”sex”, ”chest pain type”, ”resting blood pressure”, ”serum cholestoral”, ”fasting
blood sugar”, ”resting electrocardiographic results”, ”maximum heart rate achieved”, ”exercise
induced angina”, ”oldpeak = ST depression induced by exercise relative to rest”, ”the slope of the
peak exercise ST segment”, ”number of major vessels colored by fluoroscopy”, ”thallium stress”,
”heartDisease” ]
        </p>
        <p>With the selected characteristics and seed concepts defined, we constructed a prompt to guide the
LLM-based ontology generation, formatted as follows:</p>
        <p>You are a knowledge engineer specializing in ontology design. Your task is to generate a
domain-specific ontology for heart disease in Turtle format,with the following properties:
- Use the following terms as seed concepts: {terms}
- It must have exactly {num_classes} classes, no more, no less.
- It must have {num_relations} unique relations, no more, no less, with their domain and
range specified.
- The maximum hierarchy depth should be {max_depth}.
- If you struggle to fit terms while maintaining {num_classes} classes, create additional generic
but relevant classes.
- No additional commentary—output, only the ontology.
- Ensure meaningful and logically coherent class relationships.</p>
        <p>Think step by step and enumerate the number of classes and relations while structuring the
ontology, but in your response, include **only** the final ontology without any explanation, and
please make sure that it has the requested amount of classes, relations and max depth.
Step 2: LLM-based Ontology Generation After the prompt construction, we then provide the
prompt to three diferent LLMs (detailed in the Section 4) to generate ontologies.</p>
        <p>Step 3: Constraint checking Each generated ontology undergoes an initial validation step to ensure
it meets the predefined constraints (e.g., number of classes, relations, and maximum hierarchy depth).
If the constraints are not satisfied, the ontology is regenerated iteratively until either the requirements
are met or a threshold is reached. Once validated, the final ontology is saved for further evaluation and
comparison.</p>
        <p>The LLM-generated ontologies used in this study are publicly available at Zenodo 3 to facilitate
reproducibility and future research.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Ontology Evaluation</title>
        <p>To assess the quality of the generated ontologies, as illustrated in Figure 1, we employ a three-tier
evaluation approach consisting of structural, semantic, and task-based evaluation methods. These
evaluations measure structural integrity, conceptual alignment, and practical usability. Finally, the
generated ontologies are compared against a human-built ontology.</p>
        <p>The metrics used for each evaluation method are detailed in Table 1. Below, we describe each metric
in detail.</p>
        <p>Structural-Based Methods: Structural-based evaluation focuses on assessing the structural
properties of the ontology as a graph. These metrics capture aspects such as connectivity, hierarchy, and
distribution of nodes and edges. The structural-based metrics include:
Structural
Semantic
Task</p>
        <p>Avg Degree
Path Length
Branching Factor
Degree Dist.</p>
        <p>DistAugTab
EmbedClustAugTab</p>
        <p>Avg. edges per node: ∑ degrees</p>
        <p>nodes
Mean shortest path between nodes.</p>
        <p>Avg. subclasses per class: ∑PCahreilndtrsen</p>
        <p>Node degree distribution (connections per class)</p>
        <p>Tabular dataset enrichment with euclidean distance of
embeddings to target class centroids.</p>
        <p>Tabular dataset enrichment with embedding vectors and
kmeans cluster membership.</p>
        <p>• Average degree, which measures the average number of edges (relations) per node (concept),
providing insights into overall connectivity of the ontology.
• Path length, which measures the mean shortest path between ontology concepts, showing how
the concepts are connected. If the ontology is not connected, it computes the average path over
weakly connected components.
• Branching factor, which computes the average number of subclasses per class, indicating the
hierarchy’s complexity.
• Degree distribution, which measures the spread of connections per class, helping understand the
potential structural imbalances.</p>
        <p>
          These metrics help compare the structure of generated ontologies against human-built ones, ensuring
that LLM-generated ontologies maintain a reasonable level of hierarchy and connectivity.
Semantic-Based Methods: Semantic-based evaluation assesses the conceptual alignment between
the generated ontology and its human-built counterpart. This ensures that the LLM-generated ontology
retains meaningful relationships and terminology relevant to the domain. The semantic-based metrics
include:
• Jaccard similarity (classes &amp; relations), which computes the overlap between concepts and relations
between ontologies using set-based similarity measures.
• Ontology embedding similarity, which applies Node2Vec to transform the ontologies into an
embedding space and then compute cosine similarity to capture semantic coherence beyond exact
term matching.
• Information content (IC), which measures how general or specific a concept is, with more specific
concepts carrying higher IC values. Originaly derived from Information Theory [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], Resnik [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]
introduced a corpus-based approach to computing IC, where the IC of a concept  is defined based
on its probability of occurrence in a corpus as − log () , where () represents the probability
of encountering concept  in a given corpus. Resnik’s approach relies on external corpora to
estimate concept probabilities, making it dependent on domain-specific datasets and prone to data
sparsity issues. To overcome the dependency on external text corpora, Seco et al. [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] proposed
an intrinsic IC measure that relies solely on the hierarchical structure of an ontology. Instead
of using corpus frequency, their IC metric is based on the number of hyponyms (i.e., subsumed
concepts in the taxonomy), an adaption of the formula is shown below:
  () = 1 −
log(|descendants()| + 1)
log(||)
(1)
        </p>
        <p>
          Recent work [
          <xref ref-type="bibr" rid="ref27 ref31">31, 27</xref>
          ] has applied IC to ontology-based reasoning and explainability in AI.
        </p>
        <p>These methods provide insights into how closely the generated ontologies align with human-built
ones, capturing both direct classes and relations overlap and deeper latent relationships using
embeddings.</p>
        <p>
          Task-Based Methods: Task-based evaluation assesses the practical usability of the generated
ontology by testing its efectiveness in augmenting tabular datasets for Machine Learning (ML) applications.
The idea, introduced in [
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ], is to leverage information from ontologies to enrich tabular data by
adding additional features computed in the embedding space. These features capture latent semantic
relationships between concepts, potentially improving predictive performance in downstream tasks.
The process begins with populating the ontologies into knowledge graphs with tabular instances,
associating data points with relevant ontology classes. Then the embeddings are computed. To
systematically evaluate the impact of these ontology-based augmentations, we conduct experiments using
four diferent ML models on a binary classification task. We assess model performance using accuracy
and F2 score, with the latter being particularly relevant in disease prediction tasks, where maximizing
true positive cases is critical for efectively identifying patients at risk. The evaluation considers two
augmentation scenarios as follows:
• DistAugTab, which computes the euclidean distance between each instance embedding and the
centroids of target classes (e.g., disease, no disease), enriching tabular data with proximity-based
features.
• EmbedClustAygTab, which applies k-means clustering on the embeddings, incorporating cluster
assignments and vector embeddings as additional features to improve dataset representation.
        </p>
        <p>This evaluation aims to determine whether LLM-generated ontologies provide comparable results to
human-built ones when used for tabular data augmentation, and whether certain embedding methods
or ML models perform better with LLM-generated ontologies.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment Setup</title>
      <p>In this section we discuss the language models that are used to generate ontologies, followed by the
details of the human-built ontology that served as gold standard.</p>
      <p>LLMs used. To generate ontologies, we use three LLMs: GPT-4o4, LLaMA 3.35, and DeepSeek R16,
accessed via their respective APIs. These models were selected to provide a diverse comparison based
on their training methodologies, their sizes and performance in structured knowledge generation tasks.
Table 2 shows an overview of the LLMs that are used for the experiment, their version, size (in terms of
parameters) and date of the conducted experiments.</p>
      <p>
        Heart disease ontology. In the experiments, we compare the LLM-generated ontologies against
a human-built ontology, (Heart Disease Ontology), to evaluate their structural and semantic quality.
The Heart Disease Ontology, is a manually crafted ontology derived from Trepan Reloaded [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. It
4https://openai.com/index/gpt-4o-system-card
5https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3
6https://api-docs.deepseek.com/news/news250120
      </p>
      <p>Model
GPT-4o
LLaMa 3.3
DeepSeek R1</p>
      <p>Version</p>
      <p>Size/Parameters Experiment Date</p>
      <p>GPT-4o-2024-11-20
Llama-3.3-70B-Instruct</p>
      <p>DeepSeek R1 7</p>
      <p>Not disclosed
70B
685B
is designed to represent the key features found in the Heart Disease dataset 8, ensuring a structured
and meaningful representation of relevant medical concepts. The ontology consists of 29 classes (e.g.,
Patient, HeartRate, ChestPain), 6 object properties (e.g., hasChestPain), and 10 data properties (e.g.,
hasAge), providing a well-defined knowledge structure for heart disease-related information.
Heart Disease dataset. The idea of generating ontologies from LLMs using concept seed, was to be
able to represent tabular datasets’ features. For our experiments, in order to evaluate the generated
ontologies for task-based evaluation, we chose the two methods for augmenting the tabular data for ML
binary classification, and for that the ontology needed to be populated into a knowledge graph with
the instances from tabular data. For that we used Heart disease dataset in kaggle, which consists of
303 instances, with 14 features capturing various patient health indicators relevant to diagnosing heart
disease such as heart rate and cholesterol.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In this section, we present the evaluation results of the generated ontologies across structural-based,
semantic-based and task-based. We analyze their structural characteristics, semantic similarity to
human-built ontology and their impact on downstream tasks.</p>
      <sec id="sec-5-1">
        <title>5.1. Structural-based results</title>
        <p>A key diference is observed in the path length, which measures the average shortest distance between
nodes/concepts. The GPT-generated ontology has the shortest path length (1.20), meaning concepts are
8https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
closer to each other and the root, reinforcing its flatter structure. LLaMa-generated ontology comes
closer to HB in terms of the path length. In contrast, DeepSeek-generated (2.02) exceeds the HB ontology
(1.78), suggesting that it organizes concepts in a more hierarchical manner, distributing them across
multiple levels rather than clustering them near the root.</p>
        <p>The branching factor, which indicates how widely concepts spread at each level, further supports
these observations. The GPT-generated ontology has the highest branching factor (4.00), meaning it
favors breadth over depth, potentially due to the model’s tendency to generate broad taxonomies when
not strictly constrained. On the other hand, DeepSeek-generated (2.00) has the lowest branching factor,
aligning closely with the human-built ontology. This suggests that DeepSeek organizes concepts into
deeper, more refined structures rather than spreading them widely.</p>
        <p>Overall, these findings highlight key diferences in the way LLMs structure ontologies.
DeepSeekgenerated ontology most closely resembles the HB ontology, balancing depth and path length, likely
due to its ability to generate a more structured hierarchy. In contrast, GPT-generated ontology produces
a much flatter structure with fewer hierarchical levels and wider branching, possibly reflecting a
generative preference for more direct, less deeply nested relationships. Moreover, LLaMA-generated
ontology represents a middle ground, generating a more structured hierarchy than GPT but still falling
short of the depth and organization of the HB ontology.</p>
        <p>Figure 2 presents the degree distribution of concepts across diferent ontologies, highlighting
structural diferences. The HB ontology shows a well-balanced hierarchy, with a mix of low-degree leaf
concepts and high-degree core concepts. The GPT-generated ontology is highly skewed, with most
concepts having low degrees (1-2) and a few high-degree nodes (12), leading to a broad but shallow
structure. The LLaMA-generated ontology exhibits a more gradual degree distribution (1-7), balancing
depth and breadth better than GPT while preserving hierarchical organization. The DeepSeek-generated
ontology, while showing mostly low-degree nodes (1-4), aligns with its previously observed higher path
length, suggesting that it prioritizes deeper hierarchical structuring over broad interconnectivity rather
than being purely weakly connected. These patterns indicate that the HB and LLaMA-generated
ontologies create more structured knowledge hierarchies, whereas GPT and DeepSeek-generated ontologies
exhibit diferent structural biases—GPT towards flatter structures and DeepSeek towards deeper, more
linear taxonomies.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Semantic-based results</title>
        <p>The last column of Table 3 shows the average Information Content (IC) values for HB and
LLMgenerated ontologies. We observe that the GPT-generated ontology has the highest overall IC value
(0.868), surpassing the human-built ontology (0.776). This suggests that GPT tends to generate concepts
with higher specificity, potentially due to its tendency to introduce highly distinct categories, though in
a flatter structure (as reflected in its low depth and high branching factor). However, this high IC does
not necessarily translate to a well-organized hierarchy, as seen in its limited depth and high-degree
generalizations.</p>
        <p>The LLaMA-generated ontology (IC = 0.824) also exhibits a higher IC than the HB ontology. This
aligns with its more structured hierarchy (moderate depth and balanced branching), suggesting that
LLaMA maintains a reasonable trade-of between depth and specificity, producing an ontology that is
more hierarchical while still preserving semantic richness.</p>
        <p>In contrast, the DeepSeek-generated ontology (IC = 0.757) has the lowest IC value. This suggests
that DeepSeek produces a more generalized ontology with less concept diferentiation, aligning with
its deeper but less interconnected structure. The low branching factor (2.0) and longer path length
(2.02) indicate that while it organizes concepts hierarchically, it does so with a more constrained level
of semantic granularity, leading to lower IC.</p>
        <p>Figure 3 further illustrates how IC varies across depth levels, showing that LLM-generated ontologies,
particularly GPT and LLaMA, tend to reach higher IC values at mid-level depths, suggesting more
specific categorizations even within shallow structures.</p>
        <p>Figure 4 shows the comparison of ontology similarities across the HB and LLM-generated ontologies.
We observe that for Jaccard similarity, there is a huge diference between the LLM-generated ontologies
compared to HB one. However, it is important to note that Jaccard similarity is a purely lexical metric
and it does not account for synonyms or paraphrased labels. As a result, it may underestimate true
semantic overlap when LLMs represent equivalent concepts using diferent terminology. Despite this,
LLM-generated ontologies share some class similarities between each other, and this is due to the
predefined terms provided during ontology construction. This suggests that while LLMs incorporate
the given terms, their overall structural organization deviates from HB taxonomies.</p>
        <p>In contrast, the Node2Vec-based cosine similarity of embeddings suggests a moderate alignment
between the HB and LLM-generated ontologies, despite their structural diferences. Notably, the HB
ontology exhibits higher similarity to LLaMA (0.8332) and DeepSeek (0.8252) than to GPT (0.7734).
This finding implies that although LLM-generated ontologies do not strictly replicate human-defined
structures, they capture meaningful semantic relationships in their embeddings. As a result, these
ontologies may still hold value for downstream tasks such as knowledge graph completion and link
prediction, where semantic coherence in the embedding space is crucial.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Task-based results</title>
        <p>For task-based evaluation, we examine performance across two augmentation scenarios: DistAugTab
and EmbedClustAugTab, which are further discussed in this subsection.</p>
        <sec id="sec-5-3-1">
          <title>DistAugTab Scenario.</title>
          <p>Table 4 presents the accuracy and F2-score of various ML models trained
on datasets augmented with features derived from measuring euclidean distance of each instance
embedding to target class centroid. The centroid ⃗  is computed as the mean of the embedding vectors
⃗ for all instances belonging to the target class   . These embeddings are generated using diferent
embedding methods. This evaluation assesses how efectively LLM-generated ontologies contribute to
tabular data augmentation in downstream classification tasks.</p>
          <p>Across all embedding methods (Node2Vec, RDF2Vec, DistMult and TransH), as expected, HB ontology
consistently achieves the highest or near-highest performance across both accuracy and F2-score.</p>
          <p>Among the LLM-generated ontologies, the LLaMA-generated ontology generally outperforms GPT
and DeepSeek, with DeepSeek consistently ranking the lowest. While LLaMA-generated ontology comes
closest to the HB ontology, making it the most suitable for downstream ML tasks, the GPT-generated
ontology exhibits inconsistent performance, with lower overall accuracy and F2-score, likely due to
its flatter structure and lack of deeper hierarchical relationships. The DeepSeek-generated ontology
frequently achieves the lowest performance, particularly when DistMult and TransH are used to create
embeddings, suggesting that its structural and semantic characteristics, as well as the embedding vectors
do not translate efectively into tabular data augmentation.</p>
          <p>Interestingly, the XGBoost model, which typically performs well in tabular classification tasks, shows
greater variability in performance across diferent ontologies, indicating that it is more sensitive to
ontology quality than other ML models. Notably, when embeddings are generated using DistMult,
XGBoost achieves higher accuracy and F2-score with LLM-generated ontologies, especially GPT-generated,
compared to the HB ontology. This suggests that in certain cases, LLM-generated ontologies may
capture latent semantic relationships that benefit specific ML models, despite their structural limitations.</p>
        </sec>
        <sec id="sec-5-3-2">
          <title>EmbedClustAugTab Scenario.</title>
          <p>Table 5 shows the accuracy and F2 score of various ML models trained
on datasets augmented using cluster-based features derived from embeddings. Unlike the DistAugTab
scenario, where features were computed based on Euclidean distances to target class centroids, the
EmbedClustAugTab scenario enriches the dataset by incorporating K-means cluster assignments and
embedding vectors for each instance. This evaluation analyzes whether LLM-generated ontologies
provide useful representations for downstream classification tasks when cluster-based features are used
instead of distance-based features.</p>
          <p>Performance Comparison of ML Models Across Ontologies Using Diferent Embedding Methods:
Accuracy and F2 Score (Mean ± Std) in the EmbedClustAugTab Scenario</p>
          <p>When comparing the two augmentation approaches, performance gaps between HB and
LLMgenerated ontologies appear more pronounced in the EmbedClustAugTab scenario, especially when the
embeddings are generated using TransH. This suggests that distance-based features (DistAugTab) are
more robust to variations in ontology structure, while cluster-based features (EmbedClustAugTab) are
more sensitive to how well an ontology organizes and diferentiates concepts. As a result, models in this
scenario rely more heavily on ontology quality to provide meaningful cluster assignments, highlighting
the importance of well-structured taxonomies.</p>
          <p>Regarding ML models, NNs and SVMs generally perform better across embeddings, confirming that
they benefit more from additional embedding-based features. XGBoost continues to exhibit higher
sensitivity to ontology quality, with noticeable performance variations across ontologies and embedding
methods. KNN performs relatively stably across diferent embedding methods but shows stronger
dependence on ontology structure, performing best with HB and LLaMA-generated ontologies.</p>
          <p>When analyzing embedding methods, Node2Vec and RDF2Vec embeddings continue to yield the
highest accuracy and F2-score, consistent with the DistAugTab scenario. DistMult and TransH
embeddings perform worse, particularly when used to generated embeddings for LLM-generated ontologies.
Additionally, the performance gap between HB and LLM-generated ontologies is wider in this scenario,
suggesting that clustering-based augmentation methods are more reliant on well-structured ontologies
than distance-based methods.</p>
          <p>In conclusion, HB ontologies still lead in performance, confirming that human-built ontologies provide
the best augmentation for ML tasks. LLaMA-generated ontologies remain the best LLM-generated
alternative, showing stronger alignment with HB than GPT and DeepSeek.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>In this paper, we present an evaluation strategy for ontologies generated with LLMs. We investigate the
potential of LLMs to generate domain-specific ontologies, given a set of concepts and characteristics, in
the domain of heart disease. We evaluated the LLM-generated ontologies using structural, semantic
and task-based metrics.</p>
      <p>For RQ1, the results indicate that LLM-generated ontologies partially align with human-built ones but
deviate in hierarchical organization and conceptual relationships. Structurally, the ontology generated
using LLaMA has more structured knowledge hierarchies, whereas GPT and DeepSeek-generated
ontologies exhibit diferent structural biases—GPT towards flatter structures and DeepSeek towards
deeper, more linear taxonomies. Semantically, they fail to maintain the same classes and relations
as the human based ontology, as reflected in low Jaccard similarity, however they show moderate
Node2Vec-based cosine similarity, especially the ontology generated using LLaMa. This suggests that
although they do not closely resemble human-built ontologies structurally, they can still be valuable
in embedding-based applications such as knowledge graph completion and link prediction, where
structural precision is less critical than semantic coherence. In task-based evaluation, LLaMA-generated
ontologies perform the best among other LLMs, suggesting that some LLMs may produce useful
ontologies for dataset augmentation. Notably, LLMs can be efectively used to generate ontologies when
the primary goal is to model dataset features (given as a set of terms) or to enforce specific structural
constraints, such as defining the number of classes, relations, or maximum depth.</p>
      <p>Regarding RQ2, we evaluate LLM-generated ontologies through structural, semantic, and task-based
evaluations. Graph-theoretic metrics (depth, degree, path length) evaluate hierarchical organization,
while Jaccard similarity and Node2Vec-based cosine similarity measure semantic alignment. In
taskbased evaluation, ontologies are converted into knowledge graphs, embedded using Node2Vec, RDF2Vec,
DistMult, and TransH, and used for tabular data augmentation in ML models. Performance is tested in
two augmentation scenarios: DistAugTab and EmbedClustAugTab.</p>
      <p>For RQ3, evaluation shows that LLaMA-generated ontology is the closest to the human-built ontology,
producing more structured and semantically meaningful ontologies. It achieves deeper hierarchies,
better preserves conceptual relationships, and consistently outperforms GPT and DeepSeek in task-based
evaluation. However, it still does not fully replicate human-built ontologies, especially in taxonomic
structuring and relation modeling. GPT-generated ontologies are inconsistent, favoring broad, flat
structures, while DeepSeek struggles the most, generating weakly interconnected hierarchies with poor
downstream usability.</p>
      <p>Limitations. While this paper provides valuable insights, its focus on a single domain limits the
generalizability of the findings to other domains. Moreover, potential risks such as LLM training data
leakage and sensitivity to prompt phrasing were not investigated, though they may influence the quality
and reliability of the generated ontologies.</p>
      <p>Future work. Future research will explore Competency Question (CQ) evaluation to assess how
well LLM-generated ontologies support domain-specific queries, as well as schema validation using
SHACL (Shapes Constraint Language) will be incorporated to ensure structural and consistency checks,
providing a more rigorous evaluation of ontology quality. Additionally, we plan to extend the study
beyond the medical domain to evaluate LLM performance in other domains. Further experiments will
include a broader range of LLMs and investigate their efectiveness in generating larger ontologies,
incorporating techniques such as ontology modularization to improve scalability. We will also assess
alternative task-based evaluation approaches to better understand the practical usability of AI-generated
ontologies.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments</title>
      <p>This work was supported by the FFG SENSE (894802) and FAIR-AI (904624) projects, as well as by the
Austrian Science Fund (FWF) BILAI 10.55776/COE12 and HOnEst (V 745-N) projects. For open access
purposes, the author has applied a CC BY public copyright license to any author accepted manuscript
version arising from this submission.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4o in order to: brainstorm ideas about the
title, as well as code refactoring. After using these tool, the authors reviewed and edited the content as
needed and take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Babaei Giglou</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>S. Auer,</given-names>
          </string-name>
          <article-title>Llms4ol: Large language models for ontology learning</article-title>
          , in: International Semantic Web Conference, Springer,
          <year>2023</year>
          , pp.
          <fpage>408</fpage>
          -
          <lpage>427</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Confalonieri</surname>
          </string-name>
          , G. Guizzardi,
          <article-title>On the multiple roles of ontologies in explanations for neurosymbolic ai</article-title>
          ,
          <source>Neurosymbolic Artificial Intelligence</source>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hubert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Monnin</surname>
          </string-name>
          , M.
          <string-name>
            <surname>d'Aquin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Monticolo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Brun</surname>
          </string-name>
          , Pygraft:
          <article-title>Configurable generation of synthetic schemas and knowledge graphs at your fingertips</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Melo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          ,
          <article-title>Synthesizing knowledge graphs for link and type prediction benchmarking</article-title>
          ,
          <source>in: The Semantic Web: 14th International Conference, ESWC</source>
          <year>2017</year>
          , Portorož, Slovenia, May 28-June 1,
          <year>2017</year>
          , Proceedings,
          <source>Part I 14</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Doumanas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soularidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spiliotopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Vassilakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kotis</surname>
          </string-name>
          ,
          <article-title>Fine-tuning large language models for ontology engineering: A comparative analysis of gpt-4 and mistral</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>15</volume>
          (
          <year>2025</year>
          )
          <fpage>2146</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Alharbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Grasso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Payne</surname>
          </string-name>
          ,
          <article-title>Investigating open source llms to retrofit competency questions in ontology engineering</article-title>
          ,
          <source>in: Proceedings of the AAAI Symposium Series</source>
          , volume
          <volume>4</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>188</fpage>
          -
          <lpage>198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Kommineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>König-Ries</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Samuel</surname>
          </string-name>
          ,
          <article-title>From human experts to machines: An llm supported approach to ontology and knowledge graph construction</article-title>
          ,
          <source>arXiv preprint arXiv:2403.08345</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Llugiqi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Ekaputra</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Sabou,</surname>
          </string-name>
          <article-title>Enhancing machine learning predictions through knowledge graph embeddings</article-title>
          ,
          <source>in: International Conference on Neural-Symbolic Learning and Reasoning</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>279</fpage>
          -
          <lpage>295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Llugiqi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Ekaputra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sabou</surname>
          </string-name>
          ,
          <article-title>Semantic-based data augmentation for machine learning prediction enhancement</article-title>
          ,
          <source>Neurosymbolic Artificial Intelligence (under review)</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <article-title>Triplecheckmate: A tool for crowdsourcing the quality assessment of linked data, in: Knowledge Engineering and the Semantic Web: 4th International Conference</article-title>
          , KESW 2013,
          <article-title>St</article-title>
          . Petersburg, Russia, October 7-
          <issue>9</issue>
          ,
          <year>2013</year>
          . Proceedings 4, Springer,
          <year>2013</year>
          , pp.
          <fpage>265</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Glimm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Horrocks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Motik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Stoilos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Hermit: an owl 2 reasoner</article-title>
          ,
          <source>Journal of automated reasoning 53</source>
          (
          <year>2014</year>
          )
          <fpage>245</fpage>
          -
          <lpage>269</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Guarino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Welty</surname>
          </string-name>
          ,
          <article-title>An overview of ontoclean, Handbook on ontologies (</article-title>
          <year>2009</year>
          )
          <fpage>201</fpage>
          -
          <lpage>220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsaneva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vasic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sabou</surname>
          </string-name>
          ,
          <article-title>Llm-driven ontology evaluation: Verifying ontology restrictions with chatgpt</article-title>
          ,
          <source>The Semantic Web: ESWC Satellite Events</source>
          <year>2024</year>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kejriwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Knoblock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Szekely</surname>
          </string-name>
          ,
          <article-title>Knowledge graphs: Fundamentals, techniques, and applications</article-title>
          , MIT Press,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Raad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cruz</surname>
          </string-name>
          ,
          <article-title>A survey on ontology evaluation methods</article-title>
          , in: International conference
          <article-title>on knowledge engineering and ontology development</article-title>
          , volume
          <volume>2</volume>
          , SciTePress,
          <year>2015</year>
          , pp.
          <fpage>179</fpage>
          -
          <lpage>186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Alharbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Grasso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Payne</surname>
          </string-name>
          ,
          <article-title>An experiment in retrofitting competency questions for existing ontologies</article-title>
          ,
          <source>in: Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1650</fpage>
          -
          <lpage>1658</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rebboud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tailhardat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lisena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <article-title>Can llms generate competency questions?</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>71</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mateiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Groza</surname>
          </string-name>
          ,
          <article-title>Ontology engineering with large language models</article-title>
          ,
          <source>in: 2023 25th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>226</fpage>
          -
          <lpage>229</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fathallah</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>S. D.</given-names>
          </string-name>
          <string-name>
            <surname>Giorgis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Poltronieri</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Haase</surname>
          </string-name>
          , L. Kovriguina,
          <article-title>Neon-gpt: a large language model-powered pipeline for ontology learning</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Lippolis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ceriani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zuppiroli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Nuzzolese</surname>
          </string-name>
          , Ontogenia:
          <article-title>Ontology generation with metacognitive prompting in large language models</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>259</fpage>
          -
          <lpage>265</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>M. J. Saeedizade</surname>
          </string-name>
          , E. Blomqvist,
          <article-title>Navigating ontology development with large language models</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>161</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Chen,
          <article-title>Making large language models perform better in knowledge graph completion</article-title>
          ,
          <source>in: Proceedings of the 32nd ACM International Conference on Multimedia</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>233</fpage>
          -
          <lpage>242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Hughes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Llugiqi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Polat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Ekaputra</surname>
          </string-name>
          , et al.,
          <article-title>Knowledge-centric prompt composition for knowledge base construction from pre-trained language models</article-title>
          .,
          <source>in: KBC-LM/LMKBC@ ISWC</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>H.</given-names>
            <surname>Babaei Giglou</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D'Souza</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Engel</surname>
          </string-name>
          , S. Auer,
          <article-title>Llms4om: Matching ontologies with large language models</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>C.</given-names>
            <surname>Saetia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Phruetthiset</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chalothorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lertsutthiwong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Taerungruang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Buabthong</surname>
          </string-name>
          ,
          <article-title>Financial product ontology population with large language models</article-title>
          ,
          <source>in: Proceedings of TextGraphs17: Graph-based Methods for Natural Language Processing</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>53</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Symbol-llm: leverage language models for symbolic system in visual human activity reasoning</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2023</year>
          )
          <fpage>29680</fpage>
          -
          <lpage>29691</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>R.</given-names>
            <surname>Confalonieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Weyde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Besold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>del Prado</surname>
          </string-name>
          <string-name>
            <surname>Martín</surname>
          </string-name>
          ,
          <article-title>Using ontologies to enhance human understandability of global post-hoc explanations of black-box models</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>296</volume>
          (
          <year>2021</year>
          )
          <fpage>103471</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ross</surname>
          </string-name>
          , First Course in Probability,
          <string-name>
            <surname>A</surname>
          </string-name>
          , Macmillan,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>P.</given-names>
            <surname>Resnik</surname>
          </string-name>
          ,
          <article-title>Using information content to evaluate semantic similarity in a taxonomy, arXiv preprint cmp-lg/9511007 (</article-title>
          <year>1995</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>N.</given-names>
            <surname>Seco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Veale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <article-title>An intrinsic information content metric for semantic similarity in wordnet</article-title>
          ,
          <source>in: Ecai</source>
          , volume
          <volume>16</volume>
          ,
          <year>2004</year>
          , p.
          <fpage>1089</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Batet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Isern</surname>
          </string-name>
          ,
          <article-title>Ontology-based information content computation</article-title>
          ,
          <source>Knowledge-based systems 24</source>
          (
          <year>2011</year>
          )
          <fpage>297</fpage>
          -
          <lpage>303</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>