<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Evaluating Knowledge Graph Construction and Ontology Learning with LLMs without Test Data Leakage</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Heiko Paulheim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Mannheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The use of Large Language Models (LLMs) becomes increasingly popular for knowledge graph construction and ontology learning. Very often, methods and tools using LLMs for those tasks are evaluated on existing knowledge graphs and ontologies, which are publicly available on the Web. In cases of very popular ontologies and knowledge graphs, there might be additional material such as tutorials and publications. Thus, it can be assumed that the test data has been seen by the LLM, and it is questionable if the results transfer to a case of unseen data (which is where those models are intended to be employed). In this paper, we propose a diferent method of evaluating LLMs for knowledge graph construction and ontology learning. We suggest using a secondary LLM to create test data for one-time use on the fly. This also allows for repeating experiments and computing standard deviations and confidence intervals, which facilitates additional statements about the robustness of diferent approaches. We demonstrate our suggested approach on two original ontologies, and discuss diferent observations when comparing results between original and generated test data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Ontology Learning</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Data Leakage</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>While a considerable amount of works has been conducted on using LLMs for ontology learning, the
evaluation is most often conducted on well-known, public ontologies.</p>
      <p>The LLMs4OL challenge, conducted for the first time at the International Semantic Web Conference
in 2024, uses WordNet, GeoNames, UMLS, the Gene Ontology (GO), FoodOn, and schema.org as test
ontologies. [15] The 2025 edition1 added the Ontology for Biomedical Investigations (OBI), the Material
Ontology (MatOnto), Semantic Web for Earth and Environment Technology Ontology (SWEET), the
Human Disease Ontology (DOID), the PROcess Chemistry Ontology (PROCO), and the Plant Ontology
(PO) – all of which are publicly available and widely used ontologies. Similarly, the OntoURL benchmark
uses a larger set of publicly available ontologies, including many of the aforementioned ones. [16]</p>
      <p>The KBC-LM challenge, conducted for the first time as a challenge at the International Semantic Web
Conference, has been using relations from Wikidata throughout all iterations. [17] Evaluation datasets
proposed in other papers use taxonomies such as those from arxiv.org and Wikipedia [18], or public
ontologies such as DOREMUS, Polifonia, DemCar, Odeuropa, NORIA-O, or FIBO [19].</p>
      <p>Another strain of works (e.g. [20, 21]) does not evaluate the generated ontologies against a ground
truth, but rather uses quality metrics such as those defined by the ontology pitfalls scanner (OOPS!) [ 22].
While this avoids the data leakage problem, it can rate only the compliance of LLM-based approaches
with ontology engineering guidelines and finds general issues such as taxonomy cycles, but does not
take the actual semantics of the generated ontologies into account.</p>
      <p>Overall, we see that there is no easy way to evaluate how well LLM-based approaches work for
ontology learning on unseen data.</p>
      <p>The approach in this paper proposes to use synthetic ontologies as benchmarks for ontology learning,
which are created dynamically for an experiment, and not reused afterwards. While synthethic
ontologies have been proposed for other benchmarking means, such as reasoning [23, 24], knowledge graph
completion [25], machine learning over knowledge graphs [26], or querying [27, 28, 29], the approach
pursued in this paper difers in the two aspects that such approaches do not exist for ontology learning,
and that the generation at runtime has not been in the focus so far (in fact, most synthetic benchmarks
are public, and usually, researchers reuse public synthetic benchmarks instead of recreating fresh ones).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
      <p>In order to overcome the data leakage problem in evaluating LLM-based ontology learning tools, we
propose a schema based on the GET methodology [14], as shown in Figure 1. It foresees the usage of a
large language model to generate synthetic ontologies for one-time usage. In detail, the pipeline has
the following steps:
1. From an original ontology, we extract key characteristics, such as the number of classes and
properties.
2. The extracted characteristics are used to prompt an LLM to generate a set of synthetic ontologies
resembling the original one. We propose two variants: (a) generating ontologies in the same
domain, and (b) generating ontologies in related domains.
3. The result is a set of generated synthetic ontologies that are generated on the fly. We assume that
they were not part of the LLM training data.
4. The synthetic ontologies are used as benchmarks for testing LLM-based ontology learning tools.
5. The results are collected. Since multiple similar ontologies can be generated, the approach also
allows for assessing the stability of the results in addition to metrics such as precision and recall
(e.g., by computing standard deviations across all generated ontologies).
6. After running the experiments, the synthetic ontologies should not be reused, but they can be
made public in a research data repository for fostering reproducibility.
1https://sites.google.com/view/llms4ol2025/home</p>
      <sec id="sec-3-1">
        <title>Original ontology LLM Key character</title>
        <p>istics
2
3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Synthetic ontologies</title>
        <p>6
4</p>
      </sec>
      <sec id="sec-3-3">
        <title>LLM-based tool</title>
        <p>5</p>
      </sec>
      <sec id="sec-3-4">
        <title>Evaluation metrics</title>
        <p>In step 2, in order to generate diferent ontologies, we propose using a temperature above 0. Moreover,
we propose to use an LLM which is not by any tool used in step 4.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In order to test the proposed approach, we conducted experiments with two ontology learning tasks,
i.e., taxonomy induction and domain/range induction.</p>
      <sec id="sec-4-1">
        <title>4.1. Ontology Generation</title>
        <p>To test the proposed approach, we started from two well-known ontologies, the Pizza ontology2 and
the Wine ontology3. For each of those, we had an LLM create three replica within the same domain,
and three in adjacent domains (pasta, sushi, and curry dishes for the pizza ontology, and beer, whiskey4,
and gin for the wine domain).5 The prompts used for generating the synthetic ontologies, as well as for
learning subclass and domain/range axioms, are shown in the appendix.</p>
        <p>Statistics on the generated ontologies are shown in Table 1. We can make multiple observations here.
First, while the LLM does a good job at creating an exact small number of items (here: properties), there
is more variation for the larger numbers (here: classes). Second, while in the original ontologies, some
properties do not have a defined domain or range, this never occurs in the generated ones, even though
2https://protege.stanford.edu/ontologies/pizza/pizza.owl
3https://www.w3.org/TR/owl-guide/wine.rdf
4Running the experiment with whisky and comparing the results to those with whiskey is left as an exercise to the reader.
5While we selected the adjacent domains by hand, it would also be possible to prompt an LLM for those for full automation.</p>
        <p># classes
# properties
# subclass axioms
# domain axioms
# range axioms</p>
        <p># classes
# properties
# subclass axioms
# domain axioms
# range axioms
this has been explicitly permitted in the prompt used to generate the ontologies. Moreover, in most
cases, the domain of all properties is the central class (e.g., pizza or wine).</p>
        <p>Table 2 shows the similarity of the original and generated ontologies in terms of overlapping classes
and properties. It can be observed that the generated ontologies are very diferent from the original
ontologies in that respect, and that the diferent generated ontologies are also reasonably diferent from
one another.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Ontology Learning Evaluation</title>
        <p>We evaluate two tasks, i.e., subclass axiom induction and domain/range induction by LLMs, and we use
three LLMs of diferent sizes for that task: Llama 8B, Llama 70B, and Mistral Large Instruct (123B) at a
temperature of 0 for reproducbility. The ontologies themselves are generated using Gemma-27B using
a temperature of 0.5 to ensure variance in the generated ontologies.</p>
        <p>Since the original ontologies were not fully materialized, we (a) materialized the domain/range axioms
for inverse properties, and (b) added subclass axioms for equivalent restriction definitions (i.e., for
 ≡  ⊓ , we added  ⊑  and  ⊑ ) before evaluating the generated domain/range and subclass
axioms. Both sets of materialized axioms are included in the counts in Table 1. All generated ontologies
and axioms output by the diferent models are available online. 6</p>
        <p>The results are shown in tables 3 for subclass induction, 4 for domain induction, and 5 for range
induction.</p>
        <p>Before analyzing the results in more detail, we want to point out two cases observed frequently
throughout the entire evaluation:
6https://github.com/HeikoPaulheim/llm-ontology-learning
0 8 9 i
0 7 9 h
, u
0 0 0 s
’
”
a
i
p
z p 0 ,
z
7 ,6 s p 3 ,
5 ,4</p>
        <p>’
3 3 3 ”
8 4 7 e
, i
0 0 0 w
0 6 6
r ,00 ,8 0</p>
        <p>2 ,8
0 0 0
0 2 5
f ,00 ,5 8</p>
        <p>9 ,9
0 0 0</p>
        <p>4 9 3
r ,59 ,3 2</p>
        <p>7 ,6
0 0 0
4 4 4
f ,03 ,1 2</p>
        <p>8 ,8
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
m
e
r
u
s
a
e
1
F
d
0 8 9
f ,00 ,1 4</p>
        <p>4 ,7
0 0 0</p>
        <p>6 6 8
f ,46 ,2 3</p>
        <p>6 ,5
0 0 0</p>
        <p>2 0 2
f ,53 ,0 8</p>
        <p>0 ,9
0 1 0</p>
        <p>3 4 8
f ,25 ,2 0</p>
        <p>7 ,7
0 0 0
”
0 7 0 y
0 1 8 r
,
0 0 0 w
9 ,9 r p 0 ,
u
7 ,7
in p 2 ,</p>
        <p>6 is p 0 ,
’
a
zz p 0 ,
i
p
0 96 ,183 tsaa p 0 ,
0 0
, ,
0 0 0 p
8 7 4 ’e
8 8 5
5 ,6
in p ,7 ,
7 ,7 e p ,7 ,</p>
        <p>3 ,3
2 4 4 r
8 8 8 e
0 9
,0 ,</p>
        <p>6 ,7
0 4 0
0 5 0
,0 ,
0
1 ,3
0
0 1 5
0 7 9
,0 ,
0
3 ,0
0 0
0 5 5
0 0 8
0 8 7 a</p>
        <p>8
0 6
,0 ,</p>
        <p>7 ,6
0 0 0
r ,00 ,9 9</p>
        <p>9 ,9
0 0 0
0 6 8
f ,00 ,0 2</p>
        <p>7 ,5
0 0 0
0 0 0
r ,00 ,4 6</p>
        <p>8 ,8
0 0 0
8 7 5
f ,26 ,5 9</p>
        <p>7 ,5
0 0 0
l
a
n
i
g
z
i
p
i
r
o p ,2 ,
a
z
3 3 5
5 5 3</p>
        <p>7 ,6
6 2 0
r ,28 ,6 6</p>
        <p>7 ,5
0 0 0
,2 ,
0
,2 ,
,1 ,</p>
        <p>6 ,6
,3 ,
0
,2 ,
,3 ,</p>
        <p>4 ,5
,3 ,
0
,4 ,</p>
        <p>6 ,6
l
a
n
i
g
i
w
r
o p ,3 ,
e
n
i
,2 ,</p>
        <p>B a
B 0 L
8 7 l</p>
        <p>m ts</p>
        <p>m ts
e
g
r</p>
        <p>m ts
1
F
d
n
a
,
R
:
4
e
l
b
a
T
”
a
zz p 0 .
i
p
0 7 0 y
0 1 8 r
. u
0 0 0 c
0 0 0 ”e
1 0 0
.</p>
        <p>0 1 1 w
k
9 is p .0 .</p>
        <p>7 .0</p>
        <p>’
8 0 0 ”
3 0 0
.</p>
        <p>0 0 1 w
0 6 5
4
7 4 1
3 3 1
7 9 6
7 6 0
1 9 3 a</p>
        <p>9
a
n
i
z
z
i
p
g
i
r
o p .0 .
a
6 0 0
0 0 0
5 .0
0 0
3 4 8
7 1 8
6</p>
        <p>m ts
4 4 4
7 4 4
.1 .
0
5 4 4
2 7 7
.1 .</p>
        <p>9 .9</p>
        <p>m ts
0 8 7
0 1 7
0 7 7
0 9 6
.0 .</p>
        <p>m ts</p>
        <p>0
7 .7
4 .4
0 0
3 5 8
8 2 5
8 5 6
6
’
”
a
z p 0 .
z .
i
p
0 0 0 i
7 0 0 h</p>
        <p>’
0 5 9 ”
0 7 8</p>
        <p>i
0 0 0 w
e
n p 9 .</p>
        <p>0 0 0
r .00 .0 0</p>
        <p>0 .0
1 1 1
3 9 6
f .41 .6 4</p>
        <p>7 .8
0 0 0</p>
        <p>m ts
0 .0 ru p .6 .</p>
        <p>1 .1
in p 7 .
.
0 0 0 w
in p 1 .
z
z
i
p</p>
        <p>0 0 0
r .00 .0 0</p>
        <p>0 .0
1 1 1
0 6 3
f .80 .4 2</p>
        <p>8 .9
0 0 0</p>
        <p>m ts</p>
        <p>m ts
• Results where both recall and precision are 0 are in most cases due to the LLM answering with a
completely diferent format than the one request. A particular second cause can be observed in
particular for Llama-8B, which often mixes up domain and range, and outputs property ranges
instead of domains, which is why Llama8B often has zeros in the domain induction task.
• Results with a very high recall and very low precision: occasionally, the LLMs output cross
products of properties and classes for domains or ranges, or redundantly include all subclasses of
the actual domain/range class.7
When looking into the results in more detail, we can make number of further observations:
• The results on the original ontologies are often worse than those on the generated ones. There are
at least three possible explanations: (a) the “mental models” of the generating and the evaluation
LLMs are more aligned (i.e., LLMs have a certain shared understanding of a given domain), (b)
the original ontologies, which were created for instructive purposes, contain more corner cases,
and (c) in contrast to most generated ontologies, the original ones contain properties without
explicit domain and range definitions, while the LLM almost always returns a definition for each
property, despite explicitly prompted that this is optional, leading to a larger number of false
positives on the original ontologies.
• The results in related domains are generally worse than those in the original domain, especially
in the tasks based on the wine ontology (i.e., beer, gin, and whiskey ontologies). This may hint at
the LLMs having gathered a part of their ontology engineering knowledge on the wine ontology
and related tutorial materials.
• The order of tools by performance is not the same. For example, while Llama70B is superior to
Mistral Large on almost all tasks on the original ontologies, Mistral Large outperforms Llama70B
on many of the generated ontologies (both in the same and in similar domains). This may hint at
a higher tendency of Llama70B’s results being an efect of memorization to a larger extent than
Mistral Large.
• The standard deviation is often considerable, showing that the approaches are not very stable,
that good results can also be the result of a lucky coincidence, and that results in the same quality
cannot be guaranteed on unseen data.</p>
        <p>Overall, we see that with the proposed methodology, we can obtain more in-depth results than by only
evaluating on the two original ontologies.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Outlook</title>
      <p>Test data leakage is an overlooked issue when evaluating LLM-based tools for ontology learning on
publicly available ontologies and knowledge graphs. In this paper, we have proposed an alternative
methodology: instead of evaluating against publicly available ontologies, we propose to generate test
ontologies on the fly for one-time evaluations. We have demonstrated the approach on the task of
taxonomy induction, showing that it is possible to evaluate and also assess robustness of LLM-based
taxonomy induction mechanisms.</p>
      <p>While we assume that the generated ontologies are not seen by the LLMs during training, this
assumption may be partially wrong – in case the generating LLM reproduces parts of an existing
ontology, those may in fact have been seen by the LLM. One important task of future work is therefore
applying data leakage metrics [30] to the generated data to assess the degree of freshness of the generated
synthetic ontologies.</p>
      <p>One of the striking observations of this work was that the results observed on synthetic ontologies
are often better than those on the original, human-generated ones. This deserves a deeper analysis.
One possible reason we postulated was that diferent LLMs have a stronger alignment on their “mental
models” of a domain, an assumption that deserves further analysis, e.g., by swapping the LLMs used
7In future work, we will catch the latter issue programmatically, and filter out those correct, but redundant axioms.
for generation and evaluation, and comparing the generated ontologies to one another. Moreover, we
will think of approaches to assess the dificulty of the ontology learning tasks on the original and the
generated ontologies, respectively, and to experiment with diferent prompts for controlling the task
dificulty.</p>
      <p>On the practical side, future work will consist of wrapping the approach in an end-to-end evaluation
pipeline. Further experimentation will go into the generation of ontologies, e.g., controlling the
complexity and dificulty of the generated ontologies, and conducting experiments with diferent
generation models.</p>
      <p>So far, we have looked into taxonomy and domain/range induction, but the approach might be
interesting for various other tasks, such as the learning of more complex restrictions, detection of
property characteristics (transitivity, inverse properties, etc.), or entity typing.</p>
      <p>Overall, we hope that this mode of evaluation will be used at least in addition to the currently dominant
paradigm of evaluating using publicly available ontologies, in order to analyze tool performance in a
setup free from test data leakage.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>The experiments have been run using the Chat AI service provided by GWDG. [31]
[12] Y. Cheng, Y. Chang, Y. Wu, A survey on data contamination for large language models, arXiv
preprint arXiv:2502.14425 (2025).
[13] H. T. Mai, C. X. Chu, H. Paulheim, Do llms really adapt to domains? an ontology learning
perspective, in: International Semantic Web Conference, Springer, 2024, pp. 126–143.
[14] H. Paulheim, Ontologies, Knowledge Graphs, and LLMs: How Do We GET Evaluations Done</p>
      <p>Right?, in: International Semantic Web Conference, Posters and Demos, 2025.
[15] H. B. Giglou, J. D’Souza, S. Auer, Llms4ol 2024 overview: The 1st large language models for
ontology learning challenge, arXiv preprint arXiv:2409.10146 (2024).
[16] X. Zhang, H. Lai, Q. Meng, J. Bos, Ontourl: A benchmark for evaluating large language models
on symbolic ontological understanding, reasoning and learning, arXiv preprint arXiv:2505.11031
(2025).
[17] J.-C. Kalo, T.-P. Nguyen, S. Razniewski, B. Zhang, Preface: Lm-kbc challenge 2024, in: 2nd</p>
      <p>Workshop on Knowledge Base Construction from Pre-Trained Language Models, CEUR. ws, 2024.
[18] A. Lo, A. Q. Jiang, W. Li, M. Jamnik, End-to-end ontology learning with large language models,</p>
      <p>Advances in Neural Information Processing Systems 37 (2024) 87184–87225.
[19] Y. Rebboud, P. Lisena, L. Tailhardat, R. Troncy, Benchmarking llm-based ontology
conceptualization: A proposal, in: ISWC 2024, 23rd International Semantic Web Conference, 2024.
[20] M. A. Cappelli, G. Di Marzo Serugendo, Methodological exploration of ontology generation with
a dedicated large language model, Electronics 14 (2025) 2863.
[21] A. S. Lippolis, M. J. Saeedizade, R. Keskisärkkä, S. Zuppiroli, M. Ceriani, A. Gangemi, E. Blomqvist,
A. G. Nuzzolese, Ontology generation using large language models, in: European Semantic Web
Conference, Springer, 2025, pp. 321–341.
[22] M. Poveda-Villalón, M. C. Suárez-Figueroa, A. Gómez-Pérez, Validating ontologies with oops!, in:
International conference on knowledge engineering and knowledge management, Springer, 2012,
pp. 267–281.
[23] M. Ebrahimi, M. K. Sarker, F. Bianchi, N. Xie, D. Doran, P. Hitzler, Reasoning over rdf knowledge
bases using deep learning, arXiv preprint arXiv:1811.04132 (2018).
[24] A. Eberhart, M. Ebrahimi, L. Zhou, C. Shimizu, P. Hitzler, Completion reasoning emulation for
the description logic el+, in: Proceedings of the AAAI 2020 Spring Symposium on Combining
Machine Learning and Knowledge Engineering in Practice, 2020.
[25] A. Melo, H. Paulheim, Synthesizing knowledge graphs for link and type prediction benchmarking,
in: European Semantic Web Conference, Springer, 2017, pp. 136–151.
[26] J. Portisch, H. Paulheim, The dlcc node classification benchmark for analyzing knowledge graph
embeddings, in: International semantic web conference, Springer, 2022, pp. 592–609.
[27] Y. Guo, Z. Pan, J. Heflin, Lubm: A benchmark for owl knowledge base systems, Journal of Web</p>
      <p>Semantics 3 (2005) 158–182.
[28] M. Schmidt, T. Hornung, G. Lausen, C. Pinkel, Spˆ 2bench: a sparql performance benchmark, in:
2009 IEEE 25th International Conference on Data Engineering, IEEE, 2009, pp. 222–233.
[29] C. Bizer, A. Schultz, The berlin sparql benchmark, in: Semantic Services, Interoperability and Web</p>
      <p>Applications: Emerging Concepts, IGI Global Scientific Publishing, 2011, pp. 81–103.
[30] S. Ni, X. Kong, C. Li, X. Hu, R. Xu, J. Zhu, M. Yang, Training on the benchmark is not all you
need, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 2025, pp.
24948–24956.
[31] A. Doosthosseini, J. Decker, H. Nolte, J. M. Kunkel, Chat ai: A seamless slurm-native solution for
hpc-based services, 2024. URL: https://arxiv.org/abs/2407.00110. arXiv:2407.00110.</p>
    </sec>
    <sec id="sec-8">
      <title>Appendix</title>
      <p>This section documents the prompts used in the experiments. In all cases, we used the following system
prompt:
You are an ontology engineer</p>
      <p>For the generation of ontologies, we used Gemma-27B with a temperature of 0.5 and the following
prompt:
Here, domain is one out of {pizza,pasta,sushi,curry dishes,wine,beer,whiskey,gin},
and N and M are numbers extracted from the original pizza and wine ontologies. The output format is
chosen to ease the evaluation.</p>
      <p>For the generation of subclass axioms, the following prompt is used:</p>
      <p>For the generation of domain axioms, the following prompt is used:
I want to build an ontology of {domain}. I give you the classes and
properties I defined, please provide a list of range definitions in the
format</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fathallah</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>S. D.</given-names>
          </string-name>
          <string-name>
            <surname>Giorgis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Poltronieri</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Haase</surname>
          </string-name>
          , L. Kovriguina,
          <article-title>Neon-gpt: a large language model-powered pipeline for ontology learning</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Unifying large language models and knowledge graphs: A roadmap, IEEE Transactions on Knowledge and Data Engineering (TKDE) (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Shimizu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          ,
          <article-title>Accelerating knowledge graph and ontology engineering with large language models</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>85</volume>
          (
          <year>2025</year>
          )
          <fpage>100862</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Varró</surname>
          </string-name>
          ,
          <article-title>Prompting or fine-tuning? a comparative study of large language models for taxonomy construction</article-title>
          , in: 2023 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems
          <string-name>
            <surname>Companion (MODELS-C)</surname>
          </string-name>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>588</fpage>
          -
          <lpage>596</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Kommineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>König-Ries</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Samuel</surname>
          </string-name>
          ,
          <article-title>From human experts to machines: An llm supported approach to ontology and knowledge graph construction</article-title>
          ,
          <source>arXiv preprint arXiv:2403.08345</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vetter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Aryan</surname>
          </string-name>
          ,
          <article-title>Using large language models for ontoclean-based ontology refinement</article-title>
          ,
          <source>arXiv preprint arXiv:2403.15864</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsaneva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vasic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sabou</surname>
          </string-name>
          ,
          <article-title>Llm-driven ontology evaluation: Verifying ontology restrictions with chatgpt, The semantic web: ESWC satellite events</article-title>
          <year>2024</year>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Knowledge graph large language model (kg-llm) for link prediction</article-title>
          ,
          <source>Proceedings of Machine Learning Research</source>
          <volume>260</volume>
          (
          <year>2024</year>
          )
          <fpage>143</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Christou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gautam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Eells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shimizu</surname>
          </string-name>
          ,
          <article-title>Ontology population using llms, in: Handbook on Neurosymbolic AI and Knowledge Graphs</article-title>
          , IOS Press,
          <year>2025</year>
          , pp.
          <fpage>421</fpage>
          -
          <lpage>438</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hertling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          , Olala:
          <article-title>Ontology matching with large language models</article-title>
          ,
          <source>in: Proceedings of the 12th knowledge capture conference</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>131</fpage>
          -
          <lpage>139</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          , J. Han,
          <article-title>Don't make your llm an evaluation benchmark cheater</article-title>
          ,
          <source>arXiv preprint arXiv:2311</source>
          .
          <year>01964</year>
          (
          <year>2023</year>
          ).
          <article-title>In the latter three prompts, {classes} and {properties} are lists of the classes and properties of the ontology at hand, provided one per line</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>