<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Graphs, and LLMs: How Do We GET Evaluations Done Right?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Heiko Paulheim</string-name>
          <email>heiko.paulheim@uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Large Language Models</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ontologies</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Knowledge Graphs</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Data Leakage</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Taxonomy Induction</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Mannheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>6</lpage>
      <abstract>
        <p>The use of Large Language Models (LLMs) becomes increasingly popular for many tasks in the semantic web and knowledge graph community, e.g., knowledge graph (KG) construction, ontology learning, and ontology matching. Methods and tools using LLMs for those tasks are often evaluated on existing KGs and ontologies, which are publicly available on the Web. Thus, it is a reasonable assumption that the test data may have been seen by the LLM, and it is questionable if the results transfer to a case of unseen data (which is where those models are intended to be employed).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
http://www.heikopaulheim.com/ (H. Paulheim)</p>
      <p>© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR</p>
      <p>ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>Web Data</title>
    </sec>
    <sec id="sec-3">
      <title>Runtime retrieval</title>
    </sec>
    <sec id="sec-4">
      <title>Used for training</title>
      <p>Vicious Circle
of AI Evaluation</p>
    </sec>
    <sec id="sec-5">
      <title>Publication</title>
    </sec>
    <sec id="sec-6">
      <title>Learning from inputs</title>
      <p>AI Model
e
v
a
l
u
a
t
e
d
o
n</p>
    </sec>
    <sec id="sec-7">
      <title>Researcher Benchmark Dataset</title>
      <p>2. Proposed Approach
In order to overcome the data leakage problem in evaluation, we propose a schema as shown in figure 2.
We coin this the GET methodology (generate–evaluate–trash). It foresees the usage of a large language
model to generate synthetic ontologies for one-time usage. In detail, the pipeline has the following
steps:
1. From an original ontology, we extract key characteristics, such as the number of classes and
properties.
2. The extracted characteristics are used to prompt an LLM to generate a set of synthetic ontologies
resembling the original one. We propose two variants: (a) generating ontologies in the same
domain, and (b) generating ontologies in related domains.
3. The result is a set of generated synthetic ontologies which have not been seen by any LLM during
pre-training.
4. The synthetic ontologies are used as benchmarks for testing LLM-based tools, e.g., for ontology
learning.
5. The results are collected. Since multiple similar ontologies can be generated, the approach also
allows for assessing the stability of the results in addition to metrics such as precision and recall
(e.g., by computing standard deviations across all generated ontologies).
6. After running the experiments, the synthetic ontologies should not be reused, but they can be
made public in a research data repository for fostering reproducibility.</p>
      <p>In step 2, in order to generate diferent ontologies, we propose using a temperature above 0. Moreover,
we propose to use an LLM in this step which is not by any tool used in step 4.
1</p>
      <sec id="sec-7-1">
        <title>Original ontology LLM Key character</title>
        <p>istics
2
3</p>
      </sec>
      <sec id="sec-7-2">
        <title>Synthetic ontologies</title>
        <p>6
4</p>
      </sec>
      <sec id="sec-7-3">
        <title>LLM-based tool</title>
        <p>5</p>
      </sec>
      <sec id="sec-7-4">
        <title>Evaluation metrics</title>
        <p>3. Example: Taxonomy Induction with LLMs
To test the proposed approach, we run experiments with taxonomy induction on two well-known
ontologies, the Pizza ontology2 and the Wine ontology3. For each of those, we asked an LLM to create
three replica within the same domain, and three in adjacent domains (pasta, sushi, and curry dishes for
the pizza ontology, and beer, whiskey, and gin for the wine domain). Details can be found in [15].</p>
        <p>For each of those ontologies, we provide a list of all classes to an LLM, and ask it to return the subclass
axioms holding between those classes. The returned subclass axioms are then compared to the one in
the original ontology to compute recall, precision, and f-measure. The prompts used for generating the
synthetic ontologies and for learning subclass axioms, as well as the generated ontologies, are available
online.4</p>
        <p>In our experiment, we use three LLMs of diferent sizes for taxonomy induction, i.e., Llama 8B, Llama
70B, and Mistral Large (123B) at a temperature of 0. The ontologies themselves are generated using
Gemma-27B at a temperature of 0.5 (in order to create diferent test ontologies). The results are shown
in table 1. We can make multiple observations:
1. The results on the original ontologies are often worse than those on the generated ones. There are
at least two possible explanations: (a) the “mental models” of the generating and the evaluation
LLMs are more aligned (i.e., LLMs, even diferent ones, have a certain shared understanding of a
given domain), and (b) the original ontologies were created for instructive purposes, with the goal
of displaying more diferent OWL constructs rather than providing a complete domain ontology. 5
2. The results in related domains are generally worse than those in the original domain, especially
in the tasks based on the wine ontology (i.e., beer, gin, and whiskey ontologies). This may hint at
the LLMs having gathered a part of their ontology engineering knowledge on the wine ontology
and related tutorial materials.
2https://protege.stanford.edu/ontologies/pizza/pizza.owl
3https://www.w3.org/TR/owl-guide/wine.rdf
4https://github.com/HeikoPaulheim/llm-ontology-learning
5For example, the generated pizza ontologies, on average, contain three times more diferent types of pizza than the original
pizza ontology.</p>
        <p>f ± ± ±
f ± ± ±
f ± ± ±</p>
        <p>f ± ± ±
0 2 4
0 8 8</p>
        <p>6 8 0
0 0 0
0 0 0
0 0 0</p>
        <p>0 0 0
a 0 8 7 a
0 6 8
r ± ± ±
r ± ± ±
r ± ± ±</p>
        <p>r ± ± ±
0 0 0
0 0 0
0 0 0</p>
        <p>0 0 0
0 8 9
f ,00 ,1 4</p>
        <p>4 ,7
0 0 0</p>
        <p>6 6 8
f ,46 ,2 3</p>
        <p>6 ,5
0 0 0</p>
        <p>2 0 2
f ,53 ,0 8</p>
        <p>0 ,9
0 1 0</p>
        <p>3 4 8
f ,25 ,2 0</p>
        <p>7 ,7
0 0 0
’
”
a
z p 0 ,
z ,
i
p
0 8 9 i
0 7 9 h</p>
        <p>u
0 0 0 s</p>
        <p>’
3 3 3 ”
8 4 7
0 0 0 w
7 ,6 s p ,3 ,
5 ,4
e
in p 3 ,
3 0 5
6 0 69 ign p 1 ,
, 0 , ,
0 1 0
0 7 9
0 6 2
,0 ,
0
2 ,2
0 0
0 9
,0 ,</p>
        <p>5
6 ,7
0 4 0
0 5 0
,0 ,
0 0</p>
        <p>1 ,3
,2 ,
0
1 ,1
0 0
0 9
,2 ,</p>
        <p>9
6 ,6
,1 ,
0</p>
        <p>1 ,1
,1 ,
,2 ,</p>
        <p>7 ,7
4 9 3
r ,59 ,3 2</p>
        <p>7 ,6
0 0 0
4 4 4
f ,03 ,1 2</p>
        <p>8 ,8
0 0 0
4 3 3
r ,03 ,5 6</p>
        <p>8 ,8
0 0 0
1 5 1
f ,10 ,5 3</p>
        <p>6 ,7
0 0 0
0 0 0
r ,12 ,4 3</p>
        <p>7 ,8
0 0 0
0 6 6
r ,00 ,8 0</p>
        <p>2 ,8
0 0 0
0 2 5
f ,00 ,5 8</p>
        <p>9 ,9
0 0 0
0 0 0
r ,00 ,9 9</p>
        <p>9 ,9
0 0 0
0 6 8
f ,00 ,0 2</p>
        <p>7 ,5
0 0 0
0 0 0
r ,00 ,4 6</p>
        <p>8 ,8
0 0 0
8 7 5
f ,26 ,5 9</p>
        <p>7 ,5
0 0 0
r
o p ,2 ,
a</p>
        <p>7 ,6
6 2 0
r ,28 ,6 6</p>
        <p>7 ,5
0 0 0
”
a
p
l
a
n
i
g
i
z
z
i
p
,2 ,
0
1 9 9
9
,5 ,
0 4
7 ,8
,2 ,
0 0</p>
        <p>1 ,1
,1 ,
0
,8 ,
,4 ,
8 4 9
f ,41 ,1 9</p>
        <p>5 ,2
0 0 0
l
a
n
i
g
i
w
i
ro p 3 ,
e
n
,4 ,</p>
        <p>6 ,6
5 2 9
r ,35 ,5 1</p>
        <p>9 ,9
0 0 0
0 1 0
f ,00 ,7 4</p>
        <p>3 ,5
0 0 0
,
0 0 0
7 ,7
in p ,2 ,
0 ,
6 is p 0 ,</p>
        <p>2 ,4
’
a
zz p 0 ,
i
p
0 96 ,183 tsaa p 0 ,
0 0
, ,
0 0 0 p
8 7 4 ’e
8 8 5
2 4 4 r
8 8 8 e
,
0 0 0 b
5 ,6
in p 7 ,
7 ,7 e p ,7 ,</p>
        <p>3 ,3
0 0 0 w
B 0 L
8 7 l</p>
        <p>a
a a r</p>
        <p>m ts
m
la la i
L L M
e
g
r</p>
        <p>B a
B 0 L
8 7 l</p>
        <p>a
a a r
m</p>
        <p>m ts
la la i
L L M
e
g
r</p>
        <p>B a
B 0 L
8 7 l</p>
        <p>a
a a r
m</p>
        <p>m ts
la la i
L L M T
m
R
:
1
e
l
b
a
m
3. The order of tools by performance is not the same. For example, while Llama70B is superior to
Mistral Large on almost all tasks on the original ontologies, Mistral Large outperforms Llama70B
on many of the generated ontologies (both in the same and in similar domains). This may hint at
a higher tendency of Llama70B’s results being an efect of memorization to a larger extent than
Mistral Large. This change of ordering demonstrates that evaluating on synthetic ontologies can
reveal additional information that the evaluation on original ontologies do not provide.
4. The standard deviation is often considerable, showing that the approaches are not very stable,
that good results can also be the result of a lucky coincidence, and that results in the same quality
cannot be guaranteed on unseen data.</p>
        <p>Overall, the results demonstrate that with the GET methodology, we can obtain more in-depth results
than by only evaluating on the two original ontologies.
4. Conclusion and Outlook
Test data leakage is an overlooked issue when running LLM-based tools and evaluating them on public
ontologies and knowledge graphs. In this paper, we have proposed the GET (generate–evaluate–trash)
methodology as an alternative: instead of evaluating against publicly available knowledge graphs and
ontologies, we propose to generate those on the fly for one-time evaluations. We have demonstrated
the approach on the task of taxonomy induction, showing that it is possible to evaluate and also assess
robustness of LLM-based taxonomy induction mechanisms.</p>
        <p>First and foremost, future work will consist of wrapping the approach in an end-to-end evaluation
pipeline. Further experimentation will go into controlling the complexity and dificulty of the generated
ontologies, and the conduction of experiments in other tasks than taxonomy induction.
Acknowledgments
The experiments have been run using the Chat AI service provided by GWDG. [16]
Declaration on Generative AI
The author(s) have not employed any Generative AI tools.
[8] D. Shu, T. Chen, M. Jin, C. Zhang, M. Du, Y. Zhang, Knowledge graph large language model
(kg-llm) for link prediction, Proceedings of Machine Learning Research 260 (2024) 143–158.
[9] S. S. Norouzi, A. Barua, A. Christou, N. Gautam, A. Eells, P. Hitzler, C. Shimizu, Ontology
population using llms, in: Handbook on Neurosymbolic AI and Knowledge Graphs, IOS Press,
2025, pp. 421–438.
[10] S. Hertling, H. Paulheim, Olala: Ontology matching with large language models, in: Proceedings
of the 12th knowledge capture conference 2023, 2023, pp. 131–139.
[11] K. Zhou, Y. Zhu, Z. Chen, W. Chen, W. X. Zhao, X. Chen, Y. Lin, J.-R. Wen, J. Han, Don’t make
your llm an evaluation benchmark cheater, arXiv preprint arXiv:2311.01964 (2023).
[12] Y. Cheng, Y. Chang, Y. Wu, A survey on data contamination for large language models, arXiv
preprint arXiv:2502.14425 (2025).
[13] S. Ni, X. Kong, C. Li, X. Hu, R. Xu, J. Zhu, M. Yang, Training on the benchmark is not all you
need, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 2025, pp.
24948–24956.
[14] H. T. Mai, C. X. Chu, H. Paulheim, Do llms really adapt to domains? an ontology learning
perspective, in: International Semantic Web Conference, Springer, 2024, pp. 126–143.
[15] H. Paulheim, Towards evaluating knowledge graph construction and ontology learning with llms
without test data leakage, in: 3rd workshop on Knowledge Base Construction from Pre-Trained
Language Models (KBC-LM), 2025.
[16] A. Doosthosseini, J. Decker, H. Nolte, J. M. Kunkel, Chat ai: A seamless slurm-native solution for
hpc-based services, 2024. URL: https://arxiv.org/abs/2407.00110. arXiv:2407.00110.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fathallah</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>S. D.</given-names>
          </string-name>
          <string-name>
            <surname>Giorgis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Poltronieri</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Haase</surname>
          </string-name>
          , L. Kovriguina,
          <article-title>Neon-gpt: a large language model-powered pipeline for ontology learning</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Unifying large language models and knowledge graphs: A roadmap, IEEE Transactions on Knowledge and Data Engineering (TKDE) (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Shimizu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          ,
          <article-title>Accelerating knowledge graph and ontology engineering with large language models</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>85</volume>
          (
          <year>2025</year>
          )
          <fpage>100862</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Varró</surname>
          </string-name>
          ,
          <article-title>Prompting or fine-tuning? a comparative study of large language models for taxonomy construction</article-title>
          , in: 2023 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems
          <string-name>
            <surname>Companion (MODELS-C)</surname>
          </string-name>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>588</fpage>
          -
          <lpage>596</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Kommineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>König-Ries</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Samuel</surname>
          </string-name>
          ,
          <article-title>From human experts to machines: An llm supported approach to ontology and knowledge graph construction</article-title>
          ,
          <source>arXiv preprint arXiv:2403.08345</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vetter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Aryan</surname>
          </string-name>
          ,
          <article-title>Using large language models for ontoclean-based ontology refinement</article-title>
          ,
          <source>arXiv preprint arXiv:2403.15864</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsaneva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vasic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sabou</surname>
          </string-name>
          ,
          <article-title>Llm-driven ontology evaluation: Verifying ontology restrictions with chatgpt, The semantic web: ESWC satellite events</article-title>
          <year>2024</year>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>