-

1613-0073

Graphs, and LLMs: How Do We GET Evaluations Done Right?

Heiko Paulheim

heiko.paulheim@uni-mannheim.de 0

Large Language Models

Ontologies

Knowledge Graphs

Data Leakage

Taxonomy Induction

0 University of Mannheim , Germany

2025

2 6

The use of Large Language Models (LLMs) becomes increasingly popular for many tasks in the semantic web and knowledge graph community, e.g., knowledge graph (KG) construction, ontology learning, and ontology matching. Methods and tools using LLMs for those tasks are often evaluated on existing KGs and ontologies, which are publicly available on the Web. Thus, it is a reasonable assumption that the test data may have been seen by the LLM, and it is questionable if the results transfer to a case of unseen data (which is where those models are intended to be employed).

1. Introduction http://www.heikopaulheim.com/ (H. Paulheim)

ceur-ws.org

Web Data Runtime retrieval Used for training

Vicious Circle of AI Evaluation

Publication Learning from inputs

AI Model e v a l u a t e d o n

Researcher Benchmark Dataset

2. Proposed Approach In order to overcome the data leakage problem in evaluation, we propose a schema as shown in figure 2. We coin this the GET methodology (generate–evaluate–trash). It foresees the usage of a large language model to generate synthetic ontologies for one-time usage. In detail, the pipeline has the following steps: 1. From an original ontology, we extract key characteristics, such as the number of classes and properties. 2. The extracted characteristics are used to prompt an LLM to generate a set of synthetic ontologies resembling the original one. We propose two variants: (a) generating ontologies in the same domain, and (b) generating ontologies in related domains. 3. The result is a set of generated synthetic ontologies which have not been seen by any LLM during pre-training. 4. The synthetic ontologies are used as benchmarks for testing LLM-based tools, e.g., for ontology learning. 5. The results are collected. Since multiple similar ontologies can be generated, the approach also allows for assessing the stability of the results in addition to metrics such as precision and recall (e.g., by computing standard deviations across all generated ontologies). 6. After running the experiments, the synthetic ontologies should not be reused, but they can be made public in a research data repository for fostering reproducibility.

In step 2, in order to generate diferent ontologies, we propose using a temperature above 0. Moreover, we propose to use an LLM in this step which is not by any tool used in step 4. 1

Original ontology LLM Key character

istics 2 3

Synthetic ontologies

6 4

LLM-based tool

Evaluation metrics

3. Example: Taxonomy Induction with LLMs To test the proposed approach, we run experiments with taxonomy induction on two well-known ontologies, the Pizza ontology2 and the Wine ontology3. For each of those, we asked an LLM to create three replica within the same domain, and three in adjacent domains (pasta, sushi, and curry dishes for the pizza ontology, and beer, whiskey, and gin for the wine domain). Details can be found in [15].

For each of those ontologies, we provide a list of all classes to an LLM, and ask it to return the subclass axioms holding between those classes. The returned subclass axioms are then compared to the one in the original ontology to compute recall, precision, and f-measure. The prompts used for generating the synthetic ontologies and for learning subclass axioms, as well as the generated ontologies, are available online.4

In our experiment, we use three LLMs of diferent sizes for taxonomy induction, i.e., Llama 8B, Llama 70B, and Mistral Large (123B) at a temperature of 0. The ontologies themselves are generated using Gemma-27B at a temperature of 0.5 (in order to create diferent test ontologies). The results are shown in table 1. We can make multiple observations: 1. The results on the original ontologies are often worse than those on the generated ones. There are at least two possible explanations: (a) the “mental models” of the generating and the evaluation LLMs are more aligned (i.e., LLMs, even diferent ones, have a certain shared understanding of a given domain), and (b) the original ontologies were created for instructive purposes, with the goal of displaying more diferent OWL constructs rather than providing a complete domain ontology. 5 2. The results in related domains are generally worse than those in the original domain, especially in the tasks based on the wine ontology (i.e., beer, gin, and whiskey ontologies). This may hint at the LLMs having gathered a part of their ontology engineering knowledge on the wine ontology and related tutorial materials. 2https://protege.stanford.edu/ontologies/pizza/pizza.owl 3https://www.w3.org/TR/owl-guide/wine.rdf 4https://github.com/HeikoPaulheim/llm-ontology-learning 5For example, the generated pizza ontologies, on average, contain three times more diferent types of pizza than the original pizza ontology.

f ± ± ± f ± ± ± f ± ± ±

f ± ± ± 0 2 4 0 8 8

6 8 0 0 0 0 0 0 0 0 0 0

0 0 0 a 0 8 7 a 0 6 8 r ± ± ± r ± ± ± r ± ± ±

r ± ± ± 0 0 0 0 0 0 0 0 0

0 0 0 0 8 9 f ,00 ,1 4

4 ,7 0 0 0

6 6 8 f ,46 ,2 3

6 ,5 0 0 0

2 0 2 f ,53 ,0 8

0 ,9 0 1 0

3 4 8 f ,25 ,2 0

7 ,7 0 0 0 ’ ” a z p 0 , z , i p 0 8 9 i 0 7 9 h

u 0 0 0 s

’ 3 3 3 ” 8 4 7 0 0 0 w 7 ,6 s p ,3 , 5 ,4 e in p 3 , 3 0 5 6 0 69 ign p 1 , , 0 , , 0 1 0 0 7 9 0 6 2 ,0 , 0 2 ,2 0 0 0 9 ,0 ,

5 6 ,7 0 4 0 0 5 0 ,0 , 0 0

1 ,3 ,2 , 0 1 ,1 0 0 0 9 ,2 ,

9 6 ,6 ,1 , 0

1 ,1 ,1 , ,2 ,

7 ,7 4 9 3 r ,59 ,3 2

7 ,6 0 0 0 4 4 4 f ,03 ,1 2

8 ,8 0 0 0 4 3 3 r ,03 ,5 6

8 ,8 0 0 0 1 5 1 f ,10 ,5 3

6 ,7 0 0 0 0 0 0 r ,12 ,4 3

7 ,8 0 0 0 0 6 6 r ,00 ,8 0

2 ,8 0 0 0 0 2 5 f ,00 ,5 8

9 ,9 0 0 0 0 0 0 r ,00 ,9 9

9 ,9 0 0 0 0 6 8 f ,00 ,0 2

7 ,5 0 0 0 0 0 0 r ,00 ,4 6

8 ,8 0 0 0 8 7 5 f ,26 ,5 9

7 ,5 0 0 0 r o p ,2 , a

7 ,6 6 2 0 r ,28 ,6 6

7 ,5 0 0 0 ” a p l a n i g i z z i p ,2 , 0 1 9 9 9 ,5 , 0 4 7 ,8 ,2 , 0 0

1 ,1 ,1 , 0 ,8 , ,4 , 8 4 9 f ,41 ,1 9

5 ,2 0 0 0 l a n i g i w i ro p 3 , e n ,4 ,

6 ,6 5 2 9 r ,35 ,5 1

9 ,9 0 0 0 0 1 0 f ,00 ,7 4

3 ,5 0 0 0 , 0 0 0 7 ,7 in p ,2 , 0 , 6 is p 0 ,

2 ,4 ’ a zz p 0 , i p 0 96 ,183 tsaa p 0 , 0 0 , , 0 0 0 p 8 7 4 ’e 8 8 5 2 4 4 r 8 8 8 e , 0 0 0 b 5 ,6 in p 7 , 7 ,7 e p ,7 ,

3 ,3 0 0 0 w B 0 L 8 7 l

a a a r

m ts m la la i L L M e g r

B a B 0 L 8 7 l

a a a r m

m ts la la i L L M e g r

B a B 0 L 8 7 l

a a a r m

m ts la la i L L M T m R : 1 e l b a m 3. The order of tools by performance is not the same. For example, while Llama70B is superior to Mistral Large on almost all tasks on the original ontologies, Mistral Large outperforms Llama70B on many of the generated ontologies (both in the same and in similar domains). This may hint at a higher tendency of Llama70B’s results being an efect of memorization to a larger extent than Mistral Large. This change of ordering demonstrates that evaluating on synthetic ontologies can reveal additional information that the evaluation on original ontologies do not provide. 4. The standard deviation is often considerable, showing that the approaches are not very stable, that good results can also be the result of a lucky coincidence, and that results in the same quality cannot be guaranteed on unseen data.

Overall, the results demonstrate that with the GET methodology, we can obtain more in-depth results than by only evaluating on the two original ontologies. 4. Conclusion and Outlook Test data leakage is an overlooked issue when running LLM-based tools and evaluating them on public ontologies and knowledge graphs. In this paper, we have proposed the GET (generate–evaluate–trash) methodology as an alternative: instead of evaluating against publicly available knowledge graphs and ontologies, we propose to generate those on the fly for one-time evaluations. We have demonstrated the approach on the task of taxonomy induction, showing that it is possible to evaluate and also assess robustness of LLM-based taxonomy induction mechanisms.

First and foremost, future work will consist of wrapping the approach in an end-to-end evaluation pipeline. Further experimentation will go into controlling the complexity and dificulty of the generated ontologies, and the conduction of experiments in other tasks than taxonomy induction. Acknowledgments The experiments have been run using the Chat AI service provided by GWDG. [16] Declaration on Generative AI The author(s) have not employed any Generative AI tools. [8] D. Shu, T. Chen, M. Jin, C. Zhang, M. Du, Y. Zhang, Knowledge graph large language model (kg-llm) for link prediction, Proceedings of Machine Learning Research 260 (2024) 143–158. [9] S. S. Norouzi, A. Barua, A. Christou, N. Gautam, A. Eells, P. Hitzler, C. Shimizu, Ontology population using llms, in: Handbook on Neurosymbolic AI and Knowledge Graphs, IOS Press, 2025, pp. 421–438. [10] S. Hertling, H. Paulheim, Olala: Ontology matching with large language models, in: Proceedings of the 12th knowledge capture conference 2023, 2023, pp. 131–139. [11] K. Zhou, Y. Zhu, Z. Chen, W. Chen, W. X. Zhao, X. Chen, Y. Lin, J.-R. Wen, J. Han, Don’t make your llm an evaluation benchmark cheater, arXiv preprint arXiv:2311.01964 (2023). [12] Y. Cheng, Y. Chang, Y. Wu, A survey on data contamination for large language models, arXiv preprint arXiv:2502.14425 (2025). [13] S. Ni, X. Kong, C. Li, X. Hu, R. Xu, J. Zhu, M. Yang, Training on the benchmark is not all you need, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 2025, pp. 24948–24956. [14] H. T. Mai, C. X. Chu, H. Paulheim, Do llms really adapt to domains? an ontology learning perspective, in: International Semantic Web Conference, Springer, 2024, pp. 126–143. [15] H. Paulheim, Towards evaluating knowledge graph construction and ontology learning with llms without test data leakage, in: 3rd workshop on Knowledge Base Construction from Pre-Trained Language Models (KBC-LM), 2025. [16] A. Doosthosseini, J. Decker, H. Nolte, J. M. Kunkel, Chat ai: A seamless slurm-native solution for hpc-based services, 2024. URL: https://arxiv.org/abs/2407.00110. arXiv:2407.00110.

[1]

Fathallah , A. Das , S. D.

Giorgis , A.

Poltronieri , P.

Haase , L. Kovriguina, Neon-gpt: a large language model-powered pipeline for ontology learning , in: European Semantic Web Conference , Springer, 2024 , pp. 36 - 50 .

[2]

Pan ,

Luo ,

Wang ,

Chen ,

Wang ,

Wu , Unifying large language models and knowledge graphs: A roadmap, IEEE Transactions on Knowledge and Data Engineering (TKDE) ( 2024 ).

[3]

Shimizu ,

Hitzler , Accelerating knowledge graph and ontology engineering with large language models , Journal of Web Semantics 85 ( 2025 ) 100862 .

[4]

Chen ,

Yi ,

Varró , Prompting or fine-tuning? a comparative study of large language models for taxonomy construction , in: 2023 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C) , IEEE, 2023 , pp. 588 - 596 .

[5]

V. K.

Kommineni ,

König-Ries ,

Samuel , From human experts to machines: An llm supported approach to ontology and knowledge graph construction , arXiv preprint arXiv:2403.08345 ( 2024 ).

[6]

Zhao ,

Vetter ,

Aryan , Using large language models for ontoclean-based ontology refinement , arXiv preprint arXiv:2403.15864 ( 2024 ).

[7]

Tsaneva ,

Vasic ,

Sabou , Llm-driven ontology evaluation: Verifying ontology restrictions with chatgpt, The semantic web: ESWC satellite events 2024 ( 2024 ).