1. Introduction

Towards Evaluating Knowledge Graph Construction and Ontology Learning with LLMs without Test Data Leakage

Heiko Paulheim

0 0 University of Mannheim , Germany

The use of Large Language Models (LLMs) becomes increasingly popular for knowledge graph construction and ontology learning. Very often, methods and tools using LLMs for those tasks are evaluated on existing knowledge graphs and ontologies, which are publicly available on the Web. In cases of very popular ontologies and knowledge graphs, there might be additional material such as tutorials and publications. Thus, it can be assumed that the test data has been seen by the LLM, and it is questionable if the results transfer to a case of unseen data (which is where those models are intended to be employed). In this paper, we propose a diferent method of evaluating LLMs for knowledge graph construction and ontology learning. We suggest using a secondary LLM to create test data for one-time use on the fly. This also allows for repeating experiments and computing standard deviations and confidence intervals, which facilitates additional statements about the robustness of diferent approaches. We demonstrate our suggested approach on two original ontologies, and discuss diferent observations when comparing results between original and generated test data.

eol>Large Language Models Ontology Learning Evaluation Data Leakage

1. Introduction 2. Related Work

While a considerable amount of works has been conducted on using LLMs for ontology learning, the evaluation is most often conducted on well-known, public ontologies.

The LLMs4OL challenge, conducted for the first time at the International Semantic Web Conference in 2024, uses WordNet, GeoNames, UMLS, the Gene Ontology (GO), FoodOn, and schema.org as test ontologies. [15] The 2025 edition1 added the Ontology for Biomedical Investigations (OBI), the Material Ontology (MatOnto), Semantic Web for Earth and Environment Technology Ontology (SWEET), the Human Disease Ontology (DOID), the PROcess Chemistry Ontology (PROCO), and the Plant Ontology (PO) – all of which are publicly available and widely used ontologies. Similarly, the OntoURL benchmark uses a larger set of publicly available ontologies, including many of the aforementioned ones. [16]

The KBC-LM challenge, conducted for the first time as a challenge at the International Semantic Web Conference, has been using relations from Wikidata throughout all iterations. [17] Evaluation datasets proposed in other papers use taxonomies such as those from arxiv.org and Wikipedia [18], or public ontologies such as DOREMUS, Polifonia, DemCar, Odeuropa, NORIA-O, or FIBO [19].

Another strain of works (e.g. [20, 21]) does not evaluate the generated ontologies against a ground truth, but rather uses quality metrics such as those defined by the ontology pitfalls scanner (OOPS!) [ 22]. While this avoids the data leakage problem, it can rate only the compliance of LLM-based approaches with ontology engineering guidelines and finds general issues such as taxonomy cycles, but does not take the actual semantics of the generated ontologies into account.

Overall, we see that there is no easy way to evaluate how well LLM-based approaches work for ontology learning on unseen data.

The approach in this paper proposes to use synthetic ontologies as benchmarks for ontology learning, which are created dynamically for an experiment, and not reused afterwards. While synthethic ontologies have been proposed for other benchmarking means, such as reasoning [23, 24], knowledge graph completion [25], machine learning over knowledge graphs [26], or querying [27, 28, 29], the approach pursued in this paper difers in the two aspects that such approaches do not exist for ontology learning, and that the generation at runtime has not been in the focus so far (in fact, most synthetic benchmarks are public, and usually, researchers reuse public synthetic benchmarks instead of recreating fresh ones).

3. Proposed Approach

In order to overcome the data leakage problem in evaluating LLM-based ontology learning tools, we propose a schema based on the GET methodology [14], as shown in Figure 1. It foresees the usage of a large language model to generate synthetic ontologies for one-time usage. In detail, the pipeline has the following steps: 1. From an original ontology, we extract key characteristics, such as the number of classes and properties. 2. The extracted characteristics are used to prompt an LLM to generate a set of synthetic ontologies resembling the original one. We propose two variants: (a) generating ontologies in the same domain, and (b) generating ontologies in related domains. 3. The result is a set of generated synthetic ontologies that are generated on the fly. We assume that they were not part of the LLM training data. 4. The synthetic ontologies are used as benchmarks for testing LLM-based ontology learning tools. 5. The results are collected. Since multiple similar ontologies can be generated, the approach also allows for assessing the stability of the results in addition to metrics such as precision and recall (e.g., by computing standard deviations across all generated ontologies). 6. After running the experiments, the synthetic ontologies should not be reused, but they can be made public in a research data repository for fostering reproducibility. 1https://sites.google.com/view/llms4ol2025/home

Original ontology LLM Key character

istics 2 3

Synthetic ontologies

6 4

LLM-based tool

Evaluation metrics

In step 2, in order to generate diferent ontologies, we propose using a temperature above 0. Moreover, we propose to use an LLM which is not by any tool used in step 4.

4. Experiments

In order to test the proposed approach, we conducted experiments with two ontology learning tasks, i.e., taxonomy induction and domain/range induction.

4.1. Ontology Generation

To test the proposed approach, we started from two well-known ontologies, the Pizza ontology2 and the Wine ontology3. For each of those, we had an LLM create three replica within the same domain, and three in adjacent domains (pasta, sushi, and curry dishes for the pizza ontology, and beer, whiskey4, and gin for the wine domain).5 The prompts used for generating the synthetic ontologies, as well as for learning subclass and domain/range axioms, are shown in the appendix.

Statistics on the generated ontologies are shown in Table 1. We can make multiple observations here. First, while the LLM does a good job at creating an exact small number of items (here: properties), there is more variation for the larger numbers (here: classes). Second, while in the original ontologies, some properties do not have a defined domain or range, this never occurs in the generated ones, even though 2https://protege.stanford.edu/ontologies/pizza/pizza.owl 3https://www.w3.org/TR/owl-guide/wine.rdf 4Running the experiment with whisky and comparing the results to those with whiskey is left as an exercise to the reader. 5While we selected the adjacent domains by hand, it would also be possible to prompt an LLM for those for full automation.

# classes # properties # subclass axioms # domain axioms # range axioms

# classes # properties # subclass axioms # domain axioms # range axioms this has been explicitly permitted in the prompt used to generate the ontologies. Moreover, in most cases, the domain of all properties is the central class (e.g., pizza or wine).

Table 2 shows the similarity of the original and generated ontologies in terms of overlapping classes and properties. It can be observed that the generated ontologies are very diferent from the original ontologies in that respect, and that the diferent generated ontologies are also reasonably diferent from one another.

4.2. Ontology Learning Evaluation

We evaluate two tasks, i.e., subclass axiom induction and domain/range induction by LLMs, and we use three LLMs of diferent sizes for that task: Llama 8B, Llama 70B, and Mistral Large Instruct (123B) at a temperature of 0 for reproducbility. The ontologies themselves are generated using Gemma-27B using a temperature of 0.5 to ensure variance in the generated ontologies.

Since the original ontologies were not fully materialized, we (a) materialized the domain/range axioms for inverse properties, and (b) added subclass axioms for equivalent restriction definitions (i.e., for ≡ ⊓ , we added ⊑ and ⊑ ) before evaluating the generated domain/range and subclass axioms. Both sets of materialized axioms are included in the counts in Table 1. All generated ontologies and axioms output by the diferent models are available online. 6

The results are shown in tables 3 for subclass induction, 4 for domain induction, and 5 for range induction.

Before analyzing the results in more detail, we want to point out two cases observed frequently throughout the entire evaluation: 6https://github.com/HeikoPaulheim/llm-ontology-learning 0 8 9 i 0 7 9 h , u 0 0 0 s ’ ” a i p z p 0 , z 7 ,6 s p 3 , 5 ,4

’ 3 3 3 ” 8 4 7 e , i 0 0 0 w 0 6 6 r ,00 ,8 0

2 ,8 0 0 0 0 2 5 f ,00 ,5 8

9 ,9 0 0 0

4 9 3 r ,59 ,3 2

7 ,6 0 0 0 4 4 4 f ,03 ,1 2

8 ,8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 m e r u s a e 1 F d 0 8 9 f ,00 ,1 4

4 ,7 0 0 0

6 6 8 f ,46 ,2 3

6 ,5 0 0 0

2 0 2 f ,53 ,0 8

0 ,9 0 1 0

3 4 8 f ,25 ,2 0

7 ,7 0 0 0 ” 0 7 0 y 0 1 8 r , 0 0 0 w 9 ,9 r p 0 , u 7 ,7 in p 2 ,

6 is p 0 , ’ a zz p 0 , i p 0 96 ,183 tsaa p 0 , 0 0 , , 0 0 0 p 8 7 4 ’e 8 8 5 5 ,6 in p ,7 , 7 ,7 e p ,7 ,

3 ,3 2 4 4 r 8 8 8 e 0 9 ,0 ,

6 ,7 0 4 0 0 5 0 ,0 , 0 1 ,3 0 0 1 5 0 7 9 ,0 , 0 3 ,0 0 0 0 5 5 0 0 8 0 8 7 a

8 0 6 ,0 ,

7 ,6 0 0 0 r ,00 ,9 9

9 ,9 0 0 0 0 6 8 f ,00 ,0 2

7 ,5 0 0 0 0 0 0 r ,00 ,4 6

8 ,8 0 0 0 8 7 5 f ,26 ,5 9

7 ,5 0 0 0 l a n i g z i p i r o p ,2 , a z 3 3 5 5 5 3

7 ,6 6 2 0 r ,28 ,6 6

7 ,5 0 0 0 ,2 , 0 ,2 , ,1 ,

6 ,6 ,3 , 0 ,2 , ,3 ,

4 ,5 ,3 , 0 ,4 ,

6 ,6 l a n i g i w r o p ,3 , e n i ,2 ,

B a B 0 L 8 7 l

m ts

m ts e g r

m ts 1 F d n a , R : 4 e l b a T ” a zz p 0 . i p 0 7 0 y 0 1 8 r . u 0 0 0 c 0 0 0 ”e 1 0 0 .

0 1 1 w k 9 is p .0 .

7 .0

’ 8 0 0 ” 3 0 0 .

0 0 1 w 0 6 5 4 7 4 1 3 3 1 7 9 6 7 6 0 1 9 3 a

9 a n i z z i p g i r o p .0 . a 6 0 0 0 0 0 5 .0 0 0 3 4 8 7 1 8 6

m ts 4 4 4 7 4 4 .1 . 0 5 4 4 2 7 7 .1 .

9 .9

m ts 0 8 7 0 1 7 0 7 7 0 9 6 .0 .

m ts

0 7 .7 4 .4 0 0 3 5 8 8 2 5 8 5 6 6 ’ ” a z p 0 . z . i p 0 0 0 i 7 0 0 h

’ 0 5 9 ” 0 7 8

i 0 0 0 w e n p 9 .

0 0 0 r .00 .0 0

0 .0 1 1 1 3 9 6 f .41 .6 4

7 .8 0 0 0

m ts 0 .0 ru p .6 .

1 .1 in p 7 . . 0 0 0 w in p 1 . z z i p

0 0 0 r .00 .0 0

0 .0 1 1 1 0 6 3 f .80 .4 2

8 .9 0 0 0

m ts

m ts • Results where both recall and precision are 0 are in most cases due to the LLM answering with a completely diferent format than the one request. A particular second cause can be observed in particular for Llama-8B, which often mixes up domain and range, and outputs property ranges instead of domains, which is why Llama8B often has zeros in the domain induction task. • Results with a very high recall and very low precision: occasionally, the LLMs output cross products of properties and classes for domains or ranges, or redundantly include all subclasses of the actual domain/range class.7 When looking into the results in more detail, we can make number of further observations: • The results on the original ontologies are often worse than those on the generated ones. There are at least three possible explanations: (a) the “mental models” of the generating and the evaluation LLMs are more aligned (i.e., LLMs have a certain shared understanding of a given domain), (b) the original ontologies, which were created for instructive purposes, contain more corner cases, and (c) in contrast to most generated ontologies, the original ones contain properties without explicit domain and range definitions, while the LLM almost always returns a definition for each property, despite explicitly prompted that this is optional, leading to a larger number of false positives on the original ontologies. • The results in related domains are generally worse than those in the original domain, especially in the tasks based on the wine ontology (i.e., beer, gin, and whiskey ontologies). This may hint at the LLMs having gathered a part of their ontology engineering knowledge on the wine ontology and related tutorial materials. • The order of tools by performance is not the same. For example, while Llama70B is superior to Mistral Large on almost all tasks on the original ontologies, Mistral Large outperforms Llama70B on many of the generated ontologies (both in the same and in similar domains). This may hint at a higher tendency of Llama70B’s results being an efect of memorization to a larger extent than Mistral Large. • The standard deviation is often considerable, showing that the approaches are not very stable, that good results can also be the result of a lucky coincidence, and that results in the same quality cannot be guaranteed on unseen data.

Overall, we see that with the proposed methodology, we can obtain more in-depth results than by only evaluating on the two original ontologies.

5. Conclusion and Outlook

Test data leakage is an overlooked issue when evaluating LLM-based tools for ontology learning on publicly available ontologies and knowledge graphs. In this paper, we have proposed an alternative methodology: instead of evaluating against publicly available ontologies, we propose to generate test ontologies on the fly for one-time evaluations. We have demonstrated the approach on the task of taxonomy induction, showing that it is possible to evaluate and also assess robustness of LLM-based taxonomy induction mechanisms.

While we assume that the generated ontologies are not seen by the LLMs during training, this assumption may be partially wrong – in case the generating LLM reproduces parts of an existing ontology, those may in fact have been seen by the LLM. One important task of future work is therefore applying data leakage metrics [30] to the generated data to assess the degree of freshness of the generated synthetic ontologies.

One of the striking observations of this work was that the results observed on synthetic ontologies are often better than those on the original, human-generated ones. This deserves a deeper analysis. One possible reason we postulated was that diferent LLMs have a stronger alignment on their “mental models” of a domain, an assumption that deserves further analysis, e.g., by swapping the LLMs used 7In future work, we will catch the latter issue programmatically, and filter out those correct, but redundant axioms. for generation and evaluation, and comparing the generated ontologies to one another. Moreover, we will think of approaches to assess the dificulty of the ontology learning tasks on the original and the generated ontologies, respectively, and to experiment with diferent prompts for controlling the task dificulty.

On the practical side, future work will consist of wrapping the approach in an end-to-end evaluation pipeline. Further experimentation will go into the generation of ontologies, e.g., controlling the complexity and dificulty of the generated ontologies, and conducting experiments with diferent generation models.

So far, we have looked into taxonomy and domain/range induction, but the approach might be interesting for various other tasks, such as the learning of more complex restrictions, detection of property characteristics (transitivity, inverse properties, etc.), or entity typing.

Overall, we hope that this mode of evaluation will be used at least in addition to the currently dominant paradigm of evaluating using publicly available ontologies, in order to analyze tool performance in a setup free from test data leakage.

Declaration on Generative AI

The author(s) have not employed any Generative AI tools.

Acknowledgements

The experiments have been run using the Chat AI service provided by GWDG. [31] [12] Y. Cheng, Y. Chang, Y. Wu, A survey on data contamination for large language models, arXiv preprint arXiv:2502.14425 (2025). [13] H. T. Mai, C. X. Chu, H. Paulheim, Do llms really adapt to domains? an ontology learning perspective, in: International Semantic Web Conference, Springer, 2024, pp. 126–143. [14] H. Paulheim, Ontologies, Knowledge Graphs, and LLMs: How Do We GET Evaluations Done

Right?, in: International Semantic Web Conference, Posters and Demos, 2025. [15] H. B. Giglou, J. D’Souza, S. Auer, Llms4ol 2024 overview: The 1st large language models for ontology learning challenge, arXiv preprint arXiv:2409.10146 (2024). [16] X. Zhang, H. Lai, Q. Meng, J. Bos, Ontourl: A benchmark for evaluating large language models on symbolic ontological understanding, reasoning and learning, arXiv preprint arXiv:2505.11031 (2025). [17] J.-C. Kalo, T.-P. Nguyen, S. Razniewski, B. Zhang, Preface: Lm-kbc challenge 2024, in: 2nd

Workshop on Knowledge Base Construction from Pre-Trained Language Models, CEUR. ws, 2024. [18] A. Lo, A. Q. Jiang, W. Li, M. Jamnik, End-to-end ontology learning with large language models,

Advances in Neural Information Processing Systems 37 (2024) 87184–87225. [19] Y. Rebboud, P. Lisena, L. Tailhardat, R. Troncy, Benchmarking llm-based ontology conceptualization: A proposal, in: ISWC 2024, 23rd International Semantic Web Conference, 2024. [20] M. A. Cappelli, G. Di Marzo Serugendo, Methodological exploration of ontology generation with a dedicated large language model, Electronics 14 (2025) 2863. [21] A. S. Lippolis, M. J. Saeedizade, R. Keskisärkkä, S. Zuppiroli, M. Ceriani, A. Gangemi, E. Blomqvist, A. G. Nuzzolese, Ontology generation using large language models, in: European Semantic Web Conference, Springer, 2025, pp. 321–341. [22] M. Poveda-Villalón, M. C. Suárez-Figueroa, A. Gómez-Pérez, Validating ontologies with oops!, in: International conference on knowledge engineering and knowledge management, Springer, 2012, pp. 267–281. [23] M. Ebrahimi, M. K. Sarker, F. Bianchi, N. Xie, D. Doran, P. Hitzler, Reasoning over rdf knowledge bases using deep learning, arXiv preprint arXiv:1811.04132 (2018). [24] A. Eberhart, M. Ebrahimi, L. Zhou, C. Shimizu, P. Hitzler, Completion reasoning emulation for the description logic el+, in: Proceedings of the AAAI 2020 Spring Symposium on Combining Machine Learning and Knowledge Engineering in Practice, 2020. [25] A. Melo, H. Paulheim, Synthesizing knowledge graphs for link and type prediction benchmarking, in: European Semantic Web Conference, Springer, 2017, pp. 136–151. [26] J. Portisch, H. Paulheim, The dlcc node classification benchmark for analyzing knowledge graph embeddings, in: International semantic web conference, Springer, 2022, pp. 592–609. [27] Y. Guo, Z. Pan, J. Heflin, Lubm: A benchmark for owl knowledge base systems, Journal of Web

Semantics 3 (2005) 158–182. [28] M. Schmidt, T. Hornung, G. Lausen, C. Pinkel, Spˆ 2bench: a sparql performance benchmark, in: 2009 IEEE 25th International Conference on Data Engineering, IEEE, 2009, pp. 222–233. [29] C. Bizer, A. Schultz, The berlin sparql benchmark, in: Semantic Services, Interoperability and Web

Applications: Emerging Concepts, IGI Global Scientific Publishing, 2011, pp. 81–103. [30] S. Ni, X. Kong, C. Li, X. Hu, R. Xu, J. Zhu, M. Yang, Training on the benchmark is not all you need, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 2025, pp. 24948–24956. [31] A. Doosthosseini, J. Decker, H. Nolte, J. M. Kunkel, Chat ai: A seamless slurm-native solution for hpc-based services, 2024. URL: https://arxiv.org/abs/2407.00110. arXiv:2407.00110.

Appendix

This section documents the prompts used in the experiments. In all cases, we used the following system prompt: You are an ontology engineer

For the generation of ontologies, we used Gemma-27B with a temperature of 0.5 and the following prompt: Here, domain is one out of {pizza,pasta,sushi,curry dishes,wine,beer,whiskey,gin}, and N and M are numbers extracted from the original pizza and wine ontologies. The output format is chosen to ease the evaluation.

For the generation of subclass axioms, the following prompt is used:

For the generation of domain axioms, the following prompt is used: I want to build an ontology of {domain}. I give you the classes and properties I defined, please provide a list of range definitions in the format

[1]

Fathallah , A. Das , S. D.

Giorgis , A.

Poltronieri , P.

Haase , L. Kovriguina, Neon-gpt: a large language model-powered pipeline for ontology learning , in: European Semantic Web Conference , Springer, 2024 , pp. 36 - 50 .

[2]

Pan ,

Luo ,

Wang ,

Chen ,

Wang ,

Wu , Unifying large language models and knowledge graphs: A roadmap, IEEE Transactions on Knowledge and Data Engineering (TKDE) ( 2024 ).

[3]

Shimizu ,

Hitzler , Accelerating knowledge graph and ontology engineering with large language models , Journal of Web Semantics 85 ( 2025 ) 100862 .

[4]

Chen ,

Yi ,

Varró , Prompting or fine-tuning? a comparative study of large language models for taxonomy construction , in: 2023 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C) , IEEE, 2023 , pp. 588 - 596 .

[5]

V. K.

Kommineni ,

König-Ries ,

Samuel , From human experts to machines: An llm supported approach to ontology and knowledge graph construction , arXiv preprint arXiv:2403.08345 ( 2024 ).

[6]

Zhao ,

Vetter ,

Aryan , Using large language models for ontoclean-based ontology refinement , arXiv preprint arXiv:2403.15864 ( 2024 ).

[7]

Tsaneva ,

Vasic ,

Sabou , Llm-driven ontology evaluation: Verifying ontology restrictions with chatgpt, The semantic web: ESWC satellite events 2024 ( 2024 ).

[8]

Shu ,

Chen ,

Jin ,

Zhang ,

Du , Y. Zhang, Knowledge graph large language model (kg-llm) for link prediction , Proceedings of Machine Learning Research 260 ( 2024 ) 143 - 158 .

[9]

S. S.

Norouzi ,

Barua ,

Christou ,

Gautam ,

Eells ,

Hitzler ,

Shimizu , Ontology population using llms, in: Handbook on Neurosymbolic AI and Knowledge Graphs , IOS Press, 2025 , pp. 421 - 438 .

[10]

Hertling ,

Paulheim , Olala: Ontology matching with large language models , in: Proceedings of the 12th knowledge capture conference 2023 , 2023 , pp. 131 - 139 .

[11]

Zhou ,

Zhu ,

Chen ,

W. X.

Zhao ,

Chen ,

Lin ,

J.-R.

Wen , J. Han, Don't make your llm an evaluation benchmark cheater , arXiv preprint arXiv:2311 . 01964 ( 2023 ). In the latter three prompts, {classes} and {properties} are lists of the classes and properties of the ontology at hand, provided one per line .