Actively Learning Ontologies from LLMs: First Results (Extended Abstract) Matteo Magnini1 , Ana Ozaki2,3 and Riccardo Squarcialupi1 1 Department of Computer Science and Engineering, University of Bologna, Via dell’Università 50, Cesena, Italy 2 Department of Informatics, University of Oslo, Gaustadalléen 23B, Oslo, Norway 3 Department of Informatics, University of Bergen, Thormøhlensgate 55, Bergen, Norway Abstract In active learning a learner attempts to acquire some kind of knowledge by posing questions to a teacher. Here we consider that the teacher is a language model and study the case in which the knowledge is expressed as an ontology. To evaluate the approach, we present first results testing logical consistency and the performance of GPT and other language models when answering whether concept inclusions from existing ℰℒ ontologies are ‘true’ or ‘false’. Keywords Active Learning, Ontologies, Language Models 1. Introduction Large language models (LLMs) have reached a point where they have accumulated so much information and improved on their question/answering capability that we are now willing to interact and learn from them. Prompts to these models vary from questions about general knowledge such as basic definitions and historical events, to more domain specific questions, e.g., scientific facts related to health and medicine. What can we learn from LLMs? And, since it is known that they can give false information, is there an automated way of discovering whether responses are incorrect or at least inconsistent? In this work we explore an active learning approach to learn from LLMs. In active learning [1], a learner attempts to learn some kind of knowledge by posing questions to a teacher. The questions made by the learner are called membership queries and are answered with ‘yes’ or ‘no’ (or equivalently, with ‘true’ or ‘false’) [2]. Here we consider that the teacher is an LLM and study the case in which the knowledge to be learned is expressed as an ontology. We use the Manchester OWL Syntax [3] in our prompts, as this syntax is closer to natural language. We present preliminary results showing the performance of GPT and other language models when answering whether concept inclusions created by an ontology engineer on prototypical ℰℒ ontologies are ‘true’ or ‘false’. 2. Probing Language Models Here we briefly describe challenges encountered when probing LLMs with ontology axioms and how we handled them. Input Format and Unexpected Responses One important factor is the format of the query. To systematically query an LLM with the goal of learning an ontology, it is useful to standardise the questions. For the membership queries task, we investigate the use of the Manchester OWL syntax [3], as this is an ontology syntax designed to be closer to natural language. Another aspect to consider is that, in principle, there are no constraints in the answers returned by the language model. An LLM may answer with an arbitrary and unexpected response, even if the expected answer is just a single word like in the case of membership queries in the exact learning model. To mitigate this issue, one can explicitly DL 2024: 37th International Workshop on Description Logics, June 18–21, 2024, Bergen, Norway $ matteo.magnini@unibo.it (M. Magnini); anaoz@uio.no (A. Ozaki); riccard.squarcialupi@studio.unibo.it (R. Squarcialupi) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings tell the LLM to answer with ‘true’ or ‘false’. This particular request can be done in the question itself (e.g., appending “Answer with ‘true’ or ‘false’.” after the query) or by exploiting some hyper-parameters of the API of the LLM. In the second case, one can use a system prompt – a.k.a., integrated text into each query within the chat session – to enrich the model with additional information and useful to harness the response. We highlight that there are also other hyper-parameters that could help driving the LLM’s response into the desired format (e.g., maximum number of tokens, temperature, etc.). Even with all these precautions the model may return an unexpected response. For example: (i) the answer can have more text than just ‘true’ or ‘false’, (ii) both ‘true’ and ‘false’ can appear in the answer, (iii) the answer does not have ‘true’ nor ‘false’. While in the first scenario a trivial parsing would determine the correct classification, in the remaining cases, since there is some ambiguity, we considered a third value, which we called ‘unknown’. Correctness and Logical Consistency We also need to deal with challenges regarding the correct- ness of the responses (assuming the format of the responses returned by the language model are as expected, see Section 2). Actively learning ontologies has been investigated for various fragments of ℰℒ [4, 5, 6], though, without using LLMs as teachers. If an LLM is playing the role of the teacher then there is no guarantee that the responses are correct [7] (in the sense of reflecting the ‘truth’ about the real world) and, moreover, that they are logically consistent with any ℰℒ ontology. Indeed, it is known that LLMs can learn statistical features instead of performing logical reasoning [8]. So, we need to consider the following kinds of errors: 1. 𝐶 ⊑ 𝐷 should be ‘false’ (cf. the real world) but the LLM answers ‘true’; 2. 𝐶 ⊑ 𝐷 should be ‘true’ (cf. the real world) but the LLM answers ‘false’; 3. all concept inclusions in 𝒯 = {𝐶1 ⊑ 𝐷1 , . . . , 𝐶𝑛 ⊑ 𝐷𝑛 } are answered with ‘true’, 𝒯 |= 𝐶 ⊑ 𝐷 but 𝐶 ⊑ 𝐷 is classified as ‘false’. The last case is a logical inconsistency. One strategy to handle this issue is to consider the closure under logical consequence [9]. That is, in Point 3, one could consider 𝐶 ⊑ 𝐷 as ‘true’. 3. Experiments The experiments consist in performing a number of membership queries with multiple LLMs on prototypical ontologies. These are small ontologies taken from ontology repositories and used for experiments in the ExactLearner project [6]1 2 , which focuses on ℰℒ ontologies. In all ontologies considered, the logical closure is finite. We consider the following ontologies: 1. Animals contains knowledge related to the animal realm, including actual animals, subphyla, classes, orders, etc. The ontology has 12 (explicit) logical axioms in ℰℒ and 20 logical axioms in the logical closure (that is, taking into account inferred axioms). 2. Cell provides information about different cells based on their type, development stage and organism. The ontology has 24 logical axioms in ℰℒ and 24 in the logical closure. 3. Football is a minimal ontology that describes the relations between football game, teams, players and managers. It has 9 logical axioms in ℰℒ and 12 in the logical closure. 4. Generations describes the members and relations within a family. This ontology has 18 (explicit) logical axioms in ℰℒ and 42 in the logical closure. 1 https://github.com/bkonev/ExactLearner/ 2 Generations, University, and Cell were also part of the Protégé Ontology Library. Not maintained anymore at https: //protegewiki.stanford.edu/wiki/Protege_Ontology_Library but still accessible via web archive at https://web.archive.org/ web/20210226123540/https://protegewiki.stanford.edu/wiki/Protege_Ontology_Library 5. University is a small ontology, focusing on the professor role, with 4 logical axioms in the logical axioms in ℰℒ and 8 in the logical closure. Animals University Generations Football Cell Models T F U T F U T F U T F U T F U Mistral (7b) 9 1 2 2 0 2 5 10 3 7 2 0 17 1 6 Mixtral (47b) 11 1 0 4 0 0 3 6 9 9 0 0 15 9 0 Llama2 (7b) 11 1 0 4 0 0 16 1 1 9 0 0 24 0 0 Llama2 (13b) 11 1 0 4 0 0 16 1 1 9 0 0 23 1 0 Gpt3.5 10 2 0 4 0 0 13 4 1 9 0 0 21 3 0 Table 1 Results for the experiments testing correctness w.r.t. axioms in the ontologies. Labels T, F and U mean ‘true’, ‘false’ and ‘unknown’ responses count. We indicate the number of parameters in each model in parenthesis (e.g. Mistral has 7 billion). It is not known the number of parameters of GPT 3.5. Animals University Generations Football Cell T F U L T F U L T F U L T F U L T F U L 14 2 4 2 5 1 2 0 10 27 5 2 9 3 0 0 18 1 5 0 18 2 0 0 8 0 0 0 19 13 10 0 12 0 0 0 17 7 0 0 20 0 0 0 8 0 0 0 40 1 1 1 12 0 0 0 24 0 0 0 18 2 0 1 7 1 0 0 35 6 1 4 11 1 0 1 21 3 0 0 20 0 0 0 7 1 0 0 36 5 1 0 12 0 0 0 18 6 0 0 Table 2 Results for the experiment testing logical consistency. The number of parameters of each model and the meaning of T, F, U are as in Table 1. L stands for logical inconsistencies (an axiom answered as ‘false’ or ‘unknown’ which can be inferred from the set of the axioms answered as True, see Section 2). Models’ names omitted for better readability (they are the same of Table 1). Animals University Generations Football Cell A P R A P R A P R A P R A P R 0.87 0.52 0.72 0.57 0.67 0.5 0.84 0.71 0.23 0.74 0.44 0.65 0.65 0.48 0.81 0.89 0.57 0.69 0.57 0.48 0.92 0.82 0.64 0.66 0.72 0.43 0.76 0.7 0.32 0.64 0.51 0.2 1 0.24 0.24 1 0.4 0.22 0.88 0.21 0.21 1 0.27 0.18 1 0.73 0.31 0.94 0.45 0.3 0.92 0.63 0.32 0.74 0.44 0.26 0.88 0.44 0.21 0.91 0.71 0.3 1 0.69 0.44 1 0.74 0.41 1 0.68 0.4 1 0.61 0.28 0.91 Table 3 Results for the experiments testing negative examples. Labels A, P and R mean ‘Accuracy’, ‘Precision’ and ‘Recall’ respectively [10]. Models’ names omitted for better readability (they are the same of Table 1). We use a total of 5 LLMs: Open AI’s GPT 3.5 Turbo [11], Mistral [12], Mixtral [13] and two Llama 2 [14] models (we use Ollama’s API3 ). Both Mistral and Mixtral are open models. Llama 2 is free of charge for research while GPT can be expensive as it charges for each query4 . For each logical axiom in an ontology we generate a membership query to an LLM using the Manchester OWL syntax. The goal is to test how well an LLM can correctly answer to membership queries on different domains and without any fine-tuning, where ‘correctly’ means that it answers ‘true’ for the axioms in the ontology (even though ontologies may not match with the real world, we expect them to be mostly correct). The results are in Table 1. We generate all the inferred axioms using the HermiT [15] reasoner (as mentioned above, the logical closure of the ontologies is finite) and we repeat the experiments with the new ontologies. Probing 3 https://github.com/ollama/ollama 4 The source code of the experiments is publicly available https://github.com/MatteoMagnini/ExactLearner the LLMs on ontologies with inferred axioms is useful to test logical consistency. While it is possible that the LLMs could have seen these ontologies during their training (since they are available online), it is unlikely that this is the case for the inferred axioms, since they are not explicitly present in the ontologies. The results are in Table 2. We perform a third experiment where we actively learn ontologies by means of a naive learning algorithm where all concept inclusions of the form 𝐴 ⊑ 𝐵 with 𝐴, 𝐵 concept names in a given signature are asked (the ontologies have complex ℰℒ concepts, but in this experiment we only considered concept names to reduce the number of membership queries). The results are in Table 3. We applied the Chi-squared test to check the relationship between the answers of the LLMs and the ontologies, with the null hypothesis being that there is no correlation. We rejected the null hypothesis in every case (p-value lower than 0.05) except the ones in yellow. Mistral/Mixtral were competitive with GPT 3.5 and had better performance in comparison with the Llama 2 models. The LLMs performed well on ontologies with general knowledge (e.g., Animals, Generations). As future work, we would like to build on these first results and extend the experiments to larger ontologies. Moreover, we plan to investigate the task of actively learning ontologies from LLMs using the ExactLearner [6]. Acknowledgements Ana Ozaki is supported by the Research Council of Norway, project number 316022. References [1] D. Angluin, Computational learning theory: Survey and selected bibliography, in: S. R. Kosaraju, M. Fellows, A. Wigderson, J. A. Ellis (Eds.), Proceedings of the 24th Annual ACM Symposium on Theory of Computing, ACM, 1992, pp. 351–369. doi:10.1145/129712.129746. [2] D. Angluin, Queries and concept learning, Mach. Learn. 2 (1987) 319–342. doi:10.1007/ BF00116828. [3] M. Horridge, N. Drummond, J. Goodwin, A. L. Rector, R. Stevens, H. Wang, The manchester OWL syntax, in: B. C. Grau, P. Hitzler, C. Shankey, E. Wallace (Eds.), Proceedings of the OWLED*06 Workshop on OWL: Experiences and Directions, Athens, Georgia, USA, November 10-11, 2006, volume 216 of CEUR Workshop Proceedings, CEUR-WS.org, 2006. URL: https://ceur-ws.org/Vol-216/ submission_9.pdf. [4] B. Konev, C. Lutz, A. Ozaki, F. Wolter, Exact learning of lightweight description logic ontologies, J. Mach. Learn. Res. 18 (2017) 201:1–201:63. URL: http://jmlr.org/papers/v18/16-256.html. [5] A. Ozaki, C. Persia, A. Mazzullo, Learning query inseparable elh ontologies, in: The Thirty- Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, AAAI Press, 2020, pp. 2959–2966. doi:10.1609/AAAI.V34I03.5688. [6] M. R. C. Duarte, B. Konev, A. Ozaki, Exactlearner: A tool for exact learning of EL ontologies, in: M. Thielscher, F. Toni, F. Wolter (Eds.), Principles of Knowledge Representation and Reasoning: Proceedings of the Sixteenth International Conference, KR 2018, Tempe, Arizona, 30 October - 2 November 2018, AAAI Press, 2018, pp. 409–414. URL: https://aaai.org/ocs/index.php/KR/KR18/ paper/view/18006. [7] M. Funk, S. Hosemann, J. C. Jung, C. Lutz, Towards ontology construction with language models, in: S. Razniewski, J. Kalo, S. Singhania, J. Z. Pan (Eds.), Joint proceedings of the 1st workshop on Knowledge Base Construction from Pre-Trained Language Models (KBC-LM) and the 2nd challenge on Language Models for Knowledge Base Construction (LM-KBC) co-located with the 22nd International Semantic Web Conference (ISWC 2023), Athens, Greece, November 6, 2023, volume 3577 of CEUR Workshop Proceedings, CEUR-WS.org, 2023. URL: https://ceur-ws.org/ Vol-3577/paper16.pdf. [8] H. Zhang, L. H. Li, T. Meng, K. Chang, G. V. den Broeck, On the paradox of learning to reason from data, in: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China, ijcai.org, 2023, pp. 3365–3373. URL: https://doi.org/10.24963/ijcai.2023/375. [9] S. Blum, R. Koudijs, A. Ozaki, S. Touileb, Learning horn envelopes via queries from language models, International Journal of Approximate Reasoning (2023) 109026. doi:https://doi.org/ 10.1016/j.ijar.2023.109026. [10] M. Grandini, E. Bagli, G. Visani, Metrics for multi-class classification: an overview, CoRR abs/2008.05756 (2020). URL: https://arxiv.org/abs/2008.05756. arXiv:2008.05756. [11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Lan- guage models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Asso- ciates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. [12] A. Q. J. et al., Mistral 7b, CoRR abs/2310.06825 (2023). doi:10.48550/ARXIV.2310.06825. [13] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mixtral of experts, CoRR abs/2401.04088 (2024). doi:10.48550/ARXIV.2401.04088. [14] H. Touvron, et al., Llama 2: Open foundation and fine-tuned chat models, CoRR abs/2307.09288 (2023). doi:10.48550/ARXIV.2307.09288. [15] B. Glimm, I. Horrocks, B. Motik, G. Stoilos, Z. Wang, Hermit: An OWL 2 reasoner, J. Autom. Reason. 53 (2014) 245–269. doi:10.1007/S10817-014-9305-1.