Learning Domain Ontologies Based On Top-Level Ontology Concepts Using Language Models And Informal Definitions Alcides Lopes1,∗,† , Joel Carbonera1,† and Mara Abel1,† 1 Instituto de Informática, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, RS, Brazil Abstract Ontology development is a challenging task that encompasses many time-consuming activities. One of these activities is the classification of the domain entities (concepts and instances) according to top-level concepts. This activity is usually performed manually by an ontology engineer. However, when the set of entities increases in size, associating each domain entity to the proper top-level ontological concept becomes challenging and requires a high level of expertise in both the target domain and ontology engineering. In this context, this work describes an approach for learning domain ontologies based on top-level ontology concepts using informal definitions as input. In our approach, we used informal definitions of the domain entities as text input of a language model that predicts their proper top-level concepts. Also, we present a methodology to extract datasets from existing domain ontologies to evaluate the proposed approach. Our experiments show that we have promising results in classifying domain entities into top-level ontology concepts. Keywords Ontology learning, Deep neural network, Text classification 1. Introduction Over the years, ontologies have proved valuable in many domains, such as geology [1, 2] and biomedicine [3, 4]. In the literature, some methodologies for ontology development stands on a more abstract ontology, called top-level ontology [5, 6, 7], to explicitly define the ontological nature of the domain entities through the specialization of top-level concepts. The domain ontologies developed based on top-level ontologies have the advantage of adhering to a philo- sophically well-founded meaning. However, identifying which top-level concept generalizes a domain entity in complex domains is a laborious and time-consuming task that requires manual work and a high level of expertise in both the target domain and ontology engineering [8]. In this work, we proposed an approach to learning domain ontologies based on top-level ontology concepts using language models and the informal definitions of the domain entities. In our view, automatizing the task of learning which top-level concept a domain entity specializes ONTOBRAS 2022 - 15th Seminar on Ontology Research in Brazil, 22–25 November 2022, Online ∗ Corresponding author. † These authors contributed equally. Envelope-Open agljunior@inf.ufrgs.br (A. Lopes); jlcaronera@inf.ufrgs.br (J. Carbonera); marabel@inf.ufrgs.br (M. Abel) Orcid 0000-0003-0622-6847 (A. Lopes); 0000-0002-4499-3601 (J. Carbonera); 0000-0002-9589-2616 (M. Abel) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) has many benefits to ontology engineering, mainly because allowing a more rational resource allocation in ontology development processes since the ontology engineer can invest more time in more complex tasks. Also, we hypothesize that knowing the top-level concept of the domain entities can help in automatically discovering the relationships between that domain entities. In order to evaluate our proposal, we proposed to extract datasets from existing domain ontologies developed based on two top-level ontologies: Dolce-Lite (DL) and Dolce-Lite-Plus (DLP) [9]. Our experiments show that language models have promising results in classifying domain entities into top-level ontology concepts, with 94% of micro F1-score. The paper is organized as follows: firstly, we describe the background notions that support this proposal. Secondly, we present the current state of our research, describing the dataset extraction and the proposed approach for learning domain ontologies based on top-level ontology concepts using language models and informal definitions. After that, we show the current results achieved from our experiments. Finally, we present our future works. 2. Related Works The manual development of ontologies is complex and laborious, bringing a significant challenge to the natural language processing area on automatically creating ontologies as sophisticated as those made by humans. In this context, several approaches propose automatizing specific activi- ties in the ontology development process, such as domain entity identification and classification [10, 11, 8, 12, 13], semantic relation identification and classification [14, 12], etc. In [11], the authors proposed an automatic methodology to extract domain entities from text corpora based on a domain-specific thesaurus and rank them based on their frequency in the corpora. Afterward, the geology engineers manually provided definitions to the selected entities, and domain experts manually classified them according to their respective concepts in the GeoCore ontology [15]. In [12], the authors proposed a framework to automatize domain ontology learning. Their framework has four steps: 1) Extract sentences from text corpora and extract their triples using NLP tools. 2) Matches the extracted sentences with public knowledge bases to extract the concept pairs. 3) Labels common concept pairs between the outputs of steps 1 and 2. 4) Learns the relationship between each concept pair using a bi-directional long-short term memory (Bi-LSTM) model and a convolutional neural network (CNN). In [10], the authors proposed a methodology for ontology learning from big data. They used word embedding and hierarchical clustering to improve the quality of the ontology entities extraction from textual corpora and reduce the processing time. Their methodology started by extracting the most relevant word of the domain. After that, they identified the concepts and their properties using a set of pre-defined rules based on the Part-Of-Speech (POS) tag. The authors created the ontology hierarchy from the identified concepts by combining a word embedding model and a hierarchical clustering algorithm. Finally, they used the word embedding of the properties and the concepts to identify if the properties are data properties or object properties. The authors tested the proposed methodology using gold-standard datasets for ontology learning. In [13], the authors proposed a semi-automatic methodology for ontology learning based on a domain-specific corpus. The first step of this methodology is setting the main classes of the ontology and selecting the terms that specialize them. The approach accomplished the term selection by extracting all adjectives, nouns, and verbs from the domain corpus and requiring user intervention to classify some of these terms under the main classes. Thus, the methodology used a similarity function to classify the remaining entities according to the most similar previously classified terms. Finally, the authors used hierarchical clustering algorithms to build the final ontology hierarchy. The main drawback of this methodology is the necessity of human intervention in all of those steps. Here, we revised some recent works on the ontology learning domain, but many more works propose solutions to learn ontologies from text [16, 17]. However, no one of them focuses on the task of learning the top-level concepts of the domain entities. In our view, this task is essential to create more powerful domain ontologies by adhering to a philosophically well-founded meaning. Also, in this context, there is a lack of datasets to evaluate new proposals on this line. Thus, in this work, in addition to proposing an approach to accomplish this learning task, we also propose several datasets extracted from existing domain ontologies developed under top-level concepts. 3. Current work In this section, we first describe the methodology used to select the domain ontologies developed under top-level ontology concepts and the process performed to extract the datasets from these ontologies. After that, we present our proposal for learning domain ontologies based on top-level ontology concepts using language models and informal definitions. 3.1. Dataset Extraction In our work, we aim to find domain ontologies developed under top-level ontologies containing informal definitions of their domain entities to build the datasets used for evaluating our proposed approach. In this context, in order to extract the datasets for predicting the top-level concepts of Dolce-Lite and Dolce-Lite-Plus ontologies, we select the OntoWordNet ontology [9]. This general domain ontology aligns 86,982 entities obtained from the WordNet synsets [18] with the Dolce-Lite-Plus (DLP) ontology structure. This top-level ontology is an extension of the Dolce-Lite (DL) ontology with several modules for representing information, communication, plans, and domain information, for example, legal and biomedical notions. Thus, for each domain entity extracted from OntoWordNet ontology, we took the lowest DLP and DL top-level concepts that it specializes, its informal definition, and the set of its labels. For each label in this set, we created an instance inside both DLP and DL datasets containing the label, the informal definition, and the respective top-level concept referent to the dataset. Finally, the Dolce-Lite-Plus dataset contains 90 classes, the Dolce-lite dataset contains 20 classes, and both datasets have 120,489 instances. 3.2. Learning the top-level concepts of domain entities In our view, developing an approach that automatizes the prediction of the top-level concepts of the domain entities using only the informal definitions of these domain entities has many benefits for ontology engineering and artificial intelligence fields. For example, we can use this approach as a decision support system, thus allowing a more rational resource allocation in ontology development processes since the ontology engineer can invest more time in more complex tasks. Also, we can insert the notion of top-level ontology concepts in natural language processing tasks (e.g., named entity recognition, text classification, relationship prediction, etc.). As the last example, we can use this approach to predict the concepts of any top-level ontology, thus allowing the development of a classification system for multiple top-level ontologies. We explored several machine-learning approaches to develop architectures for classifying domain entities into top-level ontology concepts using the informal definition of these domain entities. In this context, we have used three kinds of architecture. In the first kind, we explored the word embeddings of the terms that represent the domain entities and input these word embeddings in several machine learning models, such as Random Forest (RF), Linear Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), Bernoulli Naive Bayes (BNB), Gaussian Naive Bayes (GNB), Feed-Forward neural network (FNN), and a bi-LSTM neural network. However, this architecture fails to deal with polysemic words (words with more than one sense) as the pre-trained word embedding models. In the second kind, we insert the informal definitions into a bi-LSTM neural network and concatenate its hidden layers with the hidden layer of the previous architecture before outputting the predicted class. Thus, we solve the problem of polysemic words. In the third architecture, we adopted language models (e.g., BERT, ELECTRA, ALBERT, etc.) and combined the term and the informal definitions in a single string before inputting it into the model. We achieved our best results using the third architecture. 4. Current experiments and results We develop two experiments to evaluate the three proposed architectures in classifying domain entities into concepts specified by top-level ontologies. In both experiments, we applied the stratified k-fold cross-validation approach (with 𝑘 = 10) to split the DLP dataset into train and test folds. In the first experiment, we compared the first proposed architecture against the second proposed architecture. Also, we selected the 30 most populated classes in the DLP dataset. From these classes, we downsample the training fold according to the size of the less populated class and insert the removed instances into the test fold. In the second experiment, we compared the second architecture with the proposed third architecture using the BERT-Base language model. In this experiment, we fined-tuned the BERT-Base model for our classification task and do not apply any sampling strategy in the train and test folds. Also, we selected the top 82 classes of the DLP dataset. According to Table 1, in the first experiment, the second architecture achieved 59% of the F1-micro score against 57% of the first architecture with the SVM model. These results suggest that combining the informal definitions with the term representing the domain entities improves the performance of classifying domain entities into top-level concepts and also solves the first architecture’s polysemy problem. Thus, making it possible to use more instances in the training Experiment Architecture F1-micro First architecture .57 1 with SVM model Second architecture .59 Second architecture .54 2 Third architecture .94 with BERT-Base model Table 1 The comparison between each architecture in the two performed experiments and test folds. However, in the second experiment, the third architecture with the BERT-Base model reached an outstanding result compared with the second architecture. In this experiment, the third architecture achieves 94% of the micro F1-score against 54% of the second architecture. One reason for this is due to combining the term and the informal definition into a single textual sentence. Another reason is those language models are state-of-the-art for natural language processing tasks. 5. Future works Extracting novel datasets. Nowadays, we have datasets for the Dolce-Lite, Dolce-Lite-Plus, and BFO top-level ontologies. Nevertheless, we aim to increase the number of instances in each dataset by exploring other domain ontologies developed under these top-level ontologies, or knowledge graphs developed from WordNet, such as BabelNet. From the BabelNet, we have access to multiple sources of informal definitions in many languages. Learning the relationships between domain entities. From the presented architectures, we aim to combine the task of learning the top-level concepts of domain entities with the task of predicting the relationship between domain entities. We hypothesize that we can improve the results of the latter task by previously classifying the top-level concept of the evaluated domain entities. Acknowledgements The authors gratefully acknowledges the financial support of the Brazil Federal Agencies CAPES and CNPQ, and the grant and scientific cooperation of PETROBRAS company. References [1] L. F. Garcia, M. Abel, M. Perrin, R. dos Santos Alvarenga, The geocore ontology: A core ontology for general use in geology, Computers & Geosciences 135 (2020) 104387. [2] F. Cicconeto, L. V. Vieira, M. Abel, R. dos Santos Alvarenga, J. L. Carbonera, L. F. Garcia, Georeservoir: An ontology for deep-marine depositional system geometry description, Computers & Geosciences 159 (2022) 105005. [3] K. Degtyarenko, P. De Matos, M. Ennis, J. Hastings, M. Zbinden, A. McNaught, R. Alcántara, M. Darsow, M. Guedj, M. Ashburner, Chebi: a database and ontology for chemical entities of biological interest, Nucleic acids research 36 (2007) D344–D350. [4] G. O. Consortium, The gene ontology resource: 20 years and still going strong, Nucleic acids research 47 (2019) D330–D338. [5] N. Guarino, C. A. Welty, An overview of ontoclean, Handbook on ontologies (2004) 151–171. [6] G. Guizzardi, Ontological foundations for structural conceptual models, Ph.D. thesis, University of Twente, 2005. [7] R. Arp, B. Smith, A. D. Spear, Building ontologies with basic formal ontology, Mit Press, 2015. [8] A. G. L. Junior, J. L. Carbonera, D. Schimidt, M. Abel, Predicting the top-level ontologi- cal concepts of domain entities using word embeddings, informal definitions, and deep learning, Expert Systems with Applications 203 (2022) 117291. [9] A. Gangemi, R. Navigli, P. Velardi, The ontowordnet project: Extension and axiomatization of conceptual relations in wordnet, in: R. Meersman, Z. Tari, D. C. Schmidt (Eds.), On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE, Springer Berlin Heidelberg, Berlin, Heidelberg, 2003, pp. 820–838. [10] N. Mahmoud, H. Elbeh, H. M. Abdlkader, Ontology learning based on word embeddings for text big data extraction, in: 2018 14th International Computer Engineering Conference (ICENCO), 2018, pp. 183–188. [11] L. F. Garcia, F. H. Rodrigues, A. Lopes, R. d. S. A. Kuchle, M. Perrin, M. Abel, What geologists talk about: Towards a frequency-based ontological analysis of petroleum domain terms., in: ONTOBRAS, 2020, pp. 190–203. [12] J. Chen, J. Gu, Adol: a novel framework for automatic domain ontology learning, The Journal of Supercomputing 77 (2021) 152–169. [13] F. ten Haaf, C. Claassen, R. Eschauzier, J. Tjan, D. Buijs, F. Frasincar, K. Schouten, Web-soba: Word embeddings-based semi-automatic ontology building for aspect-based sentiment classification, in: R. Verborgh, K. Hose, H. Paulheim, P.-A. Champin, M. Maleshkova, O. Corcho, P. Ristoski, M. Alam (Eds.), The Semantic Web, Springer International Publishing, Cham, 2021, pp. 340–355. [14] A. Lamurias, D. Sousa, L. A. Clarke, F. M. Couto, Bo-lstm: classifying relations via long short-term memory networks along biomedical ontologies, BMC bioinformatics 20 (2019) 1–12. [15] L. F. Garcia, M. Abel, M. Perrin, R. dos Santos Alvarenga, The geocore ontology: a core ontology for general use in geology, Computers & Geosciences 135 (2020) 104387. [16] A. Konys, Knowledge repository of ontology learning tools from text, Procedia Computer Science 159 (2019) 1614–1628. [17] J. Watrobski, Ontology learning methods from text-an extensive knowledge-based ap- proach, Procedia Computer Science 176 (2020) 3356–3368. [18] G. A. Miller, Wordnet: a lexical database for english, Communications of the ACM 38 (1995) 39–41.