Automated and explainable ontology extension based on deep learning: A case study in the chemical domain Adel Memariani1 , Martin Glauer1 , Fabian Neuhaus1,2 , Till Mossakowski1 and Janna Hastings1,3 1 Otto von Guericke University Magdeburg, Germany 2 Free University of Bozen-Bolzano, Italy 3 University College London, UK Abstract Reference ontologies provide a shared vocabulary and knowledge resource for their domain. Manual construction enables them to maintain a high quality, allowing them to be widely accepted across their community. However, the manual development process does not scale for large domains. We present a new methodology for automatic ontology extension and apply it to the ChEBI ontology, a prominent reference ontology for life sciences chemistry. We trained a Transformer-based deep learning model on the leaf node structures from the ChEBI ontology and the classes to which they belong. The model is then capable of automatically classifying previously unseen chemical structures. The proposed model achieved an overall F1 score of 0.80, an improvement of 6 percentage points over our previous results on the same dataset. Additionally, we demonstrate how visualizing the model’s attention weights can help to explain the results by providing insight into how the model made its decisions. Keywords ontology extension, ontology generation, ontology learning, chemical ontology, Transformers, automated classification, transfer learning, multi-label classification 1. Introduction Ontologies represent knowledge in a way that is both accessible to humans and is machine interpretable. Reference ontologies provide a shared vocabulary for a community, and are successfully being used in a range of different domains. Examples include the OBO ontologies in the life sciences [1], the Financial Industry Business Ontology for the financial domain [2], and the Open Energy Ontology in the energy domain [3]. While these ontologies differ in many respects, they share one important feature: they are manually created by experts using a process by which each term is manually added to the ontology including a textual definition, relevant axioms, and ideally some additional documentation. Often, this process involves extensive discussions about individual terms. Hence, developing such ontologies is a time-intensive and expensive process. This leads to a challenge for ontologies that cover a large domain. For example, the ChEBI (Chemical Entities of Biological Interest) ontology [4] is the largest and most widely used ontology for the domain of biologically relevant chemistry in the public 3rd International Workshop on Data meets Applied Ontologies, September 2021 - https://daoxai.inf.unibz.it © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) domain. It currently (as of June 2021) contains 59,122 fully curated classes, which makes it large in comparison to other reference ontologies. ChEBI is largely manually maintained by a team of expert curators. This is an essential prerequisite for its success, because it enables it to capture the terminology and classification logic shared by chemistry experts. However, the number of chemicals covered by ChEBI is dwarfed by the 110 million chemicals in the PubChem database [5], which itself is not comprehensive. The manually curated portion of ChEBI only grows at a rate of around 100 entries per month, thus will only ever be able to cover a small fraction of the chemicals that are in its domain. ChEBI tries to navigate this dilemma by extending the manually curated core part of the ontology automatically using the ClassyFire tool [6]. This approach has tripled ChEBI’s coverage to 165 000 classes (as of June 2021). However, there are limitations to this approach. Firstly, ClassyFire uses a different underlying classification approach to ChEBI (e.g. conjugate bases and acids are not distinguished), thus, mapping to ChEBI loses classification precision. More importantly, ClassyFire is rule-based and while the extension of the ontology is automated, the creation and curation of the ClassyFire’s rules is not. This limits the scalability of this approach. Somewhat inspired by ChEBI’s workflow, we suggest navigating the ontology scaling dilemma by using a new kind of approach to ontology extension, which transfers the design decision of an existing ontology analogously to new classes and relations. Our starting point is an existing, manually curated reference ontology. We suggest the use of machine learning methods to learn some of the criteria that the ontology developers adopted in the development of the ontology, and then use the learned model to extend the ontology to entities that have not been covered by the manual ontology development process yet. We will illustrate this approach in this paper for the chemistry use case by training an artificial neural network (with a Transformer-based architecture) to automate the extension of ChEBI with new classes of chemical entities. The approach has several benefits: since it builds on top of the existing ontology, the extension will preserve the manually created consensus. Moreover, the model is trained solely on the content of the ontology itself and does not rely on any external sources. Finally, as we will see, the chosen architecture allows explanation of the choices of the neural network, and, thus to validate the trained model to some degree by manual inspection. In the next two sections we discuss related work and the overall methodology that we are using to train a model for classifying new classes of chemical entity as subclasses of existing classes in ChEBI. 2. Related Work In this paper, we present a methodology for ontology extension, which can be considered as a kind of ontology learning. Ontology learning has been an active area of research for more than two decades [7, 8, 9, 10, 11] and a number of automated ontology generators have been developed. A recent publication [11] defined a list of six desirable goals for ontology learning methods: they should support expressive languages, require small amount of time and training data, require limited or no human intervention, support unsupervised learning, handle inconsistencies and noise, and their results should be interpretable. The fundamental make-up of the resulting ontologies varies largely – in part due to different notions of what constitutes an ontology. A survey-based study by Biemann [9] defines three classes of ontologies: formal, prototype-based and terminological ontologies. Most early and data- driven approaches resulted in prototype-based ontologies, in which concepts are not defined in natural language or by logical formulae, but solely by their members. New concepts are often derived from metric-based aggregations such as hierarchical clustering [12]. The quality of the resulting classification depends strongly on the chosen representation of individuals and the criteria for similarity, and may not agree with distinctions that are used by domain experts. Advances in natural language processing led to a different class of approaches - the termino- logical ontologies. Here, artificial intelligence is used to analyse corpora of relevant literature in order to extract important terms and their relations. Yet, these approaches reflect rather than resolve the inherent ambiguities and differences in language use that exist within different communities of domain experts or even within single communities. The resolution of these ambiguities is an essential part of the ontology development process that involves extensive in-depth communication with and between domain experts [3]. Finally, formal ontologies place a strong emphasis on definitions distinguishing entities, and a rich logical axiomatisation that yields a powerful foundation for reasoning and data integration [13, 14, 15]. While the majority of existing approaches in ontology learning focus on creating new on- tologies from scratch, the ones that are dedicated to ontology extension use the ontology as a seed to identify terms that are important for the target domain [16, 17, 18, 19, 20]. These are used to guide approaches that are similar to those that are applied to learn ontologies ’from scratch’. Hence, the resulting extensions are not necessarily based on the principles that have been employed to develop the ontology in the first place, and may potentially introduce biases from the literature into the ontology. Some approaches involve several manual steps, in which experts evaluate concepts and related phrases to sort out these potential issues [21]. Involving human experts has the advantage of providing quality control, but is labour-intensive and costly. Our approach differs from the existing work in that it employs machine learning techniques but does not rely on text corpora. Rather, it relies only on the content of the ontology that is being extended, in particular on structured annotations. Our specific application domain is chemical ontology. One characteristic of chemical ontolo- gies is the fact that many classes of chemical entities are annotated with information about their chemical structure. Particularly important for our purposes are annotations in the Simplified Molecular-Input Line-Entry System (SMILES) [22], which is used to represent chemical entities as a linear sequence of characters. The SMILES notation is analogous to a language to describe atoms and their bonds within a chemical entity. In our approach, we train a deep learning classifier on an existing chemical ontology with structural annotations. The learning method biases the classifier towards the ontology’s inter- nal structure, yielding a model that is in line with the domain experts’ conceptualisation as represented in the existing ontology. The resulting model is then used to integrate previously unseen classes into the ontology. This is a novel approach to the problem of chemical classification, which task has historically been approached in multiple different ways [23]. Solutions that involve deep-learning methods were successfully employed for many other applications in chemistry [24], such as the prediction of properties of chemicals [25] or reaction behaviour [26]. Yet, the automated classification of chemicals using deep learning according to an existing ontology has been largely unexplored. The ClassyFire tool [6] is at the time of writing the most comprehensive method for structure- based automated chemical ontology extension. However, it uses a rule-based and algorithmic implementation that is cumbersome to maintain and is not able to adapt as the underlying ontology changes. In our previous work [27], we have evaluated several classifiers for this task, including a long short-term memory (LSTM) model which was the best-performing overall. The results of this effort were satisfactory as a whole, but several specific limitations were identified. In particular, the model failed to provide any prediction for a subset of input molecules, and the system as a whole offered no explainability. The current contribution harnesses a Transformer-based architecture and describes how the attention weights of the resulting model can provide insights into how the model made its decisions. Furthermore, by using transfer learning, a broader applicability of this data- and compute-hungry method becomes computationally more feasible. 3. Methodology Our goal is to train a system that automatically extends the ChEBI ontology with new classes of chemical entities (such as molecules) based on the design decisions that are implicitly reflected in the structure of ChEBI. Thus, for our work we take the ‘upper level’ of the ontology, which contains generic distinctions, as given. Our focus is the extension of the ChEBI ontology with classes of chemical entities that may be characterised by a SMILES string, i.e., are associated with a specific chemical structure. (These classes are not necessarily leaf nodes in the ontological hierarchy, but nevertheless tend to be in the ‘lower’ part of the hierarchy.) The learning task for ontology extension may, thus, be characterised as follows: Given a class of chemical entities (characterised by a SMILES string), what are its optimal direct superclasses in ChEBI? While our goal is – from an ontological point of view – to extend the ChEBI ontology with new classes (i.e., adding new subsumptions), from a machine learning perspective we turned this problem into a classification task, for which we prepare an appropriate learning dataset from the ontology. Hierarchical chemical classifications should group chemical compounds in a scientifically valid and meaningful way [23, 28]. Each chemical entity has many structural features which contribute to its potential structure-based classification and structures that determine different classes may occur in a single molecule. Thus, ChEBI contains classes that overlap (i.e. share members). The ChEBI ontology provides two separate classification hierarchies for the chemical entities: one based on their structures and another based on their functions or uses. In the current work, we focus on the structure-based sub-ontology. Entities in the structure-based sub- ontology are often associated with specifications of their molecular structures, particularly – but not exclusively – the leaf nodes within the classification hierarchy. In ChEBI, a chemical entity with a defined structure can be the classification parent for another structurally defined entity, since all entities are classes according to the ChEBI ontology, and there can be different levels of specificity even amongst structurally defined classes. To formulate a supervised machine learning problem, however, we need to create a distinction between those entities with chemical structures that form the input for learning, and the chemical classes that they belong to that form the learning target. This distinction is created by sampling structurally defined entities Chemical entity Molecular entity Main group molecular entity P-block molecular entity Carbon group molecular entity Polyatomic entity S-block molecular entity Chalcogen molecular entity Organic molecular entity Heteroatomic molecular entity Hydrogen molecular entity Oxygen molecular entity Organometallic compound Tin molecular entity Hydroxides Organotin compound Fentin hydroxide Figure 1: Fentin hydroxide and its hierarchical classes. Blue lines indicate the sub-class relationships. only from the ontology leaf nodes. As mentioned above, the SMILES notation is analogous to a language to describe atoms and their bonds within a chemical structure. Intuitively, this leads to a correspondence between the processing of chemical structures in this type of 2representation, and natural language processing [29]. Therefore, architectures that have been successfully applied to language-based problems can also be employed for this multi-label prediction task. One of these successful architectures is Bidirectional Encoder Representations from Transformers (BERT) [30] – a precursor of the RoBERTa architecture that our approach is based on. The BERT architecture offers a learning paradigm that enables pre-training the model on unlabeled data and then fine-tuning it for the ultimately desired task. Fine-tuning can be done by adding one additional layer to the pre-trained model, without requiring major modifications to the model’s architecture. BERT is pre-trained on two unsupervised tasks: A Masked Language Modeling (MLM), in which some tokens are randomly removed from the input sequences and the model will train to predict the masked tokens, and Next Sentence Prediction (NSP), a binary classification task that predicts whether or not the second sentence in the input sequence follows the first sentence in the original text. The RoBERTa model is an extension of the BERT model and it offers several improvements with minor changes in the pre-training strategy. The Robustly optimized BERT (RoBERTa) [31] model does not include the NSP part of BERT, and it employs a dynamic masking approach as a replacement of the original masking scheme of BERT. While the original BERT model only applies masking once during data preprocessing, the Roberta model dynamically changes the masking pattern on each training sequence in every epoch. As a result, the model gets exposed to different versions of the same input data with masks on various locations. Since chemical structures in ChEBI typically belong to several ontology classes, the problem of automated chemical entity categorization can be viewed as a multi-label prediction task. Figure 1 shows the fentin hydroxide molecule and its parents in the ChEBI ontology: organotin compound and hydroxides. Our approach1 pre-trains a RoBERTa model on SMILES strings, and then predicts multiple chemical class memberships. The overall architecture is shown in Figure 3. This architecture is similar to the one that was used for molecular property prediction in Chithrananda et al. [32]. 1 https://github.com/adelmemariani/chebi-roberta Figure 2: Left: class counts in the dataset. Right: Number of members per number of assigned classes 3.1. Dataset To use the existing ontology classification as input to the learning task, the ontology first has to be transformed into an appropriate form. The ontology classification is inherently unbalanced, as different classes have different numbers of members and are partially overlapping. It is therefore necessary to define a sampling strategy to select leaf node entities and classes to minimize the impact on the training. In order to be able to compare our results to our earlier findings, we have used the same dataset2 and sampling strategy as was used in Hastings et al. [27]. Using only the hierarchical sub-class relations in the ChEBI ontology, this dataset was created by randomly sampling leaf node molecular entities from higher-level classes that they are subclasses of, using an algorithm that aimed to minimize (as far as possible) class overlap, described in Section 3 of [27]. The resulting dataset contained a total of 500 molecule classes and 31.280 molecules. Despite these balancing measures, it still suffers from certain imbalances. Figure 2 (left) illustrates the number of times each class has appeared in the training and test datasets. As illustrated, some of the classes appeared more frequently than others. Figure 2 (right) shows the number of members per number of associated classes. For example, 7,864 members have just one assigned class, whereas three members have 17 classes assigned. To train, validate and test our model, we divided the dataset into three subsets; a training set containing 21,896 molecules, a validation set of 2,815 molecules, and a test set of 6,569 molecules. 3.2. Input encodings Tokenization is a pre-processing step used to create a vocabulary from textual data. It is applica- ble at the character, word, or sub-word level. Pre-trained large-scale word embeddings such as Word2Vec [33] and GloVe [34] employ word tokenization to generate vector representations for words that can encapsulate their meanings, semantic connections, and the contexts in which they are used. Transformer-based models rely on a subword tokenization algorithm that counts the occurrences of each character pair in the dataset and incrementally adds the most frequently occurring pairs to the vocabulary. In our previous work, Hastings et al. [27], we used two strategies to encode the input sequences for the LSTM model: a character-level tokenization and an atom-wise tokenization, where letter combinations that represent an atom were encoded as a token. In the current work, we use the Byte Pair Encoding (BPE) algorithm as a sub-word tokenization method with a RoBERTa architecture. 2 https://doi.org/10.5281/zenodo.4519815 Parameter Value Number of attention heads 12 Number of hidden layers 6 Dropout for attention probabilities 0.1 Activation function in the encoder gelu Activation for the classification layer sigmoid Number of epochs in pre-training 100 Number of epochs in fine-tuning 30 Masked language modeling probability %15 Batch size 4 Loss function for pre-training BCELoss Loss function for fine-tuning BCEWithLogitsLoss Optimizer Adam with weight decay Number of vocabularies (tokens) 1395 Number of trainable parameters 45,577,728 Tokenizer BPE Table 1 Figure 3 Hyper-parameters of the RoBERTa model Architecture of our ontology extension approach (a) (b) (c) Figure 4: Train and validation loss: (a): pre-training (masked language modeling). (b): fine-tuning (class prediction). (c): F1 score for the validation dataset, during the fine-tuning step. 3.3. Experiment To train the model, we used a single GPU. Table 1 shows the hyper-parameters for our model. We firstly pre-trained our model based on masked language modelling for 100 epochs (unsupervised). The pre-training step allows the model to discover common patterns in the SMILES strings by attempting to predict the masked tokens using the unmasked tokens. As discussed in Section 3, the pre-trained model provides a proper starting point for training a model on a related desired task. This starting point incorporates the trained weights of the model. Furthermore, We validated the model on a separate dataset after each training epoch. The validation during training has no effect on model’s trained weights, nevertheless, it helps in adjusting the model’s hyper-parameters. Figure 4 (a) illustrates the loss values for the train and validation sets during the pre-training phase. For the final multi-label classification task, we loaded the pre-trained model and trained it for 30 epochs with the class labels (supervised). Figure 4 (b) shows the train and validation loss during the fine-tuning step. Similarly, Fig. 4 (c) shows the F1 score for the validation dataset during the fine-tuning. 4. Results and Evaluation For our evaluations during and after training, we used the F1-score as the main measure. The F1 score may be computed in different ways depending on the averaging scheme: (1) samples: calculates the F1 score for each molecule in the test dataset and then computers their average. (2) micro: collects the total number of true positives, false positives, and false negatives and calculates the overall F1 score. (3) macro: calculates the F1 score for each class and then computes their average. (4) weighted: this averaging scheme is similar to the macro F1 score, but it calculates a weight for each class based on the number of true members in each class. Fig. 5 compares the results of our proposed model with the previously obtained results by the LSTM model. The raw output values of our model are the probabilities of a sigmoid function. Therefore, a threshold value must be applied to these probabilities to produce a binary vector, indicating the final classifications. These results are based on the threshold value of 0.5. The precision – in our classification task – shows the ability of the model to not wrongly assign a label to a molecule, while the recall score reflects the ability of the model to discover all labels that were assigned to a molecule. Self-attention in Transformer- based models enables the model Samples Macro Micro Weighted to explore several locations in LSTM RoBERTa LSTM RoBERTa LSTM RoBERTa LSTM RoBERTa F1 0.66 0.76 0.71 0.77 0.74 0.80 0.73 0.79 the input sequence to produce Recall 0.66 0.75 0.68 0.76 0.70 0.78 0.70 0.78 a better embedding for the to- Precision 0.67 0.77 0.77 0.80 0.79 0.82 0.79 0.82 kens. As a result, the embed- ROC-AUC 0.83 0.87 0.84 0.89 0.85 0.88 0.85 0.89 dings encode different contex- tual information for the same token in different positions (and different sequences). The architecture of the RoBERTa model contains a stack of Trans- formers’ encoders, each con- sisting of multiple attention heads. Since the attention heads do not share parame- ters, each head learns a unique Figure 5: F1 score on test dataset. Left: Kernel density diagram based on set of attention weights. Intu- the samples (molecules). Right: Histogram diagram based on itively, attention weights deter- the labels (classes in the ontology). mine the importance of each to- ken for the embeddings of the next layers [35]. In this sense, visualizing the attention weights of Transformer-based models helps to interpret the model with respect to the relative importance of different input items for making classifications [36]. While the benefit of attention visualiza- tion may be limited in explaining particular predictions, depending on the task, attention can be quite useful in explaining the model’s predictions overall [37, 38, 39]. In fact, attention heads can reveal a wide variety of model behaviors and some of these heads may be more significant for model interpretation than others [36]. Our proposed model comprises six layers, each with twelve heads, producing a total of 72 unique attention mechanisms. We examined how attention corresponds to different chemical structural elements, at both the token and molecule level. Figure 6 shows the averaged attention weights of all heads in the last encoder of the model. The most attended sub-structures for each molecule are highlighted with green circles in the molecular graphs. It can be observed that often, the most attention (darker green) is given to (a) (c) (b) (d) (e) Figure 6: The model predicted class labels for these molecules by attending to influential sub-structures (highlighted in green): (a) organobromine compound (b) iron molecular entity (c) arenesulfonic acid (d) barbiturates (e) phosphatidylinositol Attention to Oxygen atom Attention to Nitrogen atom Attention to Carbon atom Figure 7: Each cell represents the percentage of all attentions (by each head) that was given to the corresponding token. For example, head 5-6 in (b) dedicated 41.6% of its attention to the Nitrogen atom. the heaviest atoms, for example bromine, iron and sulfur in Fig. 6 (a), (b) and (c) respectively. This corresponds to the broad principles of classification in organic chemistry as captured in ChEBI. The predicted classes in Fig. 6 demonstrate that the model learned to assign appropriate labels to the chemical compounds. As illustrated in Fig. 6 (d), the model assigned the barbiturates class to the corresponding molecule, which class refers to the family of chemicals that contain a six-membered ring structure, which was also the structural element given the most attention. Similarly, Fig. 6 (e) shows that the model focused most on the phosphate substructure when assigning the phosphatidylinositol class to the molecule. The presented model takes a given class of molecules, represented by a SMILES string, and assigns the corresponding superclasses from the CHEBI ontology. ChEBI already makes use of an automated tool to extend its coverage beyond the manually curated core, namely ClassyFire. The model can be integrated into the ChEBI development process in the same way. The resulting system can then be used to integrate the given class into the ontology and translate the classification results into subsumption relations. Figure 8 shows the result of this process. The resulting workflow, depicted in 3, allows for the fully automated extension of the ChEBI ontology. Figure 8: The extended ontology. Existing subsumption relations (black) have been enriched with new subclasses, shown with dashed borders. Correct subclass predictions are depicted with cyan, dashed arrows, while red, dotted arrows indicate misclassifications. 5. Discussion ChEBI uses ClassyFire, a rules-based system to extend its manually curated reference ontology to chemicals that are not yet covered. This approach has limitations, notably that ClassyFire is structured around a different chemical ontology with only a partial mapping to ChEBI, and ClassyFire’s rules are manually maintained. The deep-learning-based approach that we presented can overcome the limitations of rules-based approaches by allowing dynamic creation of classifiers based on a given existing ontology structure. Yet, for optimal applicability, the approach must meet certain quality criteria. Ozaki [11] defined six goals for ontology extension, which we use to structure our discussion of our results. Handling of inconsistencies and noise Our model is trained on information that originated solely from the ontology itself. This design decision eliminates this external source of incon- sistencies and noise. The comparison of the F1 scores in the table in Fig. 5 shows that this classification outperforms the current state-of-the-art approaches - including the formerly lead- ing LSTM-based model. In particular, for those chemical classes that were the most challenging in the previous approach, the current approach performed almost twice as well as shown in Figure 5. It should be noted that there nevertheless remain some chemical classes that perform worse than others. For example, classes that are based on cyclic structures pose challenges, as their information may be scattered around the respective SMILES strings. Alternative input for- mats and network architectures may be explored in the future to better handle these structures. The model may also benefit from a larger amount of data. The distribution of class memberships depicted in Figure 2 indicates that the dataset features some classes far more often than others. These classes are more prominent, often by virtue of being higher in the ontology subclass hierarchy and, therefore, represent broader classes of chemicals that may share members with other classes. Such an imbalance can skew the training in favour of those classes. Different sampling and regularization techniques may be explored in the future to address this issue. Unsupervised learning The presented approach is a variant of ontology extension. The ontology is therefore a mandatory input, from which the information that is needed for the ontology extension is extracted. The resulting dataset does include labels for each molecule. Strictly speaking, it is thus a supervised learning approach. However, these labels are extracted fully automatically from the input – the ontology. Therefore, no additional annotation by experts or other manual data pre-processing is necessary. Human interaction As the ontology is extended automatically, no interaction is required. Expressivity The system extends the given ontology using the same ontology language that has been used to build it. ChEBI is developed as an OWL ontology, which comes with expressive OWL-DL semantics. Interpretability The formerly best classifier was based on an LSTM architecture. This approach outperformed ClassyFire, but this performance came with a disadvantage: The reason for a specific classification was not transparent. This is problematic, because the experts that check the ontology extension need insights into the system’s decision processes in order to evaluate the classifications. An explainable approach is therefore crucial. The attention mechanism of the RoBERTa architecture that has been used in the present approach helps to address this issue. Attention weights can be seen as a measure of how much focus is put on an individual token. A homogenous distribution of attention shows that nothing has been focused in particular, whilst high attention on a head shows that a particular token had a high impact. Figure 7 shows that carbon atoms, which are very common in organic chemistry, trigger a low general focus. At the same time, a high focus is put on oxygen atoms, that often indicate functional groups of high classificatory relevance, such as carboxy groups. Figure 6 shows which parts of a particular molecule have been focused on during the classification process. This information can be used to explain the decisions made by the model, raise trust in the prediction system, and aid the experts during the ontology extension process. Efficiency In [11] ‘efficiency’ is defined as the time it takes to build the ontology. Once the model is fully trained, the classification which leads to the ontology extension only takes a few minutes. As an example, classification of 6,569 chemical entities in our test dataset took around 10 minutes. While extending the ontology itself is fast, the training of the model requires more time. Training is divided in pre-training and fine-tuning. The pre-training with 100 epochs took around 10 hours. This time is only invested a single time and once a model has been pre-trained, it can be fine-tuned repeatedly for several large sets of molecules and their corresponding classes comparatively quickly. Our final fine-tuning for 30 epochs took around 2 hours. This analysis shows that the presented approach achieves the goals of ontology learning stipulated in [11]. One additional issue that needs to be addressed is applicability. At the heart of the presented approach is a neural network that is trained based on the annotations of the ontology. In the same way as any text analysis approach to ontology generation is dependent on the existence of suitable text corpora, our approach requires that the ontology contains enough information to train a model to predict the superclasses of a new class. ChEBI is an ideal use case, because SMILES annotations provide rich, structured information that we could use for training the model. Another potential application domain for our approach in biology are proteins, which are also classified based on structures, features of which can be annotated in the relevant ontology. Moreover, our approach is not limited to ontologies with structural information represented in annotations. E.g., for ontologies in material science one could consider training the model based on the physical properties (e.g., density, hardness, thermal conductivity), which are typically represented as data properties. In short, our approach to ontology extension is applicable to reference ontologies that associate classes with sufficient information that a neural network may learn the classification criteria that the ontology developers are using. 6. Conclusion and Future Work We have presented a novel approach to the problem of ontology extension, applied to the chemical domain. Instead of extending the ontology using external resources, we created a model using the ontology’s own structured annotations. This transformer-based model can not only classify previously unseen chemical entities (such as molecules) into the appropriate classes, but also provides information about relevant aspects of its internal structure on which the decision is based. At the same time, it was able to outperform previously existing approaches to ontology-based chemical classification in terms of predictive performance. However, the trained model still struggles with several chemical classes that depend on specific structural features. E.g, classes that exhibit cyclic structures are often found in the lower quantile of classification quality. This behaviour can be traced back to the way molecules are encoded into the SMILES notation. This weakness might be addressed by using architectures that operate directly on the molecular structures, such as Graph Neural Networks [40]. We have illustrated our approach by applying it to the chemical domain, but as we discussed in Section 5, the approach is applicable to any ontology that contains classes that are annotated with information that is relevant to their position in the class hierarchy. While our approach supports an automatic extension of an ontology, it can also be used in a semi-automated fashion to help ontology developers in their manual curation of the ontology. Since the model is trained based on the content of a manually curated ontology, improving and extending this ontology will lead to better quality training data and, thus, enable better predictions. Hence, there is a potential for a positive feedback loop between manual development and the AI-based extension. One limitation of our current approach is that it does not use most of the logical axioms of the ontology during the learning process. One strategy to address this gap would be to represent the axioms in the form of Logical Neural Networks [41] in order to detect possible inconsistencies already in the learning process and to penalise them accordingly. Overall, there is still a great need for research in the field of (semi-)automatic ontology extension. Here, the growing field of neuro-symbolic integration can serve as the interface between formal ontologies and the potent solutions of deep learning. This may further the understanding of the inner workings of artificial intelligence and, therefore, raise trust in these systems. References [1] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, L. J. Goldberg, K. Eilbeck, A. Ireland, C. J. Mungall, et al., The obo foundry: coordinated evolution of ontologies to support biomedical data integration, Nature biotechnology 25 (2007) 1251–1255. [2] D. Allemang, P. Garbacz, P. Grądzki, E. Kendall, R. Trypuz, An analysis of the debate over structural universals, in: F. Neuhaus, B. Brodaric (Eds.), Formal Ontology in Information Systems - Proceedings of the 11th International Conference, FOIS 2021, Bozen-Bolzano, Italy, Frontiers in Artificial Intelligence and Applications, in print. [3] M. Booshehri, L. Emele, S. Flügel, H. Förster, J. Frey, U. Frey, M. Glauer, J. Hastings, C. Hofmann, C. Hoyer-Klick, et al., Introducing the open energy ontology: Enhancing data interpretation and interfacing in energy systems analysis, Energy and AI 5 (2021) 100074. [4] J. Hastings, G. Owen, A. Dekker, M. Ennis, N. Kale, V. Muthukrishnan, S. Turner, N. Swain- ston, P. Mendes, C. Steinbeck, ChEBI in 2016: Improved services and an expanding collection of metabolites., Nucleic Acids Research 44 (2016) D1214–D1219. doi:doi: 10.1093/nar/gkv1031. [5] Y. Wang, J. Xiao, T. Suzek, J. Zhang, J. Wang, S. Bryant, PubChem: a public information system for analyzing bioactivities of small molecules, Nucl Acids Res 37 (2009) W623–W633. doi:doi:10.1093/nar/gkp456. [6] Y. Djoumbou Feunang, R. Eisner, C. Knox, L. Chepelev, J. Hastings, G. Owen, E. Fahy, C. Steinbeck, S. Subramanian, E. Bolton, R. Greiner, D. S. Wishart, ClassyFire: auto- mated chemical classification with a comprehensive, computable taxonomy, Journal of Cheminformatics 8 (2016) 61. URL: https://jcheminf.biomedcentral.com/articles/10.1186/ s13321-016-0174-y. doi:doi:10.1186/s13321-016-0174-y. [7] H. Assadi, Construction of a Regional Ontology from Text and its Use within a Documen- tary System, in: FOIS’98 - 1st International conference on Formal Ontology in Information Systems, volume 46 of Frontiers in Artificial Intelligence and Applications, IOS Press, Trento, Italy, 1998, pp. 236–252. URL: https://hal.archives-ouvertes.fr/hal-01617868. [8] A. Maedche, S. Staab, Ontology learning for the semantic web, IEEE Intelligent systems 16 (2001) 72–79. [9] C. Biemann, Ontology learning from text: A survey of methods., in: LDV forum, volume 20, 2005, pp. 75–93. [10] M. N. Asim, M. Wasim, M. U. G. Khan, W. Mahmood, H. M. Ab- basi, A survey of ontology learning techniques and applications, Database 2018 (2018). URL: https://doi.org/10.1093/database/bay101. doi:doi: 10.1093/database/bay101. arXiv:https://academic.oup.com/database/article- p d f / d o i / 1 0 . 1 0 9 3 / d a t a b a s e / b a y 1 0 1 / 2 7 3 2 9 2 6 4 / b a y 1 0 1 . p d f , bay101. [11] A. Ozaki, Learning description logic ontologies: Five approaches. where do they stand?, KI-Künstliche Intelligenz 34 (2020) 317–327. [12] L. Karoui, M.-A. Aufaure, N. Bennacer, Contextual concept discovery algorithm., in: FLAIRS conference, 2007, pp. 460–465. [13] G. Xiao, D. Calvanese, R. Kontchakov, D. Lembo, A. Poggi, R. Rosati, M. Zakharyaschev, Ontology-based data access: A survey, in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, International Joint Conferences on Artificial Intelligence Organization, 2018, pp. 5511–5519. URL: https://doi.org/10.24963/ ijcai.2018/777. doi:doi:10.24963/ijcai.2018/777. [14] J. Hastings, Primer on Ontologies, in: C. Dessimoz, N. Škunca (Eds.), The Gene Ontology Handbook, volume 1446, Springer New York, New York, NY, 2017, pp. 3–13. URL: http: //link.springer.com/10.1007/978-1-4939-3743-1_1. doi:doi:10.1007/978-1-4939-3743-1_1, series Title: Methods in Molecular Biology. [15] R. Fikes, P. Hayes, I. Horrocks, Owl-ql—a language for deductive query answering on the semantic web, Journal of Web Semantics 2 (2004) 19–29. [16] W. Liu, A. Weichselbraun, A. Scharl, E. Chang, Semi-automatic ontology extension using spreading activation, Journal of Universal Knowledge Management (2005) 50–58. [17] S. Althubaiti, Ş. Kafkas, M. Abdelhakim, R. Hoehndorf, Combining lexical and context features for automatic ontology extension, Journal of biomedical semantics 11 (2020) 1–13. [18] Y. Zhou, L. Zhang, S. Niu, The research of concept extraction in ontology extension based on extended association rules, in: 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS), IEEE, 2016, pp. 111–114. [19] P. H. Barchi, E. R. Hruschka, Never-ending ontology extension through machine reading, in: 2014 14th International Conference on Hybrid Intelligent Systems, IEEE, 2014, pp. 266–272. [20] A. Schutz, P. Buitelaar, Relext: A tool for relation extraction from text in ontology extension, in: International semantic web conference, Springer, 2005, pp. 593–606. [21] H. Li, R. Armiento, P. Lambrix, A method for extending ontologies with application to the materials science domain, Data Science Journal 18 (2019) 1–21. [22] D. Weininger, Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules, Journal of chemical information and computer sciences 28 (1988) 31–36. [23] J. Hastings, D. Magka, C. Batchelor, L. Duan, R. Stevens, M. Ennis, C. Steinbeck, Structure- based classification and ontology in chemistry., Journal of cheminformatics 4 (2012) 8. doi:doi:10.1186/1758-2946-4-8. [24] A. C. Mater, M. L. Coote, Deep learning in chemistry, Journal of chemical information and modeling 59 (2019) 2545–2559. [25] G. B. Goh, N. O. Hodas, C. Siegel, A. Vishnu, Smiles2vec: An interpretable general-purpose deep neural network for predicting chemical properties, arXiv preprint arXiv:1712.02034 (2017). [26] C. W. Coley, W. Jin, L. Rogers, T. F. Jamison, T. S. Jaakkola, W. H. Green, R. Barzilay, K. F. Jensen, A graph-convolutional neural network model for the prediction of chemical reactivity, Chemical science 10 (2019) 370–377. [27] J. Hastings, M. Glauer, A. Memariani, F. Neuhaus, T. Mossakowski, Learning chemistry: exploring the suitability of machine learning for the task of structure-based chemical ontology classification, Journal of Cheminformatics 13 (2021) 1–20. [28] C. Bobach, T. Böhme, U. Laube, A. Püschel, L. Weber, Automated compound classification using a chemical ontology, Journal of Cheminformatics 4 (2012) 1–12. [29] P. Schwaller, T. Gaudin, D. Lanyi, C. Bekas, T. Laino, “found in translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models, Chemical science 9 (2018) 6091–6098. [30] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [31] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [32] S. Chithrananda, G. Grand, B. Ramsundar, Chemberta: Large-scale self-supervised pre- training for molecular property prediction, arXiv preprint arXiv:2010.09885 (2020). [33] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013). [34] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [35] J. Vig, A. Madani, L. Varshney, C. Xiong, N. Rajani, Bertology meets biology: Interpreting attention in protein language models. arxiv 2020, arXiv preprint arXiv:2006.15222 (2020). [36] J. Vig, A multiscale visualization of attention in the transformer model, arXiv preprint arXiv:1906.05714 (2019). [37] P. Moradi, N. Kambhatla, A. Sarkar, Interrogating the explanatory power of attention in neural machine translation, arXiv preprint arXiv:1910.00139 (2019). [38] D. Pruthi, M. Gupta, B. Dhingra, G. Neubig, Z. C. Lipton, Learning to deceive with attention-based explanations, arXiv preprint arXiv:1909.07913 (2019). [39] S. Serrano, N. A. Smith, Is attention interpretable?, arXiv preprint arXiv:1906.03731 (2019). [40] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, G. Monfardini, The graph neural network model, IEEE transactions on neural networks 20 (2008) 61–80. [41] R. Riegel, A. Gray, F. Luus, N. Khan, N. Makondo, I. Y. Akhalwaya, H. Qian, R. Fagin, F. Barahona, U. Sharma, et al., Logical neural networks, arXiv preprint arXiv:2006.13155 (2020).