Using Artificial Neural Networks to Determine Ontologies Most Relevant to Scientific Texts Lukáš Korel1 , Alexander S. Behr2 , Norbert Kockmann2 and Martin Holeňa1,3,4 1 Faculty of Information Technology, CTU, Prague, Czech Republic 2 Faculty of Biochemical and Chemical Engineering, TU Dortmund University, Germany 3 Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic 4 Leibniz Institute for Catalysis, Rostock, Germany Abstract This paper provides an insight into the possibility of how to find ontologies most relevant to scientific texts using artificial neural networks. The basic idea of the presented approach is to select a representative paragraph from a source text file, embed it to a vector space by a pre-trained fine-tuned transformer, and classify the embedded vector according to its relevance to a target ontology. We have considered different classifiers to categorize the output from the transformer, in particular random forest, support vector machine, multilayer perceptron, k-nearest neighbors, and Gaussian process classifiers. Their suitability has been evaluated in a use case with ontologies and scientific texts concerning catalysis research. From results we can say the worst results have random forest. The best results in this task brought support vector machine classifier. Keywords ontology, text data, text preprocessing, text representation learning, text classification 1. Introduction The need to find a suitable ontology for an input text can help in classifying the information presented within A domain ontology defines a set of representational prim- the text as well as to connect the input text with data. This itives with which to model a domain of knowledge or would allow for automated selection of ontologies and discourse. The representational primitives are typically respective classification of the text. Different text data classes, attributes, and relationships. The definitions of could thus be compared automatically in an understand- the representational primitives include information about able way and connected with corresponding research their meaning and constraints on their logically consis- data. Ontologies represent "a formal specification of a tent application. Classes can be defined in two ways: shared conceptualization" [7] and can thus be used to by annotating their definitions, or by connecting classes express knowledge and data in a formalized, standard- with each other and with properties. Each domain ontol- ized description language to specify terms and relations ogy typically uses domain-specific definitions of terms between those terms. denoting its primitives. Current ontology recommenders, such as the NCBO The FAIR research data management (Findable, Access- ontology recommender [8], score annotations based on able, Interoperable, and Reuseable) needs a consistent words similar to preferred and alternate labels of ontol- data representation in ontologies, particularly for rep- ogy classes and term frequency. In contrast to this, this resenting the data structure in the specific domain [34]. work aims to use text representation learning in order to Since different ontologies are written by different people, not only search for words also contained in ontologies they are often incompatible, even within the same do- but also to find concepts with similar semantic meaning main. As systems that rely on domain ontologies expand, between text and ontology. it is often needed to merge domain ontologies by man- This paper is devoted to a specific problem encoun- ual tuning. The same is true for enhancing an ontology tered during enhancing ontologies and sometimes during with information available in domain-related texts. Merg- their merging: to decide which of several available on- ing and enhancing ontologies is thus a largely manual tologies is most relevant to given domain-related piece of process and therefore time-consuming and expensive. text. Our solution to the problem relies primarily on arti- ficial neural networks (ANNs), in particular on natural ITAT’22: Information technologies – Applications and Theory, Septem- language processing (NLP). ber 23–27, 2022, Zuberec, Slovakia The next section surveys the applicability of artificial $ lukas.korel@fit.cvut.cz (L. Korel); neural networks to ontologies. Section 3 recalls the em- alexander.behr@tu-dortmund.de (A. S. Behr); norbert.kockmann@tu-dortmund.de (N. Kockmann); ployed methods of text preprocessing. There have been martin@cs.cas.cz,martin.holena@catalysis.de (M. Holeňa) used modules for text extractions from PDF files, for © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). transforming extracted files to pure text and for elimi- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) nating irrelevant paragraphs. In the section is described encoder-decoder configuration and translates through text representation learning, as well as the principles of syntactic transformation a subset of natural language the employed classifiers. In section 4, an application of into the description logic language ALLQ. Moreover, the the proposed methodology to catalysis is described and system generalizes over different syntactic structures, evaluated. and has the ability to tolerate unknown words through With regard to sources we have studied described in copying input words as extralogical symbols to the out- part 2 of this article, we are not aware that classifiers put, as well as the ability to enrich the training set with learned from the results of representational learning have new annotated examples. In [23], a mapping is estab- ever been used to determine the most relevant of a given lished between ontologies and a pair of interacting as- set of ontologies. sociative memories. One of them stores assertions, and the other stores entailment rules. The most recent work [21] describes a method for the specific task of extracting 2. Applicability of Artificial Neural a taxonomy from an embedding of a knowledge graph. Networks to Ontologies Over that embedding, which can be obtained for exam- ple with RDF2Vec, hierarchical agglomerative clustering In connection with learning and extending ontologies, ar- is performed, first without using type information, and tificial neural networks (ANNs) have been primarily used then injecting types into the hierarchical clustering tree. for identification of concepts, relations and attributes In addition, an axiom induction algorithm is applied to [10, 15, 18]. With respect to relations, some ANN-based each cluster in the resulting tree, which allows to identify methods have been developed specifically for subsump- new classes corresponding to those axioms that describe tion relations needed for the construction of taxonomies their respective clusters accurately enough. [11, 14, 21, 30]. In connection with integration of on- Neural networks are often used due to their strengths tologies, they have been primarily used for ontologies in natural language processing task. Ontology construc- matching aka ontologies alignment [12, 13, 16, 33]. The tion rely very much on texts, which suggest the applica- variety of employed kinds of ANNs is rather large. It bility of artificial neural networks (ANNs) in this context. includes traditional multilayer perceptrons (MLPs) [19], adaptive resonance theory (ART) networks [17] and asso- ciative memories [23], as well as the modern deep convo- 3. Methodological Background lutional networks (CNNs) [12, 20], deep belief networks This section describes details of used methods to reach [10], long short term memory (LSTM) networks together requested target. In the first part we need receive con- with their bidirectional variant (BiLSTM) [24] and gated tent from textural files, parse it into paragraphs and keep recurrent units (GRU) networks [28, 29]. The dependence only paragraphs fulfilling minimal length and relevant of ontologies on texts led to using networks developed content to the topic of the document. The second part for text and natural language representation learning, describes usage selected transformer and embedding in- most importantly BERT [22, 26], the bidirectional en- put paragraphs to classification numeric vectors. The coder representations from transformers, and word2vec final part describes used classifiers, which use outputs [25], the most traditional network for embedding text from the transformer for final classifications to target into an Euclidean space. The close relationship of on- ontology. tologies to knowledge graphs led to using also RDF2Vec [21, 29], which was originally proposed for knowledge graphs [31]. In connection with word2vec and RDF2Vec, 3.1. Text Preprocessing it is on similar principles, the network OWL2Vec was For the problem scientific texts classification to the most proposed for embedding of ontologies [32]. Finally, the relevant existing ontology, we have been using docu- graph-like structure of ontologies brought usage graph ments in portable document files (PDFs). An issue with neural networks (GNNs) [16, 33]. PDFs is that they are optimized to print on physical Closest to the proposed project is the way ANNs have printer, thus they contain meta-information about the been recently used in connection with translating into contained text related to the position on the page. There- OWL [27, 28], with predicate chaining and restriction fore, it is not easy to address a single paragraph. If the [23], and with taxonomy extraction from knowledge file is read using the basic library for PDF files and the graphs [21]. In [27], ontology learning is tailored as a newline mark is used as the splitter, it returns only a transductive reasoning task that uses two recurrent neu- single row, not the whole paragraph. Another issue is ral networks to translate text in natural language into connected with multi-column documents. If the docu- OWL specifications in description logic. That approach ment does not include information about where the text was further developed in [28], resulting in a system based continues, software libraries for text extraction from PDF on a single recurrent network of GRU type. It uses an usually continue with the next letter on the same row. The basic schema of BERT is given in Figure 1. The One solution to get text data from multi-column PDF tokenized input at first passes through the encoder, is to use Microsoft Word engine. Its engine is able to which embeds sentences to elements of an Euclidean solve both problems and parse text properly. It identifies space. These vectors are used as input to the BERT structural information in text such as headings, para- decoder. BERT returns one vector for each input. Each graphs and sentences. Each document may contain texts input sequence contains a special token at the beginning irrelevant to the topic of interest, for example references, marked as CLS. Vectors embedding the tokens of an acknowledgement etc. input sequence can be arranged into a matrix. The Specifications of the ontologies are most often stored first row of the matrix is the embedding of the whole in OWL files. OWL [2] is a specific kind of XML for input. Details of BERT are described in [3] and on the ontologies. Text that describes classes and relations may https://huggingface.co/docs/transformers/model_doc/bert. be stored in different tags, depending on the decision of These embeddings of every input paragraph are taken the ontology designer. into account for the final assignment of the most relevant ontology to the paragraph. 3.2. Text Representation Learning 3.3. Classification For typical data analysis tasks like classification of clus- tering, it is suitable to represent words or other parts of The embeddings obtained in BERT are used as inputs for text by vectors in an Euclidean space. Such representa- classifiers classifying a given input part of text (e.g., a tion is mostly the result of representation learning by paragraph) with respect to its relevance to the consid- ANNs. In the area of text analysis and processing, the ered ontologies. Those classifiers have been trained on probably most successful representation learning algo- the embeddings of the annotations from the considered rithm is BERT (Bidirectional Encoder Representations ontologies because for them, the ground truth (i.e., the from Transformers)[3]. ontology to which the annotation belongs) is known. BERT needs to be trained using large amount of texts. We have decided to select five classifiers implemented That is why some pretrained version is typically used, in scikit-learn [4]. They are the following: and often subsequently fine-tuned using texts concerning 1. Random forest (RF): An ensemble classifier that the considered topic. Such fine-tuning is often performed fits a number of classification trees on various even if the pretrained network was trained, apart from sub-samples of the training data and uses some general texts, also with texts from some broader relevant aggregation function to improve the predictive domain (biology, medicine, chemistry, etc.). accuracy and control over-fitting. Usually, each tree in the ensemble is built using a sample drawn with replacement (i.e., a bootstrap sample) from the training set. Furthermore, when splitting each node during the construction of a tree, the best split is found using either all input features or a random subset of a given size. The purpose is to decrease the variance of the forest estimator. Indeed, individual decision trees typically exhibit high variance and tend to overfit. The injected randomness in forests yield decision trees with somewhat decoupled prediction errors. By taking an average of those predictions, some errors can cancel out. RFs achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase in bias. Typically, the variance reduction yields an overall better model [35]. 2. Support vector machine (SVM): It is a classifier de- Figure 1: BERT (Bidirectional Encoder Representations from signed specifically to achieve the lowest possible Transformers) architecture [3]. An input sentence is divided predictive error, using a known relationship be- into tokens and each token is encoded to number. The BERT’s tween generalization error and margin of the sep- output contains one numeric vector per one token. The output arating hyperplane. It uses only training points marked as C is used for final classification. on both support hyperplanes of the margin (sup- port vectors), so it is also memory efficient. A simple SVM can be used only for linearly separa- The advantages of MLP are capability to learn ble classes. For linearly nonseparable classes, the non-linear models and capability to learn models data must be first transformed to linearly separa- in real-time (on-line learning). But the MLP with ble sets of functions in a high-dimensional vector hidden layers have a non-convex loss function space of functions using a suitable kernel. The where there exists more than one local minimum. SVM classification has multiclass support handled Therefore, different random weight initializations according to a one-vs-one or one-vs-rest scheme can lead to different validation accuracy. A MLP [36]. requires tuning a number of hyperparameters 3. Gaussian Process (GP): It has been designed pri- such as the number of hidden neurons, layers, marily for regression problems. A Gaussian Pro- and iterations. Moreover, it is sensitive to feature cess Classifier (GPC) implements a collection of scaling [39]. random variables indexed by an Eucliedan space for classification purposes through placing a GP prior on latent functions. Its purpose is to al- 4. Case Study in Catalysis low a convenient formulation of the classifica- A catalyst is some chemical that is not consumed in the tion through a logistic link function. GPCs sup- process of a chemical reaction. Using a catalyst in a chem- port multi-class classification by performing ei- ical reaction usually allows said reaction to take place ther one-versus-rest or one-versus-one training faster and allows for more moderate reaction conditions. and prediction. A crucial ingredient of each GPC Catalysis-based chemical synthesis is applied at roughly is the covariance functions of the underlying GP. 90% of chemical processes in chemical industry. The sci- It encodes the assumptions on the similarity of entific domain of catalysis is highly interconnected to Gaussian distributions corresponding to different other sciences and thus spans over many topics from points [37]. material sciences to process design [5, 6]. 4. K nearest neighbors: Neighbors-based classifica- tion simply stores instances of the training data. A query point is assigned the data class which has 4.1. Used Data the most representatives within the nearest neigh- The texts that have been used for fine-tuning BERT, have bors of the point. The nearest neighbors classifi- been taken from scientific papers in catalysis. These cation can use uniform weights, that means, the articles have been by PowerShell script extracted to Mi- value assigned to a query point is computed from crosoft Word documents. Thanks to its engine, para- a simple majority vote of the nearest neighbors. graphs and titles are marked properly, so paragraphs In some cases, it is better to weight the neighbors with relevant texts have been extracted and with BERT in such a way that nearer neighbors contribute embedding prepared for classification. more to the fit. For example, when an unknown We conduct our experiments on a set of five ontologies point’s class is computed from two nearest points from the chemical domain (Table 1) gathered within the and one of this two is nearer than second, in NFDI4Cat project [34]. The ontologies NCIT, CHMO and weighted case is result class same as the nearer Allotrope have a close connection to the chemical domain. point. The distance 𝑑 between two points can 𝑐 1 However, according to their names, the chemical entities be computed as: 𝑑(𝑥, 𝑦) = ( 𝑛 𝑖=1 |𝑥𝑖 − 𝑦𝑖 | ) , ∑︀ 𝑐 of biological interest (CHEBI) and the system biology where 𝑛 is the dimension of each point and 𝑐 ≥ 1, ontology (SBO) are expected to be further away from the if 𝑐 = 1, this is the Manhattan distance and in chemical domain. This does not hold necessarily true for case 𝑐 = 2, this is the Euclidean distance [38]. the CHEBI as it describes a plethora of chemical entities, 5. Multi-layer Perceptron (MLP): Given a set of fea- also relevant in the chemical and not only biological do- tures and a target, it can learn a non-linear func- main. The SBO was selected as it contains some general tion approximator for either classification or re- laboratory and computational contexts. It also can be gression. It is different from logistic regression, seen as some kind of a test, whether the tools used can because between the input and the output layer, also identify ontologies not fitting to the text content. there can be one or more non-linear hidden lay- Hence, these ontologies are classes to which classifiers ers. The input layer consists of a set of neurons assign new parts of text. The data have been divided into representing the input features. Each neuron in training and testing datasets in stratified proportion 1:1. the hidden layer transforms the values from the The testing dataset has been divided into 20 disjoint sub- previous layer with a weighted linear summation, sets, assuming that disjointness is a sufficient condition followed by a non-linear activation function. The for their independence. The training dataset have been output layer receives the values from the last hid- under-sampled in order to mitigate overfitting during den layer and transforms them into output values. Table 1 4.3. Comparison of Important Classifiers Types and counts of labels in the used OWL files on Considered Ontologies Summary statistics of the predictive accuracy of classi- Ontology Number name XML classes of classes fying all 20 testing datasets are in Table 3. The table is Literal complemented with boxplots (Figure 2), where the fol- Allotrope rdfs:comment 2773 lowing quality measures are presented for each classifier: rdfs:label accuracy, F1 score, precision and recall. The random rdfs:comment forest classifier had the worst results of all experiments. NCIT 1169 rdfs:label Other models had significantly better results. The best Literal accuracy had the Gaussian process, its mean accuaracy SBO rdfs:comment 534 was 97.5 % with very low standard deviation. rdfs:label The differences between the considered classifiers obo:IAO_0000115 were tested for significance by the Friedman test. The CHEBI 35067 rdfs:label basic null hypothesis that the mean accuracy for all 5 clas- obo:IAO_0000115 sifiers coincides was strongly rejected, with the achieved CHMO rdfs:comment 2521 rdfs:label significance 𝑝 = 3.02 × 10−12 . For the post-hoc anal- ysis, we employed the Wilcoxon signed rank test with two-sided alternative for all 10 pairs of the compared classifiers, because of the inconsistence of the more com- training part. mon mean ranks post-hoc test, as pointed out in [40]. For correction to multiple hypotheses testing, we used the Holm method. The results are given in Table 4, good 4.2. Experimental Setting results has Support vector machine and Gaussian process At first, the PDFs were transformed into Microsoft Word classifier. using PowerShell scripts. The output files have been pro- cessed by a python library for parsing docx files. As a 4.4. Classification of Scientific Texts with result relevant paragraphs have been extracted for clas- respect to Relevant Ontologies sification according to the most relevant ontology. The irrelevant paragraphs contained acknowledgement, ref- For this experiment, we had no ground truth as to which erences, titles and too short paragraphs (shorter than 100 of the available ontologies is the most relevant for each letters) have been skipped. considered paragraph of text. We employed two collec- The annotations in the specifications of given ontolo- tions of scientific papers from the area of catalysis. The gies have been extracted using XML parser for python small one are papers dealing with the topic of methana- named BeautifulSoup. Extracted paragraphs have also tion of CO2, it consists of 28 PDFs, from which we have been used for BERT fine-tuning. The chosen version of extracted 1 485 relevant paragraphs. The large one is the BERT was recobo/chemical-bert-uncased from the the digital archive of papers (co-)authored by scientists Huggingface portal [9]. Using the fine-tuned BERT, every from the Leibniz Institute of Catalysis (with the excep- paragraph has been transformed into a 768-dimensional tion of very few papers with read protection), it consists numeric vector. of 3 450 PDFs, from which we have extracted 144 490 The extraction of annotations from OWL files has been relevant paragraphs. The BERT embeddings of those performed using a python XML parser. Individual annota- paragraphs were classified by the five trained classifiers. tions have been again embedded into the 768-dimensional The confidence is probability over all classes, that source vector space using the fine-tuned BERT. paragraphs fits into target class. Every paragraph can For the employed classifiers, their implementations be classified to more than one target class with specific in ScikitLearn [4] has been used. The optimal values confidence. The sum of confidences of each paragraph of hyperparameters of each classifier were determined is one. In this experiment were used models trained in using a 5-fold cross-validation applied to a grid-search previous experiment. with the grid values listed in Table 2. In order to miti- gate overfitting, training data have been undersampled. 4.4.1. Results for the small dataset Statistic computations have used the scipy, statsmodels and pingouin python libraries. Figure 3 shows how many paragraphs each classifier as- signed to each ontology. The Gaussian process, k-nearest neighbor, MLP and SVM assigned almost all paragraphs to the NCIT ontology. The random forest is most uncer- Table 2 Hyperparameters of individual classifiers that were determined through grid-search on combinations of considered values. In the column Selected are values, that have been selected using a random stratified 5-fold cross-validation applied to a grid-search with the Considered values Classifier Hyperparameter Considered values Selected maximal depth {5, 7, 9, 11} 11 criterion {entropy, gini} gini Random count of estimators {5, 10, 15, 20, 25, 30} 20 forest fraction of features {0.5, 0.7} 0.5 used in each split bootstrap samples {false, true} true slack trade-off Support {1, 10, 100, 1000} 100 constant (C) vector kernel type {linear, radial basic} radial basic machine kernel coefficient [0.001, 0.0001] 0.001 gamma {radial basic, dot product, mattern, Gaussian kernel matern rational quadratic, white kernel} process random state {0, 50} unapplicable number of considered {1, 5, 9, 13, 17} 9 neighbors K nearest weights {uniform, distance} distance neighbors algorithm {auto, ball tree, kd tree, brute} auto distance metric exponent {1, 2, 3, 4, 5} 2 random state {0, 1} 0 activation function {identity, logistic, tanh, relu} tanh Multi-layer optimizer {lbfgs, sgd, adam} adam perceptron hidden layer size {1, 4, 16, 64} 4 strength of L2 {0.0001, 0.05} 0.05 regularization term learning rate for {constant, adaptive} constant weights update Table 3 Quality measures of the considered classifiers aggregated over all 20 testing datasets (mean [%] ± standard deviation [%]), where 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇 𝑃 +𝐹𝑇 𝑁𝑃 +𝑇 𝑁 +𝑇 𝑁 +𝐹 𝑃 , 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝐹 𝑃𝑇+𝑇 𝑃 𝑃 , 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝐹 𝑁𝑇+𝑇𝑃 𝑃 and 𝐹 1 = 2 · 𝑃𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛·𝑅𝑒𝑐𝑎𝑙𝑙 Accuracy F1 Precision Recall Gaussian process 97.46 ± 0.39 89.48 ± 1.38 85.70 ± 1.35 95.88 ± 1.21 K-nearest neighbor 96.66 ± 0.67 87.60 ± 2.41 84.36 ± 2.69 92.73 ± 2.04 Multi-layer perceptron 96.99 ± 0.67 87.84 ± 1.59 84.03 ± 1.54 94.97 ± 1.58 Random forest 94.63 ± 0.69 82.00 ± 2.29 76.30 ± 2.34 90.76 ± 2.58 Support vector machine 97.16 ± 0.53 88.72 ± 1.85 84.64 ± 1.89 95.85 ± 1.69 tain among all classifiers, assigning most paragraphs to predicted ontology and the second highest class confi- the CHEBI ontology, but some pragraphs also to each of dence is shown. Again, the highest values are achieved the remaining four. by the SVM and MLP, whereas the Gaussian process Figure 4 uses instead of the count of class predictions and random forest have only small margin between the their confidences. The confidence of the SVM and MLP predicted and second most confident ontology, and the is very high, whereas that of the Gaussian process and k-nearest neighbor has quite a high margin, but not so random forest is substantially lower. The k-nearest neigh- high as SVM or MLP. bors classifier has rather high confidence also. In Figure 5, the margin between the confidence of the Figure 2: Box plots comparing the distribution of quality measures of the considered classifiers on testing datasets Table 4 Comparison of accuracy results on all 20 testing sets with ontology annotations. The values in the table are counts of datasets, in which the model in the row has a higher accuracy compared to the model in the column. If the difference is not significant in the Wilcoxon test then the count is in italic. If the difference is significant, then the higher count is in bold. Support Random Gaussian K-nearest Multi-layer Summary vector forest process neighbors perceptron score machine Random forest - 0 0 0 0 0 Support vector machine 20 - 2 15 13 50 Gaussian process 20 16 - 17 19 72 K-nearest neighbors 20 3 3 - 5 31 Multi-layer perceptron 20 4 1 14 - 39 4.4.2. Results for the large dataset the k-nearest neighbors classifier. In Figure 8, the margin between the confidence of the Figure 6 depicts the count of paragraphs from the large predicted ontology and the second highest class confi- dataset that each classifier assigned to each ontology. dence is shown. Again, the highest values are achieved The Gaussian process, k-nearest neighbor, MLP and SVM by the SVM, whereas the Gaussian process and random assigned almost all paragraphs to the NCIT ontology. forest have only small margin between the predicted The random forest is most uncertain among all classifiers, and second most confident ontology, and the MLP and assigning most paragraphs to the CHMO ontology, but k-nearest neighbor have quite a high margin, but not so some pragraphs also to each of the remaining four. high as SVM. Figure 7 using confidences of class predictions shows, that the confidence of the SVM is very high, whereas that of the Gaussian process and random forest is substantially lower. A rather high confidence have also the MLP and Figure 3: Counts of paragraphs of the small collection of scientific papers predicted by highest confidence to target class Figure 4: Sum of prediction confidences for the small collection of scientific papers Figure 5: Sum of margins between top two confidences for the small collection of scientific papers 4.4.3. Summary results for both datasets texts. Successful processing input texts and ontologies often requires a quite hard and laborious job. Here have From results in the first experiment we can say SVM has been used classifiers in combination with the representa- good results on testing data in many metrics. The results tion learning by BERT, that may help make this process for both datasets show that the SVM classifier has very faster. Our idea was to use embedding of each paragraph high confidences and very high margins between top from PDFs as input to classifiers. We used a pretrained two confidences. Hence, the results indicate that for a BERT that have been fine-tuned using chemical articles. large majority of the unknown scientific texts, the most The output embeddings from fine-tuned BERT were used relevant ontology is NCIT. as an input to the classifiers. We have experimented with five different classifiers, in particular random forest, sup- 5. Conclusion port vector machine, multilayer perceptron, k-nearest neighbors, and Gaussian process. The random forest was This paper provides an insight into the possibility to auto- not successful, its accuracy was the worst of all models. matically determine ontologies most relevant to scientific The best results had Gaussian process and support vector Figure 6: Counts of paragraphs of the large collection of scientific papers predicted by highest confidence to target class Figure 7: Sum of prediction confidences for the large collection of scientific papers Figure 8: Sum of margins between top two confidences for the large collection of scientific papers machine. to predict the next token based on all tokens before it. In second experiment the considered classifiers have In future research, it is desirable to try different trans- been tested and compared on scientific papers from the formers. We would like to extract knowledge from ANNs domain of catalysis. The ground truth was not known in the context of learning. The main direction of our re- there. The k-nearest neighbor and Gaussian process had search is extending and integrating ontologies. We plan very low margin between first and second highest con- to use also graph neural networks to incorporate them fidence. The highest confidence among all classifiers into representation learning. had the support vector machine. It had also the highest margin among them. The biggest weakness of this article is the lack of Acknowledgments ground truth for the classification of scientific articles, The research reported in this paper has been supported by which makes it impossible to evaluate this classification. the German Research Foundation (DFG) funded projects Therefore, we plan to use methods for reducing the im- NFDI2/12020 and 467401796, and by the Grant Agency pact of unknown ground truth. Our idea is to use inter- of the Czech Technical University in Prague, grant No. polation between annotations using GPT-2 and GPT-3 SGS20/208/OHK3/3T/18. networks. GPT (Generative Pre-trained Transformer) [41] stands for a series of pre-trained language models, which have been developed by OpenAI. They have been References trained with a large dataset of textual information and can be applied to deal with specific language-related tasks. [1] Gruber T., Liu L., Özsu M. T.: Ontol- BERT, which was trained with Wiki and books data that ogy. https://tomgruber.org/writing/ contains over 3.3 billion tokens, is popular in natural definition-of-ontology.pdf Encyclopedia of language understanding tasks, e.g., text classification. Database Systems, Springer-Verlag (2009) However, BERT as a masked language model can only [2] OWL Working Group: OWL. https://www.w3. learn contextual representation of words but not orga- org/OWL/ W3C Semantic Web (2012-Dec-11) nize and generate language, which makes it unsuitable [3] Devlin J., Chang M., Lee K., Toutanova K.: BERT: Pre- for design concept generation task. On the other hand, training of Deep Bidirectional Transformers for Lan- GPTs are autoregressive language models that are trained guage Understanding https://aclanthology. org/N19-1423 Proceedings of the 2019 Conference telligent approaches for the automated domain ontol- of the North American Chapter: Human Language ogy extraction. International Congress on Informa- Technologies, Volume 1 (Long and Short Papers), As- tion and Communication Technology, pp. 410-–417, sociation for Computational Linguistics (June 2019), 2021. pp. 4171–4186 [19] Kolozali S., Fazekas G., Barthet M., Sandler M. B.: A [4] Pedregosa et al.: Scikit-learn: Machine Learning in framework for automatic ontology generation based Python, Journal of Machine Learning Research 12 on semantic audio analysis. Audio Engineering Soci- (2011), pp. 2825–2830 ety International Conference, pp. 87-–96, 2014. [5] Benvenuto M. A., Plaumann H.: Industrial Cataly- [20] Li G.: CNN based ontology learning algorithm and sis (2021). Berlin, Boston: De Gruyter (De Gruyter applied in PE data IAENG International Journal of STEM). Computer Science, 48:1-–8, 2021. [6] American Chemical Society Report: Technology Vi- [21] Martel F., Zouaq A.: Taxonomy extraction using sion 2020 The Chemical Industry, December 1996 knowledge graph embeddings and hierarchical clus- [7] Borst W. N. Construction of Engineering Ontologies tering SAC’21, pp. 836-–844, 2021. for Knowledge Sharing and Reuse Enschede Centre [22] Memariani A., Glauer M., Neuhaus F., Mossakowski for Telematics and Information Technology (CTIT), T., Hatings J.: Automated and explainable ontology 1997 extension based on deep learning: A case study in [8] Martínez-Romero M., Jonquet C., O’Connor M. J., the chemical domain. 3rd International Workshop Graybeal J., Pazos A., Musen M. A.: NCBO On- on Data Meets Applied Ontologies, pp. 1-–16, 2021. tology Recommender 2.0: An enhanced approach [23] Mercier C., Chateau-Laurent H., Alexandre F., for biomedical ontology recommendation Journal of Viéville T.: Ontology as neuronal-space manifold: biomedical semantics 8 (1), pp. 21 (2017) Towards symbolic and numerical artificial embed- [9] Recobo AI Company: BERT for Chemical In- ding Workshop on Knowledge Representation for dustry https://huggingface.co/recobo/ Hybrid and Compositional AI, pp. 1—11, 2021. chemical-bert-uncased HuggingFace - Trans- [24] Mueller R. M., S. Abdullaev: Deep cause: Hypoth- formers - BERT (2022) esis extraction from information systems papers [10] Al-Aswadi F. N., Chan H. Y., K.H. Gan: Extracting with deep learning for theory ontology learning An- semantic concepts and relations from scientific publi- nual Hawaii International Conference on System cations by using deep learning. Proceedings of IRICT Sciences, pp. 6250-–6259, 2019. 2020, pp. 374—383, 2021. [25] Teslya N., Savosin S.: Matching ontologies with [11] Althubaiti S., Kafkas S., Abdelhakim M., Hoehndorf Word2Vec-based neural network ICCSA, pp. 745— R.: Combining lexical and context features for au- 756, 2019. tomatic ontology extension. Journal of Biomedical [26] Oba A., Paik I., Kuwana A.: Automatic classifica- Semantics, 11:article no. 1, 2020. tion for ontology generation by pretrained language [12] Bento A., Zouaq A., M. Gagnon: Ontology match- model International Conference on Industrial, Engi- ing using convolutional neural networks. LREC, pp. neering and Other Applications of Applied Intelli- 5648-–5653, 2020. gent Systems, pp. 210—221, 2021. [13] Chakraborty J., Yaman B., Virgili L., Konar K., [27] Petrucci G., Ghindini C., Rospocher M.: Ontology Bansal S. K.: OntoConnect: Results for OAEI 2020. learning in the deep EKAW, pp. 480-–495, 2016. OM ISWC, pp. 204-–210, 2020. [28] Petrucci G., Rospocher M., Ghindini C.: Expres- [14] Espinoza-Anke L., Ronzano F., Saggion H.: Hyper- sive ontology learning as neural machine translation nym extraction: Combining machine-learning and Journal of Web Semantics, pp. 52-–53:66-–82, 2018. dependency grammar. CICLing, pp. 372-–383, 2015. [29] Potoniec J.: Learning OWL 2 property character- [15] Gupta N., Podder S., Annervaz K. M., Sengupta S.: istics as an explanation for an RNN Bulletin of the Domain ontology induction using word embeddings. Polish Academy of Sciences, Technical Sciences, pp. ICMLA, pp. 115—119, 2016. 68:1481-–1490, 2020. [16] Hao L., Lei C., Efthymiou V., Quamar A., Özcan F., [30] Navarro-Almanza R., Juárez-Ramírez R., Castro J. et al.: MEDTO: Medical data to ontology matching R.: Automated ontology extraction from unstruc- using hybrid graph neural networks. KDD’21, pp. tured texts using deep learning Intuitionistic and 2946-–2954, 2021. Type-2 Fuzzy Logic Enhancements in Neural and [17] Hourali M., Montazer G. A.: Using ART2 neural Optimization Algorithms: Theory and Applications, network and bayesian network for automating the pp. 727—755. Springer, 2020. ontology constructing process. Procedia Engineer- [31] Ristoski P., Paulheim H.: Rdf2vec: Rdf graph em- ing, 29:3914—3923, 2012. beddings for data mining, International Semantic [18] Katyshev A., Anikin A., Denisov M., Petrova T.: In- Web Conference, pp. 498—514, 2016. [32] Ritchie A., Chen J., Castro L. J., Rebholz-Schuhmann D., Jiménez-Ruiz E.: Ontology clustering with OWL2Vec DeepOntoNLP, pp. 54-–61, 2021. [33] Wu J., Lv J., Guo H., Daeom S. M.: A deep atten- tional embedding approach for biomedical ontology matching Applied Sciences, 10:article no. 7909, 2020. [34] Wulf C., Beller M., Boenisch T., Hanf S., Deutschman O., and others: Research – Challenges and Concepts: A Unified Research Data Infrastruc- ture for Catalysis ChemCatChem, pp. 3223–3236, vol 13, 2021. [35] Ho T. K.: Random decision forests Proceedings of 3rd International Conference on Document Anal- ysis and Recognition, 1995, pp. 278–282 vol.1, doi: 10.1109/ICDAR.1995.598994. [36] Schölkopf B., Smola A. J.: Learning with Kernels, MIT Press 2002 [37] Rasmussen C. E., Williams C. K. I.: Classification in Gaussian Processes for Machine Learning MIT Press, 2005, pp.33–77. [38] Kramer O.: K-Nearest Neighbors. In: Dimensional- ity Reduction with Unsupervised Nearest Neighbors (2013) Intelligent Systems Reference Library, vol 51. Springer, Berlin, Heidelberg, DOI: 10.1007/978-3-642- 38652-7_2 [39] Vang-Mata R.: Multilayer Perceptrons: Theory and Applications Computer Science, Technology and Applications Series (2020) Nova Science Publishers, ISBN: 978-1-536-17365-9 [40] Benavoli A., Corani G., Mangili F.: Should We Really Use Post-Hoc Tests Based on Mean-Ranks? Journal of Machine Learning Research (2016), vol. 17, pp. 1–10 [41] Zhu Q., Luo J.: Generative pre-trained transformer for design concept generation: An exploration. Pro- ceedings of the Design Society, 2: pp. 1825—1834, 2022.