Using Artificial Neural Networks to Determine Ontologies
Most Relevant to Scientific Texts
Lukáš Korel1 , Alexander S. Behr2 , Norbert Kockmann2 and Martin Holeňa1,3,4
1
  Faculty of Information Technology, CTU, Prague, Czech Republic
2
  Faculty of Biochemical and Chemical Engineering, TU Dortmund University, Germany
3
  Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic
4
  Leibniz Institute for Catalysis, Rostock, Germany


                                          Abstract
                                          This paper provides an insight into the possibility of how to find ontologies most relevant to scientific texts using artificial
                                          neural networks. The basic idea of the presented approach is to select a representative paragraph from a source text file,
                                          embed it to a vector space by a pre-trained fine-tuned transformer, and classify the embedded vector according to its relevance
                                          to a target ontology. We have considered different classifiers to categorize the output from the transformer, in particular
                                          random forest, support vector machine, multilayer perceptron, k-nearest neighbors, and Gaussian process classifiers. Their
                                          suitability has been evaluated in a use case with ontologies and scientific texts concerning catalysis research. From results we
                                          can say the worst results have random forest. The best results in this task brought support vector machine classifier.

                                          Keywords
                                          ontology, text data, text preprocessing, text representation learning, text classification


1. Introduction                                                                                           The need to find a suitable ontology for an input text
                                                                                                       can help in classifying the information presented within
A domain ontology defines a set of representational prim- the text as well as to connect the input text with data. This
itives with which to model a domain of knowledge or would allow for automated selection of ontologies and
discourse. The representational primitives are typically respective classification of the text. Different text data
classes, attributes, and relationships. The definitions of could thus be compared automatically in an understand-
the representational primitives include information about able way and connected with corresponding research
their meaning and constraints on their logically consis- data. Ontologies represent "a formal specification of a
tent application. Classes can be defined in two ways: shared conceptualization" [7] and can thus be used to
by annotating their definitions, or by connecting classes express knowledge and data in a formalized, standard-
with each other and with properties. Each domain ontol- ized description language to specify terms and relations
ogy typically uses domain-specific definitions of terms between those terms.
denoting its primitives.                                                                                  Current ontology recommenders, such as the NCBO
    The FAIR research data management (Findable, Access- ontology recommender [8], score annotations based on
able, Interoperable, and Reuseable) needs a consistent words similar to preferred and alternate labels of ontol-
data representation in ontologies, particularly for rep- ogy classes and term frequency. In contrast to this, this
resenting the data structure in the specific domain [34]. work aims to use text representation learning in order to
Since different ontologies are written by different people, not only search for words also contained in ontologies
they are often incompatible, even within the same do- but also to find concepts with similar semantic meaning
main. As systems that rely on domain ontologies expand, between text and ontology.
it is often needed to merge domain ontologies by man-                                                     This paper is devoted to a specific problem encoun-
ual tuning. The same is true for enhancing an ontology tered during enhancing ontologies and sometimes during
with information available in domain-related texts. Merg- their merging: to decide which of several available on-
ing and enhancing ontologies is thus a largely manual tologies is most relevant to given domain-related piece of
process and therefore time-consuming and expensive.                                                    text. Our solution to the problem relies primarily on arti-
                                                                                                       ficial neural networks (ANNs), in particular on natural
ITAT’22: Information technologies – Applications and Theory, Septem- language processing (NLP).
ber 23–27, 2022, Zuberec, Slovakia                                                                        The next section surveys the applicability of artificial
$ lukas.korel@fit.cvut.cz (L. Korel);                                                                  neural networks to ontologies. Section 3 recalls the em-
alexander.behr@tu-dortmund.de (A. S. Behr);
norbert.kockmann@tu-dortmund.de (N. Kockmann);
                                                                                                       ployed methods of text preprocessing. There have been
martin@cs.cas.cz,martin.holena@catalysis.de (M. Holeňa)                                                used modules for text extractions from PDF files, for
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).
                                                                                                       transforming extracted files to pure text and for elimi-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
nating irrelevant paragraphs. In the section is described  encoder-decoder configuration and translates through
text representation learning, as well as the principles of syntactic transformation a subset of natural language
the employed classifiers. In section 4, an application of  into the description logic language ALLQ. Moreover, the
the proposed methodology to catalysis is described and     system generalizes over different syntactic structures,
evaluated.                                                 and has the ability to tolerate unknown words through
   With regard to sources we have studied described in     copying input words as extralogical symbols to the out-
part 2 of this article, we are not aware that classifiers  put, as well as the ability to enrich the training set with
learned from the results of representational learning have new annotated examples. In [23], a mapping is estab-
ever been used to determine the most relevant of a given   lished between ontologies and a pair of interacting as-
set of ontologies.                                         sociative memories. One of them stores assertions, and
                                                           the other stores entailment rules. The most recent work
                                                           [21] describes a method for the specific task of extracting
2. Applicability of Artificial Neural a taxonomy from an embedding of a knowledge graph.
      Networks to Ontologies                               Over that embedding, which can be obtained for exam-
                                                           ple with RDF2Vec, hierarchical agglomerative clustering
In connection with learning and extending ontologies, ar- is performed, first without using type information, and
tificial neural networks (ANNs) have been primarily used then injecting types into the hierarchical clustering tree.
for identification of concepts, relations and attributes In addition, an axiom induction algorithm is applied to
[10, 15, 18]. With respect to relations, some ANN-based each cluster in the resulting tree, which allows to identify
methods have been developed specifically for subsump- new classes corresponding to those axioms that describe
tion relations needed for the construction of taxonomies their respective clusters accurately enough.
[11, 14, 21, 30]. In connection with integration of on-       Neural networks are often used due to their strengths
tologies, they have been primarily used for ontologies in natural language processing task. Ontology construc-
matching aka ontologies alignment [12, 13, 16, 33]. The tion rely very much on texts, which suggest the applica-
variety of employed kinds of ANNs is rather large. It bility of artificial neural networks (ANNs) in this context.
includes traditional multilayer perceptrons (MLPs) [19],
adaptive resonance theory (ART) networks [17] and asso-
ciative memories [23], as well as the modern deep convo- 3. Methodological Background
lutional networks (CNNs) [12, 20], deep belief networks
                                                           This section describes details of used methods to reach
[10], long short term memory (LSTM) networks together
                                                           requested target. In the first part we need receive con-
with their bidirectional variant (BiLSTM) [24] and gated
                                                           tent from textural files, parse it into paragraphs and keep
recurrent units (GRU) networks [28, 29]. The dependence
                                                           only paragraphs fulfilling minimal length and relevant
of ontologies on texts led to using networks developed
                                                           content to the topic of the document. The second part
for text and natural language representation learning,
                                                           describes usage selected transformer and embedding in-
most importantly BERT [22, 26], the bidirectional en-
                                                           put paragraphs to classification numeric vectors. The
coder representations from transformers, and word2vec
                                                           final part describes used classifiers, which use outputs
[25], the most traditional network for embedding text
                                                           from the transformer for final classifications to target
into an Euclidean space. The close relationship of on-
                                                           ontology.
tologies to knowledge graphs led to using also RDF2Vec
[21, 29], which was originally proposed for knowledge
graphs [31]. In connection with word2vec and RDF2Vec, 3.1. Text Preprocessing
it is on similar principles, the network OWL2Vec was
                                                           For the problem scientific texts classification to the most
proposed for embedding of ontologies [32]. Finally, the
                                                           relevant existing ontology, we have been using docu-
graph-like structure of ontologies brought usage graph
                                                           ments in portable document files (PDFs). An issue with
neural networks (GNNs) [16, 33].
                                                           PDFs is that they are optimized to print on physical
    Closest to the proposed project is the way ANNs have
                                                           printer, thus they contain meta-information about the
been recently used in connection with translating into
                                                           contained text related to the position on the page. There-
OWL [27, 28], with predicate chaining and restriction
                                                           fore, it is not easy to address a single paragraph. If the
[23], and with taxonomy extraction from knowledge
                                                           file is read using the basic library for PDF files and the
graphs [21]. In [27], ontology learning is tailored as a
                                                           newline mark is used as the splitter, it returns only a
transductive reasoning task that uses two recurrent neu-
                                                           single row, not the whole paragraph. Another issue is
ral networks to translate text in natural language into
                                                           connected with multi-column documents. If the docu-
OWL specifications in description logic. That approach
                                                           ment does not include information about where the text
was further developed in [28], resulting in a system based
                                                           continues, software libraries for text extraction from PDF
on a single recurrent network of GRU type. It uses an
usually continue with the next letter on the same row.            The basic schema of BERT is given in Figure 1. The
   One solution to get text data from multi-column PDF         tokenized input at first passes through the encoder,
is to use Microsoft Word engine. Its engine is able to         which embeds sentences to elements of an Euclidean
solve both problems and parse text properly. It identifies     space. These vectors are used as input to the BERT
structural information in text such as headings, para-         decoder. BERT returns one vector for each input. Each
graphs and sentences. Each document may contain texts          input sequence contains a special token at the beginning
irrelevant to the topic of interest, for example references,   marked as CLS. Vectors embedding the tokens of an
acknowledgement etc.                                           input sequence can be arranged into a matrix. The
   Specifications of the ontologies are most often stored      first row of the matrix is the embedding of the whole
in OWL files. OWL [2] is a specific kind of XML for            input. Details of BERT are described in [3] and on the
ontologies. Text that describes classes and relations may      https://huggingface.co/docs/transformers/model_doc/bert.
be stored in different tags, depending on the decision of      These embeddings of every input paragraph are taken
the ontology designer.                                         into account for the final assignment of the most relevant
                                                               ontology to the paragraph.
3.2. Text Representation Learning
                                                               3.3. Classification
For typical data analysis tasks like classification of clus-
tering, it is suitable to represent words or other parts of    The embeddings obtained in BERT are used as inputs for
text by vectors in an Euclidean space. Such representa-        classifiers classifying a given input part of text (e.g., a
tion is mostly the result of representation learning by        paragraph) with respect to its relevance to the consid-
ANNs. In the area of text analysis and processing, the         ered ontologies. Those classifiers have been trained on
probably most successful representation learning algo-         the embeddings of the annotations from the considered
rithm is BERT (Bidirectional Encoder Representations           ontologies because for them, the ground truth (i.e., the
from Transformers)[3].                                         ontology to which the annotation belongs) is known.
   BERT needs to be trained using large amount of texts.          We have decided to select five classifiers implemented
That is why some pretrained version is typically used,         in scikit-learn [4]. They are the following:
and often subsequently fine-tuned using texts concerning
                                                                   1. Random forest (RF): An ensemble classifier that
the considered topic. Such fine-tuning is often performed
                                                                      fits a number of classification trees on various
even if the pretrained network was trained, apart from
                                                                      sub-samples of the training data and uses some
general texts, also with texts from some broader relevant
                                                                      aggregation function to improve the predictive
domain (biology, medicine, chemistry, etc.).
                                                                      accuracy and control over-fitting. Usually, each
                                                                      tree in the ensemble is built using a sample drawn
                                                                      with replacement (i.e., a bootstrap sample) from
                                                                      the training set. Furthermore, when splitting each
                                                                      node during the construction of a tree, the best
                                                                      split is found using either all input features or
                                                                      a random subset of a given size. The purpose is
                                                                      to decrease the variance of the forest estimator.
                                                                      Indeed, individual decision trees typically exhibit
                                                                      high variance and tend to overfit. The injected
                                                                      randomness in forests yield decision trees with
                                                                      somewhat decoupled prediction errors. By taking
                                                                      an average of those predictions, some errors can
                                                                      cancel out. RFs achieve a reduced variance by
                                                                      combining diverse trees, sometimes at the cost of
                                                                      a slight increase in bias. Typically, the variance
                                                                      reduction yields an overall better model [35].
                                                                   2. Support vector machine (SVM): It is a classifier de-
Figure 1: BERT (Bidirectional Encoder Representations from            signed specifically to achieve the lowest possible
Transformers) architecture [3]. An input sentence is divided          predictive error, using a known relationship be-
into tokens and each token is encoded to number. The BERT’s           tween generalization error and margin of the sep-
output contains one numeric vector per one token. The output          arating hyperplane. It uses only training points
marked as C is used for final classification.                         on both support hyperplanes of the margin (sup-
                                                                      port vectors), so it is also memory efficient. A
   simple SVM can be used only for linearly separa-              The advantages of MLP are capability to learn
   ble classes. For linearly nonseparable classes, the           non-linear models and capability to learn models
   data must be first transformed to linearly separa-            in real-time (on-line learning). But the MLP with
   ble sets of functions in a high-dimensional vector            hidden layers have a non-convex loss function
   space of functions using a suitable kernel. The               where there exists more than one local minimum.
   SVM classification has multiclass support handled             Therefore, different random weight initializations
   according to a one-vs-one or one-vs-rest scheme               can lead to different validation accuracy. A MLP
   [36].                                                         requires tuning a number of hyperparameters
3. Gaussian Process (GP): It has been designed pri-              such as the number of hidden neurons, layers,
   marily for regression problems. A Gaussian Pro-               and iterations. Moreover, it is sensitive to feature
   cess Classifier (GPC) implements a collection of              scaling [39].
   random variables indexed by an Eucliedan space
   for classification purposes through placing a GP
   prior on latent functions. Its purpose is to al-       4. Case Study in Catalysis
   low a convenient formulation of the classifica-
                                                          A catalyst is some chemical that is not consumed in the
   tion through a logistic link function. GPCs sup-
                                                          process of a chemical reaction. Using a catalyst in a chem-
   port multi-class classification by performing ei-
                                                          ical reaction usually allows said reaction to take place
   ther one-versus-rest or one-versus-one training
                                                          faster and allows for more moderate reaction conditions.
   and prediction. A crucial ingredient of each GPC
                                                          Catalysis-based chemical synthesis is applied at roughly
   is the covariance functions of the underlying GP.
                                                          90% of chemical processes in chemical industry. The sci-
   It encodes the assumptions on the similarity of
                                                          entific domain of catalysis is highly interconnected to
   Gaussian distributions corresponding to different
                                                          other sciences and thus spans over many topics from
   points [37].
                                                          material sciences to process design [5, 6].
4. K nearest neighbors: Neighbors-based classifica-
   tion simply stores instances of the training data.
   A query point is assigned the data class which has     4.1. Used Data
   the most representatives within the nearest neigh-
                                                          The texts that have been used for fine-tuning BERT, have
   bors of the point. The nearest neighbors classifi-
                                                          been taken from scientific papers in catalysis. These
   cation can use uniform weights, that means, the
                                                          articles have been by PowerShell script extracted to Mi-
   value assigned to a query point is computed from
                                                          crosoft Word documents. Thanks to its engine, para-
   a simple majority vote of the nearest neighbors.
                                                          graphs and titles are marked properly, so paragraphs
   In some cases, it is better to weight the neighbors
                                                          with relevant texts have been extracted and with BERT
   in such a way that nearer neighbors contribute
                                                          embedding prepared for classification.
   more to the fit. For example, when an unknown
                                                             We conduct our experiments on a set of five ontologies
   point’s class is computed from two nearest points
                                                          from the chemical domain (Table 1) gathered within the
   and one of this two is nearer than second, in
                                                          NFDI4Cat project [34]. The ontologies NCIT, CHMO and
   weighted case is result class same as the nearer
                                                          Allotrope have a close connection to the chemical domain.
   point. The distance 𝑑 between two points can
                                                   𝑐 1
                                                          However, according to their names, the chemical entities
   be computed as: 𝑑(𝑥, 𝑦) = ( 𝑛     𝑖=1 |𝑥𝑖 − 𝑦𝑖 | ) ,
                                   ∑︀
                                                     𝑐
                                                          of biological interest (CHEBI) and the system biology
   where 𝑛 is the dimension of each point and 𝑐 ≥ 1,      ontology (SBO) are expected to be further away from the
   if 𝑐 = 1, this is the Manhattan distance and in        chemical domain. This does not hold necessarily true for
   case 𝑐 = 2, this is the Euclidean distance [38].       the CHEBI as it describes a plethora of chemical entities,
5. Multi-layer Perceptron (MLP): Given a set of fea-      also relevant in the chemical and not only biological do-
   tures and a target, it can learn a non-linear func-    main. The SBO was selected as it contains some general
   tion approximator for either classification or re-     laboratory and computational contexts. It also can be
   gression. It is different from logistic regression,    seen as some kind of a test, whether the tools used can
   because between the input and the output layer,        also identify ontologies not fitting to the text content.
   there can be one or more non-linear hidden lay-           Hence, these ontologies are classes to which classifiers
   ers. The input layer consists of a set of neurons      assign new parts of text. The data have been divided into
   representing the input features. Each neuron in        training and testing datasets in stratified proportion 1:1.
   the hidden layer transforms the values from the        The testing dataset has been divided into 20 disjoint sub-
   previous layer with a weighted linear summation,       sets, assuming that disjointness is a sufficient condition
   followed by a non-linear activation function. The      for their independence. The training dataset have been
   output layer receives the values from the last hid-    under-sampled in order to mitigate overfitting during
   den layer and transforms them into output values.
Table 1                                                      4.3. Comparison of Important Classifiers
Types and counts of labels in the used OWL files                  on Considered Ontologies
                                                             Summary statistics of the predictive accuracy of classi-
       Ontology                          Number
        name
                       XML classes
                                         of classes          fying all 20 testing datasets are in Table 3. The table is
                          Literal                            complemented with boxplots (Figure 2), where the fol-
        Allotrope     rdfs:comment             2773          lowing quality measures are presented for each classifier:
                         rdfs:label                          accuracy, F1 score, precision and recall. The random
                      rdfs:comment                           forest classifier had the worst results of all experiments.
         NCIT                                  1169
                         rdfs:label                          Other models had significantly better results. The best
                          Literal                            accuracy had the Gaussian process, its mean accuaracy
          SBO         rdfs:comment                 534       was 97.5 % with very low standard deviation.
                         rdfs:label                             The differences between the considered classifiers
                     obo:IAO_0000115                         were tested for significance by the Friedman test. The
         CHEBI                                35067
                         rdfs:label
                                                             basic null hypothesis that the mean accuracy for all 5 clas-
                     obo:IAO_0000115
                                                             sifiers coincides was strongly rejected, with the achieved
        CHMO          rdfs:comment             2521
                         rdfs:label
                                                             significance 𝑝 = 3.02 × 10−12 . For the post-hoc anal-
                                                             ysis, we employed the Wilcoxon signed rank test with
                                                             two-sided alternative for all 10 pairs of the compared
                                                             classifiers, because of the inconsistence of the more com-
training part.                                               mon mean ranks post-hoc test, as pointed out in [40].
                                                             For correction to multiple hypotheses testing, we used
                                                             the Holm method. The results are given in Table 4, good
4.2. Experimental Setting                                    results has Support vector machine and Gaussian process
At first, the PDFs were transformed into Microsoft Word      classifier.
using PowerShell scripts. The output files have been pro-
cessed by a python library for parsing docx files. As a      4.4. Classification of Scientific Texts with
result relevant paragraphs have been extracted for clas-
                                                                  respect to Relevant Ontologies
sification according to the most relevant ontology. The
irrelevant paragraphs contained acknowledgement, ref-        For this experiment, we had no ground truth as to which
erences, titles and too short paragraphs (shorter than 100   of the available ontologies is the most relevant for each
letters) have been skipped.                                  considered paragraph of text. We employed two collec-
   The annotations in the specifications of given ontolo-    tions of scientific papers from the area of catalysis. The
gies have been extracted using XML parser for python         small one are papers dealing with the topic of methana-
named BeautifulSoup. Extracted paragraphs have also          tion of CO2, it consists of 28 PDFs, from which we have
been used for BERT fine-tuning. The chosen version of        extracted 1 485 relevant paragraphs. The large one is
the BERT was recobo/chemical-bert-uncased from the           the digital archive of papers (co-)authored by scientists
Huggingface portal [9]. Using the fine-tuned BERT, every     from the Leibniz Institute of Catalysis (with the excep-
paragraph has been transformed into a 768-dimensional        tion of very few papers with read protection), it consists
numeric vector.                                              of 3 450 PDFs, from which we have extracted 144 490
   The extraction of annotations from OWL files has been     relevant paragraphs. The BERT embeddings of those
performed using a python XML parser. Individual annota-      paragraphs were classified by the five trained classifiers.
tions have been again embedded into the 768-dimensional      The confidence is probability over all classes, that source
vector space using the fine-tuned BERT.                      paragraphs fits into target class. Every paragraph can
   For the employed classifiers, their implementations       be classified to more than one target class with specific
in ScikitLearn [4] has been used. The optimal values         confidence. The sum of confidences of each paragraph
of hyperparameters of each classifier were determined        is one. In this experiment were used models trained in
using a 5-fold cross-validation applied to a grid-search     previous experiment.
with the grid values listed in Table 2. In order to miti-
gate overfitting, training data have been undersampled.      4.4.1. Results for the small dataset
Statistic computations have used the scipy, statsmodels
and pingouin python libraries.                               Figure 3 shows how many paragraphs each classifier as-
                                                             signed to each ontology. The Gaussian process, k-nearest
                                                             neighbor, MLP and SVM assigned almost all paragraphs
                                                             to the NCIT ontology. The random forest is most uncer-
Table 2
Hyperparameters of individual classifiers that were determined through grid-search on combinations of considered values.
In the column Selected are values, that have been selected using a random stratified 5-fold cross-validation applied to a
grid-search with the Considered values

              Classifier     Hyperparameter               Considered values                        Selected
                             maximal depth                {5, 7, 9, 11}                            11
                             criterion                    {entropy, gini}                          gini
              Random
                             count of estimators          {5, 10, 15, 20, 25, 30}                  20
              forest
                             fraction of features
                                                          {0.5, 0.7}                               0.5
                             used in each split
                             bootstrap samples            {false, true}                            true
                             slack trade-off
              Support                                     {1, 10, 100, 1000}                       100
                             constant (C)
              vector
                             kernel type                  {linear, radial basic}                   radial basic
              machine
                             kernel coefficient
                                                          [0.001, 0.0001]                          0.001
                             gamma
                                                          {radial basic, dot product, mattern,
              Gaussian       kernel                                                                matern
                                                          rational quadratic, white kernel}
              process
                             random state                 {0, 50}                                  unapplicable
                             number of considered
                                                          {1, 5, 9, 13, 17}                        9
                             neighbors
              K nearest
                             weights                      {uniform, distance}                      distance
              neighbors
                             algorithm                    {auto, ball tree, kd tree, brute}        auto
                             distance metric exponent     {1, 2, 3, 4, 5}                          2
                             random state                 {0, 1}                                   0
                             activation function          {identity, logistic, tanh, relu}         tanh
              Multi-layer    optimizer                    {lbfgs, sgd, adam}                       adam
              perceptron     hidden layer size            {1, 4, 16, 64}                           4
                             strength of L2
                                                          {0.0001, 0.05}                           0.05
                             regularization term
                             learning rate for
                                                          {constant, adaptive}                     constant
                             weights update


Table 3
Quality measures of the considered classifiers aggregated over all 20 testing datasets (mean [%] ± standard deviation [%]),
where 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇 𝑃 +𝐹𝑇 𝑁𝑃 +𝑇 𝑁
                            +𝑇 𝑁 +𝐹 𝑃
                                       , 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝐹 𝑃𝑇+𝑇
                                                           𝑃
                                                             𝑃
                                                                , 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝐹 𝑁𝑇+𝑇𝑃
                                                                                   𝑃
                                                                                      and 𝐹 1 = 2 · 𝑃𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
                                                                                                      𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛·𝑅𝑒𝑐𝑎𝑙𝑙


                                           Accuracy         F1                 Precision         Recall
                Gaussian process           97.46 ± 0.39     89.48 ± 1.38       85.70 ± 1.35      95.88 ± 1.21
                K-nearest neighbor         96.66 ± 0.67     87.60 ± 2.41       84.36 ± 2.69      92.73 ± 2.04
                Multi-layer perceptron     96.99 ± 0.67     87.84 ± 1.59       84.03 ± 1.54      94.97 ± 1.58
                Random forest              94.63 ± 0.69     82.00 ± 2.29       76.30 ± 2.34      90.76 ± 2.58
                Support vector machine     97.16 ± 0.53     88.72 ± 1.85       84.64 ± 1.89      95.85 ± 1.69


tain among all classifiers, assigning most paragraphs to        predicted ontology and the second highest class confi-
the CHEBI ontology, but some pragraphs also to each of          dence is shown. Again, the highest values are achieved
the remaining four.                                             by the SVM and MLP, whereas the Gaussian process
   Figure 4 uses instead of the count of class predictions      and random forest have only small margin between the
their confidences. The confidence of the SVM and MLP            predicted and second most confident ontology, and the
is very high, whereas that of the Gaussian process and          k-nearest neighbor has quite a high margin, but not so
random forest is substantially lower. The k-nearest neigh-      high as SVM or MLP.
bors classifier has rather high confidence also.
   In Figure 5, the margin between the confidence of the
Figure 2: Box plots comparing the distribution of quality measures of the considered classifiers on testing datasets


Table 4
Comparison of accuracy results on all 20 testing sets with ontology annotations. The values in the table are counts of datasets,
in which the model in the row has a higher accuracy compared to the model in the column. If the difference is not significant
in the Wilcoxon test then the count is in italic. If the difference is significant, then the higher count is in bold.

                                                   Support
                                       Random                   Gaussian     K-nearest     Multi-layer    Summary
                                                   vector
                                       forest                   process      neighbors     perceptron     score
                                                   machine
           Random forest                      -           0             0             0              0             0
           Support vector machine            20           -             2            15             13            50
           Gaussian process                  20          16             -            17             19            72
           K-nearest neighbors               20           3             3             -              5            31
           Multi-layer perceptron            20           4             1            14              -            39


4.4.2. Results for the large dataset                       the k-nearest neighbors classifier.
                                                             In Figure 8, the margin between the confidence of the
Figure 6 depicts the count of paragraphs from the large
                                                           predicted ontology and the second highest class confi-
dataset that each classifier assigned to each ontology.
                                                           dence is shown. Again, the highest values are achieved
The Gaussian process, k-nearest neighbor, MLP and SVM
                                                           by the SVM, whereas the Gaussian process and random
assigned almost all paragraphs to the NCIT ontology.
                                                           forest have only small margin between the predicted
The random forest is most uncertain among all classifiers,
                                                           and second most confident ontology, and the MLP and
assigning most paragraphs to the CHMO ontology, but
                                                           k-nearest neighbor have quite a high margin, but not so
some pragraphs also to each of the remaining four.
                                                           high as SVM.
   Figure 7 using confidences of class predictions shows,
that the confidence of the SVM is very high, whereas that
of the Gaussian process and random forest is substantially
lower. A rather high confidence have also the MLP and


Figure 3: Counts of paragraphs of the small collection of scientific papers predicted by highest confidence to target class
Figure 4: Sum of prediction confidences for the small collection of scientific papers


Figure 5: Sum of margins between top two confidences for the small collection of scientific papers


4.4.3. Summary results for both datasets                     texts. Successful processing input texts and ontologies
                                                             often requires a quite hard and laborious job. Here have
From results in the first experiment we can say SVM has
                                                             been used classifiers in combination with the representa-
good results on testing data in many metrics. The results
                                                             tion learning by BERT, that may help make this process
for both datasets show that the SVM classifier has very
                                                             faster. Our idea was to use embedding of each paragraph
high confidences and very high margins between top
                                                             from PDFs as input to classifiers. We used a pretrained
two confidences. Hence, the results indicate that for a
                                                             BERT that have been fine-tuned using chemical articles.
large majority of the unknown scientific texts, the most
                                                             The output embeddings from fine-tuned BERT were used
relevant ontology is NCIT.
                                                             as an input to the classifiers. We have experimented with
                                                             five different classifiers, in particular random forest, sup-
5. Conclusion                                                port vector machine, multilayer perceptron, k-nearest
                                                             neighbors, and Gaussian process. The random forest was
This paper provides an insight into the possibility to auto- not successful, its accuracy was the worst of all models.
matically determine ontologies most relevant to scientific The best results had Gaussian process and support vector


Figure 6: Counts of paragraphs of the large collection of scientific papers predicted by highest confidence to target class
Figure 7: Sum of prediction confidences for the large collection of scientific papers


Figure 8: Sum of margins between top two confidences for the large collection of scientific papers


machine.                                                        to predict the next token based on all tokens before it.
   In second experiment the considered classifiers have            In future research, it is desirable to try different trans-
been tested and compared on scientific papers from the          formers. We would like to extract knowledge from ANNs
domain of catalysis. The ground truth was not known             in the context of learning. The main direction of our re-
there. The k-nearest neighbor and Gaussian process had          search is extending and integrating ontologies. We plan
very low margin between first and second highest con-           to use also graph neural networks to incorporate them
fidence. The highest confidence among all classifiers           into representation learning.
had the support vector machine. It had also the highest
margin among them.
   The biggest weakness of this article is the lack of          Acknowledgments
ground truth for the classification of scientific articles,
                                                                The research reported in this paper has been supported by
which makes it impossible to evaluate this classification.
                                                                the German Research Foundation (DFG) funded projects
Therefore, we plan to use methods for reducing the im-
                                                                NFDI2/12020 and 467401796, and by the Grant Agency
pact of unknown ground truth. Our idea is to use inter-
                                                                of the Czech Technical University in Prague, grant No.
polation between annotations using GPT-2 and GPT-3
                                                                SGS20/208/OHK3/3T/18.
networks. GPT (Generative Pre-trained Transformer)
[41] stands for a series of pre-trained language models,
which have been developed by OpenAI. They have been             References
trained with a large dataset of textual information and
can be applied to deal with specific language-related tasks.    [1] Gruber T., Liu L., Özsu M. T.:                Ontol-
BERT, which was trained with Wiki and books data that               ogy.        https://tomgruber.org/writing/
contains over 3.3 billion tokens, is popular in natural             definition-of-ontology.pdf Encyclopedia of
language understanding tasks, e.g., text classification.            Database Systems, Springer-Verlag (2009)
However, BERT as a masked language model can only               [2] OWL Working Group: OWL. https://www.w3.
learn contextual representation of words but not orga-              org/OWL/ W3C Semantic Web (2012-Dec-11)
nize and generate language, which makes it unsuitable           [3] Devlin J., Chang M., Lee K., Toutanova K.: BERT: Pre-
for design concept generation task. On the other hand,              training of Deep Bidirectional Transformers for Lan-
GPTs are autoregressive language models that are trained            guage Understanding https://aclanthology.
    org/N19-1423 Proceedings of the 2019 Conference              telligent approaches for the automated domain ontol-
    of the North American Chapter: Human Language                ogy extraction. International Congress on Informa-
    Technologies, Volume 1 (Long and Short Papers), As-          tion and Communication Technology, pp. 410-–417,
    sociation for Computational Linguistics (June 2019),         2021.
    pp. 4171–4186                                            [19] Kolozali S., Fazekas G., Barthet M., Sandler M. B.: A
[4] Pedregosa et al.: Scikit-learn: Machine Learning in          framework for automatic ontology generation based
    Python, Journal of Machine Learning Research 12              on semantic audio analysis. Audio Engineering Soci-
    (2011), pp. 2825–2830                                        ety International Conference, pp. 87-–96, 2014.
[5] Benvenuto M. A., Plaumann H.: Industrial Cataly-         [20] Li G.: CNN based ontology learning algorithm and
    sis (2021). Berlin, Boston: De Gruyter (De Gruyter           applied in PE data IAENG International Journal of
    STEM).                                                       Computer Science, 48:1-–8, 2021.
[6] American Chemical Society Report: Technology Vi-         [21] Martel F., Zouaq A.: Taxonomy extraction using
    sion 2020 The Chemical Industry, December 1996               knowledge graph embeddings and hierarchical clus-
[7] Borst W. N. Construction of Engineering Ontologies           tering SAC’21, pp. 836-–844, 2021.
    for Knowledge Sharing and Reuse Enschede Centre          [22] Memariani A., Glauer M., Neuhaus F., Mossakowski
    for Telematics and Information Technology (CTIT),            T., Hatings J.: Automated and explainable ontology
    1997                                                         extension based on deep learning: A case study in
[8] Martínez-Romero M., Jonquet C., O’Connor M. J.,              the chemical domain. 3rd International Workshop
    Graybeal J., Pazos A., Musen M. A.: NCBO On-                 on Data Meets Applied Ontologies, pp. 1-–16, 2021.
    tology Recommender 2.0: An enhanced approach             [23] Mercier C., Chateau-Laurent H., Alexandre F.,
    for biomedical ontology recommendation Journal of            Viéville T.: Ontology as neuronal-space manifold:
    biomedical semantics 8 (1), pp. 21 (2017)                    Towards symbolic and numerical artificial embed-
[9] Recobo AI Company: BERT for Chemical In-                     ding Workshop on Knowledge Representation for
    dustry       https://huggingface.co/recobo/                  Hybrid and Compositional AI, pp. 1—11, 2021.
    chemical-bert-uncased HuggingFace - Trans-               [24] Mueller R. M., S. Abdullaev: Deep cause: Hypoth-
    formers - BERT (2022)                                        esis extraction from information systems papers
[10] Al-Aswadi F. N., Chan H. Y., K.H. Gan: Extracting           with deep learning for theory ontology learning An-
    semantic concepts and relations from scientific publi-       nual Hawaii International Conference on System
    cations by using deep learning. Proceedings of IRICT         Sciences, pp. 6250-–6259, 2019.
    2020, pp. 374—383, 2021.                                 [25] Teslya N., Savosin S.: Matching ontologies with
[11] Althubaiti S., Kafkas S., Abdelhakim M., Hoehndorf          Word2Vec-based neural network ICCSA, pp. 745—
    R.: Combining lexical and context features for au-           756, 2019.
    tomatic ontology extension. Journal of Biomedical        [26] Oba A., Paik I., Kuwana A.: Automatic classifica-
    Semantics, 11:article no. 1, 2020.                           tion for ontology generation by pretrained language
[12] Bento A., Zouaq A., M. Gagnon: Ontology match-              model International Conference on Industrial, Engi-
    ing using convolutional neural networks. LREC, pp.           neering and Other Applications of Applied Intelli-
    5648-–5653, 2020.                                            gent Systems, pp. 210—221, 2021.
[13] Chakraborty J., Yaman B., Virgili L., Konar K.,         [27] Petrucci G., Ghindini C., Rospocher M.: Ontology
    Bansal S. K.: OntoConnect: Results for OAEI 2020.            learning in the deep EKAW, pp. 480-–495, 2016.
    OM ISWC, pp. 204-–210, 2020.                             [28] Petrucci G., Rospocher M., Ghindini C.: Expres-
[14] Espinoza-Anke L., Ronzano F., Saggion H.: Hyper-            sive ontology learning as neural machine translation
    nym extraction: Combining machine-learning and               Journal of Web Semantics, pp. 52-–53:66-–82, 2018.
    dependency grammar. CICLing, pp. 372-–383, 2015.         [29] Potoniec J.: Learning OWL 2 property character-
[15] Gupta N., Podder S., Annervaz K. M., Sengupta S.:           istics as an explanation for an RNN Bulletin of the
    Domain ontology induction using word embeddings.             Polish Academy of Sciences, Technical Sciences, pp.
    ICMLA, pp. 115—119, 2016.                                    68:1481-–1490, 2020.
[16] Hao L., Lei C., Efthymiou V., Quamar A., Özcan F.,      [30] Navarro-Almanza R., Juárez-Ramírez R., Castro J.
    et al.: MEDTO: Medical data to ontology matching             R.: Automated ontology extraction from unstruc-
    using hybrid graph neural networks. KDD’21, pp.              tured texts using deep learning Intuitionistic and
    2946-–2954, 2021.                                            Type-2 Fuzzy Logic Enhancements in Neural and
[17] Hourali M., Montazer G. A.: Using ART2 neural               Optimization Algorithms: Theory and Applications,
    network and bayesian network for automating the              pp. 727—755. Springer, 2020.
    ontology constructing process. Procedia Engineer-        [31] Ristoski P., Paulheim H.: Rdf2vec: Rdf graph em-
    ing, 29:3914—3923, 2012.                                     beddings for data mining, International Semantic
[18] Katyshev A., Anikin A., Denisov M., Petrova T.: In-         Web Conference, pp. 498—514, 2016.
[32] Ritchie A., Chen J., Castro L. J., Rebholz-Schuhmann
    D., Jiménez-Ruiz E.: Ontology clustering with
    OWL2Vec DeepOntoNLP, pp. 54-–61, 2021.
[33] Wu J., Lv J., Guo H., Daeom S. M.: A deep atten-
    tional embedding approach for biomedical ontology
    matching Applied Sciences, 10:article no. 7909, 2020.
[34] Wulf C., Beller M., Boenisch T., Hanf S.,
    Deutschman O., and others: Research – Challenges
    and Concepts: A Unified Research Data Infrastruc-
    ture for Catalysis ChemCatChem, pp. 3223–3236, vol
    13, 2021.
[35] Ho T. K.: Random decision forests Proceedings of
    3rd International Conference on Document Anal-
    ysis and Recognition, 1995, pp. 278–282 vol.1, doi:
    10.1109/ICDAR.1995.598994.
[36] Schölkopf B., Smola A. J.: Learning with Kernels,
    MIT Press 2002
[37] Rasmussen C. E., Williams C. K. I.: Classification in
    Gaussian Processes for Machine Learning MIT Press,
    2005, pp.33–77.
[38] Kramer O.: K-Nearest Neighbors. In: Dimensional-
    ity Reduction with Unsupervised Nearest Neighbors
    (2013) Intelligent Systems Reference Library, vol 51.
    Springer, Berlin, Heidelberg, DOI: 10.1007/978-3-642-
    38652-7_2
[39] Vang-Mata R.: Multilayer Perceptrons: Theory and
    Applications Computer Science, Technology and
    Applications Series (2020) Nova Science Publishers,
    ISBN: 978-1-536-17365-9
[40] Benavoli A., Corani G., Mangili F.: Should We Really
    Use Post-Hoc Tests Based on Mean-Ranks? Journal
    of Machine Learning Research (2016), vol. 17, pp.
    1–10
[41] Zhu Q., Luo J.: Generative pre-trained transformer
    for design concept generation: An exploration. Pro-
    ceedings of the Design Society, 2: pp. 1825—1834,
    2022.