Text-to-Ontology Mapping via Natural Language Processing
Models
Uladzislau Yorsh1 , Alexander S. Behr2 , Norbert Kockmann2 and Martin Holeňa1,3,4
1
  Faculty of Information Technology, CTU, Prague, Czech Republic
2
  Faculty of Biochemical and Chemical Engineering, TU Dortmund University, Germany
3
  Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic
4
  Leibniz Institute for Catalysis, Rostock, Germany


                                          Abstract
                                          The paper presents work in progress attempting to solve a text-to-ontology mapping problem. While ontologies are being
                                          created as formal specifications of shared conceptualizations of application domains, different users often create different
                                          ontologies to represent the same domain. For better reasoning about concepts in scientific papers, it is desired to pick the
                                          ontology which best matches concepts present in the input text.
                                              We have started to automatize this process and attack the problem by utilizing state-of-the-art NLP tools and neural
                                          networks. Given a specific set of ontologies, we experiment with different training pipelines for NLP machine learning
                                          models with the aim to construct representative embeddings for the text-to-ontology matching task. We assess the final result
                                          through visualizing the latent space and exploring the mappings between an input text and ontology classes.

                                          Keywords
                                          text analysis, language models, fastText, BERT, matching text to ontologies


1. Introduction                                                                                        ontologies can focus on different sub-domains as well as
                                                                                                       on different levels of abstraction. Choosing the ontology
The FAIR (Findable, Accessible, Interoperable and which best corresponds to an input text is an important
Reusable) research data management needs a consistent step towards reasoning about it.
data representation in ontologies, particularly for repre-                                                In the reported work in progress, we focus on the latter
senting the data structure in the specific domain [1]. The problem. One of the possible ways to address the task is
application of ontologies varies from a domain-specific to consider it as matching input texts with an existing
vocabulary and a translation reference up to an environ- text collection. Such a formulation allows to employ
ment for logical reasoning and property inference.                                                     already existing rich text processing pipelines, as well as
   Despite their purpose of standardizing the knowledge powerful pretrained models.
conceptualization, there still may exist several ontologies
within the same domain [2]. Creating and managing an
ontology is a manual process often performed by many 2. Related Work
domain experts. As each expert works on different prob-
lems, they also might have different conceptualizations 2.1. Entity linking
of their respective knowledge. However, approaches to
                                                                                                       The problem is closely related to the concept normaliza-
automate the knowledge conceptualization also face their
                                                                                                       tion and entity linking tasks. The algorithms encountered
challenges, as a machine cannot easily create semantics
                                                                                                       in this context include dictionary lookup [3, 4], condi-
without human input (e.g. scientific theses, which are
                                                                                                       tional random fields and tf-idf vector similarity [5], word
created by humans). A constant demand for a knowl-
                                                                                                       embeddings and syntactical similarity [6].
edge database expansion and utilizing of already available
                                                                                                          The vector similarity approaches either employ tf-idf
knowledge leads to the problem of ontology alignment
                                                                                                       vectors or dense word embeddings. The tf-idf vector is
and merging, which is a research field on their own.
                                                                                                       a document vector of the size of the considered vocabu-
   Another problem faced by domain experts is how to
                                                                                                       lary, where each element is the number of occurrences of
choose a proper ontology for a certain task. Different
                                                                                                       the term in a document, multiplied by the logarithmized
ITAT’22: Information technologies – Applications and Theory, Septem- reciprocal value of the number of the documents where
ber 23–27, 2022, Zuberec, Slovakia                                                                     this term appears. These vectors are well-interpretable
$ yorshula@fit.cvut.cz (U. Yorsh);                                                                     (high values indicate the rare term which appears in par-
alexander.behr@tu-dortmund.de (A. S. Behr);
norbert.kockmann@tu-dortmund.de (N. Kockmann);                                                         ticular document often), but very sparse, which impedes
martin@cs.cas.cz (M. Holeňa)                                                                           the performance of machine learning algorithms. On
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   contrary, word embeddings generated by representation
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
learning algorithms are dense, but provide no direct in-      is the key difference from many other works, which rely
terpretation.                                                 on ground-truth either for training or evaluation.
   The mentioned systems share a common pipeline—
at the first step, they use an external algorithm to find     Ontologies may significantly differ in size. This
potential concepts in a scientific text. After that, they     can lead to very outbalanced datasets when generating
link proposals with concepts using retrieval techniques,      them from ontologies.
such as dictionary lookup or vector distance.                   These difficulties should be considered in the first place
                                                              when choosing a solution method.
2.2. Natural Language Processing
Entity linking techniques relying on vector similarity        3.2. Text Similarity Strategy
may either use tf-idf vectors or word embeddings. The         Ontologies typically provide annotations for most of their
latter may be beneficial due to the dense vector structure    classes and relations, potentially generating supervised
and an ability to be produced by high-capacity language       datasets for ML algorithms. But before employing a text
models, trained on large corpora.                             similarity approach, we have to make several strong as-
   fastText [7] is a representation learning algorithm pro-   sumptions:
ducing word-level embeddings. A neural network with
a single hidden layer is being trained to predict a word           • The distribution of input texts is the same as the
given its context, and the learned word representations              distribution of annotation texts. It means that
are then being used as word embeddings.                              the input sentences should follow the same gen-
   Another widely used representation learning algo-                 eral structure, length and vocabulary as ontology
rithm is BERT [8]. A deep sequence processing neu-                   annotations to avoid prediction skewing for irrel-
ral network is trained on two objectives—predicting a                evant reasons.
masked word in a sentence and predicting the order of              • The best matching ontology is the one which pro-
two given sentences.                                                 vides annotations most similar to the input text.
   Compared to the fastText, BERT embeds the whole                   Since the considered methods are text-based, they
input sequence at once and produces contextual embed-                will not rely on structures or hierarchies created
dings for each token—the same token in different con-                by ontology classes and input text terms.
texts will be embedded differently. This allows it to
achieve state-of-the-art results in text classification [8]      For methods mentioned below in this subsection, we
and named entity recognition [9] tasks. Another benefit       will employ fastText and BERT models trained on texts
of BERT is that its Transformer architecture demonstrates     from related domains, which will serve as a backbone for
impressive transfer-learning capabilities [10], which can     further processing. Following the notation introduced
be useful for fine-tuning the model for tasks laying out-     in the Subsection 3.1, we consider a "hard" mapping
side pretraining data distribution.                           𝑓 : T ↦→ 𝐾 directly to the space of ontologies of interest.

                                                              3.2.1. Zero-shot classification
3. Matching Texts to Ontologies
                                                           The method consists of assigning an ontology consid-
                                                           ering a similarity between annotation embeddings and
3.1. Problem definition
                                                           an embedding of an input text. The method is simple
Within the proposed framework, we define an ontology and does not require model fine-tuning, which allows to
𝑂 as a directed attributed multi-graph, where vertices quickly establish a baseline for other experiments. The
represent classes, edges represent relationships between common choices of similarity measures are Euclidean
them, and both vertices and edges can have attributes. or cosine distances – we choose the latter in our experi-
Given a set of specific ontologies 𝐾 = {𝑂1 , . . . , 𝑂𝑛 } ments. The reason is that for some embedding algorithms
and an input text 𝑇 ∈ T, the task is to predict the ontol- vector length may be influenced by the input text size,
ogy that best matches the content of 𝑇 . A predictor may so vectors corresponding to semantically close texts may
be either a "hard" mapping 𝑓 : T ↦→ 𝐾 or a scoring func- generally point in the same direction but be dissimilar in
tion 𝑓 : T × 𝐾 ↦→ R which allows to order ontologies terms of Euclidean distance.
by relevance.
   There are several complications of the task:            3.2.2. Supervised classification based on ontology
                                                                  annotations
Given ontologies are the only source of supervision.
                                                           This method relies on a supervision provided by ontol-
No text-to-ontology mapping labels are provided. This
                                                           ogy annotation attributes. Given an ontology set 𝐾, we
can generate a dataset of annotation-ontology label pairs     a scispaCy [20] model en_ner_bc5cdr_md. For the re-
and use it for supervised training. Under the aforemen-       maining machine learning models, PyTorch implementa-
tioned assumptions we can directly assign input texts to      tions were used. For 3D visualization method Uniform
ontologies using the trained model.                           Manifold Approximation and Projection for Dimension
                                                              Reduction (UMAP) [21], we used the implementation
3.2.3. Negative sampling                                      described in [22].
                                                                 Due to the lack of ground truth matching data, we
This method extends the method above by adding a              assess the performance primarily through inspecting the
"None" class, denoting that the input text does not relate    resulting input sentence-annotation pairs.
to any of given ontologies. The annotation dataset is
extended by:
                                                          4.1.1. Text preprocessing
     • Sentences extracted from scientific papers from We employ the following text preprocessing pipeline
       unrelated domains and labeled with the "None" before constructing input embeddings:
       label.
     • Sentences extracted from papers from related do-       1. *Split an input text into sentences with a spaCy
       mains with a different objective during training.         model.
       For related input texts, instead of maximizing the     2. *Filter valid sentences, which contain at least two
       model output scores for a ground truth class we           nouns and a verb.
       minimize the output scores for the "None" class.       3. *Filter out sentences with non-paired parenthesis
       This method is intended to partially counter the          and ill-parsed formulas or composed terms.
       possible input distribution difference between on-     4. (BERT) Tokenize with a tokenizer coming with
       tology annotations and scientific texts.                  the model.
                                                              4. (fastText) Convert to lowercase and split into
                                                                 words
4. Experiments
                                                                The points marked with an asterisk are meant to be
4.1. Setup                                                    applied to new sentences from scientific papers only.
We conduct our experiments on a set of five ontologies
related to the chemical domain (Table 1). The ontolo-         4.2. Text Similarity
gies NCIT, CHMO and Allotrope are considered to be the        Zero-shot setup. We start with representation learn-
closest to it, while Chemical Entities of Biological Inter-   ing of annotations using the fastText and BERT algo-
est (CHEBI) has only a subset of relevant entities. The       rithms and inspecting the embeddings produced. For
SBO was selected as it contains some general laboratory       the dimensionality reduction, we use the UMAP algo-
and computational contexts, which can be seen as some         rithm with the number of neighbors set to 15, minimum
kind of a test, whether the tools used can also identify      distance 0.5 and cosine metric. We have found that 3-
ontologies not fitting to the text content.                   dimensional embeddings preserve substantially more in-
   We also selected 28 scientific papers as inputs for as-    formation (allowing to separate clusters that may be in-
sessment, consisting of 25 research and 3 review papers.      separable in 2D). The result is illustrated in Figure 1, three
Those papers deal with the topic of methanation of CO2        example sentences together with annotations assigned
and consist in sum of 1,3M symbols.                           to them by fastText and BERT are in Table 3.

Table 1                                                       Table 2
Sizes of considered ontologies                                Zero-shot statistics for the distances of sentences to the closest
           Ontology        Classes   Annotations              ontology annotations.
         CHEBI [11]        171058         51095                 Embeddings        Closest distance       Closest distance
         NCIT [12]         170300        133478                                        mean             standard deviation
         Allotrope [13]       2893          2677                fastText               0.846                  0.086
         CHMO [14]            3084          2895                BERT                   0.605                  0.038
         SBO [15]              693           692
                                                                Those visualizations and Table 2 allow to suppose that
  We use the pretrained fastText model by [16] and the        the model embeds input papers separately from ontol-
recobo/chemical-bert-uncased [17] checkpoint of
                                                              ogy annotations, which may indicate a distribution shift
a BERT implementation [18] from the HuggingFace               between sentences and annotations.
repository. For preprocessing we use spaCy [19] with
Table 3
Sentence pairs of a new sentence from the scientific papers and the closest ontology annotation. The "carbon dioxide"
annotation was assigned by BERT to all three above example new sentences. While BERT embeddings are more discriminative
for the ontology classification task, the assigned sentences and low-dimensional embeddings on Figure 3 indicate that this
approach is more sensitive to the distribution shift problem.
  New sentence       Also there is an upper limit of     The difference is the main ad-     This enhancement of the Ni dis-
                     operation above which thermal       sorption species during the re-    persion is very relevant because
                     decomposition will occur.           action.                            as reported in the literature [78]
                                                                                            NiO sites [...]
  fastText closest   An end event specification is an    Reaction scheme where the          The name of the individual
                     event specification that is about   products are created from the      working for the sponsor respon-
                     the end of some process.            reactants [...]                    sible for overseeing the activi-
                                                                                            ties of the study.
  BERT closest       Carbon dioxide gas is a gas that    Carbon dioxide gas is a gas that   Carbon dioxide gas is a gas that
                     is composed of carbon dioxide       is composed of carbon dioxide      is composed of carbon dioxide
                     molecules.                          molecules.                         molecules.


                                                                 768-dimensional hidden layer. Due to the significant dif-
                                                                 ference in sizes between ontologies, we proportionally
                                                                 oversample minority data points. The classifier reaches
                                                                 0.987 validation accuracy after the single-shot validation
                                                                 on the annotations from all the classes, which indicates
                                                                 their good separability for different ontologies, cf. Fig-
                                                                 ures 2 and 3.
                                                                    However, if we preprocess input texts and embed them
                                                                 in this way, the inspection will show that their distribu-
                        (a) fastText                             tion significantly differs from the distribution of ontology
                                                                 annotations. The visualizations in Figures 2 and 3 show a
                                                                 dense separate cluster of sentences parsed from scientific
                                                                 papers.

                                                                 Negative sampling. As an attempt to counter the is-
                                                                 sue, we introduced scientific texts into training data. We
                                                                 sampled 400 scientific texts from the chemical domain (as
                                                                 positive examples) and 400 from unrelated domains (as
                                                                 negatives). During training, the model is being trained
                                                                 on two objectives:
                                                                     1. Cross-entropy loss if the input is an ontology
                          (b) BERT
                                                                        annotation (same as before)
Figure 1: A 3-dimensional projection of annotation embed-            2. Binary cross-entropy loss if the input is a sentence
dings produced by fastText and BERT. In the case of fastText,           from a scientific paper. The model minimizes the
SBO, Allotrope, and CHMO annotations are located in tiny                probability of a special "Negative" class output
areas, primarily close to the center of the image.                      for a related scientific text, and maximises it for
                                                                        unrelated.
                                                                  In this setting we train the head over BERT until con-
Ontology matching as text classification. As we                 vergence first, leaving the backbone frozen. Considering
mentioned in the Subsection 3.2, another potential strat-       only ontology annotations and leaving aside sampled
egy to solve the problem is to treat it as a classification     sentences, the model reaches 0.984 validation accuracy,
task. If the distributions of input texts and correspond-       which is very similar to the performance of the classifier
ing ontologies are the same, we can train a classifier on       described above.
ontology annotations and apply it on input texts.                 After that, we fine-tune the whole BERT model. The
   We implement this by embedding ontology annota-              model reaches 0.958 validation accuracy after single-shot
tions with BERT and training over them a shallow fully-         validation on the combined annotation and paper sen-
connected multilayer perceptron (MLP) with a single             tence dataset, with the confusion matrix on Figure 4. As
                                                                    (a) 3-dimensional projection of the embeddings pro-
    (a) The progress of training and validation accuracy
                                                                        duced by the fine-tuned BERT
        during training. The blue (above) and orange (be-
        low) lines indicate the training and validation ac-
        curacy respectively.


                                                                    (b) 3-dimensional projection of the activities of the
                                                                        hidden layer of the MLP trained over BERT
    (b) Annotation from ontologies and sentences from           Figure 3: Visualization of the BERT embedding phase and the
        14 scientific papers embedded by BERT                   MLP classification phase of the ontology classification task
Figure 2: Training plot and a 3-dimensional projection of the   with the fine-tuned BERT in the negative sampling setting.
embeddings produced by BERT in the classification approach.
The visualization gives an intuition of the distribution gap
between the scientific texts for which we would like to find    and test the following hypothesis:
the most relevant ontology, and ontology annotations.
                                                                Hypothesis H1 (Null): All the six models perform the
                                                                same on the validation splits.
we will show later, mixing sampled sentences in from               The Friedman test resulted in the null hypothesis rejec-
both relevant and irrelevant scientific texts allowed to        tion on the significance level of 5%. To further compare
improve classification accuracy over the classifier on top      the models, we perform the Wilcoxon signed-rank test on
of BERT.                                                        each pair of models. We make the following assumptions
   Despite the good separability of individual ontologies       about the algorithms:
and the additional optimization criterion, the UMAP em-
beddings look similar to the previous setup in terms of              • For a larger 𝑘 the 𝑘NN classifier can work the
clustering input sentences into a separate subspace.                   same or better than the 1NN.
   It is worth to note that the classifier and negative sam-         • The neural network model can fit training data
pling models produce softmax scores, which can be in-                  the same or better than the 𝑘NN.
terpreted as a class probability distribution. However,              • The negative sampling results in a non-decrease
neural networks tend to be overconfident in their out-                 or an improvement in the model generalization.
puts [23], so additional calibration is needed before using
the outputs for relevance estimation.                           Hypothesis H2 (Null for 𝑘NN models): The 10NN mod-
                                                                els perform the same as their 1NN variants.
Statistical results. To compare the models, we con-           While the 1NN is a common setting for many NLP
duct the Friedman test first to check if the models perform systems, it may produce complex decision boundaries
the same. We perform a stratified split of the validation and lead to overfitting. We test a larger 𝑘 versus one to
dataset with the ontology annotations into 50 samples determine whether this is an issue in our setup.
                                                               Figure 5: The comparison matrix of the six considered models.
Figure 4: Confusion matrix of the MLP classification over      The 𝑖, 𝑗 -th element indicates an amount of splits where the
fine-tuned BERT for a dataset consisting of the annotations    𝑖-th model performed better than 𝑗 -th. Except the one- and
from all five considered ontologies and the sentences of the   ten-nearest-neighbor over BERT embeddings, all the models
additional 400 related and 400 unrelated scientific papers..   demonstrate statistically significant differences. BERT NN
                                                               denotes a neural network classifier trained over BERT embed-
                                                               dings.
Hypothesis H3 (Null for neural network classifier): The
NN classifier performs the same as the 𝑘NN models both
on BERT/fastText embeddings.                                   5. Conclusion and Further
  The assumption behind this hypothesis is that a neural          Research
network as a universal approximator can fit data better
than a nearest-neighbour classifier.                           We are not aware of other works on unsupervised text-
                                                               to-ontology mappings, so we are not able to discuss them
Hypothesis H4 (Null for the fine-tuned model with nega-        and compare the proposed approach with previous meth-
tive sampling): The fine-tuned BERT with negative sam-         ods.
pling performs the same as other considered models.               The reported work in progress revealed that the dis-
   We suppose that additional sampled sentences would          tribution of the scientific texts substantially differs from
allow to improve the model performance and help to             the one of ontology annotations. In spite of the high
avoid overfitting when fine-tuning the whole model in-         classification accuracy both for the annotations from the
stead of head only.                                            considered ontologies and the sentences of the additional
                                                               800 scientific papers, this leads to mapping into separate
Hypothesis H5 (Null for the rest): In each remaining           subsets of the embedding space. This is true even for the
pair, both models have the same performance.                   most sophisticated of the three investigated settings –
   We indicate the relative model performance on Fig-          with the BERT fine-tuned using both the ontology anno-
ure 5. Considering the 5% significance level, the test         tations and scientific texts from (un-)related domains.
rejected all the null hypotheses except the H2 , which was        To avoid such a loss of generality, the future research
rejected only for the fastText embeddings. To explain          could include an intermediate step of entity recognition.
that, we can note that there is a relatively sharp boundary    Using such recognized entities instead of raw text can
between individual classes on UMAP embeddings. If it           help to separate the information in scientific papers that
holds so for the original space, larger 𝑘 may suppress         is directly related to concepts from ontologies and unre-
outlier noise but decrease classification accuracy near it.    lated words, sentences and other parts of text not elimi-
                                                               nated during preprocessing.
Acknowledgments                                                   CoRR abs/2112.00405 (2021). URL: https://arxiv.org/
                                                                  abs/2112.00405. arXiv:2112.00405.
The research reported in this paper has been supported       [10] K. Lu, A. Grover, P. Abbeel, I. Mordatch, Pre-
by the German Research Foundation (DFG) funded                    trained transformers as universal computation en-
projects 467401796 and NFDI2/12020.                               gines, CoRR abs/2103.05247 (2021). URL: https:
                                                                  //arxiv.org/abs/2103.05247. arXiv:2103.05247.
                                                             [11] J. Hastings, G. Owen, A. Dekker, M. Ennis,
References                                                        N. Kale, V. Muthukrishnan, S. Turner, N. Swainston,
 [1] M. Wolf, J. Logan, K. Mehta, D. Jacobson, M. Cash-           P. Mendes, C. Steinbeck, ChEBI in 2016: Improved
     man, A. M. Walker, G. Eisenhauer, P. Widener,                services and an expanding collection of metabolites,
     A. Cliff, Reusability first: Toward fair workflows,          Nucleic Acids Res 44 (2015) D1214–9.
     in: 2021 IEEE International Conference on Cluster       [12] National cancer institue thesaurus, 2022. URL: https:
     Computing (CLUSTER), 2021, pp. 444–455. doi:10.              //bioportal.bioontology.org/ontologies/NCIT.
     1109/Cluster48925.2021.00053.                           [13] Allotrope foundation ontologies, 2022. URL: https:
 [2] J. Grühn, A. S. Behr, T. H. Eroglu, V. Trögel,               //www.allotrope.org/ontologies.
     K. Rosenthal, N. Kockmann, From coiled flow in-         [14] Systems biology ontology, 2022. URL: https://github.
     verter to stirred tank reactor – bioprocess develop-         com/EBI-BioModels/SBO.
     ment and ontology design, Chemie Ingenieur Tech-        [15] Chemical methods ontology, 2022. URL: https://
     nik 94 (2022) 852–863. doi:https://doi.org/10.               obofoundry.org/ontology/chmo.html.
     1002/cite.202100177.                                    [16] E. Kim, Z. Jensen, A. van Grootel, K. Huang,
 [3] L. Hirschman, M. Krallinger, A. Valencia, J. Fluck,          M. Staib, S. Mysore, H.-S. Chang, E. Strubell, A. Mc-
     H.-T. Mevissen, H. Dach, M. Oster, M. Hofmann-               Callum, S. Jegelka, E. Olivetti, Inorganic materi-
     Apitius, Prominer: Recognition of human gene and             als synthesis planning with literature-trained neu-
     protein names using regularly updated dictionaries,          ral networks, Journal of Chemical Information
     Proceedings of the Second BioCreAtIvE Challenge              and Modeling 60 (2020) 1194–1201. URL: https:
     Evaluation Workshop (2007) 149–151.                          //doi.org/10.1021/acs.jcim.9b00995. doi:10.1021/
 [4] A. A. Morgan, Z. Lu, X. Wang, A. M. Cohen, J. Fluck,         acs.jcim.9b00995.
     P. Ruch, A. Divoli, K. Fundel, R. Leaman, J. Haken-     [17] Bert for chemical industry, 2022. URL: https://
     berg, C. Sun, H.-H. Liu, R. Torres, M. Krauthammer,          huggingface.co/recobo/chemical-bert-uncase.
     W. W. Lau, H. Liu, C.-N. Hsu, M. Schuemie, K. B.        [18] Bert, 2022. URL: https://huggingface.co/docs/
     Cohen, L. Hirschman, Overview of BioCreative II              transformers/model_doc/bert.
     gene normalization, Genome Biol 9 Suppl 2 (2008)        [19] M. Honnibal, I. Montani, spaCy 2: Natural language
     S3.                                                          understanding with Bloom embeddings, convolu-
 [5] R. Leaman, R. Islamaj Dogan, Z. Lu, DNorm: disease           tional neural networks and incremental parsing,
     name normalization with pairwise learning to rank,           2017.
     Bioinformatics 29 (2013) 2909–2917.                     [20] M. Neumann, D. King, I. Beltagy, W. Ammar, Scis-
 [6] İ. Karadeniz, A. Özgür, Linking entities through             paCy: Fast and robust models for biomedical natu-
     an ontology using word embeddings and syntactic              ral language processing, in: Proceedings of the 18th
     re-ranking, BMC Bioinformatics 20 (2019) 156. URL:           BioNLP Workshop and Shared Task, Association for
     https://doi.org/10.1186/s12859-019-2678-8. doi:10.           Computational Linguistics, Florence, Italy, 2019, pp.
     1186/s12859-019-2678-8.                                      319–327. URL: https://aclanthology.org/W19-5034.
 [7] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, En-          doi:10.18653/v1/W19-5034.
     riching word vectors with subword information,          [21] L. McInnes, J. Healy, J. Melville, Umap: Uniform
     arXiv preprint arXiv:1607.04606 (2016).                      manifold approximation and projection for dimen-
 [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,                sion reduction, 2018. URL: https://arxiv.org/abs/
     BERT: Pre-training of deep bidirectional transform-          1802.03426. doi:10.48550/ARXIV.1802.03426.
     ers for language understanding, in: Proceedings         [22] L. McInnes, J. Healy, N. Saul, L. Grossberger, Umap:
     of the 2019 Conference of the NAACL, Associ-                 Uniform manifold approximation and projection,
     ation for Computational Linguistics, Minneapo-               The Journal of Open Source Software 3 (2018) 861.
     lis, Minnesota, 2019, pp. 4171–4186. URL: https:        [23] Y. Gal, Uncertainty in Deep Learning, Ph.D. thesis,
     //aclanthology.org/N19-1423. doi:10.18653/v1/                University of Cambridge, 2016.
     N19-1423.
 [9] Z. Liu, F. Jiang, Y. Hu, C. Shi, P. Fung, NER-BERT: A
     pre-trained model for low-resource entity tagging,