Text-to-Ontology Mapping via Natural Language Processing Models Uladzislau Yorsh1 , Alexander S. Behr2 , Norbert Kockmann2 and Martin Holeňa1,3,4 1 Faculty of Information Technology, CTU, Prague, Czech Republic 2 Faculty of Biochemical and Chemical Engineering, TU Dortmund University, Germany 3 Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic 4 Leibniz Institute for Catalysis, Rostock, Germany Abstract The paper presents work in progress attempting to solve a text-to-ontology mapping problem. While ontologies are being created as formal specifications of shared conceptualizations of application domains, different users often create different ontologies to represent the same domain. For better reasoning about concepts in scientific papers, it is desired to pick the ontology which best matches concepts present in the input text. We have started to automatize this process and attack the problem by utilizing state-of-the-art NLP tools and neural networks. Given a specific set of ontologies, we experiment with different training pipelines for NLP machine learning models with the aim to construct representative embeddings for the text-to-ontology matching task. We assess the final result through visualizing the latent space and exploring the mappings between an input text and ontology classes. Keywords text analysis, language models, fastText, BERT, matching text to ontologies 1. Introduction ontologies can focus on different sub-domains as well as on different levels of abstraction. Choosing the ontology The FAIR (Findable, Accessible, Interoperable and which best corresponds to an input text is an important Reusable) research data management needs a consistent step towards reasoning about it. data representation in ontologies, particularly for repre- In the reported work in progress, we focus on the latter senting the data structure in the specific domain [1]. The problem. One of the possible ways to address the task is application of ontologies varies from a domain-specific to consider it as matching input texts with an existing vocabulary and a translation reference up to an environ- text collection. Such a formulation allows to employ ment for logical reasoning and property inference. already existing rich text processing pipelines, as well as Despite their purpose of standardizing the knowledge powerful pretrained models. conceptualization, there still may exist several ontologies within the same domain [2]. Creating and managing an ontology is a manual process often performed by many 2. Related Work domain experts. As each expert works on different prob- lems, they also might have different conceptualizations 2.1. Entity linking of their respective knowledge. However, approaches to The problem is closely related to the concept normaliza- automate the knowledge conceptualization also face their tion and entity linking tasks. The algorithms encountered challenges, as a machine cannot easily create semantics in this context include dictionary lookup [3, 4], condi- without human input (e.g. scientific theses, which are tional random fields and tf-idf vector similarity [5], word created by humans). A constant demand for a knowl- embeddings and syntactical similarity [6]. edge database expansion and utilizing of already available The vector similarity approaches either employ tf-idf knowledge leads to the problem of ontology alignment vectors or dense word embeddings. The tf-idf vector is and merging, which is a research field on their own. a document vector of the size of the considered vocabu- Another problem faced by domain experts is how to lary, where each element is the number of occurrences of choose a proper ontology for a certain task. Different the term in a document, multiplied by the logarithmized ITAT’22: Information technologies – Applications and Theory, Septem- reciprocal value of the number of the documents where ber 23–27, 2022, Zuberec, Slovakia this term appears. These vectors are well-interpretable $ yorshula@fit.cvut.cz (U. Yorsh); (high values indicate the rare term which appears in par- alexander.behr@tu-dortmund.de (A. S. Behr); norbert.kockmann@tu-dortmund.de (N. Kockmann); ticular document often), but very sparse, which impedes martin@cs.cas.cz (M. Holeňa) the performance of machine learning algorithms. On © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). contrary, word embeddings generated by representation CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) learning algorithms are dense, but provide no direct in- is the key difference from many other works, which rely terpretation. on ground-truth either for training or evaluation. The mentioned systems share a common pipeline— at the first step, they use an external algorithm to find Ontologies may significantly differ in size. This potential concepts in a scientific text. After that, they can lead to very outbalanced datasets when generating link proposals with concepts using retrieval techniques, them from ontologies. such as dictionary lookup or vector distance. These difficulties should be considered in the first place when choosing a solution method. 2.2. Natural Language Processing Entity linking techniques relying on vector similarity 3.2. Text Similarity Strategy may either use tf-idf vectors or word embeddings. The Ontologies typically provide annotations for most of their latter may be beneficial due to the dense vector structure classes and relations, potentially generating supervised and an ability to be produced by high-capacity language datasets for ML algorithms. But before employing a text models, trained on large corpora. similarity approach, we have to make several strong as- fastText [7] is a representation learning algorithm pro- sumptions: ducing word-level embeddings. A neural network with a single hidden layer is being trained to predict a word • The distribution of input texts is the same as the given its context, and the learned word representations distribution of annotation texts. It means that are then being used as word embeddings. the input sentences should follow the same gen- Another widely used representation learning algo- eral structure, length and vocabulary as ontology rithm is BERT [8]. A deep sequence processing neu- annotations to avoid prediction skewing for irrel- ral network is trained on two objectives—predicting a evant reasons. masked word in a sentence and predicting the order of • The best matching ontology is the one which pro- two given sentences. vides annotations most similar to the input text. Compared to the fastText, BERT embeds the whole Since the considered methods are text-based, they input sequence at once and produces contextual embed- will not rely on structures or hierarchies created dings for each token—the same token in different con- by ontology classes and input text terms. texts will be embedded differently. This allows it to achieve state-of-the-art results in text classification [8] For methods mentioned below in this subsection, we and named entity recognition [9] tasks. Another benefit will employ fastText and BERT models trained on texts of BERT is that its Transformer architecture demonstrates from related domains, which will serve as a backbone for impressive transfer-learning capabilities [10], which can further processing. Following the notation introduced be useful for fine-tuning the model for tasks laying out- in the Subsection 3.1, we consider a "hard" mapping side pretraining data distribution. 𝑓 : T ↦→ 𝐾 directly to the space of ontologies of interest. 3.2.1. Zero-shot classification 3. Matching Texts to Ontologies The method consists of assigning an ontology consid- ering a similarity between annotation embeddings and 3.1. Problem definition an embedding of an input text. The method is simple Within the proposed framework, we define an ontology and does not require model fine-tuning, which allows to 𝑂 as a directed attributed multi-graph, where vertices quickly establish a baseline for other experiments. The represent classes, edges represent relationships between common choices of similarity measures are Euclidean them, and both vertices and edges can have attributes. or cosine distances – we choose the latter in our experi- Given a set of specific ontologies 𝐾 = {𝑂1 , . . . , 𝑂𝑛 } ments. The reason is that for some embedding algorithms and an input text 𝑇 ∈ T, the task is to predict the ontol- vector length may be influenced by the input text size, ogy that best matches the content of 𝑇 . A predictor may so vectors corresponding to semantically close texts may be either a "hard" mapping 𝑓 : T ↦→ 𝐾 or a scoring func- generally point in the same direction but be dissimilar in tion 𝑓 : T × 𝐾 ↦→ R which allows to order ontologies terms of Euclidean distance. by relevance. There are several complications of the task: 3.2.2. Supervised classification based on ontology annotations Given ontologies are the only source of supervision. This method relies on a supervision provided by ontol- No text-to-ontology mapping labels are provided. This ogy annotation attributes. Given an ontology set 𝐾, we can generate a dataset of annotation-ontology label pairs a scispaCy [20] model en_ner_bc5cdr_md. For the re- and use it for supervised training. Under the aforemen- maining machine learning models, PyTorch implementa- tioned assumptions we can directly assign input texts to tions were used. For 3D visualization method Uniform ontologies using the trained model. Manifold Approximation and Projection for Dimension Reduction (UMAP) [21], we used the implementation 3.2.3. Negative sampling described in [22]. Due to the lack of ground truth matching data, we This method extends the method above by adding a assess the performance primarily through inspecting the "None" class, denoting that the input text does not relate resulting input sentence-annotation pairs. to any of given ontologies. The annotation dataset is extended by: 4.1.1. Text preprocessing • Sentences extracted from scientific papers from We employ the following text preprocessing pipeline unrelated domains and labeled with the "None" before constructing input embeddings: label. • Sentences extracted from papers from related do- 1. *Split an input text into sentences with a spaCy mains with a different objective during training. model. For related input texts, instead of maximizing the 2. *Filter valid sentences, which contain at least two model output scores for a ground truth class we nouns and a verb. minimize the output scores for the "None" class. 3. *Filter out sentences with non-paired parenthesis This method is intended to partially counter the and ill-parsed formulas or composed terms. possible input distribution difference between on- 4. (BERT) Tokenize with a tokenizer coming with tology annotations and scientific texts. the model. 4. (fastText) Convert to lowercase and split into words 4. Experiments The points marked with an asterisk are meant to be 4.1. Setup applied to new sentences from scientific papers only. We conduct our experiments on a set of five ontologies related to the chemical domain (Table 1). The ontolo- 4.2. Text Similarity gies NCIT, CHMO and Allotrope are considered to be the Zero-shot setup. We start with representation learn- closest to it, while Chemical Entities of Biological Inter- ing of annotations using the fastText and BERT algo- est (CHEBI) has only a subset of relevant entities. The rithms and inspecting the embeddings produced. For SBO was selected as it contains some general laboratory the dimensionality reduction, we use the UMAP algo- and computational contexts, which can be seen as some rithm with the number of neighbors set to 15, minimum kind of a test, whether the tools used can also identify distance 0.5 and cosine metric. We have found that 3- ontologies not fitting to the text content. dimensional embeddings preserve substantially more in- We also selected 28 scientific papers as inputs for as- formation (allowing to separate clusters that may be in- sessment, consisting of 25 research and 3 review papers. separable in 2D). The result is illustrated in Figure 1, three Those papers deal with the topic of methanation of CO2 example sentences together with annotations assigned and consist in sum of 1,3M symbols. to them by fastText and BERT are in Table 3. Table 1 Table 2 Sizes of considered ontologies Zero-shot statistics for the distances of sentences to the closest Ontology Classes Annotations ontology annotations. CHEBI [11] 171058 51095 Embeddings Closest distance Closest distance NCIT [12] 170300 133478 mean standard deviation Allotrope [13] 2893 2677 fastText 0.846 0.086 CHMO [14] 3084 2895 BERT 0.605 0.038 SBO [15] 693 692 Those visualizations and Table 2 allow to suppose that We use the pretrained fastText model by [16] and the the model embeds input papers separately from ontol- recobo/chemical-bert-uncased [17] checkpoint of ogy annotations, which may indicate a distribution shift a BERT implementation [18] from the HuggingFace between sentences and annotations. repository. For preprocessing we use spaCy [19] with Table 3 Sentence pairs of a new sentence from the scientific papers and the closest ontology annotation. The "carbon dioxide" annotation was assigned by BERT to all three above example new sentences. While BERT embeddings are more discriminative for the ontology classification task, the assigned sentences and low-dimensional embeddings on Figure 3 indicate that this approach is more sensitive to the distribution shift problem. New sentence Also there is an upper limit of The difference is the main ad- This enhancement of the Ni dis- operation above which thermal sorption species during the re- persion is very relevant because decomposition will occur. action. as reported in the literature [78] NiO sites [...] fastText closest An end event specification is an Reaction scheme where the The name of the individual event specification that is about products are created from the working for the sponsor respon- the end of some process. reactants [...] sible for overseeing the activi- ties of the study. BERT closest Carbon dioxide gas is a gas that Carbon dioxide gas is a gas that Carbon dioxide gas is a gas that is composed of carbon dioxide is composed of carbon dioxide is composed of carbon dioxide molecules. molecules. molecules. 768-dimensional hidden layer. Due to the significant dif- ference in sizes between ontologies, we proportionally oversample minority data points. The classifier reaches 0.987 validation accuracy after the single-shot validation on the annotations from all the classes, which indicates their good separability for different ontologies, cf. Fig- ures 2 and 3. However, if we preprocess input texts and embed them in this way, the inspection will show that their distribu- (a) fastText tion significantly differs from the distribution of ontology annotations. The visualizations in Figures 2 and 3 show a dense separate cluster of sentences parsed from scientific papers. Negative sampling. As an attempt to counter the is- sue, we introduced scientific texts into training data. We sampled 400 scientific texts from the chemical domain (as positive examples) and 400 from unrelated domains (as negatives). During training, the model is being trained on two objectives: 1. Cross-entropy loss if the input is an ontology (b) BERT annotation (same as before) Figure 1: A 3-dimensional projection of annotation embed- 2. Binary cross-entropy loss if the input is a sentence dings produced by fastText and BERT. In the case of fastText, from a scientific paper. The model minimizes the SBO, Allotrope, and CHMO annotations are located in tiny probability of a special "Negative" class output areas, primarily close to the center of the image. for a related scientific text, and maximises it for unrelated. In this setting we train the head over BERT until con- Ontology matching as text classification. As we vergence first, leaving the backbone frozen. Considering mentioned in the Subsection 3.2, another potential strat- only ontology annotations and leaving aside sampled egy to solve the problem is to treat it as a classification sentences, the model reaches 0.984 validation accuracy, task. If the distributions of input texts and correspond- which is very similar to the performance of the classifier ing ontologies are the same, we can train a classifier on described above. ontology annotations and apply it on input texts. After that, we fine-tune the whole BERT model. The We implement this by embedding ontology annota- model reaches 0.958 validation accuracy after single-shot tions with BERT and training over them a shallow fully- validation on the combined annotation and paper sen- connected multilayer perceptron (MLP) with a single tence dataset, with the confusion matrix on Figure 4. As (a) 3-dimensional projection of the embeddings pro- (a) The progress of training and validation accuracy duced by the fine-tuned BERT during training. The blue (above) and orange (be- low) lines indicate the training and validation ac- curacy respectively. (b) 3-dimensional projection of the activities of the hidden layer of the MLP trained over BERT (b) Annotation from ontologies and sentences from Figure 3: Visualization of the BERT embedding phase and the 14 scientific papers embedded by BERT MLP classification phase of the ontology classification task Figure 2: Training plot and a 3-dimensional projection of the with the fine-tuned BERT in the negative sampling setting. embeddings produced by BERT in the classification approach. The visualization gives an intuition of the distribution gap between the scientific texts for which we would like to find and test the following hypothesis: the most relevant ontology, and ontology annotations. Hypothesis H1 (Null): All the six models perform the same on the validation splits. we will show later, mixing sampled sentences in from The Friedman test resulted in the null hypothesis rejec- both relevant and irrelevant scientific texts allowed to tion on the significance level of 5%. To further compare improve classification accuracy over the classifier on top the models, we perform the Wilcoxon signed-rank test on of BERT. each pair of models. We make the following assumptions Despite the good separability of individual ontologies about the algorithms: and the additional optimization criterion, the UMAP em- beddings look similar to the previous setup in terms of • For a larger 𝑘 the 𝑘NN classifier can work the clustering input sentences into a separate subspace. same or better than the 1NN. It is worth to note that the classifier and negative sam- • The neural network model can fit training data pling models produce softmax scores, which can be in- the same or better than the 𝑘NN. terpreted as a class probability distribution. However, • The negative sampling results in a non-decrease neural networks tend to be overconfident in their out- or an improvement in the model generalization. puts [23], so additional calibration is needed before using the outputs for relevance estimation. Hypothesis H2 (Null for 𝑘NN models): The 10NN mod- els perform the same as their 1NN variants. Statistical results. To compare the models, we con- While the 1NN is a common setting for many NLP duct the Friedman test first to check if the models perform systems, it may produce complex decision boundaries the same. We perform a stratified split of the validation and lead to overfitting. We test a larger 𝑘 versus one to dataset with the ontology annotations into 50 samples determine whether this is an issue in our setup. Figure 5: The comparison matrix of the six considered models. Figure 4: Confusion matrix of the MLP classification over The 𝑖, 𝑗 -th element indicates an amount of splits where the fine-tuned BERT for a dataset consisting of the annotations 𝑖-th model performed better than 𝑗 -th. Except the one- and from all five considered ontologies and the sentences of the ten-nearest-neighbor over BERT embeddings, all the models additional 400 related and 400 unrelated scientific papers.. demonstrate statistically significant differences. BERT NN denotes a neural network classifier trained over BERT embed- dings. Hypothesis H3 (Null for neural network classifier): The NN classifier performs the same as the 𝑘NN models both on BERT/fastText embeddings. 5. Conclusion and Further The assumption behind this hypothesis is that a neural Research network as a universal approximator can fit data better than a nearest-neighbour classifier. We are not aware of other works on unsupervised text- to-ontology mappings, so we are not able to discuss them Hypothesis H4 (Null for the fine-tuned model with nega- and compare the proposed approach with previous meth- tive sampling): The fine-tuned BERT with negative sam- ods. pling performs the same as other considered models. The reported work in progress revealed that the dis- We suppose that additional sampled sentences would tribution of the scientific texts substantially differs from allow to improve the model performance and help to the one of ontology annotations. In spite of the high avoid overfitting when fine-tuning the whole model in- classification accuracy both for the annotations from the stead of head only. considered ontologies and the sentences of the additional 800 scientific papers, this leads to mapping into separate Hypothesis H5 (Null for the rest): In each remaining subsets of the embedding space. This is true even for the pair, both models have the same performance. most sophisticated of the three investigated settings – We indicate the relative model performance on Fig- with the BERT fine-tuned using both the ontology anno- ure 5. Considering the 5% significance level, the test tations and scientific texts from (un-)related domains. rejected all the null hypotheses except the H2 , which was To avoid such a loss of generality, the future research rejected only for the fastText embeddings. To explain could include an intermediate step of entity recognition. that, we can note that there is a relatively sharp boundary Using such recognized entities instead of raw text can between individual classes on UMAP embeddings. If it help to separate the information in scientific papers that holds so for the original space, larger 𝑘 may suppress is directly related to concepts from ontologies and unre- outlier noise but decrease classification accuracy near it. lated words, sentences and other parts of text not elimi- nated during preprocessing. Acknowledgments CoRR abs/2112.00405 (2021). URL: https://arxiv.org/ abs/2112.00405. arXiv:2112.00405. The research reported in this paper has been supported [10] K. Lu, A. Grover, P. Abbeel, I. Mordatch, Pre- by the German Research Foundation (DFG) funded trained transformers as universal computation en- projects 467401796 and NFDI2/12020. gines, CoRR abs/2103.05247 (2021). URL: https: //arxiv.org/abs/2103.05247. arXiv:2103.05247. [11] J. Hastings, G. Owen, A. Dekker, M. Ennis, References N. Kale, V. Muthukrishnan, S. Turner, N. Swainston, [1] M. Wolf, J. Logan, K. Mehta, D. Jacobson, M. Cash- P. Mendes, C. Steinbeck, ChEBI in 2016: Improved man, A. M. Walker, G. Eisenhauer, P. Widener, services and an expanding collection of metabolites, A. Cliff, Reusability first: Toward fair workflows, Nucleic Acids Res 44 (2015) D1214–9. in: 2021 IEEE International Conference on Cluster [12] National cancer institue thesaurus, 2022. URL: https: Computing (CLUSTER), 2021, pp. 444–455. doi:10. //bioportal.bioontology.org/ontologies/NCIT. 1109/Cluster48925.2021.00053. [13] Allotrope foundation ontologies, 2022. URL: https: [2] J. Grühn, A. S. Behr, T. H. Eroglu, V. Trögel, //www.allotrope.org/ontologies. K. Rosenthal, N. Kockmann, From coiled flow in- [14] Systems biology ontology, 2022. URL: https://github. verter to stirred tank reactor – bioprocess develop- com/EBI-BioModels/SBO. ment and ontology design, Chemie Ingenieur Tech- [15] Chemical methods ontology, 2022. URL: https:// nik 94 (2022) 852–863. doi:https://doi.org/10. obofoundry.org/ontology/chmo.html. 1002/cite.202100177. [16] E. Kim, Z. Jensen, A. van Grootel, K. Huang, [3] L. Hirschman, M. Krallinger, A. Valencia, J. Fluck, M. Staib, S. Mysore, H.-S. Chang, E. Strubell, A. Mc- H.-T. Mevissen, H. Dach, M. Oster, M. Hofmann- Callum, S. Jegelka, E. Olivetti, Inorganic materi- Apitius, Prominer: Recognition of human gene and als synthesis planning with literature-trained neu- protein names using regularly updated dictionaries, ral networks, Journal of Chemical Information Proceedings of the Second BioCreAtIvE Challenge and Modeling 60 (2020) 1194–1201. URL: https: Evaluation Workshop (2007) 149–151. //doi.org/10.1021/acs.jcim.9b00995. doi:10.1021/ [4] A. A. Morgan, Z. Lu, X. Wang, A. M. Cohen, J. Fluck, acs.jcim.9b00995. P. Ruch, A. Divoli, K. Fundel, R. Leaman, J. Haken- [17] Bert for chemical industry, 2022. URL: https:// berg, C. Sun, H.-H. Liu, R. Torres, M. Krauthammer, huggingface.co/recobo/chemical-bert-uncase. W. W. Lau, H. Liu, C.-N. Hsu, M. Schuemie, K. B. [18] Bert, 2022. URL: https://huggingface.co/docs/ Cohen, L. Hirschman, Overview of BioCreative II transformers/model_doc/bert. gene normalization, Genome Biol 9 Suppl 2 (2008) [19] M. Honnibal, I. Montani, spaCy 2: Natural language S3. understanding with Bloom embeddings, convolu- [5] R. Leaman, R. Islamaj Dogan, Z. Lu, DNorm: disease tional neural networks and incremental parsing, name normalization with pairwise learning to rank, 2017. Bioinformatics 29 (2013) 2909–2917. [20] M. Neumann, D. King, I. Beltagy, W. Ammar, Scis- [6] İ. Karadeniz, A. Özgür, Linking entities through paCy: Fast and robust models for biomedical natu- an ontology using word embeddings and syntactic ral language processing, in: Proceedings of the 18th re-ranking, BMC Bioinformatics 20 (2019) 156. URL: BioNLP Workshop and Shared Task, Association for https://doi.org/10.1186/s12859-019-2678-8. doi:10. Computational Linguistics, Florence, Italy, 2019, pp. 1186/s12859-019-2678-8. 319–327. URL: https://aclanthology.org/W19-5034. [7] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, En- doi:10.18653/v1/W19-5034. riching word vectors with subword information, [21] L. McInnes, J. Healy, J. Melville, Umap: Uniform arXiv preprint arXiv:1607.04606 (2016). manifold approximation and projection for dimen- [8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, sion reduction, 2018. URL: https://arxiv.org/abs/ BERT: Pre-training of deep bidirectional transform- 1802.03426. doi:10.48550/ARXIV.1802.03426. ers for language understanding, in: Proceedings [22] L. McInnes, J. Healy, N. Saul, L. Grossberger, Umap: of the 2019 Conference of the NAACL, Associ- Uniform manifold approximation and projection, ation for Computational Linguistics, Minneapo- The Journal of Open Source Software 3 (2018) 861. lis, Minnesota, 2019, pp. 4171–4186. URL: https: [23] Y. Gal, Uncertainty in Deep Learning, Ph.D. thesis, //aclanthology.org/N19-1423. doi:10.18653/v1/ University of Cambridge, 2016. N19-1423. [9] Z. Liu, F. Jiang, Y. Hu, C. Shi, P. Fung, NER-BERT: A pre-trained model for low-resource entity tagging,