=Paper=
{{Paper
|id=Vol-3226/paper3
|storemode=property
|title=Text-to-Ontology Mapping via Natural Language Processing Models
|pdfUrl=https://ceur-ws.org/Vol-3226/paper3.pdf
|volume=Vol-3226
|authors=Uladzislau Yorsh,Alexander Behr,Norbert Kockmann,Martin Holeňa
|dblpUrl=https://dblp.org/rec/conf/itat/YorshBKH22
}}
==Text-to-Ontology Mapping via Natural Language Processing Models==
Text-to-Ontology Mapping via Natural Language Processing
Models
Uladzislau Yorsh1 , Alexander S. Behr2 , Norbert Kockmann2 and Martin Holeňa1,3,4
1
Faculty of Information Technology, CTU, Prague, Czech Republic
2
Faculty of Biochemical and Chemical Engineering, TU Dortmund University, Germany
3
Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic
4
Leibniz Institute for Catalysis, Rostock, Germany
Abstract
The paper presents work in progress attempting to solve a text-to-ontology mapping problem. While ontologies are being
created as formal specifications of shared conceptualizations of application domains, different users often create different
ontologies to represent the same domain. For better reasoning about concepts in scientific papers, it is desired to pick the
ontology which best matches concepts present in the input text.
We have started to automatize this process and attack the problem by utilizing state-of-the-art NLP tools and neural
networks. Given a specific set of ontologies, we experiment with different training pipelines for NLP machine learning
models with the aim to construct representative embeddings for the text-to-ontology matching task. We assess the final result
through visualizing the latent space and exploring the mappings between an input text and ontology classes.
Keywords
text analysis, language models, fastText, BERT, matching text to ontologies
1. Introduction ontologies can focus on different sub-domains as well as
on different levels of abstraction. Choosing the ontology
The FAIR (Findable, Accessible, Interoperable and which best corresponds to an input text is an important
Reusable) research data management needs a consistent step towards reasoning about it.
data representation in ontologies, particularly for repre- In the reported work in progress, we focus on the latter
senting the data structure in the specific domain [1]. The problem. One of the possible ways to address the task is
application of ontologies varies from a domain-specific to consider it as matching input texts with an existing
vocabulary and a translation reference up to an environ- text collection. Such a formulation allows to employ
ment for logical reasoning and property inference. already existing rich text processing pipelines, as well as
Despite their purpose of standardizing the knowledge powerful pretrained models.
conceptualization, there still may exist several ontologies
within the same domain [2]. Creating and managing an
ontology is a manual process often performed by many 2. Related Work
domain experts. As each expert works on different prob-
lems, they also might have different conceptualizations 2.1. Entity linking
of their respective knowledge. However, approaches to
The problem is closely related to the concept normaliza-
automate the knowledge conceptualization also face their
tion and entity linking tasks. The algorithms encountered
challenges, as a machine cannot easily create semantics
in this context include dictionary lookup [3, 4], condi-
without human input (e.g. scientific theses, which are
tional random fields and tf-idf vector similarity [5], word
created by humans). A constant demand for a knowl-
embeddings and syntactical similarity [6].
edge database expansion and utilizing of already available
The vector similarity approaches either employ tf-idf
knowledge leads to the problem of ontology alignment
vectors or dense word embeddings. The tf-idf vector is
and merging, which is a research field on their own.
a document vector of the size of the considered vocabu-
Another problem faced by domain experts is how to
lary, where each element is the number of occurrences of
choose a proper ontology for a certain task. Different
the term in a document, multiplied by the logarithmized
ITAT’22: Information technologies – Applications and Theory, Septem- reciprocal value of the number of the documents where
ber 23–27, 2022, Zuberec, Slovakia this term appears. These vectors are well-interpretable
$ yorshula@fit.cvut.cz (U. Yorsh); (high values indicate the rare term which appears in par-
alexander.behr@tu-dortmund.de (A. S. Behr);
norbert.kockmann@tu-dortmund.de (N. Kockmann); ticular document often), but very sparse, which impedes
martin@cs.cas.cz (M. Holeňa) the performance of machine learning algorithms. On
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). contrary, word embeddings generated by representation
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
learning algorithms are dense, but provide no direct in- is the key difference from many other works, which rely
terpretation. on ground-truth either for training or evaluation.
The mentioned systems share a common pipeline—
at the first step, they use an external algorithm to find Ontologies may significantly differ in size. This
potential concepts in a scientific text. After that, they can lead to very outbalanced datasets when generating
link proposals with concepts using retrieval techniques, them from ontologies.
such as dictionary lookup or vector distance. These difficulties should be considered in the first place
when choosing a solution method.
2.2. Natural Language Processing
Entity linking techniques relying on vector similarity 3.2. Text Similarity Strategy
may either use tf-idf vectors or word embeddings. The Ontologies typically provide annotations for most of their
latter may be beneficial due to the dense vector structure classes and relations, potentially generating supervised
and an ability to be produced by high-capacity language datasets for ML algorithms. But before employing a text
models, trained on large corpora. similarity approach, we have to make several strong as-
fastText [7] is a representation learning algorithm pro- sumptions:
ducing word-level embeddings. A neural network with
a single hidden layer is being trained to predict a word • The distribution of input texts is the same as the
given its context, and the learned word representations distribution of annotation texts. It means that
are then being used as word embeddings. the input sentences should follow the same gen-
Another widely used representation learning algo- eral structure, length and vocabulary as ontology
rithm is BERT [8]. A deep sequence processing neu- annotations to avoid prediction skewing for irrel-
ral network is trained on two objectives—predicting a evant reasons.
masked word in a sentence and predicting the order of • The best matching ontology is the one which pro-
two given sentences. vides annotations most similar to the input text.
Compared to the fastText, BERT embeds the whole Since the considered methods are text-based, they
input sequence at once and produces contextual embed- will not rely on structures or hierarchies created
dings for each token—the same token in different con- by ontology classes and input text terms.
texts will be embedded differently. This allows it to
achieve state-of-the-art results in text classification [8] For methods mentioned below in this subsection, we
and named entity recognition [9] tasks. Another benefit will employ fastText and BERT models trained on texts
of BERT is that its Transformer architecture demonstrates from related domains, which will serve as a backbone for
impressive transfer-learning capabilities [10], which can further processing. Following the notation introduced
be useful for fine-tuning the model for tasks laying out- in the Subsection 3.1, we consider a "hard" mapping
side pretraining data distribution. 𝑓 : T ↦→ 𝐾 directly to the space of ontologies of interest.
3.2.1. Zero-shot classification
3. Matching Texts to Ontologies
The method consists of assigning an ontology consid-
ering a similarity between annotation embeddings and
3.1. Problem definition
an embedding of an input text. The method is simple
Within the proposed framework, we define an ontology and does not require model fine-tuning, which allows to
𝑂 as a directed attributed multi-graph, where vertices quickly establish a baseline for other experiments. The
represent classes, edges represent relationships between common choices of similarity measures are Euclidean
them, and both vertices and edges can have attributes. or cosine distances – we choose the latter in our experi-
Given a set of specific ontologies 𝐾 = {𝑂1 , . . . , 𝑂𝑛 } ments. The reason is that for some embedding algorithms
and an input text 𝑇 ∈ T, the task is to predict the ontol- vector length may be influenced by the input text size,
ogy that best matches the content of 𝑇 . A predictor may so vectors corresponding to semantically close texts may
be either a "hard" mapping 𝑓 : T ↦→ 𝐾 or a scoring func- generally point in the same direction but be dissimilar in
tion 𝑓 : T × 𝐾 ↦→ R which allows to order ontologies terms of Euclidean distance.
by relevance.
There are several complications of the task: 3.2.2. Supervised classification based on ontology
annotations
Given ontologies are the only source of supervision.
This method relies on a supervision provided by ontol-
No text-to-ontology mapping labels are provided. This
ogy annotation attributes. Given an ontology set 𝐾, we
can generate a dataset of annotation-ontology label pairs a scispaCy [20] model en_ner_bc5cdr_md. For the re-
and use it for supervised training. Under the aforemen- maining machine learning models, PyTorch implementa-
tioned assumptions we can directly assign input texts to tions were used. For 3D visualization method Uniform
ontologies using the trained model. Manifold Approximation and Projection for Dimension
Reduction (UMAP) [21], we used the implementation
3.2.3. Negative sampling described in [22].
Due to the lack of ground truth matching data, we
This method extends the method above by adding a assess the performance primarily through inspecting the
"None" class, denoting that the input text does not relate resulting input sentence-annotation pairs.
to any of given ontologies. The annotation dataset is
extended by:
4.1.1. Text preprocessing
• Sentences extracted from scientific papers from We employ the following text preprocessing pipeline
unrelated domains and labeled with the "None" before constructing input embeddings:
label.
• Sentences extracted from papers from related do- 1. *Split an input text into sentences with a spaCy
mains with a different objective during training. model.
For related input texts, instead of maximizing the 2. *Filter valid sentences, which contain at least two
model output scores for a ground truth class we nouns and a verb.
minimize the output scores for the "None" class. 3. *Filter out sentences with non-paired parenthesis
This method is intended to partially counter the and ill-parsed formulas or composed terms.
possible input distribution difference between on- 4. (BERT) Tokenize with a tokenizer coming with
tology annotations and scientific texts. the model.
4. (fastText) Convert to lowercase and split into
words
4. Experiments
The points marked with an asterisk are meant to be
4.1. Setup applied to new sentences from scientific papers only.
We conduct our experiments on a set of five ontologies
related to the chemical domain (Table 1). The ontolo- 4.2. Text Similarity
gies NCIT, CHMO and Allotrope are considered to be the Zero-shot setup. We start with representation learn-
closest to it, while Chemical Entities of Biological Inter- ing of annotations using the fastText and BERT algo-
est (CHEBI) has only a subset of relevant entities. The rithms and inspecting the embeddings produced. For
SBO was selected as it contains some general laboratory the dimensionality reduction, we use the UMAP algo-
and computational contexts, which can be seen as some rithm with the number of neighbors set to 15, minimum
kind of a test, whether the tools used can also identify distance 0.5 and cosine metric. We have found that 3-
ontologies not fitting to the text content. dimensional embeddings preserve substantially more in-
We also selected 28 scientific papers as inputs for as- formation (allowing to separate clusters that may be in-
sessment, consisting of 25 research and 3 review papers. separable in 2D). The result is illustrated in Figure 1, three
Those papers deal with the topic of methanation of CO2 example sentences together with annotations assigned
and consist in sum of 1,3M symbols. to them by fastText and BERT are in Table 3.
Table 1 Table 2
Sizes of considered ontologies Zero-shot statistics for the distances of sentences to the closest
Ontology Classes Annotations ontology annotations.
CHEBI [11] 171058 51095 Embeddings Closest distance Closest distance
NCIT [12] 170300 133478 mean standard deviation
Allotrope [13] 2893 2677 fastText 0.846 0.086
CHMO [14] 3084 2895 BERT 0.605 0.038
SBO [15] 693 692
Those visualizations and Table 2 allow to suppose that
We use the pretrained fastText model by [16] and the the model embeds input papers separately from ontol-
recobo/chemical-bert-uncased [17] checkpoint of
ogy annotations, which may indicate a distribution shift
a BERT implementation [18] from the HuggingFace between sentences and annotations.
repository. For preprocessing we use spaCy [19] with
Table 3
Sentence pairs of a new sentence from the scientific papers and the closest ontology annotation. The "carbon dioxide"
annotation was assigned by BERT to all three above example new sentences. While BERT embeddings are more discriminative
for the ontology classification task, the assigned sentences and low-dimensional embeddings on Figure 3 indicate that this
approach is more sensitive to the distribution shift problem.
New sentence Also there is an upper limit of The difference is the main ad- This enhancement of the Ni dis-
operation above which thermal sorption species during the re- persion is very relevant because
decomposition will occur. action. as reported in the literature [78]
NiO sites [...]
fastText closest An end event specification is an Reaction scheme where the The name of the individual
event specification that is about products are created from the working for the sponsor respon-
the end of some process. reactants [...] sible for overseeing the activi-
ties of the study.
BERT closest Carbon dioxide gas is a gas that Carbon dioxide gas is a gas that Carbon dioxide gas is a gas that
is composed of carbon dioxide is composed of carbon dioxide is composed of carbon dioxide
molecules. molecules. molecules.
768-dimensional hidden layer. Due to the significant dif-
ference in sizes between ontologies, we proportionally
oversample minority data points. The classifier reaches
0.987 validation accuracy after the single-shot validation
on the annotations from all the classes, which indicates
their good separability for different ontologies, cf. Fig-
ures 2 and 3.
However, if we preprocess input texts and embed them
in this way, the inspection will show that their distribu-
(a) fastText tion significantly differs from the distribution of ontology
annotations. The visualizations in Figures 2 and 3 show a
dense separate cluster of sentences parsed from scientific
papers.
Negative sampling. As an attempt to counter the is-
sue, we introduced scientific texts into training data. We
sampled 400 scientific texts from the chemical domain (as
positive examples) and 400 from unrelated domains (as
negatives). During training, the model is being trained
on two objectives:
1. Cross-entropy loss if the input is an ontology
(b) BERT
annotation (same as before)
Figure 1: A 3-dimensional projection of annotation embed- 2. Binary cross-entropy loss if the input is a sentence
dings produced by fastText and BERT. In the case of fastText, from a scientific paper. The model minimizes the
SBO, Allotrope, and CHMO annotations are located in tiny probability of a special "Negative" class output
areas, primarily close to the center of the image. for a related scientific text, and maximises it for
unrelated.
In this setting we train the head over BERT until con-
Ontology matching as text classification. As we vergence first, leaving the backbone frozen. Considering
mentioned in the Subsection 3.2, another potential strat- only ontology annotations and leaving aside sampled
egy to solve the problem is to treat it as a classification sentences, the model reaches 0.984 validation accuracy,
task. If the distributions of input texts and correspond- which is very similar to the performance of the classifier
ing ontologies are the same, we can train a classifier on described above.
ontology annotations and apply it on input texts. After that, we fine-tune the whole BERT model. The
We implement this by embedding ontology annota- model reaches 0.958 validation accuracy after single-shot
tions with BERT and training over them a shallow fully- validation on the combined annotation and paper sen-
connected multilayer perceptron (MLP) with a single tence dataset, with the confusion matrix on Figure 4. As
(a) 3-dimensional projection of the embeddings pro-
(a) The progress of training and validation accuracy
duced by the fine-tuned BERT
during training. The blue (above) and orange (be-
low) lines indicate the training and validation ac-
curacy respectively.
(b) 3-dimensional projection of the activities of the
hidden layer of the MLP trained over BERT
(b) Annotation from ontologies and sentences from Figure 3: Visualization of the BERT embedding phase and the
14 scientific papers embedded by BERT MLP classification phase of the ontology classification task
Figure 2: Training plot and a 3-dimensional projection of the with the fine-tuned BERT in the negative sampling setting.
embeddings produced by BERT in the classification approach.
The visualization gives an intuition of the distribution gap
between the scientific texts for which we would like to find and test the following hypothesis:
the most relevant ontology, and ontology annotations.
Hypothesis H1 (Null): All the six models perform the
same on the validation splits.
we will show later, mixing sampled sentences in from The Friedman test resulted in the null hypothesis rejec-
both relevant and irrelevant scientific texts allowed to tion on the significance level of 5%. To further compare
improve classification accuracy over the classifier on top the models, we perform the Wilcoxon signed-rank test on
of BERT. each pair of models. We make the following assumptions
Despite the good separability of individual ontologies about the algorithms:
and the additional optimization criterion, the UMAP em-
beddings look similar to the previous setup in terms of • For a larger 𝑘 the 𝑘NN classifier can work the
clustering input sentences into a separate subspace. same or better than the 1NN.
It is worth to note that the classifier and negative sam- • The neural network model can fit training data
pling models produce softmax scores, which can be in- the same or better than the 𝑘NN.
terpreted as a class probability distribution. However, • The negative sampling results in a non-decrease
neural networks tend to be overconfident in their out- or an improvement in the model generalization.
puts [23], so additional calibration is needed before using
the outputs for relevance estimation. Hypothesis H2 (Null for 𝑘NN models): The 10NN mod-
els perform the same as their 1NN variants.
Statistical results. To compare the models, we con- While the 1NN is a common setting for many NLP
duct the Friedman test first to check if the models perform systems, it may produce complex decision boundaries
the same. We perform a stratified split of the validation and lead to overfitting. We test a larger 𝑘 versus one to
dataset with the ontology annotations into 50 samples determine whether this is an issue in our setup.
Figure 5: The comparison matrix of the six considered models.
Figure 4: Confusion matrix of the MLP classification over The 𝑖, 𝑗 -th element indicates an amount of splits where the
fine-tuned BERT for a dataset consisting of the annotations 𝑖-th model performed better than 𝑗 -th. Except the one- and
from all five considered ontologies and the sentences of the ten-nearest-neighbor over BERT embeddings, all the models
additional 400 related and 400 unrelated scientific papers.. demonstrate statistically significant differences. BERT NN
denotes a neural network classifier trained over BERT embed-
dings.
Hypothesis H3 (Null for neural network classifier): The
NN classifier performs the same as the 𝑘NN models both
on BERT/fastText embeddings. 5. Conclusion and Further
The assumption behind this hypothesis is that a neural Research
network as a universal approximator can fit data better
than a nearest-neighbour classifier. We are not aware of other works on unsupervised text-
to-ontology mappings, so we are not able to discuss them
Hypothesis H4 (Null for the fine-tuned model with nega- and compare the proposed approach with previous meth-
tive sampling): The fine-tuned BERT with negative sam- ods.
pling performs the same as other considered models. The reported work in progress revealed that the dis-
We suppose that additional sampled sentences would tribution of the scientific texts substantially differs from
allow to improve the model performance and help to the one of ontology annotations. In spite of the high
avoid overfitting when fine-tuning the whole model in- classification accuracy both for the annotations from the
stead of head only. considered ontologies and the sentences of the additional
800 scientific papers, this leads to mapping into separate
Hypothesis H5 (Null for the rest): In each remaining subsets of the embedding space. This is true even for the
pair, both models have the same performance. most sophisticated of the three investigated settings –
We indicate the relative model performance on Fig- with the BERT fine-tuned using both the ontology anno-
ure 5. Considering the 5% significance level, the test tations and scientific texts from (un-)related domains.
rejected all the null hypotheses except the H2 , which was To avoid such a loss of generality, the future research
rejected only for the fastText embeddings. To explain could include an intermediate step of entity recognition.
that, we can note that there is a relatively sharp boundary Using such recognized entities instead of raw text can
between individual classes on UMAP embeddings. If it help to separate the information in scientific papers that
holds so for the original space, larger 𝑘 may suppress is directly related to concepts from ontologies and unre-
outlier noise but decrease classification accuracy near it. lated words, sentences and other parts of text not elimi-
nated during preprocessing.
Acknowledgments CoRR abs/2112.00405 (2021). URL: https://arxiv.org/
abs/2112.00405. arXiv:2112.00405.
The research reported in this paper has been supported [10] K. Lu, A. Grover, P. Abbeel, I. Mordatch, Pre-
by the German Research Foundation (DFG) funded trained transformers as universal computation en-
projects 467401796 and NFDI2/12020. gines, CoRR abs/2103.05247 (2021). URL: https:
//arxiv.org/abs/2103.05247. arXiv:2103.05247.
[11] J. Hastings, G. Owen, A. Dekker, M. Ennis,
References N. Kale, V. Muthukrishnan, S. Turner, N. Swainston,
[1] M. Wolf, J. Logan, K. Mehta, D. Jacobson, M. Cash- P. Mendes, C. Steinbeck, ChEBI in 2016: Improved
man, A. M. Walker, G. Eisenhauer, P. Widener, services and an expanding collection of metabolites,
A. Cliff, Reusability first: Toward fair workflows, Nucleic Acids Res 44 (2015) D1214–9.
in: 2021 IEEE International Conference on Cluster [12] National cancer institue thesaurus, 2022. URL: https:
Computing (CLUSTER), 2021, pp. 444–455. doi:10. //bioportal.bioontology.org/ontologies/NCIT.
1109/Cluster48925.2021.00053. [13] Allotrope foundation ontologies, 2022. URL: https:
[2] J. Grühn, A. S. Behr, T. H. Eroglu, V. Trögel, //www.allotrope.org/ontologies.
K. Rosenthal, N. Kockmann, From coiled flow in- [14] Systems biology ontology, 2022. URL: https://github.
verter to stirred tank reactor – bioprocess develop- com/EBI-BioModels/SBO.
ment and ontology design, Chemie Ingenieur Tech- [15] Chemical methods ontology, 2022. URL: https://
nik 94 (2022) 852–863. doi:https://doi.org/10. obofoundry.org/ontology/chmo.html.
1002/cite.202100177. [16] E. Kim, Z. Jensen, A. van Grootel, K. Huang,
[3] L. Hirschman, M. Krallinger, A. Valencia, J. Fluck, M. Staib, S. Mysore, H.-S. Chang, E. Strubell, A. Mc-
H.-T. Mevissen, H. Dach, M. Oster, M. Hofmann- Callum, S. Jegelka, E. Olivetti, Inorganic materi-
Apitius, Prominer: Recognition of human gene and als synthesis planning with literature-trained neu-
protein names using regularly updated dictionaries, ral networks, Journal of Chemical Information
Proceedings of the Second BioCreAtIvE Challenge and Modeling 60 (2020) 1194–1201. URL: https:
Evaluation Workshop (2007) 149–151. //doi.org/10.1021/acs.jcim.9b00995. doi:10.1021/
[4] A. A. Morgan, Z. Lu, X. Wang, A. M. Cohen, J. Fluck, acs.jcim.9b00995.
P. Ruch, A. Divoli, K. Fundel, R. Leaman, J. Haken- [17] Bert for chemical industry, 2022. URL: https://
berg, C. Sun, H.-H. Liu, R. Torres, M. Krauthammer, huggingface.co/recobo/chemical-bert-uncase.
W. W. Lau, H. Liu, C.-N. Hsu, M. Schuemie, K. B. [18] Bert, 2022. URL: https://huggingface.co/docs/
Cohen, L. Hirschman, Overview of BioCreative II transformers/model_doc/bert.
gene normalization, Genome Biol 9 Suppl 2 (2008) [19] M. Honnibal, I. Montani, spaCy 2: Natural language
S3. understanding with Bloom embeddings, convolu-
[5] R. Leaman, R. Islamaj Dogan, Z. Lu, DNorm: disease tional neural networks and incremental parsing,
name normalization with pairwise learning to rank, 2017.
Bioinformatics 29 (2013) 2909–2917. [20] M. Neumann, D. King, I. Beltagy, W. Ammar, Scis-
[6] İ. Karadeniz, A. Özgür, Linking entities through paCy: Fast and robust models for biomedical natu-
an ontology using word embeddings and syntactic ral language processing, in: Proceedings of the 18th
re-ranking, BMC Bioinformatics 20 (2019) 156. URL: BioNLP Workshop and Shared Task, Association for
https://doi.org/10.1186/s12859-019-2678-8. doi:10. Computational Linguistics, Florence, Italy, 2019, pp.
1186/s12859-019-2678-8. 319–327. URL: https://aclanthology.org/W19-5034.
[7] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, En- doi:10.18653/v1/W19-5034.
riching word vectors with subword information, [21] L. McInnes, J. Healy, J. Melville, Umap: Uniform
arXiv preprint arXiv:1607.04606 (2016). manifold approximation and projection for dimen-
[8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, sion reduction, 2018. URL: https://arxiv.org/abs/
BERT: Pre-training of deep bidirectional transform- 1802.03426. doi:10.48550/ARXIV.1802.03426.
ers for language understanding, in: Proceedings [22] L. McInnes, J. Healy, N. Saul, L. Grossberger, Umap:
of the 2019 Conference of the NAACL, Associ- Uniform manifold approximation and projection,
ation for Computational Linguistics, Minneapo- The Journal of Open Source Software 3 (2018) 861.
lis, Minnesota, 2019, pp. 4171–4186. URL: https: [23] Y. Gal, Uncertainty in Deep Learning, Ph.D. thesis,
//aclanthology.org/N19-1423. doi:10.18653/v1/ University of Cambridge, 2016.
N19-1423.
[9] Z. Liu, F. Jiang, Y. Hu, C. Shi, P. Fung, NER-BERT: A
pre-trained model for low-resource entity tagging,