Biomedical Ontology Alignment with BERT Yuan He1 , Jiaoyan Chen1 , Denvar Antonyrajah2 , and Ian Horrocks1 1 Department of Computer Science, University of Oxford, UK 2 Samsung Research, UK Abstract. Existing machine learning-based ontology alignment systems often adopt complicated feature engineering or traditional non-contextual word embeddings. However, they are often outrun by the rule-based sys- tems despite the model complexity. This paper proposes a novel ontology alignment system based on a contextual embedding model named BERT, aiming to sufficiently utilize the text semantics implied by ontologies. Our results on two biomedical alignment tasks demonstrate that, despite us- ing the to-be-aligned classes alone as the input, our system outperforms the leading systems: LogMap and AML. Keywords: Ontology Alignment · Contextual Embeddings · BERT. 1 Introduction Ontology alignment refers to matching semantically related entities from differ- ent ontologies, with the vision of integrating data from heterogeneous resources. The resulting mappings usually indicate equivalence or subsumption relation- ships, consequently providing a convenient means for merging two ontologies. Moreover, through alignment with other ontologies, we can introduce additional semantics for augmenting an individual ontology’s quality assurance [4, 6, 13]. Independent development of ontologies results in different naming schemes, leading to a challenge in alignment. For example, the class named “lanugo” in the SNOMED ontology is named as “primary hair ” by the Foundational Model of Anatomy (FMA) ontology. Besides, real-world ontologies typically contain a large number of classes, which causes scalability issues during mapping discovery and makes it more difficult to distinguish classes of similar names (e.g., with overlapped sub-words) but distinct meanings, especially for systems that adopts string similarity-based lexical matching. Leading systems such as LogMap [8] and AgreementMakerLight (AML) [5] approach alignment as a sequential process, with lexical matching typically be- ing the first stage, followed by mapping extension and mapping repair. However, their lexical matching parts mostly take the text’s surface form, without consid- ering the semantics of words. More recent machine learning-based approaches such as DeepAlignment [10] and OntoEmma [15] adopt the word embedding Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Y. He et al. technique which projects words into vectors, where pairs of words with closer semantic meanings have a smaller Euclidean distance in the vector space. Never- theless, these methods adopt traditional non-contextual word embedding meth- ods which assign each word a unified representation and thus cannot well exploit word-level contexts to help resolve ambiguity. To tackle this problem, we propose an ontology alignment system based on BERT, a contextual word embedding model that has demonstrated its strength (through fine-tuning) in a wide range of Natural Language Processing (NLP) tasks such as question answering, named entity recognition, and sentiment anal- ysis [2, 16, 18], but has not yet been fully investigated in ontology alignment. The fundamental challenge of applying deep learning techniques on ontology align- ment is that the number of reference mappings is often several orders smaller than the number of candidates (i.e., class pairs) to predict, resulting in a lack of labelled data and an imbalance between positive and negative samples. Thus, previous research into supervised learning schemes usually involves complicated feature engineering and needs to address extra noise brought by silver data (i.e. labelled data that are automatically generated by certain heuristics) [7, 15]. In contrast, fine-tuning the pretrained BERT on downstream tasks typically neces- sitates only a moderate amount of training data and avoids complex hand-crafted features. Furthermore, we also consider a critical issue in mapping prediction, i.e., reducing the quadratic complexity of searching all possible mappings. To the best of our knowledge, we are among the first to develop a robust and general ontology alignment system using contextual embedding. In comparison to the previous work by Neutel et al. [12], which is a preliminary work that employs BERT to match two domain ontologies, we establish a concrete and flexible pipeline that fits both the unsupervised and semi-supervised settings; we improve their mean token embedding and the class token embedding models and use them as baselines; we optimize the mapping search, and conduct more extensive experiments to examine our approaches. We refer to our system as BERTMap3 . As shown in Figure 1, it consists of the following steps: 1. Corpora construction. We extract synonym and non-synonym pairs from various sources, including input ontologies, known mappings and/or external knowledge. Such construction exploits the text semantics and avoids the need for hand-crafted features. 2. Fine-tuning. We then choose a suitable pretrained BERT variant, and fine- tune it on our corpora for a classifier, which takes a moderate amount of data and training resources. 3. Mapping prediction. For each class pair, we take their labels as input to the classifier. Since one class may have multiple labels, we use the average of the output probabilities of all the label combinations as the mapping value. To reduce the search space but keep the recall, we use a sub-word inverted index based on BERT’s tokenizer. 3 Code is available at: https://github.com/KRR-Oxford/BERTMap. Biomedical Ontology Alignment with BERT 3 Corpora BERT Class from either Prediction ontology Fine-tuning Ontology and Intra-ontology Sub-word String Match Corpus Inverted Index (Optional) Transfer weights Not Found Known/High Confidence Cross-ontology Mappings (Optional) Pretrained BERT Fine-tuned BERT Corpus Found Transfer Complementary Sources Complementary weights Fine-tuned Binary , (Optional) Binary Classifier Classfier Corpus from opposite ontology Fig. 1. Illustration of BERTMap system. We evaluate BERTMap on (i) the FMA-SNOMED small fragment task of the OAEI Large BioMed Track (LargeBio)4 , and (ii) its extended version FMA- SNOMED+, where the missing labels of SNOMED are augmented with labels from a more recent version of SNOMED. We compare BERTMap with four in- ternal baselines — two lexical matching-based methods and two BERT token embedding-based models, and two leading systems — LogMap [8] and AML [5]. Our results demonstrate that BERTMap, despite using the to-be-aligned classes alone as the input, outperforms all the baselines on both tasks. 2 Preliminaries 2.1 Problem Formulation An ontology is typically defined as an explicit specification of a conceptualiza- tion. It often uses representational vocabularies to describe a domain of interest with the main components being entities and axioms. Note that entities include classes, instances and properties. Ontology alignment involves tasks of matching cross-ontology entities with equivalence, subsumption or other more complicated relationships. In this work, we focus on equivalence alignment between classes. The ontology alignment system takes as input a pair of ontologies, O and O0 , with class sets C and C 0 , respectively. It first generates a set of scored mappings, which are triples of the form of (c ∈ C, c0 ∈ C 0 , P (c ≡ c0 )), where P (c ≡ c0 ) ∈ [0, 1] denotes the probability score (a.k.a. mapping value) that c and c0 are equivalent. In this paper, we determine the final output by preserving mappings with a score larger than a certain threshold λ ∈ [0, 1]. We further clarify some notations used in this paper. Note that a class of an ontology typically contains a list of labels (via annotation properties such as rdfs:label ) that serve as alternative class names. We lowercase these aliases and remove any underscores before tokenization. We denote the preprocessed labels as ω and the set of them as Ω(c) for a class c. 4 http://www.cs.ox.ac.uk/isg/projects/SEALS/oaei/. 4 Y. He et al. NSP Mask LM Mask LM Output probability for predicting Text A and B as synonymous ... ... Text Pair Classification Layer ... ... ... ... ... ... ... ... ... ... ... ... Weights initialized from Pretrained BERT ... ... Embedding (Token + Segment + Position) Embedding (Token + Segment + Position) [CLS] ... [SEP] ... [CLS] ... [SEP] ... Masked & Tokenized Masked & Tokenized Tokenized Sentence A Tokenized Sentence B Sentence A Sentence B Pretraining Fine-tuning (Binary Classification) Fig. 2. BERT pretraining (left) and fine-tuning (right) for binary classification. 2.2 BERT: Pretraining and Fine-tuning BERT is a language representation model built on the bidirectional transformer encoder [14]. As shown in Figure 2, its input is a sequence composed of a spe- cial token [CLS], tokens of two sentences A and B, and a special token [SEP] that separates A and B. Each token’s initial embedding encodes its content, its position in the sequence, and the sentence it belongs to. The model has L suc- cessive layers of an identical architecture. Its main component is the multi-head self-attention block, which computes a contextual hidden representation for each token by considering the whole sequence output from the previous layer (see the grey arrows in Figure 2). The output of layer l is denoted as: (l) (l) (l) (l) 0(l) 0(l) 0 fbert (x, l) = (vCLS , v1 , ..., vN , vSEP , v1 , ..., vN 0 ) ∈ R(N +N +2)×d (1) (l) 0(l) where x is the input sequence, vi s and vj s are d-dimensional vectors of cor- responding tokens. The last layer (l = L) outputs can be used as the input of downstream tasks or as the token embeddings. In contrast to the traditional non-contextual word embedding techniques such as Word2Vec [11], which as- sign each token in the vocabulary only one embedding, BERT can distinguish different occurrences of the same token. For instance, given an input sentence “the bank robber was seen on the river bank”, BERT computes different embed- dings for the two occurrences of “bank”, while the traditional model yields a unified embedding that is biased towards the most frequent meaning of “bank” (probably the money bank) in the training corpora. The framework of BERT involves pre-training and fine-tuning, where pre- training is to develop a multi-purpose model that learns vast background knowl- edge, and fine-tuning is to adjust the parameters of pre-trained BERT by further training on a downstream task. On the left of Figure 2, we illustrate that BERT Biomedical Ontology Alignment with BERT 5 is pre-trained on two tasks: Masked Language Modelling (MLM) which predicts tokens that are randomly masked in sentences A and B, and Next Sentence Pre- diction (NSP) which predicts if sentence B follows A. On the right of Figure 2, we present the example of fine-tuning on a downstream paraphrasing task, where the pre-trained BERT is attached to an additional binary classification layer that outputs the probability that A and B are synonymous. By minimizing the cross-entropy loss on the training samples, the parameters of pre-trained BERT are adjusted, and the parameters of the additional layer are learnt. Pre-trained BERT models are usually publicly available and can be re-used for fine-tuning on various downstream tasks. In this paper, we conduct no pre-training but instead fine-tuning an existing pre-trained BERT that has learnt biomedical background knowledge (see Section 4 for details). 3 BERTMap 3.1 Corpora Construction and Fine-tuning Real-world ontologies are typically abundant in labels that serve as aliases to their classes. Labels of the same class or semantically equivalent classes are intuitively synonymous in the domain of the input ontologies. On the other hand, non-synonymous pairs can be extracted from classes that are semantically distinct. For convenience, we use “synonym” and “non-synonym” to describe a synonymous and a non-synonymous label pair, respectively. The corpora of these pairs are divided into the following three categories: Intra-ontology corpus. For each input ontology, we regard each pair of la- bels associated with the same class as synonymous. We also construct identity synonyms to encode each label as a synonym of itself. For non-synonyms, we consider: soft non-synonyms which are labels from separate classes at random, and hard non-synonyms which are labels from disjoint classes. Since class dis- jointness is rarely defined in an ontology, we infer it from the structure of the input ontology. In this paper, we simply assume that sibling classes are disjoint. Cross-ontology corpus. The lack of annotated mappings makes it unfeasible to apply supervised learning on ontology alignment. However, we can optionally employ a semi-supervised setting by assuming that a small portion of mappings have been created by human experts. For each known mapping, we extract la- bel pairs from its two classes as synonyms. Meanwhile, soft non-synonyms are extracted by matching a source class to a random target class, whereas hard non-synonyms are not available at the cross-ontology level because we have no predefined disjointness in the mappings. Also, we do not create identity syn- onyms here because they have been considered in the intra-ontology corpus. Complementary corpus. Besides the input ontologies, we can expand the synonym and non-synonym sets from external sources, especially other ontologies in the relevant domain. To reduce the potential noise and the corpus size, we could truncate the auxiliary ontology by considering only the classes whose labels can be matched to some class of the input ontologies. 6 Y. He et al. The intra-ontology corpus, cross-ontology corpus and complementary corpus are denoted as io, co and cp, respectively. io is essential to BERTMap, while co and cp are optional. The identity synonyms are denoted as ids. For convenience, we use + to denote the combination of different corpus/synonyms; for example, io + ids refers to the intra-ontology corpus with identity synonyms considered, and io+co+cp refers to including all three corpora without identity synonyms. To learn the symmetrical property, we also consider appending reversed synonyms, i.e., if (ω1 , ω2 ) is in the synonym set, (ω2 , ω1 ) is also added as a synonym. Given a corpus setting, we obtain the corresponding synonym and non-synonym sets, and then fine-tune a pretrained BERT on them as introduced in Section 2.2. We evaluate various corpus settings in Section 4. Finally, since some non-synonym pairs are extracted by random class combination, they can occasionally appear in the synonym set; in such cases we delete the relevant non-synonym pair. 3.2 Candidate Selection with Sub-word Inverted Index Given input ontologies O and O0 , and their class sets C and C 0 , a naive algorithm for computing the alignment is to look up c0 = arg maxc0 ∈C 0 P (c ≡ c0 ) for every class c ∈ C, which results in a time complexity of O(n2 ). To reduce that, we use a sub-word inverted index based on BERT’s WordPiece tokenizer [17]. The algorithm first initializes the vocabulary with single characters present in the training corpus and incrementally merges them into sub-words so that the most likely combination is added at each iteration. We opt to use the built-in sub- word tokenizer rather than re-train it on our corpora because it has already been fitted to an enormous corpus (with 3.3 billion words) that covers various topics [3], and in this context we consider generality to be preferable to task specificity. We build sub-word inverted indices for O and O0 separately. Each entry of an index is a sub-word, and its values are classes whose labels contain this sub- word after tokenization. With the indices, we can implement candidate selection very efficiently in the following way. We first restrict the search space of each source class c to target classes that share at least one sub-word token with c. Next, we rank these target classes by a scoring metric based on inverted docu- ment frequency (idf ), and the top k scored are chosen for subsequent mapping prediction. For a target class c0 , this scoring metric is computed as: X X s(c, c0 ) = idf (t) = log10 (|C 0 | / |C 0 (t)|), t∈T (c)∩T (c0 ) t∈T (c)∩T (c0 ) where T (·) is the set of sub-word tokens from tokenzing all the labels of a class, C 0 (t) is the the set of target classes that have token t after tokenization, and |·| denotes set cardinality. In this way, we reduce the search space from O(n2 ) to O(kn) where k is a constant. Compared to the traditional word-level inverted index, our approach has the following advantages: (i) it captures various forms of words without requiring additional processing such as stemming and consulting a dictionary; (ii) it interprets unknown words by parsing them into consecutive known sub-words rather than treating them as the same (unknown) token. Biomedical Ontology Alignment with BERT 7 3.3 Mapping Prediction With the ranked candidate classes of a source class c, we first perform a string- match check to see whether any of the candidate classes have at least one exactly matched label (after preprocessing as illustrated in Section 2.1) with c, i.e., to search for c0 according to the rank such that Ω(c) Ω(c0 ) 6= ∅. We assign a T mapping value of 1.0 to the first c0 that satisfies the condition. If we cannot find such candidate class, we apply the fine-tuned BERT classifier. In this case, for each candidate class c0 , we predict the synonym probabilities for all label pairs (ω, ω 0 ) ∈ Ω(c) × Ω(c0 ), and take the average as the mapping value between c and c0 . We return the top scored mapping for each c. We can generate three mapping sets from the input source and target ontolo- gies: (i) src2tgt by looking for a target class c0 ∈ C 0 for each source class c ∈ C; (ii) tgt2src by looking for a source class c ∈ C for each target class c0 ∈ C 0 ; and (iii) combined by merging src2tgt and tgt2src with duplicates removed. We finally output determined mappings by filtering out mappings whose values are lower than a certain threshold λ ∈ [0, 1]. 4 Experiments 4.1 Datasets and Experiment Settings Datasets and Tasks. We first evaluate BERTMap on the FMA-SNOMED small fragment task of the OAEI LargeBio Track. The input FMA and SNOMED on- tologies (segments) have 10, 157 and 13, 412 classes, respectively. The dataset also includes a set of UMLS-based reference (ground truth) mappings for evalu- ating the systems, with 6, 026 of them marked by “=” and 2, 982 marked by “?”. Mappings marked by “?” will cause logical conflicts after alignment, so they are regarded as neither positive nor negative in evaluation. We construct the com- plementary corpus for FMA-SNOMED task by utilizing class labels from the most recent version of the original SNOMED5 . Note that the complementary labels are functional in fine-tuning but not prediction. To examine the scenario when baseline systems can also use these additional labels, we consider the ex- tended task FMA-SNOMED+, where the input ontology SNOMED is extended to SNOMED+ by incorporating these labels. BERTMap Settings. We set up different corpus settings for training. Recall the corpus notations in Section 3.1, the unsupervised and semi-supervised learn- ing settings are distinguished by including co (cross-ontology corpus) or not, and io (intra-ontology corpus) is always considered. In the unsupervised learning set- ting, 80% of the fine-tuning corpus are used for training and 20% for validation. The final mapping prediction is evaluated on the full set of reference mappings. In the semi-supervised learning setting, the training data is formed by incor- porating all the unsupervised fine-tuning data and co constructed from 20% of the reference mappings. We use an additional co constructed from 10% of the 5 The version of 20210131 from https://www.nlm.nih.gov/healthit/snomedct/index.html. 8 Y. He et al. reference mappings as the validation set. We take the remaining 70% as the test mappings for evaluating mapping prediction. Note that here validation is differ- ent from testing because the former concerns fine-tuning while the latter concerns mapping prediction. We also examine the impact of ids (identity synonyms) on both tasks, and the impact of cp (complementary corpus) on FMA-SNOMED. In implementation, we consider all the synonyms in the positive sample set, and randomly sample 2 soft non-synonyms and 2 hard non-synonyms for each syn- onym in io, and 4 soft non-synonyms for each synonym in co. We perform the same negative sampling procedure on cp as on io because cp is also a corpus derived from one (external) ontology. As a result, the positive-negative ratio is consistently 1 : 4 for all the corpus settings. For settings that consider ids, we sample the corresponding number of non-synonyms to keep this ratio. We adopt Bio-Clinical BERT which has been pretrained on biomedical and clinical domain corpora [1]. We fine-tune the BERT model for 3 epochs with a batch size of 32, and evaluate it on the validation set for every 0.1 epoch, through which the best checkpoint (on the cross-entropy loss) is selected for testing. The input maximum length is set to 128. In prediction, the number of candidates selected using the sub-word inverted index is set to 200. Our implementation uses owlready26 and transformers7 . Baselines. We compare BERTMap with the following baselines: 1. String-match. It sets the mapping value of two classes c and c0 to 1.0 if 0 T Ω(c) Ω(c ) 6= ∅, and to 0 otherwise. 2. Edit-similarity. Given a source class c, it predicts the target class c0 and the corresponding mapping value by arg maxc0 ∈ζ(c) nes(Ω(c), Ω(c0 )), where nes(·, ·) refers to the maximum normalized edit similarity between the labels of c and c0 , and ζ(c) denotes the candidates of c that are selected in the same way as BERTMap. Note that string-match is a special case of Edit-similarity. 3. Mean-embeds and Cls-embeds. BERT outputs token embeddings fbert (x, l) at layer l (see (1)). Mean-embeds (the mean token embedding model) ex- tracts the mean of all the token embeddings of the last layer L, denoted as fbert (x, L), as the embedding of a class label, and calculates the cosine sim- ilarity of two classes as their mapping value. Note that the embeddings of multiple labels of a class are averaged. Cls-embeds (the class token embed- ding model) is the same except that it considers the class token embedding (L) of the last layer, i.e., vCLS , as a class label’s embedding. As in BERTMap, string-match is also first considered before calculating the cosine similarity. 4. LogMap, AML and LogMapLt. LogMap and AML are two lexical matching and reasoning based systems with leading performance in many OAEI tracks and other tasks. Since LogMap and AML consider the neighbourhoods and relevant logical axioms of two classes while BERTMap at the current stage only considers the class labels, we additionally introduce LogMapLt, which only uses the lexical matching part of LogMap, for comparison. 6 https://owlready2.readthedocs.io/en/latest/. 7 https://huggingface.co/transformers/. Biomedical Ontology Alignment with BERT 9 Fig. 3. Precision, Recall and Macro-F1 of BERTMap as the mapping value threshold λ ranges from 0 to 1. The top three figures are for the io+co+ids setting on FMA- SNOMED task, and the bottom three are for the io+ids setting on FMA-SNOMED+ task. The maximum F1 is indicated by a red vertical line. 4.2 Results We illustrate the resulting Precision, Recall and Macro-F1 scores in Table 1 and 2. Note that, in the semi-supervised learning setting, we measure the perfor- mance only on the test mappings. We search for the optimal combination of mapping set (src2tgt, tgt2src and combined) and the corresponding mapping value threshold λ that leads to the best Macro-F1 score on the validation map- pings (10%) for the semi-supervised models. We select the best combination on full mappings for the unsupervised models due to the shortage of validation map- pings. Nevertheless, we will later illustrate that BERTMap models are robust to mapping threshold selection, and we will obtain similar results if a reasonable validation set is provided. For the baselines, we also select the best mapping set-threshold combination for each of them. The overall results show that BERTMap attains the best F1 score (typically with high recall) among all the systems for both tasks. On the FMA-SNOMED task, the best unsupervised BERTMap model surpasses AML (resp. LogMap) by 2.0% (resp. 4.6%) in F1, while the best semi-supervised BERTMap model exceeds AML (resp. LogMap) by 3.7% (resp. 6.1%). The corresponding statistics become 1.8% (resp. 1.0%) and 2.9% (resp. 2.3%) on the FMA-SNOMED+ task. The string-match and edit-similarity baselines perform much better on the FMA-SNOMED+ task than the FMA-SNOMED task because they rely on the sufficiency of class labels in input ontologies, whereas BERTMap can learn from 10 Y. He et al. Full Mappings Test Mappings System Precision Recall Macro-F1 Precision Recall Macro-F1 io 0.321 0.625 0.424 0.248 0.621 0.354 io+ids 0.635 0.727 0.678 0.561 0.704 0.625 Unsupervised io+cp 0.862 0.822 0.842 0.867 0.786 0.825 io+cp+ids 0.860 0.824 0.842 0.866 0.782 0.822 io+co NA NA NA 0.822 0.773 0.797 io+co+ids NA NA NA 0.821 0.747 0.782 Semi-supervised io+co+cp NA NA NA 0.839 0.824 0.832 io+co+cp+ids NA NA NA 0.875 0.813 0.843 string-match 0.988 0.196 0.328 0.983 0.192 0.321 edit-similarity 0.523 0.386 0.444 0.430 0.378 0.402 mean-embeds 0.464 0.500 0.481 0.422 0.450 0.436 Baselines cls-embeds 0.522 0.242 0.331 0.970 0.192 0.321 AML 0.902 0.758 0.824 0.865 0.754 0.806 LogMap 0.942 0.689 0.796 0.918 0.681 0.782 LogMapLt 0.969 0.208 0.342 0.956 0.204 0.336 Table 1. BERTMap and baseline results on the FMA-SNOMED task. Full Mappings Test Mappings System Precision Recall Macro-F1 Precision Recall Macro-F1 io 0.893 0.874 0.883 0.911 0.834 0.871 Unsupervised io+ids 0.932 0.833 0.880 0.906 0.832 0.868 io+co NA NA NA 0.913 0.841 0.875 Semi-supervised io+co+ids NA NA NA 0.913 0.836 0.873 string-match 0.975 0.686 0.805 0.964 0.678 0.796 edit-similarity 0.965 0.750 0.844 0.950 0.746 0.836 mean-embeds 0.972 0.690 0.807 0.960 0.683 0.798 Baselines cls-embeds 0.972 0.686 0.805 0.963 0.678 0.796 AML 0.905 0.828 0.865 0.868 0.825 0.846 LogMap 0.880 0.865 0.873 0.838 0.868 0.852 LogMapLt 0.958 0.718 0.821 0.940 0.709 0.808 Table 2. BERTMap and baseline results on the FMA-SNOMED+ task. external resources. Edit-similarity is consistently better than string-match be- cause it has already considered all the string-match cases, but it is still worse than LogMap and AML. Note that we also apply string-match for the mean- embeds and cls-embeds models before calculating the cosine similarity between the source and target classes’ embeddings. Still, the results are merely better than the string-match baseline. This suggests that directly using pretrained BERT to encode class embeddings and calculate their distance in vector space is not adequate—we need fine-tuning to utilize the BERT embeddings effectively. Compared to LogMap’s lexical matcher, LogMapLt, BERTMap performs bet- ter than 50% on FMA-SNOMED and 6% on FMA-SNOMED+, implying the potential of BERTMap to become more powerful when it is extended to incorpo- rate structural and logical information. For example, we can adjust the mapping Biomedical Ontology Alignment with BERT 11 values by taking the alignment of neighbouring classes into account. We can also apply the reasoning-based ontology repair module [9] to prune the mapping set. Regarding the BERTMap settings, we observe that when the input ontolo- gies have sufficient labels (i.e., on the FMA-SNOMED+ task), considering intra- ontology corpus alone has already yielded promising results. Also, BERTMap has better performance when it incorporates the cross-ontology and complementary corpora—especially on the FMA-SNOMED task—where the SNOMED ontol- ogy is deficient in labels. Including the identity synonyms can also improve the performance, but not that prominent compared with without. Finally, in Figure 3, we illustrate the effect of the mapping value threshold λ on BERTMap under two settings, i.e., io + co + ids on FMA-SNOMED and io + ids on FMA-SNOMED+. We can observe that in all cases the highest F1 scores are achieved when λ is very close to 1.0, and as λ increases, Precision grows significantly while Recall does not drop much. This suggests that BERTMap is robust to selecting an appropriate λ. 5 Conclusion and Discussion In this study, we investigate ontology alignment using a contextual embedding- based model, BERTMap, that exploits the ontologies’ text semantics. Rather than using a complex combination of machine learning and hand-crafted features, we construct a straightforward BERT fine-tuning task that learns the meanings of class labels, and we apply the resulting classifier to mapping prediction, which leads to promising results on two biomedical ontology alignment tasks. BERTMap is suitable for real-world applications because it supports both unsupervised and semi-supervised modes, and can well incorporate external materials when the input ontologies are incomplete in class labels. As part of our future work, we aim to develop the mapping extension and repair modules so as to make BERTMap a full-fledged ontology alignment system. Acknowledgments This work was supported by the SIRIUS Centre for Scalable Data Access (Re- search Council of Norway, project 237889), Samsung Research UK, Siemens AG, and the EPSRC projects AnaLOG (EP/P025943/1), OASIS (EP/S032347/1), UK FIRES (EP/S019111/1) and the AIDA project (Alan Turing Institute). References 1. Alsentzer, E., Murphy, J., Boag, W., Weng, W.H., Jindi, D., Naumann, T., Mc- Dermott, M.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop. pp. 72–78 (Jun 2019) 2. Clark, C., Lee, K., Chang, M.W., Kwiatkowski, T., Collins, M., Toutanova, K.: BoolQ: Exploring the surprising difficulty of natural yes/no questions. In: Pro- ceedings of NAACL-HLT. pp. 2924–2936 (2019) 12 Y. He et al. 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL- HLT. pp. 4171–4186 (2019) 4. Euzenat, J., Meilicke, C., Stuckenschmidt, H., Shvaiko, P., Santos, C.: Ontology alignment evaluation initiative: Six years of experience. J. Data Semant. 15, 158– 192 (2011) 5. Faria, D., Pesquita, C., Santos, E., Palmonari, M., Cruz, I.F., Couto, F.M.: The agreementmakerlight ontology matching system. In: Meersman, R., Panetto, H., Dillon, T., Eder, J., Bellahsene, Z., Ritter, N., De Leenheer, P., Dou, D. (eds.) On the Move to Meaningful Internet Systems: OTM 2013 Conferences. pp. 527–541. Springer Berlin Heidelberg, Berlin, Heidelberg (2013) 6. Horrocks, I., Chen, J., Lee, J.: Tool support for ontology design and quality assur- ance (2020) 7. Iyer, V., Agarwal, A., Kumar, H.: Veealign: a supervised deep learning approach to ontology alignment. In: OM@ISWC (2020) 8. Jiménez-Ruiz, E., Cuenca Grau, B.: Logmap: Logic-based and scalable ontology matching. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) The Semantic Web – ISWC 2011. pp. 273–288. Springer Berlin Heidelberg, Berlin, Heidelberg (2011) 9. Jiménez-Ruiz, E., Meilicke, C., Grau, B.C., Horrocks, I.: Evaluating mapping repair systems with large biomedical ontologies. In: Description Logics (2013) 10. Kolyvakis, P., Kalousis, A., Kiritsis, D.: DeepAlignment: Unsupervised ontology matching with refined word vectors. In: Proceedings of NAACL-HLT. pp. 787–798 (Jun 2018) 11. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word rep- resentations in vector space. In: ICLR (2013) 12. Neutel, S., Boer, M.D.: Towards automatic ontology alignment using bert. In: AAAI Spring Symposium: Combining Machine Learning with Knowledge Engi- neering (2021) 13. Shvaiko, P., Euzenat, J.: Ontology matching: State of the art and future challenges. IEEE Transactions on Knowledge and Data Engineering 25(1), 158–176 (2013) 14. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017) 15. Wang, L., Bhagavatula, C., Neumann, M., Lo, K., Wilhelm, C., Ammar, W.: On- tology alignment in the biomedical domain using entity definitions and context. In: Proceedings of the BioNLP 2018 workshop. pp. 47–55 (Jul 2018) 16. Wu, Q., Lin, Z., Karlsson, B., Lou, J.G., Huang, B.: Single-/multi-source cross- lingual NER via teacher-student learning on unlabeled data in target language. In: Proceedings of ACL. pp. 6505–6514 (Jul 2020) 17. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., Dean, J.: Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR (2016) 18. Yin, D., Meng, T., Chang, K.W.: SentiBERT: A transferable transformer-based architecture for compositional sentiment semantics. In: Proceedings of ACL. pp. 3695–3706 (Jul 2020)