Biomedical Entity Normalization based on Pre-trained Model with Enhanced Information Lu Fang, Yiling Cao, and Zhongguang Zheng Fujitsu R&D Center Co., Ltd. Beijing 100022, China {fanglu,caoyiling,zhengzhg}@fujitsu.com Abstract. Biomedical entity normalization, which links entity mentions in biomedical texts to their corresponding standard concepts in a knowl- edge base(KB) or an ontology is an important task in biomedical text mining. A prevalent solution is to generate the most similar concepts, and then rank those concepts with semantic models. Herein, to improve the performance of candidate concepts ranking for entity normalization, we rank the candidates by fine-tuning the domain-specific pre-trained BioBERT model and enhancing the representation information with en- tity mentions and candidates. We have achieved significant improve- ment over the state-of-the-art method on the Bacteria Biotope data of BioNLP-OST191 . Keywords: Biomedical Entity Normalization · KB · BioBERT 1 Introduction Mapping entity mentions in texts to a certain standard knowledge base (KB) or an ontology is a fundamental task, which can link the unstructured text to a structured dataset. Ambiguity and variation are the main challenges of this task. Unlike in the general domain, variation is much more common than ambiguity in the biomedical domain. Therefore, a variety of methods have been proposed to deal with this challenge, including rule based methods [1], machine learning and deep learning based methods [2, 3]. Recently, pre-trained models have been applied to many NLP tasks in the biomedical domain such as named entity recognition and relation classification tasks, resulting in significant improvements. BioBERT [4], which is based on the BERT model and pre-trained on large-scale biomedical articles, is applied to many biomedical NLP tasks to improve state-of-the-art performance. But few researchers have used the models for entity normalization. In this paper, we propose a biomedical entity normalization approach by fine-tuning the BioBERT model. We enhance the representation information by using the embedding of the special first token as well as the embeddings of the entity mention and its candidate. The performance of our approach achieves significant improvement compared to the best scores in the BB-norm shared task of BioNLP-OST19. 1 https://sites.google.com/view/bb-2019/home Copyright 2021 for this paper by its authors. 1 Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Lu Fang, Yiling Cao, and Zhongguang Zheng 2 Approach Our approach includes two principal steps: (1) Candidate concept genera- tion: for a given biomedical entity mention, generating candidate concepts from the KB or ontology. (2) Candidate concept ranking: ranking those candidate concepts. Further details are provided in the following sections. 2.1 Candidate Concept Generation We first pre-process all entity mentions and concept names in KB with ab- breviation and tokenization resolution. The Ab3p tool2 is utilized to identify abbreviations in documents and replace entity mentions in abbreviations with their corresponding full names. The Snowball toolkit 3 is used to tokenize all the entity mentions and concept names. Then, we implement two types of methods to generate the candidate concepts: Similarity based method: We calculate the cosine similarity between vector representations of each concept name and the mention, and also the Jaccard similarity between them [5]. Then we choose the top n1 concept names as a set C1 , which have cosine similarity greater than or equal to threshold t1 , top n2 concept names as a set C2 , which have Jaccard similarity Sgreater than or equal to threshold t2 . The final candidate set is composed of C1 C2 . In our experiments on training data, we set t1 = 0.7, t2 = 0.1, n1 = 3, n2 = 7. Information Retrieval based method: It is inefficient to calculate the sim- ilarity between a given mention and each concept name in the KG when the number of concepts is very large. In this case, IR is a more efficient method with which to obtain similar concept names. We implement the IR system using Lucene4 . First, we index all mentions in training data and concept names with their identities. Second, we retrieve the top-20 concept names for each mention, then the final candidate concept names are obtained from the retrieved results using the similarity based method described above. 2.2 Candidate Concept Ranking We rank the candidate concepts by fine-tuning the pre-trained BioBERT model. Inspired by the work [6], for each entity mention m and one of its candidates c, we feed a sequence [CLS] m [SEP] c to BioBERT for the fine-tuning procedure, where [CLS] is the beginning of each sequence, and [SEP] is a special token used to separate m and c. V is supposed to be the final hidden state generated from BioBERT model, and d is the dimension of the hidden state of the model. V0 ∈ Rd represents the output of the first token [CLS], Vm and Vc are the final hidden vectors for m and c respectively, Vm = [Vi , ..., Vj ] and Vc = [Vl , ..., Vn ]. 2 https://github.com/ncbi-nlp/Ab3P 3 http://www.nltk.org/_modules/nltk/stem/snowball.html 4 https://lucene.apache.org/ Title Suppressed Due to Excessive Length 3 0 0 Then we get the representation of m (Vm ) and the representation of c (Vc ) using the following equations: j n 0 1 X 0 1 X Vm = W [tanh( Vt )] + b Vc = W [tanh( Vt )] + b (1) j − i + 1 t=i n−l+1 t=l Where we use an average operation to the vectors, we then add an activation layer and a fully connected layer, there are j − i + 1 words in m and n − l + 1 words in c. For V0 , an activation layer and a fully connected layer are added: 0 V0 = W0 [tanh(V0 )] + b0 (2) 0 0 0 We then concatenate V0 , Vm , Vc and add a fully connected layer to generate the final representation for a mention and one of its candidates: 0 0 0 r = Wcon [concat(V0 , Vm , Vc )] + bcon (3) In Equations (1), (2), (3), W, W0 ∈ Rd×d , Wcon ∈ R3d , b, b0 , and bcon are bias vectors. It is supposed that there are K candidates for each entity mention. We use rk to represent the final vector output by our model for the k th candidate name. To rank the K candidate names, we first define R = [r1 , r2 , ..., rK ], and then compute the probability of the k th candidate to be the normalized one as follows: p(k|R) = sigmoid(Wr R + br ) (4) Where Wr ∈ R3d , br is the bias. We use binary cross entropy as the loss function. 3 Experiments Dataset: We use the Bacteria Biotope (BB) data to evaluate our approach. Three types of entities are involved: microorganism, habitat and phenotype. Mi- croorganisms are normalized to taxa from the NCBI taxonomy5 , which contains 903,191 taxa plus synonyms. While habitat and phenotype entities are normal- ized to concepts from the OntoBiotope ontology6 which includes 3,601 concepts plus synonyms. Table 1 shows the number of mentions, unique mentions and concepts for each entity type. In the candidate generation step, we use the IR based method to generate candidates for microorganisms, and the similarity based method to generate candidates for habitat and phenotype entities. Metrics: Since the entity mentions in the data are given and every mention is normalized to a concept, we evaluate the performance of our biomedical concept normalization algorithm with precision, following the BB-norm task. The official on-line testing platform7 is used to calculate scores on the test data. 5 ftp://ftp.ncbi.nih.gov/pub/taxonomy 6 http://agroportal.lirmm.fr/ontologies/ONTOBIOTOPE 7 http://bibliome.jouy.inra.fr/demo/BioNLP-OST-2019-Evaluation/index.html 4 Lu Fang, Yiling Cao, and Zhongguang Zheng Table 1. Statistics for each entity type. Habitat Phenotype Microorganism Entity mentions 3,506 1,102 2,487 Unique entity mentions 1,774 498 950 Concepts 440 141 491 Parameters Settings: For fine-tuning, the parameters are the same as those in the pre-trained BioBERT model. We set the learning rate to 5e-5, the batch size to 16, and the number of training epochs to 16. Early stopping is employed according to the precision of the validation set. Experimental Results The performance of the method is displayed in table 2. Compared to the methods from official teams who participated in the BB-norm shared task, our method achieves significant improvement of, respectively +4, +6, and +4 points compared to the best scores for habitat, phenotype, and microorganism normalization, and +10 points compared to the best score for all types. We also discard the hidden vector output of the entity mention and candidate, and only use the hidden vector output of the special first token for ranking, the results show that combining the output of entity and candidate vectors further enriches the information and improves the accuracy. Table 2. Comparison of our biomedical entity normalization approach with the results in the BioNLP-OST19 challenge. Best scores are in bold font. All Type Habitat Phenotype Microorganism Baseline 0.531 0.559 0.581 0.470 BOUN-ISIK-2 0.679 0.687 0.566 0.711 BLAIR GMU-2 0.678 0.615 0.646 0.783 PADIA BacReader-1 0.633 0.684 0.758 0.511 Our approach(without entity) 0.762 0.708 0.821 0.817 Our approach 0.778 0.733 0.823 0.825 4 Conclusion In this paper, we develop an approach for biomedical entity normalization by fine-tuning the pre-trained model, we also leverage the embeddings of entity mentions and their candidates to enrich the information and improve the perfor- mance. We conduct experiments on the BB dataset provided by the challenge of BioNLP-OST and our results significantly outperform the state of-the-art meth- ods. In the future, we will try to add the context information of a mention to improve the performance of this problem. Title Suppressed Due to Excessive Length 5 References 1. Jennifer DSouza, Vincent Ng. Sieve-Based Entity Linking for the Biomedical Do- main. In: Proceedings of ACL-IJCNLP15. pp. 279–302. Beijing, China (2015) 2. Robert Leaman, Rezarta Islamaj Dogan, Zhiyong Lu. DNorm: disease name normal- ization with pairwise learning to rank. Bioinformatics. vol.29, pp. 2909-291. (2013) 3. Haodi Li, Qingcai Chen, Buzhou Tang. et al. CNN-based ranking for biomedical entity normalization. BMC Bioinformatics. (2017) 4. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, et al. BioBERT: a pre-trained biomedi- cal language representation model for biomedical text mining. Bioinformatics. (2019) 5. Ishani Mondal, Sukannya Purkayastha, Sudeshna Sarkar, et al. Medical Entity Link- ing using Triplet Network. pp.95-100. (2019) 6. Shanchan Wu, Yifan He. Enriching Pre-trained Language Model with Entity Infor- mation for Relation Classification. CoRR. (2019)