=Paper=
{{Paper
|id=Vol-2664/cantemist_paper3
|storemode=property
|title=NCU-IISR: Pre-trained Language Model for CANTEMIST Named Entity Recognition
|pdfUrl=https://ceur-ws.org/Vol-2664/cantemist_paper3.pdf
|volume=Vol-2664
|authors=Jen-Chieh Han,Richard Tzong-Han Tsai
|dblpUrl=https://dblp.org/rec/conf/sepln/HanT20
}}
==NCU-IISR: Pre-trained Language Model for CANTEMIST Named Entity Recognition==
NCU-IISR: Pre-trained Language Model for CANTEMIST Named Entity Recognition Jen-Chieh Han, Richard Tzong-Han Tsai* Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan Abstract Since BERT has brought a huge improvement in various NLP tasks, the great constructed pre-trained language model shows its power of being fine-tuned in other downstream tasks either. In this paper, NCU-IISR team adopted the Spanish BERT, BETO, as our pre-trained model, and the model was fine- tuned on CANTEMIST Named Entity Recognition (NER) data. Besides, we also compared it with an- other fine-tuned version, which was trained on an external Spanish medical text. Finally, our best score achieved an F1-measure of 0.85 in the official test set result for CANTEMIST-NER task. Keywords Electronic Health Records, Named Entity Recognition, Deep Learning, Pre-trained Language Model 1. Introduction Biomedical text mining techniques can mine crucial information from literature in different languages, and the development in Spanish clinical data has become a novel research direction recently. Clinical texts contain immediate symptoms in reality. The demand is not only for those who are related to medical researchers, but the front of the hospital, such as doctors and patients. Because of the complexity of biomedical knowledge, what a key-point item we want to extract in electronic health records (EHRs) to analyze is setting up a base of the task tracking. Cancer is the second-leading cause of death globally in 2018, and healthcare expenditures are growing for oncological treatments. The organizer of CANTEMIST (CANcer TExt Mining Shared Task) Track [1] is successful in previous shared tasks, and past experiments raise the new task. This year, they focus on named entity recognition (NER) at a critical type of concept related to cancer, that is to say, tumor morphology, and use a standard classification resource, International Classification of Diseases for Oncology (ICD-O) or eCIE-O-3.1 in Spanish, to code diagnosis. The Spanish language is the fourth-most spoken language in the world, and the language with these technologies have a potential of transfer learning to Italian, German, French, or even English. Furthermore, Spanish EHRs produce tens of thousands every 10 minutes, and therefore this task is collected in Spanish. The CANTEMIST task involves three sub-tasks (CANTEMIST-NER, CANTEMIST-NORM, and CANTEMIST-CODING), and each has a particularly important use case scenario. For the ⋆ Corresponding author. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) email: joyhan@cc.ncu.edu.tw (J. Han); thtsai@csie.ncu.edu.tw (R.T. Tsai*) orcid: 0000-0003-0998-4539 (J. Han); 0000-0003-0513-107X (R.T. Tsai*) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) task summary, CANTEMIST explores the automatic assignment of eCIE-O-3.1 codes (Mor- fología neoplasia) to health-related documents in the Spanish language. In this project, we only participate in CANTEMIST-NER task, and the task requires finding automatically tumor morphology mentions, that is, the start and the end character offsets of tumor morphology in medical documents. 2. Pre-processing The official dataset provides a set of text files with their corresponding annotation files. It includes 501 clinical cases in the training set, 500 documents in the dev(elopment) set, and 5231 documents in test and background set (only 300 documents test set are evaluated on the predictions). The dev set is divided into two subsets and each contains 250 documents. The second subset (dev-2) is released later than the first one (dev-1). We only used the dev-1 set because we did not have sufficient time to retrain our system on the whole dev set. Before we started training a model, we had to transform these input data into NER format. Thus, given a text with annotated name entities (NEs), we divided it into tokens and whether a token was in a NE (gold label). First, we took GENIA Tagger’s tokenizer1 , which was often used in biomedical tasks, to separate input text into tokens. If a token was in a NE, we gave it a label based on where it was located in NE positions. It is called IOB2 tag scheme [2]. Because we adopted BERT [3] method to train the NER model, and there was another tokenizer called WordPiece2 . Therefore, an input token would be tokenized again and became word pieces with the vocabulary dictionary. Each label of tokens also expanded its numbers like word pieces and set the new one as the default label id. The final NER input format is shown in Table 1. Table 1 Transforming text with NE (Bold) into NER input format from BERT. Example input text: Carcinoma microcítico de pulmón. There are label 𝐵 (eginning), 𝐼 (nside) and 𝑂 (utside) Token Label Car 𝐵 ##cino None ##ma None micro 𝐼 ##cí None ##tico None de 𝑂 pulmón 𝑂 . 𝑂 1 https://github.com/saffsd/geniatagger 2 https://github.com/google-research/bert/blob/master/tokenization.py 348 3. Methods In Cantemist-NER task, we submitted two prediction results, and the following subsection describes their methods. 3.1. Pre-trained Language Model Bidirectional Encoder Representations from Transformers (BERT) [3] is a method of pre-training language representations and can be adapted to many natural language processing (NLP) tasks handily in fewer computational resources. BERT can use the same structure to be fine-tuned in various NLP tasks but also achieves state-of-the-art results. Although BERT is written in English, they also provide Multilingual BERT, which contains over 100 languages. Because CANTEMIST-NER is a Spanish language dataset, Multilingual BERT3 may not be sufficient. Then, we surveyed possible pre-trained language models for Spanish and found a Spanish BERT, called BETO [4] which has higher performances than Multilingual BERT, in Github. BETO is a BERT model and trained on over 300M lines of Spanish corpus such as Spanish Wikis. We adopted it as our pre-trained language model using a tool Transformers [5]. It adds a classification layer after BETO sequence output dropout. Cantemist- NER dev-1 result of BETO is shown in Table 2. We tuned the parameter 𝑒𝑝𝑜𝑐ℎ from 3 to 50 and chose the best F1 model as our first parameter setting of the submitted model. Table 2 Cantemist-NER dev-1 results of BETO. Epoch Precision Recall F1 3 0.775 0.8 0.787 5 0.793 0.791 0.792 10 0.804 0.801 0.802 20 0.806 0.808 0.807 30 0.816 0.819 0.818 40 0.809 0.813 0.811 50 0.818 0.812 0.815 3.2. Spanish Biomedical Data Because BETO is trained in the Spanish general domain, we thought that adding the last layer with clinical data before fine-tuning in NER may increase the final performance. Besides, we saw one team use the Spanish Health Corpus and got 90% F1 in last year [6]. Therefore, we found the resource and decided to take all Spanish text files from MedlinePlus in TEI format [7] instead of other XML file format. Then, we input raw texts over 37 thousand lines into BETO first. We called it BETO+Bio. Cantemist-NER dev-1 result of BETO+Bio is shown in Table 3. As the above model, we used the different 𝑒𝑝𝑜𝑐ℎ to train a model and found the top one as our second parameter setting of the submitted model. 3 https://github.com/google-research/bert/blob/master/multilingual.md 349 Table 3 Cantemist-NER dev-1 results of BETO+Bio. Epoch Precision Recall F1 3 0.76 0.788 0.774 5 0.765 0.792 0.778 10 0.782 0.803 0.792 20 0.79 0.797 0.794 30 0.802 0.803 0.803 40 0.803 0.805 0.804 50 0.81 0.802 0.806 4. Results We tried different hyperparameters to train two models on training and dev-1 sets. Our final Cantemist-NER test results are shown in Table 4. To our surprise, the best one is BETO and achieved an F1-measure of 0.85 which is slightly higher than BETO+Bio by 0.005. It may come from that the time limitation makes us unable to try different ways to transfer learning on the text of MedlinePlus. Table 4 Cantemist-NER test results. System Precision Recall F1 BETO 0,849 0,851 0.85 BETO+Bio 0,84 0.85 0.845 5. Conclusion Our NCU-IISR team constructed two models to predict Cantemist-NER labels, and both results are above 0.84 F1-measure. It showed that using the pre-trained language model can achieve high performance. We considered that adding a Spanish biomedical layer in the model would have a better result at first, however it got a slightly lower score. It might come from that the fine-tuned data was not enough, or we did not process text input properly before training. However, we believe that combining biomedical data before training in Cantemist-NER has the potential to help the model more fit in this task. Therefore, we would like to try possible ways in future work or find some errors to fix to improve results. Acknowledgments Appreciating Po-Ting Lai for giving us suggestions during the challenge and revising the working note. 350 References [1] A. Miranda-Escalada, E. Farré, M. Krallinger, Named entity recognition, concept normal- ization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020. [2] H.-J. Dai, P.-T. Lai, Y.-C. Chang, R. T.-H. Tsai, Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization, Journal of cheminformatics 7 (2015) S14. [3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [4] J. Cañete, G. Chaperon, R. Fuentes, J. Pérez, Spanish pre-trained bert model and evaluation data, in: to appear in PML4DC at ICLR 2020, 2020. [5] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Huggingface’s transformers: State-of-the-art natural language processing, ArXiv abs/1910.03771 (2019). [6] M. Stoeckel, W. Hemati, A. Mehler, When Specialization Helps: Using Pooled Contextualized Embeddings to Detect Chemical and Biomedical Entities in Spanish, in: Proceedings of the International Workshop on BioNLP Open Shared Tasks (BioNLP-OST), Association for Computational Linguistics SIGDAT and Asian Federation of Natural Language Processing, 2019. Accepted. [7] M. Villegas, A. Intxaurrondo, A. Gonzalez-Agirre, M. Marimon, M. Krallinger, The mespen resource for english-spanish medical machine translation and terminologies: census of parallel corpora, glossaries and term translations, LREC MultilingualBIO: Multilingual Biomedical Text Processing (Malero M, Krallinger M, Gonzalez-Agirre A, eds.) (2018). 351