Deep Neural Model with Contextualized-word Embeddings for Named Entity Recognition in Spanish Clinical Text Renzo Rivera-Zavalaa , Paloma Martineza a Carlos III University of Madrid, Avda. Universidad, 30, 28911 Leganés, Madrid Abstract In this work, we introduce a Deep Learning architecture for concepts related to cancer Named Entity Recognition in Spanish clinical cases texts. We propose a deep neural approach based on two Bidi- rectional Long Short-Term Memory (Bi-LSTM) network and Conditional Random Field (CRF) network using character and contextualized-word embeddings to deal with the extraction of semantic, syntac- tic and morphological features. The approach was evaluated on the CANTEMIST Corpus obtaining an F-measure of 82.3% for NER. Keywords Natural Language Processing, Clinical Texts, Deep Learning, Contextual Information 1. Introduction Currently, the number of biomedical literature is growing at an exponential rate. Therefore, the efficient access to information on biological, chemical, and biomedical data described in scientific articles, patents, or e-health reports is a growing interest in the biomedical industry, research, and so forth. In this context, improved access to biomedical concept mentions in biomedical texts is a crucial step prior to downstream tasks such as drug and protein interactions, chemical compounds, adverse drug reactions, among others. Named Entity Recognition (NER) is a crucial task in biomedical Information Extraction (IE), intending to automatically extract and identify mentions of concepts of interest in running text, typically through their mention offsets or by classifying individual tokens whether they belong to entity mentions or not. There are different approaches to address the NER task. Dictionary-based methods, which are limited by the size of the dictionary, spelling errors, the use of synonyms, and the constant growth of vocabulary. Rule-based methods and Machine Learning methods usually require both syntactic and semantic features as well as specific language and domain features. One of the most effective methods is Conditional Random Fields (CRF) [1] since CRF is one of the most reliable sequence labeling methods. Recently, deep learning-based methods have also demonstrated state-of-the-art performance for English [2, 3, 4] texts by automatically learning relevant patterns from corpora, which allows language and Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) email: renzomauricio.rivera@alumnos.uc3m.es (R. Rivera-Zavala); pmf@inf.uc3m.es (P. Martinez) orcid: 0000-0002-0877-7063 (R. Rivera-Zavala) © 2020 Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). IberLEF 2020, September 2020, Málaga, Spain. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) domain independence. However, pre-trained models in Spanish and even more in the biomedical domain are limited, for to the best of our knowledge there is only one public available work that addresses the generation of Spanish biomedical word embeddings [5, 6]. In this paper, we propose a deep neural model with two Bi-LSTM layers and a CRF layer. To do this, we adapt the NeuroNER model proposed in [7] for track 1 (NER offset and entity classification) of the CANTEMIST task [8]. Specifically, we have extended NeuroNER by adding contextualized-word information and information about overlapping or nested entities. Moreover, in this work, we use an existing pre-trained contextualized-word model as well as our train from scratch contextualized-word model: i) a word2vec/FastText Spanish Billion Word Embeddings models [9], which were trained on the 2014 dump of Wikipedia ii) our medical word embeddings for Spanish trained using the FastText model and iii) a sense-disambiguation embedding model [10]. Experiment results on CANTEMIST tasks showed that our features representation improved each of separate representations, implying that LSTM-based compositions play different roles in capturing token-level features for NER tasks, thus making improvements in their combination. Moreover, the use of specific domain contextualized-word vector representations outperforms general domain word vector representations. 2. Materials and Methods In this section, we first describe the corpora used to generate our train from the scratch contextualized-word representation, the training procedure and the pre-trained contextualized- word models used in our study. Then, we describe our system architecture for offset and entity classification. Finally, the datasets used for training, validating, and evaluating our deep learning model performance. 2.1. Corpora The corpora were gathered from Spanish biomedical texts from different multilingual biomedical sources: 1. The Spanish Bibliographical Index in Health Sciences (IBECS - http://ibecs.isciii.es) corpus that collects scientific journals covering multiple fields in health sciences, 2. Scientific Electronic Library Online (SciELO - https://scielo.org/es/) corpus gathers elec- tronic publications of complete full-text articles from scientific journals of Latin America, South Africa, and Spain, 3. MedlineNLM corpus obtained from the PubMed free search engine (https://www.ncbi. nlm.nih.gov/pubmed/), 4. The MedlinePlus corpus (an online information service provided by the U.S. National Library of Medicine - https://medlineplus.gov/), consists of Health topics, Drugs and supplements, Medical Encyclopedia and Laboratory test information, and 5. The UFAL corpus (https://ufal.mff.cuni.cz/ufal_medical_corpus) is a collection of parallel corpora of medical and general domain texts. 386 Table 1 Biomedical Spanish corpus details. Collection\Corpus IBECS SciELO MedlineNLM MedlinePlus UFAL Documents 168,198 161,710 330,928 1,063 265,410 Words 23,648,768 26,169,655 4,710,191 217,515 41,604,517 Unique Words 184,936 159,997 20,942 5,099 198,424 Figure 1: Dublin core format for biomedical corpus. Source corpus details are described in Table 2.1. All the corpora are in XML (Dublin core format) and TXT format files. XML files were processed for extract only raw text from specific XML tags such as "title" and "description" from Spanish labels, based on the Dublin Core format as shown in Figure 1. TXT files were not processed. Raw texts from all files were compiled in a single TXT file. Texts were processed, setting all to lower, removing punctuations, trailing spaces and stop words and used as input to generate our word embeddings. Sentences preprocessing (split and tokenized) were made using Spacy 1 , an open-source python library for advanced multi-language natural language processing. 2.2. Contextualized-word models The use of word representations from pre-trained unsupervised methods is a common practice and a crucial step in NER pipelines. Previous word embedding models such as Word2Vec [11], Glove [12], and FastText [13] focused on context-independent word representations. However, in the last few years models are focused on learning context-dependent word representations, such as ELMo [14], CoVe [15], and the state-of-the-art BERT model [16]. BERT is a context-dependent word representation model based on a masked language model and trained using the transformer architecture [16]. Previous models such as RNN (LSTM GRU) combines two unidirectional layers (i.e., Bi-LSTM) to address the sequential nature of natural language, as a replacement for the sequential approach the BERT model employs a much faster attention-based approach. BERT is pre-trained in two unsupervised "artificial" tasks: i) masked language modeling that predicts randomly masked words in a sequence, and hence can be used for learning bidirectional representations by jointly conditioning on both left and right contexts in all layers and ii) next sentence prediction in order to train a model that understands sentence 1 https://spacy.io/ 387 Table 2 Contextualized-word models details. Detail BSC-BERT bert-base- BETO cased multilingual- cased 104 Language Spanish Spanish languages Domain Biomedical General General Type Contextual Word Contextual Contextual Word Word Corpus size 6 billion 3,300M 3 billion Vocab size 200k 120k 31k Hidden size 768 768 1024 Algorithm BERT train BERT train BERT train Figure 2: BERT pre-training and fine-tuning architecture overview (source [16]). relationships. The transformer layer has two sub-layers: a multi-head self-attention mechanism, and a position-wise fully connected feed-forward network, followed by a normalization layer. Even though BERT learns a lot about language through pre-training, it is possible to adapt the model by adding a customized layer on top of BERT outputs and then new training is done with specific data (this phase is called fine-tuning). We refer readers [16] for a more detailed description of BERT. An overview of the BERT architecture can be seen in Figure 2. Due to the benefits of the BERT model, we adopted the multilingual cased [16] and the BETO [17] pre-trained BERT models. Moreover, we trained from scratch a Biomedical Spanish model (BSC-BERT) with 12 transformer layers (12-layer, 768-hidden, 12-heads, 110Mparameters) and a SoftMax output layer to perform the NER task. First, we replace the WordPiece tokenizer with the SentencePiece implementation [18] and the Spacy [19] Spanish tokenizer for sentence and subword segmentation. We train with a batch size of 128 sequences for 1,000,000 steps, which is approximately 40 epochs over the 4 million word corpus. We use Adam with a learning rate of 1e-4. We use a dropout probability of 0.15 on all layers and a gelu activation function. Training of BSC-BERT was performed on 1 Cloud TPU, 8vCPUs Intel(R) Xeon(R) CPU @ 2.30GHz and 16GB memory. Details of the train and pre-trained models can be seen in Table 2.2. 388 Figure 3: The architecture of the Bi-LSTM CRF model for tumor morphology identification. 2.3. System Description Our approach involves the adaption of a state-of-art NER model named NeuroNER as proposed in [7], based on a deep learning network with a preprocess step, learning transfer from pre- trained models, two recurrent neural network layers and a last layer for CRF (see Figure 3). The input for the first Bi-LSTM layer is character embeddings. In the second layer, we concatenate character embeddings from the first layer with contextualized-word representations for the second Bi-LSTM layer. Finally, the last CRF layer obtains the most suitable labels for each token using a tag encoding format. For more details about NeuroNER, please refer to [7]. Our contribution consists of extending the NeuroNER system with additional features. In particular, adding contextualized-word representations and the extended BMEWO-V encoding format has been added to the network. The BMEWO-V encoding format distinguishes the B tag for entity start, the M tag for entity continuity, the E tag for entity end, the W tag for a single entity, and the O tag for other tokens 389 Figure 4: BRAT annotation example from CANTEMIST corpus sentence. Table 3 Tokens annotated with BMEWO-V encoding in the ConLL-2003 format. offset offset token nested-tag tag start end Orienta 0 6 O O hacia 8 13 O O carcinoma 15 25 V-MORFOLOGIA_NEOPLASIA B-MORFOLOGIA_NEOPLASIA indiferenciado 27 32 O E-MORFOLOGIA_NEOPLASIA de 33 35 O O origen 36 42 O O mamario 43 50 O O . 51 52 O O that do not belong to any entity. The V tag allows us to represent nested entities. BMEWO- V is similar to other previous encoding formats [20]; however, it allows the representation of nested and discontinuous entities. As a result, we obtain our sentences annotated in the CoNLL-2003 format [21]. An example of the BMEWO-V encoding format applied to the sentence "Orienta hacia carcinoma indiferenciado de origen mamario." ("Orients towards undifferentiated carcinoma of mammary origin.") can be seen in Figure 4 and Table 3. 2.3.1. First Bi-LSTM layer using character embeddings Word embedding models are able to capture syntactic and semantic information. However, other linguistic information, such as morphological information and orthographic transcription are not exploited. According to [22], the use of character embeddings improves learning for specific domains and is useful for morphologically rich languages such as Spanish. For this reason, we decided to include the character-level representations to obtain morphological and orthographic information from words. Each word is decomposed into its character n-grams and initialized with a random dense vector which is then learned. We used a 25-feature vector to represent each character. In this way, tokens in sentences are represented by their corresponding character embeddings, which are the input for our Bi-LSTM network. 390 Table 4 CANTEMIST subsets details. Dataset Subset Documents Sentences Entities Train 501 19359 6529 CANTEMIST Valid 250 9489 3246 Test 5231 88406 2.3.2. Second Bi-LSTM layer using word and Sense embeddings The input for the second Bi-LSTM layer is the concatenation of character embeddings from the first layer with the pre-trained contextualized-word representations of the tokens in a given input sentence. The second layer goal is to obtain a sequence of probabilities for each tag in the BMEWO-V encoding format. In this way, for each input token, this layer returns six probabilities (one for each tag in BMEWO-V). The final tag should be with the highest probability for each token. 2.3.3. Last layer based on Conditional Random Fields (CRF) To improve the accuracy of predictions, a Conditional Random Field (CRF) [1] model is trained, which takes as input the label probability for each independent token from the previous layer and obtains the most probable sequence of predicted labels based on the correlations between labels and their context. Handling independent labels for each word shows sequence limitations. For example, considering the drug sequence labeling problem, an "I-MORFOLOGIA-NEOPLASIA" tag cannot be found before a "B-MORFOLOGIA-NEOPLASIA" tag or a "B-MORFOLOGIA- NEOPLASIA" tag cannot be found after an "I-MORFOLOGIA-NEOPLASIA" tag. Finally, once tokens have been annotated with their corresponding labels in the BMEWO-V encoding format, the entity mentions must be transformed into the BRAT format. V tags, which identify nested or overlapping entities, are generated as new annotations within the scope of other mentions. 3. Evaluation As it was described above, our system is based on a deep network with two Bi-LSTM layers and the last layer for CRF. We evaluate our NER system using the train, validation, and test subsets from the CANTEMIST dataset provided by the CANTEMIST task organizers [8]. Detailed information for each subset can be seen in Table 4. The CANTEMIST dataset is a manually annotated corpus of 5,982 clinical cases written in Spanish and annotated with mentions of concepts related to cancer, namely tumor morphology. CANTEMIST NER task can be addressed as a binary classification task only considering one entity type "MORFOLOGIA-NEOPLASIA" ("tumor morphology"). The CANTEMIST task considers three subtasks. In this work, we address the first track considering offset recognition and entity classification of tumor morphology. The F-measure is used as the main metric where true positives are entities which match with the gold standard entity boundaries and type. A detailed description of evaluation can be found in the CANTEMIST 391 Table 5 Contextualized-word models details. Detail BSC-BERT bert-base- BETO cased multilingual- cased 104 Language Spanish Spanish languages Domain Biomedical General General Type Contextual Word Contextual Contextual Word Word Corpus size 6 billion 3,300M 3 billion Vocab size 200k 120k 31k Hidden size 768 768 1024 Algorithm BERT train BERT train BERT train Table 6 Contextualized-word models out of the task results for entity classification on CANTEMIST test subset. Model Precision (%) Recall (%) F-Score (%) bert-base-multilingual-cased 80.41 76.78 78.55 BETO cased 84.68 79.02 81.66 BSC-BERT 83.59 81.63 82.60 web 2 . The NER task is addressed as a sequence labeling task. For the NER track, we tested different configurations with various pre-trained contextualized-word models. The pre-trained models and their parameters are summarized in Table 5. In Table 6, we compare the different pre-trained models on the validation subset. As shown in Table 6 specific domain contextualized-word models outperform general domain models by almost 1 points. For the test subset, we applied our best system configuration NeuroNER + BSC-BERT obtaining an f-score of 82.30% for offset detection and entity classification. Despite we predict results for more than 5K documents, the evaluation is only performed in 300 of them. 4. Conclusions In this work, we propose a system for the detection of concepts related to cancer in clinical narrative written in Spanish. We address the named entity recognition task as a sequence labeling task. We proposed a deep learning approach only using dense vector representations features instead of hand-crafted word-based features. We proved that as in other tasks such as NER, the use of dense representation of words such as word-level and character-level are helpful for named entity recognition. The proposed system achieves satisfactory performance with F-score over 82.3% for the test subset. The extension of NeuroNER network is domain- independent and could be used in other fields, although generic pre-trained contextualized-word 2 https://temu.bsc.es/cantemist/?p=3975 392 representations are used, new pre-trained biomedical Spanish contextualized-word model has been generated for this work. We found that the text preprocessing step had a significant impact on the entity offset recog- nition and classification. Separating words by the hyphen ’-’ caused some errors. Abbreviation recognition is a difficult task due to ambiguity and length, even more for very short abbre- viations (1-2 letters) due to their high level of ambiguity. Long entities consisting of more than 5 tokens are hard to identify correctly. Moreover, words not present in the pre-trained model’s vocabulary are not recognized in entity offset recognition and classification; therefore, we decided to initialize words not present in the vocabulary with random vectors. As future work, we plan to generate contextualized-word representations integrating biomed- ical knowledge into our systems such as SNOMED-CT or UMLS. The motivation would be to see whether contextualized-word representations generated with biomedical knowledge can help to improve the results and provide a deep learning model for biomedical NER and concept indexing. Acknowledgments This work was supported by the Research Program of the Ministry of Economy and Competi- tiveness - Government of Spain (DeepEMR project TIN2017-87548-C2-1-R). References [1] J. D. Lafferty, A. McCallum, F. C. N. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, in: Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001, pp. 282–289. URL: http://dl.acm.org/citation.cfm?id= 645530.655813. [2] W. Hemati, A. Mehler, Lstmvoter: chemical named entity recognition using a conglomerate of sequence labeling tools, Journal of Cheminformatics 11 (2019) 3. URL: https://doi.org/ 10.1186/s13321-018-0327-2. doi:10.1186/s13321-018-0327-2. [3] M. Pérez-Pérez, O. Rabal, G. Pérez-Rodríguez, M. Vazquez, F. Fdez-Riverola, J. Oyarzábal, A. Valencia, A. Lourenço, M. Krallinger, Evaluation of chemical and gene/protein entity recognition systems at biocreative v.5: the cemp and gpro patents tracks, 2017. [4] V. Suárez-Paniagua, R. M. R. Zavala, I. Segura-Bedmar, P. Martínez, A two-stage deep learn- ing approach for extracting entities and relationships from medical texts, Journal of Biomed- ical Informatics 99 (2019) 103285. URL: http://www.sciencedirect.com/science/article/pii/ S1532046419302047. doi:https://doi.org/10.1016/j.jbi.2019.103285. [5] M. M. K. M. Armengol-Estapé Jordi, Soares Felipe, Pharmaconer tag- ger: a deep learning-based tool for automatically finding chemicals and drugs in spanish medical texts, Genomics Inform 17 (2019) e15–. URL: http://genominfo.org/journal/view.php?number=557. doi:10.5808/GI.2019.17.2.e15. arXiv:http://genominfo.org/journal/view.php?number=557. 393 [6] F. Soares, M. Villegas, A. Gonzalez-Agirre, M. Krallinger, J. Armengol-Estapé, Medical word embeddings for Spanish: Development and evaluation, in: Proceedings of the 2nd Clinical Natural Language Processing Workshop, Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 124–133. URL: https://www.aclweb. org/anthology/W19-1916. [7] F. Dernoncourt, J. Y. Lee, P. Szolovits, NeuroNER: an easy-to-use program for named- entity recognition based on neural networks, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 97–102. URL: https: //www.aclweb.org/anthology/D17-2017. doi:10.18653/v1/D17-2017. [8] A. Miranda-Escalada, E. Farré, M. Krallinger, Named entity recognition, concept normal- ization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020. [9] C. Cardellino, Spanish Billion Words Corpus and Embeddings, 2016. URL: http:// crscardellino.me/SBWCE/. [10] A. Trask, P. Michalak, J. Liu, sense2vec - A fast and accurate method for word sense disambiguation in neural word embeddings, CoRR abs/1511.06388 (2015). URL: http: //arxiv.org/abs/1511.06388. arXiv:1511.06388. [11] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, CoRR abs/1310.4546 (2013). URL: http: //arxiv.org/abs/1310.4546. arXiv:1310.4546. [12] J. Pennington, R. Socher, C. Manning, Glove: Global Vectors for Word Representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014) 1532–1543. URL: http://aclweb.org/anthology/D14-1162. doi:10.3115/ v1/D14-1162. arXiv:1504.06654. [13] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with sub- word information, CoRR abs/1607.04606 (2016). URL: http://arxiv.org/abs/1607.04606. arXiv:1607.04606. [14] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, in: NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies - Proceedings of the Conference, volume 1, Association for Computational Linguistics (ACL), 2018, pp. 2227–2237. doi:10.18653/v1/n18-1202. arXiv:1802.05365. [15] B. McCann, J. Bradbury, C. Xiong, R. Socher, Learned in translation: Contextualized word vectors, Technical Report, 2017. arXiv:1708.00107. [16] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, Technical Report, 2019. URL: http://arxiv.org/ abs/1810.04805. arXiv:1810.04805. [17] J. Cañete, G. Chaperon, R. Fuentes, J. Pérez, Spanish pre-trained bert model and evaluation data, in: to appear in PML4DC at ICLR 2020, 2020. [18] T. Kudo, J. Richardson, Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, 2018. arXiv:1808.06226. 394 [19] M. Honnibal, I. Montani, spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing, 2017. To appear. [20] A. Borthwick, J. Sterling, E. Agichtein, R. Grishman, Exploiting diverse knowledge sources via maximum entropy in named entity recognition, in: Sixth Workshop on Very Large Corpora, 1998. URL: https://www.aclweb.org/anthology/W98-1118. [21] E. F. Tjong Kim Sang, F. De Meulder, Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition, in: Proceedings of the Seventh Con- ference on Natural Language Learning at HLT-NAACL 2003, 2003, pp. 142–147. URL: https://www.aclweb.org/anthology/W03-0419. [22] W. Ling, C. Dyer, A. W. Black, I. Trancoso, R. Fermandez, S. Amir, L. Marujo, T. Luis, Finding function in form: Compositional character models for open vocabulary word representation, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Lisbon, Portugal, 2015, pp. 1520–1530. URL: https://www.aclweb.org/anthology/D15-1176. doi:10.18653/v1/ D15-1176. 395