Exploring Deep Learning for Named Entity Recognition of Tumor Morphology Mentions Gema de Vargas Romero, Isabel Segura-Bedmar Computer Science Department, Universidad Carlos III de Madrid (UC3M), Leganés, 28911, Madrid, Spain Abstract This paper describes the development of a Named Entity Recognition (NER) system to automatically detect tumor morphology mentions in medical documents, known as ICD-O codes, International Classi- fication of Diseases for Oncology. This study is developed as part of the Cantemist program of the Plan of Advancement of Language Technologies (Plan TL). This work tries to contribute to the existing NER technologies that focus on Spanish health-related documents. This is a necessary task given the amount of research, regarding the health sector, that is written in Spanish and the benefits it could bring to the medical environment. In fact, since most NER techniques are developed for English, this research can- not be completely exploited. In this research, we explore different machine learning techniques such as CRF, a Bidirectional Long short-term memory (Bi-LSTM) and a Bidirectional Encoder Representations from Transformers (BERT) to address the task of detecting tumor morphology mentions from clinical texts written in Spanish. Keywords Named Entity Recognition, BiLSTM, BERT 1. Introduction Natural Language Processing (NLP) has become vital since the amount of information to which people have access nowadays cannot be easily managed. To solve this, NLP offers tools that vary based on the intention of the user, such as translating, summarizing or extracting information from a text. [1] Named Entity Recognition (NER) is a specific Natural Language Processing (NLP) task that focuses on information extraction. Focusing on NER, there are currently many technologies that achieve state-of-the-art perfor- mance. However, there are two key aspects that force to keep developing in this field. On the one hand, the performance of NER systems is dependent on the domain. As a result, this makes it necessary to construct domain specific NER systems. In fact, this task becomes crucial in the medical domain given the amount of research in this field and the advantages it could bring to patient diagnosis, prognosis and further research. [2] On the other hand, despite the generalization that some NER systems can achieve, the performance of the systems is also dependent on the language it was built for. Therefore, this makes it necessary to build systems specific for the different languages. For instance, focusing on Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) email: 100426253@alumnos.uc3m.es> (G.d.V. Romero); isegura@inf.uc3m.es> (I. Segura-Bedmar) url: http://hulat.inf.uc3m.es/nosotros/miembros/isegura (I. Segura-Bedmar) orcid: 0000-0002-0877-7063 (G.d.V. Romero); 0000-0002-7810-2360 (I. Segura-Bedmar) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Spanish, there is a huge amount of biomedical research written in this language. However, since most NER techniques are developed for English, this research cannot be completely exploited. Disease recognition and normalization in medical texts is a challenging task given the vari- ability and complexity of the disease mentions. Currently, the main technique followed in this context is the clinical coding, which consists in the collection of codes regarding the taxonomy of a disease, signs, symptoms and medical procedures. These are standard codes provided by the ICD10, International Classification of Diseases (CIE10 in Spanish). In particular, the ICD-O codes (CIE-O-3.1 in Spanish), International Classification of Diseases for Oncology, allows to code tumor morphology mentions in health-related documents written in Spanish. [2] Under this context, this study proposes a NER system to automatically detect tumor mor- phology mentions in a medical document. It is developed as part of the Cantemist track [2], which is sponsored by Plan de Impulso de las Tecnologías del Lenguaje (Plan TL) and is part of the IberLEF 2020 evaluation campaign. The Cantemist track is organized into 3 subtasks: 1. NER which consists on finding tumor morphology mentions; 2. NORM which involves finding tumor morphology mentions and assigning an ICD-O code; 3. CODING which consists on assigning an ICD-O code to documents. We focus on the NER task. This way, it will aim at constructing a system to identify tumor morphology mentions in the corpus provided by the Cantemist organizers and contribute to the existing NER technologies and resources that focus on Spanish health-related documents. Our study explores three different machine learning approaches: 1) CRF classifier as baseline, 2) Bidirectional Long short-term memory (Bi-LSTM) and 3) Bidirectional Encoder Representations from Transformers (BERT) to address the task of detecting tumor morphology mentions from clinical texts written in Spanish. This study will be focusing on domain specific NEs, more specifically, on a unique entity type: “Morfología Neoplasia”. 2. Related Work NER aims to identify and categorize words or expressions inside a text that represent entities. NER systems must confront many challenges such as the construction of a generic entity tagger given the variability of the entity set among different domains, the diversity of languages and the ambiguity within each of them [3], the dependency on the quality of the annotations and the existence of infrequent entities. Two key aspects in NER systems are to employ context information in the identification of entities and assume inter-dependency between the words in the text. [3] Since entities can be words or phrases, they are usually labeled following the IOB format, where every token that conforms an entity is assigned a tag ‘I’, ‘O’ or ‘B’ based on its location, followed by the entity type. In fact, the IOB format stands for “Inside, Outside, Beginning”. Therefore, if a token is at the beginning of an entity, it will be labeled as “B”, as “I” if the token is inside an entity and as “O” if the token is not part of an entity. [4] Along the years, several approaches have been employed to address the NER task. The main approaches include: 397 1. Rule-based; 2. Unsupervised learning; 3. Feature based supervised learning; 4. Deep learning. Rule-based It is based on hand-crafted rules, that can be either semantic or syntactic. This approach does not require annotated data and is highly dependent on the domain and can rely on dictionaries. It usually provides high precision but with a low recall.[1] Unsupervised learning It involves training unsupervised methods without using labeled data. The main unsupervised learning technique employed in NER is clustering. For this, the system clusters the data into groups based on “context similarity” and assigns them a named entity by means of “entity extraction”. [1] Feature based supervised learning This approach involves prior feature engineering to construct features (usually vectors) that represent each instance (word or sentence). Features can be word level features or document and corpus features among others. This approach needs a dataset of annotations, which are used for train a model. In this scenario, the NER task is considered as a multi-class classification problem, where the variety of entity types conform the set of classes. Common supervised methods employed in NER are Hidden Markov Models, Decision Trees, Maximum Entropy Models, Support Vector Machines and Conditional Random Fields. [1]. This last method, CRF, considers context information and achieves to outperform the previous ones. [1] Therefore, we propose CRF as our baseline system. Deep learning The use of deep learning in NER became more common in recent years. As an advantage to previous techniques, the extraction of features is learned automatically by the model, without complex feature engineering. This is done by stacking various layers in the neural network, where different abstractions of the data are obtained. Another advantage of deep learning is the use of non-linear activation functions that allow to learn more complex features. [1] As previously mentioned, NER is a challenging task for various reasons such as the variability of the entity sets among different domains and the dependency on the quality of the annotations. However, when it comes to the biomedical domain, the NER task becomes even more challenging given the domain specific terminology, the use of acronyms or abbreviations and non-standard terms. Furthermore, Bio-NER systems are affected by the lack of comprehensive biomedical entities dictionaries. [5] Many studies have shown that the combination of a Bidirectional Long short-term memory (BiLSTM) followed by a CRF achieves state-of-the-art performance in NER [6]. Zhai et al. [6] made a comparison of various systems combining deep learning models such as Bidirectional Long short-term memory (Bi-LSTM) and Convolutional Neural Network (CNN) to recognize chemical and diseases mentions. To initialize the deep learning networks, the author used a Word2Vec word embedding model trained over MEDLINE abstracts. The authors employed the 398 BioCreative V CDR corpus [7], which is a manually annotated dataset. 1,965 disease entities, 1,467 chemical entities and 1,038 chemical disease relations (CDRs) were employed in the training dataset. Moreover, 1,865 disease entities, 1,507 chemical entities and 1,012 CDRs were employed in the development dataset. Finally, 1,988 disease entities, 1,435 chemical entities and 1,066 CDRs were employed in the test set. [7] Focusing on the tasks of disease recognition, the best performance (F1=83.01%) was provided by a hybrid architecture combining a Bi-LSTM with a CRF classifier a CNN initialized with character embedding. [6] Wei et al. [8] proposed various systems for disease named entity recognition combining Conditional Random Fields (CRF) and Bidirectional Recursive Neural Network (Bi-RNN). For this purpose, the authors also employed the BioCreative V CDR corpus [7]. They found that the best performance (F1=82.88%) was obtained when combining the results of a Bi-RNN with the CRF based model through a Support Vector Machine (SVM). On the one hand, the Bi-RNN was trained using pre-trained vectors on a PubMed corpus. On the other hand, the CRF-based model employed input features such as PoS tag, word shape, preffix and suffix among others [7]. Moreover, Choo and Lee [9] proposed a system, CLSTM (Contextual LSTM with CRF) that focuses on "capturing local context information based on n-gram characters" and employs word embeddings. The system was evaluated over three biomedical corpora: the National Center for Biotechnology Information (NCBI) [10], the BioCreative II Gene Mention (GM) corpus [11], and the BioCreative V [7] corpus. Their analysis showed that, focusing on a strict entity matching, the CLSTM system trained using word and character embeddings achieved the best performance (F1=86.44%) over the BioCreative V corpus. Regarding the results over the BioCreative V corpus, CLSTM with word level embedding and CLSTM with Character level embedding achieved F1 scores of 86.36 and 85.92 respectively, followed by GRAM-CNN (F1=85.79%), BERT [12] (F1=85.72%), BiLSTM-CRF (F1=85.50%) and BiLSTM (F1=81.88%) [9]. 3. Methods 3.1. The Cantemist corpus The Cantemist corpus [2] is formed by 3,000 clinical cases stored in different files. These are plain text with UTF-8 encoding. Each text file is associated to an annotation file (see Figure 1) in BRAT format which has been manually annotated by clinical experts. These annotations are tumor mentions, more specifically, ICD-O codes. The total corpus has been divided into 4 sets; train set, two development sets and test set. A detailed description of this dataset can be found in [2]. In the BRAT format, each line represents an entity mention. Each entity annotation includes an ID (‘T1’, ‘T2’. . . ) where ‘T’ stands for “text bound annotation”, followed by the entity type (‘MORFOLOGIA NEOPLASIA’), start-end offsets and the text of the annotation. The start-end offset indicates the position within the file of the first and last character in the entity. [4] Entities can be formed by several tokens (words). In fact, the average number of words that conform each entity is 2. However, the BRAT format in which the annotations are given show continuous text-bound annotations, where only the start offset of the first word and the end offset of the last word that form the entity are given (see Figure 2). That is why, a pre-processing 399 Figure 1: Example of an ann file. Figure 2: Example of a continuous text-bound annotation in BRAT format. Figure 3: Example of a discontinuous text-bound annotation in BRAT format. stage that involves converting such annotations into discontinuous text-bound annotations has been applied (see Figure 3). [4] The IOB format is mostly used to tag each token. However, after analyzing the Cantemist corpus, the presence of nested entities, which are entities embedded in another entity, was noticed (see Figure 4). On the contrary, those entities that do not appear in various annotations are known as flat entities. To solve this problem, we have used an extension of the IOB format, the BIOES-V format. [6, 13] This way, if the entity is a single token entity it will be labeled as "S" and “V” if the token is part of a nested entity. To preprocess the texts and represent their tokens with the BIOES-V format, we have used Spacy, a popular library for NLP. Spacy allows us to perform several tasks such as sentence splitting, tokenization and PoS tagging. Regarding the training dataset, over which the following methods will be trained, the distri- bution of tokens belonging to each BIOES-V tag is presented in Table 1. Table 1 also includes the distribution of tokens of the test set. 400 Figure 4: Example of nested entities. Table 1 BIOES-V statistic in the Cantemist corpus Label Training Test B 3,071 1,709 I 5,193 2,791 O 439,863 239,701 E 3,070 1,707 S 3,313 1,908 V 20 22 Total 454,530 247,838 Figure 5 shows the main preprocessing tasks involved in our study. 401 Figure 5: Preprocessing steps involved in the NER task. 3.2. Machine Learning models As baseline, we propose the use of a CRF classifier, a type of probabilistic discriminative model which aims to predict an output variable while giving huge relevance to the sequence of predictions. The main difference with generative models is that these rely on strict dependency assumptions, whilst this discriminative model relaxed such assumptions. This way, it can employ features where it exists a dependency. Until the implementation of deep learning methods, CRF provided the best performance in NER tasks among machine learning methods. CRF models the conditional distribution p(y|x), which is the probability of label sequences given observation sequences and follows the formula below. [3] 𝑇 𝐾 1 𝑝𝜃 (𝑦|𝑥) = exp (∑ ∑ 𝜃𝑘 𝑓𝑘 (𝑦𝑡−1 , 𝑦𝑡 , 𝑥𝑡 )), (1) 𝑍𝜃 (𝑥) 𝑡=1 𝑘=1 Here, 𝜃𝑘 represents the parameters of the distribution and 𝑓𝑘 is the function that defines the transition from state 𝑦𝑡−1 to state 𝑦𝑡 , being 𝑦𝑡−1 and 𝑦𝑡 the sequences’ labels. 𝑍𝜃 is the normalization factor defined as follows. [3] 𝑇 𝐾 𝑍𝜃 (𝑥) = ∑ exp (∑ ∑ 𝜃𝑘 𝑓𝑘 (𝑦𝑡−1 , 𝑦𝑡 , 𝑥𝑡 )), (2) 𝑦 𝑡=1 𝑘=1 The training of the model involves working with sentences as the input sequence. In addition, for every token in a sequence, the model needs as parameters, the position of the current token, 402 its PoS (Part of Speech) tag, the BIOES-V label of the current token and the label of the previous token. This way, context information is also being considered. [14] We also implement a BiLSTM (Long Short-Term Memory) network followed by a CRF classifier. On the one hand, the LSTM is a type of RNN (Recurrent Neural Network). The main aspect of these RNNs is that they employ context information, more specifically, information from past observations, to make predictions over the current observation. [15] In this study, the observations are tokens from the text that are used as sequential inputs to the network. This ability to “remember” information is crucial in NER since this is a sequence labelling problem where there is interdependency between the tokens in the text. However, RNNs suffer from vanishing gradient problem caused by the increasing amount of information to be “remembered”. To solve this, LSTMs differ from common RNNs by “forgetting” information and “making space” for more important one. [15] In addition, the implementation of a bidirectional LSTM becomes more appropriate when working with NER since it allows not only to remember information from past observations but also to consider future information. This is achieved by employing two independent LSTM processes working in opposite directions. As a result, the final layer (a CRF classifier) in the network receives two vectors, the outputs of the two LSTMs layers, as an input. Then, the CRF considers the label predicted for the previous observation in order to predict the label of the current one. This behavior is necessary in an NER task given the interdependency between tokens. We have implemented three different approaches based on BiLSTM-CRF. Approach 1: BiLSTM-CRF with random initialization of vectors It consists on creating a word vocabulary from the words present in the training dataset and map each word to a numeric vector of 40 components. It must be mentioned that the length of each sentence is being set to a fixed size, 75. Then, it assigns a random vector for each token in the vocabulary. Unfortunately, it does not capture relationships or similarities between tokens. In addition, it may occur that certain tokens in the test dataset do not appear in the training dataset. Therefore, regardless of the similarity of these tokens to others in the vocabulary, these tokens will be mapped to the value that identifies the “Unknown” label. [15] Approach 2: BiLSTM-CRF with a pre-trained word embedding in Spanish Word em- bedding is a technique that consists on representing words in a vocabulary as vectors of real numbers while capturing semantic and syntactic information. [6] However, word embeddings are not able to capture other features such as PoS tags. Among word embedding methods it can be found Word2Vec[16], which is a neural network-based model, and can be implemented using two different algorithms; Continuous Bag of Words model (CBOW) and Skip-gram model (SG). Other word embedding methods are Global Vector (GloVe)[17] and fastText[18]. The main disadvantage of using word embeddings is that there may be words in the test set that are not represented in the initial vocabulary. These are out-of-vocabulary words (OOV). For this study, the pre-trained word embedding employed is the Scielo + Wikipedia health cased skip-gram model developed by Plan TL [19]. This model was trained using FasText implementation over a corpus created as a combination of the SciELO database, which contains articles in Spanish, English and Portuguese, and a health-related subset of Wikipedia that includes Pharmacology, 403 Pharmacy, Medicine and Biology related documents. A very common pre-trained word em- bedding model employed in NLP tasks in Spanish is the Spanish Billion Words Corpus and Embedding, SBWCE. [20] However, given the specificity of the domain subject of this study, the Scielo + Wikipedia health cased skip-gram model [19] seems more appropriate since it contains specific medical terms that may be found in the Cantemist corpus. Thus, this approach involves representing each token in the training dataset as a 300-element numeric vector using the vocabulary from the pre-trained Scielo + Wikipedia health cased word embedding model. Also, this approach also involves fixing the parameter that defines the maximum sentence length to 75. Approach 3: BiLSTM-CRF with character embedding This technique implements the pre-trained word embedding explained in approach 2. As a novelty from the previous approach, it represents each character in the words as a vector of a fixed dimension. For this, it performs character embedding as a previous layer to represent words as characters. This approach is useful when working with out of vocabulary words (OOV), which are words that do not appear in the vocabulary that the model was trained with. In sum, this approach combines two feature representations of the text as inputs to the system. On the one hand, every token in the text is encoded into a 300-element vector using a vocabulary from a pre-trained word embedding. On the other hand, every character that conforms a token is encoded using a character vocabulary also learned from the pre-trained word embedding. This way, this approach considers two fixed parameters; the maximum sequence length, fixed to 75, and the maximum word length, fixed to 10. Finally, we also explore the use of Bidirectional Encoder Representations from Transformers (BERT) [12]. BERT is a more recent deep learning approach for NLP that has outperformed prior language models [12]. The main difference between BERT and other strategies is that BERT overcomes the limitation of standard language models based on a unidirectional constraint and allows to consider context information from both directions without the need to employ two independent LSTMs. [12] Furthermore, the process of BERT consists of two stages. The first stage is the model pre- training over unlabeled data and the second stage is approached either by performing feature based or fine-tuning tasks. [12] In our work, we employ a pre-trained BERT model and then perform fine-tuning over it to find the optimal combination of parameters for this specific NER task. On the bright side, focusing on fine-tuning an already trained model is less time consuming and has a good generalization since very few parameters have to be learned. Since this task involves working with clinical cases in Spanish, a multilingual cased version of BERT is employed instead [21]. It has fixed model sizes: L=12, H=768, A=12, and Total Parameters=110M. Here, L denotes the number of layers (or transformer blocks), H is the hidden size and A is the number of self-attention heads. This way, the only parameters to focus on while performing fine-tuning are maximum length (the context or sequence length to consider), batch size, learning rate and number of epochs. [12] In sum, we use an already pre-trained cased text BERT model. Therefore, no preprocessing steps has been performed over the corpus. Regarding the method’s parameters, the maximum sequence length has been fixed to 75, as in previous approaches. When it comes to the training 404 of the model, a batch size of 32, a learning rate of 3e-5 and 3 epochs have been employed. It must be mentioned that the number of epochs specify the number of times the model goes through all the data. Therefore, in order to avoid overfitting, this value must not be too large. 4. Results The performance of the models has been evaluated using several metrics. The primary metrics employed are precision, recall and f1 score. What’s more, it will also be employing micro-average scores instead of macro-average scores since the number of tokens belonging to each entity type is imbalanced. The main difference between micro-average and macro-average scores is that micro average scores are computed jointly for all the entity types whilst the macro average scores involve computing the metrics individually for each entity type and then combining them by performing the mean. The following methods have been trained using the training dataset and fine-tuned employing the labeled development datasets. Then, once the optimal parameter combination had been achieved, the final model has been trained employing the training and development datasets. First, we will present and discuss the performance of our approaches by considering their micro-average results calculated over the BIOES-V tags used to represent our tokens. Finally, we also present the general results provided by the Cantemist organizers. First, we propose the CRF classifier, which has shown good performance in literature [3, 14]. This allows to establish a baseline. Focusing on the performance of the CRF classifier in identifying tokens that belong to an entity, this baseline method achieved a micro average F1 score 0.74 and 0.77 on the development datasets 1 and 2 respectively. Regarding the test set, this method has achieved a micro average F1 score of 0.78 as seen in Figure 6. BiLSTM-CRF with random initialization achieves a micro average F1 score of 0.71 and 0.73 on the development datasets 1 and 2 respectively, when analyzing its performance in identifying tokens that belong to an entity. Regarding the test dataset, it has achieved a micro average F1 score of 0.78 as seen in Figure 7. BiLSTM-CRF with a pre-trained word embedding in Spanish shows a micro average F1 score of 0.75 and 0.78 on the training and development datasets 1 and 2 respectively, when analyzing its performance in identifying tokens that belong to an entity. On the other hand, an f1 score of 0.81 has been obtained over the test dataset as seen in Figure 8. BiLSTM-CRF with character embedding obtains a micro average F1 score of 0.75, 0.72 and 0.76 on the training and development datasets 1 and 2 respectively, when analyzing its performance in identifying tokens that belong to an entity. Regarding the test dataset, it has obtained an F1 score of 0.78 as seen in Figure 9. Finally, BERT achieves a micro average F1 score of 0.78 and 0.80 on the development datasets 1 and 2 respectively, when analyzing its performance in identifying tokens that belong to an entity. Focusing on the test set, it achieved a micro average f1 score of 0.82. In short, CRF established a baseline with a micro average f1 score of 0.78. Then, focusing on the deep learning methods, three strategies that implement a Bidirectional LSTM combined with CRF have been approached. The first strategy consisted on a random initialization of the vectors in the vocabulary and did not succeed in defeating the baseline method (see Figure 405 Table 2 Precision, Recall and F1 scores obtained over the test set. Model Precision Recall F1 Score CRF 0.8 0.768 0.783 BiLSTM-CRF with random initialization 0.771 0.773 0.772 BiLSTM-CRF with pre-trained word embeddings 0.828 0.769 0.797 BiLSTM-CRF with word and character embeddings 0.784 0.759 0.771 BERT 0.756 0.775 0.765 11). The second BiLSTM-CRF strategy involved a pre-trained word embedding in Spanish for token representation. It achieved a better performance than the first BILSTM-CRF approach and baseline approach, CRF, (see Figure 11). The third BiLSTM-CRF strategy consisted on incorporating a character embedding along with the token representation did not achieve a better performance than the second approach of BiLSTM-CRF nor the baseline approach (see Figure 11). Finally, the last deep learning approach to be assessed was BERT. This method achieved to defeat both the baseline method (CRF) and the previous Bi-LSTM approaches explored (see Figure 11). Therefore, we can conclude that BERT is the best option to recognize tumor morphology mentions in clinical texts. However, the performance of the models must also be assessed based on the correctly identified entities in each clinical case. This way, this assessment does not consider if a token has been correctly identified as part of an entity, but if an entity, which can be formed by several tokens, has been identified in its complete form. The results of this evaluation have been captured in Table 2. It shows that the second approach of BiLSTM-CRF initialized with a pre-trained word embedding model has the best performance among the different methods explored. 5. Conclusion This study has focused on developing a NER system to automatically identify tumor morphology entity mentions in health-related documents in Spanish. As previously mentioned, it has been developed under the Cantemist track, from which the corpus employed has been obtained. For this purpose, we have explored different machine learning approaches such as CRF, Bi-LSTM and BERT. Although the approaches show very similar performance, we can conclude that Bi-LSTM with pre-trained word embeddings shows the top F1 (0.797). However, if we study the micro-average F1 calculated over the BIOES-V format to represent our tokens, we see that BERT provides better results (micro F1=0.82) than the other approaches. As future work, we plan to extend our deep learning models by incorporating semantic information from biomedical dictionaries (such as entity embeddings). We will also explore other hybrid deep learning architecture. 406 6. Acknowledgments This work was supported by the Research Program of the Ministry of Economy and Competitive- ness - Government of Spain, (DeepEMR project TIN2017-87548-C2-1-R) and the Interdisciplinary Projects Program for Young Researchers at Universidad Carlos III of Madrid founded by the Community of Madrid (NLP4Rare-CM-UC3M). . References [1] J. Li, A. Sun, J. Han, C. Li, A survey on deep learning for named entity recognition, IEEE Transactions on Knowledge and Data Engineering (2020). [2] A. Miranda-Escalada, E. Farré, M. Krallinger, Named entity recognition, concept normal- ization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020. [3] T. Nguyen, D. Nguyen, P. Rao, Adaptive name entity recognition under highly unbalanced data, arXiv preprint arXiv:2003.10296 (2020). [4] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, J. Tsujii, Brat: a web-based tool for nlp-assisted text annotation, in: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, pp. 102–107. [5] V. Cotik, H. Rodríguez, J. Vivaldi, Spanish named entity recognition in the biomedical domain, in: Annual International Symposium on Information Management and Big Data, Springer, 2018, pp. 233–248. [6] Z. Zhai, D. Q. Nguyen, K. Verspoor, Comparing cnn and lstm character-level embeddings in bilstm-crf models for chemical and disease named entity recognition, arXiv preprint arXiv:1808.08450 (2018). [7] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C.-H. Wei, R. Leaman, A. P. Davis, C. J. Mattingly, T. C. Wiegers, Z. Lu, Biocreative v cdr task corpus: a resource for chemical disease relation extraction, Database 2016 (2016). [8] Q. Wei, T. Chen, R. Xu, Y. He, L. Gui, Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks, Database 2016 (2016). [9] H. Cho, H. Lee, Biomedical named entity recognition using deep neural networks with contextual information, BMC bioinformatics 20 (2019) 735. [10] R. I. Doğan, R. Leaman, Z. Lu, Ncbi disease corpus: a resource for disease name recognition and concept normalization, Journal of biomedical informatics 47 (2014) 1–10. [11] L. Smith, L. K. Tanabe, R. J. nee Ando, C.-J. Kuo, I.-F. Chung, C.-N. Hsu, Y.-S. Lin, R. Klinger, C. M. Friedrich, K. Ganchev, et al., Overview of biocreative ii gene mention recognition, Genome biology 9 (2008) S2. [12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). 407 [13] R. M. R. Zavala, P. Martínez, I. Segura-Bedmar, A hybrid bi-lstm-crf model for knowledge recognition from ehealth documents., in: TASS@ SEPLN, 2018, pp. 65–70. [14] N. Patil, A. Patil, B. Pawar, Named entity recognition using conditional random fields, Procedia Computer Science 167 (2020) 1181–1188. [15] X. Ma, E. Hovy, End-to-end sequence labeling via bi-directional lstm-cnns-crf, arXiv preprint arXiv:1603.01354 (2016). [16] Y. Goldberg, O. Levy, word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method, arXiv preprint arXiv:1402.3722 (2014). [17] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [18] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (2017) 135–146. [19] F. Soares, M. Villegas, A. Gonzalez-Agirre, M. Krallinger, J. Armengol-Estapé, Medical word embeddings for spanish: Development and evaluation, in: Proceedings of the 2nd Clinical Natural Language Processing Workshop, 2019, pp. 124–133. [20] C. Cardellino, Spanish billion words corpus and embeddings (march 2016), URL http://crscardellino. me/SBWCE (2016). [21] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and evaluation data, in: Practical ML for Developing Countries Workshop@ ICLR 2020, 2020. 408 Figure 6: Evaluation of CRF over the test dataset. Figure 7: Evaluation of BiLSTM-CRF approach 1 over the test dataset. 409 Figure 8: Evaluation of BiLSTM-CRF approach 2 over the test dataset. Figure 9: Evaluation of BiLSTM-CRF approach 3 over the test dataset. 410 Figure 10: Evaluation of BERT over the test dataset. Figure 11: Comparison of the evaluation metrics of the NER approaches over the test dataset. 411