TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 65-70 A Hybrid Bi-LSTM-CRF model for Knowledge Recognition from eHealth documents Un Modelo Hı́brido Bi-LSTM-CRF para el Reconocimiento de Conocimiento a partir de documentos electrónicos de eSalud Renzo M. Rivera Zavala1 , Paloma Martı́nez1 , Isabel Segura-Bedmar1 1 Computer Science Department, University Carlos III of Madrid 100371920@alumnos.uc3m.es, pmf@inf.uc3m.es, isegura@inf.uc3m.es Abstract: In this work, we describe a Deep Learning architecture for Named Entity Recognition (NER) in biomedical texts. The architecture has two bidirectional Long Short-Term Memory (LSTM) layers and a last layer based on Conditional Random Field (CRF). Our system obtained the first place in the subtask A (identification) of TASS-2018-Task 3 eHealth Knowledge Discovery, with an F1 of 87.2%. Keywords: NER, Bi-LSTM, CRF, Information Extraction Resumen: En este trabajo, describimos una arquitectura Deep Learning para el reconocimiento de entidades nombradas (NER) en textos biomédicos. La arquitec- tura se compone de dos capas bidireccionales LSTM (Long Short-Term Memory) y una última capa basada en Conditional Random Field (CRF). Nuestro sistema obtuvó el primer puesto en las subtareas A (identificación) y B (clasificación) de la competición TASS-2018-Task 3 eHealth Knowledge Discovery, con una F1 de 87.2%. Palabras clave: NER, Bi-LSTM, CRF, Extracción de Información 1 Introduction 2001). Recently, Deep learning-based meth- ods have also demonstrated state-of-the-art Currently, the number of biomedical litera- performance by automatically learning of rel- ture is growing at an exponential rate. The evant patterns from corpora, which allows substantial number of research works makes the independence of a specific language or it extremely difficult for researchers to keep domain. However, until now, Deep Learning up with the new development in their re- methods have not been able to provide bet- search areas. Therefore, the effective man- ter results than those obtained by classical agement of a large amount of information traditional machine learning methods (Lim- and the accuracy of knowledge is a vital task. sopatham and Collier, 2016). Named Entities Recognition (NER) is one In this paper, we propose a hybrid of the fundamental tasks of biomedical text model combining two bidirectional Long mining, with the aim of identifying pieces of Short Memory (Bi-LSTM) layers with a CRF text that refer to specific entities of interest. layer. To do this, we adapt the NeuroNER There are different scopes to address the model proposed in (Dernoncourt, Lee, and NER problem. Among them, we can find Szolovits, 2017) for the subtask A (identifi- methods based on dictionaries, which are lim- cation) of TASS-2018-Task 3 eHealth Knowl- ited by the size of the dictionary, spelling er- edge Discovery (Martı́nez-Cámara et al., rors, the use of synonyms and the constant 2018). Specifically, we have extended Neu- growth of vocabulary. Rule-based methods roNER by adding context information, Part- and Machine Learning methods usually re- of-Speech (PoS) tags and information about quire both syntactic and semantic features overlapping or nested entities. Moreover, in as well as characteristics of the language of this work, we use two pre-trained word em- the specific domain. One of the most effec- bedding models: i) a word2vec model (Span- tive method is Conditional Random Fields ish Billion Word Embeddings (Cardellino, (CRF) (Lafferty, McCallum, and Pereira, 2016)), which was trained on the 2014 dump ISSN 1613-0073 Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes. Renzo M. Rivera Zavala, Paloma Martínez e Isabel Segura-Bedmar of Wikipedia and ii) a sense-disambiguation networks. In last years, neural networks for embedding model (Trask, Michalak, and Liu, training word embedding models have gained 2015). a lot of popularity among NLP community The rest of the paper is organized as fol- because they are able to capture syntactic lows. In Section 2, we describe the architec- and semantic information among words. The ture of our system. Section 3 presents the most popular methods are word2vec (Le and results. In Section 4, we provide the conclu- Mikolov, 2014), the global aggregate model sions. of word-word co-occurrence statistics (Pen- nington, Socher, and Manning, 2014) and 2 System Description the morphological representation of fastText 2.1 Pre-processing (Bojanowski et al., 2017). In this work, we used the Spanish Bil- All texts were preprocessed in four steps. lion Words(Cardellino, 2016), which is a pre- First, sentences were split by using Spacy trained model of word embeddings trained (Space.io, 2018), an open source library for on different text corpora written in Spanish advanced natural language processing with (such Ancora Corpus (Taulé, Martı́, and Re- support for 26 languages. Second, sentences casens, 2008) and Wikipedia). The details of and their annotated entities were trans- the pre-trained model are the following: formed to the BRAT format1 , a standoff for- mat similar to BioNLP Shared Task stand- • Corpus size: approximately 1.5 billions off format. Then, sentences were tokenized. words Finally, each token in a sentence was anno- tated using the BMEWO-V extended tag en- • Vocab size: 1000653 coding, to capture information about the se- quence of tokens in a given sentence. The • Array size: 300 BIOES label scheme introduced in the work • Algorithm: Skip-gram Bag of Words of (Borthwick et al., 1998) arises in order to overcome the limitation of the BIO scheme 2.2.2 Sense-Disambiguation for the representation of discontinuous enti- Embedding ties. BIOES coding distinguishes the end of We also used the sense2vec (Trask, Michalak, an entity through the E (End) tag and adds and Liu, 2015) model, which provides multi- the S (Single) tag to denote entities com- ple embeddings for each word based on the posed of a single token. The BIOES-V or sense of the word. This is able to analyze the BMEWO-V encoding distinguishes the B tag context of a word and then assign its more to indicate the start of an entity, the M tag adequate vector. In this work, we used a pre- to indicate the continuity of an entity, the E trained model generated with the sense2vec tag to indicate the end of an entity, the W tool with 22 million words represented in 128 tag for indicate a single entity, and the O tag features vectors trained on the 2015 Reddit. to represent other tokens that do not belong Reddit Vector is a pre-trained model of to any entity. The V tag allows to represent sense-disambiguation representation vectors overlapping entities. This encoding scheme presented by (Trask, Michalak, and Liu, allows the representation of discontinuous en- 2015). This model was trained on a collec- tities and overlapping or nested entities. tion of comments published on Reddit (cor- 2.2 Learning Transfer responding to the year 2015). The pre- trained Reddit vectors support the following In our work, we propose as input of our net- ”senses”, whether partial or full PoS tags or work two different embeddings: word embed- entity tags. The details of the pre-trained dings and sense-disambiguation embeddings. model are the following: Below we describe them in more detail. 2.2.1 Words Embeddings • Corpus size: approximately 2 billions Word embedding is an approach to repre- words sent words as vectors of real numbers. There • Vocab size: 1 million are different methods to obtain these vec- tors such as probabilistic models and neural • Array size: 128 1 http://brat.nlplab.org/standoff.html • Algorithm: Sense2Vec 66 A Hybrid Bi-LSTM-CRF Model for Knowledge Recognition from eHealth Documents Figure 1: Overview architecture of our hybrid LSTM-CRF model. 2.3 The network ond Bi-LSTM layer. The goal of this layer 2.3.1 Character Embedding is to obtain a sequence of probabilities cor- Bi-LSTM layer responding to each label of the BMEWO-V encoding format. In this way, for each in- Although the word embeddings are able to put token, this layer returns six probabilities capture syntactic and semantic information, (one for each tag in BMEWO-V). The final other linguistic information such as morpho- tag should be that with highest probability. logical information, orthographic transcrip- tion or PoS tags are not exploited. Accord- The parameters of the sets and the hyper ing to (Ling et al., 2015), the use of character parameters of the models are the following: embeddings improves learning for specific do- • Words Embedding Dimension: 300 mains and is useful for morphologically rich languages. For this reason, we decided to • Characters Embedding Dimension: 25 consider the character embedding represen- tation in our system. We used a vector of 25 • Hidden Layers Dimension: 100 (for each dimensions to represent each character. The LSTM: for the forward and backward character alphabet includes all 121 unique layers) characters in the TASS-2018-Task 3 eHealth Knowledge Discovery training, development • Learning method: SGD, learning ratio: and test datasets and the token PADDING. 0.005 In this way, tokens in sentences are repre- sented by their corresponding character em- • Dropout: 0.5 beddings, which are the input for the first • Epochs: 100 Bi-LSTM network. 2.3.2 Word and Sense embedding 2.3.3 Conditional Random Fields Bi-LSTM layer (CRF) layer The output of the first layer is concate- To improve the accuracy of predictions, we nated with the word embeddings and with also used a CRF model trained, which takes the sense-disambiguation embeddings of the as input the output of the previous layer and tokens in a given input sentence. This con- obtains the most probable sequence of pre- catenation of features is the input for the sec- dicted labels. 67 Renzo M. Rivera Zavala, Paloma Martínez e Isabel Segura-Bedmar 2.4 Post-processing tass/2018/task-3/evaluation.html). Once tokens have been annotated with their Moreover, we used evaluation script corresponding labels in the BMEWO-V en- (https://github.com/TASS18-Task3/ coding format, the entity mentions must be data/blob/master/score_training.py) transformed to the BRAT format. V tags, provided by the shared task organizers to which identify nested or overlapping entities, evaluate our system. are generated as new annotations within the scope of other mentions. 3.2 Results As it was described above, our system is 3 Evaluation based on network with two Bi-LSTM layers 3.1 Datasets and a last layer for CRF. In the first Bi- The evaluation of the proposed model LSTM layer, we consider the character em- was carried out using the annotated beddings. In the second layer, we concate- corpus proposed in the TASS-2018- nate the output of the first layer with word Task 3 eHealth Knowledge Discovery embeddings and sense-disambiguate embed- (https://github.com/tass18-task3/data). dings. Finally, the last layer uses a CRF to The training set is made up of 5 docu- obtain the most suitable labels for each to- ments with 3276 entities annotations. The ken. development set consists of 1 text document Table 3 compares the results obtained us- with 1958 entities annotations. The test set ing the NeuroNER system with our extended consists of 1 text document (see Table 2). version using pre-trained embeddings models There are two types of of entities: concepts and the BMEWO-V encoding format. Our and actions. For this reason, tokens can be extended version of NeuroNER achieves a annotated with different labels (see Table 1) significant improvement of the results (more following the BMEWO-V encoding format. than 7.2% in F1). Entity Tags System P R F1 NeuroNER 0.824 0.785 0.804 Concept B/M/E/W/V-Concept ext. Neu- 0.862 0.882 0.872 Action B/M/E/W/V-Action roNER Others O Table 3: Comparison of NeuroNER and our Table 1: Tokens Tag in Sentence extended version. Datasets Files Concept Action In the substask A (identification of key Train 5 2427 849 phrases), our system obtained the top micro Development 1 1525 434 F1 (0.872) (see Table4). It significantly out- Test 1 0 0 perform the rest of participating systems. We will wait to review the proposed systems in greater depth in order to establish compar- Table 2: Dataset Statistics isons and possible improvements to our im- plementation. In our experiments, we used precision, recall and F1 score to evaluate the perfor- System P R F mance of our system. The TASS-2018-Task Extended 0.862 0.882 0.872 3 considers two different criteria: the NeuroNER partial matching (a tagged entity name plubeda 0.77 0.81 0.79 is correct only if there is some overlap upf-upc 0.86 0.75 0.80 between it and a gold entity name) and VSP 0.31 0.32 0.32 exact matching (a tagged entity name Marcelo 0.11 0.32 0.17 is correct only if its boundary exactly match with a gold entity name). A de- Table 4: Results of the participating systems tailed description of evaluation is in the in the subtask A. web (http://www.sepln.org/workshops/ 68 A Hybrid Bi-LSTM-CRF Model for Knowledge Recognition from eHealth Documents 4 Conclusions tem Demonstrations, pages 97–102. Asso- Named Entity Recognition (NER) is a crucial ciation for Computational Linguistics. tool in text mining tasks. In this work, we Lafferty, J., A. McCallum, and F. C. Pereira. propose a hybrid Bi-LSTM and CRF model 2001. Conditional random fields: Proba- adding sense-disambiguation embedding and bilistic models for segmenting and labeling an extended tag encoding format to detect sequence data. In Proceedings of the Eigh- discontinuous entities, as well as overlap- teenth International Conference on Ma- ping or nested entities. Our system is able chine Learning (ICML ’01), pages 282– to achieve satisfactory performance without 289. requiring specifically domain knowledge or Le, Q. and T. Mikolov. 2014. Distributed hand-crafted features. It is also important to representations of sentences and docu- highlight the language independence, which ments. In International Conference on is key to multi-language tasks. Our results Machine Learning, pages 1188–1196. demonstrated that the extended BMEWO-V encoding improves the result of the predic- Limsopatham, N. and N. Collier. 2016. tions. Moreover, the pre-trained models help Learning orthographic features in bi- to reduce training time and increase the ac- directional lstm for biomedical named en- curacy of labeling, achieving the highest F1 tity recognition. In Proceedings of the for the subtask A. Fifth Workshop on Building and Evaluat- We plan to try with other embeddings ing Resources for Biomedical Text Mining models such as the FastText model, which (BioTxtM2016), pages 10–19. contains morphological information. More- Ling, W., T. Luı́s, L. Marujo, R. F. As- over, we will extend the encoding format tudillo, S. Amir, C. Dyer, A. W. Black, to capture distinct types of overlapping or and I. Trancoso. 2015. Finding function nested entities. in form: Compositional character models for open vocabulary word representation. Acknowledgement In Proceedings of the 2015 Conference on This work was supported by the Research Empirical Methods in Natural Language Program of the Ministry of Economy and Processing, page 1520–1530. Competitiveness - Government of Spain Martı́nez-Cámara, E., Y. Almeida-Cruz, (project DeepEMR: Clinical information ex- M. C. Dı́az-Galiano, S. Estévez-Velarde, traction using deep learning and big data M. A. Garcı́a-Cumbreras, M. Garcı́a- techniques-TIN2017-87548-C2-1-R) Vega, Y. Gutiérrez, A. Montejo-Ráez, A. Montoyo, R. Muñoz, A. Piad- References Morffis, and J. Villena-Román. 2018. Bojanowski, P., E. Grave, A. Joulin, and Overview of TASS 2018: Opinions, T. Mikolov. 2017. Enriching word vectors health and emotions. In E. Martı́nez- with subword information. Transactions Cámara, Y. Almeida Cruz, M. C. of the Association for Computational Lin- Dı́az-Galiano, S. Estévez Velarde, M. A. guistics, 5:135–146. Garcı́a-Cumbreras, M. Garcı́a-Vega, Borthwick, A., J. Sterling, E. Agichtein, Y. Gutiérrez Vázquez, A. Montejo Ráez, and R. Grishman. 1998. Exploiting Di- A. Montoyo Guijarro, R. Muñoz Guillena, verse Knowledge Sources via Maximum A. Piad Morffis, and J. Villena-Román, Entropy in Named Entity Recognition. editors, Proceedings of TASS 2018: Work- Technical report. shop on Semantic Analysis at SEPLN (TASS 2018), volume 2172 of CEUR Cardellino, C. 2016. Spanish Billion Words Workshop Proceedings, Sevilla, Spain, Corpus and Embeddings. September. CEUR-WS. Dernoncourt, F., J. Y. Lee, and P. Szolovits. Pennington, J., R. Socher, and C. Manning. 2017. Neuroner: an easy-to-use pro- 2014. Glove: Global vectors for word rep- gram for named-entity recognition based resentation. In Proceedings of the 2014 on neural networks. In Proceedings of conference on empirical methods in nat- the 2017 Conference on Empirical Meth- ural language processing (EMNLP), pages ods in Natural Language Processing: Sys- 1532–1543. 69 Renzo M. Rivera Zavala, Paloma Martínez e Isabel Segura-Bedmar Space.io. 2018. spaCy · Industrial-strength Natural Language Processing in Python. Taulé, M., M. A. Martı́, and M. Recasens. 2008. Ancora: Multilevel annotated cor- pora for catalan and spanish. In LREC 2008, pages 96–101. Trask, A., P. Michalak, and J. Liu. 2015. sense2vec - a fast and accurate method for word sense disambiguation in neural word embeddings. CoRR, abs/1511.06388. 70