-

VSP at MEDDOCAN 2019

0 Computer Science Department, Carlos III University of Madrid. Leganes 28911 , Madrid , Spain

2019

654 662

This work presents the participation in the MEDDOCAN Task of the VSP team with a neural model for the Named Entity Recognition of medical documents in Spanish. The Neural Network consists of a two-layer model that creates a feature vector for each word of the sentences. The rst layer uses the character information of each word and the output is aggregated to the second layer together with its word embedding in order to create the feature vector of the word. Both layers are implemented with a bidirectional Recurrent Neural Network with LSTM cells. Moreover, a Conditional Random Field layer classi es the word vectors in one of the 29 types of Protected Health Information (PHI). The system obtains a performance of 86.01%, 87.03%, and 89,12% in F1 for the classi cation of the entity types, the sensitive span detection, and both tasks merged, respectively. The model shows very high and promising results being a basic approach without using pretrained word embeddings or any hand-crafted feature.

Named Entity Recognition Deep Learning Recurrent Neural Network Medical Documents

V ctor Suarez-Paniagua Nowadays, healthcare professionals deal with a high amount of unstructured documents that makes very di cult the task of nding the essential data in medical documents. Decreasing the time-consuming task of retrieving the most relevant information can help the fastness of generating a diagnosis for patients by doctors. Instead the vast of information are available as Electronic Health Record (EHR), the manual annotation of them is impracticable because the highly increasing number of generated documents per day and also because they contain sensitive data and Protected Health Information (PHI). For this reason, the development of an automatic system that identi es sensitive information from medical documents is vital for helping doctors and preserving patient condentiality.

The i2b2 shared task was the rst Natural Language Processing (NLP) challenge for identifying PHI in the clinical narratives [ 13 ]. The second edition of the i2b2 shared task Track 1 [ 12 ] created a gold standard dataset with annotations of the PHI categories from 1,304 medical records in English. In this competition, the highest ranking system used the Conditional Random Field (CRF) classi er together with hand-written rules for the de-identi cation of clinical narratives obtaining very promising results with 97.68% in F1 [14].

The goal of the Iberian Languages Evaluation Forum (IberLEF) 2019, which includes the TASS and IberEval workshops, is to create NLP challenges using corpora written in one of the Iberian languages (Spanish, Portuguese, Catalan, Basque or Galician). Following the i2b2 de-identi cation task, the Medical Document Anonymization task (MEDDOCAN) encourages the research community to design NLP systems for the identi cation of PHI from clinical texts in Spanish [ 9 ]. For this purpose, a corpus of 1,000 clinical case studies with PHI phrases was manually annotated by health documentalists.

Currently, Deep Learning approaches overcome traditional machine learning systems on the majority of NLP tasks, such as text classi cation [ 6 ], language modeling [ 10 ] and machine translation [ 1 ]. Moreover, these models have the advantage of automatically learn the most relevant features without de ning rules by hand. Concretely, the state-of-the-art performance for Named Entity Recognition (NER) task is an LSTM-CRF Model proposed by [ 8 ]. The main idea of this system is to create a word vector representation using a bidirectional Recurrent Neural Network with LSTM cells (BiLSTM) with character information encoded in another BiLSTM layer in order to classify the tag of each word in the sentences with a CRF classi er. Following this approach, the system proposed in [ 3 ] uses a BiLSTM-CRF Model with character and word levels for the deidenti cation of patient notes using the i2b2 dataset. This approach overcomes the top ranking system in this task reaching to 97.88% in F1.

This paper presents the participation of the VSP team at the tasks proposed by MEDDOCAN about the classi cation of PHI types and the sensitive span detection from medical documents in Spanish. The proposed system follows the same approaches of [ 8 ] and [ 3 ] with some modi cations for the Spanish language implemented with NeuroNER tool [ 2 ]. 2

Dataset

The corpus of the MEDDOCAN task contains 1,000 clinical cases with PHI entities manually annotated by health documentalists. The documents are randomly divided into the training, validation and test sets for creating, developing and ranking the di erent systems, respectively.

Similarly to the annotation schema of the i2b2 de-identi cation tasks, the named entities are annotated according to their o sets and their type for each detection and classi cation (see Figure 1). The 29 types of the annotated PHI mentions follow the Health Insurance Portability and Accountability Act (HIPAA) guidelines for Spanish the health records aggregating some PHI entities.

VSP at MEDDOCAN 2019 This section presents the Neural architecture for the classi cation of the PHI entity types and the sensitive span detection using medical documents in Spanish. Figure 2 shows the entire process of the model using two BiLSTMs for the character and token levels in order to create each word representation until its classi cation by a CRF.

V ctor Suarez-Paniagua 3.1

Data preprocessing

Before using the system, the documents of the corpus are preprocessed in order to prepare the inputs for the Neural model. Firstly, the clinical cases are separated into sentences using a sentence splitter and the words of these sentences are extracted by a tokenizer, both were adapted for the Spanish language. Once the sentences are divided into word, the BIOES tag schema encodes each token with an entity type. The tag B de nes the beginning token of a mention, the I tag de nes the inside token of a mention, the E tag de nes the ending token of a mention, the S tag indicates that the mention has a single token and the O tag indicates the outside tokens that do not belong to any mention. In many previous NER tasks, using this codi cation is better than the BIO tag scheme [ 11 ], but the number of labels increases because there are two additional tags for each class. Thus, the number of possible classes are the 4 tags times the 29 PHI classes and the O tag for the MEDDOCAN corpus. For the experiments, all the previous processes are performed by the spaCy tool in Python [ 4 ]. 3.2

BiLSTM layers

RNNs are very e ective in feature learning when the inputs are sequences. This Deep Learning model uses two di erent weights for the input and for the previous output as: h(t) = f (Wx(t) + Uh(t 1) + b) where h(t) is the output at t time of the input x, f is a non-linear function, W are the weights for the current input, U are the weights for the previous output, and b the bias term of the Neural Network. However, the basic RNN cannot capture the long dependencies because it loses the information of the gradients as long as the back-propagation is applied to the previous states. For this reason, the incorporation of cell units into the RNN computation solves the long propagation of the gradient problem.

The Long Short-Term Memory cell (LSTM) [ 5 ] de nes four gates for creating a word representation taking the information of the current and previous cells. The input gate it, the forget gate ft and the output gate ot for the current t step transform the input vector xt taking the previous output ht 1 using its corresponding weights and bias computed with a sigmoid function. The cell state ct takes the information given from the previous cell state ct 1 regulated by the forget cell and the information given from the current cell c0t regulated by the input cell using the element-wise represented as:

VSP at MEDDOCAN 2019 ft = (Wf [ht 1; xt] + bf )

it = (Wi [ht 1; xt] + bi) c0t = tanh(Wc [ht 1; xt] + bc) ct = ft ct 1 + it c0

t ot = (Wo [ht 1; xt] + bo)

ht = ot tanh(ct)

Finally, the current output ht is represented with the hyperbolic function of the cell state and controlled by the output gate. Furthermore, another LSTM can be applied in the other direction from the end of the sequence to the start. Computing the two representations is bene cial for extracting the relevant features of each word because they have dependencies in both directions. Character level The rst layer takes each word of the sentences individually. These tokens are decomposed into characters that are the input of the BiLSTM. Once all the inputs are computed by the network, the last output vectors of both directions are concatenated in order to create the vector representation of the word according to its characters.

Token level The second layer takes the embedding of each word in the sentence and concatenates them with the outputs of the rst BiLSTM with the character representation. In addition, a Dropout layer is applied to the word representation in order to prevent over tting in the training phase. In this case, the outputs of each direction in one token are concatenated for the classi cation layer. 3.3

Contional Random Field Classi er

CRF [ 7 ] is the sequential version of the Softmax that aggregates the label predicted in the previous output as part of the input. In NER tasks, CRF shows better results than Softmax because it adds a higher probability to the correct labelled sequence. For instance, the I tag cannot be before a B tag or after a E tag by de nition. For the proposed system, the CRF classi es the output vector of the BiLSTM layer with the token information in one of the classes. 4

Results and Discussion

The architecture was trained over the training set during 25 epochs with shu ed mini-batches and choosing the best performance over the validation set. The values of the two BiLSTM and CRF parameters for generating the prediction of the test set are presented in Table 1. The embeddings of the characters and words are randomly initialized and learned during the training of the network.

V ctor Suarez-Paniagua Additionally, a gradient clipping keeps the weight of the network in a low range preventing the exploding gradient problem.

The results were measured with precision (P), recall (R) and F-measure (F1) using the True Positives (TP), False Positives (FP) and False Negatives (FN) for its calculation. Table 2 presents the results of the Neural Model with the two BiLSTM levels and the CRF classi er over the test set of the MEDDOCAN task. The performance over the NER o set and entity type classi cation (Task 1) shows an 86,01% in F1 and the performance over the sensitive token detection (Task 2) shows an 87,03% in F1 taking into consideration only if the entities have exact boundary match and entity type (Strict). Thus, the results for both tasks merged reach to 89,12% in F1.

From the table, it can be observed that the number of FN and FP are very similar giving very similar Precision and Recall results in all the classes. On the one hand, there are classes with very high performance, such as CORREO ELECTRONICO, EDAD SUJETO ASISTENCIA, FECHAS, NOMBRE SUJETO ASISTENCIA and PAIS that are greater than the 95% in F1 because of the data is presented in the same location between documents and they are easy to disambiguate from the remaining classes. On the other hand, the classes of OTROS SUJETO ASISTENCIA and PROFESION shows a very low performance because they have a very small number of instances in the training set making hard the learning of their representation in the network. In order to alleviate this problem, the use of oversampling techniques is proposed to increase the number of instances of the less representative classes and making more balanced this dataset. 5

Conclusions and Future work

This work proposes a Neural model for the detection and classi cation of PHI from clinical texts in Spanish. The architecture is based on RNNs in both direction of the sentences using LSTM for the computation of the outputs. Finally, a CRF classi er performs the classi cation for tagging the PHI entity types. The results shows a performance of 86.01% and 87.03% in F1 for the classi

VSP at MEDDOCAN 2019 cation of the entity types and the sensitive span detection over the MEDDOCAN corpus giving 89,12% in F1 for the merged tasks as the o cial result. The results are very similar in Precision and Recall for all the classes giving a low performance in the less representative classes and a higher performance in the well-structured PHI entities, such as NOMBRE SUJETO ASISTENCIA EDAD SUJETO ASISTENCIA, CORREO ELECTRONICO, FECHAS, and PAIS.

As future work, exploring the contribution of each representation individually and ne-tuning the parameters of the model will be useful in order to increase the performance. In addition, the aggregation of embeddings from di erent external information, such as Part-of-Speech tags, syntactic parse trees or semantic tags, could increase the representation of each word for improving its classi cation. Moreover, the sentence splitter of spaCy seems to divide sentences when some acronyms appear, such as 'Dr.', 'Dra.', 'Sr.' or 'Sra.' (Spanish honori c pre x). For this reason, the creation of simple rules in order to avoid these cases could be bene cial for increasing the performance. Furthermore, adding more layers to each BiLSTM is proposed to be included in the architecture.

V ctor Suarez-Paniagua

1. Cho , K., van Merrienboer , B. , Gulcehre , C. , Bahdanau , D. , Bougares , F. , Schwenk , H. , Bengio , Y. : Learning phrase representations using RNN encoder{decoder for statistical machine translation . In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . pp. 1724 { 1734 . Association for Computational Linguistics, Doha, Qatar (Oct 2014 ). https://doi.org/10.3115/v1/ D14 -1179, https://www.aclweb. org/anthology/D14-1179

2. Dernoncourt , F. , Lee , J.Y. , Szolovits , P.: NeuroNER: an easy-to-use program for named-entity recognition based on neural networks . In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations . pp. 97 { 102 . Association for Computational Linguistics, Copenhagen, Denmark (Sep 2017 ). https://doi.org/10.18653/v1/ D17 -2017, https:// www.aclweb.org/anthology/D17-2017

3. Dernoncourt , F. , Young

Lee

, J. , Uzuner , O. , Szolovits , P. : De-identi cation of patient notes with recurrent neural networks . Journal of the American Medical Informatics Association : JAMIA 24 ( 06 2016 ). https://doi.org/10.1093/jamia/ocw156

4. Explosion

: spaCy - Industrial-strength Natural Language Processing in Python ( 2017 ), https://spacy.io/

5. Hochreiter , S. , Schmidhuber , J.: Long short-term memory . Neural computation 9(8) , 1735 { 1780 ( 1997 )

6. Kim , Y. : Convolutional neural networks for sentence classi cation . In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . pp. 1746 { 1751 ( 2014 )

7. La

erty

, J.D., McCallum , A. , Pereira , F.C.N. : Conditional random elds: Probabilistic models for segmenting and labeling sequence data pp . 282 { 289 ( 2001 ), http://dl.acm.org/citation.cfm?id= 645530 . 655813

8. Lample , G. , Ballesteros , M. , Subramanian , S. , Kawakami , K. , Dyer , C. : Neural architectures for named entity recognition . In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . pp. 260 { 270 . Association for Computational Linguistics, San Diego, California (Jun 2016 ). https://doi.org/10.18653/v1/ N16 -1030, https://www.aclweb.org/anthology/N16-1030

9. Marimon , M. , Gonzalez-Agirre , A. , Intxaurrondo , A. , Rodr

guez

, H., Lopez

Martin

, J.A. , Villegas , M. , Krallinger , M. : Automatic de -identi cation of medical texts in spanish: the meddocan track, corpus, guidelines, methods and evaluation of results . In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019 ). vol. TBA , p. TBA. CEUR Workshop Proceedings (CEUR-WS.org) , Bilbao, Spain (Sep 2019 ), TBA

10. Mikolov , T. , Sutskever , I. , Chen , K. , Corrado , G.S. , Dean , J. : Distributed representations of words and phrases and their compositionality . In: Advances in neural information processing systems . pp. 3111 { 3119 ( 2013 )

11. Ratinov , L. , Roth , D. : Design challenges and misconceptions in named entity recognition . In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009) . pp. 147 { 155 . Association for Computational Linguistics ( 2009 ), http://aclweb.org/anthology/W09-1119

12. Stubbs , A. , Kot la , C. , Uzuner , O. : Automated systems for the deidenti cation of longitudinal clinical narratives: Overview of 2014 i2b2/uthealth shared task track 1 . Journal of biomedical informatics 58 (07 2015 ). https://doi.org/10.1016/j.jbi. 2015 . 06 .007

13.

zlem Uzuner , Luo, Y. , Szolovits , P. : Evaluating the state-of-the-art in automatic de-identi cation . Journal of the American Medical Informatics Association 14 ( 5 ), 550 { 563 ( 2007 ). https://doi.org/https://doi.org/10.1197/jamia.M2444, http:// www.sciencedirect.com/science/article/pii/S106750270700179X 14. Yang , H. , Garibaldi , J.M.: Automatic detection of protected health information from clinic narratives . Journal of Biomedical Informatics 58 , S30 { S38 ( 2015 )