-

NLNDE: The Neither-Language-Nor-Domain-Experts' Way of Spanish Medical Document De-Identi cation

Lukas Lange

0 1 2

Heike Adel

Jannik Strotgen

Jannik.Stroetgeng@de.bosch.com 0 0 Bosch Center for Arti cial Intelligence Robert-Bosch-Campus 1 , 71272 Renningen, Germany https:// 1 Saarbrucken Graduate School of Computer Science Saarland Informatics Campus, Saarland University , Saarbrucken , Germany 2 Spoken Language Systems , LSV

2019

671 678

Natural language processing has huge potential in the medical domain which recently led to a lot of research in this eld. However, a prerequisite of secure processing of medical documents, e.g., patient notes and clinical trials, is the proper de-identi cation of privacy-sensitive information. In this paper, we describe our NLNDE system, with which we participated in the MEDDOCAN competition, the medical document anonymization task of IberLEF 2019. We address the task of detecting and classifying protected health information from Spanish data as a sequence-labeling problem and investigate di erent embedding methods for our neural network. Despite dealing in a non-standard language and domain setting, the NLNDE system achieves promising results in the competition.

De-Identi cation dings Recurrent Neural Networks

The anonymization of privacy-sensitive information is of increasing importance in the age of digitalization and machine learning. It is, in particular, relevant for texts from the medical domain that contain a large number of sensitive information by nature. The shared task MEDDOCAN (Medical Document Anonymization) [ 11 ] aims at automatically detecting protected health information (PHI) from Spanish medical documents. Following the past de-identi cation task on English PubMed abstracts [ 14 ], it is the rst competition on this topic on Spanish data.

In this paper, we describe our submissions to MEDDOCAN and their results. We, as Neither Language Nor Domain Experts (NLNDE), address the anonymization task as a sequence-labeling problem and use a combination of

L. Lange et al.

di erent state-of-the-art approaches from natural language processing to tackle its challenges.

We train recurrent neural networks with conditional random eld output layers which are state of the art for di erent sequence labeling tasks, such as named entity recognition [ 9 ], part-of-speech tagging [ 7 ] or de-identi cation [ 8,10 ]. Recently, the eld of natural language processing has seen another boost in performance by using context-aware language representations which are pre-trained on a large amount of unlabeled corpora [ 1,4,12 ]. Therefore, we experiment with FLAIR embeddings for Spanish [ 1 ] to represent the input of our networks. In our di erent runs, we further explore the advantages of domain-speci c fastText embeddings [ 3 ] that have been pre-trained on SciELO and Wikipedia articles [ 13 ].

From a natural-language-processing perspective, the MEDDOCAN task is interesting due to the non-standard domain (medicine) and language (Spanish) of the documents. The results of our submissions show that state-of-the-art architectures for sequence-labeling tasks can be directly transferred to these settings and that domain-speci c embeddings are helpful but not necessary. 2

Methods

In this section, we rst give an overview of the di erent embedding methods we use in our system. Second, we describe the architecture of our system. Character embeddings fastText

FLAIR pretrained BiLSTM-LM char embeddings BiLSTM char embeddings n-gram embeddings <añ año ños os>

Pedro de diecisiete años sin antecedentes ... a ñ o s … t e a ñ o s s i ... Character Embedding: The characters of a word are represented by randomly initialized embeddings. Those are passed to a bi-directional long short-term memory network (BiLSTM). The last hidden states of the forward and backward pass are concatenated to represent the word [ 9 ].

FastText Embedding: The fastText embeddings represent a word by the normalized sum of the embeddings for the n-grams of the word [ 3 ].

FLAIR Embedding: FLAIR computes character-based embeddings for each word depending on all words in the context [ 2 ]. For this, the complete sentence is used as the input to the BiLSTM instead of only a single word. The BiLSTM of FLAIR is pretrained using a character-level language model objective, i.e., given a sequence of characters, compute the probability for the next character. 2.2

NLNDE System

T1 T2

NOMBRE_SUJETO_ASISTENCIA 0 5 Pedro EDAD_SUJETO_ASISTENCIA 9 24 diecisiete años

postprocessing B-NOM_SA

B-EDAD

I-EDAD

CRF BiLSTM embeddings Pedro de diecisiete años sin antecedentes ...

In Figure 2, the architecture of our model is depicted. In the following, we explain the di erent layers.

Input Representation. We tokenize the input using the tokenizer provided by the shared task organizers [ 6 ]. Then, we represent each token with embeddings. In our runs, we investigate the impact of the following kinds of embeddings: the output of an LSTM over character embeddings (50 dimensions, randomly initialized and ne-tuned during training), domain-independent fastText embeddings (300 dimensions, pre-trained on Spanish text [ 5 ]), domain-speci c fastText embeddings (100 dimensions, pre-trained on Spanish SciELO and Wikipedia articles [ 13 ]), and FLAIR embeddings (4096 dimensions, pre-trained on Spanish text [ 2 ]). For FLAIR embeddings, we also test their pooled version (8192 dimensions, using min pooling) [ 1 ]. Note that except for the character embeddings, we do not ne-tune any of the embeddings.

BiLSTM-CRF Layers. The embeddings are fed into a BiLSTM with a conditional random eld (CRF) output layer, similar as done by Lample et al. [ 9 ]. The CRF output layer is a linear-chain CRF, i.e., it learns transition scores between the output classes. For training, the forward algorithm is used to sum the scores for all possible sequences. During decoding, the Viterbi algorithm is applied to obtain the sequence with the maximum score. Note that the hyperparameters are the same across all runs. We use a BiLSTM hidden size of 256 and train the network with mini-batch stochastic gradient descent using a learning rate of 0.1 and a batch size of 32. For regularization, we employ early stopping on the development set and apply dropout with probability 0.5 on the input representations.

Postprocessing. The output of the model is further adjusted with a post-processing layer, similar as done by Yang et al. [ 15 ] and Liu et al. [ 10 ]. As some classes from the annotation guidelines 4 do not occur in the training data, we tackle them with pattern matching. For this, we use regular expressions for URLs, IPand MAC addresses to detect the classes URL WEB and DIREC PROT INTERNET, overwriting the results of the neural classi er. 3

Submissions

We submitted ve runs to the MEDDOCAN competition. All of them are based on the architecture described in Section 2.2. They only di er in the usage of di erent input representations.

S1 (Char+fastText+Domain): Our rst run uses a combination of character embeddings, domain-independent fastText embeddings as well as domainspeci c fastText embeddings to represent tokens. The resulting representation for each token has 450 dimensions.

S2 (FLAIR+fastText ): In contrast to all other runs, the second run uses only domain-independent embeddings, i.e., embeddings that have been trained on standard narrative and news data from Common Crawl and Wikipedia. In particular, it uses a combination of domain-independent fastText embeddings and Flair embeddings.

S3 (FLAIR+fastText+Domain ): The third run adds domain-speci c fastText embeddings to the system of the second run in order to investigate the impact of domain knowledge. 4 http://temu.bsc.es/meddocan/index.php/annotation-guidelines/ S4 (PooledFLAIR): The fourth run is equal to the third run, except that we use the minimum-pooling version of the FLAIR embeddings.

S5 (Ensemble): The fth run is an ensemble of the previous four runs using weighted voting: Each classi er Ci is assigned a weight wi 2 [0:5; 3]. For each output label, the weights of the classi ers predicting it are summed. Then, the label with the highest score is chosen if it exceeds a speci c threshold t 2 [1; 5], or O (no PHI class) otherwise. The weights and threshold are selected based on results on the development set as follows: w1 = 0:5, w2 = 2:0, w3 = 2:5, w4 = 0:5 and t = 3. With these parameters, a label needs votes from at least two classi ers (wi < t; i 2 f1; 2; 3; 4g). However, the models of the submissions S2 and S3 are assigned higher weights than S1 and S4. This re ects their performance (see next section). 4

Results and Analysis

This section describes our results and analysis. We report the results on the MEDDOCAN test set using the o cial shared task evaluation measures [ 11 ]. 4.1

Results for Task 1: NER O set and Entity Type Classi cation In the rst sub-task, the systems need to nd spans for de-identi cation and categorize them into one of 29 classes. Table 1 presents our results on this subtask.

While the domain-independent system (run 2 with FLAIR and domainindependent fastText embeddings) leads to the highest recall values, the third run that also uses domain-speci c fastText embeddings achieves the highest F1 scores. This shows that integrating domain knowledge into the token representation is bene cial. However, the di erences among the ve runs are rather small, indicating that the architecture itself is already strong enough for the given dataset and the impact of di erent input representations is minor.

L. Lange et al.

Since the o cial evaluation measure for this task is the strict one, we focus our explanation on Table 2. The main ranking of our models is the same as the ranking for sub-task 1: the addition of domain-speci c input representations performs best. Interestingly, the domain-speci c input representations (run 3) now perform best in terms of recall as well while the domain-independent input representations (run 2) perform best in terms of precision.

In both sub-tasks, FLAIR embeddings outperform standard character embeddings (except for the evaluation type merge in Table 3). Also, for both subtasks, pooling of FLAIR embeddings leads to worse results. Surprisingly, run 5, i.e., the ensemble of the models from runs 1{4, does not improve the results over single models. 4.3

Confusion Matrix Analysis

Table 4 shows the confusion matrix of our best performing system (run 3). It is similar to the identity matrix, i.e., confusions between classes happen very rarely. The most confusions happen with O, the label we assign to all non-PHI terms which might be caused by the high number of occurrences of this class in the training dataset. Confusions among PHI-classes happen mostly between related classes. For example, Hospital (HOS) and Institution (INST) are confused quite often, as Hospital is a subclass of Institution and other medical institutions are tagged with Hospital and vice versa, e.g., Clinica Gnation is an institution 2 Abbreviations for entity types:

ECDAALLDESUJET O(CAASLILSET)E,NCIA C(EEDNATDR)O, SFAALMUIDLIARES (SCUSJ)E,TO ASCISOTRERNECOIAEL(EFACTMR),ONFIECCOHAS (F(EMCAHIAL)),, IIHDDOETSPMITIPTULALELAOCIPO(EHNROSPSOE)NR,ASOLINSDAAALNSISTEAAGNRUIITROAARMIIOENTO(ID(IED(IPDTS)PA,SS)),, IDIDICNOSSUNTJTITEAUTCCOTIOOANSAISSITSETNECN(IICANISATL), (I D(IDCNOOSNAM))-,, BRE PERSONAL SANITARIO (NOM PS), NOMBRE SUJETO ASISTENCIA (NOM SA), NUMERO FAX (#FAX), NUMERO TELEFONO (#TEL), OTROS SUJETO ASISTENCIA (OTRO), PAIS (PAIS), PROFESION (PROF), SEXO SUJETO ASISTENCIA (SEXO), TERRITORIO (TER) tagged as a hospital. Analogously, Streets (CALLE) and Territoriums (TER) are getting confused often, as both classes are related and typically constitute of multiple tokens. In contrast to this, Countries (PAIS) are tagged correctly almost every time, as there is only a very limited number of countries and they are usually single token expressions. As mentioned above, the performance di erence between our systems is rather small. This may be caused by the synthetic augmentation of the MEDDOCAN data which was used to extend the texts with header and footer information containing many PHI terms. In fact, 85% of PHI terms appear in the augmented text parts. While this extension is necessary to cover more classes and PHI terms, the synthetic nature of these extensions may have an impact on the performance of automatic classi ers. Therefore, we perform a case study in which we remove these parts from the test set and compare only the predictions found in the real text. Only 838 out of 5661 (14.8%) annotations and only 13 out of 29 classes remain in this experiment. The performances of our systems are decreased to F1 scores around 0.90 which is still rather high. This shows that our systems have learned more than just to reproduce the synthetic data augmentation. However, the performance di erences among our systems are still small, indicating that the data augmentation was not the reason for this behavior. Note, however, that we did not retrain our models without the synthetic augmentation. 5

Conclusions

In this paper, we described the system with which we participated in the MEDDOCAN competition on automatically detecting protected health information from Spanish medical documents. As neither language nor domain experts, we addressed the task with a sequence labeling model. In particular, we trained a bi-directional long short-term memory network and explored di erent input representations. All of our runs achieved high performance with F1 scores about 97%.

1. Akbik , A. , Bergmann , T. , Vollgraf , R.: Pooled contextualized embeddings for named entity recognition . In: Proc. of NAACL . pp. 724 { 728 ( 2019 )

2. Akbik , A. , Blythe , D. , Vollgraf , R.: Contextual string embeddings for sequence labeling . In: Proc. of COLING . pp. 1638 { 1649 ( 2018 )

3. Bojanowski , P. , Grave , E. , Joulin , A. , Mikolov , T. : Enriching word vectors with subword information . Transactions of the Association for Computational Linguistics 5 , 135 { 146 ( 2017 ). https://doi.org/10.1162/tacl a 00051

4. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : BERT: Pre-training of deep bidirectional transformers for language understanding . In: Proc. of NAACL . pp. 4171 { 4186 ( 2019 )

5. Grave , E. , Bojanowski , P. , Gupta , P. , Joulin , A. , Mikolov , T. : Learning word vectors for 157 languages . In: Proc. of LREC ( 2018 )

6. Intxaurrondo , A.:

SPACCC (spanish clinical case corpus) tokenizer (

Mar 2019 ). https://doi.org/10.5281/zenodo.2586978

7. Kemos , A. , Adel , H. , Schutze, H.: Neural semi-Markov conditional random elds for robust character-based part-of-speech tagging . In: Proc. of NAACL . pp. 2736 { 2743 ( 2019 )

8. Khin , K. , Burckhardt , P. , Padman , R.: A deep learning architecture for de-identi cation of patient notes: Implementation and evaluation . CoRR abs/ 1810 .01570 ( 2018 )

9. Lample , G. , Ballesteros , M. , Subramanian , S. , Kawakami , K. , Dyer , C. : Neural architectures for named entity recognition . In: Proc. of NAACL ( 2016 )

10. Liu , Z. , Tang , B. , Wang , X. , Chen , Q. : De-identi cation of clinical notes via recurrent neural network and conditional random eld . Journal of Biomedical Informatics 75 , S34 { S42 ( 2017 ). https://doi.org/https://doi.org/10.1016/j.jbi. 2017 . 05 .023

11. Marimon , M. , Gonzalez-Agirre , A. , Intxaurrondo , A. , Rodrguez , H. , Lopez

Martin

, J.A. , Villegas , M. , Krallinger , M. : Automatic de -identi cation of medical texts in spanish: the meddocan track, corpus, guidelines, methods and evaluation of results . In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019 ). vol. TBA , p. TBA. CEUR Workshop Proceedings (CEUR-WS.org) , Bilbao, Spain (Sep 2019 )

12. Peters , M.E. , Neumann , M. , Iyyer , M. , Gardner , M. , Clark , C. , Lee , K. , Zettlemoyer , L. : Deep contextualized word representations . In: Proc. of NAACL ( 2018 )

13. Soares , F. , Villegas , M. , Gonzalez-Agirre , A. , Krallinger , M. , Armengol-Estape , J. : Medical word embeddings for Spanish: Development and evaluation ( Jun 2019 ), https://www.aclweb.org/anthology/W19-1916

14. Stubbs , A. , Uzuner , O . : Annotating longitudinal clinical narratives for deidenti cation: The 2014 i2b2/uthealth corpus . Journal of biomedical informatics 58, S20{S29 ( 2015 )

15. Yang , H. , Garibaldi , J.M.: Automatic detection of protected health information from clinic narratives . Journal of Biomedical Informatics 58 , S30 { S38 ( 2015 ). https://doi.org/https://doi.org/10.1016/j.jbi. 2015 . 06 .015