E2EJ : Anonymization of Spanish Medical Records using End-to-End Joint Neural Networks Mohammed Jabreel1 , Fadi Hassan2 , Najlaa Maarrof1 , David Sánchez2 , Josep Domingo-Ferrer2 , and Antonio Moreno1 1 iTAKA: Intelligent Technologies for Advanced Knowledge Acquisition. Department of Computer Science and Mathematics 2 CYBERCAT-Center for Cybersecurity Research of Catalonia. UNESCO Chair in Data Privacy. Universitat Rovira i Virgili, Av. Paı̈sos Catalans 26, E-43007 Tarragona, Catalonia {mohammed.jabreel, fadi.hassan, najlaa.maaroof, david.sanchez, josep.domingo, antonio.moreno}@urv.cat Abstract. This paper describes E2EJ, the system that we have devel- oped to participate in the Medical Document Anonymization challenge in the shared task of IberLEf2019. E2EJ is a data-driven and end-to-end neural network. It does not rely on external resources such as part-of- speech tagger. It proposes to solve two problems jointly; the first problem is to automatically identify whether a token is sensitive, whereas the sec- ond one is to identify the type of the token. E2EJ shows comparable results to the state-of-the-art systems and outperform the baseline sys- tems. The F1 score of our system on the test set is 96.61% and 95.83% for the sensitivity detection and the token type identification tasks re- spectively. Keywords: Anonymization · CRF · Medical Documents. · Deep Learn- ing 1 Introduction Patient notes in electronic health records (EHRs) contain critical information that may be useful for medical investigations. However, due to privacy con- cerns, the vast majority of medical investigators can only access anonymized or de-identified notes to protect the confidentiality of patients [1]. Anonymization can be either manual or automated. Manual anonymization means that human annotators label protected health information (PHI). This approach has some drawbacks. First, only a limited set of individuals is allowed to access the iden- tified patient notes. Thus, the task cannot be crowd-sourced. Second, humans are prone to mistakes. Third, manual anonymization is impractical given the size of EHR databases. Therefore, a reliable automated anonymization system would consequently be of high-value [14, 8]. In the literature, there are many Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 Septem- ber 2019, Bilbao, Spain. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) M. Jabreel et al. systems for EHR anonymization, which we can categorize them as rule-based, feature-engineering-based, or deep-learning-based approaches. Starting by a seed collection of sensitive tokens, the idea of rule-based systems is to manually engineer some rules based on regular expressions, syntactic, or dependency structures to expand the collection iteratively [13, 9]. The feature-engineering-based systems aim to train a sequence tagger with rich, hand-crafted features based on linguistic or syntactic information from annotated corpus to predict a label (e.g., O, B− < entity > or I− < entity >) on each token in a sentence [5]. Rule-based and feature-engineering-based approaches are labor-intensive for constructing rules or features using linguistic and syntactic information. Despite some promising results, there are two main issues with these approaches. First, the engineering of rules and features is a time-consuming task. Moreover, rules always need to be updated. Second, the systems of these two categories are dependent on some external requirements like a parser analyzing the syntactic and dependency structure of sentences. Therefore, the performances of these systems rely on the quality of the parsing results [14, 9]. To avoid these issues, deep-learning is used to develop systems learn high-level representations for each token, on which a classifier or sequence tagger can be trained [6]. Medical Document Anonymization (MEDDOCAN) [7] is a challenge in the shared task of IberLEf2019 dedicated to the EHRs in the Spanish language. There are two structured sub-tasks: ”sensitive token detection” and ”NER off- set and entity type classification”. The first sub-task aims to identify the sensitive tokens in a document. We can solve this sub-task as a token-level binary classi- fication problem in which we develop a system that takes as input a document and classify each token as sensitive or not. The second sub-task aims at iden- tifying the type of each token in a document. We can model this problem as a sequence tagging problem. The input is a sequence of tokens, and the output is their corresponding labels. We participated in the MEDDOCAN challenge by developing E2EJ, a joint and end-to-end neural network-based system for the two sub-tasks. The pro- posed system provides an end-to-end solution and does not require any parsers or other linguistic resources. Specifically, the proposed system is a multilayer neural network, where the first three layers aim to learn high representation for a sequence of tokens, then we pass, jointly, the output of these layers to two sub- models that are learned interactively. One is for extracting the sensitive tokens, while the other is for identifying their types. The rest of the paper structured as follows: Section 2 presents the Methodol- ogy; Section 3 explains the dataset, baselines, and experimental settings; Section 4 presents and discusses the results; finally, Section 5 concludes this paper. 2 System Description The main distinction point between our model and the literature deep-learning- based is the consideration of the interaction between the two tasks of sensitivity 713 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) E2EJ : Anonymization of Spanish Medical Records detection and token type identification. In this subsection, we introduce E2EJ and its implementation steps in detail. Fig 1. depicts the architecture of our model. Conv1D MLP Sensitive? Word-level Embedding BiLSTM 1 Char-level Embedding FwLSTM BwLSTM Conv1D CRF O / B-? / I-? Conv1D MLP Sensitive? Word-level Embedding BiLSTM 2 Char-level Embedding FwLSTM BwLSTM Conv1D CRF O / B-? / I-? ... ... ... Conv1D MLP Sensitive? Word-level Embedding BiLSTM Char-level Embedding FwLSTM BwLSTM Conv1D CRF O / B-? / I-? Fig. 1. E2EJ Architecture. 2.1 Embedding Layer The goal of the embedding layer is to represent each word wi ∈ S by a low- dimensional vector space vi ∈ Rd . Here, d is the size of the embedding layer. We use two levels of embedding: word-level and character-level. For the word- level embedding, we replace wi with its pre-trained Glove word embedding vector viw [11]. We use a single-layer 1-Dimensional Convolutional Neural Net- works (Conv1D) with max-over-time pooling to represent the word at character- level as the following. Suppose that wi is made up of a sequence of characters [c1 , c2 , ..., cn ], where n is the length of wi . First, we pass the sequence of charac- ters of the word wi to a randomly initialized character embedding layer to get the matrix Ci ∈ Rr×l - that is the character-level representation of wi . Here, the j−th column corresponds to the character embedding for cj . After that, we apply a narrow convolution between Ci and a filter (or kernel) H ∈ Rr×k of width k, after which we add a bias and apply a nonlinearity to obtain a feature map f i ∈ Rn−k+1 . Specifically, the m-th element of f i is given by: f i [m] = tanh(hCi [∗, m : m + k − 1], Hi + b) (1) where Ci [∗, m : m + k − 1] is the m-to-(m + w1)-th column of Ci and hA, Bi is the frobenius inner product. Finally, we take the max-over-time vic = maxm f i [m] (2) 714 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) M. Jabreel et al. as the feature corresponding to the filter H (when applied to word wi ).A fil- ter is basically picking out a character n-gram, where the size of the n-gram corresponds to the filter width. The final representation of the word wi is given by concatenating the word- level vector and the character-level vector. vi = [viw ; vic ] (3) 2.2 BiLSTM Layer The goal of the encoder layer is to represent the sequence of words representa- tions, {v1 , v2 , ..., vl }, that is obtained from the embedding layer in higher level of abstraction and model the sequential phenomena. In this work we use a BiRNN → − ← − to design our encoder. A BiRNN consists of forward φ and backward φ recur- rent neural networks (RNNs). The first one reads the input sequence in a forward − → → − direction and produces a sequence of forward hidden states (h1 , ..., hl ), whereas the former reads the sequence in the reverse order (vwl , ..., vw1 ) resulting in a ←− ← − sequence of backward hidden states ( hl , ..., h1 ). We obtain a representation for each word vwt by concatenating the cor- → − ←− responding forward hidden state ht and the backward one ht . The following equations illustrate the main ideas: → − → − −−→ ht = φ (vwt , ht−1 ) (4) ← − ← − ←−− ht = φ (vwt , ht−1 ) (5) → − ← − ht = [ ht ; ht ] (6) In practice, RNNs are challenging to train. Gradients may explode or vanish over long sequences [10]. To overcome these problems, we use Long Short-Term Memory (LSTM) [3] networks that are a more sophisticated variant of regular RNNs. 2.3 Sensitivity Detection Sub-Model The input to this sub-model is the obtained sequence of vectors from the BiLSTM layer, and the output is the probability for each token been sensitive. As shown fig. 1, it comprises of two units: Conv1D with a single layer and a multi-layer perceptron (MLP) with one hidden layer and one Sigmoid neuron, i.e., the output layer. The goal of the Conv1D layer is to enrich the representation of each token with information about a fixed-sized context depending on a kernel width of k. Formally, we get the final representation of the input sequences as follows: [v1s , v2s , ..., vls ] = Conv1D([v1 , v2 , ..., vl ] (7) Where Conv1D refers to the same operations in Equations 1 and 2. Given that, for each vts , we obtain the final output as the following. xst = tanh(vts · W1s + bs1 ) (8) 715 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) E2EJ : Anonymization of Spanish Medical Records ȳts = sigmoid(xst · W2s + bs2 ) (9) Here, W1s ∈ Rds ×dx , bs1 ∈ Rdx , W2s ∈ Rds ×1 and bs2 ∈ R are the MLP parameters. Where ds is the dimensionality of the output vector from Conv1D and dx is the dimensionality of the output vector from the hidden layer. 2.4 NER Type Detection Sub-Model Similarly, the input to this sub-model is the obtained sequence of vectors from the BiLSTM layer. The output, in this case, is the probability for each token been sensitive. Formally, let [v1t , v2t , ..., vlt ] be the sequence of vectors to be la- beled, which is produced the concatenation of the MLP layer in the Sensitivity Detection sub-model and the output of the Conv1D layer in this sub-model, and Y t = [y1t , y2t , ..., ylt ] is the corresponding tag sequence. Each element yit of y is one of the B− < entity >, I− < entity > or O tags. Both H and Y t are assumed to be random variables, and they are jointly modeled using a conditional random field (CRF). 2.5 Training We train our model to minimise the joint objective function J. J = Js + Jt (10) Where Js is the sigmoid cross-entropy and Jt is the negative log-probability of the correct tag sequence: Js = yts × −log(ȳts ) + (1 − yts ) × −log(1 − ȳts ) (11) Jt = −log p(Y t |H)  (12) Where yts is the golden label and ȳts is the predicted one. The Y t refers to the sequence of tags. As optimization algorithm, we used Stochastic Gradient Descent (SGD)-based ADAM algorithm [4] with learning-rate of 0.001. To avoid the over-fitting, we used dropout on the embeddings and decoder outputs with a rate of 0.3 [12]. 3 Experiments In this section, we discuss the dataset used and different experimental settings devised to evaluate our system. 3.1 Dataset Details We trained and fine-tuned our system respectively on the training and the de- velopment sets provided by the organizers of the MEDDOCAN challenge. After that, we submitted the predicted labels of the test set that are produced by our system to evaluate its performance. The organizers omitted the golden labels of the test. The training set contains 500 documents, and the development and test sets contain 250 documents each. 716 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) M. Jabreel et al. 3.2 Hyper-parameters We used grid-search to obtain the best hyper-parameter values based on the development set. We list these values in Table 1. Table 1. The chosen hyper-parameter values. Dimension size: 300 Word Embedding Initialization: Glove Trainable: No Dimension size: 50 Conv1D filters: 100 Char Embedding Kernel width: 3 Initialization: Uinform [-0.1, 0.1] Hidden units: 256 BiLSTM Layers: 2 Conv1D filters: 200 Sub-Model (1) Kernel width: 3 Hidden size: 200 Conv1D filters: 200 Sub-Model (2) Kernel width: 3 4 Results We evaluated the performance of our system by comparing it against the follow- ing baseline systems: – RegEx: a rule-based system using only regular expressions. – CRf : a CRf-based system trained on a set of features such as unigram, part-of-tags, word shape, affixes, etc. [12] – E2E-LSTM: a version of our system that are trained to only identify the type of tokens. Table 2 shows the results of our submitted system (i.e., E2EJ system) and the compared systems. The evaluation metrics are precision, recall, and F1 scores. From the reported results, we can note that in general, E2EJ gives comparable performance to the state-of-the-art system CRF. It outperforms all the compared systems in terms of recall metric. One remarkable observation is that our system, unlike the other systems, gives a similar performance in all the evaluation met- rics, which shows its consistency. Hence, some error analysis and performance inspection can lead to improving the performance of the system. The CRF- based system gives the best performance in terms of precision and F1 metrics with the NER sub-task, and the best performance in terms of the precision score for the Spans detection sub-task. We attribute this to the use of the external MEDDOCAN-Gazetteer resources provided by the organizers of the task. 717 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) E2EJ : Anonymization of Spanish Medical Records Table 2. The Performances of our system compared to various methods. The best value in bold. Sub-Task 1 (NER) Sub-Task 2 (Spans) System P R F1 P R F1 RegEx 91.06 81.01 85.74 91.32 81.24 85.99 CRf 97.02 94.93 95.96 97.47 95.37 96.41 E2E 94.78 93.64 94.21 95.80 94.65 95.22 E2EJ 95.98 95.69 95.83 96.76 96.45 96.61 5 Conclusion We have developed a system, called E2EJ, that automatically detects the sen- sitive entities and identify their types in Spanish electronic health records. It contains two sub-models that are trained jointly. The first one aims to detects the sensitive entities and guides the second one to accurately predict the type of these detected tokens. E2EJ provides an end-to-end solution and does not require any external tools or other linguistic resources. The effectiveness of the proposed system has been evaluated by participating in the Medical Document Anonymization challenge for the electronic health records in Spanish language obtaining results which show comparable results to the state-of-the-art systems and outperform the baseline systems. The reported results show that the pro- posed system is stable and consistent. In our future work, we plan to perform extensive error analysis and inspect the performance of the system and improve it. For example, we plan to use a transformer-based interpretable model like BERT [2] as a pre-trained sentence encoder instead of using BiLSTM. Acknowledgements The authors acknowledge the support of Univ. Rovira i Virgili through a Martı́ i Franqués PhD grant, the assistant/teaching grant for the Department of Com- puter Engineering and Mathematics and the Research Support Funds 2019PFR- URV-B2-60. References 1. Act, A.: Health insurance portability and accountability act of 1996. Public law 104, 191 (1996) 2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 3. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997) 4. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 718 Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019) M. Jabreel et al. 5. Liu, Z., Chen, Y., Tang, B., Wang, X., Chen, Q., Li, H., Wang, J., Deng, Q., Zhu, S.: Automatic de-identification of electronic medical records using token-level and character-level conditional random fields. Journal of biomedical informatics 58, S47–S52 (2015) 6. Liu, Z., Tang, B., Wang, X., Chen, Q.: De-identification of clinical notes via recur- rent neural network and conditional random field. Journal of biomedical informatics 75, S34–S42 (2017) 7. Marimon, M., Gonzalez-Agirre, A., Intxaurrondo, A., Rodrguez, H., Lopez Mar- tin, J.A., Villegas, M., Krallinger, M.: Automatic de-identification of medical texts in spanish: the meddocan track, corpus, guidelines, methods and evaluation of re- sults. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019). vol. TBA, p. TBA. CEUR Workshop Proceedings (CEUR-WS.org), Bilbao, Spain (Sep 2019), TBA 8. Meystre, S.M., Friedlin, F.J., South, B.R., Shen, S., Samore, M.H.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC medical research methodology 10(1), 70 (2010) 9. Neamatullah, I., Douglass, M.M., Li-wei, H.L., Reisner, A., Villarroel, M., Long, W.J., Szolovits, P., Moody, G.B., Mark, R.G., Clifford, G.D.: Automated de- identification of free-text medical records. BMC medical informatics and decision making 8(1), 32 (2008) 10. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: ICML (2013) 11. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word repre- sentation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014) 12. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1), 1929–1958 (2014) 13. Sweeney, L.: Replacing personally-identifying information in medical records, the scrub system. In: Proceedings of the AMIA annual fall symposium. p. 333. Amer- ican Medical Informatics Association (1996) 14. Uzuner, Ö., Luo, Y., Szolovits, P.: Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association 14(5), 550–563 (2007) 719