-

E2EJ : Anonymization of Spanish Medical Records using End-to-End Joint Neural Networks

Mohammed Jabreel

Fadi Hassan

Najlaa Maarrof

David Sanchez

Josep Domingo-Ferrer

Antonio Moreno

antonio.morenog@urv.cat 1

Medical Documents. Deep Learn-

0 CYBERCAT-Center for Cybersecurity Research of Catalonia. UNESCO Chair in Data Privacy. Universitat Rovira i Virgili , Av. Pasos Catalans 26, E-43007 Tarragona, Catalonia 1 iTAKA: Intelligent Technologies for Advanced Knowledge Acquisition. Department of Computer Science and Mathematics

2019

712 719

This paper describes E2EJ, the system that we have developed to participate in the Medical Document Anonymization challenge in the shared task of IberLEf2019. E2EJ is a data-driven and end-to-end neural network. It does not rely on external resources such as part-ofspeech tagger. It proposes to solve two problems jointly; the rst problem is to automatically identify whether a token is sensitive, whereas the second one is to identify the type of the token. E2EJ shows comparable results to the state-of-the-art systems and outperform the baseline systems. The F1 score of our system on the test set is 96.61% and 95.83% for the sensitivity detection and the token type identi cation tasks respectively.

Anonymization CRF ing

Patient notes in electronic health records (EHRs) contain critical information that may be useful for medical investigations. However, due to privacy concerns, the vast majority of medical investigators can only access anonymized or de-identi ed notes to protect the con dentiality of patients [ 1 ]. Anonymization can be either manual or automated. Manual anonymization means that human annotators label protected health information (PHI). This approach has some drawbacks. First, only a limited set of individuals is allowed to access the identi ed patient notes. Thus, the task cannot be crowd-sourced. Second, humans are prone to mistakes. Third, manual anonymization is impractical given the size of EHR databases. Therefore, a reliable automated anonymization system would consequently be of high-value [ 14, 8 ]. In the literature, there are many

M. Jabreel et al. systems for EHR anonymization, which we can categorize them as rule-based, feature-engineering-based, or deep-learning-based approaches.

Starting by a seed collection of sensitive tokens, the idea of rule-based systems is to manually engineer some rules based on regular expressions, syntactic, or dependency structures to expand the collection iteratively [ 13, 9 ].

The feature-engineering-based systems aim to train a sequence tagger with rich, hand-crafted features based on linguistic or syntactic information from annotated corpus to predict a label (e.g., O, B < entity > or I < entity >) on each token in a sentence [ 5 ].

Rule-based and feature-engineering-based approaches are labor-intensive for constructing rules or features using linguistic and syntactic information. Despite some promising results, there are two main issues with these approaches. First, the engineering of rules and features is a time-consuming task. Moreover, rules always need to be updated. Second, the systems of these two categories are dependent on some external requirements like a parser analyzing the syntactic and dependency structure of sentences. Therefore, the performances of these systems rely on the quality of the parsing results [ 14, 9 ]. To avoid these issues, deep-learning is used to develop systems learn high-level representations for each token, on which a classi er or sequence tagger can be trained [ 6 ].

Medical Document Anonymization (MEDDOCAN) [ 7 ] is a challenge in the shared task of IberLEf2019 dedicated to the EHRs in the Spanish language. There are two structured sub-tasks: "sensitive token detection" and "NER o set and entity type classi cation". The rst sub-task aims to identify the sensitive tokens in a document. We can solve this sub-task as a token-level binary classication problem in which we develop a system that takes as input a document and classify each token as sensitive or not. The second sub-task aims at identifying the type of each token in a document. We can model this problem as a sequence tagging problem. The input is a sequence of tokens, and the output is their corresponding labels.

We participated in the MEDDOCAN challenge by developing E2EJ, a joint and end-to-end neural network-based system for the two sub-tasks. The proposed system provides an end-to-end solution and does not require any parsers or other linguistic resources. Speci cally, the proposed system is a multilayer neural network, where the rst three layers aim to learn high representation for a sequence of tokens, then we pass, jointly, the output of these layers to two submodels that are learned interactively. One is for extracting the sensitive tokens, while the other is for identifying their types.

The rest of the paper structured as follows: Section 2 presents the Methodology; Section 3 explains the dataset, baselines, and experimental settings; Section 4 presents and discusses the results; nally, Section 5 concludes this paper. 2

System Description

The main distinction point between our model and the literature deep-learningbased is the consideration of the interaction between the two tasks of sensitivity detection and token type identi cation. In this subsection, we introduce E2EJ and its implementation steps in detail. Fig 1. depicts the architecture of our model.

1 2 ...

Word-level Embedding Char-level Embedding

BiLSTM

FwLSTM BwLSTM Word-level Embedding Char-level Embedding Word-level Embedding Char-level Embedding

BiLSTM FwLSTM BwLSTM ...

...

BiLSTM FwLSTM BwLSTM

Conv1D Conv1D Conv1D Conv1D Conv1D Conv1D

MLP CRF MLP CRF MLP

CRF The goal of the embedding layer is to represent each word wi 2 S by a lowdimensional vector space vi 2 Rd. Here, d is the size of the embedding layer. We use two levels of embedding: word-level and character-level. For the wordlevel embedding, we replace wi with its pre-trained Glove word embedding vector viw [ 11 ]. We use a single-layer 1-Dimensional Convolutional Neural Networks (Conv1D) with max-over-time pooling to represent the word at characterlevel as the following. Suppose that wi is made up of a sequence of characters [c1; c2; :::; cn], where n is the length of wi. First, we pass the sequence of characters of the word wi to a randomly initialized character embedding layer to get the matrix Ci 2 Rr l - that is the character-level representation of wi. Here, the j th column corresponds to the character embedding for cj . After that, we apply a narrow convolution between Ci and a lter (or kernel) H 2 Rr k of width k, after which we add a bias and apply a nonlinearity to obtain a feature map f i 2 Rn k+1. Speci cally, the m-th element of f i is given by: f i[m] = tanh(hCi[ ; m : m + k 1]; Hi + b) where Ci[ ; m : m + k 1] is the m-to-(m + w1)-th column of Ci and hA; Bi is the frobenius inner product. Finally, we take the max-over-time vic = maxmf i[m]

Sensitive? O / B-? / I-? Sensitive? O / B-? / I-? Sensitive? O / B-? / I-? (1) (2) as the feature corresponding to the lter H (when applied to word wi).A lter is basically picking out a character n-gram, where the size of the n-gram corresponds to the lter width.

The nal representation of the word wi is given by concatenating the wordlevel vector and the character-level vector.

vi = [viw; vic] 2.2

BiLSTM Layer

The goal of the encoder layer is to represent the sequence of words representations, fv1; v2; :::; vlg, that is obtained from the embedding layer in higher level of abstraction and model the sequential phenomena. In this work we use a BiRNN to design our encoder. A BiRNN consists of forward ! and backward recurrent neural networks (RNNs). The rst one reads the input sequence in a forward direction and produces a sequence of forward hidden states (h!1; :::; !hl ), whereas the former reads the sequence in the reverse order (vwl ; :::; vw1 ) resulting in a sequence of backward hidden states (hl ; :::; h1).

We obtain a representation for each word vwt by concatenating the corresponding forward hidden state !ht and the backward one ht. The following equations illustrate the main ideas: !ht = !(vwt ; ht!1) ht =

(vwt ; ht 1) ht = [ !ht; ht]

In practice, RNNs are challenging to train. Gradients may explode or vanish over long sequences [ 10 ]. To overcome these problems, we use Long Short-Term Memory (LSTM) [ 3 ] networks that are a more sophisticated variant of regular RNNs. 2.3

Sensitivity Detection Sub-Model

The input to this sub-model is the obtained sequence of vectors from the BiLSTM layer, and the output is the probability for each token been sensitive. As shown g. 1, it comprises of two units: Conv1D with a single layer and a multi-layer perceptron (MLP) with one hidden layer and one Sigmoid neuron, i.e., the output layer. The goal of the Conv1D layer is to enrich the representation of each token with information about a xed-sized context depending on a kernel width of k. Formally, we get the nal representation of the input sequences as follows: [v1s; v2s; :::; vls] = Conv1D([v1; v2; :::; vl] Where Conv1D refers to the same operations in Equations 1 and 2. Given that, for each vts, we obtain the nal output as the following.

xts = tanh(vts W1s + bs1) (3) (4) (5) (6) (7) (8) yts = sigmoid(xts

W2s + bs2) (9) Here, W1s 2 Rds dx , bs1 2 Rdx , W2s 2 Rds 1 and bs2 2 R are the MLP parameters. Where ds is the dimensionality of the output vector from Conv1D and dx is the dimensionality of the output vector from the hidden layer. 2.4

NER Type Detection Sub-Model

Similarly, the input to this sub-model is the obtained sequence of vectors from the BiLSTM layer. The output, in this case, is the probability for each token been sensitive. Formally, let [v1t; v2t; :::; vlt] be the sequence of vectors to be labeled, which is produced the concatenation of the MLP layer in the Sensitivity Detection sub-model and the output of the Conv1D layer in this sub-model, and Y t = [y1t; y2t; :::; ylt] is the corresponding tag sequence. Each element yit of y is one of the B < entity >, I < entity > or O tags. Both H and Y t are assumed to be random variables, and they are jointly modeled using a conditional random eld (CRF). 2.5

Training

We train our model to minimise the joint objective function J .

J = Js + Jt Where Js is the sigmoid cross-entropy and Jt is the negative log-probability of the correct tag sequence:

s Js = yt log(yts) + (1 yts) log(1

yts) Jt =

log p(Y tjH)

Where yts is the golden label and yts is the predicted one. The Y t refers to the sequence of tags. As optimization algorithm, we used Stochastic Gradient Descent (SGD)-based ADAM algorithm [ 4 ] with learning-rate of 0.001. To avoid the over- tting, we used dropout on the embeddings and decoder outputs with a rate of 0.3 [ 12 ]. (10) (11) (12) 3

Experiments

In this section, we discuss the dataset used and di erent experimental settings devised to evaluate our system. 3.1

Dataset Details

We trained and ne-tuned our system respectively on the training and the development sets provided by the organizers of the MEDDOCAN challenge. After that, we submitted the predicted labels of the test set that are produced by our system to evaluate its performance. The organizers omitted the golden labels of the test. The training set contains 500 documents, and the development and test sets contain 250 documents each. We used grid-search to obtain the best hyper-parameter values based on the development set. We list these values in Table 1. We evaluated the performance of our system by comparing it against the following baseline systems: { RegEx: a rule-based system using only regular expressions. { CRf : a CRf-based system trained on a set of features such as unigram, part-of-tags, word shape, a xes, etc. [ 12 ] { E2E-LSTM: a version of our system that are trained to only identify the type of tokens. We have developed a system, called E2EJ, that automatically detects the sensitive entities and identify their types in Spanish electronic health records. It contains two sub-models that are trained jointly. The rst one aims to detects the sensitive entities and guides the second one to accurately predict the type of these detected tokens. E2EJ provides an end-to-end solution and does not require any external tools or other linguistic resources. The e ectiveness of the proposed system has been evaluated by participating in the Medical Document Anonymization challenge for the electronic health records in Spanish language obtaining results which show comparable results to the state-of-the-art systems and outperform the baseline systems. The reported results show that the proposed system is stable and consistent. In our future work, we plan to perform extensive error analysis and inspect the performance of the system and improve it. For example, we plan to use a transformer-based interpretable model like BERT [ 2 ] as a pre-trained sentence encoder instead of using BiLSTM.

Acknowledgements

The authors acknowledge the support of Univ. Rovira i Virgili through a Mart i Franques PhD grant, the assistant/teaching grant for the Department of Computer Engineering and Mathematics and the Research Support Funds 2019PFRURV-B2-60.

1. Act , A. : Health insurance portability and accountability act of 1996 . Public law 104 , 191 ( 1996 )

2. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : Bert: Pre-training of deep bidirectional transformers for language understanding . arXiv preprint arXiv: 1810 . 04805 ( 2018 )

3. Hochreiter , S. , Schmidhuber , J.: Long short-term memory . Neural computation 9(8) , 1735 { 1780 ( 1997 )

4. Kingma , D.P. , Ba , J.: Adam: A method for stochastic optimization . arXiv preprint arXiv:1412.6980 ( 2014 )

5. Liu , Z. , Chen , Y. , Tang , B. , Wang , X. , Chen , Q. , Li , H. , Wang , J. , Deng , Q. , Zhu , S. : Automatic de -identi cation of electronic medical records using token-level and character-level conditional random elds . Journal of biomedical informatics 58, S47{S52 ( 2015 )

6. Liu , Z. , Tang , B. , Wang , X. , Chen , Q. : De-identi cation of clinical notes via recurrent neural network and conditional random eld . Journal of biomedical informatics 75, S34{S42 ( 2017 )

7. Marimon , M. , Gonzalez-Agirre , A. , Intxaurrondo , A. , Rodrguez , H. , Lopez

Martin

, J.A. , Villegas , M. , Krallinger , M. : Automatic de -identi cation of medical texts in spanish: the meddocan track, corpus, guidelines, methods and evaluation of results . In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019 ). vol. TBA , p. TBA. CEUR Workshop Proceedings (CEUR-WS.org) , Bilbao, Spain (Sep 2019 ), TBA

8. Meystre , S.M. , Friedlin , F.J. , South , B.R. , Shen , S. , Samore , M.H. : Automatic de-identi cation of textual documents in the electronic health record: a review of recent research . BMC medical research methodology 10(1) , 70 ( 2010 )

9. Neamatullah , I. , Douglass , M.M. , Li-wei , H.L. , Reisner , A. , Villarroel , M. , Long , W.J. , Szolovits , P. , Moody, G.B., Mark , R.G. , Cli ord , G.D.: Automated deidenti cation of free-text medical records. BMC medical informatics and decision making 8(1 ), 32 ( 2008 )

10. Pascanu , R. , Mikolov , T. , Bengio , Y. : On the di culty of training recurrent neural networks . In: ICML ( 2013 )

11. Pennington , J. , Socher , R. , Manning , C. : Glove: Global vectors for word representation . In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) . pp. 1532 { 1543 ( 2014 )

12. Srivastava , N. , Hinton , G. , Krizhevsky , A. , Sutskever , I. , Salakhutdinov , R.: Dropout: a simple way to prevent neural networks from over tting . The Journal of Machine Learning Research 15 ( 1 ), 1929 { 1958 ( 2014 )

13. Sweeney , L. : Replacing personally-identifying information in medical records, the scrub system . In: Proceedings of the AMIA annual fall symposium . p. 333 . American Medical Informatics Association ( 1996 )

14. Uzuner , O. , Luo , Y. , Szolovits , P. : Evaluating the state-of-the-art in automatic de-identi cation . Journal of the American Medical Informatics Association 14 ( 5 ), 550 { 563 ( 2007 )