-

ReCRF : Spanish Medical Document Anonymization using Automatically-crafted Rules and CRF

Fadi Hassan

Mohammed Jabreel

Najlaa Maaroof

David Sanchez

Josep Domingo-Ferrer

Antonio Moreno

antonio.morenog@urv.cat 1 0 CYBERCAT-Center for Cybersecurity Research of Catalonia. UNESCO Chair in Data Privacy 1 iTAKA: Intelligent Technologies for Advanced Knowledge Acquisition. Department of Computer Science and Mathematics Universitat Rovira i Virgili , Av. Pasos Catalans 26, E-43007 Tarragona, Catalonia

2019

727 734

This paper describes ReCRF, a named-entity recognition system submitted to the Medical Document Anonymization (MEDDOCAN) challenge in the IberLEF 2019 Workshop. We propose a general method based on a data-driven rule generator and Conditional Random Fields (CRFs) to automatically detect protected health information (PHI) in Spanish medical documents. The reported experiments show that our system achieves a micro-F1 of 96.33% on the test dataset for the rst sub-task and a micro-F1 of 96.86% and 97.50% on strict and merged metrics, respectively, on the test dataset for the second sub-task.

Anonymization CRF

Medical documents containing detailed patients' data are of utmost importance for research. When healthcare data are associated to individuals, they are considered protected health information (PHI). The new European General Data Protection Regulation (GDPR) [ 6 ], states that explicit consent from the a ected individuals is needed to use personally identi able information (PII), and PHI in particular, for secondary purposes. That means, the data collector should strive to gather such consent. To avoid the need for consent, data used for secondary purposes should no longer be personally identi able. Document anonymization provides a way to turn PII into information that cannot be linked to a speci c identi ed individual any more, so that it is not subject to privacy regulations anymore.

In 2006 and 2014, i2b2 organized two shared tasks on document anonymization [ 2 ]. The i2b2 e ort had a signi cant impact on the medical natural language processing (NLP) community, but that e ort was focused on English documents

F. Hassan et al.

only. IberLEF 2019 organizes the rst community challenge task speci cally devoted to the anonymization of Spanish medical documents, called the MEDDOCAN task [ 5 ]. The purpose of the MEDDOCAN task is to detect and remove PHI from Spanish plain text medical records. The task is structured into two sub-tasks: "NER o set and entity type classi cation" and "sensitive token detection". The rst sub-task aims at detecting entity types and locations in the text and the second sub-task aims at detecting just entity locations.

The remainder of the paper is organized as follows. In Section 2 we brie y describe the data to be anonymized. Section 3 describes the methodology we propose. Results and discussions are presented in Section 4. Section 5 presents the conclusions and depicts some lines of future work. 2

Data Description

The MEDDOCAN challenge task aims at identifying and extracting several types of PHI categories from plain text medical documents. The PHI categories are grouped into eight main categories with 22 sub-categories. The corpora released for the tasks consists of 1000 documents, divided into: 500 as training data, 250 as development data and 250 as test data. The distributions of PHI categories and sub-categories in the training, development and test data are shown in Table 1.

ReCRF: Spanish Medical Document Anonymization

Methodology

We developed an automatic system to detect PHI categories from Spanish medical documents. The next subsections describe the steps followed to train and use the system. 3.1

Text Tokenization

In this step, we tokenize the text at two levels: sentence-level and word-level. First, a sentence tokenizer takes a single document as input and produces list of sentences. Afterwards, we split each single sentence into a list of tokens. The sentence tokenizer is based on newline delimiter whereas a manually-crafted regular expression based tokenizer and a spaCy pre-trained model for Spanish[ 1 ] are used sequentially to perform the word-level tokenization. 3.2

Rules Generation

In this step, we developed a data-driven regular expression generator so that we avoid implementing hand crafted regular expression rules. This generator analyses all the appearances of the PHI categories in the training data set and, from that, it generates rules to detect those categories. These rules are later used to extract sudo-labelled tokens that are used to guide the CRF tagger in taking the nal decision. 3.3

Feature Extraction

We extract a wide variety of linguistic features, similarly to previous studies [ 9, 8 ]. These features characterize the semantics of PHI terms. The main types of features are: { Lexical Features: they include the target word itself, its pre x and su x, word lemma, and Part-of-Speech (POS) tag. { Orthographic Features: they detail word form information, e.g. target word length, word shape (CAPITALIZED, ALL UPPER, ALL LOWER, MIX), ends with s, contains alpha and contains number. { RegEx features: a RegEx model is used as rst-pass recognizer for the PHI entities in the text. We use the output of the RegEx model to detect the location of the token, either at the beginning, middle, end or outside of PHI entity. { External Resource Features: we also consider if a token appears into one or several external resources, which include lists of English and Spanish names of countries and cities, names and abbreviations of time expressions (e.g. 'an~o', 'mes'), or names and abbreviations of places (e.g. 'plaza', 'av.'). Additional resources include lists of Spanish last names, Spanish rst names, addresses, hospitals, cities and towns, professions and autonomous communities, and provinces.

F. Hassan et al.

Extracting these features from just the target word does not consider the context in which the word appears, which may lead to misclassifying tokens due to language ambiguity. To tackle this, we consider a window of 5 words centered at the target word (i.e., the two words on the left and the two words on the the right). 3.4

Training the system

We used both a set of automatically-crafted rules (RegEx model) and Conditional Random Fields[ 4 ] (CRF model) to identify PHIs in medical documents. The system is implemented using Python 3.7 with sklearn-crfsuite package [ 3 ] and spaCy package [ 1 ] for the tokenization. We also use the BIO tagging scheme to set the labels of the tokens [ 7 ]. Each word token in the document is labeled using one of three possible tags: B, I, or O, which indicate if the word is at the beginning, middle, or outside of a PHI entity.

Annotated Documents

Document Rules Generation

RegEx Rules

Annotated Documents

Tags

Document Text Tokenization Feature Extraction Training the System

CRF Model

RegEx

Rules

External Resources

Fig 2 shows that our system has two outputs: the RegEx model and the CRF model. The RegEx model is built by the automatic rule extractor by analyzing the PHI categories that appear in well-structured contexts (e.g. Nombre: Xxxxx., Fecha de nacimiento: dd/mm/yyyy.).

ReCRF: Spanish Medical Document Anonymization

The CRF model is trained by passing all the extracted features from the tokens plus the decision of the RegEx model which add extra information and make the decision easier for the CRF model.

Unannotated

Documents RegEx

Rules External Resources

Annotated Documents

Document Text Tokenization Feature Extraction

CRF Model

Document Fig 3 shows how both RE and CRF models are used to make the annotations. Even though the RegEx model is accurate enough to detect well-structured entities, it is not e ective in front of small changes in the text format. So, we decided to use the RegEx model to perform a preliminary annotation, which is then passed to the CRF model that will make the nal decision. 4

Results and Discussions

The performance of the detection of PHI categories has been evaluated using Precision, Recall and F1 scores at the entity level. The results of our system on the test set for the di erent PHI categories are shown in Table 2; the confusion matrix is shown in Table 3. Notice that categories that have low frequency in the training dataset have less F1 score (e.g. NUMERO FAX appears only 15 times in the training set and CENTRO SALUD appears six times). This result is expected because the model didn't get enough examples in order to learn how to accurately detect them.

The overall results of our system for the two sub-tracks of the competition on the development and test datasets are shown in Table 4. We see very small di erences in F1 scores between the development and test datasets. This proves that our system generalizes well in front of new data. 5

Conclusion and Future Work

We presented a hybrid system that automatically detects PHI entities from plain text medical documents. The system consists of an automatically constructed RegEx model and a trained CRF model. The design of the system, which includes using a variety of linguistic and semantic features to increase the accuracy, ensures that it generalizes well in front of new data.

Finally, because of the rules that we get from the automatic RegEx generator are not fully generalized, in future work, we plan to implement an automatic optimizer to get a better result.

ReCRF: Spanish Medical Document Anonymization Acknowledgments

This work was partly supported by the European Commission (project H2020700540 "CANVAS"), the Government of Catalonia (ICREA Academia Prize to J. Domingo-Ferrer and grant 2017 SGR 705) and the Spanish Government (projects RTI2018-095094-B-C2 \CONSENT" and TIN2016-80250-R \Sec-MCloud"). While the authors are with the UNESCO Chair in Data Privacy, the opinions expressed in this paper are the authors' own and do not necessarily re ect the views of UNESCO.

1. Honnibal , M. , Montani , I. : spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing . To appear ( 2017 )

2. I2B2: i2b2: Informatics for integrating biology the bedside . https://www.i2b2.org/, last accessed: 25 -June-2019

3. Korobov , M. : sklearn-crfsuite . https://sklearn-crfsuite.readthedocs.io/en/latest/, last accessed: 31 -May-2019

4. La

erty

, J., McCallum , A. , Pereira , F.C. : Conditional random elds: Probabilistic models for segmenting and labeling sequence data . In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML'01

5. Marimon , M. , Gonzalez-Agirre , A. , Intxaurrondo , A. , Rodrguez , H. , Lopez

Martin

, J.A. , Villegas , M. , Krallinger , M. : Automatic de -identi cation of medical texts in spanish: the meddocan track, corpus, guidelines, methods and evaluation of results . In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019 ). vol. TBA , p. TBA. CEUR Workshop Proceedings (CEUR-WS.org) , Bilbao, Spain (Sep 2019 ), TBA

6. Regulation , G.D.P.: Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data , and repealing directive 95/46. O cial Journal of the European Union (OJ) 59(1-88) , 294 ( 2016 )

7. Sang , E.F. , Veenstra , J.: Representing text chunks . In: Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics . pp. 173 { 179 . Association for Computational Linguistics ( 1999 )

8. Stubbs , A. , Kot la , C. , Uzuner , O . : Automated systems for the de-identi cation of longitudinal clinical narratives: Overview of 2014 i2b2/uthealth shared task track 1 . Journal of biomedical informatics 58, S11{S19 ( 2015 ) 9 . Yang , H. , Garibaldi , J.M.: Automatic detection of protected health information from clinic narratives . Journal of biomedical informatics 58, S30{S38 ( 2015 )