-

Anonymization of Clinical Reports in Spanish: a Hybrid Method Based on Machine Learning and Rules.

Pilar Lopez-Ubeda

Manuel C. D az-Galiano

L. Alfonso Uren~a-Lopez

M. Teresa Mart n-Valdivia

0 0 Department of Computer Science, Advanced Studies Center in ICT (CEATIC) Universidad de Jaen , Campus Las Lagunillas, 23071, Jaen , Spain

2019

687 695

Biomedicine is an ideal environment for the use of Natural Language Processing, due to the huge amount of information processed and stored in electronic format. This information cannot be shared with con dential patient data. In order to achieve this task, the Medical Document Anonymization workshop has been created. In this paper, we present an automated anonymization system for clinical reports written in Spanish. Three di erent methods are evaluated and compared. The rst method is rule-based, the second method uses machine learning and the third is a hybrid method between the rst two. The evaluation showed that the use of the hybrid method obtained the best results. The results are as expected, we obtained 90% in measure F1 in sub-task 1 and 95% in sub-task 2.

Anonymization Named Entities Recognition CRF chine Learning Regular Expressions

Named Entity Recognition (NER) in a text is an important key in many natural language applications such as the anonymization of clinical records. This task is crucial because a hospital cannot freely publish information related to a patient [ 11 ].

NER consists in automatic identi cation of fragments of texts called entities which refer to information units such as persons, geographical locations, sex, names of organizations, dates, occupation or references to documents [ 2 ].

The OTG de Sanidad of the Plan TL in collaboration with the National Center for Oncological Research and the Hospital 12 de Octubre in Madrid have organized a task to contribute to these studies mentioned above. This task is part of the IberLEF (Iberian Languages Evaluation Forum) in SEPLN (Sociedad Espan~ola para el Procesamiento del Lenguaje Natural) 2019.

Medical Document Anonymization (MEDDOCAN) task is the rst community challenge task speci cally devoted to the anonymization of medical documents in Spanish [ 7 ]. The MEDDOCAN task is structured into two sub-tasks: NER o set and entity type classi cation and Sensitive span detection.

The rst sub-track wants to match exactly the beginning and end locations of each Protected Health Information (PHI) entity tag, as well as detecting correctly the annotation type. In the order hand, the second sub-track is more speci c to the practical scenario needed for releasing de-identi ed clinical documents, where the main goal is to identify and be able to obfuscate or mask sensitive data, regardless the actual type of entity or the correct o set identi cation of multi-token sensitive phrase mentions. 2

Dataset

This MEDDOCAN corpus was selected manually by a practicing physician and augmented with PHI phrases by health documentalists, adding PHI information from discharge summaries and medical genetics clinical records. Usually, the clinical records that we will treat are structured, most of them follow a common format.

The organizers provided us with the following datasets: { The training set consists of 500 documents { The validation set consists of 250 documents { The test set contain 3751 documents

The MEDDOCAN annotation scheme de nes a total of 29 entity types 1 and the o cial annotation guidelines used to annotate the MEDDOCAN data sets is available 2. 3

Strategies

In this section we will describe the methods and strategies followed to achieve the tasks. These methods are used for both subtasks: NER o set and entity type classi cation and Sensitive token detection. 3.1

Rule-based method

An initial survey using the training dataset showed that the majority of the values for a given eld were recorded in certain patterns. Some patterns we can nd are shown in Table 1. In this table we can see that it can be easy to identify some named entities. Therefore, we designed rules by incorporating regular expression (REs or regex) for each eld according to the description types of the elds [ 1 ].

There have been many studies on the use of regular expressions in di erent areas of medicine [ 8, 5 ]. Some of them very current so that it is still a method that is applied in the area of Natural Language Processing (NLP). 1 http://temu.bsc.es/meddocan/index.php/annotation-guidelines/ 2 http://temu.bsc.es/meddocan/wp-content/uploads/2019/02/gu%C3%ADas-deanotaci%C3%B3n-de-informaci%C3%B3n-de-salud-protegida.pdf

The rst step to elaborate this experiment was to de ne the rules to extract the required entities. These rules were taken from the annotation guide provided by the task organizers. The text is converted to lower case to have it homogeneous. A total of 25 rules were de ned in this method. Some of these rules are shown in Table 2.

In addition to obtaining greater precision in recognizing PHI information within the text. We use some resources to correctly identify territories and peoples. First, we used a list of Spanish provinces on which we relied to identify territories, and on the other hand, we use Stanford Named Entity Recognizer [ 6 ] with a pre-trained Spanish model. Spanish models on a combination of two corpora, after very heavy modi cations: AnCora Spanish 3.0 corpus3 and DEFT Spanish Treebank V24. With these Stanford models, people and territories will be identi ed in the text. 3.2

CRF Conditional Random Fields (CRF)[ 4 ] classi er is a stochastic model commonly used to label and segment data sequences or extract information from medical documents [ 3 ]. We used CRFsuite, the implementation provided by Okazaki [ 9 ], as it is fast and provides a simple interface for training and modifying the input features.

3 http://clic.ub.edu/corpus/ancora 4 https://catalog.ldc.upenn.edu/LDC2018T01

We incorporate some basic features of each word such as isLower, isUpper, isTitle, isDigit, isAlpha, isBeginOfSentence and isEndIfSentece.

Similar to most machine learning-based de-identi cation systems, the tokenlevel CRF requires a tokenization module at rst. The tokenizer used is WordPunctTokenizer of the NLTK5 library in Python.

Below are a few short lines from the training le (S1130-010820090005000121).

S1130-01082009000500012-1 Medico S1130-01082009000500012-1 : S1130-01082009000500012-1 David S1130-01082009000500012-1 Hernandez S1130-01082009000500012-1 Alcaraz S1130-01082009000500012-1 .

S1130-01082009000500012-1 NCol S1130-01082009000500012-1 : S1130-01082009000500012-1 29 S1130-01082009000500012-1 29585 0 0 NOMBRE_PERSONAL_SANITARIO NOMBRE_PERSONAL_SANITARIO NOMBRE_PERSONAL_SANITARIO 0 0 0 ID_TITULACION_PERSONAL_SANITARIO

ID_TITULACION_PERSONAL_SANITARIO

The CRF algorithm trained with the parameters: algorithm = lbfgs, c1 = 0.1, c2 = 0.1, max iterations = 100, all possible transitions = False.

The output provided by this method is shown below. As we can see, CRF returns tokens and their predicted annotation, so it is necessary to perform a treatment to join di erent tokens in the same concept. [("Medico",0), (":",0), ("David",NOMBRE_PERSONAL_SANITARIO), ("Hernandez",NOMBRE_PERSONAL_SANITARIO), ("Alcaraz",NOMBRE_PERSONAL_SANITARIO), (".",0), ("NCol",0), (":", 0),("29",ID_TITULACION_PERSONAL_SANITARIO), ("29585",ID_TITULACION_PERSONAL_SANITARIO)]

This treatment consisted of joining all the contiguous tokens of the same category. In this way, we create the correct output le as shown below: David Hernandez Alcaraz 29 29585

NOMBRE_PERSONAL_SANITARIO

ID_TITULACION_PERSONAL_SANITARIO 3.3

Hybrid method

The last method applied was using the two methods described above: rule-based method and machine learning with CRF.

At the end of the machine learning method, an error analysis was carried out and we found that there were some inconsistencies according to the PHI phrases that had been annotated. This analysis was developed with the development dataset.

5 https://www.nltk.org/

Most of the error cases that we could observe were with annotations like: TERRITORIO, ID CONTACTO ASISTENCIAL and ID ASEGURAMIENTO.

The main problem we got with the TERRITORIO notation is that we wrote phrases together as shown below: AV. San Francisco 7, 3D 50006 Zaragoza TERRITORIO

And the correct annotation should be: AV. San Francisco 7, 3D 50006 Zaragoza

TERRITORIO TERRITORIO

TERRITORIO

These errors could be solved by using regular expressions that separated that annotation into di erent entries, in this case we used the regular expression to nd the Zip Code and the list of cities in Spain, in this way, we could separate the di erent PHI phrases.

Other errors that we could avoid with the use of regex is the erroneous annotation ID CONTACTO ASISTENCIAL because the algorithm identi ed it as ID SUJETO ASISTENCIA. 4

Results and discussion

For both sub-tracks the primary de-identi cation metrics used will consist of standard measures from the NLP community, namely micro-averaged precision, recall, and balanced F-score.

In addition, the leak scores is also used for sub-task 1. This measure is related to the detection of leaks (non-redacted PHI remaining after de-identi cation), that is (#false negatives / #sentences present).

The results obtained by our team for sub-task 1 (NER o set and entity type classi cation) are shown in Table 3.

This results show that we have improved our baseline (rule-based method) using machine learning algorithms. A great step that is re ected how we obtain a 0.59 in F1 score and we obtain a 0.86 with CRF. We managed to improve in all measures with the use of some rules obtaining an precision of 0.92 and a recall of 0.88.

Table 4 and Table 5 show the evaluation of the systems for sub-task 2 (Sensitive span detection) with strict and merged spans evaluation respectively.

In this second sub-task we check that we obtain values higher than the previous task, this is because the objective is only to identify con dential data. Thus, this is considered a span-based evaluation, regardless of the actual type of entity or the correct o set identi cation of multi-token sensitive phrase mentions.

In this sub-task we get a higher baseline than in the previous task, and also get better thanks to CRF and the rules applied. We achieve relatively equal results in both the stric and merged evaluations. The precision is almost perfect, our third method is correct in 97% of cases.

If we see a di erence between the two evaluations and systems 2 and 3. In the strict evaluation system the machine learning algorithms get 90% F1, and in the merged evaluation they get 95% with the second method. We see that the second method works best when the evaluation is not strict.

Finally, it is interesting to see that between the two possible evaluations of sub-task 2 we obtain similar values with the third method proposed by our group. This means that our third experiment records almost exactly. 5

Error analysis

The main purpose of this section is to carry out an error analysis to identify the weaknesses of our best system: hybrid method (run 3). To this end, we have obtained some basic statistics for 250 gold test les. The test les taken into account are from Subtask 1 described in Section ??. These les contain 5661 key phrases annotated.

We have described three basic types of errors for this analysis:

1. Does not have the same annotated label.

In this case, our system writes the positions of the key phrase correctly but the associated annotation is incorrect. We found a total of 221 errors in this case. The biggest confusions our system makes are described in Table 6. This table shows the annotation found in the gold test les, the annotation of our system and the number of errors found. In future work, these types of errors are relatively easy to solve. The largest of the cases is found in the territory and id of the subject of assistance, this is because the zip codes are 5 digits and the IDs are usually digits as well. For the improvement of this case we should observe the context in which the digits are to annotate them correctly.

2. Incorrect positions.

In this case, our system incorrectly marks the start or end position of the key phrase. To make this situation clearer, some examples are shown below: Example number 1, the rst frame shows the correct annotations. In this frame we can see that the name of the health employee and the street are noted separately, we see that the positions are consecutive (340 - 360 and 361 - 374) but our system (second frame) takes everything as a full name with positions from 340 to 374.

CALLE 361 374 NOMBRE_PERSONAL_SANITARIO 340 360 Paseo Calanda Raquel Ridruejo Saez NOMBRE_PERSONAL_SANITARIO 340 374 Raquel Ridruejo Saez Paseo Calanda The following example shows a similar error, the rst frame shows the correct output and the next frame shows the output of our system. In them we can see how our system annotates everything as a street when it should separately record institution and street.

CALLE 1945 1974 INSTITUCION 1923 1944

Gran Via Corts Catalanes, 111

Ciutat de la Just cia CALLE 1923 1974 Ciutat de la Just cia Gran Via Corts Catalanes, 111 The number of errors of this type in our system are 215.

3. Not found.

Finally, we found 203 cases that are scored in the gold test but our system does not write them down.

The total number of errors found are 639, which is 11% of the test gold annotations. It is a small number that could possibly be improved for later systems. 6

Conclusions

We have observed an increase in the number of studies associated with the identi cation of PHI phrases in electronic medical records. Statistical analyses or machine learning, followed by NLP techniques, are gaining popularity over the years in comparison with rule-based systems [ 10 ]. In this study we try to verify that traditional and automatic methods can still coexist and that they can be of great help if we use them together.

The SINAI group presents its rst participation in this type of tasks where the main objective is to nd PHI information in clinical records. The results are really good. In both subtasks we have reached more than 90% F1 in our best method. And in precision we got more than 97% in sub-task 2 with sensitive evaluation. 7

Acknowledgements

This work has been partially supported by Fondo Europeo de Desarrollo Regional (FEDER), LIVING-LANG project (RTI2018-094653-B-C21) and REDES project (TIN2015-65136-C2-1-R) from the Spanish Government.

1. Chen , L. , Song , L. , Shao , Y. , Li , D. , Ding , K. : Using natural language processing to extract clinically useful information from chinese electronic medical records . International journal of medical informatics 124 , 6{ 12 ( 2019 )

2. Gralinski , F. , Jassem , K. , Marcinczuk , M. , Wawrzyniak , P. : Named entity recognition in machine anonymization . Recent Advances in Intelligent Information Systems pp. 247 { 260 ( 2009 )

3. He , Y. , Kayaalp , M. : Biological entity recognition with conditional random elds . In: AMIA Annual Symposium Proceedings . vol. 2008 , p. 293 . American Medical Informatics Association ( 2008 )

4. La

erty

, J., McCallum , A. , Pereira , F.C. : Conditional random elds: Probabilistic models for segmenting and labeling sequence data ( 2001 )

5. Liang , Z. , Chen , J. , Xu , Z. , Chen , Y. , Hao , T. : A pattern-based method for medical entity recognition from chinese diagnostic imaging text . Frontiers in Arti cial Intelligence 2 , 1 ( 2019 )

6. Manning , C. , Surdeanu , M. , Bauer , J. , Finkel , J. , Bethard , S. , McClosky , D. : The stanford corenlp natural language processing toolkit . In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations . pp. 55 { 60 ( 2014 )

7. Marimon , M. , Gonzalez-Agirre , A. , Intxaurrondo , A. , Rodr

guez

, H., Lopez

Martin

, J.A. , Villegas , M. , Krallinger , M. : Automatic de -identi cation of medical texts in spanish: the meddocan track, corpus, guidelines, methods and evaluation of results . In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019 ). vol. TBA , p. TBA. CEUR Workshop Proceedings (CEUR-WS.org) , Bilbao, Spain (Sep 2019 ), TBA

8. Nguyen , A.N. , Lawley , M.J. , Hansen , D.P. , Bowman , R.V. , Clarke , B.E. , Duhig , E.E. , Colquist , S. : Symbolic rule-based classi cation of lung cancer stages from freetext pathology reports . Journal of the American Medical Informatics Association 17 ( 4 ), 440 { 445 ( 2010 )

9. Okazaki , N.: Crfsuite: a fast implementation of conditional random elds (crfs) ( 2007 )

10. Shivade , C. , Raghavan , P. , Fosler-Lussier , E. , Embi , P.J. , Elhadad , N. , Johnson , S.B., Lai , A.M.: A review of approaches to identifying patient phenotype cohorts using electronic health records . Journal of the American Medical Informatics Association 21 ( 2 ), 221 { 230 ( 2013 )

11. Szarvas , G. , Farkas , R. , Busa-Fekete , R. : State-of-the-art anonymization of medical records using an iterative machine learning framework . Journal of the American Medical Informatics Association 14 ( 5 ), 574 { 580 ( 2007 )