-

Automatic De-Identi cation of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results

Montserrat Marimon

Aitor Gonzalez-Agirre

0 1

Ander Intxaurrondo

0 1

Heidy Rodr guez

Jose Antonio Lopez Martin

Marta Villegas

0 1

Martin Krallinger

0 1 0 Barcelona Supercomputing Center , BSC 1 Centro Nacional de Investigaciones Oncologicas , CNIO 2 Hospital 12 de Octubre - Madrid

2019

618 638

There is an increasing interest in exploiting the content of electronic health records by means of natural language processing and text-mining technologies, as they can result in resources for improving patient health/safety, aid in clinical decision making, facilitate drug repurposing or precision medicine. To share, re-distribute and make clinical narratives accessible for text mining research purposes, it is key to fulll legal conditions and address restrictions related data protection and patient privacy. Thus, clinical records cannot be shared directly "as is". A necessary precondition for accessing clinical records outside of hospitals is their de-identi cation or exhaustive removal/replacement of all mentioned privacy related protected health information phrases. Providing a proper evaluation scenario for automatic anonymization tools is key for approval of data redistribution. The construction of manually de-identi ed medical records is currently the main rate and cost-limiting step for secondary use applications of clinical data. This paper summarizes the settings, data and results of the rst shared track on anonymization of medical documents in Spanish, the MEDDOCAN (Medical Document Anonymization) track. This track relied on a carefully constructed synthetic corpus of clinical case documents, the MEDDOCAN corpus, following annotation guidelines for sensitive data based on the analysis of the EU General Data Protection Regulation. A total of 18 teams (from the 51 registrations) submitted 63 runs for rst sub-track 1 and 61 systems for the second sub-track. The top scoring systems were based on sophisticated deep learning approaches, representing strategies that can signi cantly reduce time and costs associated to accessing textual data containing privacy-related sensitive information. The results of this track might help in lowering the clinical data access hurdle for Spanish language technology developers, showing also potentials for similar settings using data in other languages or from di erent domains.

Introduction

There is an increasing interest in exploiting the content of unstructured clinical narratives by means of language technologies. Therefore, and because there is clear interest in the health sector by the language technology industry, one of the agship projects of the Spanish National Plan for the Advancement of Language Technology (Plan TL4) is related to the clinical and biomedical eld. The Plan TL has promoted the generation of a collection of resources for Spanish biomedical NLP5, including corpora [ 26 ], gazetteers [ 26 ], components [ 2, 19 ] and tools, as well as evaluation e orts [ 18, 11, 12 ]. Due to their central role in fostering language technology resources, the promotion of shared tasks and evaluation campaigns is of particular relevance for the Plan TL, being considered a key instrument for: (1) independent quality evaluation of components, (2) promotion of standards, interoperability and harmonization of resources, (3) generation of new systems, tools and software components, (4) promotion of con dence by end users, investors and commercial partners in language technologies, (5) promoting new start ups and innovative ideas, (6) improving access to data, (7) create collaborative research interactions and networks and (8) serve as a knowledge transfer and learning experience engaging both academia and industry. Structured clinical data, in the form of codi ed clinical information using controlled indexing vocabulary such as ICD10, only covers a fraction of the medically relevant information stored in electronic health records (EHRs) and clinical texts. Complex relations such as drug-related allergies, constituting a serious health risk, cannot be captured well by the coding schemes followed typically by clinical documentalists and, thus, require direct processing of clinical narrative texts.

Being able to transform automatically clinical documents into some structured representations is nonetheless needed to enable secondary use of EHRs to carry out population and epidemiological studies, to detect medication-related adverse events or for monitoring systematically treatment-related responses, just to name a few.

To be able to share, re-distribute and make clinical narratives accessible for text mining and natural language processing (NLP) purposes, it is key to ful ll legal conditions and address restrictions related data protection and patient privacy legislations [ 5 ]. Some e orts have been made to examine GDPR demands Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 September 2019, Bilbao, Spain. 4 https://www.plantl.gob.es 5 https://github.com/PlanTL-SANIDAD

MEDDOCAN: Automatic de-identi cation of medical texts in Spanish for the construction of de-identi ed textual corpora for research purposes [ 15 ]. Thus, clinical records with protected health information (PHI) cannot be directly shared "as is", due to privacy constraints, making it particularly cumbersome to carry out NLP research in the medical domain. A necessary precondition for accessing clinical records outside of hospitals is their de-identi cation, i.e., the exhaustive removal (or replacement) of all mentioned PHI phrases.

Studies describing services for pseudonymization of EHRs based on standards such as the ISO/EN 13606 were previously published for data in Spanish [ 4 ], but are generally limited to the structured elds of the clinical documents, have not been evaluated against any particular Gold Standard dataset (i.e. lack proper evaluation), and, most importantly, are not accessible or released on public software repositories, making it impossible to actually carry out a proper independent benchmark study. Providing a proper evaluation scenario of automatic anonymization tools, with well-de ned sensitive data types, is crucial for approval of data redistribution consents signed by ethical committees of healthcare institutions. It is important to highlight that the construction of manually de-identi ed medical records is currently the main rate and cost-limiting step for secondary use applications. Moreover, such settings also require very carefully designed annotation guidelines and interfaces to assure that there is no leak of sensitive information from clinical records and that the resulting de-identi ed datasets are compliant with all legal constraints.

The practical relevance of anonymization or de-identi cation of clinical texts motivated the proposal of two shared tasks, the 2006 and 2014 de-identi cation tracks [ 24, 21 ], organized under the umbrella of the i2b2 (i2b2.org) community evaluation e ort. The i2b2 e ort has deeply in uenced the clinical NLP community worldwide, but was focused on documents in English and covering characteristics of US-healthcare data providers. Systems used for de-identifying English clinical texts like Carafe, based on Conditional Random Fields or MIST (the MITRE Identi cation Scrubber Toolkit) have bene ted from i2b2 shared tasks to improve, evaluate and analyze these tools. The interest in automated de-identi cation and anonymization systems is not limited to data in English, and there is also a growing awareness in developing such systems for other languages, such as French [ 9, 7 ], German [ 22 ], Dutch [ 20 ], Portuguese [ 13 ], Danish [ 17 ], Swedish [ 1 ] or Norwegian [ 23 ].

In case of texts in Spanish, there has been so far a rather limited attempt in developing and characterizing automatic de-identi cation strategies [ 10, 14, 25, 6 ], even though some in house tools, such as the AEMPS anonymizer or a recent publication by Medina and Turmo [ 14 ] show that e orts in this direction are being made and such tools are already explored in practice. We, therefore, organized the rst community challenge track speci cally devoted to the anonymization of medical documents in Spanish, called the MEDDOCAN (Medical Document Anonymization) track, as part of the IberLEF evaluation initiative. 2 2.1

Methods Track Description

The MEDDOCAN track was one of the nine challenge tracks of the Iberian Languages Evaluation Forum (IberLEF 2019)6 evaluation campaign, which had the goal of promoting the development of language technologies for Iberian languages. MEDDOCAN was the rst community challenge track speci cally devoted to the anonymization of medical documents in Spanish and it evaluated the performance of the systems for identifying and classifying sensitive information in clinical case studies written in Spanish.

The evaluation of automatic predictions for this track had two di erent scenarios or sub-tracks: 1. NER o set and entity type classi cation : the rst sub-track was focused on the identi cation and classi cation of sensitive information (e.g., patient names, telephones, addresses, etc.). 2. Sensitive span detection: the second sub-track was focused on the detection of sensitive text more speci c to the practical scenario necessary for the release of de-identi ed clinical documents, where the objective is to identify and to mask con dential data, regardless of the real type of entity or the correct identi cation of PHI type. 2.2

Track data

For this track, we prepared a synthetic corpus of clinical cases enriched with PHI expressions, named the MEDDOCAN corpus. The MEDDOCAN corpus, of 1,000 clinical case studies, was selected manually by a practicing physician and augmented with PHI phrases by health documentalists, adding PHI information from discharge summaries and medical genetics clinical records.

To carry out the manual annotation, we constructed the rst public guidelines for PHI in Spanish [ 16 ], following the speci cations derived from the General Data Protection Regulation (GDPR) of the EU, as well as the annotation guidelines and types de ned by the i2b2 de-identi cation tracks, based on the US Health Insurance Portability and Accountability Act (HIPAA). The construction of these annotation guidelines involved active feedback over a six-month period from a hybrid team of nine persons with expertise in both healthcare and NLP, resulting in a 28-page document that has been distributed along with the corpus. Along with the annotation rules, illustrative examples were provided to make the interpretation and use of the guidelines as easy as possible.

The MEDDOCAN corpus was randomly sampled into three subset: the train set, which contained 500 clinical cases, and the development and test sets of 250 clinical cases each. These clinical cases were manually annotated using a customized version of AnnotateIt. Then, the BRAT annotation toolkit was used to 6 http://hitz.eus/sepln2019/?q=node/21 correct errors and add missing annotations, achieving an inter-annotator agreement (IAA) of 98% (calculated with 50 documents). Together with the test set, we released an additional collection of 3,501 documents (background set7) to make sure that participating teams were not able to do manual corrections and also to promote that these systems would potentially be able to scale to larger data collections.

The MEDDOCAN annotation guidelines de ned a total of 29 entity types. track and the number of occurrences among the training, development and test sets.

The MEDDOCAN corpus was distributed in plain text in UTF-8 encoding, where each clinical case was stored as a single le, while PHI annotations were released in the BRAT format, which makes visualization of results straightforward, as you can see in Fig. 1 For this track, we also prepared a conversion script8 between the BRAT annotation format and the annotation format used by the 7 The background set included the train, development and test sets, and an additional collection of 2,751 clinical cases (totalling 3,751 clinical cases). 8 https://github.com/PlanTL-SANIDAD/MEDDOCAN-Format-Converter-Script

Marimon et al. previous i2b2 e ort, to make comparison and adaptation of previous systems used for English texts easier. We developed an evaluation script that supported the evaluation of the predictions of the participating teams. For both sub-tracks the primary evaluation metrics used consisted of standard measures from the NLP community, namely micro-averaged precision, recall, and balanced F-score, being the last one the only o cial evaluation measure of both sub-tracks:

T P Precision: P = T P +F P

T P Recall: R = T P +F N

Participation and Results Participation

To participate in the MEDDOCAN track it was necessary to register both on the o cial website9 and in the CodaLab competition10. Training and development sets were made available for download on the o cial website11, and the evaluation script was uploaded to GitHub12, to ensure a transparent evaluation.

Submissions had to be provided in a prede ned prediction format (BRAT or i2b2). The participants had a period of almost two months to develop their system. In the middle of this period, the text and background sets were released with the 3,751 documents that the participants had to process and label, although the nal evaluation was done on the 250 documents of the test set. As we have mentioned, the participants could submit a maximum of 5 system runs, and, once the submission deadline expired, we published the Gold Standard annotations of the test set, in order to ensure a transparent evaluation process.

A total of 18 teams participated in the track, submitting a total of 63 systems for sub-track 1 and 61 systems for sub-track 2. Teams from eight di erent nationalities participated in the track: ten from Spain, two from the United States, and one from Argentina, China, Germany, Italy, Japan, and Russia. Among all the participants, only one belonged to an institution of a commercial nature. Table 2 summarizes the most relevant information about the participants. 3.2

Baseline system

We produced a baseline system using a vocabulary transfer approach. Each annotation from the train and development datasets was transferred to the test dataset using strict string matching. For those cases where the text was the same, but the entity type was di erent, we decided to annotate all entity types that matched that text. 9 http://temu.bsc.es/meddocan/ 10 https://competitions.codalab.org/competitions/22643 11 http://temu.bsc.es/meddocan/index.php/data/ 12 https://github.com/PlanTL-SANIDAD/MEDDOCAN-CODALAB-EvaluationScript Marimon et al.

MEDDOCAN: Automatic de-identi cation of medical texts in Spanish set and entity type classi cation. with a recall of 0.98335, lukas.lange, with a recall of 0.98264, and, mhjabreel, with a recall of 0.97471.

An analysis of errors showed that some of the annotations in the Gold Standard (GS) corpus were not detected by any of the systems (at least not exactly). Some of them are listed here: { HOSPITAL: Hospital General de Agudos P. Pin~ero { FAMILIARES SUJETO

ASISTENCIA: tres hermanos varones sordomudos y otro con baja vision

{ OTROS SUJETO

ASISTENCIA: estudiante de administracion de empresas

On the contrary, some systems annotated entities that were not in the GS but probably should be. For instance, "ex-operario de la industria textil " was annotated as PROFESION

by jiangdehuan, jimblair, and Jordi, but this annotation was not in the GS. MEDDOCAN: Automatic de-identi cation of medical texts in Spanish One of the primary goals of this track was to develop systems capable of completely de-identifying sensitive information from clinical documents. However, none of submitted systems managed to obfuscate all the sensitive information. In this section, we present two experiments we performed that evaluated the performance of combined systems to de-identify the test dataset without leaks. The rst experiment was based on a joint system, the second experiment, on a voting system.

Joint system The goal of this experiment was to nd the combination of individual systems that achieved the best possible performance. For this, rst, we ranked all the systems by F-score, and then we joined the annotations of the two best system. If the performance of the Joint system improved, we continued with the next best system, if not, we kept the previous system (or the previous joint system). We repeated this until no systems were left. We measured the performance of the joint system using three metrics: 1. Best F1: If the F-score of the joint system improved when we added the annotations from the next system, we updated the joint system with the new one. If the F-score did not improve, but it was maintained and the recall was better, we also updated the joint system with the new one (same F-score, better recall, worse precision). 2. Best Recall: If the recall of the joint system improved, we updated the joint system, regardless of the drop in the F-score. It tried to maximize the chances of completely de-identifying the documents. 3. Balanced: If the recall of the joint system improved, we updated the joint system only if the decrease of the F-score was at much four times the increase

MEDDOCAN: Automatic de-identi cation of medical texts in Spanish

Marimon et al. of the recall. That it, for every point of increase in recall, we allowed 4 point of decrease in F-score, but not more. It tried to increase the recall, but without hurting the F-Score too much.

The systems that were used to achieve the best results for these metrics were the following: { Best F1: { Recall: lukas.lange/run3 improves the F-score from 0 a 0.96961. lukas.lange/run2 improves the F-score from 0.96961 a 0.96997. lukas.lange/run1 improves the F-score from 0.96997 a 0.97033. lukas.lange/run3 improves the recall from 0 to 0.96944. lukas.lange/run2 improves the recall from 0.96944 to 0.97209. lukas.lange/run1 improves the recall from 0.97209 to 0.97492. lukas.lange/run4 improves the recall from 0.97492 to 0.97562. Fadi/15-7 improves the recall from 0.97562 to 0.97898.

Fadi/14-5 improves the recall from 0.97898 to 0.97951.

Fadi/17-3 improves the recall from 0.97951 to 0.98022.

Fadi/16-3 improves the recall from 0.98022 to 0.98039. nperez/ncrfpp improves the recall from 0.98039 to 0.98181.

FSL/run1 improves the recall from 0.98181 to 0.98393.

FSL/run2 improves the recall from 0.98393 to 0.9841. nperez/sp-test-03-empty improves the recall from 0.9841 to 0.98516. mhjabreel/run3 improves the recall from 0.98516 to 0.98551. mhjabreel/run2 improves the recall from 0.98551 to 0.98569. jiangdehuan/run3 improves the recall from 0.98569 to 0.98693. jiangdehuan/run2 improves the recall from 0.98693 to 0.9871. jimblair/run2 improves the recall from 0.9871 to 0.98763. jimblair/run3 improves the recall from 0.98763 to 0.98781. jiangdehuan/run1 improves the recall from 0.98781 to 0.98816. Jordi/run3 improves the recall from 0.98816 to 0.98869.

vcotik/run5 improves the recall from 0.98869 to 0.98887. { Balanced: lukas.lange/run3 improves the recall from 0 to 0.96944 (+0.96944) without losing too much F-score: 0.96961 (-0.96961). lukas.lange/run2 improves the recall from 0.96944 to 0.97209 (+0.00265) without losing too much F-score: 0.96841 (0.00112). lukas.lange/run1 improves the recall from 0.97209 to 0.97492 (+0.00283) without losing too much F-score: 0.96647 (0.00194).

Fadi/15-7 improves the recall from 0.97492 to 0.97863 (+0.00371) without losing too much F-score: 0.96181 (0.00466).

Fadi/17-3 improves the recall from 0.97863 to 0.97951 (+0.00088) without losing too much F-score: 0.95868 (0.00313). nperez/ncrfpp improves the recall from 0.97951 to 0.98128 (+0.00177) without losing too much F-score: 0.95308 (0.00560).

FSL/run1 improves the recall from 0.98128 to 0.98375 (+0.00247) without losing too much F-score: 0.94342 (0.00966).

MEDDOCAN: Automatic de-identi cation of medical texts in Spanish Voting The combination of individual systems from the previous experiment was done directly on the test set. It is very di cult for a given combination of systems to be transferable from one data set to another. Therefore, it should be taken as only an approximation of the upper bound that can be obtained by combining individual systems. In this experiment, we combined the systems using a voting scenario: we accepted as good the annotations that had predicted by N systems.

We created 50 systems for sub-track 1. The rst system accepted all the annotations predicted by, at least, one of the systems, while the last one accepted only the annotations that were predicted by, at least, 50 systems. The results of this experiment is shown in Table 9. As expected, as the value of N increased (we increased the number of required votes), the recall got worse and the precision improved. The maximum value of F-score on the train and development sets was obtained combining 17 systems (F-score of 0.9942). When we used the train and development sets as train corpus to select the optimal value of N and used this value on the test set, we obtained an F-score of 0.9757. This score was lower than the best one that could be obtained (0.9768, with N = 23), but the di erence was (in practice) negligible.

Comparing the results of the two experiments, we see that the voting system improved the joint system by 0.54 points. In addition, as we see in the Table 9, the values were very stable and a non-optimal choice of the value N did not vary much the result. The negative part was that the voting scenario required many systems to obtain this result (17 systems out of 63 had to agree in order to accept an annotation), while the joint system was a combination of only 3 systems. The voting system matched the performance of the joint system when N is 13, scoring 0.9701 (the joint system scored 0.9703) .

For reasons of space, we do not include the results of this experiment for sub-tracks 2A and 2B, but they showed a very similar behavior. In this section we analyze the performance of the systems on the di erent data sets. As we have said, the background set included, the train set and the develMarimon et al. opment set, which allowed us to measure the F-score of all the systems on the train, development and test set, and to analyze their behavior.

All the scores of this analysis are shown in table 10, where the drop column indicates the di erence of performance in the test set with respect to the development set (a negative value indicates a lower performance on the test set). There were two teams that achieved a F-score of 1.0 in both train and development set: jimblair (in all tracks) and m. domrachev (in sub-tracks 1 and 2A). The former had a performance drop of 6.25 points, and the latter of 9.99 points in the test set, probably because both systems of these competitors memorized the train and development data, obtaining a perfect score, incurring in over tting. This also suggested that they could have used the development set to train the system, and not just to tune it.

In contrast to this, we see that lukas.lange, which was rst team on the test set for sub-track 1, was also the rst on the development set (without taking into account those who had scored 1.0), but third on the train set (without taking into account those who scored 1.0). The performance of their system only dropped 0.14 points in the test set with respect to the development set. Probably they used the train set to build the system and the development only for tuning, not incurring in over tting. This demonstrated that the ability of the systems to generalize was very important.

Taking into account all the sub-tracks, the maximum performance drop was su ered by m.domrachev, losing 9.99 points in sub-track 1. Without taking into account those who had scores 1.0 on the development set, the system that lost more points was the one submitted by Jordi, which lost 5.25 points on track 2B (0.33 points in sub-track 1 ,and 0.29 points in sub-track 2A). The next participants with the highest loss of performance were VSP and FSL.

The maximum improvement in the test set with respect to the development set was 3.32 points, corresponding to the system submitted by jiangdehuan, in track 2A.

As a curiosity, ccolon scored exactly the same result on the development and test set. However, its performance decreased with respect to the train set (by 3.77 points). 4

Discussion

The MEDDOCAN track attracted a considerable number of teams, not only from Spain, but also from other countries, stressing the global interest in solving the clinical data access hurdles and assuring patient data privacy requirements. Compared to previous e orts for English, namely the i2b2 de-identi cation tracks, MEDDOCAN could even reach a higher number of participation. It is important to point out that the MEDDOCAN track bene ted signi cantly from the experiences, setting and annotation process pioneered by the i2b2 e orts.

In case of the 2006 i2b2 shared task [ 24 ], a total of 7 teams participated in the track, providing 16 systems. The ve best systems scored above 0.95 for the entity detection track and equaled or exceeded an F-score of 0.95 for the tokenbased evaluation. The 2014 i2b2 de-identi cation shared task [ 21 ] had 10 teams, submitting 22 runs. The top team reached an F-score of 0.9360 for the entity detection track, and 0.9611 for the evaluation based on tokens. It is important to mention that in case of MEDDOCAN a synthetic corpus was used so the results might not be directly comparable to i2b2. Also, it is well known that there is a considerable variability in density, distribution and characteristics of sensitive information even between di erent types of clinical records.

De-identi cation is still a very hard task, because for the special characteristics of clinical texts and the importance of recall, i.e. avoiding leakage of sensitive information. The top three teams are above 0.96 in F-score, for the track based on entity detection.

The top scoring systems make use of the most cutting-edge NLP techniques, i.e. exploiting Deep Learning. Their results are comparable to single manual anonymization done by humans. Automatic anonymization with manual revision to detect potential leakages might result in anonymized Spanish clinical records that allow data redistribution. Nevertheless, a follow up task, using real EHRs from various healthcare institutions, and assessing the practical user scenario with experts in the loop would be desirable to quantify also cost reduction and bene ts of the quality of anonymization strategies assisted by automated tools. 5

Conclusions

The results of the MEDDOCAN shared task and evaluation e ort on automatic de-identi cation of sensitive information from texts in Spanish show that advanced deep learning approaches in combination with rule based systems and gazetteer resources can provide very competitive results when a high quality manually labeled dataset is available. The construction of Gold Standard corpora is key and require very detailed annotation guidelines and a carefully designed corpus generation process with involvement of clinical domain experts. We expect that such a corpus and evaluation will also be carried out for data in other languages and that automatic anonymization and de-identi cation systems will be bene cial beyond EHRs, such as medical surveys [ 8 ] or legal- nancial documents [ 3 ]. In order to improve the impact of future shared tasks on anonymization, the involvement should not be limited to academic groups on language technologies, but also directly data providers (health institutions), legal experts and national and European institutions. For instance, the European Medicines Agency (EMA) has launched a Technical Anonymisation Group (TAG) consisting of a group of experts in data anonymisation to help further develop best practices for the anonymisation of clinical reports. Moreover, we also would like to stress the key importance of making the systems code or developed participant tools accessible/available and the need to explore strategies to promote start-ups and commercialization of solutions resulting from shared tasks and evaluation campaigns.

Acknowledgements

We acknowledge the Encargo of Plan TL (SEAD) to CNIO and BSC for funding, and the scienti c committee for their valuable comments and guidance. We would also like to thank Siamak Barzegar for his help in setting up MEDDOCAN at CodaLab, and Felipe Soares for input in preparing the manuscript and task.

MEDDOCAN: Automatic de-identi cation of medical texts in Spanish

1. Alfalahi , A. , Brissman , S. , Dalianis , H.: Pseudonymisation of personal names and other phis in an annotated clinical swedish corpus . In: Third Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM 2012 ) Held in Conjunction with LREC . pp. 49 { 54 ( 2012 )

2. Amengol-Estape , J. , Soares , F. , Marimon , M. , Krallinger , M. : Pharmaconer tagger: a deep learning-based tool for automatically nding chemicals and drugs in spanish medical texts . Genomics & Informatics 17 ( 2 ) ( 2019 )

3. Bick , E. , Barreiro , A. : Automatic anonymisation of a new portuguese-english parallel corpus in the legal- nancial domain . Oslo Studies in Language 7 ( 1 ) ( 2015 )

4. Cristobal , R.S. , Carrero , A.M. , Carrasco , M.P. , Rodr

guez

, M.C., Mendez , J.F. , de Mingo , M.G. , Tello , J.C. , de

Madariaga

, R.S. , Serrano , A.C. , Aza , I.V. , et al.: Sistema anonimizador conforme a la norma une-en iso 13606 ( 2012 )

5. Fernandez-Aleman , J.L. , Sen~or, I.C. , Lozoya , P.A.O. , Toval , A. : Security and privacy in electronic health records: A systematic literature review . Journal of biomedical informatics 46(3) , 541 { 562 ( 2013 )

6. Garc

a Sardin~a, L.: Automating the anonymisation of textual corpora (

2018 )

7. Gaudet-Blavignac , C. , Fou , V. , Wehrli , E. , Lovis , C. : De-identi cation of french medical narratives . Swiss Medical Informatics 34 ( 00 ) ( 2018 )

8. Gentili , M. , Hajian , S. , Castillo , C. : A case study of anonymization of medical surveys . In: Proceedings of the 2017 International Conference on Digital Health . pp. 77 { 81 . ACM ( 2017 )

9. Grouin , C. , Neveol , A. : De-identi cation of clinical notes in french: towards a protocol for reference corpus development . Journal of biomedical informatics 50 , 151 { 161 ( 2014 )

10. Hassan , F. , Domingo-Ferrer , J. , Soria-Comas , J. : Anonimizacin de datos no estructurados a travs del reconocimiento de entidades nominadas . In: Actas de la XV Reunin Espaola sobre Criptologa y Seguridad de la Informacin - RECSI 2018 . pp. 102 { 106 ( 2018 )

11. Intxaurrondo , A. , Marimon , M. , Gonzalez-Agirre , A. , Lopez-Martin , J.A. , Rodriguez , H. , Santamaria , J. , Villegas , M. , Krallinger , M. : Finding mentions of abbreviations and their de nitions in spanish clinical cases: The barr2 shared task evaluation results . In: IberEval@ SEPLN . pp. 280 { 289 ( 2018 )

12. Intxaurrondo , A. , Perez-Perez , M. , Perez-Rodr guez , G., Lopez-Mart n , J.A. , Santamaria , J., de la Pena, S. , Villegas , M. , Akhondi , S.A. , Valencia , A. , Lourenco , A. , Kralllinger , M.: The biomedical abbreviation recognition and resolution (barr) track: benchmarking, evaluation and importance of abbreviation recognition systems applied to spanish biomedical abstracts . SEPLN ( 2017 )

13. Mamede , N. , Baptista , J. , Dias , F. : Automated anonymization of text documents . In: 2016 IEEE Congress on Evolutionary Computation (CEC) . pp. 1287 { 1294 . IEEE ( 2016 )

14. Medina , S. , Turmo , J.: Building a spanish/catalan health records corpus with very sparse protected information labelled . In: LREC 2018 : Workshop MultilingualBIO: Multilingual Biomedical Text Processing: proceedings. pp. 1 { 7 ( 2018 )

15. Megyesi , B. , Granstedt , L. , Johansson , S. , Prentice , J. , Rosen , D. , Schenstrom, C.J., Sundberg , G. , Wiren , M. , Volodina , E.: Learner corpus anonymization in the age of gdpr: Insights from the creation of a learner corpus of swedish . In: Proceedings of the 7th workshop on NLP for Computer Assisted Language Learning . pp. 47 { 56 ( 2018 )

16. Mota , E. , Mart n , N., Moreno , A. , Ferrete , E. , Santamar

, J., Marimon , M. , Intxaurrondo , A. , Gonzalez-Agirre , A. , Villegas , M. , Krallinger , M. : Gu as de anotacion de informacion de salud protegida ( Oct 2018 ), http://temu.bsc.es/meddocan/wp-content/uploads/2019/02/gu as-de -anotacionde-informacion-de-salud-protegida .pdf

17. Pantazos , K. , Lauesen , S. , Lippert , S. : Preserving medical correctness, readability and consistency in de-identi ed health records. Health informatics journal 23(4) , 291 { 303 ( 2017 )

18. Perez-Perez , M. , Perez-Rodr guez , G., Blanco-M guez , A. , Fdez-Riverola , F. , Valencia , A. , Krallinger , M. , Lourenco , A. : Next generation community assessment of biomedical entity recognition web servers: metrics, performance, interoperability aspects of becalm . Journal of Cheminformatics 11 ( 1 ), 42 ( 2019 )

19. Santamar

, J., Krallinger , M. : Construccion de recursos terminologicos medicos para el espan~ol: el sistema de extraccion de terminos cutext y los repositorios de terminos biomedicos . Procesamiento del Lenguaje Natural 61 ( 2018 )

20. Scheurwegs , E. , Luyckx , K. , Van der Schueren , F., Van den Bulcke, T.: Deidenti cation of clinical free text in dutch with limited training data: a case study . In: Proceedings of the Workshop on NLP for Medicine and Biology associated with RANLP 2013 . pp. 18 { 23 ( 2013 )

21. Stubbs , A. , Kot la , C. , Uzuner , O . : Automated systems for the de-identi cation of longitudinal clinical narratives: Overview of 2014 i2b2/uthealth shared task track 1 . Journal of biomedical informatics 58 Suppl, S11{9 ( 2015 )

22. Tomanek , K. , Daumke , P. , Enders , F. , Huber , J. , Theres , K. , Muller, M.: An interactive de-identi ca-tion-system . In: Proceedings of SMBM 2012-The 5th International Symposium on Semantic Mining in Biomedicine . pp. 82 { 86 ( 2012 )

23. Tveit , A. , Edsberg , O. , Rost , T. , Faxvaag , A. , Nytro , O. , Nordgard , T. , Ranang , M.T. , Grimsmo , A. : Anonymization of general practioner medical records . In: second HelsIT Conference ( 2004 )

24. Uzuner , O. , Luo , Y. , Szolovits , P. : Evaluating the State-of-the-Art in Automatic De-identi cation . Journal of the American Medical Informatics Association 14 ( 5 ), 550 { 563 (09 2007 ). https://doi.org/10.1197/jamia.M2444, https://doi.org/10.1197/jamia.M2444

25. Vico , H. , et al.: De nicion de una arquitectura de referencia para anonimizar documentos ( 2013 )

26. Villegas , M. , Intxaurrondo , A. , Gonzalez-Agirre , A. , Marimon , M. , Krallinger , M.: The mespen resource for english-spanish medical machine translation and terminologies: census of parallel corpora, glossaries and term translations . In: Proceedings of the LREC 2018 Workshop MultilingualBIO: Multilingual Biomedical Text Processing , Paris, France. European Language Resources Association (ELRA) ( 2018 )