=Paper=
{{Paper
|id=Vol-2421/MEDDOCAN_overview
|storemode=property
|title=Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results
|pdfUrl=https://ceur-ws.org/Vol-2421/MEDDOCAN_overview.pdf
|volume=Vol-2421
|authors=Montserrat Marimon,Aitor Gonzalez-Agirre,Ander Intxaurrondo,Heidy Rodríguez,Jose Lopez Martin,Marta Villegas,Martin Krallinger
|dblpUrl=https://dblp.org/rec/conf/sepln/MarimonGIRMVK19
}}
==Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results==
<pdf width="1500px">https://ceur-ws.org/Vol-2421/MEDDOCAN_overview.pdf</pdf>
<pre>
Automatic De-Identification of Medical Texts in
  Spanish: the MEDDOCAN Track, Corpus,
Guidelines, Methods and Evaluation of Results

Montserrat Marimon2 , Aitor Gonzalez-Agirre1,2 , Ander Intxaurrondo1,2 , Heidy
  Rodrı́guez1 , Jose Antonio Lopez Martin3 , Marta Villegas1,2 , and Martin
                               Krallinger*1,2
             1
              Centro Nacional de Investigaciones Oncológicas (CNIO)
                  2
                     Barcelona Supercomputing Center (BSC)
           {montserrat.marimon, aitor.gonzalez, marta.villegas,
                         martin.krallinger}@bsc.es
                      3
                         Hospital 12 de Octubre - Madrid


      Abstract. There is an increasing interest in exploiting the content of
      electronic health records by means of natural language processing and
      text-mining technologies, as they can result in resources for improving
      patient health/safety, aid in clinical decision making, facilitate drug re-
      purposing or precision medicine. To share, re-distribute and make clinical
      narratives accessible for text mining research purposes, it is key to ful-
      fill legal conditions and address restrictions related data protection and
      patient privacy. Thus, clinical records cannot be shared directly ”as is”.
      A necessary precondition for accessing clinical records outside of hospi-
      tals is their de-identification or exhaustive removal/replacement of all
      mentioned privacy related protected health information phrases. Provid-
      ing a proper evaluation scenario for automatic anonymization tools is
      key for approval of data redistribution. The construction of manually
      de-identified medical records is currently the main rate and cost-limiting
      step for secondary use applications of clinical data. This paper summa-
      rizes the settings, data and results of the first shared track on anonymiza-
      tion of medical documents in Spanish, the MEDDOCAN (Medical Docu-
      ment Anonymization) track. This track relied on a carefully constructed
      synthetic corpus of clinical case documents, the MEDDOCAN corpus,
      following annotation guidelines for sensitive data based on the analysis
      of the EU General Data Protection Regulation. A total of 18 teams (from
      the 51 registrations) submitted 63 runs for first sub-track 1 and 61 sys-
      tems for the second sub-track. The top scoring systems were based on
      sophisticated deep learning approaches, representing strategies that can
      significantly reduce time and costs associated to accessing textual data
      containing privacy-related sensitive information. The results of this track
      might help in lowering the clinical data access hurdle for Spanish lan-
      guage technology developers, showing also potentials for similar settings
      using data in other languages or from different domains.
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


        Marimon et al.

       Keywords: GDPR · IberLEF · de-identification · anonymization · sen-
       sitive data · data privacy · named entity recognition · deep learning ·
       Gold Standard corpus · NLP · Plan TL · text mining · EHR.


1    Introduction

There is an increasing interest in exploiting the content of unstructured clinical
narratives by means of language technologies. Therefore, and because there is
clear interest in the health sector by the language technology industry, one of
the flagship projects of the Spanish National Plan for the Advancement of Lan-
guage Technology (Plan TL4 ) is related to the clinical and biomedical field. The
Plan TL has promoted the generation of a collection of resources for Spanish
biomedical NLP5 , including corpora [26], gazetteers [26], components [2, 19] and
tools, as well as evaluation efforts [18, 11, 12]. Due to their central role in foster-
ing language technology resources, the promotion of shared tasks and evaluation
campaigns is of particular relevance for the Plan TL, being considered a key in-
strument for: (1) independent quality evaluation of components, (2) promotion
of standards, interoperability and harmonization of resources, (3) generation of
new systems, tools and software components, (4) promotion of confidence by end
users, investors and commercial partners in language technologies, (5) promot-
ing new start ups and innovative ideas, (6) improving access to data, (7) create
collaborative research interactions and networks and (8) serve as a knowledge
transfer and learning experience engaging both academia and industry. Struc-
tured clinical data, in the form of codified clinical information using controlled
indexing vocabulary such as ICD10, only covers a fraction of the medically rel-
evant information stored in electronic health records (EHRs) and clinical texts.
Complex relations such as drug-related allergies, constituting a serious health
risk, cannot be captured well by the coding schemes followed typically by clini-
cal documentalists and, thus, require direct processing of clinical narrative texts.
    Being able to transform automatically clinical documents into some struc-
tured representations is nonetheless needed to enable secondary use of EHRs to
carry out population and epidemiological studies, to detect medication-related
adverse events or for monitoring systematically treatment-related responses, just
to name a few.
    To be able to share, re-distribute and make clinical narratives accessible for
text mining and natural language processing (NLP) purposes, it is key to fulfill
legal conditions and address restrictions related data protection and patient
privacy legislations [5]. Some efforts have been made to examine GDPR demands

  Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 Septem-
  ber 2019, Bilbao, Spain.
4
  https://www.plantl.gob.es
5
  https://github.com/PlanTL-SANIDAD


                                          619
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


      MEDDOCAN: Automatic de-identification of medical texts in Spanish

for the construction of de-identified textual corpora for research purposes [15].
Thus, clinical records with protected health information (PHI) cannot be directly
shared ”as is”, due to privacy constraints, making it particularly cumbersome
to carry out NLP research in the medical domain. A necessary precondition for
accessing clinical records outside of hospitals is their de-identification, i.e., the
exhaustive removal (or replacement) of all mentioned PHI phrases.

    Studies describing services for pseudonymization of EHRs based on stan-
dards such as the ISO/EN 13606 were previously published for data in Spanish
[4], but are generally limited to the structured fields of the clinical documents,
have not been evaluated against any particular Gold Standard dataset (i.e. lack
proper evaluation), and, most importantly, are not accessible or released on
public software repositories, making it impossible to actually carry out a proper
independent benchmark study. Providing a proper evaluation scenario of auto-
matic anonymization tools, with well-defined sensitive data types, is crucial for
approval of data redistribution consents signed by ethical committees of health-
care institutions. It is important to highlight that the construction of manually
de-identified medical records is currently the main rate and cost-limiting step for
secondary use applications. Moreover, such settings also require very carefully
designed annotation guidelines and interfaces to assure that there is no leak of
sensitive information from clinical records and that the resulting de-identified
datasets are compliant with all legal constraints.

    The practical relevance of anonymization or de-identification of clinical texts
motivated the proposal of two shared tasks, the 2006 and 2014 de-identification
tracks [24, 21], organized under the umbrella of the i2b2 (i2b2.org) community
evaluation effort. The i2b2 effort has deeply influenced the clinical NLP com-
munity worldwide, but was focused on documents in English and covering char-
acteristics of US-healthcare data providers. Systems used for de-identifying En-
glish clinical texts like Carafe, based on Conditional Random Fields or MIST
(the MITRE Identification Scrubber Toolkit) have benefited from i2b2 shared
tasks to improve, evaluate and analyze these tools. The interest in automated
de-identification and anonymization systems is not limited to data in English,
and there is also a growing awareness in developing such systems for other lan-
guages, such as French [9, 7], German [22], Dutch [20], Portuguese [13], Danish
[17], Swedish [1] or Norwegian [23].

    In case of texts in Spanish, there has been so far a rather limited attempt
in developing and characterizing automatic de-identification strategies [10, 14,
25, 6], even though some in house tools, such as the AEMPS anonymizer or
a recent publication by Medina and Turmo [14] show that efforts in this di-
rection are being made and such tools are already explored in practice. We,
therefore, organized the first community challenge track specifically devoted to
the anonymization of medical documents in Spanish, called the MEDDOCAN
(Medical Document Anonymization) track, as part of the IberLEF evaluation
initiative.


                                          620
           Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


         Marimon et al.

2     Methods

2.1    Track Description

The MEDDOCAN track was one of the nine challenge tracks of the Iberian
Languages Evaluation Forum (IberLEF 2019)6 evaluation campaign, which had
the goal of promoting the development of language technologies for Iberian lan-
guages. MEDDOCAN was the first community challenge track specifically de-
voted to the anonymization of medical documents in Spanish and it evaluated the
performance of the systems for identifying and classifying sensitive information
in clinical case studies written in Spanish.
    The evaluation of automatic predictions for this track had two different sce-
narios or sub-tracks:

 1. NER offset and entity type classification: the first sub-track was focused
    on the identification and classification of sensitive information (e.g., patient
    names, telephones, addresses, etc.).
 2. Sensitive span detection: the second sub-track was focused on the detection
    of sensitive text more specific to the practical scenario necessary for the
    release of de-identified clinical documents, where the objective is to identify
    and to mask confidential data, regardless of the real type of entity or the
    correct identification of PHI type.


2.2    Track data

For this track, we prepared a synthetic corpus of clinical cases enriched with
PHI expressions, named the MEDDOCAN corpus. The MEDDOCAN corpus,
of 1,000 clinical case studies, was selected manually by a practicing physician and
augmented with PHI phrases by health documentalists, adding PHI information
from discharge summaries and medical genetics clinical records.
    To carry out the manual annotation, we constructed the first public guide-
lines for PHI in Spanish [16], following the specifications derived from the Gen-
eral Data Protection Regulation (GDPR) of the EU, as well as the annotation
guidelines and types defined by the i2b2 de-identification tracks, based on the US
Health Insurance Portability and Accountability Act (HIPAA). The construc-
tion of these annotation guidelines involved active feedback over a six-month
period from a hybrid team of nine persons with expertise in both healthcare and
NLP, resulting in a 28-page document that has been distributed along with the
corpus. Along with the annotation rules, illustrative examples were provided to
make the interpretation and use of the guidelines as easy as possible.
    The MEDDOCAN corpus was randomly sampled into three subset: the train
set, which contained 500 clinical cases, and the development and test sets of 250
clinical cases each. These clinical cases were manually annotated using a cus-
tomized version of AnnotateIt. Then, the BRAT annotation toolkit was used to
6
    http://hitz.eus/sepln2019/?q=node/21


                                           621
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


       MEDDOCAN: Automatic de-identification of medical texts in Spanish

correct errors and add missing annotations, achieving an inter-annotator agree-
ment (IAA) of 98% (calculated with 50 documents). Together with the test set,
we released an additional collection of 3,501 documents (background set7 ) to
make sure that participating teams were not able to do manual corrections and
also to promote that these systems would potentially be able to scale to larger
data collections.
    The MEDDOCAN annotation guidelines defined a total of 29 entity types.
Table 1 summarizes the list of sensitive entity types defined for the MEDDOCAN
track and the number of occurrences among the training, development and test
sets.

              Table 1. Entity type distribution among the data sets.

    Type                                     Train    Dev     Test    Total
    TERRITORIO                                1875    987      956    3818
    FECHAS                                    1231    724      611    2566
    EDAD SUJETO ASISTENCIA                    1035    521      518    2074
    NOMBRE SUJETO ASISTENCIA                  1009    503      502    2014
    NOMBRE PERSONAL SANITARIO                 1000    497      501    1998
    SEXO SUJETO ASISTENCIA                     925    455      461    1841
    CALLE                                      862    434      413    1709
    PAIS                                       713    347      363    1423
    ID SUJETO ASISTENCIA                       567    292      283    1142
    CORREO ELECTRONICO                         469    241      249     959
    ID TITULACION PERSONAL SANITARIO           471    226      234     931
    ID ASEGURAMIENTO                           391    194      198     783
    HOSPITAL                                   255    140      130     525
    FAMILIARES SUJETO ASISTENCIA               243     92       81     416
    INSTITUCION                                 98     72       67     237
    ID CONTACTO ASISTENCIAL                     77     32       39     148
    NUMERO TELEFONO                             58     25       26     109
    PROFESION                                   24      4        9      37
    NUMERO FAX                                  15      6        7      28
    OTROS SUJETO ASISTENCIA                      9      6        7      22
    CENTRO SALUD                                 6      2        6      14
    ID EMPLEO PERSONAL SANITARIO                 0      1        0        1
    IDENTIF VEHICULOS NRSERIE PLACAS             0      0        0        0
    IDENTIF DISPOSITIVOS NRSERIE                 0      0        0        0
    NUMERO BENEF PLAN SALUD                      0      0        0        0
    URL WEB                                      0      0        0        0
    DIREC PROT INTERNET                          0      0        0        0
    IDENTF BIOMETRICOS                           0      0        0        0
    OTRO NUMERO IDENTIF                          0      0        0        0


    The MEDDOCAN corpus was distributed in plain text in UTF-8 encoding,
where each clinical case was stored as a single file, while PHI annotations were
released in the BRAT format, which makes visualization of results straightfor-
ward, as you can see in Fig. 1 For this track, we also prepared a conversion script8
between the BRAT annotation format and the annotation format used by the
7
  The background set included the train, development and test sets, and an additional
  collection of 2,751 clinical cases (totalling 3,751 clinical cases).
8
  https://github.com/PlanTL-SANIDAD/MEDDOCAN-Format-Converter-Script


                                          622
         Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


       Marimon et al.

previous i2b2 effort, to make comparison and adaptation of previous systems
used for English texts easier.


Fig. 1. An example of MEDDOCAN annotation visualized using the BRAT annotation
interface..


2.3   Evaluation metrics
We developed an evaluation script that supported the evaluation of the pre-
dictions of the participating teams. For both sub-tracks the primary evaluation
metrics used consisted of standard measures from the NLP community, namely
micro-averaged precision, recall, and balanced F-score, being the last one the
only official evaluation measure of both sub-tracks:
                              Precision: P = T PT+F
                                                 P
                                                    P

                               Recall: R = T PT+F
                                                P
                                                  N


                                         623
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


       MEDDOCAN: Automatic de-identification of medical texts in Spanish

                                                 (P ∗R)
                              F-score: F 1 = 2 ∗ (P +R)

    where TP = true positives, FP = false positives and FN = false negatives.
    In addition, in case of the first sub-track, the leak scores; i.e., #false neg-
atives/#sentences present, previously proposed for the i2b2 challenges, were
also computed. In the case of the second sub-track, we also additionally com-
puted another evaluation where we merged the spans of PHI connected by non-
alphanumerical characters.
    Teams could submit up to five prediction files (runs) in a predefined predic-
tion format (BRAT o i2b2).


3     Participation and Results
3.1   Participation
To participate in the MEDDOCAN track it was necessary to register both on
the official website9 and in the CodaLab competition10 . Training and develop-
ment sets were made available for download on the official website11 , and the
evaluation script was uploaded to GitHub12 , to ensure a transparent evaluation.
    Submissions had to be provided in a predefined prediction format (BRAT
or i2b2). The participants had a period of almost two months to develop their
system. In the middle of this period, the text and background sets were released
with the 3,751 documents that the participants had to process and label, al-
though the final evaluation was done on the 250 documents of the test set. As
we have mentioned, the participants could submit a maximum of 5 system runs,
and, once the submission deadline expired, we published the Gold Standard
annotations of the test set, in order to ensure a transparent evaluation process.
    A total of 18 teams participated in the track, submitting a total of 63 systems
for sub-track 1 and 61 systems for sub-track 2. Teams from eight different na-
tionalities participated in the track: ten from Spain, two from the United States,
and one from Argentina, China, Germany, Italy, Japan, and Russia. Among all
the participants, only one belonged to an institution of a commercial nature.
Table 2 summarizes the most relevant information about the participants.

3.2   Baseline system
We produced a baseline system using a vocabulary transfer approach. Each an-
notation from the train and development datasets was transferred to the test
dataset using strict string matching. For those cases where the text was the
same, but the entity type was different, we decided to annotate all entity types
that matched that text.
9
   http://temu.bsc.es/meddocan/
10
   https://competitions.codalab.org/competitions/22643
11
   http://temu.bsc.es/meddocan/index.php/data/
12
   https://github.com/PlanTL-SANIDAD/MEDDOCAN-CODALAB-Evaluation-
   Script


                                          624
            Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


         Marimon et al.

        Table 2. Overview of Team Participation in the MEDDOCAN track.

    Username                  Organization/Institution/Company                  Members Country Comm.
     Aspie96                          University of Turin                         1       Italy   No
      ccolon                    Carlos III University of Madrid                   3       Spain   No
       Fadi               Universitat Rovira i Virgili, CRISES group              6       Spain   No
       FSL                                Unaffiliated                            1       Spain   No
      gauku                        University of Pennsylvania                     2       USA     No
  jiangdehuan                    Harbin Institute of Technology                   9      China    No
     jimblair                        University of Maryland                       2       USA     No
       Jordi           Centro de Estudios de la Real Academia Espaola             1       Spain   No
     lsi uned               National Distance Education University                4       Spain   No
    lsi2 uned               National Distance Education University                2       Spain   No
   lukas.lange               Bosch Center for Artificial Intelligence             3     Germany   Yes
  m.domrachev                             Unaffiliated                            3      Russia   No
    mhjabreel        Universitat Rovira i Virgili, iTAKA Research Group           5       Spain   No
      nperez                               Vicomtech                              4       Spain   No
     plubeda               Advanced Studies Center in ICT, SINAI                  4       Spain   No
      sohrab   National Institute of Advanced Industrial Science and Technology   3      Japan    No
      vcotik                      Universidad de Buenos Aires                     3     Argentina No
       VSP                      Carlos III University of Madrid                   1       Spain   No


3.3    Results
Table 3 shows the results for sub-track 1 (NER offset and entity type classifi-
cation), ordered by team performance (first column), then system performance
(second column). Note that almost all of the systems were well above the base-
line, which would rank 18.
    The top scoring system was submitted by lukas.lange, with an F-score of
0.96961, being relatively close to the next two participants: Fadi, ranked 2nd
with a F-score of 0.96327, and nperez, ranked 3rd with a F-score of 0.96018. If we
focus our attention on the recall (which is a crucial metric for de-identification)
obtained by the systems, we see that best performing systems were lukas.lange,
with a recall of 0.96944, FSL, with a recall of 0.96043, and mhjabreel, with a
recall of 0.95707.
    Tables 6 and 7 show the results for sub-track 2A (Sensitive token detec-
tion with strict spans) and sub-track 2B (Sensitive token detection with merged
spans), respectively, ordered by team performance (first column), then system
performance (second column). As in sub-track 1, almost all of the systems were
well above the baseline.
    The top scoring system for sub-track 2A was submitted by lukas.lange, with
a F-score of 0.97491. The second team was Fadi, with a F-score of 0.96861, and
the third team was nperez, with a F-score of 0.96799. The best results in terms
of recall were obtained by lukas.lange, with a recall of 0.97474, mhjabreel, with
a recall of 0.96591, and, FSL, with a recall of 0.96520.
    The results for sub-track 2B were quite surprising. The top scoring systems
was submitted by lukas.lange, with a F-score of 0.98530, but the second team
for this sub-track was jiangdehuan, with a F-score of 0.98184, very close to the
best team. Note that jiangdehuan ranked 7th for sub-tracks 1 and 2A (their
best system ranked 25th). This boost in performance was quite surprising and
probably need further analysis. The third team was nperez, with a F-score of
0.97593. Finally, the best results in terms of recall were obtained by jiangdehuan,


                                                625
     Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


 MEDDOCAN: Automatic de-identification of medical texts in Spanish

Table 3. Results for sub-track 1: NER offset and entity type classification.

  Team Rank   System Rank       User              Leak    Precision    Recall      F1
                    1                           0.02299    0.96978    0.96944   0.96961
                    2                           0.02378    0.97078    0.96838   0.96958
      1             3        lukas.lange        0.02365    0.97044    0.96856   0.96950
                    4                           0.02432    0.96956    0.96767   0.96861
                    5                           0.02724    0.96720    0.96379   0.96549
                    6                           0.03255    0.96991    0.95672   0.96327
                    7                           0.03388    0.97160    0.95495   0.96321
      2             8           Fadi            0.03508    0.97191    0.95337   0.96255
                    9                           0.03322    0.96867    0.95584   0.96221
                   10                           0.03402    0.96933    0.95478   0.96200
                   11                           0.03282    0.96403    0.95637   0.96018
                   15                           0.03946    0.96823    0.94754   0.95777
      3            19          nperez           0.03946    0.96492    0.94754   0.95615
                   20                           0.04146    0.96570    0.94489   0.95518
                   21                           0.04770    0.97124    0.93658   0.95360
                   12                           0.02976    0.95857    0.96043   0.95950
      4            16           FSL             0.03096    0.95597    0.95884   0.95740
                   18                           0.03096    0.95547    0.95884   0.95715
                   13                           0.03242    0.95978    0.95690   0.95834
                   14                           0.03282    0.95976    0.95637   0.95806
      5            17         mhjabreel         0.03229    0.95741    0.95707   0.95724
                   22                           0.03734    0.95610    0.95036   0.95322
                   24                           0.04783    0.94779    0.93641   0.94207
      6            23          lsi uned         0.05381    0.95877    0.92846   0.94337
                   25                           0.03574    0.92806    0.95248   0.94011
                   26                           0.03681    0.92892    0.95107   0.93986
      7            28        jiangdehuan        0.04106    0.92868    0.94542   0.93697
                   30                           0.03747    0.92217    0.95019   0.93597
                   58                           0.16835    0.91580    0.77619   0.84023
                   27                           0.06617    0.96451    0.91203   0.93753
                   29                           0.06604    0.96164    0.91221   0.93627
      8            33          jimblair         0.05395    0.93306    0.92828   0.93067
                   35                           0.05567    0.93125    0.92598   0.92861
                   36                           0.05594    0.92547    0.92563   0.92555
                   31                           0.05421    0.93653    0.92793   0.93221
      9                        ccolon
                   34                           0.05195    0.92700    0.93093   0.92896
                   32                           0.07002    0.95676    0.90691   0.93117
                   39                           0.08026    0.94119    0.89331   0.91662
     10            40          sohrab           0.07348    0.92553    0.90231   0.91377
                   41                           0.06325    0.90997    0.91592   0.91293
                   42                           0.08570    0.93252    0.88606   0.90870
                   37                           0.07095    0.93150    0.90567   0.91841
     11            38           Jordi           0.06218    0.91912    0.91733   0.91822
                   57                           0.12091    0.86571    0.83925   0.85227
                   43                           0.08491    0.92113    0.88712   0.90381
     12            52          plubeda          0.11998    0.89369    0.84049   0.86627
                   62                           0.34600    0.66457    0.54001   0.59585
                   44                           0.08318    0.91098    0.88942   0.90007
     13            47       m.domrachev         0.07813    0.89313    0.89613   0.89463
                   48                           0.08225    0.87824    0.89066   0.88441
                   45                           0.12052    0.96902    0.83978   0.89978
     14                       lsi2 uned
                   59                           0.18164    0.91929    0.75852   0.83120
                   46                           0.09022    0.91413    0.88006   0.89677
                   49                           0.07308    0.86568    0.90284   0.88387
     15            50           vcotik          0.07308    0.86568    0.90284   0.88387
                   51                           0.07308    0.86568    0.90284   0.88387
                   60                           0.13540    0.76223    0.82000   0.79006
                   53                           0.10165    0.85535    0.86486   0.86008
                   54                           0.10165    0.85535    0.86486   0.86008
     16                         VSP
                   55                           0.10058    0.84639    0.86628   0.85622
                   56                           0.10058    0.84639    0.86628   0.85622
     17            61           gauku           0.31464    0.90841    0.58170   0.70924
      -             -       *Baseline-VT*       0.37351   0.37023     0.50344   0.42668
     18            63          Aspie96          0.35384    0.18829    0.52959   0.27781


                                          626
              Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


        Marimon et al.

 Table 4. Results by label for sub-track 1: NER offset and entity type classification.
       Category               Sub-category              Best Team(s)    Leak    Precision   Recall     F1
        AGE             EDAD SUJETO ASISTENCIA          jiangdehuan    0.0004    0.9828     0.9942   0.9885
                                                         lukas.lange
                          CORREO ELECTRONICO                           0.0001    0.9920     0.9960   0.9940
                                                            nperez
                                                           jimblair
      CONTACT
                              NUMERO FAX                jiangdehuan    0.0000    1.0000     1.0000   1.0000
                                                           lsi uned
                           NUMERO TELEFONO              jiangdehuan    0.0000    1.0000     1.0000   1.0000
                                                        jiangdehuan
        DATE                    FECHAS                                 0.0004    0.9935     0.9951   0.9943
                                                         lukas.lange
                                                             FSL
                                                        jiangdehuan
                                                           jimblair
                                                           lsi uned
                           ID ASEGURAMIENTO              lukas.lange   0.0001    1.0000     0.9950   0.9975
                                                        m.domrachev
                                                          mhjabreel
                                                            nperez
                                                            sohrab
                                                          lsi2 uned
                                                         lukas.lange
                                                          mhjabreel
         ID             ID CONTACTO ASISTENCIAL                        0.0000    1.0000     1.0000   1.0000
                                                            nperez
                                                            sohrab
                                                            vcotik
                          ID SUJETO ASISTENCIA          jiangdehuan    0.0001    0.9758     0.9965   0.9860
                                                        jiangdehuan
                                                           jimblair
                                                           lsi uned
                                                          lsi2 uned
                    ID TITULACION PERSONAL SANITARIO                   0.0000    0.9957     1.0000   0.9979
                                                         lukas.lange
                                                          mhjabreel
                                                            nperez
                                                            sohrab
                                 CALLE                   lukas.lange   0.0031    0.9353     0.9443   0.9398
                                                             FSL
                                                        jiangdehuan
                             CENTRO SALUD                 lsi2 uned    0.0001    1.0000     0.8333   0.9091
                                                         lukas.lange
      LOCATION
                                                          mhjabreel
                                  HOSPITAL                   FSL       0.0016    0.9672     0.9077   0.9365
                                INSTITUCION             jiangdehuan    0.0036    0.6061     0.5970   0.6015
                                    PAIS                jiangdehuan    0.0004    0.9890     0.9917   0.9904
                                TERRITORIO               lukas.lange   0.0035    0.9759     0.9728   0.9743
                       NOMBRE PERSONAL SANITARIO         lukas.lange   0.0003    0.9960     0.9960   0.9960
       NAME
                        NOMBRE SUJETO ASISTENCIA        jiangdehuan    0.0000    1.0000     1.0000   1.0000
                      FAMILIARES SUJETO ASISTENCIA       lukas.lange   0.0017    0.8293     0.8395   0.8344
       OTHER             OTROS SUJETO ASISTENCIA            nperez     0.0008    1.0000     0.1429   0.2500
                          SEXO SUJETO ASISTENCIA             FSL       0.0004    0.9892     0.9935   0.9913
     PROFESSION                  PROFESION               lukas.lange   0.0004    1.0000     0.6667   0.8000


with a recall of 0.98335, lukas.lange, with a recall of 0.98264, and, mhjabreel, with
a recall of 0.97471.
    An analysis of errors showed that some of the annotations in the Gold Stan-
dard (GS) corpus were not detected by any of the systems (at least not exactly).
Some of them are listed here:

 – HOSPITAL: Hospital General de Agudos P. Piñero
 – FAMILIARES SUJETO ASISTENCIA: tres hermanos varones sordomudos
   y otro con baja visión
 – OTROS SUJETO ASISTENCIA: estudiante de administración de empresas

   On the contrary, some systems annotated entities that were not in the GS but
probably should be. For instance, ”ex-operario de la industria textil ” was anno-
tated as PROFESION by jiangdehuan, jimblair, and Jordi, but this annotation
was not in the GS.


                                                  627
         Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


      MEDDOCAN: Automatic de-identification of medical texts in Spanish

                            Table 5. Statistics by track.

         Track    Measure       Leak        Precision      Recall        F1
                   Min         0.02299       0.18829      0.52959     0.27781
                   Mean        0.07594       0.90219      0.89327     0.89410
           1      Median       0.05567       0.93252      0.92598     0.93117
                   Max         0.35384       0.97191      0.96944     0.96961
                    Std        0.06857       0.10736      0.09116     0.10223
                   Min            -          0.19771      0.55609     0.29171
                   Mean           -          .92907       0.91058     0.91724
          2A      Median          -          0.95965      0.92616     0.94118
                   Maxi           -          0.97747      0.97474     0.97491
                    Std           -          0.10200      0.08190     0.09535
                   Min            -          0.19780      0.55626     0.29183
                   Mean           -          0.94661      0.92494     0.93320
          2B      Median          -          0.97180      0.95001     0.95774
                   Maxi           -          0.98749      0.98335     0.98530
                    Std           -          0.10260      0.08247     0.09624


3.4   Combination of systems
One of the primary goals of this track was to develop systems capable of com-
pletely de-identifying sensitive information from clinical documents. However,
none of submitted systems managed to obfuscate all the sensitive information.
In this section, we present two experiments we performed that evaluated the
performance of combined systems to de-identify the test dataset without leaks.
The first experiment was based on a joint system, the second experiment, on a
voting system.

Joint system The goal of this experiment was to find the combination of
individual systems that achieved the best possible performance. For this, first,
we ranked all the systems by F-score, and then we joined the annotations of the
two best system. If the performance of the Joint system improved, we continued
with the next best system, if not, we kept the previous system (or the previous
joint system). We repeated this until no systems were left. We measured the
performance of the joint system using three metrics:

1. Best F1: If the F-score of the joint system improved when we added the
   annotations from the next system, we updated the joint system with the
   new one. If the F-score did not improve, but it was maintained and the
   recall was better, we also updated the joint system with the new one (same
   F-score, better recall, worse precision).
2. Best Recall: If the recall of the joint system improved, we updated the joint
   system, regardless of the drop in the F-score. It tried to maximize the chances
   of completely de-identifying the documents.
3. Balanced: If the recall of the joint system improved, we updated the joint
   system only if the decrease of the F-score was at much four times the increase


                                         628
     Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


  Marimon et al.


Table 6. Results for sub-track 2A: Sensitive token detection (strict spans).

      Team Rank    System Rank       User        Precision    Recall      F1
                         1                        0.97508    0.97474   0.97491
                         2                        0.97574    0.97333   0.97453
          1              3        lukas.lange     0.97540    0.97350   0.97445
                         4                        0.97522    0.97333   0.97427
                         5                        0.97217    0.96873   0.97045
                         6                        0.97529    0.96202   0.96861
                         8                        0.97507    0.96043   0.96770
          2              9           Fadi         0.97556    0.95884   0.96713
                        10                        0.97351    0.96061   0.96701
                        11                        0.97569    0.95707   0.96629
                         7                        0.97187    0.96414   0.96799
                        15                        0.97491    0.95407   0.96438
          3             20          nperez        0.97093    0.95001   0.96036
                        21                        0.96703    0.95337   0.96015
                        22                        0.97747    0.94259   0.95971
                        12                        0.96758    0.96467   0.96612
                        13                        0.96625    0.96591   0.96608
          4             14         mhjabreel      0.96720    0.96379   0.96549
                        19                        0.96463    0.95884   0.96173
                        23                        0.95798    0.94648   0.95219
                        16                        0.96315    0.96502   0.96409
          5             17           FSL          0.96231    0.96520   0.96375
                        18                        0.96180    0.96520   0.96350
          6             24          lsi uned      0.96406    0.93358   0.94858
                        25                        0.93356    0.95813   0.94569
                        26                        0.93392    0.95619   0.94492
          7             30        jiangdehuan     0.92817    0.95637   0.94206
                        31                        0.93285    0.94966   0.94118
                        57                        0.91976    0.77954   0.84387
                        27                        0.96167    0.92616   0.94358
          8             45                        0.93858    0.88271   0.90979
                        59          plubeda       0.86594    0.70288   0.77594
                        28                        0.96782    0.91910   0.94283
                        32                        0.96806    0.91539   0.94098
          9             33          jimblair      0.96646    0.91609   0.94060
                        34                        0.96536    0.91556   0.93980
                        36                        0.95965    0.91592   0.93727
                        29                        0.94705    0.93835   0.94268
          10                        ccolon
                        35                        0.93650    0.94047   0.93848
                        37                        0.96086    0.91079   0.93516
                        40                        0.93568    0.91221   0.92379
          11            41          sohrab        0.92639    0.92033   0.92335
                        43                        0.94752    0.89931   0.92278
                        44                        0.91962    0.92563   0.92262
                        38                        0.94771    0.91238   0.92971
          12            50           vcotik       0.87229    0.90973   0.89062
                        51                        0.87229    0.90973   0.89062
                        39                        0.93732    0.91132   0.92414
          13            42           Jordi        0.92407    0.92228   0.92317
                        56                        0.87136    0.84473   0.85783
                        46                        0.91424    0.89260   0.90329
          14            48       m.domrachev      0.89754    0.90055   0.89904
                        49                        0.88521    0.89772   0.89142
                        47                        0.97187    0.84225   0.90243
          15                       lsi2 uned
                        58                        0.92207    0.76082   0.83372
                        52                        0.86548    0.87511   0.87027
                        53                        0.86548    0.87511   0.87027
          16                         VSP
                        54                        0.85658    0.87670   0.86652
                        55                        0.85658    0.87670   0.86652
          17            60           gauku        0.91421    0.58541   0.71376
           -             -       *Baseline-VT*   0.44174     0.50627   0.47181
          18            61          Aspie96       0.19771    0.55609   0.29171


                                      629
      Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


  MEDDOCAN: Automatic de-identification of medical texts in Spanish


Table 7. Results for sub-track 2B: Sensitive token detection (merged spans).

       Team Rank   System Rank       User        Precision    Recall      F1
                         1                        0.98749    0.98311   0.98530
                         2                        0.98566    0.98264   0.98415
           1             3        lukas.lange     0.98648    0.98145   0.98396
                         4                        0.98598    0.98162   0.98380
                         7                        0.98182    0.97730   0.97956
                         5                        0.98033    0.98335   0.98184
                         6                        0.98029    0.98282   0.98155
           2             8        jiangdehuan     0.97496    0.98199   0.97846
                         9                        0.97962    0.97625   0.97793
                        56                        0.96913    0.80565   0.87986
                        10                        0.97954    0.97235   0.97593
                        20                        0.97724    0.96666   0.97192
           3            21          nperez        0.98253    0.96136   0.97183
                        22                        0.98159    0.95890   0.97011
                        27                        0.98329    0.95001   0.96636
                        11                        0.98128    0.96886   0.97503
                        14                        0.98110    0.96734   0.97417
           4            16           Fadi         0.97939    0.96750   0.97341
                        17                        0.98120    0.96573   0.97340
                        18                        0.98186    0.96419   0.97294
                        12                        0.97471    0.97471   0.97471
                        13                        0.97517    0.97350   0.97434
           5            15         mhjabreel      0.97481    0.97297   0.97389
                        19                        0.97457    0.96957   0.97207
                        28                        0.97125    0.95955   0.96536
                        23                        0.96694    0.96942   0.96818
           6            24           FSL          0.96708    0.96890   0.96799
                        25                        0.96645    0.96942   0.96793
                        26                        0.96515    0.96826   0.96670
           7            29       m.domrachev      0.95890    0.96768   0.96327
                        33                        0.96702    0.94718   0.95700
                        30                        0.97295    0.94370   0.95810
           8            35          plubeda       0.96825    0.93575   0.95173
                        59                        0.87549    0.70752   0.78259
                        31                        0.96308    0.95246   0.95774
           9                        ccolon
                        34                        0.95648    0.95631   0.95639
           10           32          lsi uned      0.97280    0.94201   0.95716
                        36                        0.95950    0.93908   0.94918
                        38                        0.97695    0.92028   0.94777
           11           43          sohrab        0.96234    0.92242   0.94196
                        45                        0.94907    0.92815   0.93849
                        46                        0.96924    0.90909   0.93820
                        37                        0.97424    0.92310   0.94798
                        39                        0.97505    0.91915   0.94627
           12           40          jimblair      0.97327    0.92001   0.94589
                        41                        0.97180    0.92008   0.94524
                        42                        0.96985    0.92059   0.94458
                        44                        0.95591    0.92367   0.93951
           13           50           vcotik       0.88734    0.92089   0.90381
                        51                        0.88734    0.92089   0.90381
                        47                        0.93267    0.93590   0.93428
           14           48           Jordi        0.94357    0.92149   0.93240
                        57                        0.87986    0.85150   0.86545
                        49                        0.98284    0.85568   0.91486
           15                      lsi2 uned
                        58                        0.93509    0.77562   0.84792
                        52                        0.88881    0.89356   0.89118
                        53                        0.88881    0.89356   0.89118
           16                        VSP
                        54                        0.88361    0.89685   0.89018
                        55                        0.88361    0.89685   0.89018
           17           60           gauku        0.92299    0.59848   0.72613
            -            -       *Baseline-VT*   0.50594     0.51363   0.50976
           18           61          Aspie96       0.19780    0.55626   0.29183


                                      630
         Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


       Marimon et al.

   of the recall. That it, for every point of increase in recall, we allowed 4 point
   of decrease in F-score, but not more. It tried to increase the recall, but
   without hurting the F-Score too much.
   The systems that were used to achieve the best results for these metrics were
the following:
 – Best F1:
   lukas.lange/run3 improves the F-score from 0 a 0.96961.
   lukas.lange/run2 improves the F-score from 0.96961 a 0.96997.
   lukas.lange/run1 improves the F-score from 0.96997 a 0.97033.
 – Recall:
   lukas.lange/run3 improves the recall from 0 to 0.96944.
   lukas.lange/run2 improves the recall from 0.96944 to 0.97209.
   lukas.lange/run1 improves the recall from 0.97209 to 0.97492.
   lukas.lange/run4 improves the recall from 0.97492 to 0.97562.
   Fadi/15-7 improves the recall from 0.97562 to 0.97898.
   Fadi/14-5 improves the recall from 0.97898 to 0.97951.
   Fadi/17-3 improves the recall from 0.97951 to 0.98022.
   Fadi/16-3 improves the recall from 0.98022 to 0.98039.
   nperez/ncrfpp improves the recall from 0.98039 to 0.98181.
   FSL/run1 improves the recall from 0.98181 to 0.98393.
   FSL/run2 improves the recall from 0.98393 to 0.9841.
   nperez/sp-test-03-empty improves the recall from 0.9841 to 0.98516.
   mhjabreel/run3 improves the recall from 0.98516 to 0.98551.
   mhjabreel/run2 improves the recall from 0.98551 to 0.98569.
   jiangdehuan/run3 improves the recall from 0.98569 to 0.98693.
   jiangdehuan/run2 improves the recall from 0.98693 to 0.9871.
   jimblair/run2 improves the recall from 0.9871 to 0.98763.
   jimblair/run3 improves the recall from 0.98763 to 0.98781.
   jiangdehuan/run1 improves the recall from 0.98781 to 0.98816.
   Jordi/run3 improves the recall from 0.98816 to 0.98869.
   vcotik/run5 improves the recall from 0.98869 to 0.98887.
 – Balanced:
   lukas.lange/run3 improves the recall from 0 to 0.96944 (+0.96944)
           without losing too much F-score: 0.96961 (-0.96961).
   lukas.lange/run2 improves the recall from 0.96944 to 0.97209 (+0.00265)
           without losing too much F-score: 0.96841 (0.00112).
   lukas.lange/run1 improves the recall from 0.97209 to 0.97492 (+0.00283)
           without losing too much F-score: 0.96647 (0.00194).
   Fadi/15-7 improves the recall from 0.97492 to 0.97863 (+0.00371)
           without losing too much F-score: 0.96181 (0.00466).
   Fadi/17-3 improves the recall from 0.97863 to 0.97951 (+0.00088)
           without losing too much F-score: 0.95868 (0.00313).
   nperez/ncrfpp improves the recall from 0.97951 to 0.98128 (+0.00177)
           without losing too much F-score: 0.95308 (0.00560).
   FSL/run1 improves the recall from 0.98128 to 0.98375 (+0.00247)
           without losing too much F-score: 0.94342 (0.00966).


                                         631
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


      MEDDOCAN: Automatic de-identification of medical texts in Spanish

   Table 8. Combining systems using finding the best combination (sub-track 1).

                         Criteria     Precision    Recall      F1
                         Best F1       0.96999    0.97068   0.97033
                         Balanced      0.90627    0.98375   0.94342
                        Best Recall    0.71230    0.98887   0.82811


   Table 8 summarizes the results of this experiment. The joint system trying
to maximize the F-score improved the result of the best system, but by a very
narrow margin. The balanced systems improved the recall by 1.4 points, at the
cost of decreasing the F-score by 2.6 points, being a probably desirable effect.


Voting The combination of individual systems from the previous experiment
was done directly on the test set. It is very difficult for a given combination of
systems to be transferable from one data set to another. Therefore, it should
be taken as only an approximation of the upper bound that can be obtained
by combining individual systems. In this experiment, we combined the systems
using a voting scenario: we accepted as good the annotations that had predicted
by N systems.
    We created 50 systems for sub-track 1. The first system accepted all the
annotations predicted by, at least, one of the systems, while the last one accepted
only the annotations that were predicted by, at least, 50 systems. The results of
this experiment is shown in Table 9. As expected, as the value of N increased (we
increased the number of required votes), the recall got worse and the precision
improved. The maximum value of F-score on the train and development sets was
obtained combining 17 systems (F-score of 0.9942). When we used the train and
development sets as train corpus to select the optimal value of N and used this
value on the test set, we obtained an F-score of 0.9757. This score was lower than
the best one that could be obtained (0.9768, with N = 23), but the difference
was (in practice) negligible.
    Comparing the results of the two experiments, we see that the voting system
improved the joint system by 0.54 points. In addition, as we see in the Table
9, the values were very stable and a non-optimal choice of the value N did not
vary much the result. The negative part was that the voting scenario required
many systems to obtain this result (17 systems out of 63 had to agree in order
to accept an annotation), while the joint system was a combination of only 3
systems. The voting system matched the performance of the joint system when
N is 13, scoring 0.9701 (the joint system scored 0.9703) .
    For reasons of space, we do not include the results of this experiment for
sub-tracks 2A and 2B, but they showed a very similar behavior.


3.5   Performance drop

In this section we analyze the performance of the systems on the different data
sets. As we have said, the background set included, the train set and the devel-


                                           632
  Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


Marimon et al.


 Table 9. Combining systems using a voting scheme (sub-track 1).

                    Train+Dev                      Test
      #
              P          R      F1         P         R        F1
      1    1.0000     0.2331  0.3781    0.9947    0.2084    0.3446
      2    1.0000     0.7374  0.8489    0.9922    0.6054    0.7519
      3    1.0000     0.8253  0.9043    0.9915    0.6789    0.8059
      4    1.0000     0.8809  0.9367    0.9899    0.7575    0.8583
      5    1.0000     0.9170  0.9567    0.9882    0.8477    0.9126
      6    1.0000     0.9340  0.9659    0.9869    0.8739    0.9270
      7    1.0000     0.9427  0.9705    0.9862    0.8989    0.9405
      8    0.9997     0.9571  0.9779    0.9852    0.9170    0.9498
      9    0.9995     0.9620  0.9804    0.9845    0.9244    0.9535
      10   0.9994     0.9678  0.9834    0.9838    0.9349    0.9587
      11   0.9992     0.9804  0.9897    0.9823    0.9483    0.9650
      12   0.9989     0.9845  0.9916    0.9818    0.9530    0.9672
      13   0.9985     0.9879  0.9932    0.9815    0.9591    0.9701
      14   0.9982     0.9893  0.9937    0.9802    0.9652    0.9727
      15   0.9974     0.9906  0.9940    0.9797    0.9699    0.9748
      16   0.9966     0.9914  0.9940    0.9777    0.9731    0.9754
      17   0.9962     0.9922  0.9942    0.9769    0.9745    0.9757
      18   0.9953     0.9928  0.9941    0.9758    0.9768    0.9763
      19   0.9946     0.9933  0.9939    0.9740    0.9791    0.9765
      20   0.9938     0.9938  0.9938    0.9724    0.9802    0.9763
      21   0.9931     0.9943  0.9937    0.9714    0.9818    0.9766
      22   0.9925     0.9949  0.9937    0.9698    0.9837    0.9767
      23   0.9918     0.9952  0.9935    0.9686    0.9851    0.9768
      24   0.9913     0.9954  0.9933    0.9663    0.9863    0.9762
      25   0.9906     0.9956  0.9931    0.9647    0.9879    0.9761
      26   0.9898     0.9961  0.9930    0.9636    0.9884    0.9759
      27   0.9892     0.9964  0.9928    0.9626    0.9891    0.9757
      28   0.9883     0.9967  0.9924    0.9601    0.9896    0.9746
      29   0.9877     0.9969  0.9923    0.9587    0.9905    0.9743
      30   0.9865     0.9972  0.9918    0.9571    0.9912    0.9739
      31   0.9855     0.9974  0.9914    0.9539    0.9917    0.9725
      32   0.9846     0.9976  0.9911    0.9511    0.9917    0.9710
      33   0.9833     0.9979  0.9905    0.9477    0.9919    0.9693
      34   0.9821     0.9980  0.9900    0.9465    0.9922    0.9688
      35   0.9806     0.9981  0.9893    0.9444    0.9924    0.9678
      36   0.9788     0.9982  0.9884    0.9412    0.9927    0.9663
      37   0.9767     0.9983  0.9873    0.9343    0.9934    0.9630
      38   0.9743     0.9983  0.9862    0.9313    0.9938    0.9615
      39   0.9715     0.9984  0.9847    0.9270    0.9941    0.9594
      40   0.9674     0.9986  0.9828    0.9223    0.9947    0.9571
      41   0.9632     0.9987  0.9806    0.9193    0.9950    0.9557
      42   0.9568     0.9988  0.9773    0.9147    0.9952    0.9532
      43   0.9529     0.9990  0.9754    0.9108    0.9952    0.9511
      44   0.9493     0.9990  0.9735    0.9071    0.9955    0.9493
      45   0.9449     0.9991  0.9712    0.9020    0.9957    0.9465
      46   0.9411     0.9992  0.9693    0.8975    0.9959    0.9442
      47   0.9378     0.9992  0.9675    0.8924    0.9959    0.9413
      48   0.9338     0.9992  0.9654    0.8850    0.9960    0.9372
      49   0.9286     0.9996  0.9628    0.8760    0.9962    0.9322
      50   0.9214     0.9998  0.9590    0.8679    0.9964    0.9277


                                  633
   Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


MEDDOCAN: Automatic de-identification of medical texts in Spanish

    Table 10. Performance drop of the systems between datasets.

       Track         Team       Train      Dev      Test      Drop
                 lukas.lange    0.9959    0.971    0.9696    -0.0014
                     Fadi       0.9977    0.964    0.9633    -0.0007
                    nperez      0.9906   0.9545    0.9602    +0.0057
                     FSL        0.9655    0.969    0.9595    -0.0095
                  mhjabreel      0.996   0.9643    0.9583    -0.0060
                   lsi uned     0.9713     0.95    0.9434    -0.0066
                jiangdehuan     0.9625   0.9096    0.9401    +0.0305
                   jimblair        1         1     0.9375    -0.0625
                    ccolon       0.978   0.9356    0.9322    -0.0034
         1
                    sohrab      0.9529   0.9274    0.9312    +0.0038
                     Jordi      0.9844   0.9217    0.9184    -0.0033
                   plubeda      0.9808   0.8933    0.9038    +0.0105
                m.domrachev        1         1     0.9001    -0.0999
                  lsi2 uned     0.9278   0.8944    0.8998    +0.0054
                    vcotik      0.9689   0.8953    0.8968    +0.0015
                     VSP        0.8981   0.8999    0.8601    -0.0398
                    gauku        0.725   0.7108    0.7092    -0.0016
                   Aspie96       0.284   0.2716    0.2778    +0.0062
                 lukas.lange    0.9961   0.9756    0.9749    -0.0007
                     Fadi        0.999   0.9681    0.9686    +0.0005
                    nperez      0.9942   0.9604     0.968    +0.0076
                  mhjabreel     0.9972   0.9698    0.9661    -0.0037
                     FSL        0.9715    0.974    0.9641    -0.0099
                   lsi uned      0.974   0.9539    0.9486    -0.0053
                jiangdehuan     0.9638   0.9139    0.9457    +0.0318
                   plubeda      0.9843   0.9327    0.9436    +0.0109
                   jimblair        1        1      0.9428    -0.0572
        2A
                    ccolon      0.9804   0.9427    0.9427     0.0000
                    sohrab      0.9563   0.9308    0.9352    +0.0044
                    vcotik      0.9719   0.9275    0.9297    +0.0022
                     Jordi      0.9853    0.927    0.9241    -0.0029
                m.domrachev        1        1      0.9033    -0.0967
                  lsi2 uned     0.9294   0.8977    0.9024    +0.0047
                     VSP        0.9013    0.902    0.8703    -0.0317
                    gauku        0.727   0.7132    0.7138    +0.0006
                   Aspie96      0.2943   0.2854    0.2917    +0.0063
                 lukas.lange     0.997   0.9805    0.9853     0.0048
                jiangdehuan     0.9934   0.9486    0.9818    +0.0332
                    nperez      0.9953   0.9697    0.9759    +0.0062
                     Fadi        0.999   0.9745     0.975    +0.0005
                  mhjabreel     0.9986    0.981    0.9747    -0.0063
                     FSL        0.9836   0.9855    0.9682    -0.0173
                m.domrachev       0.98   0.9664    0.9667    +0.0003
                   plubeda        0.99   0.9485    0.9581    +0.0096
                    ccolon      0.9868   0.9549    0.9577    +0.0028
        2B
                   lsi uned     0.9772   0.9617    0.9572    -0.0045
                    sohrab      0.9715   0.9468    0.9492    +0.0024
                   jimblair        1        1       0.948    -0.0520
                    vcotik      0.9749   0.9382    0.9395    +0.0013
                     Jordi      0.9878   0.9868    0.9343    -0.0525
                  lsi2 uned      0.935   0.9117    0.9149    +0.0032
                     VSP        0.9155   0.9165    0.8912    -0.0253
                    gauku       0.7406   0.7288    0.7261    -0.0027
                   Aspie96      0.2946   0.2856    0.2918    +0.0062


                                   634
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


        Marimon et al.

opment set, which allowed us to measure the F-score of all the systems on the
train, development and test set, and to analyze their behavior.
    All the scores of this analysis are shown in table 10, where the drop column
indicates the difference of performance in the test set with respect to the develop-
ment set (a negative value indicates a lower performance on the test set). There
were two teams that achieved a F-score of 1.0 in both train and development
set: jimblair (in all tracks) and m. domrachev (in sub-tracks 1 and 2A). The
former had a performance drop of 6.25 points, and the latter of 9.99 points in
the test set, probably because both systems of these competitors memorized the
train and development data, obtaining a perfect score, incurring in overfitting.
This also suggested that they could have used the development set to train the
system, and not just to tune it.
    In contrast to this, we see that lukas.lange, which was first team on the test
set for sub-track 1, was also the first on the development set (without taking
into account those who had scored 1.0), but third on the train set (without
taking into account those who scored 1.0). The performance of their system only
dropped 0.14 points in the test set with respect to the development set. Probably
they used the train set to build the system and the development only for tuning,
not incurring in overfitting. This demonstrated that the ability of the systems
to generalize was very important.
    Taking into account all the sub-tracks, the maximum performance drop was
suffered by m.domrachev, losing 9.99 points in sub-track 1. Without taking into
account those who had scores 1.0 on the development set, the system that lost
more points was the one submitted by Jordi, which lost 5.25 points on track
2B (0.33 points in sub-track 1 ,and 0.29 points in sub-track 2A). The next
participants with the highest loss of performance were VSP and FSL.
    The maximum improvement in the test set with respect to the development
set was 3.32 points, corresponding to the system submitted by jiangdehuan, in
track 2A.
    As a curiosity, ccolon scored exactly the same result on the development and
test set. However, its performance decreased with respect to the train set (by
3.77 points).


4   Discussion

The MEDDOCAN track attracted a considerable number of teams, not only from
Spain, but also from other countries, stressing the global interest in solving the
clinical data access hurdles and assuring patient data privacy requirements. Com-
pared to previous efforts for English, namely the i2b2 de-identification tracks,
MEDDOCAN could even reach a higher number of participation. It is impor-
tant to point out that the MEDDOCAN track benefited significantly from the
experiences, setting and annotation process pioneered by the i2b2 efforts.
    In case of the 2006 i2b2 shared task [24], a total of 7 teams participated in
the track, providing 16 systems. The five best systems scored above 0.95 for the


                                          635
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


       MEDDOCAN: Automatic de-identification of medical texts in Spanish

entity detection track and equaled or exceeded an F-score of 0.95 for the token-
based evaluation. The 2014 i2b2 de-identification shared task [21] had 10 teams,
submitting 22 runs. The top team reached an F-score of 0.9360 for the entity
detection track, and 0.9611 for the evaluation based on tokens. It is important to
mention that in case of MEDDOCAN a synthetic corpus was used so the results
might not be directly comparable to i2b2. Also, it is well known that there is
a considerable variability in density, distribution and characteristics of sensitive
information even between different types of clinical records.
    De-identification is still a very hard task, because for the special characteris-
tics of clinical texts and the importance of recall, i.e. avoiding leakage of sensitive
information. The top three teams are above 0.96 in F-score, for the track based
on entity detection.
    The top scoring systems make use of the most cutting-edge NLP techniques,
i.e. exploiting Deep Learning. Their results are comparable to single manual
anonymization done by humans. Automatic anonymization with manual revision
to detect potential leakages might result in anonymized Spanish clinical records
that allow data redistribution. Nevertheless, a follow up task, using real EHRs
from various healthcare institutions, and assessing the practical user scenario
with experts in the loop would be desirable to quantify also cost reduction and
benefits of the quality of anonymization strategies assisted by automated tools.


5    Conclusions

The results of the MEDDOCAN shared task and evaluation effort on automatic
de-identification of sensitive information from texts in Spanish show that ad-
vanced deep learning approaches in combination with rule based systems and
gazetteer resources can provide very competitive results when a high quality
manually labeled dataset is available. The construction of Gold Standard corpora
is key and require very detailed annotation guidelines and a carefully designed
corpus generation process with involvement of clinical domain experts. We ex-
pect that such a corpus and evaluation will also be carried out for data in other
languages and that automatic anonymization and de-identification systems will
be beneficial beyond EHRs, such as medical surveys [8] or legal-financial docu-
ments [3]. In order to improve the impact of future shared tasks on anonymiza-
tion, the involvement should not be limited to academic groups on language
technologies, but also directly data providers (health institutions), legal experts
and national and European institutions. For instance, the European Medicines
Agency (EMA) has launched a Technical Anonymisation Group (TAG) consist-
ing of a group of experts in data anonymisation to help further develop best
practices for the anonymisation of clinical reports. Moreover, we also would like
to stress the key importance of making the systems code or developed participant
tools accessible/available and the need to explore strategies to promote start-ups
and commercialization of solutions resulting from shared tasks and evaluation
campaigns.


                                          636
           Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


        Marimon et al.

Acknowledgements
We acknowledge the Encargo of Plan TL (SEAD) to CNIO and BSC for funding,
and the scientific committee for their valuable comments and guidance. We would
also like to thank Siamak Barzegar for his help in setting up MEDDOCAN at
CodaLab, and Felipe Soares for input in preparing the manuscript and task.


References
 1. Alfalahi, A., Brissman, S., Dalianis, H.: Pseudonymisation of personal names and
    other phis in an annotated clinical swedish corpus. In: Third Workshop on Building
    and Evaluating Resources for Biomedical Text Mining (BioTxtM 2012) Held in
    Conjunction with LREC. pp. 49–54 (2012)
 2. Amengol-Estapé, J., Soares, F., Marimon, M., Krallinger, M.: Pharmaconer tagger:
    a deep learning-based tool for automatically finding chemicals and drugs in spanish
    medical texts. Genomics & Informatics 17(2) (2019)
 3. Bick, E., Barreiro, A.: Automatic anonymisation of a new portuguese-english par-
    allel corpus in the legal-financial domain. Oslo Studies in Language 7(1) (2015)
 4. Cristóbal, R.S., Carrero, A.M., Carrasco, M.P., Rodrı́guez, M.C., Méndez, J.F.,
    de Mingo, M.G., Tello, J.C., de Madariaga, R.S., Serrano, A.C., Aza, I.V., et al.:
    Sistema anonimizador conforme a la norma une-en iso 13606 (2012)
 5. Fernández-Alemán, J.L., Señor, I.C., Lozoya, P.Á.O., Toval, A.: Security and pri-
    vacy in electronic health records: A systematic literature review. Journal of biomed-
    ical informatics 46(3), 541–562 (2013)
 6. Garcı́a Sardiña, L.: Automating the anonymisation of textual corpora (2018)
 7. Gaudet-Blavignac, C., Foufi, V., Wehrli, E., Lovis, C.: De-identification of french
    medical narratives. Swiss Medical Informatics 34(00) (2018)
 8. Gentili, M., Hajian, S., Castillo, C.: A case study of anonymization of medical
    surveys. In: Proceedings of the 2017 International Conference on Digital Health.
    pp. 77–81. ACM (2017)
 9. Grouin, C., Névéol, A.: De-identification of clinical notes in french: towards a
    protocol for reference corpus development. Journal of biomedical informatics 50,
    151–161 (2014)
10. Hassan, F., Domingo-Ferrer, J., Soria-Comas, J.: Anonimizacin de datos no estruc-
    turados a travs del reconocimiento de entidades nominadas. In: Actas de la XV
    Reunin Espaola sobre Criptologa y Seguridad de la Informacin - RECSI 2018. pp.
    102–106 (2018)
11. Intxaurrondo, A., Marimon, M., Gonzalez-Agirre, A., Lopez-Martin, J.A., Ro-
    driguez, H., Santamaria, J., Villegas, M., Krallinger, M.: Finding mentions of ab-
    breviations and their definitions in spanish clinical cases: The barr2 shared task
    evaluation results. In: IberEval@ SEPLN. pp. 280–289 (2018)
12. Intxaurrondo, A., Pérez-Pérez, M., Pérez-Rodrı́guez, G., López-Martı́n, J.A., San-
    tamaria, J., de la Pena, S., Villegas, M., Akhondi, S.A., Valencia, A., Lourenço,
    A., Kralllinger, M.: The biomedical abbreviation recognition and resolution (barr)
    track: benchmarking, evaluation and importance of abbreviation recognition sys-
    tems applied to spanish biomedical abstracts. SEPLN (2017)
13. Mamede, N., Baptista, J., Dias, F.: Automated anonymization of text documents.
    In: 2016 IEEE Congress on Evolutionary Computation (CEC). pp. 1287–1294.
    IEEE (2016)


                                           637
           Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


       MEDDOCAN: Automatic de-identification of medical texts in Spanish

14. Medina, S., Turmo, J.: Building a spanish/catalan health records corpus with very
    sparse protected information labelled. In: LREC 2018: Workshop MultilingualBIO:
    Multilingual Biomedical Text Processing: proceedings. pp. 1–7 (2018)
15. Megyesi, B., Granstedt, L., Johansson, S., Prentice, J., Rosén, D., Schenström,
    C.J., Sundberg, G., Wirén, M., Volodina, E.: Learner corpus anonymization in the
    age of gdpr: Insights from the creation of a learner corpus of swedish. In: Proceed-
    ings of the 7th workshop on NLP for Computer Assisted Language Learning. pp.
    47–56 (2018)
16. Mota, E., Martı́n, N., Moreno, A., Ferrete, E., Santamarı́a, J., Mari-
    mon, M., Intxaurrondo, A., Gonzalez-Agirre, A., Villegas, M., Krallinger,
    M.: Guı́as de anotación de información de salud protegida (Oct 2018),
    http://temu.bsc.es/meddocan/wp-content/uploads/2019/02/guı́as-de-anotación-
    de-información-de-salud-protegida.pdf
17. Pantazos, K., Lauesen, S., Lippert, S.: Preserving medical correctness, readability
    and consistency in de-identified health records. Health informatics journal 23(4),
    291–303 (2017)
18. Pérez-Pérez, M., Pérez-Rodrı́guez, G., Blanco-Mı́guez, A., Fdez-Riverola, F., Va-
    lencia, A., Krallinger, M., Lourenço, A.: Next generation community assessment of
    biomedical entity recognition web servers: metrics, performance, interoperability
    aspects of becalm. Journal of Cheminformatics 11(1), 42 (2019)
19. Santamarı́a, J., Krallinger, M.: Construcción de recursos terminológicos médicos
    para el español: el sistema de extracción de términos cutext y los repositorios de
    términos biomédicos. Procesamiento del Lenguaje Natural 61 (2018)
20. Scheurwegs, E., Luyckx, K., Van der Schueren, F., Van den Bulcke, T.: De-
    identification of clinical free text in dutch with limited training data: a case study.
    In: Proceedings of the Workshop on NLP for Medicine and Biology associated with
    RANLP 2013. pp. 18–23 (2013)
21. Stubbs, A., Kotfila, C., Uzuner, Ö.: Automated systems for the de-identification of
    longitudinal clinical narratives: Overview of 2014 i2b2/uthealth shared task track
    1. Journal of biomedical informatics 58 Suppl, S11–9 (2015)
22. Tomanek, K., Daumke, P., Enders, F., Huber, J., Theres, K., Müller, M.: An in-
    teractive de-identifica-tion-system. In: Proceedings of SMBM 2012-The 5th Inter-
    national Symposium on Semantic Mining in Biomedicine. pp. 82–86 (2012)
23. Tveit, A., Edsberg, O., Rost, T., Faxvaag, A., Nytro, O., Nordgard, T., Ranang,
    M.T., Grimsmo, A.: Anonymization of general practioner medical records. In: sec-
    ond HelsIT Conference (2004)
24. Uzuner, Ö., Luo, Y., Szolovits, P.: Evaluating the State-of-the-Art in
    Automatic De-identification. Journal of the American Medical Informatics
    Association 14(5), 550–563 (09 2007). https://doi.org/10.1197/jamia.M2444,
    https://doi.org/10.1197/jamia.M2444
25. Vico, H., et al.: Definición de una arquitectura de referencia para anonimizar doc-
    umentos (2013)
26. Villegas, M., Intxaurrondo, A., Gonzalez-Agirre, A., Marimon, M., Krallinger, M.:
    The mespen resource for english-spanish medical machine translation and ter-
    minologies: census of parallel corpora, glossaries and term translations. In: Pro-
    ceedings of the LREC 2018 Workshop MultilingualBIO: Multilingual Biomedical
    Text Processing, Paris, France. European Language Resources Association (ELRA)
    (2018)


                                           638

</pre>