<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Window Classi ers and Conditional Random Fields for Medical Report De-Identi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Viviana Cotik</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franco M. Luque</string-name>
          <email>francolq@famaf.unc.edu.ar</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan Manuel Perez</string-name>
          <email>jmperezg@dc.uba.ar</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CONICET</institution>
          ,
          <country country="AR">Argentina</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Departamento de Computacion, FCEyN, Universidad de Buenos Aires</institution>
          ,
          <country country="AR">Argentina</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>FAMAF, Universidad Nacional de Cordoba</institution>
          ,
          <country country="AR">Argentina</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Instituto de Ciencias de la Computacion</institution>
          ,
          <addr-line>CONICET</addr-line>
          ,
          <country country="AR">Argentina</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>744</fpage>
      <lpage>754</lpage>
      <abstract>
        <p>Information extraction of medical reports is key in order to improve timely discoveries of ndings and as an aid to improve decisions about medical treatments and budget. In order to develop information extraction methods, medical data has to be available. Since this data is extremely sensitive due to the presence of personal information, report de-identi cation is needed. We present two methods, a window classi er and an implementation of conditional random elds (CRF) in order to de-identify personal information of Spanish medical records provided by the MEDDOCAN challenge. CRF obtained the best results with a F1measure of 0.897 for named entity recognition with exact match (subtask 1), and 0.930 and 0.940 for inexact match (subtask 2 strict and merged respectively).</p>
      </abstract>
      <kwd-group>
        <kwd>report anonymization</kwd>
        <kwd>report de-identi cation</kwd>
        <kwd>named en- tity recognition</kwd>
        <kwd>BioNLP</kwd>
        <kwd>Spanish medical reports</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Last years' exponential growth of available biomedical texts and the contribution
of structured information to the triggering of automatic alerts about urgent
situations, to performing better diagnosis, and as input to clinical decision support
systems among others, has led to the need of automatically extracting
information through the use of disciplines such as natural language processing (NLP)
and information extraction (IE). The area of study that deals with information
extraction from biomedical texts is called BioNLP or biomedical text mining.</p>
      <p>Medical data is of sensitive nature, due to the presence of personal
information. Therefore, in order to process medical reports, they have to be anonymized.</p>
      <p>In the clinical domain, anonymization or de-identi cation is the process of
removing from medical records all information that could identify a patient or the
physician performing the study or diagnosis. In some cases, names and patients
? All authors contributed equally to the work.</p>
      <p>
        Many factors have to be taken into account when anonymizing medical
records. See [
        <xref ref-type="bibr" rid="ref6 ref8">6, 8</xref>
        ] to read about perturbative and non-perturbative methods.
Besides, the concept of data anonymization is ambiguous, as explained in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
where three forms of sharing data (public, quasi-public, non-public) and di
erent ways of data perturbation with de-identi cation goals and their e ects with
regards to the possibility of a meaningful analysis are presented.
      </p>
      <p>
        Some de-identi cation challenges have been organized in the past, for
example i2b2 2006 De-identi cation and Smoking Challenge [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and 2014
Deidenti cation and Heart Disease Risk Factors Challenge.8
      </p>
      <p>
        In this paper we present two methods we have developed for MEDDOCAN
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],9 a medical document anonymization task of the IberLEF 2019 (Iberian
Languages Evaluation Forum).10 MEDDOCAN is speci cally devoted to the
anonymization of medical documents in Spanish.
      </p>
      <p>The MEDDOCAN task was structured into two sub-tasks: 1) NER o set and
entity type classi cation, and 2) sensitive token detection.</p>
      <p>
        A synthetic corpus of 1000 clinical case studies augmented with protected
health information (PHI) from discharge summaries and medical genetics
clinical records was provided. Additionally, a number of linguistic resources were
supplied:11
{ AbreMES-DB,12 a Spanish medical abbreviation database, that contains
abbreviations and their potential de nitions automatically extracted from
the metadata of titles and abstracts of biomedical publications written in
Spanish,
5 https://www.hhs.gov/hipaa/index.html
6 https://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:
31995L0046:en:HTML
7 http://servicios.infoleg.gob.ar/infolegInternet/anexos/160000-164999/
160432/texact.htm, http://servicios.infoleg.gob.ar/infolegInternet/
verNorma.do?id=19429,http://servicios.infoleg.gob.ar/infolegInternet/
anexos/60000-64999/64790/texact.htm
8 https://www.i2b2.org/NLP/HeartDisease/
9 MEDDOCAN: http://temu.bsc.es/meddocan/.
10 IBERLEF 2019: https://sites.google.com/view/iberlef-2019/.
11 http://temu.bsc.es/meddocan/index.php/resources/
12 https://github.com/PlanTL-SANIDAD/AbreMES-DB
{ MEDDOCAN-Gazetteer, composed of MEDDOCAN related entities. It
includes names, surnames, addresses, hospitals, professions, and di erent
types of locations (provinces, cities and towns, among others),
{ SPACCC POS-TAGGER,13 a Part-of-Speech tagger for Spanish texts in
the medical domain based on FreeLing [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and a
{ Sentence-split test-set computed using the SPACCC POS-TAGGER.
      </p>
      <p>The rest of the paper is organized as follows. Section 2 presents previous work
of anonymization in the biomedical domain. Section 3 presents the data set used,
the implemented methods and the performance evaluation metrics used. Section
4 presents the results of our methods and its analysis. Finally, Section 5 presents
conclusions and future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Previous work</title>
      <p>In this section we describe the previous work in anonymization in the biomedical
domain.</p>
      <p>
        Emam [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] wrote a technical report telling how patient health data is
being anonymized in Canada in year 2006 and discusses the adequacy of this
practices. Some de-identi cation challenges have been organized in the past,
for example i2b2 2006 De-identi cation and Smoking Challenge [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and 2014
De-identi cation and Heart Disease Risk Factors Challenge.14 The annotation
process performed for the later is explained in Stubbs and O zlem Uzuner [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
The i2b2 e ort was focused on documents in English and covered characteristics
of US-healthcare data providers. Gkoulalas-Divanis and Loukides [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] describe
algorithms to anonymize while preserving patient demographics (non-perturbative
anonymization) and to anonymize diagnosis codes.
      </p>
      <p>
        A tutorial on Privacy Challenges and Solutions for Medical Data Sharing was
organized by IBM Research Zurich and Cardi University in 2011.15
GkoulalasDivanis et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] present a survey of more than 45 algorithms that have been
proposed for publishing data of EHRs preserving patient's privacy
      </p>
      <p>
        Emam et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] mention the ambiguity of the concept of anonymous data
and di erent ways of data perturbation with de-identi cation goals and their
e ects with regards to the possibility of a meaningful analysis.
      </p>
      <p>
        Many other analysis and surveys on the subject [
        <xref ref-type="bibr" rid="ref12 ref20 ref23">12, 20, 23</xref>
        ] and
anonymization implementations [
        <xref ref-type="bibr" rid="ref18 ref21 ref22 ref4 ref7">4, 7, 18, 21, 22</xref>
        ] have been published.
      </p>
      <p>
        Rules and regular expressions have been used, eg. for medical reports in
Spanish [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Token and character level conditional random elds and long
shortterm memory networks are methods with good results in the de-identi cation of
medical records [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ].
13 https://github.com/PlanTL-SANIDAD/SPACCC_POS-TAGGER
14 https://www.i2b2.org/NLP/HeartDisease/
15 Tutorial on Privacy Challenges and Solutions for Medical Data Sharing. Slides.
https://www.zurich.ibm.com/medical-privacy-tutorial/.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>3.1</p>
      <p>Data
This section presents the datasets used, the evaluation metrics, the preprocessing
of data and the two proposed methods. A ow chart gure for the used processing
pipeline can be seen in Figure 1.</p>
      <p>We used the MEDDOCAN corpus, a synthetic corpus composed of 1,000
clinical cases manually selected by a practicing physician and enriched with PHI
expressions extracted from discharge summaries and genetics clinical records. It
has around 33,000 sentences, with an average of around 33 sentences per clinical
case, summing up 495,000 words and an average of 494 words per clinical case.</p>
      <p>The corpus was divided into three subsets. The training set comprises 500
clinical cases, and the development and test set 250 clinical cases each. The test
set was not provided until the end of the challenge.</p>
      <p>An additional test set, that includes the actual test set, the training, the
development set, and a background set -of 2,000 documents-, composed of 3,751
clinical cases was provided in order to test the submissions.16 The actual test
set (250 clinical cases) is called test set with Gold Standard Annotations and the
test set provided before the end of the challenge (3,751 clinical cases) is called
test set (including background set).</p>
      <p>A conversion script between BRAT stando annotation format17 and i2b2
annotation format was provided.18</p>
      <p>The annotation schema used by the challenge organizers to create the dataset
de nes 29 entities and was inspired by the annotation schema used for the i2b2
de-identi cation tracks, adapting it to the speci cities of the MEDDOCAN
document collection. An iterative process was followed until the nal guidelines
were obtained. The inter-annotation agreement (IAA) calculated on a set of
50 double-annotated records is reported as 98% in exact match.19 The dataset
annotation guidelines can be downloaded.20
3.2</p>
      <p>Evaluation metrics
Given that di erent uses might require di erent balance in terms of precision
and recall, two di erent sub-tasks have been proposed: sub-task 1 (named entity
recognition o set and entity type classi cation) and sub-task 2 (sensitive span
detection). Sub-task 1 consists in exact match (ie. prediction and gold standard
16 Datasets can be found in http://temu.bsc.es/meddocan/index.php/data/.
17 https://brat.nlplab.org/standoff.html
18 https://github.com/PlanTL-SANIDAD/MEDDOCAN-Format-Converter-Script
19 An abstract of the annotation process and schema can be seen in http://temu.bsc.</p>
      <p>es/meddocan/index.php/annotation-guidelines/.
20
http://temu.bsc.es/meddocan/wp-content/uploads/2019/02/guADas-de-anotacinde-informacin-de-salud-protegida.pdf</p>
      <p>Window Classi ers and CRF for Medical Report De-Identi cation
have to begin and end in the same o set and also have the same classi cation).
Sub-task 2 allows inexact match and possibly wrong entity types, evaluating
whether spans belonging to sensitive phrases are detected correctly. Sub-task 2
has two evaluations: strict and merged.</p>
      <p>Micro-averaged precision, recall, and F1 score are used to evaluate both
subtasks (see de nition below). Micro-balanced metrics calculate each metric over
all the classes together.</p>
      <p>P recision(P ) =</p>
      <p>Recall(R) =</p>
      <p>F 1 =
Leaks =</p>
      <p>T P
T P + F P</p>
      <p>T P
T P + F N
2P R
P + R</p>
      <p>F N
#sentences
where TP are the number of true positives, FP the number of false positives,
and FN the number of false negatives.</p>
      <p>Two additional metrics were computed to evaluate the best methods of the
tasks: leaks for sub-task 1 and merged for sub-task2. The rst computes the
proportion of PHIs that were not detected. The merged results of subtask 2
merges the spans of PHIs connected by non-alphanumerical characters before
evaluating the results.
3.3</p>
      <p>Data Preparation
We rst use standard text preprocessing techniques such as sentence
segmentation and word tokenization. Then, we use the BIO encoding, a standard
technique that models Named Entity Recognition as a sequence tagging problem. In
the BIO encoding, each entity type &lt;T&gt; has labels B-&lt;T&gt; and I-&lt;T&gt; to mark,
respectively, those tokens as the beginning or the continuation of an entity. The
label O is used to tag \outside" tokens, this is, tokens that do not belong to any
entity.</p>
      <p>
        We also do Part-of-Speech tagging using our own classi er-based PoS
tagger.21 The tagger was trained using the AnCora 3.0.1 Spanish corpus [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], that
uses the EAGLES tagset. Instead of using the full tagset, we used a simpli ed
85-tag version as done in the Stanford CoreNLP tagger.22
3.4
      </p>
      <p>Window Classi er
The window classi er consists in the use of a classi er to tag each token in the
text based on information of the token itself and of tokens in a xed window
21 Unpublished work.
22 https://stanfordnlp.github.io/CoreNLP/
surrounding it. Our current system uses the two previous tokens and the next
token, this is, a window of size four with the center word at the third position.</p>
      <p>
        Features for all tokens include the lowercased token, a pre x and a su x of
length 2, the PoS tag, and boolean features indicating if the word is in upper
case and if it is only composed of digits. For the center word, a word embedding
is also included as a feature. Here we use 300-dimension fastText embeddings,23
pre-trained on Spanish Wikipedia [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        For the classi er, in this work, we use a Support Vector Machine (SVM) with
a linear kernel, as provided by scikit-learn [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Each token is tagged
independently by the classi er, leading to a very fast system.
      </p>
      <p>As tags are independent to each other, the tagger may output sequences
inconsistent with the BIO representation, such as an I-&lt;T&gt; tag after an O tag.
To x this, we introduce a post-processing step, that converts all inconsistent
I-&lt;T&gt; tags to B-&lt;T&gt; tags.
3.5</p>
      <p>Conditional Random Fields
Conditional Random Fields (CRFs) are probabilistic models used to predict
sequences of labels, based on sequences of input samples.</p>
      <p>Each token has an associated vector of features, such as the words' part of
speech tag and the words' su x of a given length. The input of a CRF is the
sequence of tokens of the text. The features of a token and the pattern of labels
assigned to previous words are used to determine the most likely label for the
current token. In a linear chain CRF, only the label of the previous token is
used.</p>
      <p>As features we used the word in lower case, a pre x of length 3, a su x of
length 4, the truth value of whether the word is all written as upper case if all
23 https://fasttext.cc/</p>
      <p>Window Classi ers and CRF for Medical Report De-Identi cation
are digits, PoS tag and reduced PoS tag of length 2. Also, one word before and
one word after were considered.</p>
      <p>
        For the implementation, CRFsuite [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and the sklearn-crfsuite wrapper were
used.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>
        A single run was submitted for each method. Results for both methods can
be seen in Table 1. Our results are compared to the results obtained by other
MEDDOCAN participating teams in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>Table 2 displays classes with lowest F1 and a minimum of 50 instances and
Figure 2 presents the confusion matrix of a subset of the named entities.</p>
      <p>We can observe some common errors: e.g, TERRITORIO (territory) is tagged
frequently as CALLE (street) or as some of the0 ID entities (postal codes are
annotated as TERRITORIO ). Another type of common error is to misclassify
NOMBRE SUJETO ASISTENCIA as NOMBRE PERSONAL SANITARIO
(and viceversa); this is probably due to the fact that both refer to proper nouns.
A model with more features (and more sophisticated, such as word embeddings)
and eventually a di erent window size would probably overcome these problems.</p>
      <p>We also noticed some annotation errors {such as Madrid and Mendieta
Espinosa{ that were not annotated as entities. This kind of errors can in uence
negatively the performance of the learning algorithm. Moreover, entities
correctly discovered are erroneously computed as errors.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future work</title>
      <p>As future work we plan to include in our models some of the linguistic resources
provided by the challenge: namely, AbreMES-DB, a Spanish medical
abbreviation database; the MEDDOCAN gazetteer, a gazetteer that includes names,
surnames, addresses, hospitals, professions, and di erent types of locations, such
as provinces, cities and towns. We tried to integrate them to our system, but
it resulted in degraded performance. We also plan to include the use of regular
expressions for detecting e-mails and phone numbers. More feature engineering
in CRF should also be performed.</p>
      <p>Another line of work to tackle NER could be the use of recurrent neural
networks (RNN); in particular, bidirectional RNNs. We trained a biLSTM but,
due to the small size of the dataset, the model failed to converge.</p>
      <p>Since we did not use the Spanish abbreviation database nor the MEDDOCAN
gazetteer the only components that takes language into account are our PoS
tagger (trained with the AnCora Spanish corpus) and the word embeddings.
Therefore, we think that our methods could be easily adapted to other languages.
Regarding Spanish, although there are many di erences in di erent countries
(e.g. the Rioplatense Spanish of Argentina and Uruguay) the entities of interest
in this task, that were recognized with our methods should have similar results
with Spanish of di erent regions of the world.
References</p>
      <p>Window Classi ers and CRF for Medical Report De-Identi cation</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>5</volume>
          ,
          <issue>135</issue>
          {
          <fpage>146</fpage>
          (
          <year>2017</year>
          ), ISSN 2307-387X
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Cotik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Information extraction from Spanish radiology reports</article-title>
          .
          <source>In: PhD Thesis</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Cotik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Filippo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roller</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Annotation of entities and relations in spanish radiology reports</article-title>
          .
          <source>In: RANLP</source>
          , pp.
          <volume>177</volume>
          {
          <issue>184</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Dalianis</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velupillai</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>De-identifying Swedish clinical text - re nement of a gold standard and experiments with Conditional random elds</article-title>
          .
          <source>J. Biomedical Semantics</source>
          <volume>1</volume>
          ,
          <issue>6</issue>
          (
          <year>2010</year>
          ), https://doi.org/10.1186/2041-1480-1-6, URL https://doi.org/10.1186/2041-1480-1-6
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Emam</surname>
            ,
            <given-names>K.E.</given-names>
          </string-name>
          :
          <article-title>Data Anonymization Practices in Clinical Research. A descriptive study</article-title>
          . , University of Ottawa (
          <year>2006</year>
          ), URL http://www.ehealthinformation.ca/wp-content/uploads/2014/07/ 2006-Data-Anonymization-Practices.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Emam</surname>
            ,
            <given-names>K.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodgers</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Anonymising and sharing individual patient data</article-title>
          .
          <source>BMJ</source>
          <volume>350</volume>
          (
          <year>2015</year>
          ), URL http://www.bmj.com/content/350/ bmj.h1139
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Gentili</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hajian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Castillo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A case study of anonymization of medical surveys</article-title>
          .
          <source>In: Proceedings of the 2017 International Conference on Digital Health, London, United Kingdom, July 2-5</source>
          ,
          <year>2017</year>
          , pp.
          <volume>77</volume>
          {
          <issue>81</issue>
          (
          <year>2017</year>
          ), https://doi.org/10.1145/3079452.3079490, URL http://doi.acm.
          <source>org/10</source>
          . 1145/3079452.3079490
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Gkoulalas-Divanis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loukides</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Overview of Patient Data Anonymization</article-title>
          , pp.
          <volume>9</volume>
          {
          <fpage>30</fpage>
          . Springer New York, New York, NY (
          <year>2013</year>
          ),
          <source>ISBN 978-1-4614- 5668-1</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Gkoulalas-Divanis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loukides</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Publishing data from electronic health records while preserving privacy: A survey of algorithms</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          <volume>50</volume>
          ,
          <issue>4</issue>
          {
          <fpage>19</fpage>
          (
          <year>2014</year>
          ), ISSN 1532-0464, https://doi.org/https://doi.org/10.1016/j.jbi.
          <year>2014</year>
          .
          <volume>06</volume>
          .002, URL http: //www.sciencedirect.com/science/article/pii/S1532046414001403, special Issue on Informatics Methods in Medical Privacy
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
          </string-name>
          , J.:
          <article-title>De-identi cation of medical records using conditional random elds and long short-term memory networks</article-title>
          .
          <source>Journal of biomedical informatics 75, S43{S53</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Automatic</surname>
          </string-name>
          de
          <article-title>-identi cation of electronic medical records using token-level and character-level conditional random elds</article-title>
          .
          <source>Journal of biomedical informatics 58, S47{S52</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Malin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karp</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scheuermann</surname>
            ,
            <given-names>R.H.</given-names>
          </string-name>
          :
          <article-title>Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research</article-title>
          .
          <source>Journal of Investigative Medicine</source>
          <volume>58</volume>
          (
          <issue>1</issue>
          ),
          <volume>11</volume>
          {
          <fpage>18</fpage>
          (
          <year>2010</year>
          ), ISSN 1081- 5589, https://doi.org/10.2310/JIM.0b013e3181c9b2ea, URL http://jim. bmj.com/content/58/1/11
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Marimon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez-Agirre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Intxaurrondo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodrguez</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.A.</given-names>
            ,
            <surname>Villegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Automatic deidenti cation of medical texts in spanish: the meddocan track, corpus, guidelines, methods and evaluation of results</article-title>
          .
          <source>In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2019</year>
          ), vol.
          <source>TBA</source>
          , p.
          <source>TBA, CEUR Workshop Proceedings (CEUR-WS.org)</source>
          , Bilbao,
          <source>Spain (Sep</source>
          <year>2019</year>
          ), URL TBA
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Okazaki</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>Crfsuite: a fast implementation of conditional random elds (crfs) (</article-title>
          <year>2007</year>
          ), URL http://www.chokkan.org/software/crfsuite/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Padr</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stanilovsky</surname>
          </string-name>
          , E.:
          <article-title>Freeling 3.0: Towards wider multilinguality</article-title>
          .
          <source>In: Proceedings of the Language Resources and Evaluation Conference (LREC</source>
          <year>2012</year>
          ), ELRA, Istanbul, Turkey (May
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.: Scikitlearn:
          <article-title>Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <volume>2825</volume>
          {
          <fpage>2830</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Stubbs</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , O zlem Uzuner:
          <article-title>Annotating longitudinal clinical narratives for de-identi cation: The 2014 i2b2/UTHealth corpus</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          <volume>58</volume>
          ,
          <issue>20</issue>
          {
          <fpage>29</fpage>
          (
          <year>2015</year>
          ), ISSN 1532-0464, https://doi.org/https://doi.org/10.1016/j.jbi.
          <year>2015</year>
          .
          <volume>07</volume>
          .020, URL http: //www.sciencedirect.com/science/article/pii/S1532046415001823
          <source>, proceedings of the 2014 i2b2/UTHealth Shared-Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Szarvas</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farkas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Busa-Fekete</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>State-of-the-art Anonymization of Medical Records Using an Iterative Machine Learning Framework</article-title>
          .
          <source>In: J Am Med Inform Assoc.</source>
          , vol.
          <volume>14</volume>
          , pp.
          <volume>574</volume>
          {
          <issue>580</issue>
          (
          <year>2007</year>
          ), URL https://www.ncbi. nlm.nih.gov/pmc/articles/PMC1975791/
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Taule</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mart</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Recasens</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>AnCora: Multilevel annotated corpora for catalan and spanish</article-title>
          .
          <source>In: Proceedings of the International Conference on Language Resources and Evaluation</source>
          , European Language Resources Association, Marrakech, Morocco (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Uzuner</surname>
            ,
            <given-names>O .</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szolovits</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Viewpoint paper: Evaluating the stateof-the-art in automatic de-identi cation</article-title>
          .
          <source>JAMIA</source>
          <volume>14</volume>
          (
          <issue>5</issue>
          ),
          <volume>550</volume>
          {
          <fpage>563</fpage>
          (
          <year>2007</year>
          ), https://doi.org/10.1197/jamia.M2444, URL https://doi.org/10.1197/ jamia.M2444
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Uzuner</surname>
            ,
            <given-names>O .</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sibanda</surname>
            ,
            <given-names>T.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szolovits</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A de-identi er for medical discharge summaries</article-title>
          .
          <source>Arti cial Intelligence in Medicine</source>
          <volume>42</volume>
          (
          <issue>1</issue>
          ),
          <volume>13</volume>
          {
          <fpage>35</fpage>
          (
          <year>2008</year>
          ), https://doi.org/10.1016/j.artmed.
          <year>2007</year>
          .
          <volume>10</volume>
          .001, URL https://doi.org/10. 1016/j.artmed.
          <year>2007</year>
          .
          <volume>10</volume>
          .001
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Velupillai</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dalianis</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hassel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nilsson</surname>
            ,
            <given-names>G.H.</given-names>
          </string-name>
          :
          <article-title>Developing a standard for de-identifying electronic patient records written in Swedish: Precision, recall and F-measure in a manual and computerized annotation trial</article-title>
          .
          <source>I. J. Medical Informatics</source>
          <volume>78</volume>
          (
          <issue>12</issue>
          ),
          <volume>19</volume>
          {
          <fpage>26</fpage>
          (
          <year>2009</year>
          ), https://doi.org/10.1016/j.ijmedinf.
          <year>2009</year>
          .
          <volume>04</volume>
          .005, URL https://doi.org/ 10.1016/j.ijmedinf.
          <year>2009</year>
          .
          <volume>04</volume>
          .005
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Wellner</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huyck</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mardis</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aberdeen</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morgan</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peshkin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yeh</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hitzeman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirschman</surname>
          </string-name>
          , L.:
          <article-title>Research paper: Rapidly retargetable approaches to de-identi cation in medical records</article-title>
          .
          <source>JAMIA</source>
          <volume>14</volume>
          (
          <issue>5</issue>
          ),
          <volume>564</volume>
          {
          <fpage>573</fpage>
          (
          <year>2007</year>
          ), https://doi.org/10.1197/jamia.M2435, URL https://doi.org/10.1197/jamia.M2435
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>