<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ReCRF : Spanish Medical Document Anonymization using Automatically-crafted Rules and CRF</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fadi Hassan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammed Jabreel</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Najlaa Maaroof</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Sanchez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Josep Domingo-Ferrer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Moreno</string-name>
          <email>antonio.morenog@urv.cat</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CYBERCAT-Center for Cybersecurity Research of Catalonia. UNESCO Chair in Data Privacy</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>iTAKA: Intelligent Technologies for Advanced Knowledge Acquisition. Department of Computer Science and Mathematics Universitat Rovira i Virgili</institution>
          ,
          <addr-line>Av. Pasos Catalans 26, E-43007 Tarragona, Catalonia</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>727</fpage>
      <lpage>734</lpage>
      <abstract>
        <p>This paper describes ReCRF, a named-entity recognition system submitted to the Medical Document Anonymization (MEDDOCAN) challenge in the IberLEF 2019 Workshop. We propose a general method based on a data-driven rule generator and Conditional Random Fields (CRFs) to automatically detect protected health information (PHI) in Spanish medical documents. The reported experiments show that our system achieves a micro-F1 of 96.33% on the test dataset for the rst sub-task and a micro-F1 of 96.86% and 97.50% on strict and merged metrics, respectively, on the test dataset for the second sub-task.</p>
      </abstract>
      <kwd-group>
        <kwd>Anonymization</kwd>
        <kwd>CRF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Medical documents containing detailed patients' data are of utmost importance
for research. When healthcare data are associated to individuals, they are
considered protected health information (PHI). The new European General Data
Protection Regulation (GDPR) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], states that explicit consent from the a ected
individuals is needed to use personally identi able information (PII), and PHI in
particular, for secondary purposes. That means, the data collector should strive
to gather such consent. To avoid the need for consent, data used for secondary
purposes should no longer be personally identi able. Document anonymization
provides a way to turn PII into information that cannot be linked to a speci c
identi ed individual any more, so that it is not subject to privacy regulations
anymore.
      </p>
      <p>
        In 2006 and 2014, i2b2 organized two shared tasks on document
anonymization [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The i2b2 e ort had a signi cant impact on the medical natural language
processing (NLP) community, but that e ort was focused on English documents
      </p>
    </sec>
    <sec id="sec-2">
      <title>F. Hassan et al.</title>
      <p>
        only. IberLEF 2019 organizes the rst community challenge task speci cally
devoted to the anonymization of Spanish medical documents, called the
MEDDOCAN task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The purpose of the MEDDOCAN task is to detect and remove
PHI from Spanish plain text medical records. The task is structured into two
sub-tasks: "NER o set and entity type classi cation" and "sensitive token
detection". The rst sub-task aims at detecting entity types and locations in the
text and the second sub-task aims at detecting just entity locations.
      </p>
      <p>The remainder of the paper is organized as follows. In Section 2 we brie y
describe the data to be anonymized. Section 3 describes the methodology we
propose. Results and discussions are presented in Section 4. Section 5 presents
the conclusions and depicts some lines of future work.
2</p>
      <sec id="sec-2-1">
        <title>Data Description</title>
        <p>The MEDDOCAN challenge task aims at identifying and extracting several types
of PHI categories from plain text medical documents. The PHI categories are
grouped into eight main categories with 22 sub-categories. The corpora released
for the tasks consists of 1000 documents, divided into: 500 as training data, 250 as
development data and 250 as test data. The distributions of PHI categories and
sub-categories in the training, development and test data are shown in Table 1.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>ReCRF: Spanish Medical Document Anonymization</title>
      <p>3</p>
      <sec id="sec-3-1">
        <title>Methodology</title>
        <p>We developed an automatic system to detect PHI categories from Spanish
medical documents. The next subsections describe the steps followed to train and
use the system.
3.1</p>
        <sec id="sec-3-1-1">
          <title>Text Tokenization</title>
          <p>
            In this step, we tokenize the text at two levels: sentence-level and word-level.
First, a sentence tokenizer takes a single document as input and produces list
of sentences. Afterwards, we split each single sentence into a list of tokens. The
sentence tokenizer is based on newline delimiter whereas a manually-crafted
regular expression based tokenizer and a spaCy pre-trained model for Spanish[
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]
are used sequentially to perform the word-level tokenization.
3.2
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Rules Generation</title>
          <p>In this step, we developed a data-driven regular expression generator so that
we avoid implementing hand crafted regular expression rules. This generator
analyses all the appearances of the PHI categories in the training data set and,
from that, it generates rules to detect those categories. These rules are later used
to extract sudo-labelled tokens that are used to guide the CRF tagger in taking
the nal decision.
3.3</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Feature Extraction</title>
          <p>
            We extract a wide variety of linguistic features, similarly to previous studies [
            <xref ref-type="bibr" rid="ref8">9,
8</xref>
            ]. These features characterize the semantics of PHI terms. The main types of
features are:
{ Lexical Features: they include the target word itself, its pre x and su x,
word lemma, and Part-of-Speech (POS) tag.
{ Orthographic Features: they detail word form information, e.g. target
word length, word shape (CAPITALIZED, ALL UPPER, ALL LOWER,
MIX), ends with s, contains alpha and contains number.
{ RegEx features: a RegEx model is used as rst-pass recognizer for the PHI
entities in the text. We use the output of the RegEx model to detect the
location of the token, either at the beginning, middle, end or outside of PHI
entity.
{ External Resource Features: we also consider if a token appears into
one or several external resources, which include lists of English and Spanish
names of countries and cities, names and abbreviations of time expressions
(e.g. 'an~o', 'mes'), or names and abbreviations of places (e.g. 'plaza', 'av.').
Additional resources include lists of Spanish last names, Spanish rst names,
addresses, hospitals, cities and towns, professions and autonomous
communities, and provinces.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>F. Hassan et al.</title>
      <p>Extracting these features from just the target word does not consider the
context in which the word appears, which may lead to misclassifying tokens due
to language ambiguity. To tackle this, we consider a window of 5 words centered
at the target word (i.e., the two words on the left and the two words on the the
right).
3.4</p>
      <sec id="sec-4-1">
        <title>Training the system</title>
        <p>
          We used both a set of automatically-crafted rules (RegEx model) and
Conditional Random Fields[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] (CRF model) to identify PHIs in medical documents.
The system is implemented using Python 3.7 with sklearn-crfsuite package [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
and spaCy package [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] for the tokenization. We also use the BIO tagging scheme
to set the labels of the tokens [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Each word token in the document is labeled
using one of three possible tags: B, I, or O, which indicate if the word is at the
beginning, middle, or outside of a PHI entity.
        </p>
        <p>Annotated
Documents</p>
        <p>Document
Rules Generation</p>
        <p>RegEx
Rules</p>
        <p>Annotated Documents</p>
        <p>Tags</p>
        <p>Document
Text Tokenization
Feature Extraction
Training the System</p>
        <p>CRF Model</p>
        <p>RegEx</p>
        <p>Rules</p>
        <p>External Resources</p>
        <p>Fig 2 shows that our system has two outputs: the RegEx model and the CRF
model. The RegEx model is built by the automatic rule extractor by analyzing
the PHI categories that appear in well-structured contexts (e.g. Nombre: Xxxxx.,
Fecha de nacimiento: dd/mm/yyyy.).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>ReCRF: Spanish Medical Document Anonymization</title>
      <p>The CRF model is trained by passing all the extracted features from the
tokens plus the decision of the RegEx model which add extra information and
make the decision easier for the CRF model.</p>
      <p>Unannotated</p>
      <p>Documents
RegEx</p>
      <p>Rules
External Resources</p>
      <p>Annotated Documents</p>
      <p>Document
Text Tokenization
Feature Extraction</p>
      <p>CRF Model</p>
      <p>Document
Fig 3 shows how both RE and CRF models are used to make the annotations.
Even though the RegEx model is accurate enough to detect well-structured
entities, it is not e ective in front of small changes in the text format. So, we
decided to use the RegEx model to perform a preliminary annotation, which is
then passed to the CRF model that will make the nal decision.
4</p>
      <sec id="sec-5-1">
        <title>Results and Discussions</title>
        <p>The performance of the detection of PHI categories has been evaluated using
Precision, Recall and F1 scores at the entity level. The results of our system on
the test set for the di erent PHI categories are shown in Table 2; the confusion
matrix is shown in Table 3. Notice that categories that have low frequency in
the training dataset have less F1 score (e.g. NUMERO FAX appears only 15
times in the training set and CENTRO SALUD appears six times). This result
is expected because the model didn't get enough examples in order to learn how
to accurately detect them.</p>
        <p>The overall results of our system for the two sub-tracks of the competition
on the development and test datasets are shown in Table 4. We see very small
di erences in F1 scores between the development and test datasets. This proves
that our system generalizes well in front of new data.
5</p>
      </sec>
      <sec id="sec-5-2">
        <title>Conclusion and Future Work</title>
        <p>We presented a hybrid system that automatically detects PHI entities from plain
text medical documents. The system consists of an automatically constructed
RegEx model and a trained CRF model. The design of the system, which
includes using a variety of linguistic and semantic features to increase the accuracy,
ensures that it generalizes well in front of new data.</p>
        <p>Finally, because of the rules that we get from the automatic RegEx generator
are not fully generalized, in future work, we plan to implement an automatic
optimizer to get a better result.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>ReCRF: Spanish Medical Document Anonymization</title>
      <sec id="sec-6-1">
        <title>Acknowledgments</title>
        <p>This work was partly supported by the European Commission (project
H2020700540 "CANVAS"), the Government of Catalonia (ICREA Academia Prize to J.
Domingo-Ferrer and grant 2017 SGR 705) and the Spanish Government (projects
RTI2018-095094-B-C2 \CONSENT" and TIN2016-80250-R \Sec-MCloud"). While
the authors are with the UNESCO Chair in Data Privacy, the opinions expressed
in this paper are the authors' own and do not necessarily re ect the views of
UNESCO.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Honnibal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montani</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <article-title>: spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing</article-title>
          . To appear (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. I2B2: i2b2:
          <article-title>Informatics for integrating biology the bedside</article-title>
          . https://www.i2b2.org/, last accessed:
          <fpage>25</fpage>
          -June-2019
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Korobov</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>: sklearn-crfsuite</article-title>
          . https://sklearn-crfsuite.readthedocs.io/en/latest/, last accessed:
          <fpage>31</fpage>
          -May-2019
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>La</surname>
            <given-names>erty</given-names>
          </string-name>
          , J.,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.C.</given-names>
          </string-name>
          :
          <article-title>Conditional random elds: Probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <source>In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML'01</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Marimon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez-Agirre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Intxaurrondo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodrguez</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.A.</given-names>
            ,
            <surname>Villegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            :
            <surname>Automatic</surname>
          </string-name>
          de
          <article-title>-identi cation of medical texts in spanish: the meddocan track, corpus, guidelines, methods and evaluation of results</article-title>
          .
          <source>In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2019</year>
          ). vol.
          <source>TBA</source>
          , p.
          <source>TBA. CEUR Workshop Proceedings (CEUR-WS.org)</source>
          , Bilbao,
          <source>Spain (Sep</source>
          <year>2019</year>
          ), TBA
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Regulation</surname>
            ,
            <given-names>G.D.P.:</given-names>
          </string-name>
          <article-title>Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data</article-title>
          ,
          <source>and repealing directive 95/46. O cial Journal of the European Union (OJ) 59(1-88)</source>
          ,
          <volume>294</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Sang</surname>
            ,
            <given-names>E.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veenstra</surname>
          </string-name>
          , J.:
          <article-title>Representing text chunks</article-title>
          .
          <source>In: Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics</source>
          . pp.
          <volume>173</volume>
          {
          <fpage>179</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Stubbs</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kot la</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uzuner</surname>
            ,
            <given-names>O</given-names>
          </string-name>
          .
          <article-title>: Automated systems for the de-identi cation of longitudinal clinical narratives: Overview of 2014 i2b2/uthealth shared task track 1</article-title>
          .
          <source>Journal of biomedical informatics 58, S11{S19</source>
          (
          <year>2015</year>
          )
          <article-title>9</article-title>
          .
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garibaldi</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>Automatic detection of protected health information from clinic narratives</article-title>
          .
          <source>Journal of biomedical informatics 58, S30{S38</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>