<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hadoken: a BERT-CRF Model for Medical Document Anonymization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jihang Mao</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wanli Liu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1214</institution>
          ,
          <addr-line>Bethesda, MD 20814</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TAJ Technologies, Inc.</institution>
          ,
          <addr-line>7910 Woodmont Ave</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University Blvd E</institution>
          ,
          <addr-line>Silver Spring, MD 20901</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>720</fpage>
      <lpage>726</lpage>
      <abstract>
        <p>In this paper we report our participation in the 2019 MEDDOCAN task where participants were provided with a synthetic corpus of clinical cases and required to de-identify the protected health information (PHI) in the corpus. Our system Hadoken utilizes BERT, a multi-layer bidirectional transformer encoder which can help learn deep bi-directional representations, and fine-tunes the pre-trained BERT model on training data for MEDDOCAN task. It then feeds the representation of a clinical case into a CRF output layer for token-level classification. To make the results more accurate, we apply some post-processing techniques based on a thorough error analysis. Through utilizing linguistic resources, we were able to achieve results of a higher recall. Our best F-Scores on the test set for Task1 and Task2 are 0.9375, 0.9428 [strict], and 0.9480 [merged] respectively. We find that Hadoken is a robust and competitive system and is applicable to multilingual NER tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>BERT</kwd>
        <kwd>Conditional Random Field</kwd>
        <kwd>Named Entity Recognition</kwd>
        <kwd>Multilingual Model</kwd>
        <kwd>Evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Recently, there has been a rapid growth of using medical documents across a wide
variety of health domains. With the application of AI techniques, clinical records can be
used for aiding clinical decision making, expanding knowledge about diseases,
improving patients’ healthcare experiences, et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, there are growing concerns
about patient privacy due to the widespread practice of health information sharing [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Clinical records with protected health information (PHI) cannot be directly shared “as
is”, due to privacy constraints. Despite of the abundance of research in
privacy-preserving for structured data [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3-5</xref>
        ] , it is particularly cumbersome to carry out NLP research
in the medical domain.
      </p>
      <p>
        A necessary precondition for accessing clinical records outside of hospitals is their
deidentification, i.e., the exhaustive removal, or replacement, of all mentioned PHI
phrases. Studies have shown that such a simple de-identification strategy lacks the
flexibility to adequately meet the diverse needs of data users [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. The practical relevance
of anonymization or de-identification of clinical texts motivated the proposal of two
shared tasks, the 2006 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and 2014 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] de-identification tracks, organized under the
umbrella of the i2b2 (i2b2.org) community evaluation effort. The i2b2 project has
collected multiple sets of clinical documents from different healthcare organizations and
made them available for research, but was focused on documents in English and
covering characteristics of US-healthcare data providers [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        MEDDOCAN (Medical Document Anonymization) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is the first community
challenge task specifically devoted to the anonymization of medical documents in Spanish.
The goal of MEDDOCAN is to set new state-of-the-art results for automatically
recognizing sensitive entity mentions. The MEDDOCAN challenge task comprises two
subtasks: NER (Named Entity Recognition) offset and entity type classification (Sub-Track
1) and sensitive token detection (Sub-Track 2). We participated in both sub-tasks this
year. Participating teams were provided with a synthetic corpus of clinical cases
enriched with PHI expressions, named the MEDDOCAN corpus. It was selected manually
by a practicing physician and augmented with PHI information from discharge
summaries and medical genetics clinical records.
      </p>
      <p>A brief description of our method for MEDDOCAN task is presented in Section 2. In
Section 3 we show the results of our method on the official MEDDOCAN test datasets.
In Section 4 we present a discussion of the results and conclusions of our participation
in this challenge.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>
        For MEDDOCAN Task, our system Hadoken builds on BERT-NER
(github.com/kyzhouhzau/BERT-NER) developed based on BERT, which has obtained
state-of-the-art performance on most NLP tasks recently [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. BERT utilizes a
multilayer bidirectional transformer encoder which can learn deep bi-directional
representations and can be later fine-tuned for a variety of tasks such as NER. Before BERT, deep
learning models, such as Long Short-Term Memory (LSTM) and Conditional Random
Field (CRF) have greatly improved the performance in NER over the last few years
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. And ELMo [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] has proved the effectiveness of contextualized representations
from a deeper structure for transfer learning.
      </p>
      <p>
        Given a clinical case, our method first obtains its token representation from the
pretrained BERT model using a case-preserving WordPiece model, including the maximal
document context provided by the data. Next we formulate this as a tagging task by
feeding the representation into a CRF [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] output layer, which is a token-level classifier
over the NER label set. Finally, we use some post-processing techniques to generate
the result.
The pre-trained BERT models are trained on a large corpus (Wikipedia + BookCorpus).
There are several pre-trained models release. In MEDDOCAN, we chose BERT-Base,
multilingual cased model for following reasons: First, multilingual model is better for
Spanish documents in MEDDOCAN because the English-only model splits tokens not
available in its vocabulary into sub-tokens, which will affect the accuracy of the NER
task. Second, although BERT-Large generally outperforms BERT-Base in English
NLP tasks, BERT-Large versions of multilingual models haven’t been released. Third,
the cased model is better because the case information is important for the NER task.
In MEDDOCAN task, we represent the input passage as a single packed sequence using
BERT embedding, then use a CRF layer as the tag decoder (Figure 1).
      </p>
    </sec>
    <sec id="sec-3">
      <title>BERT</title>
      <p>[CLS]
Tok1
Tok2
TokN
E [ C L S ]
E 1
E 2
E 3
C
T1
T2
TN
O</p>
      <p>O
B-TERRITORIO</p>
      <sec id="sec-3-1">
        <title>Input text</title>
      </sec>
      <sec id="sec-3-2">
        <title>CRF Layer</title>
        <p>Next, we conducted an error analysis on the development set to find out which types of
PHI entity were not predicted well with the above model. Since high precision (0.97)
and low recall (0.75) were observed, we investigated all types of PHI entity that were
not recognized by NER. A list of PHI entity types with the top number of false negatives
is shown in Table 1. We then use some post-processing techniques to improve our
results. For “TERRITORIO”, “CALLE” and “HOSPITAL”, we utilize the information
in the Gazetteer of MEDDOCAN related entities provided by organizers as a linguistic
resource, to identify some hospitals and different types of locations (provinces, cities,
towns, …). For “NOMBRE_PERSONAL_SANITARIO”, we perform an exhaustive
search to label the names in the clinical case that match the tokens already recognized
as “NOMBRE_PERSONAL_SANITARIO” by the model. For
“CORREO_ELECTRÓNICO”, we use regular expression to find email addresses in
the clinical case. After post-processing, the recall can be greatly improved (e.g., all
“CORREO_ELECTRÓNICO” entities in Table 1 can be identified) while the precision
remains stable.
The MEDDOCAN corpus has been randomly sampled into three subsets: the train, the
development, and the test set. The training set contains 500 clinical cases, and the
development and test set 250 clinical cases each.</p>
        <p>
          In both MEDDOCAN sub-tracks, the official evaluation and the ranking of the
submitted systems will be based exclusively on the F-score (F1) measure (labeled as
“SubTrack 1 [NER]” and “SubTrack 2 [strict]” in the evaluation script). Here we present the
results on the test set. The background set is composed of 3,751 clinical cases, in which
the test set with Gold Standard annotations is composed of 250 clinical cases. We
submitted five prediction results of our system. For “Submission1”, “Submission2”, and
“Submission3”, the BERT-CRF model was fine-tuned using the hyperparameter values
suggested in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]: learning rate=2e-5, number of epochs=3, max sequence length=256,
384, 512, and batch size=16, 14, 10, respectively. For “Submission4” and
“Submission5”, the BERT model is further pre-trained with the background corpora (the
unlabeled 3,751 clinical cases), and the maximum sequence length and batch size were set
to 320 and 14 respectively. However, it seems the new model doesn’t make any
improvement, probably due to the size of the background corpora is too small compared
to the original training corpora (Wikipedia + BookCorpus).
As shown in Table 2, “Submission1” outperformed all other submissions in precision
and F-measure for Task1, while the choice of maximum sequence length of 512 in
“Submission3” resulted in the highest performance in leak and recall for Task1. We
also note that “Submission3” consistently achieved best performance in F1[strict] and
F1[merged] for Task2, suggesting that positional embeddings are very useful for
specific token detection tasks.
We described our approach Hadoken that participated in the MEDDOCAN Medical
Document Anonymization task in IberLEF 2019. Compared to previous methods,
Hadoken has several significant differences from system architecture to the processing
flow. It is a general and robust framework and showed competitive performance during
the MEDDOCAN evaluations. The performance of the proposed methodology can be
improved by additional Spanish Linguistic resources, such as the
MEDOCANGazetteer. However, in our experiments, the accuracy for the ‘HOSPITAL’ entities is
still low after utilizing the information of hospitals in the gazetteer, which is interesting
and worth further research. With more and more training corpora available, we also
plan to explore the application of Hadoken in other NER task in future work.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>The authors would like to thank Dr. Yutao Zhang for providing Jihang Mao the intern
opportunity at George Mason University and valuable suggestions and comments on
the manuscript. The authors would also like to thank the MEDDOCAN task organizers
for providing the task data and tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Jensen</surname>
            ,
            <given-names>P.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jensen</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brunak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Mining electronic health records: Towards better research applications and clinical care</article-title>
          .
          <source>Nature Rev Genetics</source>
          <volume>13</volume>
          (
          <issue>6</issue>
          ),
          <fpage>395</fpage>
          -
          <lpage>405</lpage>
          . (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Safran</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bloomrosen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hammond</surname>
            ,
            <given-names>W.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Labkoff</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markel-Fox</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>P.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Detmer</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          <string-name>
            <surname>Toward</surname>
          </string-name>
          <article-title>a national framework for the secondary use of health data: An American Medical Informatics Association white paper</article-title>
          .
          <source>J Amer Medical Informatics Assoc</source>
          .
          <volume>14</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>P.S.</given-names>
          </string-name>
          <string-name>
            <surname>Privacy-Preserving Data</surname>
          </string-name>
          Mining:
          <source>Models and Algorithms</source>
          . Springer; New York (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Fung</surname>
            ,
            <given-names>B.C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>P.S.</given-names>
          </string-name>
          <article-title>Privacy-preserving data publishing: A survey of recent developments</article-title>
          .
          <source>ACM Comput Surveys</source>
          <volume>42</volume>
          (
          <issue>4</issue>
          ) Article 14. (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Melville</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McQuaid</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Generating shareable statistical databases for business value: Multiple imputation with multimodal perturbation</article-title>
          .
          <source>Inform Systems Res</source>
          .
          <volume>23</volume>
          (
          <issue>2</issue>
          ):
          <fpage>559</fpage>
          -
          <lpage>574</lpage>
          . (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Uzuner</surname>
          </string-name>
          , Ö.,
          <string-name>
            <surname>Sibanda</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szolovits</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>A De-identifier for Medical Discharge Summaries</article-title>
          .
          <source>International Journal Artificial Intelligence in Medicine</source>
          .
          <volume>42</volume>
          (
          <issue>1</issue>
          ):
          <fpage>13</fpage>
          -
          <lpage>35</lpage>
          . (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Meystre</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedlin</surname>
            ,
            <given-names>F.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>South</surname>
            ,
            <given-names>B.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Samore</surname>
            ,
            <given-names>M.H.</given-names>
          </string-name>
          <string-name>
            <surname>Automatic</surname>
          </string-name>
          de
          <article-title>-identification of textual documents in the electronic health record: A review of recent research</article-title>
          .
          <source>BMC Medical Res Methodology. 10: Article</source>
          <volume>70</volume>
          . (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Uzuner</surname>
          </string-name>
          , Ö.,
          <string-name>
            <surname>Juo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szolovits</surname>
            ,
            <given-names>P. Evaluating</given-names>
          </string-name>
          <article-title>the state-of-the-art in automatic de-identification</article-title>
          .
          <source>J Am Med Inform Assoc</source>
          .
          <volume>14</volume>
          (
          <issue>5</issue>
          ):
          <fpage>550</fpage>
          -
          <lpage>63</lpage>
          . (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Stubbs</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kotfila</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uzuner</surname>
            ,
            <given-names>Ö.</given-names>
          </string-name>
          <article-title>Automated systems for the deidentification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1</article-title>
          .
          <source>Journal of biomedical informatics</source>
          <volume>58</volume>
          :
          <fpage>S11</fpage>
          -
          <lpage>S19</lpage>
          . (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Stubbs</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uzuner</surname>
            ,
            <given-names>Ö.</given-names>
          </string-name>
          <article-title>Annotating longitudinal clinical narratives for deidentification: The 2014 i2b2/UTHealth corpus</article-title>
          .
          <source>Journal of biomedical informatics 58</source>
          (
          <year>2015</year>
          ):
          <fpage>S20</fpage>
          -
          <lpage>S29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Marimon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez-Agirre</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Intxaurrondo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodríguez</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez</surname>
            <given-names>Martin</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Villegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Automatic de-identification of medical texts in Spanish: the MEDDOCAN track, corpus, guidelines, methods and evaluation of results</article-title>
          .
          <source>In Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2019</year>
          ),
          <source>CEUR Workshop Proceedings (CEUR-WS.org)</source>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <fpage>04805</fpage>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Bidirectional LSTM-CRF models for sequence tagging</article-title>
          .
          <source>arXiv preprint arXiv:1508</source>
          .
          <year>01991</year>
          . (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>arXiv preprint arXiv: 1802</source>
          .
          <fpage>05365</fpage>
          . (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Lafferty</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            <given-names>A.</given-names>
          </string-name>
          , and Pereira F.
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <source>In Proceedings of the International Conference on Machine Learning(ICML)</source>
          , pages
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          . (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>