<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Anonymization of Sensitive Information in Medical Health Records</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bhavna Saluja?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gaurav Kumar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jo~ao Sedoc</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris Callison-Burch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Pennsylvania</institution>
          ,
          <addr-line>Philadelphia PA 19104</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>647</fpage>
      <lpage>653</lpage>
      <abstract>
        <p>Due to privacy constraints, clinical records with protected health information (PHI) cannot be directly shared. De-identi cation, i.e., the exhaustive removal, or replacement, of all mentioned PHI phrases has to be performed before making the clinical records available outside of hospitals. We have tried to identify PHI on medical records written in Spanish language. We applied two approaches for the anonymization of medical records in this paper. In the rst approach, we gathered various token-level features and built a LinearSVC model which gave us F1 score of 0.861 on test data. In the other approach, we built a neural network involving an LSTM-CRF model which gave us a higher F1 score of 0.935 which is an improvement over the rst approach.</p>
      </abstract>
      <kwd-group>
        <kwd>PHI</kwd>
        <kwd>Neural Networks</kwd>
        <kwd>LSTM-CRF</kwd>
        <kwd>Anonymization</kwd>
        <kwd>Computational Linguistics</kwd>
        <kwd>Privacy</kwd>
        <kwd>De-identi cation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Clinical records with protected health information (PHI) cannot be directly
shared as is, due to privacy constraints, making it particularly cumbersome to
carry out NLP research in the medical domain. A necessary precondition for
accessing clinical records outside of hospitals is their de-identi cation, i.e., the
exhaustive removal, or replacement, of all mentioned PHI phrases.</p>
      <p>PHI stands for Protected Health Information and is any information in a
medical record that can be used to identify an individual, and that was created, used,
or disclosed in the course of providing a health care service, such as a diagnosis or
treatment. In other words, PHI is personally identi able information in medical
records, including conversations between doctors and nurses about treatment.
PHI also includes billing information and any patient-identi able information in
a health insurance company's computer system. Some information that can be
considered PHI are Names, Surnames, Addresses, Hospitals, Professions, Di
erent types of locations (provinces, cities, towns,...), Billing information, Email,
Phone records.
? Equal contribution. Listing is in random order.</p>
      <p>B. Saluja et al.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Literature Review</title>
      <p>
        In paper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the authors have presented a deep learning architecture that uses
bi-directional long short-term memory networks (Bi-LSTMs) with variational
dropouts and deep contextualized word embeddings while also using components
such as traditional word embeddings (Glove), character LSTM embeddings and
conditional random elds. The paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] aims to develop techniques and
methods for semi-automated anonymization of medical record information. The paper
proposes methods like utilization of database structure, dictionaries, heuristics
and natural language processing for anonymizing patient records in general.
Major challenges which are posited are the di erences in identity markers (e.g.
Dr. and Mrs.) and hyphenation patterns in Norwegian, unstructured text, no
strictly enforced guidelines for how the data should be encoded. In paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
the authors have developed a de-identi cation model that can successfully
remove personal health information (PHI) from discharge records to make them
conform to the guidelines of the HIPAA. Authors have used feature set of 5
di erent categories - Word level features, Frequency Information, O ine
Dictionaries, Contextual Information and Phrasal Information. Authors have trained
three di erent classi ers that used three di erent contextual features and used
a voting based mechanism to decide if the word belonged to NER. If any two
classi ers have predicted the same label, the word is assigned that tag otherwise
its not considered NER. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the authors proposed a three step approach to
extract personal information from medical records. First, they split the document
into terms and extract local and external features. Then they built multiple
independent classi ers from the features that are extracted. At last, they combine
the results of independent classi ers to get nal tags of the words. In paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
the authors have experimented using SVMs to recognize NER data. Authors
have built feature set using various features such as token level features like
orthographic features, length, POS tag, kind etc. and features like date, id, phone
number etc. They have used ANNIE Web API to identify hospitals, people or
locations etc.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data Preparation</title>
      <p>
        We created dictionary representations of train, dev and test datasets from the
clinical records given in text les [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We represented each word as:
(word; N ER T ag; start index; Spanish P OST AG)
For each clinical record, we store a list of sentences. Each sentence is further a
list containing tuple representation of words in it. The records are then stored as
(key, value) pairs in a dictionary where key is docId and value is the sentences in
the form of list of word tuples which are dumped as pickle les. These pickles are
being used as input to all our models. We also created vocabularies for words,
tags and characters present in our dataset and prepared a numpy array which
contains the embeddings for tokens using fastText Spanish embeddings [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The given data set was divided into training set, development set and test
set and the distribution is shown in Table 1.</p>
      <p>Type
Training Set 401
Development Set 193
Test set 156</p>
      <p>Evaluation of the model is done by measuring F-score performance metric
which is a widely used metric in the natural language processing literature,
such as the evaluation of named entity recognition and word segmentation. The
published papers mentioned in the literature review section have also used
Fscore to evaluate the performance of their models.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Approaches</title>
      <p>4.1</p>
      <sec id="sec-4-1">
        <title>LinearSVC</title>
        <p>
          We have built a baseline model implementing linear SVM on our dataset
focusing on token-level features such as inclusion of punctuations, uppercase or
lowercase letters in the token, or is the word Roman, fax-related features, etc.
For each word we look at the window [
          <xref ref-type="bibr" rid="ref1 ref2">-1,0,1,2</xref>
          ] and create feature vector
including the features of these words in the context of the target word. We have a total
of 401 training documents, 193 dev documents and 156 test documents and we
tokenized the les into sentences whose counts are mentioned as follows:
Train sentences = 8300 (Total 401 docs)
Dev sentences = 4048 (Total 193 docs)
Test sentences = 3231 (Total 156 docs)
Our model used LinearSVC and the results are presented in table[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. As we
can see, we have achieved a F1 score of 86%.
        </p>
        <p>B. Saluja et al.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>LSTM-CRF</title>
        <p>Model Description We have built a named entity recognizer which is often
also considered as a sequence tagger. The model architecture involves Bi-LSTM
and CRF. Additionally, it makes use of fasttext word embeddings for Spanish.
It also builds word embeddings using character encodings. We have used word
embeddings (fasttext) concatenated with model word embeddings (char based
Bi-LSTM) while training the model. We then extract contextual representation
of each word in a given sentence by running Bi-LSTM on the sentence. In the
end, we used CRF to decode the output and get the category of each word.
Training Details In this section, we show a list of all hyper-parameters we
used when training our model. This list includes optimizer, dropout, layer size,
learning rate, learning rate decay, size of word embeddings, size of char
embeddings among others. These parameters can be found in Table 3.
During the training of the model, we evaluated the model on the development
set on each epoch and analyzed its performance through F1 score and confusion
matrix generated. The number of epochs for which the model ran was decided
by examining the F1 score of devset evaluations after every epoch and the
training stopped when there was no improvement in the performance for consecutive
epochs.
Results We ran the model with the above mentioned hyper-parameters and
after running the model for 5 epochs, we obtained the confusion plot as shown
in Fig 1.
Anonymization of Sensitive Information in Medical Health Records</p>
        <p>Table 4 shows the precision, recall, and F1-scores we obtained on running
the model on training, development and test data sets.
on the category `ID SUJETO ASISTENCIA'. On analysis, we found out this
category has a skewed distribution in the training set. In gold les, the
category `ID SUJETO ASISTENCIA' has been assigned to numbers in majority
of the documents and only some times to text. Therefore, our model learnt
to assign this category to numbers only.
2. No heterogeneity in the data for category `HOSPITAL': On analysing
the errors related to the category `HOSPITAL', we found out that the label
`HOSPITAL' is assigned to those terms in the training data which contain
the term `HOSPITAL' in it. This means that we do not have variety of
examples for this category in our training set. Thus, the model learnt to predict
the category `HOSPITAL' only if the token itself contains this term.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we tried di erent approaches for the task of de-identi cation of
PHI in Spanish clinical records. We started with building a LinearSVC model
using various token-level features and static dictionaries of Spanish names and
locations. We proposed a neural network that uses Bi-LSTM and CRF for named
entity recognition. The neural model performed best for us on the given dataset.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Future Work</title>
      <p>As a next step, we plan to build a system that is a combination of rule based
model and our best performing neural model. Since the dataset consists of some
structured text and some unstructured text in the medical documents, for the
structured text, we will use the predictions made by a simple rule based system
and for the unstructured text, we shall go with the neural model predictions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Amund</given-names>
            <surname>Tveit</surname>
          </string-name>
          , Ole Edsberg, Thomas Brox Rst, Arild Faxvaag, ystein Nytr, Torbjrn Nordgrd,
          <source>Martin Thorsen Ranang and Anders Grimsmo: Anonymization of General Practioner Medical Records. HelsIT</source>
          , 5 pages (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Gyrgy</given-names>
            <surname>Szarvas</surname>
          </string-name>
          , Richrd Farkas,
          <article-title>Rbert Busa-fekete: State-of-the-art Anonymization of Medical Records Using an Iterative Machine Learning Framework</article-title>
          .
          <source>Journal of the American Medical Informatics Association Volume 14 Number 5</source>
          , 7 pages (Sept / Oct 2007)
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Kaung</given-names>
            <surname>Khin</surname>
          </string-name>
          , Philipp Burckhardt, Rema Padman:
          <article-title>A Deep Learning Architecture for De-identi cation of Patient Notes: Implementation and Evaluation</article-title>
          . arXiv:
          <year>1810</year>
          .
          <article-title>01570 [cs</article-title>
          .CL], 15 pages (
          <issue>3 Oct 2018</issue>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Xiao-Bai Li</surname>
          </string-name>
          and
          <article-title>Jialun Qin: Anonymizing and Sharing Medical Text Records</article-title>
          .
          <source>Inf Syst Res. Author manuscript</source>
          , 47 pages (
          <issue>19 March 2018</issue>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Yikun</given-names>
            <surname>Guo</surname>
          </string-name>
          , Robert Gaizauskas, Ian Roberts, George Demetriou, Mark Hepple:
          <article-title>Identifying Personal Health Information Using Support Vector Machines. 5 pages (</article-title>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Montserrat</given-names>
            <surname>Marimon</surname>
          </string-name>
          , Aitor Gonzalez-Agirre, Ander Intxaurrondo, Heidy Rodrguez, Jose A Lopez Martin,
          <string-name>
            <surname>Marta Villegas</surname>
          </string-name>
          , Martin Krallinger:
          <article-title>Automatic deidenti cation of medical texts in Spanish: the MEDDOCAN track, corpus, guidelines, methods and evaluation of results</article-title>
          .
          <source>Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Edouard</given-names>
            <surname>Grave</surname>
          </string-name>
          , Piotr Bojanowski, Prakhar Gupta, Armand Joulin,
          <source>Tomas Mikolov: Learning Word Vectors for 157 Languages. Proceedings of the International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>