=Paper= {{Paper |id=Vol-2421/MEDDOCAN_paper_6 |storemode=property |title=Protected Health Information Recognition by BiLSTM-CRF |pdfUrl=https://ceur-ws.org/Vol-2421/MEDDOCAN_paper_6.pdf |volume=Vol-2421 |authors=Cristóbal Colón-Ruiz,Isabel Segura-Bedmar |dblpUrl=https://dblp.org/rec/conf/sepln/Colon-RuizS19a }} ==Protected Health Information Recognition by BiLSTM-CRF== https://ceur-ws.org/Vol-2421/MEDDOCAN_paper_6.pdf
       Protected Health Information Recognition
                  byBiLSTM-CRF

           Cristóbal Colón-Ruiz[0000−0002−9167−809X] and Isabel Segura-
                             Bedmar[0000−0002−7810−2360]

                   Universidad Carlos III de Madrid, Leganés, Spain
                           {ccolon,isegura}@inf.uc3m.es
                              http://hulat.inf.uc3m.es/



        Abstract. Medical records contain relevant information about patients,
        which can be beneficial to improving healthcare and research in the clin-
        ical domain. Due to this, there is a growing interest in developing au-
        tomatic methods to extract and exploit the information from medical
        records. However, medical records also contain protected health infor-
        mation about patients. To protect the confidentiality and privacy of pa-
        tients, this sensitive information should be removed prior to any process-
        ing of these documents. In this paper, we describe an architecture for the
        detection and identification of protected health information from medical
        records. The architecture is composed of two bidirectional Long Short-
        Term Memory layers and a final layer based on Conditional Random
        Fields. Our system participated in the Meddocan shared task, obtaining
        a micro-F1 of 93.22%.

        Keywords: Anonymization· De-identification · Deep Learning · Long
        Short-Term Memory


1     Introduction
In recent years, the number of electronic medical records (EHRs) has been in-
creasing massively. These registries are very useful resources to perform studies
focused on detection/prevention of diseases and medical decision making, among
others. However, health records contain protected health information (PHI). For
instance, information about family history, treatment tracking, and data that
may help to identify a given patient (for example, patient’s name, address,
telephone number, zip code, etc). Due to this protected information, medical
records cannot be shared without a previous de-identification process. This pro-
cess consists of detecting and subsequently replacing or removing all protected
information from the records.
    The interest in addressing de-identification problems motivated the proposal
of two tasks, the 2006 [12] and 2014 [10] de-identification tracks, organized by
Integrating Biology and the Bedside (i2b2). These tracks have significantly influ-
enced the Natural Language Processing (NLP) community in the medical field,
and in particular, for the task of automated text anonymization. Nevertheless,
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 Septem-
    ber 2019, Bilbao, Spain.
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)




        C. Colón-Ruiz, I. Segura-Bedmar

the tasks only focused on records written in English. The Meddocan 2019 has
proposed the first task for the anonymization of medical records in Spanish [6].
    The identification of PHI inside medical documents can be addressed as a
Named Entity Recognition (NER) problem. This problem has been widely stud-
ied and the approaches normally used can be divided into several main cat-
egories: dictionaries and rule-based systems, machine learning, deep learning,
and hybrid systems. Dictionary-based methods are limited by the size of the
dictionaries themselves, in addition to the constant growth of vocabulary and
spelling errors. Rule-based approaches usually provide high precision, however,
they do not usually contemplate all existing cases as a result of the complexity
of the language. Furthermore, rule-based and machine learning methods require
a previous generation of syntactic and semantic features, as well as domain-
specific information. Approaches based on deep learning methods automatically
learn relevant patterns, allowing a certain grade of independence of language
and domain. Moreover, these approaches have been shown to achieve better re-
sults than the best hybrid systems in i2b2 tasks. The system described in [2],
which was based on Long-short Term Memory (LSTM) layers combined with
Conditional Random Field (CRF) layers, scored 97.87% of F1 surpassing the
winning ib2b 2014 system [13] with 96.11%, which was based on a hybrid model
combining conditional random fields with keyword and rule-based approaches.
    Considering the above, in this paper, we propose the use of an adaptation
of the NeuroNer tool [1] for the sub-tasks 1 and 2 of MEDDOCAN-2019 on
Spanish records. This tool uses the combination of two bidirectional Long-short
Term Memory (BiLSTM) layers with a final Conditional Random Fields layer.
The rest of the paper is organized as follows. Section 2 briefly describes the
datasets provided for the MEDDOCAN-2019 task. In Section 3, we describe
the architecture of our system. Section 4 presents the results obtained for our
system. In Section 5, we provide the conclusions.



2   Dataset


The training and development sets provided in the MEDDOCAN-2019 task are
composed of 500 and 250 clinical cases respectively and contain 22 different types
of PHI entities listed in Table 5. In both sets, the representation of the different
types of entity is proportional and unbalanced.


                Table 1. Number of clinical cases and PHI entities.

                 Number of clinical cases Number of PHI entities
Training set    500                        23618
Development set 250                        12180




                                          680
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)




                  Protected health information recognition by BiLSTM-CRF

   The clinical cases are initially provided in BRAT format1 , a standoff format
where the different annotations are stored separately from the original text in a
similar way to the BioNLP Shared Task standoff format2 .


3     Methods and system description

3.1   Pre-processing

We pre-process the text of the clinical cases taking into account different steps.
First, sentences are split using the Spacy (Spacy.io), an open-source library that
provides support for texts in several languages, including Spanish. Due to the
nature of the language used in this type of text, we have defined a set of rules
to avoid detecting dots corresponding to acronyms or abbreviations as sentence
separators.
    Afterward, the resulting sentences are tokenized by using Spacy. However,
some tokens meet the following pattern ”field:value”. For instance, the token
”cp:28007”, which refers to a postal code, should be split into the corresponding
field (”cp”) and its value (”28007”). Due to this, we have included a set of rules
to correctly split this kind of tokens.
    Finally, the text and its annotations are transformed into the CoNLL-2003
format3 using the BIOES schema [9]. In this schema, tokens are annotated using
the following tags: The B tag represents a token that is the beginning of an entity,
The I tag indicates that the token belongs to an entity, the O tag represents that
the token does not belong to any entity, the E tag marks a token as the end of
a given entity, and the S tag indicates that an entity is comprised of a single
token.


3.2   Network description

As we can appreciate in section 1, approaches composed of Bidirectional LSTMs
layers in conjunction with CRF, provide good results in NER tasks [5, 2]. Bi-
directional LSTMs are a type of recurrent neural network (RNN) that takes into
account the context of words in the sentence by capturing past (previous words)
and future (next words) information. In addition, to improve the accuracy of pre-
dictions provided by the BiLSTM layer, the CRF layer uses information from
the neighbor tags (at sentence level) in order to predict current tags. Consid-
ering the above, in order to address the de-identification problem described in
the MEDDOCAN-2019 task, we propose to use the NeuroNer tool [1], which is
composed of three main layers:

 1. Token representation with character-enhanced and token embedding layer.
 2. BiLSTM prediction layer
1
  http://brat.nlplab.org/standoff.html
2
  http://2011.bionlp-st.org/home/file-formats
3
  https://www.clips.uantwerpen.be/conll2003/ner/




                                          681
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)




        C. Colón-Ruiz, I. Segura-Bedmar

 3. CRF sequence optimization layer

    The first layer aims to generate vector representations of the tokens that
conform the input sequences. The direct representation of token to vector (word
embedding) can be pre-trained or can be learned in conjunction with the rest
of the model by adjusting its weights. Pre-trained models can be obtained from
a large amount of unlabeled data with methods such as word2vec or GloVe [7,
8]. However, the different word embedding models do not contain representation
for those tokens not included in their vocabularies. The first layer addresses this
problem by incorporating a representation of tokens based on their characters
(character embeddings). Each token character is represented by its own vector,
allowing the network to learn morphological information even from tokens that
are not included in the vocabulary of the word embedding model [4].
    The character embedding sequence of each token is passed as input to a
BiLSTM to obtain character-based word embedding as output. Finally, the rep-
resentation of word embeddings and character-based word embedding are con-
catenated for each token, which will be the input for the second BiLSTM layer.
This BiLSTM layer aims to obtain the sequence of probabilities for each token
to pertain to a given label using the BIOES coding. The label for each token
will be the one with the highest probability.
    The last layer consists of a conditional random fields layer. This layer receives
as input the sequence of probabilities of the previous layer in order to improve
predictions. This is due to the ability of the layer to take into account the
dependencies between the different labels. The output of this layer provides the
most probable sequence of labels.
    The parameters of the embedding and hyperparameters of our model used
for the MEDDOCAN-2019 task are listed below:

 – Word Embeddings: randomly initialized and adjusted during training.
   The dimension of the vectors is 100.
 – Character Embeddings dimension: randomly initialized and adjusted
   during training. The dimension of the vectors is 25.
 – First BiLSTM hidden state dimension: 25 for the forward and backward
   layers
 – Second BiLSTM hidden state dimension: 100 for the forward and back-
   ward layers
 – Optimizer: Stochastic gradient descent (SGD), learning rate: 0.01
 – Dropout: 0.5
 – Number of Epochs: 100



4   Results
The MEDDOCAN-2019 evaluation process consists of two different scenarios.
The first scenario consists of detecting exactly the location in the text of each
PHI, as well as the type of entity. The second scenario consists of identifying




                                          682
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)




                  Protected health information recognition by BiLSTM-CRF




                    Fig. 1. Overview architecture of our system



sensitive data in order to be replaced, regardless of the type of entity. In the last
scenario, two different types of evaluations are performed: (1) Strict evaluation of
the spans of PHI that belong to sensitive phrases. (2) Merge evaluation where the
spans of PHI connected by non-alphanumerical characters are merged. To eval-
uate both scenarios, the metrics proposed by the organizers are micro-averaged
precision, recall, and F1-score.
   To evaluate the trained models, as well as their hyperparameters, we per-
formed a set of experiments with the development dataset provided by the
MEDDOCAN-2019 organizers. We used grid search to adjust the word embed-
dings dimension, the number of units in the BiLSTM hidden layer, the optimizer
and the learning rate.
   We can observe in Tables 2 and 3 that our best results on the overall devel-
opment set do not present significant differences. The run0 model was trained
using the hyperparameters mentioned in section 3.2. The run1 model was trained
using ADAM [3] as optimizer, with a word embeddings dimension of 300 and
200 units in the second layer of the BiLSTM. The run2 model was trained using
ADAM employing a word embeddings dimension of 200 and 100 units in the
second layer of the BiLSTM.


    Table 2. Results of Subtask 1 on development set . The top scores are bold.

               Subtask 1
      Precision Recall     F1
run 0 0.9435   0.9279      0.9356
run 1 0.9391   0.9284      0.9337
run 2 0.9364   0.9267      0.9315




                                          683
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)




       C. Colón-Ruiz, I. Segura-Bedmar

    Table 3. Results of Subtask 2 on development set . The top scores are bold.

               Subtask 2 Strict                     Subtask 2 Merged
     Precision Recall             F1      Precision Recall              F1
run 0 0.9507   0.9350             0.9428 0.9635     0.9465              0.9549
run 1 0.9461   0.9353             0.9407 0.9649     0.9519              0.9584
run 2 0.9449   0.9351             0.9400 0.9590     0.9463              0.9526



    Considering that both the models run0 and run1 achieved the best results
in the two subtasks, these models were used to process the test set provided
by MEDDOCAN-2019 in scenarios 1 and 2. The results obtained in both tasks
can be seen in Table 4. Moreover, we can verify that the run0 model is the
one that provides the best results in all scenarios (F1 of 93.22% in sub-task 1,
F1 of 94.26% in sub-task 2 for strict evaluation and F1 of 95.77% for merged
evaluation).


     Table 4. Results of all subtasks on the test set . The top scores are bold.

                         run0                       run1
               Precision Recall F1        Precision Recall F1
Task 1        0.9365     0.9279 0.9322 0.927        0.9309 0.9289
Task 2 Strict 0.9470     0.9383 0.9426 0.9365       0.9404 0.9384
Task 2 Merged 0.9630     0.9524 0.9577 0.9564       0.9563 0.9563



    Once the run 0 model has been selected as our best model, it is interesting
to perform a more detailed study for each of the PHI due to the unbalance of
the problem. As we can see in Table 5, not all types of entities are classified
so easily. For example, entities such as ID EMPLEO PERSONAL SANITARIO
and OTROS SUJETO ASISTENCIA are never correctly classified. This may be
due to their low representation in the training set, as well as the confusion it may
cause with other similar types of entities such as FAMILIARES SUJETO ASISTENCIA.
On the other hand, the types of entity with the best results are those with the
highest representation in the training set or those with a specific structure such
as ID ASEGURAMIENTO and ID CONTACTO ASISTENCIAL.


5   Conclusions

Medical records are sources of a wide range of studies to detect and prevent
diseases, as well as to provide information for medical decision making. The
growing volume of these types of records, in addition to the studies associated
with them, increases the importance of de-identification. This type of process




                                          684
           Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)




                   Protected health information recognition by BiLSTM-CRF

Table 5. Results of subtask 1 on the test set. Micro precision, recall and F1 for entity.

                                                 Precision Recall    F1
CALLE                            0.825287 0.869249 0.846698
CENTRO SALUD                     0.250000 0.166667 0.200000
CORREO ELECTRONICO               0.948000 0.951807 0.949900
EDAD SUJETO ASISTENCIA           0.980315 0.961390 0.970760
FAMILIARES SUJETO ASISTENCIA     0.658537 0.666667 0.662577
FECHAS                           0.964111 0.967267 0.965686
HOSPITAL                         0.873950 0.800000 0.835341
ID ASEGURAMIENTO                 0.929293 0.929293 0.929293
ID CONTACTO ASISTENCIAL          0.795455 0.897436 0.843373
ID EMPLEO PERSONAL SANITARIO     0.000000 0.000000 0.000000
ID SUJETO ASISTENCIA             0.964413 0.957597 0.960993
ID TITULACION PERSONAL SANITARIO 0.931330 0.927350 0.929336
INSTITUCION                      0.290909 0.238806 0.262295
NOMBRE PERSONAL SANITARIO        0.937500 0.928144 0.932798
NOMBRE SUJETO ASISTENCIA         0.994024 0.994024 0.994024
NUMERO FAX                       0.666667 0.571429 0.615385
NUMERO TELEFONO                  0.645161 0.769231 0.701754
OTROS SUJETO ASISTENCIA          0.000000 0.000000 0.000000
PAIS                             0.969014 0.947658 0.958217
PROFESION                        0.571429 0.444444 0.500000
SEXO SUJETO ASISTENCIA           0.982646 0.982646 0.982646
TERRITORIO                       0.966595 0.938285 0.952229



has the goal of eliminating or replacing sensitive data to allow access to medical
information without compromising the identification of the patients involved.
    Most previous efforts in the task of anonymization medical records have been
focused mostly on texts written in English. Meddocan is the first shared task
devoted to the anonymization of medical records in Spanish. One of the ma-
jor challenges of this shared task is that there are a large number of sensitive
data categories (22 different types of entities). In addition, these data are often
unbalanced in the text. This results in difficulties to classify them correctly.
    In this paper, we describe our participating system in this task. It exploits
the NeuroNer tool, a tool based on deep learning with bi-directional LSTM and
CRF layers for the task of NER. In spite of the challenges above described, our
system obtains a micro-F1 of 93.22% on the test set.
    For future works, we plan to explore other deep learning architectures as
well as exploiting pre-trained word embedding models, as well as other types of
embeddings such as sense embeddings [11]. Due to the unbalanced data, we also
plan to explore how the weighting of different classes in training can affect the
performance, as well as the use of different sampling methods. Furthermore, the
identification of certain types of entities (table 5)(such as PROFESION, INSTI-
TUCION, and CENTRO SALUD) might be improved by using dictionary-based
approaches in addition to our system.




                                           685
           Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)




        C. Colón-Ruiz, I. Segura-Bedmar

Acknowledgements

This work was supported by the Research Program of the Ministry of Economy
and Competitiveness - Government of Spain (DeepEMR project TIN2017-87548-
C2-1-R).


References
 1. Dernoncourt, F., Lee, J.Y., Szolovits, P.: Neuroner: an easy-to-use pro-
    gram for named-entity recognition based on neural networks. arXiv preprint
    arXiv:1705.05487 (2017)
 2. Dernoncourt, F., Lee, J.Y., Uzuner, O., Szolovits, P.: De-identification of patient
    notes with recurrent neural networks. Journal of the American Medical Informatics
    Association 24(3), 596–606 (2017)
 3. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
    arXiv:1412.6980 (2014)
 4. Ling, W., Luı́s, T., Marujo, L., Astudillo, R.F., Amir, S., Dyer, C., Black, A.W.,
    Trancoso, I.: Finding function in form: Compositional character models for open
    vocabulary word representation. arXiv preprint arXiv:1508.02096 (2015)
 5. Lyu, C., Chen, B., Ren, Y., Ji, D.: Long short-term memory rnn for biomedical
    named entity recognition. BMC bioinformatics 18(1), 462 (2017)
 6. Marimon, M., Gonzalez-Agirre, A., Intxaurrondo, A., Rodrguez, H., Lopez Mar-
    tin, J.A., Villegas, M., Krallinger, M.: Automatic de-identification of medical texts
    in spanish: the meddocan track, corpus, guidelines, methods and evaluation of re-
    sults. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019).
    vol. TBA, p. TBA. CEUR Workshop Proceedings (CEUR-WS.org), Bilbao, Spain
    (Sep 2019), TBA
 7. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
    sentations of words and phrases and their compositionality. In: Advances in neural
    information processing systems. pp. 3111–3119 (2013)
 8. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word repre-
    sentation. In: Proceedings of the 2014 conference on empirical methods in natural
    language processing (EMNLP). pp. 1532–1543 (2014)
 9. Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recog-
    nition. In: Proceedings of the thirteenth conference on computational natural lan-
    guage learning. pp. 147–155. Association for Computational Linguistics (2009)
10. Stubbs, A., Kotfila, C., Uzuner, Ö.: Automated systems for the de-identification of
    longitudinal clinical narratives: Overview of 2014 i2b2/uthealth shared task track
    1. Journal of biomedical informatics 58, S11–S19 (2015)
11. Trask, A., Michalak, P., Liu, J.: sense2vec-a fast and accurate method for word
    sense disambiguation in neural word embeddings. arXiv preprint arXiv:1511.06388
    (2015)
12. Uzuner, O., Szolovits, P., Kohane, I.: i2b2 workshop on natural language process-
    ing challenges for clinical records. In: Proceedings of the Fall Symposium of the
    American Medical Informatics Association. Citeseer (2006)
13. Yang, H., Garibaldi, J.M.: Automatic detection of protected health information
    from clinic narratives. Journal of biomedical informatics 58, S30–S38 (2015)




                                           686