<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Cristobal Colon-Ruiz[</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Hulat-TaskAB at eHealth-KD Challenge 2019</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Carlos III de Madrid</institution>
          ,
          <addr-line>Leganes</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>0000</year>
      </pub-date>
      <volume>0002</volume>
      <fpage>35</fpage>
      <lpage>42</lpage>
      <abstract>
        <p>Currently, the number of electronic health documents is increasing exponentially. Due to this, there is a growing interest in developing automatic systems to extract interesting information from these texts. In this paper, we describe a deep learning architecture for the identi cation and classi cation of named entities of interest in health documents. The architecture consists of two bidirectional Long ShortTerm Memory layers and a nal layer based on Conditional Random Fields. Our system (Hulat-TaskAB) participated in the ehealthkd-2019 sub-task A and obtained a micro-F1 of 76.63%.</p>
      </abstract>
      <kwd-group>
        <kwd>Information Extraction Named Entity Recognition Deep Learning Long Short-Term Memory Conditional Random Fields</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Electronic health records (EHRs) are one of the richest resources for
epidemiological studies in order to plan and evaluate strategies to identify and prevent
diseases, among others. However, researchers have to manually review a large
amount of information, which is a very costly and time-consuming task.
Therefore, providing systems capable of automatically identifying and classifying
entities or key phrases of interest in these records conforms a vital task.</p>
      <p>The identi cation of speci c entities of interest inside medical documents
can be addressed as a Named Entity Recognition (NER) problem. This problem
has been widely studied and the approaches normally used can be divided into
several main categories: dictionaries and rule-based systems, machine learning,
deep learning, and hybrid systems.</p>
      <p>
        Dictionary-based methods are limited by the size of the dictionaries
themselves, in addition to the constant growth of vocabulary and spelling errors.
Rulebased approaches usually provide high precision, however, they do not usually
contemplate all existing cases as a result of the complexity of the language. For
example, in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], they used a technique of direct matching with fuzzy matching
and stemmed matching. This approach was tested on the i2b2 "Heart Disease
Risk Factors Challenge" dataset and achieved an F1 of 60.1% trying to maximize
their recall with a limited impact on precision.
      </p>
      <p>
        Furthermore, rule-based and machine learning methods require a previous
generation of syntactic and semantic features, as well as domain-speci c
information. For example, in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], they used an ensemble of support vector machines
with di erent text annotation schemes, orthographic features, ngrams, path of
speech (PoS) tags, among others. Their system was tested in the i2b2 2010
clinical data set, obtaining an F1 of 77.63.
      </p>
      <p>
        Approaches based on deep learning methods automatically learn relevant
patterns, allowing a certain grade of independence of language and domain.
Moreover, these approaches have been shown to achieve better results than the
best hybrid systems in i2b2 tasks. The system described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which was based
on Long-short Term Memory (LSTM) layers combined with Conditional Random
Field (CRF) layers, scored 97.87% of F1 surpassing the winning ib2b 2014 system
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] with 96.11%, which was based on a hybrid model combining conditional
random elds with keyword and rule-based approaches.
      </p>
      <p>
        Considering the above, in this paper, we propose the use of an adaptation
of the NeuroNer tool [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for the subtask A of eHealthkd-2019 task [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] on health
records in Spanish. This tool uses the combination of two bidirectional
Longshort Term Memory (BiLSTM) layers with a nal layer based on Conditional
Random Fields.
      </p>
      <p>The rest of the paper is organized as follows. In Section 2 we brie y describe
the datasets provided for the eHealthkd-2019 task. In Section 3, we describe
the architecture of our system. Section 4 presents the results obtained for our
system. In Section 5, we provide the conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>The data set consists of the annotated corpus of Spanish electronic health
documents proposed in the ehealthkd-2019 sub-task A1. For this task, all the
documents are provided in BRAT format2, a stando format where the di erent
annotations are stored separately from the original text in a similar way to the
BioNLP Shared Task stando format3.</p>
      <p>As we can observe in Table 1, the training set is composed of 600 sentences
annotated manually and contain a total of 4350 key phrases distributed among
four classes of di erent categories. The development set is composed by 100
sentences annotated manually with a total of 4350 key phrases. The four categories
are listed below:
1 (https://github.com/knowledge-learning/ehealthkd-2019/tree/master/data)
2 http://brat.nlplab.org/stando .html
3 http://2011.bionlp-st.org/home/ le-formats
{ Concept: Key phrase that indicates a relevant term or idea in the sentence.
{ Action: Indicates a process or modi cation of concepts.
{ Predicate: Represents a function or lter in a set of elements.
{ Reference: Element that refers to a not explicit concept.</p>
      <p>In addition, as we can see in Table 2, the di erent categories are represented
unbalanced in a similar proportion for both sets.</p>
    </sec>
    <sec id="sec-3">
      <title>Methods and system description</title>
      <sec id="sec-3-1">
        <title>Pre-processing</title>
        <p>We pre-process the text of the clinical cases taking into account di erent steps.
First, the texts are split into tokens and sentences using the Spacy4, an
opensource library that provides support for texts in several languages, including
Spanish.</p>
        <p>
          Finally, the text and its annotations are transformed into the CoNLL-2003
format5 using the BIOES schema [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. In this schema, tokens are annotated
using the following tags:
{ B: represents a token that conform the begining of an entity.
{ I: indicate that the token belongs to an entity.
{ O: represents that the token does not belong to an entity.
{ E: marks a token as the end of a given entity.
{ S: indicates that an entity is comprised of a single token.
        </p>
        <sec id="sec-3-1-1">
          <title>4 https://spacy.io/ 5 https://www.clips.uantwerpen.be/conll2003/ner/</title>
          <p>3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Network description</title>
        <p>
          Bidirectional LSTMs are a type of recurrent neural network (RNN) where the
context of words in the sentence is captured using information from previous
words and information from subsequent words. In addition, to improve the
accuracy of the predictions provided by the BiLSTM layer, the CRF layer uses
information from neighboring (sentence level) tags to predict current tags. Due
to the performance of this type of architectures in entity recognition tasks, we
propose the use of the NeuroNer [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] tool for eHealthkd-2019 subtask A. This
tool is composed of three main layers (see Figure 1):
1. Character-enhanced and token embedding layer.
2. BiLSTM prediction layer
3. CRF optimization layer
        </p>
        <p>
          The rst layer (Character-enhanced and token embedding layer) aims to
generate vector representations of the tokens that conform the input sequences.
The direct representation of token to vector (word embedding) can be pre-trained
or can be learned in conjunction with the rest of the model by adjusting its
weights. Pre-trained models can be obtained from a large amount of unlabeled
data with methods such as word2vec, FastText or GloVe [
          <xref ref-type="bibr" rid="ref1 ref6 ref8">6, 1, 8</xref>
          ]. However, the
di erent word embedding models do not contain representation for those tokens
not included in their vocabularies. The rst layer addresses this problem by
incorporating a representation of tokens based on their characters (character
embeddings). Each token character is represented by its own vector, allowing
the network to learn morphological information even from tokens that are not
included in the vocabulary of the word embedding model [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>The character embedding sequence of each token is passed as input to a
BiLSTM to obtain character-based word embedding as output. Finally, the
representation of word embeddings and character-based word embedding are
concatenated for each token, which will be the input for the second BiLSTM layer
(BiLSTM prediction layer). This BiLSTM layer aims to obtain the sequence of
probabilities for each token to pertain to a given label using the BIOES coding.
The label for each token will be the one with the highest probability.</p>
        <p>The last layer (CRF optimization layer) consists of a conditional random
elds layer. This layer receives as input the sequence of probabilities of the
previous layer in order to improve predictions. This is due to the ability of the
layer to take into account the dependencies between the di erent labels. The
output of this layer provides the most probable sequence of labels.</p>
        <p>The hyperparameters of our model used for the eHealthkd-2019 subtask A
are listed below:
{ Word Embeddings: randomly initialized and adjusted during training.</p>
        <p>
          The dimension of the vectors is 200.
{ Character Embeddings dimension: randomly initialized and adjusted
during training. The dimension of the vectors is 25.
{ First BiLSTM hidden state dimension: 25 for the forward and backward
layers
{ Second BiLSTM hidden state dimension: 200 for the forward and
backward layers
{ Optimizer: ADAM optimizer [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], learning rate: 0.001
{ Dropout: 0.5
{ Number of Epochs: 100
In our experiments, we use precision, recall, and F1 score to evaluate the
performance of our system. In eHealthkd-2018 subtask A, several criteria are taken
into account:
{ Correct matches: When the spans of the predicted key phrase and its label
match exactly with its entry in the gold le.
{ Partial matches: The label matches between the predicted phrase and its
entry in the gold le, but there is overlap in its spans.
{ Incorrect matches: The spans of the predicted key phrase and its label
match exactly, but not its labels.
{ Missing matches: Those entries that appear in the gold le but not in the
predicted le
{ Spurious matches: Those entries that appear in the predicted le but not
in the gold le
        </p>
        <p>A more detailed description of the evaluation can be found on the website6.</p>
        <sec id="sec-3-2-1">
          <title>6 https://knowledge-learning.github.io/ehealthkd-2019/evaluation</title>
          <p>To evaluate the trained models, as well as their hyperparameters, we
performed a set of experiments with the development dataset provided by the
eHealthkd-2019 organizers. We used grid search to adjust the word embeddings
dimension, the number of units in the BiLSTM hidden layer, the optimizer and
the learning rate.</p>
          <p>We can observe in Table 3 our best results on the development set. The run0
model was trained using the hyperparameters mentioned in section 3.2. The
run1 model was trained using ADAM as optimizer, with a word embeddings
dimension of 200 and 100 units in the second layer of the BiLSTM. The run2
model was trained using ADAM employing a word embeddings dimension of 100
and 100 units in the second layer of the BiLSTM.</p>
          <p>Considering that the model run0 achieved the best results, this model was
used to process the test set provided by eHealthkd-2019 in subtask A. The results
obtained can be seen in Table 4.</p>
          <p>As we see in Table 4, the test set of subtask A contains 646 key phrases.
In total, the spans and labels match correctly in 476 of them and partially in
58. One of the elements that most a ect our result is the number of spurious
phrases, resulting in decreased precision.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>Electronic health records (EHRs) are sources for a wide range of studies to plan
and evaluate strategies for identifying and preventing disease. However, due to
the exponential increase in health records, manually reviewing that amount of
information is a very expensive and time-consuming task. Therefore,
providing systems capable of automatically identifying and classifying entities or key
phrases of interest in these records is a vital task.</p>
      <p>One of the biggest challenges of this shared task is that there are a large
number of entities or key phrases of interest, but they are often unbalanced in the
text. All this, together with the presence of nested, discontinuous or overlapping
entities, results in di culties in classifying them correctly.</p>
      <p>In this document, we describe our system involved in the sub-task A proposed
by eHealthkd-2019. It exploits the NeuroNer tool, a tool based on deep learning
with bi-directional LSTM and CRF layers for the NER task. Considering the
challenges described above, our system achieves a micro-F1 of 76.63% on the
test equipment.</p>
      <p>
        For future works, we plan to explore other deep learning architectures as
well as exploiting pre-trained word embedding models, as well as other types of
embeddings such as sense embeddings [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. We also plan to explore the
performance of our system by expanding the BIOES annotation scheme [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to address
the problem of overlapping, nested or discontinuous key phrases. This approach
could reduce the number of partial matches and increase the number of exact
matches. In addition, due to the unbalanced data, we also plan to explore how
the weighting of di erent classes in training can a ect the performance, as well
as the use of di erent sampling methods.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>Funding: This work was supported by the Research Program of the Ministry
of Economy and Competitiveness - Government of Spain (DeepEMR project
TIN2017-87548-C2-1-R).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>5</volume>
          ,
          <issue>135</issue>
          {
          <fpage>146</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szolovits</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Neuroner: an easy-to-use program for named-entity recognition based on neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1705.05487</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uzuner</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szolovits</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>De-identi cation of patient notes with recurrent neural networks</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          <volume>24</volume>
          (
          <issue>3</issue>
          ),
          <volume>596</volume>
          {
          <fpage>606</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Lu s, T.,
          <string-name>
            <surname>Marujo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Astudillo</surname>
            ,
            <given-names>R.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amir</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dyer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Black</surname>
            ,
            <given-names>A.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trancoso</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Finding function in form: Compositional character models for open vocabulary word representation</article-title>
          .
          <source>arXiv preprint arXiv:1508</source>
          .
          <year>02096</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>3111</volume>
          {
          <issue>3119</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Nayel</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shashirekha</surname>
          </string-name>
          , H.:
          <article-title>Improving ner for clinical texts by ensemble approach using segment representations</article-title>
          .
          <source>In: Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)</source>
          . pp.
          <volume>197</volume>
          {
          <issue>204</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          . pp.
          <volume>1532</volume>
          {
          <issue>1543</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Piad-Mor s</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutierrez</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Consuegra-Ayala</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Estevez-Velarde</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Almeida-Cruz</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , Mun~oz, R.,
          <string-name>
            <surname>Montoyo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Overview of the ehealth knowledge discovery challenge at iberlef</article-title>
          <year>2019</year>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Quimbaya</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Munera</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rivera</surname>
            ,
            <given-names>R.A.G.</given-names>
          </string-name>
          ,
          <article-title>Rodr guez</article-title>
          , J.C.D.,
          <string-name>
            <surname>Velandia</surname>
            ,
            <given-names>O.M.M.</given-names>
          </string-name>
          , Pen~a,
          <string-name>
            <given-names>A.A.G.</given-names>
            ,
            <surname>Labbe</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Named entity recognition over electronic health records through a combined dictionary-based approach</article-title>
          .
          <source>Procedia Computer Science</source>
          <volume>100</volume>
          ,
          <issue>55</issue>
          {
          <fpage>61</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ratinov</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Design challenges and misconceptions in named entity recognition</article-title>
          .
          <source>In: Proceedings of the thirteenth conference on computational natural language learning</source>
          . pp.
          <volume>147</volume>
          {
          <fpage>155</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Trask</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michalak</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>sense2vec-a fast and accurate method for word sense disambiguation in neural word embeddings</article-title>
          .
          <source>arXiv preprint arXiv:1511.06388</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garibaldi</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>Automatic detection of protected health information from clinic narratives</article-title>
          .
          <source>Journal of biomedical informatics 58, S30{S38</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>