<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hulat-TaskA at eHealth-KD Challenge 2019</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alejandro Ruiz-de-laCuadra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose Luis Lopez-Cuadrado</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Israel Gonzalez-Carrasco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Belen Ruiz-Mezcua</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Carlos III de Madrid, Computer Science Department, Av de la Universidad</institution>
          ,
          <addr-line>30, 28911, Leganes, Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>26</fpage>
      <lpage>34</lpage>
      <abstract>
        <p>Key phrases recognition is one of the open issues in Natural Language Processing. These entities are relevant to identify relations between phrases and allows extracting knowledge from unstructured text. This paper combines Recurrent Neural Networks and Conditional Random Fields to present a trending architecture to solve the Scenario 2 problem for identi cation and classi cation of key phrases at eHealthKD Challenge 2019. With a performance measure F-score of 0.7903, this team HULAT-TaskA achieved the fourth position.</p>
      </abstract>
      <kwd-group>
        <kwd>Bi-LSTM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CRF</p>
      <p>NER</p>
      <p>Knowledge Discovery</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>With the exponential growth of clinical documents, the task of structuring this
data has become more complex and unfeasible. The automation of this process
allows solving the scalability problem and continuing to o er knowledge mining
for unstructured texts.</p>
      <p>In the clinical domain, the extraction of key concepts and their relationships
allow to have a better understanding of the diagnosis and to make a better
follow-up of similar or related cases.</p>
      <p>This language processing starts with the determinant task of named entity
extraction (NER). Currently, the possible solutions to solve the NER problem
are divided into methods based whether on dictionaries, rules or machine
learning. Firstly, dictionaries are limited by the size and diversity of vocabulary,
misspellings, the use of synonyms and abbreviations. Secondly, despite the fact
that rule-based methods are at the peak of performance for this task, domain
dependency to build e ective rules makes NER a laborious and di cult to
expand work. Finally, machine learning approaches have steadily progressing. The
competitive advantage of this last one lies in the simplicity of building and
conguring the systems, the performance to extract characteristics or syntactic and
semantic patterns, and its versatility in the domain and language.</p>
      <p>While machine learning systems were already governed by Conditional
Random Fields (CRF) methods, the emergence of hybrid systems combining deep
learning and CRF was the necessary boost to be on a level with rule-based
methods.</p>
      <p>
        This paper describes the participation of the team HULAT-taskA in the
IberLEF 2019 eHealth-KD challenge [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] . This challenge is oriented to the
identi cation of key-phrases and their corresponding relationships in eHealth records
in Spanish. The challenge is structured in two di erent subtasks: A and B. The
goal of subtask A is to identify the key phrases per document and their classes.
The goal of subtask B is to link the key phrases detected and labelled in each
document. These two subtasks lead to three di erent scenarios. Scenario 1 covers
the two subtasks as a pipeline, Scenario 2 evaluates the subtask A and Scenario
3 evaluates the subtask B. The team HULAT-TaskA takes part only in Scenario
2 (subtask A).
      </p>
      <p>
        The core of the proposed tagger system is an adaptation of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] bidirectional
Long Short-Term Memory (bi-LSTM) - CRF model, successfully applied
previously for temporal expression recognition. This tagger combines several neural
network architectures for the extraction of characteristics at a contextual level
and a CRF for the decoding of labels.
      </p>
      <p>The results obtained in this task by the HULAT-TaskA team, F1 0.7903,
show a performance similar to the obtained for the temporal expressions,
applying a di erent approach at the character level. These results demonstrate the
versatility of hybrid systems for the extraction of entities.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Dataset</title>
      <p>
        The dataset is described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The provided corpus was divided in three parts:
training, development and validation. The training set contained a 600 sentences
manually annotated using brat stando format and post-processed to match
the input format. The development set was formed by 100 additional sentences
for evaluating machine learning systems and tune their hyperparameters. For
scenarios 2 and 3, only the 100 valid sentences are published. Participants could
also freely use additional resources from other corpora to improve the systems.
Although the team HULAT-TaskA used the previous year challenge dataset to
extend the word and character vocabulary with more vectors and to start testing
possible architectures, the team has not participated in the previous challenge.
      </p>
      <p>Table 1 describes the statistics of the dataset relevant for the proposed
system. A total amount of 2626 words were provided in the dataset, with 76 di erent
characters and 8 labels for classifying the key-phrases. Figure 3 represent the
relevant statistics related to the labels of the corpus, from the point of view of the
IOB (Inside, Outside, Begin) format. As shown in the gure, some concepts are
inside other ones. This is relevant for the con guration of the system, since we
decide not look for these elements in order to improve the results. We found
that the percentage of F-measure lost due to the omission of these concepts with
several words was lower than the percentage of F-Measure lost by the di culties
of the system to learn this type of concepts. So the system was only trained with
one-word concepts.</p>
    </sec>
    <sec id="sec-4">
      <title>System Description</title>
      <sec id="sec-4-1">
        <title>Pre-processing</title>
        <p>
          Since the main architecture (Figure 1) has been designed based on the
architecture of [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] for the recognition of multiple entities, it is necessary to include
a pre-processing module for adapting the input data to the format expected by
the system. This modular system allows adapting the proposed architecture to
other problems.
        </p>
        <p>In the context of this workshop, data is provided in the format:
ID \tab START END ; START END \tab LABEL \tab TEXT</p>
        <p>This module transform this format into a di erent token-oriented
organization with its corresponding tags, as shown below:
docId \tab sentId \tab tokId \tab tokTxt \tab tag \tab tagId
\tab type \tab val</p>
        <p>This new structure allows processing the tokens using the customized tagger
module while minimizing the loss of information. At the same time, it saves the
information required for reverting the process.</p>
        <p>Finally, after adapting the data to the new format, the system can
automatically generate the training and labelling les following the IOB format.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Model Architecture</title>
        <p>Input The input of the model is composed by two levels: character and word.</p>
        <p>Character level representation provides additional lexical information relative
to each word n in sentence s. Figure 2 depicts the design of the Recurrent Neural
Network (RNN) for obtaining the dimension 50 vector. A conversion table allows
obtaining a numerical representation of each character n of the sentence s (cn; s).
This vector is the input of the RNN. By means of a many-to-one architecture the
two LSTM layers in opposite directions represent more complex characteristics
than the Convulutional Neural Network (CNN) (applied in previous research on
temporal expressions). Finally, the outputs of the LSTM layers (dimension 25)
are linked together and a dropout layer is included in order to avoid over tting.</p>
        <p>
          Figure 2 represents the inputs of the bi-LSTM layer. At the word level a
conversion table has been applied, based on the numerical values calculated in
the word embeddings Spanish Billion Words [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] which returns a 300 dimension
vector (wn; s).
        </p>
        <p>LSTM Layer Weight training is focused on this layer of recurrent networks
with a high number of nodes in order to address the complexity of this problem.
The network takes the combination of both representations:</p>
        <p>Xn;s = Cn;s + Wn;s</p>
        <p>As in the representation of characters, the two layers are concatenated and
a dropout layer is applied. As a result, a word representation is obtained in the
sentence context (h1,..., hn):
ht = [!ht; ht]</p>
        <p>This system allows us to capture multiple dependencies between the tags.
Thanks to the use of two layers in both directions, the system allows
exploiting the potential of the LSTM and capturing contextual information in both
directions.</p>
        <p>Conditional Random Fields To adjust the combined output of the LSTM
layers, a CRF layer just after a dense layer has been include. This layer allows
us to decode the output considering the neighbors against the Softmax function
that makes the decision to tag independently.</p>
        <p>As a function of loss, the function log-likelihood of tag sequences in a CRF
has been used.
3.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Post-processing</title>
        <p>The post-processing module takes output from the tagger and sentence
metadata from pre-processing in order to transform tagged data into brat format.
Additionally, a set of rules have been applied to discard any incoherent IOB
(Inside, Outside, Begin) labels. In fact, these rules have been created using data
statics such as label distribution (Figure 3).
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>
        Five di erent models were tested for the competition, but only the best result
was sent. Table 2 summarises the di erent architectures tested. During the
training phase, F-score, precision and recall measures were applied. TensorBoard [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
was applied for analysing the non desired behaviors. For the nal results, the
proposed measure by the organization of the task was applied
(https://knowledgelearning.github.io/ehealthkd-2019/evaluation).
      </p>
      <p>Models 1 and 2 (Table 3) obtain similar result (around 1% of di erence). The
batch size produces a higher cost in the training phase, but reduces the number
of false positives.</p>
      <p>The performance of the rest of the models do not improve the results of
Model 1, so it was the selected for the competition.</p>
      <p>In future work, Model 1 can be improved by modifying the labeling format
or adjusting the dropout of the di erent layers.
108
46
52
117
55
52
248
57
63
45
46
57
5</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>This paper has presented RNN-ICK a bi-LSTM architecture for recognizing
key phrases in the context of the Scenario 2 (subtask A) of the IberLEF 2019
eHealth-KD challenge. The proposed system combines RNN architecture with a
CRF in the output. The proposed architecture was adapted from a previous one
applied to the temporal expression recognition, modifying the input processing
and adapting the labels to be processed. The proposed system reached the 4th
position in the competition. The results obtained were similar to the temporal
scenario. These results show the ability of hybrid systems for adapting to several
NER scenarios.</p>
      <p>
        Moreover, the results show the importance of generating more information
at the character level, being the con gurations with LSTM at all times superior
to the con gurations with CNN. This detail also clari es that it is necessary to
give more knowledge to the model, since the pre-trained embeddings by itself are
not enough, either due to the Skip-Gram algorithm or to the speci c domain. In
the previous edition (2018), the winning system used a very similar architecture
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], although achieving a result 8 percentage points above this F-score. In future
work, the use of BIOES-V labeling to di erentiate the beginning and end of the
entity, together with embeddings at sentence level, will add enough information
to address the complexity of the problem. These modi cations are feasible to add
thanks to the modular design of the proposed system. Further research will also
include the use of speci c Word Embeddings based on clinical data in Spanish
as well as the testing of other architectures.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Funding: This work was supported by the Research Program of the Ministry of
Economy and Competitiveness Government of Spain (Project DeepEMR:
Clinical information extraction using deep learning and big data techniques
TIN201787548-C2-1-R).</p>
      <p>We also thank the organization committee from eHealthKD Challenge 2019
for providing all resources.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barham</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brevdo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Citro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <source>Tensor ow: Large-scale machine learning on heterogeneous systems</source>
          ,
          <year>2015</year>
          (
          <year>2015</year>
          ),
          <article-title>tensor ow</article-title>
          .org
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cardellino</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <source>Spanish Billion Words Corpus and Embeddings (March</source>
          <year>2016</year>
          ), https://crscardellino.github.io/SBWCE/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Genthial</surname>
          </string-name>
          , G.:
          <article-title>Tensor ow - named entity recognition (</article-title>
          <year>2018</year>
          ), https://github.com/guillaumegenthial/tf ner
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Piad-Mor s</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutierrez</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Consuegra-Ayala</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Estevez-Velarde</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Almeida-Cruz</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , Mun~oz, R.,
          <string-name>
            <surname>Montoyo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Overview of the ehealth knowledge discovery challenge at iberlef</article-title>
          <year>2019</year>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Zavala</surname>
            ,
            <given-names>R.M.R.</given-names>
          </string-name>
          , Mart nez, P.,
          <string-name>
            <surname>Segura-Bedmar</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>A hybrid bi-lstm-crf model for knowledge recognition from ehealth documents</article-title>
          .
          <source>Proceedings of TASS 2172</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>