<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>uhKD4 at eHealth-KD Challenge 2021: Deep Learning Approaches for Knowledge Discovery from Spanish Biomedical Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dayany Alfaro-Gonzalez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dalianys Perez-Perera</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gilberto Gonzalez-Rodr guez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Jesus Otan~o-Barrera</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roc o Cruz-Linares[</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Math and Computer Science, University of Havana</institution>
          ,
          <addr-line>La Habana</addr-line>
          ,
          <country country="CU">Cuba</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the system presented by team uhKD4 in the IberLEF eHealth Knowledge Discovery Challenge 2021. The challenge proposes two tasks devoted to extract the semantic meaning of sentences mainly health-related in the Spanish language: Task A (entity recognition) and Task B (relation extraction). The sequential attainment of both tasks represents the main evaluation scenario of the challenge. The system is built upon two independent deep-learning-based architectures, one for each task of the challenge. Task A is addressed as a sequence labelling problem with a model that uses Long Short-Term Memory layers to encode context information and linear chain Conditional Random Fields as tag decoders. Task B is approached as a multi-class classi cation problem using a Convolutional Neural Network that consists mainly of convolutional layers to recognize n-grams, the pooling layers to determine the most relevant features and a logistic regression layer at the end to perform classi cation. The system obtained the fourth position in the main evaluation scenario of the competition. In the individual evaluation of the tasks the model for Task A showed average results while the Task B model reached the third position.</p>
      </abstract>
      <kwd-group>
        <kwd>eHealth</kwd>
        <kwd>Knowledge Discovery</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Information Extraction</kwd>
        <kwd>Named Entity Recognition</kwd>
        <kwd>Relation Extraction</kwd>
        <kwd>Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This paper presents a description of the solution submitted by team uhKD4
at the IberLEF eHealth Knowledge Discovery Challenge 2021. The challenge
proposes two tasks devoted to extract the semantic meaning of sentences mainly
health-related in the Spanish language: Task A (entity recognition) aims to
identify all the entities in a document and their types and Task B (relation extraction)
seeks to recognize all relevant semantic relationships between the entities
recognized. The sequential attainment of both tasks represents the main evaluation
scenario of the challenge[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        The system proposed consists in two independent components, one for each
task. In order to solve the named entity recognition (NER) problem associated to
Task A we present a model that uses Long Short-Term Memory (LSTM) layers
to encode context information, motivated by the fact that it has demonstrated
remarkable achievements in modeling sequential data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. On top of that are
added a dense layer and a Conditional Random Field (CRF) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] layer, which has
been widely used as a tag decoder taking the context-dependent representations
and producing a sequence of tags corresponding to the input sequence [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The
relation extraction (RE) problem framed in Task B is approached using a
Convolutional Neural Network (CNN) that consists mainly of convolutional layers
to recognize n-grams, the pooling layers to determine the most relevant features
and a fully connected neural network with a softmax at the end to perform
classi cation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>The rest of this paper is organized as follows. Section 2 describes in detail the
architectures used by the system. The o cial results achieved in each scenario
of the challenge are shown in Section 3. In Section 4 are shared some insights
derived from experimentation. Finally, in Section 5 are stated the conclusions
and future work recommendations.
2</p>
    </sec>
    <sec id="sec-2">
      <title>System Description</title>
      <p>
        Our system is built upon two independent deep-learning-based architectures.
Accordingly, two di erent models are de ned and each task is carried out
separately. Task A is approached as a sequence labelling problem in which each
token from an input sequence is assigned a label that represents the
combination of the BILUOV entity tagging scheme with each one of the possible types
of an entity. The BILUOV tags correspond to: Begin, to represent the start
of an entity; Inner, to represent its continuation; Last, to represent its end;
Unit, to represent single word entities; Other, to represent words that are not
a part of any entity; and oVerlapping, to represent words that belong to
multiple entities [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For example, in the sentence "El cancer de la cavidad nasal y
de los senos paranasales no es comun" each word should be labeled as stated
between parenthesis: El (O) cancer (V-Concept) de (I-Concept) la (I-Concept)
cavidad (I-Concept) nasal (L-Concept) y (O) de (I-Concept) los (I-Concept)
senos (I-Concept) paranasales (L-Concept) no (O) es (O) comun (U-Concept).
Thus, the output of the model considers 21 di erent labels: the O label and
the combination of the remaining tags (BILUV) and the entity types (Concept,
Action, Predicate and Reference). The proposed approach to Task B is to solve
a multi-class classi cation problem, in which given a sentence and a highlighted
pair of entities, one of the prede ned relations is assigned to occur from the rst
entity toward the second one. A new arti cial relation class none is de ned to
symbolize the non-occurrence of any relation between a pair of entities.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Preprocessing</title>
        <p>
          The initial step to extract useful information from the input of raw text is
the tokenization of each sentence, since both of the tasks require the analysis
of the sequence of words in the sentence. A xed length for the sentences is
de ned as a parameter for the models and each sequence of tokens is trimmed or
padded accordingly to t the designated length. Below are exposed the particular
features that were considered to obtain the input representation for each model.
{ Word embedding: Pre-trained word embedding word2vec [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] that have
dimensionality of 300 and was trained on the the Spanish Billion Words
Corpus with the variant of skip-gram model with negative-sampling. The
weights are kept unchanged during the training phase.
{ POS-tag embedding: Embedding to encode the information expressed by
the Part-of-speech tag of the token.
{ Character representation: Every token is trimmed or padded in order
to ensure that they all have the same prede ned number of characters. By
means of an embedding layer, each character of a word is translated to a
vector, that represents one of all the ASCII letters, digits, and punctuation
symbols and then are fed into a RNN-based model, that uses a Bidirectional
Long Short-Term Memory (BiLSTM) to obtain a character-level
representation of the token.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Common</title>
      </sec>
      <sec id="sec-2-3">
        <title>Task A</title>
        <p>Task B
{ BILUOV and Entity Type embedding: Embedding intended to encode
the information that gives the corresponding label of each word according to
the combination of the BILUOV tag system and the possible types of entity.
{ Position embeddings: Embeddings to encode the relative distance
between each word and the two target entities in the sentence. In the case of a
multi-word entity is considered the distance to the rst word of such entity.
2.2</p>
      </sec>
      <sec id="sec-2-4">
        <title>Named Entity Recognition Model</title>
        <p>Figure 1 shows the architecture of the de ned model. As stated in the previous
subsection the input of the model is a sequence of tokens, each one represented
as the concatenation of the vectors from word and POS-tag embeddings and the
character-level features. After the input is handled, the sequence of word vectors
is processed in both directions by a BiLSTM layer and the features extracted
from the forward and backward passes are concatenated together. The resulting
sequence is intended to increase the amount of information available to the
network, improving the context available to the algorithm (e.g. knowing what
words immediately follow and precede a word in a sentence). Afterward, the
sequence is processed by a simple LSTM layer to extract the most important
features. Finally, a dense layer with a linear activation function followed by
a linear-chain CRF are used to output the most probable sequence of labels
corresponding to the tokens. The CRF layer uses sentence-level tag information
to add some constraints to the nal predicted labels to ensure they are valid.
These constraints can be learned automatically from the dataset during the
training process.</p>
        <p>
          Since the goal is to classify in only four types of entities, a subsequent phase
of decoding the output of the CRF layer is needed. The required transformation
is realized in a way that is similar to the process described by team UH-MatCom
at the previous edition of the challenge [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The process is accomplished in two
steps. First, rules are used to discover the possible entities that use overlapped
words and are not formed by continuous words. After, the remaining entities are
assumed to be a continuous sequence of tokens and are detected in an iterative
manner.
2.3
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>Relation Extraction Model</title>
        <p>The architecture de ned for Task B is shown in Figure 2. The relation extraction
system is provided only with raw sentences marked with the positions of the two
entities of interest and the corresponding type of each one. Thus, exploiting the
elements that can be derived from that input, each relation mention is
represented by a matrix X = [w1; w2; :::; wn], where n is the de ned length for the
sentences and wi is the result of concatenating for the i-th token the embeddings
described before.</p>
        <p>
          The matrix X is processed by the convolutional layer in order to extract
highlevel features. A lter with window size s can be denoted as F = [f1; f2; :::; fs].
Applying the convolution operation on the two matrices X and F is gotten a
score sequence T = [t1; t2; :::; tn s+1]:
where g is some non-linear function and b is a bias term. This process is
replicated for various lters with distinct window sizes to explore the contribution
of di erent n-grams. Then, a pooling layer is applied to aggregate the scores
for each lter to assure the invariance to the absolute positions but retain the
relative positions among the n-grams and the entities. Speci cally, a global max
pooling layer is used to aggressively summarize the most important or relevant
features from each score sequence. A dropout is applied to the resulting feature
vector for regularization, and then is fed into a fully connected layer of standard
neural networks that is followed by a softmax layer in the end in order to carry
out classi cation [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
2.4
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>Hyperparameters Setup</title>
        <p>Tables 1 and 2 show the selected set of hyperparameters for the NER and RE
models respectively. In both tables are exposed the con gurations respecting
the input handling at the top, whereas the middle section covers the rest of the
network and at the bottom are located the hyperparameters for training.</p>
        <p>The hyperparameter tuning process was carried out manually, taking as a
starting point some settings that have shown a positive impact in past works
involving similar architectures. The provided development collection was used
as the validation dataset. The number of epochs was selected according to the
performance shown in training curves.
2.5</p>
      </sec>
      <sec id="sec-2-7">
        <title>Training</title>
        <p>For the implementation of the systems was used Python programming language
and the framework Keras(v2.2.4) with TensorFlow(v1.13.1) as backend. In the
NER model was used the keras contrib(v0.0.2) implementation for the CRF
layer. Tokenization and POS-tags were obtained using the model es core news md
of the Python library spaCy (v3.0.6).</p>
        <p>The training collection provided for the challenge was the only data used to
train both models. The process was carried out in a machine with a 4 core AMD
A10-8700P CPU at 1.80 GHz with an installed memory of 16 GB. For the NER
model the training time was close to 8 hours and for the RE model it took little
more than 2 hours.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>In the second task, regarding entity extraction, our system shows the least
promising results of all scenarios, ranking fth with F1 score of 0.527, as shown
in Table 4. Whilst, on the contrary, a value of 0.318 for F1 score is achieved and
the third position is reached for the relation extraction task, which results are
presented in Table 5.
Team F1
IXA 0.430
Vicomtech 0.372
uhKD4 0.318
PUCRJ-PUCPR-UFMG 0.263
UH-MMM 0.054
Codestrange 0.033
baseline 0.033
JAD 0.007
We would like to remark the relevance of the used features for both models. In
particular, the NER model using only the pretrained word embedding showed
poor results while the addition of the POS-tag and character information
provided a signi cant boost in performance.</p>
      <p>
        Regarding RE task, a mayor issue to overcome is the data scarcity problem,
the amount of non-relation entity pairs is often superior to the ones that represent
a relation, which leads to a widely unbalanced dataset and have a negative
impact on the performance of models. To mitigate this problem we enriched the
input representation with BILUOV tags and entity type information, in order to
capture patterns in which the entities appear in a sentence that may be helpful
to discriminate between positive and negative instances. The technique of adding
the tag system information has been explored before in an architecture that is
similar to ours and good results were achieved [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Experimentation proved that
the incorporation of those features was highly in uential in performance, as we
expected.
      </p>
      <p>Also, related to the architecture of the RE model, it is worth mentioning that
we experimented using max pooling layers or the global ones and better results
were achieved in the second case.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>In this paper was described the system proposed by team uhKD4 at the IberLEF
eHealth Knowledge Discovery Challenge 2021. Two independent
deep-learningbased models were de ned to solve each task of the competition. Task A is solved
as a sequence labelling problem, by a model that uses a word2vec pretrained
embedding along with syntactic features as the input representation, which is
afterwards processed by LSTM and CRF layers. Task B is approached as a
multi-class classi cation. In this case, besides the pretrained word embedding
and syntactic features, it is also used information from the BILUOV tags and the
relative distance to the highlighted entities. Then a CNN with lters of multiple
window sizes and a logistic regression layer at the end performs classi cation.</p>
      <p>The system obtained the fourth position in the main evaluation scenario of
the competition. In the individual tasks the NER model showed average results
while the RE model reached the third position.</p>
      <p>
        As future work recommendations we propose to consider the use of domain
speci c features and external sources of knowledge. Also, to explore the use
of contextual embeddings, such as Bidirectional Encoder Representations from
Transformers (BERT) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Consuegra-Ayala</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palomar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <string-name>
            <surname>UH-MatCom at</surname>
          </string-name>
          eHealth-
          <source>KD Challenge</source>
          <year>2020</year>
          :
          <article-title>Deep-Learning and Ensemble Models for Knowledge Discovery in Spanish Documents (</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding (</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>La</surname>
            <given-names>erty</given-names>
          </string-name>
          , J.D.,
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>F.C.N.</given-names>
          </string-name>
          :
          <article-title>Conditional random elds: Probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <source>In: Proceedings of the Eighteenth International Conference on Machine Learning</source>
          . p.
          <volume>282</volume>
          {
          <fpage>289</fpage>
          . ICML '
          <fpage>01</fpage>
          , Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            , A., Han,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A survey on deep learning for named entity recognition (</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grishman</surname>
          </string-name>
          , R.:
          <article-title>Relation extraction: Perspective from convolutional neural networks</article-title>
          . pp.
          <volume>39</volume>
          {
          <issue>48</issue>
          (01
          <year>2015</year>
          ). https://doi.org/10.3115/v1/
          <fpage>W15</fpage>
          -1506
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Piad-Mor s</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutierrez</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Estevez-Velarde</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Almeida-Cruz</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , Mun~oz, R.,
          <string-name>
            <surname>Montoyo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Overview of the eHealth Knowledge Discovery Challenge at IberLEF 2021</article-title>
          .
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>67</volume>
          (
          <issue>0</issue>
          ) (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , S.:
          <article-title>Exploiting entity bio tag embeddings and multi-task learning for relation extraction with imbalanced data (</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>