=Paper=
{{Paper
|id=Vol-2943/ehealth_paper4
|storemode=property
|title=UH-MMM at eHealth-KD Challenge 2021
|pdfUrl=https://ceur-ws.org/Vol-2943/ehealth_paper4.pdf
|volume=Vol-2943
|authors=Loraine Monteagudo-García,Amanda Marrero-Santos,Manuel Santiago Fernández-Arias,Hian Cañizares-Díaz
|dblpUrl=https://dblp.org/rec/conf/sepln/Monteagudo-Garcia21
}}
==UH-MMM at eHealth-KD Challenge 2021==
<pdf width="1500px">https://ceur-ws.org/Vol-2943/ehealth_paper4.pdf</pdf>
<pre>
       UH-MMM at eHealth-KD Challenge 2021

     Loraine Monteagudo-Garcı́a, Amanda Marrero-Santos, Manuel Santiago
                  Fernández-Arias, and Hian Cañizares-Dı́az

    Faculty of Math and Computer Science, University of Habana, La Habana, Cuba


        Abstract. This paper explains the solution presented by the UH-MMM
        group to the eHealth-KD challenge at IberLEF 2021. Two main subtasks
        for knowledge discovery were defined: entity recognition and relationship
        extraction. The evaluation of the task is divided into three scenarios:
        one corresponding to the detection of entities, one corresponding to the
        detection of relations between such pair of entities, and the third one
        corresponding to the extraction of both entities and relationships. For
        both subtasks, our proposal makes use of BiLSTM as contextual encoders
        and Dense layers as the tag decoder architecture of the model. In the
        challenge, the system ranked fifth in the main scenario, fourth in the
        scenario evaluating the first task, and fifth in the last scenario. The score
        obtained in the relationship extraction task shows that the proposed
        approach needs to be further explored.

        Keywords: eHealth · Knowledge Discovery · Natural Language Process-
        ing · Machine Learning · Deep Learning · Named Entity Recognition ·
        Relation Extraction


1     Introduction
This paper explains the solution presented by the UH-MMM team in the eHealth-
KD challenge at IberLEF 2021 [4]. The challenge proposes modeling of the human
language in which electronic health documents could be machine-readable from a
semantic point of view. It is divided into two tasks: A for entity recognition and
B for the extraction of the semantic relationships between pairs of such entities.
The evaluation was also divided into 3 scenarios: key-phrase identification and
classification to evaluate task A, relation extraction to evaluate task B, and full
knowledge extraction to evaluate both tasks. eHealthKD 2021’s edition includes
one significant addition concerning previous editions: a small selection of sentences
from different domains and languages (i.e., English) to encourage cross-domain
and multi-lingual approaches.
    Our solution for both tasks is based on Recurrent Neural Networks (RNN) or,
more precisely, Bidirectional Long Short Term Memory (BiLSTM) as contextual
encoders and Dense layers as the tag decoder architecture of the model. This
    IberLEF 2021, September 2021, Málaga, Spain.
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
architecture is chosen because of the sequential structure of the input its widely
used in the literature for addressing the Named Entity Recognition (NER)
problem. The system makes use of POS-tag (Part-of-Speech tag) information,
dependency relations, char-level representations as well as contextual embeddings.
The Relation Extraction (RE) task is addressed in a pairwise-query fashion,
encoding the information about the sentence and the given pair of entities using
syntactic structures derived from the dependency parse tree. In addition, a special
type of relation was used to encode the relationship between non-related pairs of
entities.
    The rest of the paper is organized as follows. Section 2 explains in detail
the proposed model. The results of the model in the several scenarios evaluated
during the eHealth-KD 2021 event are presented in Section 3. In section 4 some
insights derived from the performance of each one of our runs were discussed.
Finally, the conclusions and some future work recommendations are shown in
Section 5.


2   System Description


The proposed solution solves both tasks separately and sequentially. Thus, inde-
pendent models with different architectures and features were trained to solve
the NER and RE problems. The main distinction between the two architectures
raises from the type of problem they solve. The first task is posed as a tag
prediction problem that takes the raw text of a sentence as input and outputs two
independent tag sequences: one in the BILOUV tag system for entity prediction
and another with the tags corresponding to each entity type (Concept, Action,
Reference, Predicate). The BILOUV tag scheme classification corresponds to
Begin, for the start of an entity; Inner, for the token in the middle; Last for
the ending token; Unit, to represent single token entities; Other to represent
tokens that do not belong to any entity, and the oVerlapping tag is used to deal
with tokens that belong to multiple entities. On the other hand, the second task
is addressed as a series of pairwise queries among the entities present in the
target sentence, oriented towards identifying the relevant relations between the
previously extracted entities.
    Taking into account the multilingual characteristics of the task, the feature
extraction process of the syntactic features is handled in two phases. In the first
one, the input sentence is classified by its language using a FastText pre-trained
model for language identification [3][2]. Afterward, in the second phase, two dif-
ferent models of Spacy (https://spacy.io/) were used depending on the sentence’s
language (es core news sm for Spanish and en core web sm for English). These
models were used to extract features like the POS tag, the dependency parse
tree, and the dependency tag.
2.1   The Entity Recognition Model

The Entity Recognition Model task is to identify and classify key phrases in
biomedical texts. Key phrases are considered to be all entities (single word or
multi-word) that represent semantically relevant elements in a sentence.
   Four potential classes are corresponding to each entity type:

 – Concept: is any element of the sentence that has a semantic meaning of its
   own.
 – Action: represents a concept that describes a transformation or modification
   of the state of one or more concepts present in the sentence.
 – Predicate: represents a concept that describes the subset of elements of a
   domain that meets a certain condition.
 – Reference: they allude to concepts that exist (in the corpus) but are not
   defined in the context (in the sentence).

    The NER model receives as input the sentence as a sequence of words. For each
word, the features described in the next subsection are extracted and vectorized.
The output of the model consists of two independent tag sequences: the BILOUV
tag system for entity prediction and another with the tags corresponding to
entity types for classification purposes.


Input handling. Given the input sentence as raw text, some preprocessing
is done to obtain a useful structure. Since the model makes use of word-piece
information, the target sentence is tokenized first. To obtain a representation of
the sentence, the model makes use of the following feature for each word:

 – Dependency tag: Dependency relationship between the head token and its
   child token.
 – POS tag: Part-of-Speech tag of the token.
 – Lemma: The base form of the token, with no inflectional suffixes.
 – Character Representation: Encodes each character of the token, assigning
   an integer value according to its index in a vocabulary obtained in the train
   set. Padding is done at the end to ensure all words have the same number of
   characters
 – Word embedding of the token: we consider 3 alternative words embed-
   dings models:
      • BERT [1]: contextual embeddings with no further hyper tuning. BERT
        multilingual base model (cased) was used. The BERT model provides its
        tokenizer but its incompatibility with the rest of the implemented system
        made it necessary to make several modifications to it. We decide to not
        use its tokenization algorithm and use the output of the tokens produced
        for Spacy instead. Therefore, the encoder provided by BERT was used
        directly.
     • FastText Spanish Medical Embeddings [6]: these embeddings were gen-
       erated from Spanish corpora that include: (a) the full-text in Spanish
       available in SciELO.org (until December/2018), (b) all articles from the
       following Wikipedia categories: Pharmacology, Pharmacy, Medicine and
       Biology (during December/2018) and (c) the concatenation of the previ-
       ous two corpora. Furthermore, for each of these datasets, two different
       models were trained using CBOW (continuous bag-of-words) and Skip-
       Gram representation, and each of these architectures was developed with
       cased and uncased words. Therefore, there was a total of 12 pre-trained
       embeddings. All these models were tested for this task, and SciELO
       SkipGram Uncased gave the best results.
     • Character embeddings trained in the training set using as input the
       character representation feature.

Architecture. In the first instance, we use character-level information to capture
morphological dependencies on the token. Having this information, every single
word’s vector can be formed, even if it is out-of-vocabulary words. This component
takes as input a character representation consisting of a sequence of characters
encoded as numbers. The character representation is passed as input to an
Embedding Layer which output is processed by an LSTM layer.
    Then the syntactic features (the dependency tag, the POS tag, and the
lemma) are vectorized. These features together with the previously computed
character level representation and optionally one of the word embeddings pre-
trained models are concatenated for each token in the input sequence. These
vectors are processed by two sequential Bi-LSTM layers to produce a sequence
of vectors that encode the tokens in the input sentence.
    The output of the last Bi-LSTM layer is passed as input of two Dense
layers. The first Dense layer produces a sequence in the BILOUV tag scheme.
The second Dense layer generates a tag for each entity type: Concept, Action,
Predicate, Reference.
    The learning of the model is done with 10 epochs in the training dataset,
which has 1500 sentences. The final model had a total of 1,073,767 trainable
parameters. Adam optimization is used with the default learning rate of 0.001.
The loss function used is categorical cross-entropy as with most multi-class
classification problems. The first LSTM processing character input had 20 units
and a recurrent dropout of 0.5. The two Bi-LSTMs had 32 units and a recurrent
dropout of 0.1 and 0.2 respectively. Both Dense layers had a Softmax activation
function.
    A summary of the NER model architecture is provided in Figure 1.

Output handling. The sequence of BILOUV tags and entity types produced
by the two Dense layers is processed to get the list of entities expected as output
for Task A. There is an important challenge in this process: tokens belonging to
an entity are not necessarily continuous in the sentence. Taking this into account,
the process of decoding is handled in two phases, based on the methodology
                          Fig. 1. NER model architecture


described by UH-MAJA-KD in the previous edition of the challenge [5]. First,
two classes of discontinuous entities are extracted, one corresponding to entities
that share their initial tokens and the other referring to those that share their
final tokens. Entities matching to the former class are extracted using the regular
expression (V O∗ )+ ((I|O)∗ L)+ and (B(I|O)∗ )+ (O∗ V )+ expression relates to the
latter. Afterward, the second phase starts assuming all the remaining entities
appear as continuous sequences of tokens. To extract continuous entities, an
iterative process is carried on over the tag sequence produced by the model
assuming that the maximum overlapping depth is 2.


2.2   The Relation Extraction Model

The goal of this subtask is to discover semantic relationships between the entities
detected and labeled in each sentence. In addition, every semantic relation has a
source and target entity, therefore, the relation is directed, that is, the involved
entities must match the correct direction.
    To solve this task, all pairs of entities occurring in the same sentence are
presented to the model. The absence of relations between a pair of entities
is modeled with an additional relation type. Therefore, we used a multi-class
approach that enabled us to predict whether a candidate pair is related to some
of the relation classes available. One of the problems with this approach was that
the negative instances (the absence of relation type) substantially exceed the
positive ones leading to skewed class distribution. To mitigate the unbalance of
the obtained dataset, we optionally employed a class-oriented weighting scheme
and reduced the negative sampling during the training phase. This way, the
model gets to “pay more attention” to samples from an under-represented class.


Input handling. For the RE classifier, the following features were used for both
the source and target entities presented to the model:

 – Entity type: entity type of the key phrase according to the label it was
   assigned in the previous entity recognition task.
 – Dependency tag: dependency relationship between the head token and its
   child token.
 – POS tag: Part-of-Speech tag of the token.
 – Word embedding of the token: The same two first alternatives of the
   previous model, consisting of pre-trained word embedding models were tested:
     • BERT [1]: BERT multilingual base model (cased) with no further hyper
       tuning was used.
     • FastText Spanish Medical Embeddings [6]: as in the previous task, all
       different models were tested.

    For multi-word entities, the Lowest Common Ancestor (LCA) of the tokens
in the dependency parse tree was used as the representative token of the entity,
and only its syntactic features were processed.
    In addition, to determine a possible relation between two entities, the system
presented uses as input structures derived from the dependency parse tree
associated with the target sentence, to obtain information from both the sentence
and the entity pair:

 – Length of the path: the distance between the source and target entity in
   the dependency parse tree.
 – Dependency path representation: the path in the dependency parse tree
   is computed. Then, every dependency label is assigned an integer value. To
   ensure that all paths have the same number of nodes padding is added at
   the end.


Architecture. The syntactic features (dependency tag, POS tag, length of the
dependency path) and the entity type are vectorized. These features together
with one of the word embeddings of each pair of entities and the dependency path
representation are concatenated. The vectors are then processed by a Bi-LSTM
layer to encode the tokens and produce intermediate representations that capture
dependencies between pairs of entities.
    The resulting vector of the Bi-LSTM layer is processed by a final linear Dense
layer, that produces as outputs the most probable type of relation between the
involved entities.
    A summary of the RE model architecture is shown in Figure 2.
                          Fig. 2. RE model architecture


    Like the NER model, the learning is done with 10 epochs in the training
dataset, which has 1500 sentences. The final model had a total of 1,889,934
trainable params. Adam optimization is used with the default learning rate of
0.001. The loss function used was categorical cross-entropy. The Bi-LSTMs layer
had 32 units and a recurrent dropout of 0.1. The Dense layer used Softmax as
the activation function.


2.3   System Training


The training collection provided in the challenge was used to train the models.
The development collection was used as an offline test set to evaluate our models
and for fine-tuning.
    Both models were implemented using Python programming language, with
TensorFlow (v2.3.0) as the deep learning neural network library. BERT con-
textual embeddings were obtained from the bert-base-multilingual-cased
pre-trained model using torch (v1.8.1) and transformers (v4.5.0) libraries.
FastText (v0.9.2) Python library was used to load the language identification
pre-trained model and the FastText Spanish Medical Embeddings pre-trained
models. The tokenization of sentences and the extraction of syntactic and semantic
features was done using spaCy (v3.0.5).
    The training process was done on a personal computer with the following
stats: 8 core Intel(R) Core(TM) i5-8250U CPU at a frequency of 1.60GHz, with
a memory of 8.00GB with no GPU available for CUDA. The total training time
for the entity model took about 5 minutes, while the relation model was close to
30 minutes.
3    Results

The evaluation in both tasks was carried out using the annotated corpus proposed
in the challenge. The results were measured with a standard F1 measure as
described in the challenge overview.
    Table 1 presents the official results of the competition, given by the evaluation
of scenario 1. As it can be seen, with an overall F1 score of 0.338 our system was
ranked as fifth-best.

              Team                       F1        Precision Recall
             Vicomtech                0.53106 0.54075 0.53464
             PUCRJ-PUCPR-UFMG 0.52835 0.56849 0.50276
             IXA                      0.49886 0.46457 0.53863
             uhKD4                    0.42264 0.48529 0.37431
             UH-MMM                   0.33865 0.29163 0.40374
             Codestrange              0.23201 0.33703 0.17689
             baseline                 0.23201 0.33703 0.17689
             JAD                      0.10949 0.23441 0.07143
        Table 1: Results of the Main Scenario evaluating task A and B


   Tables 2 and 3 show the results of scenarios 2 and 3, where Task A and B
were evaluated independently. Our system was able to reach the fourth on the
task A evaluation scenario and, although achieved the fifth place on scenario 3,
presented way lower results than the fourth place.

               Team                       F1       Precision Recall
               PUCRJ-PUCPR-UFMG 0.70601 0.71491 0.69733
               Vicomtech                0.68413 0.69987 0.74706
               IXA                      0.65333 0.61372 0.6984
               UH-MMM                   0.60769 0.54604 0.68503
               uhKD4                    0.52728 0.51751 0.53743
               Yunnan-Deep              0.33406 0.52036 0.24599
               baseline                 0.30602 0.35034 0.27166
               JAD                      0.2625 0.31579 0.2246
               Yunnan-1                 0.17322 0.27107 0.12727
               Codestrange              0.08019 0.415      0.04439
                Table 2: Results of Scenario 2 evaluating task A


              Team                       F1        Precision Recall
              IXA                        0.4304 0.45357 0.40948
              Vicomtech                  0.37191 0.54186 0.28311
             Team                      F1       Precision Recall
             uhKD4                    0.31771 0.55623 0.22236
             PUCRJ-PUCPR-UFMG 0.26324 0.36659 0.20535
             UH-MMM                   0.05384 0.07727 0.04131
             Codestrange              0.03275 0.4375 0.01701
             baseline                 0.03275 0.4375 0.01701
             JAD                      0.00722 0.375       0.00365
               Table 3: Results of Scenario 3 evaluating task B


4   Discussion
Several models were trained in the training collection and tested in the devel-
opment collection. For each task, different word embeddings pre-trained models
were used: BERT multilingual model and FastText Medical Word Embedding
for Spanish. The multi-lingual approach of the challenge made it very inefficient
to use language-specific embeddings, thus no increase in overall F1 was seen
using the FastText embeddings in the first task. The use of BERT didn’t improve
either the performance obtained. As a result, for this task, we only used the
character-level information computed. The final submission of our system didn’t
use any of the pre-trained models proposed. The results obtained testings these
embeddings in the development set can be seen in Table 4.

          Embeddings                          Recall Precision F1
          SciELO SkipGram Cased            0.7239 0.4671 0.5678
          SciELO SkipGram Uncased          0.7086 0.5055 0.5903
          SciELO CBOW Cased                0.6932 0.4149 0.5191
          SciELO CBOW Uncased              0.6861 0.3981 0.5038
          SciELO+Wiki SkipGram Cased 0.6987 0.4387 0.5390
          SciELO+Wiki SkipGram Uncased 0.702 0.4292 0.5327
          SciELO+Wiki CBOW Uncased 0.6937 0.4013 0.5084
          Wiki SkipGram Cased              0.6932 0.4224 0.5249
          Wiki SkipGram Uncased            0.6937 0.4661 0.5576
          Wiki CBOW Cased                  0.7014 0.4062 0.5145
          Wiki CBOW Uncased                0.7091 0.3956 0.5079
          BERT                             0.6114 0.4736 0.5338
          No Embedding                     0.6806 0.5268 0.5939
      Table 4: Results of the NER model using different embeddings in
      the development collection.


  In the second task, the FastText and BERT embeddings were also tested.
However, the incorporation of the BERT model made it unable to complete in
time the run in scenario 3. With one of FastText’s embeddings, we obtained
a slight improvement in this task despite the language constraints, therefore,
this was the embedding used in the final submission. The results testing these
embeddings are shown in Table 5. 5.

         Embeddings                        Recall    Precision F1
         Scielo SkipGram Cased            0.0376     0.04161 0.03951
         Scielo SkipGram Uncased          0.02703 0.05011 0.03511
         Scielo CBOW Cased                0.0.02233 0.06859 0.03369
         Scielo CBOW Uncased              0.03173 0.04584 0.03750
         Scielo+Wiki SkipGram Cased 0.02115 0.0602 0.0313
         Scielo+Wiki SkipGram Uncased 0.04465 0.07211 0.05515
         Scielo+Wiki CBOW Uncased 0.0188             0.04051 0.02568
         Wiki SkipGram Cased              0.03055 0.0552 0.03933
         Wiki SkipGram Uncased            0.05875 0.04822 0.05297
         Wiki CBOW Cased                  0.0329     0.07254 0.04526
         Wiki CBOW Uncased                0.02585 0.06111 0.03633
         No Embedding                     0.05288 0.05325 0.05307
      Table 5: Results of the RE model using different embeddings in the
      development collection.


    To tackle the class unbalanced problem encountered in task B we tested two
techniques: class weighting and reduce negative sampling. These two techniques
improved the system performance, however, we think the poor results obtained
in this task show this problem wasn’t completely solved. In addition, another of
the reasons for these results can be the lack of more contextual features.
    Finally, regarding the training process, it is worth noting the fact that the
training time of the RE model is significantly longer than the NER model. This is
somewhat expected since our approach for task B takes more training examples,
defining as training instances each pair of entities in the sentence. In addition,
the computation of LCA and the path between the pair of entities is very time-
consuming. The long training time of this model was one of the reasons why
deeper and more complex architectures weren’t tested.


5   Conclusions

In this paper, we have described the main characteristics of the model that
was developed for the UH-MMM team’s submission to IberLEF’s 2021 eHealth
Knowledge Discovery shared task, where two main NLP tasks were defined: entity
recognition and relationship extraction. Three evaluation scenarios involving the
combination of these tasks were also developed.
   Our proposal follows a deep learning approach for both tasks. It is focused on
the use of a BiLSTM+Dense neural network where different word embeddings
are combined as input to the architecture. For Task A, this neural network was
trained by using the annotated dataset provided by the organization, it was
then tokenized and tagged using the BILOUV scheme. Syntactic and character-
based features were used. Task B was addressed in a pairwise-query fashion,
encoding information about the involved pair of entities using linguistic and
syntactic features derived from the dependency parse tree, and employing a
BiLSTM+Dense model. This system obtained a competitive performance on
Scenario 2, where it was located in fourth place. However, our proposal revealed
a weakness for the relationship extraction task, obtaining fifth place with a big
difference concerning the fourth place. We need to analyze in detail if the problem
lies in the class unbalanced problem or the lack of more contextual features.
    It is proposed as future work to study the performance of the model using more
contextual and semantic features as input of the neural network, as well as the
use of other types of word embeddings. Furthermore, we will try to improve the
relation extraction task by implementing another neural network that captures
in a better way the relationship between concepts.


References
1. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec-
   tional transformers for language understanding. vol. abs/1810.04805. CoRR (2018),
   http://arxiv.org/abs/1810.04805
2. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext.zip:
   Compressing text classification models (2016)
3. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text
   classification (2016)
4. Piad-Morffis, A., Gutiérrez, Y., Estevez-Velarde, S., Almeida-Cruz, Y., Muñoz, R.,
   Montoyo, A.: Overview of the eHealth Knowledge Discovery Challenge at IberLEF
   2021. Procesamiento del Lenguaje Natural 67(0) (2021)
5. Rodrı́guez-Pérez, A., Quevedo-Caballero, E., Mederos-Alvarado, J., Cruz-Linares,
   R., Consuegra-Ayala, J.P.: UH-MAJA-KD at eHealth-KD Challenge 2020: Deep
   Learning Models for Knowledge Discovery in Spanish eHealth Documents (2020)
6. Soares, F., Villegas, M., Gonzalez-Agirre, A., Krallinger, M., Armengol-Estapé, J.:
   Medical word embeddings for Spanish: Development and evaluation. In: Proceedings
   of the 2nd Clinical Natural Language Processing Workshop. pp. 124–133. Asso-
   ciation for Computational Linguistics, Minneapolis, Minnesota, USA (Jun 2019).
   https://doi.org/10.18653/v1/W19-1916, https://www.aclweb.org/anthology/W19-
   1916

</pre>