JAD at eHealth-KD Challenge 2021: Simple Neural
     Network with BERT for Joint Classification of Key-
                  Phrases and Relations

José Gabriel Navarro Comabella1[0000-0002-6278-2744], Jorge Daniel Valle Diaz 1[0000-
         0003-4781-2930 ]
                          and Alberto Helguera Fleitas 1[0000-0002-3043-4534 ]
     1 School of Math and Computer Science, University of Havana, 10200 Havana, Cuba

            tigrejg98@gmail.com;{jorge.valle, alberto.helguera}@estudiantes.matcom.uh.cu


        Abstract .This article presents the design choices and training strategy behind
        the model presented by the JAD team for at eHealth-KD Challenge 2021. The
        model consist of identifying key-phrases and relations among them using a pre-
        defined system. It was a simple model that summarizes some parts of a general
        approach to NLP problem. The system is easy to train and test using cloud ser-
        vices like Google Colab. It did not perform very well at the competition. The
        paper includes possible improvements.

        Keywords: eHealth, Knowledge Discovery, Natural Language Processing, Ma-
        chine Learning, Entity Recognition, Relation Extraction, NLP, Simple, BERT,
        Deep Learning


1       Introduction

This article describes the design choices and training strategy behind the model pre-
sented by the JAD team for eHealth-KD Challenge 2021[1]. An annotation scheme
for key-phrases and relations was given, with labelled examples for training and eval-
uation of the model used. The challenge (Main Task) was formed by several subtasks:

• Subtask A (Entity recognition): Given a list of eHealth documents written in Span-
  ish, the goal of this subtask is to identify all the entities per document and their
  types. These entities are all the relevant terms (single word or multiple words) that
  represent semantically important elements in a sentence.
• Subtask B (Relation extraction): Subtask B continues from the output of Subtask
  A, by linking the entities detected and labelled in the input document. The purpose
  of this subtask is to recognize all relevant semantic relationships between the enti-
  ties recognized.
Our team presented a system with a fairly simple architecture: a pre-trained multilin-
gual BERT[2] which output representation is used by several dense layers. Our model

______________________
   IberLEF 2021, September 2021, Málaga, Spain.
   Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attrib-
   ution 4.0 International (CC BY 4.0).
is intended to summarize a general approach to Natural Language Processing (NLP)
problem. It is easy to train and test using cloud services like Google Colab.
    The rest of the paper is organized as follows. Section 2 explains in detail the pro-
posed system. The results of the model in the several scenarios evaluated during the
eHealth-KD 2021 event are presented in Section 3. Section 4 analyses briefly matters
of interest related to the development and performance of the models. Finally, the
conclusions of the work are shown in Section 5.


2      System Description

Figure 1 shows a general overview of the proposed system. Our system is based on a
pre-trained multi-lingual BERT[2] layer which pre-processes the input sentence and
then feeds the generated embeddings to the deep learning model. This model consists
of two dense layers followed by a dropout and a dense layer for output. A binary out-
put that can be translated to the desired output format is produced. The loss function
used is binary cross entropy and the metric is binary accuracy.


                                Fig. 1. Model Diagram


2.1    Output format

The output consists of a simple array/tensor of binary values with predefined length.
This array may be partitioned in two sub-arrays, the first consisting in the key-phrases
annotation and the second consisting in the relations annotations.
 Figure 2 shows the key-phrases array, that can be re-interpreted to a 2D array in
which for each valid index pair i,j:

• i: Represents the key-phrase label position in a custom fixed-length array of cus-
  tom labels
• j: Represents the word position in the sentence with a maximum position of the
  100th word

If an element is true, it is possible to find the word and the label that matches it.


                                 Fig. 2. Key-phrases array


Figure 3 shows the relations array, that could be parsed to a 3D array in which for
each valid index pair i,j,k:

• i: Represents the relation label position in a fixed-length array of custom labels
• j: Represents the word position in the sentence with a maximum position of the
  100th origin word
• k: Represents the word position in the sentence with a maximum position of the
  100th destination word
If an element is true, it is possible to find the origin word, destiny word, and the label
that matches it.
                                Fig. 3. Relations array


Output processing. This output format must be parsed to the competition output
format and vice versa. There are certain problems that were addressed:

• The model output just considers individual words as key-phrase, and the competi-
  tion output may require several words: This was addressed including a new type of
  relation called samebox which is reflexive and links every word that should be in
  the same key-phrase. Also several words key-phrase’s relations were copied to
  every single word key-phrase.
• Inconsistencies in the model output: This was addressed by a permissive parsing to
  the competition output that solved many inconsistencies by itself.


2.2   Training

Our system was trained using a Google Colab notebook with GPU in Python with
Keras and Tensor Flow with a binary cross entropy loss function. The training task
with all its hyperparameters is available in Google Colab for reference at
https://colab.research.google.com/drive/1L0AG1fD9dHzVlv8i-
cOc1OruCiedggKG?usp=sharing
  Figure 4 shows the accuracy achieved by our model after each training epoch. Ac-
curacy can not be observed well enough but due to the elevated amount of false val-
ues in the target output is reasonable that after not many epochs is difficult to interpret
improvement due to small improvements in the terms of less than 1%.


                               Fig. 4. Model binary accuracy


                                    Fig. 5. Model loss

Figure 5 shows the loss achieved by our model after each training epoch. Loss is a
little easier to observe in these graphics. It diminishes with time, but dramatically
faster in the training set, and slower in the evaluation set. Both sets get closer to 0
with many epochs, but after a while, around 130-145 epochs there is no improvement
in evaluation set.


3      Results

Tables 1, 2, and 3 summarize the metrics of each team system in the different tasks of
the competition, and ranks the teams according to F1.


                                 Table 1. (Main Task)

       Team                                F1            Precision       Recall
       Vicomtech                           0.53106       0.54075         0.53464
       PUCRJ-PUCPR-UFMG                    0.52835       0.56849         0.50276
       IXA                                 0.49886       0.46457         0.53863
       uhKD4                               0.42264       0.48529         0.37431
       UH-MMM                              0.33865       0.29163         0.40374
       CodestrangeTeam                     0.23201       0.33703         0.17689
       baseline                            0.23201       0.33703         0.17689
       JAD                                 0.10949       0.23441         0.07143


                                   Table 2. (Task A)

       Team                                F1            Precision       Recall
       PUCRJ-PUCPR-UFMG                    0.70601       0.71491         0.69733
       Vicomtech                           0.68413       0.69987         0.74706
       IXA                                 0.65333       0.61372         0.6984
       UH-MMM                              0.60769       0.54604         0.68503
       uhKD4                               0.52728       0.51751         0.53743
       Yunnan-Deep                         0.33406       0.52036         0.24599
       baseline                            0.30602       0.35034         0.27166
       JAD                                 0.2625        0.31579         0.2246
       Yunnan-1                            0.17322       0.27107         0.12727
       CodestrangeTeam                     0.08019       0.415           0.04439
                                    Table 3. (Task B)

        Team                                 F1            Precision        Recall
        IXA                                  0.4304        0.45357          0.40948
        Vicomtech                            0.37191       0.54186          0.28311
        uhKD4                                0.31771       0.55623          0.22236
        PUCRJ-PUCPR-UFMG                     0.26324       0.36659          0.20535
        UH-MMM                               0.05384       0.07727          0.04131
        CodestrangeTeam                      0.03275       0.4375           0.01701
        baseline                             0.03275       0.4375           0.01701
        JAD                                  0.00722       0.375            0.00365

The evaluation in both tasks was carried out using the annotated corpus proposed in
the challenge. The results were measured with a standard F1 measure as described in
detail in the challenge overview [1]. Also, precision and recall measures were record-
ed and presented.
  From tables 1 to 3 our team always ranked 8 th according to F1. Results in Task A
were superior to results in Task B. None of our team results performed better than
baseline.


4      Discussion

The system achieved poor results in the challenge. The system was non-performant
finding relations, and a little better labelling key-phrases. In both cases it was worse
than baseline. These results could be due to several reasons:
  One of the reasons could be simplicity of the model. If we add recurrent or convolu-
tional layers the results may improve. Such improvement in the long run in the evalu-
ation set could also indicate that the evaluation set was too similar to the training set.
The way that the outputs were modeled might be improved to include less negative
values. If we divide the model in two separate models also the results could improve.
  The accuracy and loss achieved in training was not too good but in the final dataset
the model performed poorly. In fact these metrics were a lot better in the training set
than in the evaluation set, but the most relevant is the evaluation set because it con-
tains data that our model has not trained on. This last one metrics may look well but
in reality, these are not that good. Binary accuracy got to 0.997, but we should re-
member our output format is large in parameters size so this could imply not such a
great performance.
5      Conclusions

This paper describes the system presented by team JAD, in the eHealth-KD Challenge
2021. A deep-learning model was trained and ensembled to automatically extract
relevant entities and relations, from plain text documents. The results achieved by the
system in the challenge were not outstanding, ranking last in the main task, being
better at classifying entities, but still worse than baseline.
  The main goal of our team was not winning the competition but to build knowledge
and a general and simple model easy to understand, implement, train and run. This
goal was completed mostly. The power of a simple model was overvalued. It would
be interesting to add some recurrent layers, re-train BERT in a more specific dataset,
and reducing output size or changing output format.


6      Acknowledgements

We thank all the organizers, staff and reviewers of the eHealth-KD Challenge 2021.
Special thanks for the professors of the AI Department of MATCOM-UH.


References
 1. A. Piad-Morffis, Y. Gutiérrez, S. Estevez-Velarde, Y. Almeida-Cruz, R. Muñoz, A.
    Montoyo, Overview of the eHealth Knowledge Discovery Challenge at IberLEF 2021,
    Procesamiento del Lenguaje Natural 67 (2021).
 2. J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
    transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).