TALP-UPC at eHealth-KD Challenge 2019
 A Joint Model with Contextual Embeddings for Clinical
                Information Extraction

                         Salvador Medina and Jordi Turmo

                   Universitat Politècnica de Catalunya, Spain
          Campus Nord, Carrer de Jordi Girona, 1, 3, 08034 Barcelona, Spain
                               Tel.: +123-45-678910
                               Fax: +123-45-678910
                          {smedina,turmo}@cs.upc.edu


        Abstract. Most eHealth entity recognition and relation extraction mod-
        els tackle the identification of entities and relations with independent
        specialized models. In this article, we show how a single combined model
        can exploit the correlation between these two tasks to improve the evalua-
        tion score of both, while reducing training and execution time. Our model
        uses both traditional part-of-speech tagging and dependency-parsing of
        the documents and state-of-the-art pre-trained Contextual Embeddings
        as input features. Furthermore, Long-Short Term Memory units are used
        to model close relationships between words while convolution filters are
        applied for farther dependencies. Our model was able to get the high-
        est score in all three tasks of IberLEF2019’s eHealth-KD competition[7].
        This advantage was specially promising in the relation extraction tasks,
        in which it outperformed the second best model by a margin of 9.3% in
        F1 Score.

        Keywords: NERC · Relation Extraction · eHealth NLP · Contextual
        Embeddings


1     Introduction
This article describes the model presented by the TALP team for IberLEF2019’s
eHealth-KD[7] shared task, which includes the identification of relevant key-
phrases and relations among them in Electronic Health (i.e., eHealth) documents
written in Spanish. The task was divided in three scenarios: key-phrase identifi-
cation and classification, relation extraction and full knowledge extraction. Our
model outperformed the rest of competing models in all three scenarios.
    IberLEF2019’s eHealth-KD shared task supersedes and extends previous
year’s Taller de Análisis Semántico en la SEPLN 2018’s eHealth-KD (TASS-
2018’s eHealth-KD)[4] shared task. There are, however, substantial differences
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 Septem-
    ber 2019, Bilbao, Spain.
            Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


between both tasks’ classes1 and evaluation metrics. Likewise, the task is inspired
by previous competitions such as Semeval-2017 Task 10: ScienceIE [1].
    The models presented for the aforementioned related task incorporate com-
binations of several techniques such as Convolutional or Recurrent Neural Net-
works, Support Vector Machines, Conditional Random Fields and even rule-
based systems. Our team concurred to the key-phrase classification and rela-
tion extraction sub-tasks of TASS-2018’s eHealth-KD with a joint CNN -based
model[5], which ranked in first place for the relation extraction sub-task. The
model did not support key-phrase recognition though, as it received pairs of
key-phases as input. Our newly presented model overcomes this limitation by
identifying key-phrases and all their related key-phrases at once.
    Given the similarity to the aforementioned tasks, we decided to first try a
model for TASS-2018’s eHealth-KD data set and then used the model weights as
an starting point. This idea of transferring some of the weights of a model trained
for a different task (transfer-learning), has been extensively used in low-resource
machine learning tasks such as image classification, text analysis, question an-
swering and more[6][2].


2     Model
The model takes a document and a token’s index and computes the boundaries
and classes of the shortest key-phrase it belongs to, and the relations of every
other entities’ tokens to it. Hence, the model should be run for each token of
the input document. This approach is inspired by attention-based translation
models such as Transformer [8], in which the output is successively generated by
running the model for one particular input token at a time.
   The joint identification model’s structure is visually described in Figure 1,
and is composed of a set of shared layers and two independent output layers.
Both output layers share the same structure, a fully connected layer and CRF;
which respectively predict the target token’s smallest entity sequence and each
other token’s relation to it. The core of the shared layer contains a recurrent layer
composed by multiple bidirectional memory units (Either Gated Recurrent Units
or Long-Short Term Memory units) followed by a convolution layer. The RNN
and CNN’s outputs are then fed to a fully connected layer with output dropout.
   The recurrent and convolution layers allow for looking to both the local
and global contexts of each input token. The local context is captured by the
RNN Layer’s output and the non-pooled convolution layer’s output, which are
concatenated for each time-step. The global context is captured by the max-
pooled convolution layer’s output. The global context information and the target
token’s local context information are added to all time-steps before being fed to
the fully connected shared layer.
1
    Two additional key-phrase classes (Predicate and Reference) with their related re-
    lation classes (in-time, in-place, in-context, domain and arg) were added. Moreover,
    one relation class was removed (property-of ) and others were added (same-as, has-
    property, causes and entails)


                                            79
              Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


    The final outputs are then generated by a Conditional Random Field (CRF)
layer. Output CRF layers have proven to improve the capabilities of GRU and
LSTM networks in low-resource sequence tagging tasks[3].


       Ycn        Yrn    Ycn+1     Yrn+1      Ycn+2     Yrn+2     Ycn+3     Yrn+3


                                       CRF (Relations)

                          CRF (Concepts)

             FC                  FC                   FC                  FC
                                                                                 Fully-connected
             On                  On+1                 On+2                On+3
                                      Concatenate
              +                   +                    +                   +

                                                                                      M
                                                                                 Max-pooling            0.45   0.85    0.00      0.10   0.25    0.15             Subject
             Cn                  Cn+1                 Cn+2                Cn+3

                                   Concatenate                                                          0.10   0.25    0.00      0.45   0.90    0.75             Target
                                  +
                                         Convolution
                                                                                                        0.05   0.10    0.45      0.05   0.15    0.10             is-a
                           Convolution


                                                                                                        El


                                                                                                                asma


                                                                                                                        afecta


                                                                                                                                  las


                                                                                                                                         vías


                                                                                                                                                 respiratorias
             Rn                  Rn+1                 Rn+2                Rn+3

                                      Concatenate
              +                   +                    +                   +
       LSTM              LSTM                 LSTM                LSTM

                  LSTM                LSTM                 LSTM                LSTM                 Fig. 2. Visual representation of how rela-
                                                                                                    tions are encoded by the network when
             Xn                  Xn+1                 Xn+2                Xn+3                      afecta is the input token.

Fig. 1. Schematic architecture of the iden-
tification artificial neural network


2.1   Input encoding and decoding
As described in Section 2, the model receives the sequence of tokens of a docu-
ment and a token’s index and outputs the bounds of the innermost key-phrase
to which the token belongs. These bounds are encoded and decoded by assigning
a Begin, Inside, Unitary and End tag to each token included in that key-phrase
and Out to every other token (BIOUE-tag). Note that just one key-phrase is
decoded for each token index. Consequently, in order to identify all key-phrases,
the model has to be evaluated for every token.
    We approach relation extraction at the level of tokens. Given a token, the
list of relations’ probabilities between the innermost entity to which the token
belongs and each one of the tokens in the document is predicted. Note that
for the source token, we only consider the innermost entity whereas for the
target tokens we consider all parent entities. This restriction is imposed so that
the encoded sequence is not ambiguous. Other alternatives such as source and
target encoding were also considered but ultimately discarded as the increased
decoding complexity did not yield improved results.
    A visual representation of relations’ probability predictions is shown in Figure
2. Relations are predicted from the target key-phrase if the aggregated score


                                                                                                   80
           Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


inside a key-phrase span surpasses a threshold. Only the key-phrase with the
highest score is selected if multiple key-phrases overlap. A pseudo-code of the
relation decoder is listed below.


Algorithm 1 Relation decoding algorithm
 1: procedure decode(is , pc , E, R) . Relations from token is with probabilities pc
    of class c
 2:     es ← innermost entity at index is from E
 3:     while not done do
 4:         {et , pt } ← {φ, 0}
 5:         for e ← E do
 6:              p ← aggregate pc i ∀i ∈ bounds of e             . Pre-defined probability
    aggregation function
 7:              if p > pt ∧ p > pth then          . If probability is above the threshold
 8:                   {et , pt } ← {e, p}         . Probabilities in the span are set to 0
 9:         if et 6= φ then    S
10:              Rc ← Rc          {{es , et , c}}       . Add relation to the relation set
11:              pc i ← 0 ∀i ∈ bounds of et
12:         else
13:              return R


2.2    Input features
In section 2 we describe how the model looks at sentences at the token level.
The sentences are first tokenized, tagged and dependency-parsed by FreeLing.
    We represent each token by a vector, which results from the concatenation
of the features listed below:

 – One-Hot encoding of the category and type fields of the token’s Part-of-
   Speech Tag from FreeLing’s tag-set.
 – Normalized vector encoding the dependencies found in the path between the
   token and the target token (the one that is being decoded). It is computed
   by adding the one-hot encoding representation of the dependency class for
   each hop in the dependency path and normalizing the resulting vector, not
   considering its direction. For instance, the representation of the token
                                                                        √ ”I” in
   ”I eat fish” when the target token is ”fish” would be a vector with 2 in the
   positions corresponding to ”subj” (subject) and ”cd” (direct complement);
   whereas for ”eat” it would be a vector with a single 1 in the ”cd” position.
 – One-Hot encoding of the distance between the token and the target token.
 – Contextualized word embedding of the token, computed by extracting the
   weights of the last layer of a multi-language, general-purpose pre-trained2
2
    We used the BERT-Base, Multilingual Cased model (104 languages, 12-layer,
    768-hidden, 12-heads, 110M parameters) from the authors’ repository (https://
    github.com/google-research/bert)


                                           81
           Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


      Bidirectional Encoder Representations from Transformer model (BERT)[2].
      No fine-tuning of the BERT model is done.


2.3    Pre-Training

IberLEF2019’s eHealth-KD’ training data-set is arguably small considering the
number of classes and the variability in the examples (3818 concepts and 3503
relations). In order to prevent over-fitting and consequently increasing recall, we
opted for using previous years’ TASS-2018’s eHealth-KD[4] training and test-
ing data-sets as a pre-training, transfer-learning step. Transfer-learning in this
case is straightforward, as the two tasks are very similar and our model already
shows positive results for TASS-2018’s eHealth-KD’s tasks, as shown in Section
3. However, since output classes for both concept recognition and relation ex-
traction do not match IberLEF2019’s eHeath-KD, the output layers’ weights are
discarded after the pre-training phase.


3     Evaluation and results

The presented model was evaluated for tuning of hyper-parameters for using
the provided development data-set. For this, we used the evaluation metrics and
scripts provided by the challenge’s organizers3 . For more information about the
data-set please refer to the tasks’ overview paper.
    In addition to this evaluation, we also evaluate the model learned in the pre-
training step using TASS-2018’s eHealth-KD’s testing data-set. For the sake of
comparison, we also use the author’s evaluation metrics and scripts and contrast
the results to the challenge’s contestants4 .
    Table 1 shows the relevant evaluation metrics of TASS-2018’s eHealth-KD’s
best-performing models for the three scenarios, compared to the presented model
(Joint-RCNN ). The joint identification model clearly outperforms our previous
model (talp) in all metrics, also beating the rest of the participants in sub-tasks B
and C. In line with the original task’s results, rriveraz ’s model shows impressive
results and surpasses our model in black-boxed sub-task A by a margin of 4.1%
in F1 score, which ranks in second position.
    Table 2 shows the shared task’s final evaluation results. Our model ranked
in first position for all three evaluation scenarios. The largest advantage resides
in the relation extraction task, as it outscores the second runner by a margin
of 9.6% in F1 score. After the competition was closed, we found that our model
was outputting invalid combinations of key-phrase classes and relation classes.
fix-relations shows the evaluation of the fixed model.
3
  IberLEF2019’s eHealth-KD data-set and evaluation scripts were downloaded from
  https://github.com/knowledge-learning/ehealthkd-2019.
4
  TASS-2018’s eHealth-KD data-set and evaluation scripts were downloaded from
  https://github.com/TASS18-Task3/data.


                                           82
          Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


                           Scenario 1              Scenario 2        Scenario 3
        Model       F1 (A) F1 (C) F1 (ABC) Acc(B) F1 (C) F1 (BC) F1 (C)
        Joint-RCNN 82.3 34.5         72.8     94.3 53.0 76.8            60.1
        talp         N/A N/A         N/A      93.1 45.8       72.2      44.8
        rriveraz     87.2 00.0       75.7     95.9 10.9       62.2      03.6
        upf upc      80.5 09.3       66.1     95.4 00.0       64.8      00.0
Table 1. Evaluation results comparing TASS-2018’s eHealth-KD final results against
the presented model for the 3 training scenarios and related test corpora.


                                    Scenario 1       Scenario 2     Scenario 3
       Model                       P    R F1        P    R F1      P     R F1
       Joint-RCNN                 65.1 62.9 63.9 80.7 83.4 82.0 66.7 59.2 62.7
       fix-relations              65.4 62.9 64.1 80.7 83.4 82.0 67.3 59.2 63.0
       no-transfer                65.1 61.6 63.3 79.9 83.9 81.9 65.4 55.5 60.0
       LASTUS-TALN(abravo) 77.4 46.6 58.2 80.0 83.4 81.7 17.1 35.2 23.0
       IxaMed(iakesg)             69.0 37.6 48.7 65.7 71.1 68.3 52.0 37.5 43.6
       NLP UNED(lsi uned)         65.6 47.0 54.7 80.7 70.8 75.4 62.4 46.7 53.4
       coin flipper(ncatala)      74.5 53.3 62.2 79.9 77.6 78.7 71.3 37.7 49.3
       UH-MAJA-KD                 56.4 48.0 51.9 80.0 83.2 81.6 43.1 43.7 43.4
       VSP                        45.5 40.6 42.9 51.3 58.5 54.7 58.9 42.4 49.3
Table 2. Final evaluation results of IberLEF2019’s eHealth-KD. We also include the
results when invalid relations are removed (fix-relations) and when no transfer-learning
is used.


4    Conclusion
In this article we have presented a joint concept and relation identification and
classification model that exploits the mutual information between the entities
and their relations by using a single network that looks at both local and global
textual features. This newly presented model significantly outperforms all other
competing models in both TASS-2018’s eHealth-KD and IberLEF2019’s eHealth-
KD shared tasks.
    We hypothesize that the fact that the model shares both structure and
weights allows the full model to more accurately capture the synergy between
the two tasks and hence provide better precision and recall than traditional step-
by-step models. In spite of this, the task is still not fully figured out, and we
argue that further experimentation should be done in this line of research. We
identify three noteworthy challenges:
 – Entities and relations are formally very different. The first is usually encoded
   as a sequence and the second as a set of one-to-one labels. These two repre-
   sentations are difficult to combine, so more appropriate encoding is required
   to take full advantage of joint models.
 – Optimization functions are also hard to define, as they have to balance both
   entity recognition and relation extraction, taking into account the different
   amounts and difficulty of instances of both tasks.


                                          83
           Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)


 – Depending on the model’s structure, the optimization of the model’s hidden
   layer’s parameters for different outputs may be mutually opposite. These
   hidden layers should be designed to promote reciprocal collaboration between
   both objective functions.


5    Acknowledgements

Funding: This contribution has been partially funded by the Spanish Ministry
of Economy (MINECO) and the European Union (TIN2016-77820-C3-3-R and
AEI/ FEDER,UE).


References
1. Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: Semeval 2017 task
   10: Scienceie-extracting keyphrases and relations from scientific publications. arXiv
   preprint arXiv:1704.02853 (2017)
2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
   rectional transformers for language understanding. arXiv preprint arXiv:1810.04805
   (2018)
3. Huang, Z., Xu, W., Yu, K.: Bidirectional lstm-crf models for sequence tagging. arXiv
   preprint arXiv:1508.01991 (2015)
4. Martı́nez Cámara, E., Almeida Cruz, Y., Dı́az Galiano, M.C., Estévez-Velarde, S.,
   Garcı́a Cumbreras, M.Á., Garcı́a Vega, M., Gutiérrez, Y., Montejo Ráez, A., Mon-
   toyo, A., Muñoz, R., et al.: Overview of tass 2018: Opinions, health and emotions
   (2018)
5. Medina, S., Turmo, J.: Joint classification of key-phrases and relations in electronic
   health documents. Proceedings of TASS 2172 (2018)
6. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on knowledge
   and data engineering 22(10), 1345–1359 (2009)
7. Piad-Morffis, A., Gutiérrez, Y., Consuegra-Ayala, J.P., Estevez-Velarde, S.,
   Almeida-Cruz, Y., Muñoz, R., Montoyo, A.: Overview of the ehealth knowledge
   discovery challenge at iberlef 2019 (2019)
8. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
   L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information
   Processing Systems. pp. 5998–6008 (2017)


                                           84