<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TALP at eHealth-KD Challenge 2020: Multi-Level Recurrent and Convolutional Neural Networks for Joint Classification of Key-Phrases and Relations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Salvador Medina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jordi Turmo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitat Politècnica de Catalunya</institution>
          ,
          <addr-line>Campus Nord, Carrer de Jordi Girona, 1, 3, 08034 Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>85</fpage>
      <lpage>93</lpage>
      <abstract>
        <p>This article describes the model presented by the TALP Team to IberLEF's eHealth Knowledge Discovery 2020 shared task[1]. The model iterates over the idea of using a single model for simultaneously identify key-phrases and their relationships. Taking into account the new transfer-learning sub-task presented for 2020's edition of eHealthKD, our model does not rely on any domain-specific knowledge nor handcrafted features. Our model was competitive in all four sub-tasks, ranking in 2nd, 3rd, 4th and 1st position respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;NERC</kwd>
        <kwd>Relation Extraction</kwd>
        <kwd>eHealth NLP</kwd>
        <kwd>Contextual Embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. System Description</title>
      <p>Our model expects a document and a source token index as input and generates a sequence
of labels for each key-phrase and relation class. Input documents are parsed using FreeLing’s
dependency parser and each one of their tokens are encoded using either a BERT, a Word2Vec
or a FastText pre-trained word-embedding model. The model then applies convolution filters to
the encoded tokens of the input documents, combines the word-level filter’s outputs of each
input token and the specified source token with sentence-level embeddings of the documents,
and outputs the boundaries of each key-phrase containing the source token as well as the
likelihoods that every other token is the target of a relation having the specified source token’s
key-phrases as a source.</p>
      <p>In order to generate all possible relations, the model should be run for every input token
and have the all raw likelihoods combined across every one of them. This approach of looking
at a single input token at a time is inspired by attention-based translation models such as the
Transformer, in which the model comes up with the most likely output token one at a time,
conditioned to the previously generated tokens and the whole untranslated document.</p>
      <sec id="sec-2-1">
        <title>2.1. Internal structure of the model</title>
        <p>A visual representation of the model’s structure is shown in Figure 1. The network is composed
of a set of shared intermediate layers and two independent output layers. The intermediate
layers include a Bidirectional Gated Recurrent Unit layer followed by a set of convolution
iflters. The recurrent units’ and convolution outputs are finally concatenated and fed to a fully
connected layer. The output layers consist of a fully connected layer followed by a Conditional
Random Field layer.</p>
        <p>This structure lets the model look at both the local and global contexts of each of the input
tokens. Particularly, the local context is captured by the recurrent units’ output and the
nonpooled convolution layer’s output, while the global context is captured by the max-pooled
convolution layer’s output. Additional global context information is added when the BERT-based
model is used by concatenating the encoding of the auxiliary CLS token.</p>
        <p>The global context information and the target token’s local context information are added
to all time-steps before being fed to the fully connected shared layer. The final outputs are
then generated by a Conditional Random Field (CRF) layer. Output CRF layers have proven to
improve the capabilities of GRU and LSTM networks in low-resource sequence tagging tasks[4].</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Output generation and decoding</title>
        <p>As described in Section 2, our system receives the sequence of tokens of a document and a
token’s index and outputs the bounds of the innermost key-phrase to which the token belongs.
These bounds are encoded and decoded by assigning a Begin, Inside, Unitary and End tag to each
token included in that key-phrase and Out to every other token (BIOUE-tag). One limitation of
this approach is the fact that just one key-phrase is decoded for each token index, but this is
not an issue in our case, as key-phrases may subsume but not overlap other key-phrases.</p>
        <p>For each input token, our model outputs the list of relations’ probabilities between the
innermost entity to which the token belongs and each one of the tokens in the document is
Ycn</p>
        <p>Yrn</p>
        <sec id="sec-2-2-1">
          <title>CRF (Relations)</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>CRF (Concepts)</title>
          <p>FC
On
+
Cn
Rn
+
Xn
FC
On+1</p>
          <p>+
Cn+1</p>
          <p>+
Rn+1</p>
          <p>+</p>
          <p>Xn+1
Convolution
Concatenate
Concatenate</p>
          <p>Convolution
Concatenate
FC
On+2</p>
          <p>+
Cn+2
Rn+2</p>
          <p>+
Xn+2
FC
On+3</p>
          <p>+
Cn+3
Rn+3</p>
          <p>+
Xn+3
Fully-connected</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>LSTM</title>
        </sec>
        <sec id="sec-2-2-4">
          <title>LSTM</title>
        </sec>
        <sec id="sec-2-2-5">
          <title>LSTM</title>
        </sec>
        <sec id="sec-2-2-6">
          <title>LSTM</title>
        </sec>
        <sec id="sec-2-2-7">
          <title>LSTM</title>
        </sec>
        <sec id="sec-2-2-8">
          <title>LSTM</title>
        </sec>
        <sec id="sec-2-2-9">
          <title>LSTM</title>
        </sec>
        <sec id="sec-2-2-10">
          <title>LSTM</title>
          <p>predicted. Note that for the source token, we only consider the innermost entity whereas for
the target tokens we consider all parent entities. Consequently, our method does not allow
for overlapping relations from the same source token. This restriction is imposed so that the
encoded sequence is not ambiguous. A visual representation of relations’ probability predictions
is shown in Figure 2. Relations are predicted from the target key-phrase if the aggregated score
0.45
0.10
0.05
inside a key-phrase span surpasses a threshold. Only the key-phrase with the highest score is
selected if multiple key-phrases overlap.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Input features</title>
        <p>As previously mentioned, our model process the documents at the token level. We represent
each token by a vector, which results from the concatenation of the features listed below:
• One-Hot encoding of the category and type fields of the token’s Part-of-Speech Tag from</p>
        <p>FreeLing’s tag-set.
• Normalized vector encoding the dependencies found in the path between the token
and the target token (the one that is being decoded). It is computed by adding the
onehot encoding representation of the dependency class for each hop in the dependency
path and normalizing the resulting vector, not considering its direction. For instance,
the representation of the token "I" in "I eat fish" when the target token is "fish" would
be a vector with √2 in the positions corresponding to "subj" (subject) and "cd" (direct
complement); whereas for "eat" it would be a vector with a single 1 in the "cd" position.
• One-Hot encoding of the distance between the token and the target token.
• Word-embedding of the token. We consider 4 alternative pre-trained word embedding
models:
– Concatenation of the last output layers of a multi-language general-purpose BERT[5]
model1 with no fine tuning.
– Word2Vec and FastText Medical Word Embedding for Spanish models from Barcelona</p>
        <p>Super-computing Center2[3].</p>
        <p>– FastText Spanish Unannotated Corpora from SUC3[6]</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Pre-training with the ensemble corpus</title>
        <p>Due to the comparatively large number of parameters in our model respect to the size of the
training dataset, overfitting can be an issue. We prevent this by using the relatively larger
but inaccurate ensemble in a pre-training phase. In order not to let our model’s variables fall
into local minima that would make our model mimic previous’ years models, we randomly
add documents from IberLEF 2020’s training corpus. Furthermore, we increase dropout and
gradually decrease the learning rate for the training and fine-tuning training steps.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Single-scenario training and fine-tuning</title>
        <p>In the general evaluation scenario, the loss function has to balance accuracy for both the
keyphrase recognition and relation extraction tasks. This may be problematic, as the parameter
updates made by the optimizer to improve one task might be detrimental for the other task.
However, in evaluation scenarios 2 and 3, that is, independent key-phrase recognition and
relation extraction tasks, the model does not have to generate both outputs. Consequently, on
the one hand, we can use an uncompromising loss function. On the other hand, this means not
being able to exploit the correlation between tasks, so it might as well lead to worse performance.</p>
        <p>To study this efect, we suggest thee diferent single-scenario training strategies: using the
general model with no alteration whatsoever, fine-tuning the general model’s outputs with
independent loss function for a few epochs, or training the specific model from scratch. Note that
in the case of scenario 3, we decode the key-phrases using the gold truth rather than the model’s
output for all three strategies; and concatenate a one-hot-encoding of the key-phrase labels to
the input for the from-scratch strategy. Table 2 shows the results for all three single-scenario
training strategies.</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.6. Trainable parameters and computational resources</title>
        <p>All models were trained using the TensorFlow® 1.15 framework for Python® 3.6 in an 8 core
Intel® Xeon® E5-2620 v4 CPU at 2.10GHz, 16GB of DDR4 RAM, a GeForce® GTX 1070 GPU
1We used the BERT-Base, Multilingual Cased model (104 languages, 12-layer, 768-hidden, 12-heads, 110M
parameters) from the authors’ repository (https://github.com/google-research/bert)</p>
        <p>2We used April 15, 2020’s SciELO + Wikipedia, 300 dimensions version of Medical Word Embedding for Spanish,
which can be downloaded from https://zenodo.org/record/3744326</p>
        <p>3We used the 300 dimensions sub-word binary model from https://github.com/dccuchile/
spanish-word-embeddings/blob/master/emb-from-suc.md
Model
Vicomtech
UH-MAJA-KD
Talp-UPC (submission)
Talp-UPC (BERT)
Talp-UPC (BERT FT)
Talp-UPC (W2V Health)
Tapl-UPC (FastText Health)
Tapl-UPC (FastText General)
and a 7200rpm 1TB Seagate® HDD.</p>
        <p>BERT-based and Word2Vec/FastText-based models were trained for a total of 128 and 96
epochs respectively, divided among the pre-training, training and fine-tuning steps. Training
epochs were evenly distributed between pre-training and training steps for models with no
finetuning. When fine-tuning was applied (transfer-learning or single-task scenarios), pre-training
was shortened by 16 epochs.</p>
        <p>For each word representation model, independent models were trained with 8, 32 and 64
convolution filters of sizes 3 and 5; and 8, 32 and 64 single-layer recurrent units.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>Model P
SINAI 0.844633
Vicomtech 0.821622
IXA-NER-RE 0.726733
UH-MAJA-KD 0.820255
Talp-UPC (fine-tuned) 0.807218
Talp-UPC (general) 0.841727
Talp-UPC (from-scratch) 0.821942</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>The joint key-phrase classification and relation extraction model presented by our team for
the previous edition of IberLEF’s eHealth Knowledge Discovery shared task outperformed
every other participant model by a wide margin. This confirmed our belief that a joint model
has the potential to exploit the mutual information between the two tasks and provide better
evaluation results than traditional step-by-step architecture. The improvement was, however,
less appreciable for the key-phrase classification task.</p>
      <p>After comparing our model to the rest of the participant’s submissions, we hypothesised that
one of the main shortcomings of ours was the absolute lack of context-specific knowledge. For
this year’s edition, we decided to explore diferent alternatives to tackle this. But since a new
transfer learning scenario was added, whose evaluation score would probably be compromised
if the source model relied too heavily upon context-specific features, we opted for adding this
context-specific information in a way that would not significantly alter the model’s structure
nor make it less general with handcrafted rules. Particularly, we opted for swapping the
general-purpose word representation model by a health-specific one.</p>
      <p>Unfortunately, the results show that the use of context-specific word embeddings does not
substantially improve upon general-purpose embeddings and even leads to worse results in
the transfer-learning scenario. Not only that, but we have also shown that contextual word
embeddings such as BERT and XLNet significantly outperform predictive word embedding
models such as Word2Vec and FastText. Moreover, the concatenation of this second word
representation does not seem to provide any additional information over the original, whilst it
makes the model more complex in terms of the number of trainable parameters.</p>
      <p>Several hypotheses may explain these unsatisfactory results. First, we argue that although
the documents’ language register is formal, the use of technical terms is limited. Similarly,
relation classes and specially key-phrase categories are arguably general, as pointed out by the
results obtained in Scenario 4. Secondly, predictive word embedding models may not be able to
capture the medical terms’ semantic information to a degree that can be used by our model, but
rather more explicit features may be preferable.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this article, we have described the main characteristics of the model that we have developed
for TALP team’s submission to IberLEF’s 2020 eHealth Knowledge Discovery shared task.
Our model follows the trend started by our team’s 2018’s model, which consists of using a
single network with shared weights that jointly performs the key-phrase recognition and
relation extraction tasks to leverage the mutual information between the two. It has proven to
be competitive against the other participants model’s, especially in the general and
transferlearning scenarios, ranking in second and first position respectively. The transfer-learning
scenario particularly highlights the adaptability and context-independence of our model.</p>
      <p>Three main improvements were made over the previous year’s model: adaptive learning-rate
for pre-training, single scenario fine-tuning and context-specific word vector representations.
The last of which has been rather underwhelming though, and we conclude that adding
contextspecific information to our model is still an unresolved issue.</p>
      <p>Besides the aforementioned limitation, we see other shortcomings to our model that still need
to be tackled to more accurately capture the mutual information between the two knowledge
discovery tasks. Among these improvements, we would like to point out two that we believe
are more promising:
• Use a trainable combination function for the outputs generated by the model for diferent
source tokens in a document. Our current model, on the other hand, uses a simple union
operation to join the predictions for the diferent tokens of single key-phrase.
• Use of fine-tuned context-specific contextual word embedding model. The use of
contextspecific predictive word embeddings have proven not successful for our model, but
general-purpose contextual word embeddings can be fine-tuned with context-specific
unlabelled corpora.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments References</title>
      <p>This contribution has been partially funded by the Spanish Ministry of Economy (MINECO)
and the European Union (TIN2016-77820-C3-3-R and AEI/ FEDER,UE).
Spanish Society for Natural Language Processing (SEPLN 2019): Bilbao, Spain, September
24th, 2019, CEUR-WS. org, 2019, pp. 78–84.
[3] F. Soares, M. Villegas, A. Gonzalez-Agirre, M. Krallinger, J. Armengol-Estapé, Medical
word embeddings for Spanish: Development and evaluation, in: Proceedings of the 2nd
Clinical Natural Language Processing Workshop, Association for Computational Linguistics,
Minneapolis, Minnesota, USA, 2019, pp. 124–133. URL: https://www.aclweb.org/anthology/
W19-1916. doi:10.18653/v1/W19-1916.
[4] Z. Huang, W. Xu, K. Yu, Bidirectional lstm-crf models for sequence tagging, arXiv preprint
arXiv:1508.01991 (2015).
[5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[6] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword
information, Transactions of the Association for Computational Linguistics 5 (2017) 135–
146.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Piad-Morfis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gutiérrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cañizares-Diaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Estevez-Velarde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Almeida-Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Muñoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Montoyo</surname>
          </string-name>
          ,
          <article-title>Overview of the eHealth Knowledge Discovery Challenge at IberLEF 2020, in: Proceedings of the Iberian Languages Evaluation Forum co-located with 36th Conference of the Spanish Society for Natural Language Processing</article-title>
          ,
          <source>IberLEF@SEPLN</source>
          <year>2020</year>
          , Spain, September,
          <year>2020</year>
          .,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Medina Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Turmo</given-names>
            <surname>Borras</surname>
          </string-name>
          ,
          <article-title>Talp-upc at ehealth-kd challenge 2019: A joint model with contextual embeddings for clinical information extraction</article-title>
          ,
          <source>in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2019</year>
          )
          <article-title>: co-located with 35th Conference of the</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>