1. Introduction

TALP at eHealth-KD Challenge 2020: Multi-Level Recurrent and Convolutional Neural Networks for Joint Classification of Key-Phrases and Relations

Salvador Medina

Jordi Turmo

0 0 Universitat Politècnica de Catalunya , Campus Nord, Carrer de Jordi Girona, 1, 3, 08034 Barcelona , Spain

85 93

This article describes the model presented by the TALP Team to IberLEF's eHealth Knowledge Discovery 2020 shared task[1]. The model iterates over the idea of using a single model for simultaneously identify key-phrases and their relationships. Taking into account the new transfer-learning sub-task presented for 2020's edition of eHealthKD, our model does not rely on any domain-specific knowledge nor handcrafted features. Our model was competitive in all four sub-tasks, ranking in 2nd, 3rd, 4th and 1st position respectively.

eol>NERC Relation Extraction eHealth NLP Contextual Embeddings

1. Introduction 2. System Description

Our model expects a document and a source token index as input and generates a sequence of labels for each key-phrase and relation class. Input documents are parsed using FreeLing’s dependency parser and each one of their tokens are encoded using either a BERT, a Word2Vec or a FastText pre-trained word-embedding model. The model then applies convolution filters to the encoded tokens of the input documents, combines the word-level filter’s outputs of each input token and the specified source token with sentence-level embeddings of the documents, and outputs the boundaries of each key-phrase containing the source token as well as the likelihoods that every other token is the target of a relation having the specified source token’s key-phrases as a source.

In order to generate all possible relations, the model should be run for every input token and have the all raw likelihoods combined across every one of them. This approach of looking at a single input token at a time is inspired by attention-based translation models such as the Transformer, in which the model comes up with the most likely output token one at a time, conditioned to the previously generated tokens and the whole untranslated document.

2.1. Internal structure of the model

A visual representation of the model’s structure is shown in Figure 1. The network is composed of a set of shared intermediate layers and two independent output layers. The intermediate layers include a Bidirectional Gated Recurrent Unit layer followed by a set of convolution iflters. The recurrent units’ and convolution outputs are finally concatenated and fed to a fully connected layer. The output layers consist of a fully connected layer followed by a Conditional Random Field layer.

This structure lets the model look at both the local and global contexts of each of the input tokens. Particularly, the local context is captured by the recurrent units’ output and the nonpooled convolution layer’s output, while the global context is captured by the max-pooled convolution layer’s output. Additional global context information is added when the BERT-based model is used by concatenating the encoding of the auxiliary CLS token.

The global context information and the target token’s local context information are added to all time-steps before being fed to the fully connected shared layer. The final outputs are then generated by a Conditional Random Field (CRF) layer. Output CRF layers have proven to improve the capabilities of GRU and LSTM networks in low-resource sequence tagging tasks[4].

2.2. Output generation and decoding

As described in Section 2, our system receives the sequence of tokens of a document and a token’s index and outputs the bounds of the innermost key-phrase to which the token belongs. These bounds are encoded and decoded by assigning a Begin, Inside, Unitary and End tag to each token included in that key-phrase and Out to every other token (BIOUE-tag). One limitation of this approach is the fact that just one key-phrase is decoded for each token index, but this is not an issue in our case, as key-phrases may subsume but not overlap other key-phrases.

For each input token, our model outputs the list of relations’ probabilities between the innermost entity to which the token belongs and each one of the tokens in the document is Ycn

Yrn

CRF (Relations) CRF (Concepts)

FC On + Cn Rn + Xn FC On+1

+ Cn+1

+ Rn+1

Xn+1 Convolution Concatenate Concatenate

Convolution Concatenate FC On+2

+ Cn+2 Rn+2

+ Xn+2 FC On+3

+ Cn+3 Rn+3

+ Xn+3 Fully-connected

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

predicted. Note that for the source token, we only consider the innermost entity whereas for the target tokens we consider all parent entities. Consequently, our method does not allow for overlapping relations from the same source token. This restriction is imposed so that the encoded sequence is not ambiguous. A visual representation of relations’ probability predictions is shown in Figure 2. Relations are predicted from the target key-phrase if the aggregated score 0.45 0.10 0.05 inside a key-phrase span surpasses a threshold. Only the key-phrase with the highest score is selected if multiple key-phrases overlap.

2.3. Input features

As previously mentioned, our model process the documents at the token level. We represent each token by a vector, which results from the concatenation of the features listed below: • One-Hot encoding of the category and type fields of the token’s Part-of-Speech Tag from

FreeLing’s tag-set. • Normalized vector encoding the dependencies found in the path between the token and the target token (the one that is being decoded). It is computed by adding the onehot encoding representation of the dependency class for each hop in the dependency path and normalizing the resulting vector, not considering its direction. For instance, the representation of the token "I" in "I eat fish" when the target token is "fish" would be a vector with √2 in the positions corresponding to "subj" (subject) and "cd" (direct complement); whereas for "eat" it would be a vector with a single 1 in the "cd" position. • One-Hot encoding of the distance between the token and the target token. • Word-embedding of the token. We consider 4 alternative pre-trained word embedding models: – Concatenation of the last output layers of a multi-language general-purpose BERT[5] model1 with no fine tuning. – Word2Vec and FastText Medical Word Embedding for Spanish models from Barcelona

Super-computing Center2[3].

– FastText Spanish Unannotated Corpora from SUC3[6]

2.4. Pre-training with the ensemble corpus

Due to the comparatively large number of parameters in our model respect to the size of the training dataset, overfitting can be an issue. We prevent this by using the relatively larger but inaccurate ensemble in a pre-training phase. In order not to let our model’s variables fall into local minima that would make our model mimic previous’ years models, we randomly add documents from IberLEF 2020’s training corpus. Furthermore, we increase dropout and gradually decrease the learning rate for the training and fine-tuning training steps.

2.5. Single-scenario training and fine-tuning

In the general evaluation scenario, the loss function has to balance accuracy for both the keyphrase recognition and relation extraction tasks. This may be problematic, as the parameter updates made by the optimizer to improve one task might be detrimental for the other task. However, in evaluation scenarios 2 and 3, that is, independent key-phrase recognition and relation extraction tasks, the model does not have to generate both outputs. Consequently, on the one hand, we can use an uncompromising loss function. On the other hand, this means not being able to exploit the correlation between tasks, so it might as well lead to worse performance.

To study this efect, we suggest thee diferent single-scenario training strategies: using the general model with no alteration whatsoever, fine-tuning the general model’s outputs with independent loss function for a few epochs, or training the specific model from scratch. Note that in the case of scenario 3, we decode the key-phrases using the gold truth rather than the model’s output for all three strategies; and concatenate a one-hot-encoding of the key-phrase labels to the input for the from-scratch strategy. Table 2 shows the results for all three single-scenario training strategies.

2.6. Trainable parameters and computational resources

All models were trained using the TensorFlow® 1.15 framework for Python® 3.6 in an 8 core Intel® Xeon® E5-2620 v4 CPU at 2.10GHz, 16GB of DDR4 RAM, a GeForce® GTX 1070 GPU 1We used the BERT-Base, Multilingual Cased model (104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters) from the authors’ repository (https://github.com/google-research/bert)

2We used April 15, 2020’s SciELO + Wikipedia, 300 dimensions version of Medical Word Embedding for Spanish, which can be downloaded from https://zenodo.org/record/3744326

3We used the 300 dimensions sub-word binary model from https://github.com/dccuchile/ spanish-word-embeddings/blob/master/emb-from-suc.md Model Vicomtech UH-MAJA-KD Talp-UPC (submission) Talp-UPC (BERT) Talp-UPC (BERT FT) Talp-UPC (W2V Health) Tapl-UPC (FastText Health) Tapl-UPC (FastText General) and a 7200rpm 1TB Seagate® HDD.

BERT-based and Word2Vec/FastText-based models were trained for a total of 128 and 96 epochs respectively, divided among the pre-training, training and fine-tuning steps. Training epochs were evenly distributed between pre-training and training steps for models with no finetuning. When fine-tuning was applied (transfer-learning or single-task scenarios), pre-training was shortened by 16 epochs.

For each word representation model, independent models were trained with 8, 32 and 64 convolution filters of sizes 3 and 5; and 8, 32 and 64 single-layer recurrent units.

3. Results

Model P SINAI 0.844633 Vicomtech 0.821622 IXA-NER-RE 0.726733 UH-MAJA-KD 0.820255 Talp-UPC (fine-tuned) 0.807218 Talp-UPC (general) 0.841727 Talp-UPC (from-scratch) 0.821942

4. Discussion

The joint key-phrase classification and relation extraction model presented by our team for the previous edition of IberLEF’s eHealth Knowledge Discovery shared task outperformed every other participant model by a wide margin. This confirmed our belief that a joint model has the potential to exploit the mutual information between the two tasks and provide better evaluation results than traditional step-by-step architecture. The improvement was, however, less appreciable for the key-phrase classification task.

After comparing our model to the rest of the participant’s submissions, we hypothesised that one of the main shortcomings of ours was the absolute lack of context-specific knowledge. For this year’s edition, we decided to explore diferent alternatives to tackle this. But since a new transfer learning scenario was added, whose evaluation score would probably be compromised if the source model relied too heavily upon context-specific features, we opted for adding this context-specific information in a way that would not significantly alter the model’s structure nor make it less general with handcrafted rules. Particularly, we opted for swapping the general-purpose word representation model by a health-specific one.

Unfortunately, the results show that the use of context-specific word embeddings does not substantially improve upon general-purpose embeddings and even leads to worse results in the transfer-learning scenario. Not only that, but we have also shown that contextual word embeddings such as BERT and XLNet significantly outperform predictive word embedding models such as Word2Vec and FastText. Moreover, the concatenation of this second word representation does not seem to provide any additional information over the original, whilst it makes the model more complex in terms of the number of trainable parameters.

Several hypotheses may explain these unsatisfactory results. First, we argue that although the documents’ language register is formal, the use of technical terms is limited. Similarly, relation classes and specially key-phrase categories are arguably general, as pointed out by the results obtained in Scenario 4. Secondly, predictive word embedding models may not be able to capture the medical terms’ semantic information to a degree that can be used by our model, but rather more explicit features may be preferable.

5. Conclusions

In this article, we have described the main characteristics of the model that we have developed for TALP team’s submission to IberLEF’s 2020 eHealth Knowledge Discovery shared task. Our model follows the trend started by our team’s 2018’s model, which consists of using a single network with shared weights that jointly performs the key-phrase recognition and relation extraction tasks to leverage the mutual information between the two. It has proven to be competitive against the other participants model’s, especially in the general and transferlearning scenarios, ranking in second and first position respectively. The transfer-learning scenario particularly highlights the adaptability and context-independence of our model.

Three main improvements were made over the previous year’s model: adaptive learning-rate for pre-training, single scenario fine-tuning and context-specific word vector representations. The last of which has been rather underwhelming though, and we conclude that adding contextspecific information to our model is still an unresolved issue.

Besides the aforementioned limitation, we see other shortcomings to our model that still need to be tackled to more accurately capture the mutual information between the two knowledge discovery tasks. Among these improvements, we would like to point out two that we believe are more promising: • Use a trainable combination function for the outputs generated by the model for diferent source tokens in a document. Our current model, on the other hand, uses a simple union operation to join the predictions for the diferent tokens of single key-phrase. • Use of fine-tuned context-specific contextual word embedding model. The use of contextspecific predictive word embeddings have proven not successful for our model, but general-purpose contextual word embeddings can be fine-tuned with context-specific unlabelled corpora.

Acknowledgments References

This contribution has been partially funded by the Spanish Ministry of Economy (MINECO) and the European Union (TIN2016-77820-C3-3-R and AEI/ FEDER,UE). Spanish Society for Natural Language Processing (SEPLN 2019): Bilbao, Spain, September 24th, 2019, CEUR-WS. org, 2019, pp. 78–84. [3] F. Soares, M. Villegas, A. Gonzalez-Agirre, M. Krallinger, J. Armengol-Estapé, Medical word embeddings for Spanish: Development and evaluation, in: Proceedings of the 2nd Clinical Natural Language Processing Workshop, Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 124–133. URL: https://www.aclweb.org/anthology/ W19-1916. doi:10.18653/v1/W19-1916. [4] Z. Huang, W. Xu, K. Yu, Bidirectional lstm-crf models for sequence tagging, arXiv preprint arXiv:1508.01991 (2015). [5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [6] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (2017) 135– 146.

[1]

Piad-Morfis ,

Gutiérrez ,

Cañizares-Diaz ,

Estevez-Velarde ,

Almeida-Cruz ,

Muñoz ,

Montoyo , Overview of the eHealth Knowledge Discovery Challenge at IberLEF 2020, in: Proceedings of the Iberian Languages Evaluation Forum co-located with 36th Conference of the Spanish Society for Natural Language Processing , IberLEF@SEPLN 2020 , Spain, September, 2020 ., 2020 .

[2]

Medina Herrera ,

J. Turmo

Borras , Talp-upc at ehealth-kd challenge 2019: A joint model with contextual embeddings for clinical information extraction , in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019 ) : co-located with 35th Conference of the