=Paper=
{{Paper
|id=Vol-2943/ehealth_paper5
|storemode=property
|title=uhKD4 at eHealth-KD Challenge 2021: Deep Learning Approaches for Knowledge Discovery from Spanish Biomedical Documents
|pdfUrl=https://ceur-ws.org/Vol-2943/ehealth_paper5.pdf
|volume=Vol-2943
|authors=Dayany Alfaro-González,Dalianys Pérez-Perera,Gilberto González-Rodríguez,Antonio Jesús Otaño-Barrera,Rocío Cruz-Linares
|dblpUrl=https://dblp.org/rec/conf/sepln/Alfaro-Gonzalez21
}}
==uhKD4 at eHealth-KD Challenge 2021: Deep Learning Approaches for Knowledge Discovery from Spanish Biomedical Documents==
uhKD4 at eHealth-KD Challenge 2021: Deep Learning Approaches for Knowledge Discovery from Spanish Biomedical Documents Dayany Alfaro-González, Dalianys Pérez-Perera, Gilberto González-Rodrı́guez, Antonio Jesús Otaño-Barrera, and Rocı́o Cruz-Linares[0000−0002−0069−8950] Faculty of Math and Computer Science, University of Havana, La Habana, Cuba Abstract. This paper describes the system presented by team uhKD4 in the IberLEF eHealth Knowledge Discovery Challenge 2021. The chal- lenge proposes two tasks devoted to extract the semantic meaning of sentences mainly health-related in the Spanish language: Task A (entity recognition) and Task B (relation extraction). The sequential attainment of both tasks represents the main evaluation scenario of the challenge. The system is built upon two independent deep-learning-based architec- tures, one for each task of the challenge. Task A is addressed as a sequence labelling problem with a model that uses Long Short-Term Memory lay- ers to encode context information and linear chain Conditional Random Fields as tag decoders. Task B is approached as a multi-class classifica- tion problem using a Convolutional Neural Network that consists mainly of convolutional layers to recognize n-grams, the pooling layers to deter- mine the most relevant features and a logistic regression layer at the end to perform classification. The system obtained the fourth position in the main evaluation scenario of the competition. In the individual evaluation of the tasks the model for Task A showed average results while the Task B model reached the third position. Keywords: eHealth · Knowledge Discovery · Natural Language Pro- cessing · Information Extraction · Named Entity Recognition · Relation Extraction · Deep Learning. 1 Introduction This paper presents a description of the solution submitted by team uhKD4 at the IberLEF eHealth Knowledge Discovery Challenge 2021. The challenge proposes two tasks devoted to extract the semantic meaning of sentences mainly health-related in the Spanish language: Task A (entity recognition) aims to iden- tify all the entities in a document and their types and Task B (relation extraction) IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). seeks to recognize all relevant semantic relationships between the entities recog- nized. The sequential attainment of both tasks represents the main evaluation scenario of the challenge[7]. The system proposed consists in two independent components, one for each task. In order to solve the named entity recognition (NER) problem associated to Task A we present a model that uses Long Short-Term Memory (LSTM) layers to encode context information, motivated by the fact that it has demonstrated remarkable achievements in modeling sequential data [4]. On top of that are added a dense layer and a Conditional Random Field (CRF) [3] layer, which has been widely used as a tag decoder taking the context-dependent representations and producing a sequence of tags corresponding to the input sequence [4]. The relation extraction (RE) problem framed in Task B is approached using a Con- volutional Neural Network (CNN) that consists mainly of convolutional layers to recognize n-grams, the pooling layers to determine the most relevant features and a fully connected neural network with a softmax at the end to perform classification [6]. The rest of this paper is organized as follows. Section 2 describes in detail the architectures used by the system. The official results achieved in each scenario of the challenge are shown in Section 3. In Section 4 are shared some insights derived from experimentation. Finally, in Section 5 are stated the conclusions and future work recommendations. 2 System Description Our system is built upon two independent deep-learning-based architectures. Accordingly, two different models are defined and each task is carried out sep- arately. Task A is approached as a sequence labelling problem in which each token from an input sequence is assigned a label that represents the combina- tion of the BILUOV entity tagging scheme with each one of the possible types of an entity. The BILUOV tags correspond to: Begin, to represent the start of an entity; Inner, to represent its continuation; Last, to represent its end; Unit, to represent single word entities; Other, to represent words that are not a part of any entity; and oVerlapping, to represent words that belong to mul- tiple entities [1]. For example, in the sentence ”El cáncer de la cavidad nasal y de los senos paranasales no es común” each word should be labeled as stated between parenthesis: El (O) cáncer (V-Concept) de (I-Concept) la (I-Concept) cavidad (I-Concept) nasal (L-Concept) y (O) de (I-Concept) los (I-Concept) senos (I-Concept) paranasales (L-Concept) no (O) es (O) común (U-Concept). Thus, the output of the model considers 21 different labels: the O label and the combination of the remaining tags (BILUV) and the entity types (Concept, Action, Predicate and Reference). The proposed approach to Task B is to solve a multi-class classification problem, in which given a sentence and a highlighted pair of entities, one of the predefined relations is assigned to occur from the first entity toward the second one. A new artificial relation class none is defined to symbolize the non-occurrence of any relation between a pair of entities. 2.1 Preprocessing The initial step to extract useful information from the input of raw text is the tokenization of each sentence, since both of the tasks require the analysis of the sequence of words in the sentence. A fixed length for the sentences is defined as a parameter for the models and each sequence of tokens is trimmed or padded accordingly to fit the designated length. Below are exposed the particular features that were considered to obtain the input representation for each model. Common – Word embedding: Pre-trained word embedding word2vec [5] that have dimensionality of 300 and was trained on the the Spanish Billion Words Corpus with the variant of skip-gram model with negative-sampling. The weights are kept unchanged during the training phase. – POS-tag embedding: Embedding to encode the information expressed by the Part-of-speech tag of the token. Task A – Character representation: Every token is trimmed or padded in order to ensure that they all have the same predefined number of characters. By means of an embedding layer, each character of a word is translated to a vector, that represents one of all the ASCII letters, digits, and punctuation symbols and then are fed into a RNN-based model, that uses a Bidirectional Long Short-Term Memory (BiLSTM) to obtain a character-level represen- tation of the token. Task B – BILUOV and Entity Type embedding: Embedding intended to encode the information that gives the corresponding label of each word according to the combination of the BILUOV tag system and the possible types of entity. – Position embeddings: Embeddings to encode the relative distance be- tween each word and the two target entities in the sentence. In the case of a multi-word entity is considered the distance to the first word of such entity. 2.2 Named Entity Recognition Model Figure 1 shows the architecture of the defined model. As stated in the previous subsection the input of the model is a sequence of tokens, each one represented as the concatenation of the vectors from word and POS-tag embeddings and the character-level features. After the input is handled, the sequence of word vectors is processed in both directions by a BiLSTM layer and the features extracted from the forward and backward passes are concatenated together. The resulting sequence is intended to increase the amount of information available to the network, improving the context available to the algorithm (e.g. knowing what words immediately follow and precede a word in a sentence). Afterward, the sequence is processed by a simple LSTM layer to extract the most important features. Finally, a dense layer with a linear activation function followed by a linear-chain CRF are used to output the most probable sequence of labels corresponding to the tokens. The CRF layer uses sentence-level tag information to add some constraints to the final predicted labels to ensure they are valid. These constraints can be learned automatically from the dataset during the training process. Fig. 1. Task A model architecture. Since the goal is to classify in only four types of entities, a subsequent phase of decoding the output of the CRF layer is needed. The required transformation is realized in a way that is similar to the process described by team UH-MatCom at the previous edition of the challenge [1]. The process is accomplished in two steps. First, rules are used to discover the possible entities that use overlapped words and are not formed by continuous words. After, the remaining entities are assumed to be a continuous sequence of tokens and are detected in an iterative manner. 2.3 Relation Extraction Model The architecture defined for Task B is shown in Figure 2. The relation extraction system is provided only with raw sentences marked with the positions of the two entities of interest and the corresponding type of each one. Thus, exploiting the elements that can be derived from that input, each relation mention is repre- sented by a matrix X = [w1 , w2 , ..., wn ], where n is the defined length for the sentences and wi is the result of concatenating for the i-th token the embeddings described before. The matrix X is processed by the convolutional layer in order to extract high- level features. A filter with window size s can be denoted as F = [f1 , f2 , ..., fs ]. Applying the convolution operation on the two matrices X and F is gotten a score sequence T = [t1 , t2 , ..., tn−s+1 ]: Xs−1 T ti = g( fj+1 wj+i + b) (1) j=0 Fig. 2. Task B model architecture. where g is some non-linear function and b is a bias term. This process is repli- cated for various filters with distinct window sizes to explore the contribution of different n-grams. Then, a pooling layer is applied to aggregate the scores for each filter to assure the invariance to the absolute positions but retain the relative positions among the n-grams and the entities. Specifically, a global max pooling layer is used to aggressively summarize the most important or relevant features from each score sequence. A dropout is applied to the resulting feature vector for regularization, and then is fed into a fully connected layer of standard neural networks that is followed by a softmax layer in the end in order to carry out classification [6]. 2.4 Hyperparameters Setup Tables 1 and 2 show the selected set of hyperparameters for the NER and RE models respectively. In both tables are exposed the configurations respecting the input handling at the top, whereas the middle section covers the rest of the network and at the bottom are located the hyperparameters for training. The hyperparameter tuning process was carried out manually, taking as a starting point some settings that have shown a positive impact in past works involving similar architectures. The provided development collection was used as the validation dataset. The number of epochs was selected according to the performance shown in training curves. 2.5 Training For the implementation of the systems was used Python programming language and the framework Keras(v2.2.4) with TensorFlow(v1.13.1) as backend. In the NER model was used the keras contrib(v0.0.2) implementation for the CRF layer. Tokenization and POS-tags were obtained using the model es core news md of the Python library spaCy (v3.0.6). The training collection provided for the challenge was the only data used to train both models. The process was carried out in a machine with a 4 core AMD A10-8700P CPU at 1.80 GHz with an installed memory of 16 GB. For the NER model the training time was close to 8 hours and for the RE model it took little more than 2 hours. Table 1. Hyperparameters setup for NER model Hyperparameter Value POS-tag embedding 50 Character embedding 50 Character BiLSTM recurrent units 150 Max sentence length 40 Max word length 20 BiLSTM recurrent units 300 BiLSTM recurrent dropout 0.5 LSTM recurrent units 300 LSTM recurrent dropout 0.5 Optimizer RMSprop Learning rate 0.001 Loss function Negative Log-Likelihood Mini-Batch size 64 Epochs 9 Table 2. Hyperparameters setup for RE model Hyperparameter Value POS-tag embedding 50 BILUOV and Entity Type embedding 50 Position Entity 1 embedding 50 Position Entity 2 embedding 50 Max sentence length 80 Convolution number of filters 150 Convolution window sizes (2,3,4,5,6) Non-linear function g tanh Dropout rate 0.5 Optimizer Adam Learning rate 0.001 Loss function Categorical Crossentropy Mini-Batch size 32 Epochs 15 3 Results Table 3 presents the official results for the main scenario of evaluation in the challenge, where our team obtained the fourth position, achieving a F1 score of 0.423. There is a significant difference respecting the F1 scores of our system and the ones ranked higher. Thus, although we achieve competitive results, there is still room for improvement. Table 3. Scenario 1 (Main Evaluation) Team F1 Precision Recall Vicomtech 0.531 0.541 0.535 PUCRJ-PUCPR-UFMG 0.528 0.568 0.503 IXA 0.499 0.465 0.539 uhKD4 0.423 0.485 0.374 UH-MMM 0.339 0.292 0.404 Codestrange 0.232 0.337 0.177 baseline 0.232 0.337 0.177 JAD 0.109 0.234 0.071 In the second task, regarding entity extraction, our system shows the least promising results of all scenarios, ranking fifth with F1 score of 0.527, as shown in Table 4. Whilst, on the contrary, a value of 0.318 for F1 score is achieved and the third position is reached for the relation extraction task, which results are presented in Table 5. Table 4. Scenario 2 (Task A) Team F1 Precision Recall PUCRJ-PUCPR-UFMG 0.706 0.715 0.697 Vicomtech 0.684 0.700 0.747 IXA 0.653 0.614 0.698 UH-MMM 0.608 0.546 0.685 uhKD4 0.527 0.518 0.537 Yunnan-Deep 0.334 0.520 0.246 baseline 0.306 0.350 0.225 JAD 0.263 0.316 0.071 Yunnan-1 0.173 0.271 0.127 Codestrange 0.080 0.415 0.044 Table 5. Scenario 3 (Task B) Team F1 Precision Recall IXA 0.430 0.454 0.409 Vicomtech 0.372 0.542 0.283 uhKD4 0.318 0.556 0.222 PUCRJ-PUCPR-UFMG 0.263 0.367 0.205 UH-MMM 0.054 0.077 0.041 Codestrange 0.033 0.438 0.017 baseline 0.033 0.438 0.017 JAD 0.007 0.375 0.004 4 Discussion We would like to remark the relevance of the used features for both models. In particular, the NER model using only the pretrained word embedding showed poor results while the addition of the POS-tag and character information pro- vided a significant boost in performance. Regarding RE task, a mayor issue to overcome is the data scarcity problem, the amount of non-relation entity pairs is often superior to the ones that represent a relation, which leads to a widely unbalanced dataset and have a negative impact on the performance of models. To mitigate this problem we enriched the input representation with BILUOV tags and entity type information, in order to capture patterns in which the entities appear in a sentence that may be helpful to discriminate between positive and negative instances. The technique of adding the tag system information has been explored before in an architecture that is similar to ours and good results were achieved [8]. Experimentation proved that the incorporation of those features was highly influential in performance, as we expected. Also, related to the architecture of the RE model, it is worth mentioning that we experimented using max pooling layers or the global ones and better results were achieved in the second case. 5 Conclusions In this paper was described the system proposed by team uhKD4 at the IberLEF eHealth Knowledge Discovery Challenge 2021. Two independent deep-learning- based models were defined to solve each task of the competition. Task A is solved as a sequence labelling problem, by a model that uses a word2vec pretrained embedding along with syntactic features as the input representation, which is afterwards processed by LSTM and CRF layers. Task B is approached as a multi-class classification. In this case, besides the pretrained word embedding and syntactic features, it is also used information from the BILUOV tags and the relative distance to the highlighted entities. Then a CNN with filters of multiple window sizes and a logistic regression layer at the end performs classification. The system obtained the fourth position in the main evaluation scenario of the competition. In the individual tasks the NER model showed average results while the RE model reached the third position. As future work recommendations we propose to consider the use of domain specific features and external sources of knowledge. Also, to explore the use of contextual embeddings, such as Bidirectional Encoder Representations from Transformers (BERT) [2]. References 1. Consuegra-Ayala, J.P., Palomar, M.: UH-MatCom at eHealth-KD Challenge 2020: Deep-Learning and Ensemble Models for Knowledge Discovery in Spanish Docu- ments (2020) 2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding (2019) 3. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Proba- bilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning. p. 282–289. ICML ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001) 4. Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recog- nition (2020) 5. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representa- tions of words and phrases and their compositionality (2013) 6. Nguyen, T., Grishman, R.: Relation extraction: Perspective from convolutional neu- ral networks. pp. 39–48 (01 2015). https://doi.org/10.3115/v1/W15-1506 7. Piad-Morffis, A., Gutiérrez, Y., Estevez-Velarde, S., Almeida-Cruz, Y., Muñoz, R., Montoyo, A.: Overview of the eHealth Knowledge Discovery Challenge at IberLEF 2021. Procesamiento del Lenguaje Natural 67(0) (2021) 8. Ye, W., Li, B., Xie, R., Sheng, Z., Chen, L., Zhang, S.: Exploiting entity bio tag embeddings and multi-task learning for relation extraction with imbalanced data (2019)