-

1613-0073

LABDA at TASS-2018 Task 3: Convolutional Neural Networks for Relation Classi cation in Spanish eHealth documents

V ctor Suarez-Paniagua

Isabel Segura-Bedmar

Paloma Mart nez

0 0 Computer Science Department Carlos III University of Madrid Leganes 28911 , Madrid , Spain

2018

71 76

This work presents the participation of the LABDA team at the subtask of classi cation of relationships between two identi ed entities in electronic health (eHealth) documents written in Spanish. We used a Convolutional Neural Network (CNN) with the word embedding and the position embedding of each word to classify the type of the relation between two entities in the sentence. Previously, this machine learning method has already showed good performance for capturing the relevant features in electronic health documents which describe relationships. Our architecture obtained an F1 of 44.44 % in the scenario 3 of the shared task, named as Setting semantic relationships. Only ve teams submitted results for this subtask. Our system achieved the second highest F1, being very similiar to the top score (micro F1=44.8 %) and higher than the remainig teams. One of the main advantage of our approach is that it does not require any external knowledge resource as features.

Nowadays, there is a high increase in the publication of scienti c articles every year, which demonstrates that we are living in an emerging knowledge era. This explosion of information makes it nearly impossible for doctors and biomedical researchers to keep up to date with the literature in their elds.

The development of automatic systems to extract and analyse information from electronic health (eHealth) documents can signi cantly reduce the workload of doctors.

The TASS workshop proposes shared tasks on sentiment analysis in Spanish each year. Concretely, the goal of TASS-2018 Task 3 (Mart nez-Camara et al., 2018) is to create a competition where Natural Language Processing (NLP) experts can train their sys

Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes. tems for extracting the relevant information in Spanish eHealth documents and evaluate them in a objective and fair way.

Recently, Deep Learning has had a big impact on NLP tasks becoming the state-of-theart technique. Convolutional Neural Network (CNN) is a Deep Learning architecture which has shown good performance in Computer Vision task such as image classi cation (Krizhevsky, Sutskever, y Hinton, 2012) and face recognition (Lawrence et al., 1997) .

The system described in (Kim, 2014) was the rst work to use a CNN for a NLP task.

It created a vector representation for each sentence by extracting the relevant information with di erent lters in order to classify them into prede ned categories obtaining good results. In addition, CNN was used with good performance for relation classi cation between nominals in the work of (Zeng et al., 2014) . Furthermore, this architecture has been also used in the biomedical domain for the extraction of drug-drug interactions in (Suarez-Paniagua, Segura-Bedmar, y Mart nez, 2017a) . This system did not require any external biomedical knowledge in order to provide very close results to those obtained using lots of hand-crafted features. We also employed the same approach of (Suarez-Paniagua, Segura-Bedmar, y Mart nez, 2017b) , which was used for extracting relationships between keyphrases in the Semeval-2017 Task 10: ScienceIE (Augenstein et al., 2017) , which proposed very similar subtasks than those de ned in TASS2018 Task 3.

In this work, we describe the participation of the LABDA at the subtask C in the classi cation of relationships between two identi ed entities in Spanish documents about health.

In this subtask, the test dataset includes the text, the boundaries and the types of their entities to generate the prediction. 2

Dataset

The task provides an annotated corpus from MedlinePlus documents which is divided into the training set for the learning step, development set for the validation and test set for the evaluation of the systems.

The relationship between entities de ned as concepts are: is-a, part-of, property-of and same-as. There are also relationships de ned as roles: subject and target. The training set contains 559 sentences with 3,276 entities, 1,012 relations and 1,385 roles, the development set contains another 285 sentences. The dataset contains 3,276 entities and 1,012 relations and 1,385 roles in the train set, the development set contains 285 sentences. A detailed description of the method used to collect and process documents can be found in (Mart nez-Camara et al., 2018) .

Unlike the other two previous subtasks, the documents include annotated entities with boundaries and types. In this way, it is possible to measure and compare the di erent approaches only focusing on the goal of the subtask C. 2.1

Pre-processing phase

As some of the relationships types are asymmetrical, for each pair of entities marked in the sentence, we generate two instances.

Thus, a sentence with n entities will have (n 1) n instances. Each instance is labelled with one of the six classes is-a, partof, property-of, same-as, subject and target.

In addition, a None class is also considered for the non-relationship between the entities.

Due to the fact that there are some overlapped entities, we consider each sentence as a graph where the vertices are the entities and the edges are the non-overlapped entities with itself in order to obtain recursively all the possible paths without overlapping, thus we have di erent instances for each overlapped entities. Table 2 shows the resulting number of instances for each class on the train, validation and test sets.

Label is-a part-of property-of same-as subject target None

After that, we tokenize and clean the sentences following a similar approach as that described in (Kim, 2014) , converting the numbers to a common name, words to lower-case, replacing special Spanish accents to Unicode, e.g n~ to n, and separating special characters with white spaces by regular Relationship between entities (ataque de asma ! produce) (ataque de asma produce) (ataque de asma ! s ntomas) (ataque de asma s ntomas) (ataque de asma ! empeoran) (ataque de asma empeoran) (produce ! s ntomas) (produce s ntomas) (produce ! empeoran) (produce empeoran) (s ntomas ! empeoran) (s ntomas empeoran) (asma ! produce) (asma produce) (asma ! s ntomas) (asma s ntomas) (asma ! empeoran) (asma empeoran) (produce ! s ntomas) (produce s ntomas) (produce ! empeoran) (produce empeoran) (s ntomas ! empeoran) (s ntomas empeoran)

Instances after entity blinding 'un entity1 se entity2 cuando los entity0 entity0 .' 'un entity2 se entity1 cuando los entity0 entity0 .' 'un entity1 se entity0 cuando los entity2 entity0 .' 'un entity2 se entity0 cuando los entity1 entity0 .' 'un entity1 se entity0 cuando los entity0 entity2 .' 'un entity2 se entity0 cuando los entity0 entity1 .' 'un entity0 se entity1 cuando los entity2 entity0 .' 'un entity0 se entity2 cuando los entity1 entity0 .' 'un entity0 se entity1 cuando los entity0 entity2 .' 'un entity0 se entity2 cuando los entity0 entity1 .' 'un entity0 se entity0 cuando los entity1 entity2 .' 'un entity0 se entity0 cuando los entity2 entity1 .' 'un ataque de entity1 se entity2 cuando los entity0 entity0 .' 'un ataque de entity2 se entity1 cuando los entity0 entity0 .' 'un ataque de entity1 se entity0 cuando los entity2 entity0 .' 'un ataque de entity2 se entity0 cuando los entity1 entity0 .' 'un ataque de entity1 se entity0 cuando los entity0 entity2 .' 'un ataque de entity2 se entity0 cuando los entity0 entity1 .' 'un ataque de entity0 se entity1 cuando los entity2 entity0 .' 'un ataque de entity0 se entity2 cuando los entity1 entity0 .' 'un ataque de entity0 se entity1 cuando los entity0 entity2 .' 'un ataque de entity0 se entity2 cuando los entity0 entity1 .' 'un ataque de entity0 se entity0 cuando los entity1 entity2 .' 'un ataque de entity0 se entity0 cuando los entity2 entity1 .' Label None target None None None None None None subject None None target None None None None None None None None subject None None target Tabla 1: Instances with two di erent entities relationship after the pre-processing phase with entity blinding of the sentence 'Un ataque de asma se produce cuando los s ntomas empeoran.'. expressions.

Furthermore, the two target entities of each instance are replaced by the labels "entity1 ", "entity2 ", and by "entity0 "for the remaining entities. This method is known as entity blinding, and supports the generalization of the model. For instance, the sentence in Figure 1: 'Un ataque de asma se produce cuando los s ntomas empeoran.' with the entities ataque de asma, asma, produce, s ntomas and empeoran should be transformed to the relation instances showed in Table 1.

Figura 1: Relationships and entities in the sentence 'Un ataque de asma se produce cuando los s ntomas empeoran.'.

We observed that there are some instances that involve relationships between an entity and its overlapped entity, for this reason, we remove them from the dataset because we can not deal with these relations in the entity blinding process. Moreover, there are relationships with more than one label, in this case, we take just one label because our system is not able to cope with a multi-class problem.

CNN model

In this section, we present the CNN architecture which is used for the task of relation extraction in electronic health documents. Figure 2 shows the entire process of the CNN starting from a sentence with marked entities to return the prediction. 3.1

Word table layer

After the pre-processing phase, we created an input matrix suitable for the CNN architecture. The input matrix should represent all training instances for the CNN model; therefore, they should have the same length. We determined the maximum length of the sentence in all the instances (denoted by n), and then extended those sentences with lengths shorter than n by padding with an auxiliary token "0 ".

Moreover, each word has to be represented by a vector. To do this, we randomly initialized a vector for each di erent word which allows us to replace each word by its word embedding vector: We 2 RjV j me where V is the vocabulary size and me is the word embedding dimension. Finally, we obtained a vector x = [x1; x2; :::; xn] for each instance where each word of the sentence is represented by its corresponding word vector from the word embedding matrix. We denote p1 and p2 as the positions in the sentence of the two

Un <e1>ataque de asma<\e1> se <e2>produce<\e2> cuando los <e0>síntomas<\e0> <e0>empeoran<\e0>. n |V|

Un entity1

se entity2 cuando

los entity0 entity0.

0 …

Figura 2: CNN model for the Setting semantic relationships subtask of TASS-2018-Task 3. entities to be classi ed.

The following step involves calculating the relative position of each word to the two candidate entities as i p1 and i p2, where i is the word position in the sentence (padded word included), in the same way as (Zeng et al., 2014) . In order to avoid negative values, we transformed the range ( n + 1; n 1) to the range (1; 2n 1). Then, we mapped these distances into a real value vector using two position embeddings Wd1 2 R(2n 1) md and Wd2 2 R(2n 1) md . Finally, we created an input matrix X 2 Rn (me+2md) which is represented by the concatenation of the word embeddings and the two position embeddings for each word in the instance.

3.2 Convolutional layer

Once we obtained the input matrix, we applied a lter matrix f = [f1; f2; :::; fw] 2 Rw (me+2md) to a context window of size w in the convolutional layer to create higher level features. For each lter, we obtained a score sequence s = [s1; s2; :::; sn w+1] 2 R(n w+1) 1 for the whole sentence as w si = g(X fj xiT+j 1 + b)

j=1 where b is a bias term and g is a non-linear function (such as tangent or sigmoid). Note that in Figure 2, we represent the total number of lters, denoted by m, with the same size w in a matrix S 2 R(n w+1) m. However, the same process can be applied to lters with di erent sizes by creating additional matrices that would be concatenated in the following layer.

3.3 Pooling layer

In this layer, the goal is to extract the most relevant features of each lter using an aggregating function. We used the max function, which produces a single value in each lter as zf = maxfsg = maxfs1; s2; :::; sn w+1g. Thus, we created a vector z = [z1; z2; :::; zm], whose dimension is the total number of lters m representing the relation instance. If there are lters with di erent sizes, their output values should be concatenated in this layer.

3.4 Softmax layer

Prior to performing the classi cation, we performed a dropout to prevent over tting. We obtained a reduced vector zd, randomly setting the elements of z to zero with a probability p following a Bernoulli distribution. After that, we fed this vector into a fully connected softmax layer with weights Ws 2 Rm k to compute the output prediction values for the classi cation as o = zdWs + d where d is a bias term; we have k = 6 classes in the Label is-a part-of property-of same-as subject target Scenario 3 dataset and the "Noneclass. At test time, the vector z of a new instance is directly classi ed by the softmax layer without a dropout. For the training phase, we need to learn the CNN parameter set = (We, Wd1, Wd2, Ws, d, Fm, b), where Fm are all of the m lters f. For this purpose, we used the conditional probability of a relation r obtained by the softmax operation as p(rjx; ) =

exp(or) Pk l=1 exp(ol) to minimize the cross entropy function for all instances (xi,yi) in the training set T as follows

J ( ) =

T X log p(yijxi; ) i=1 In addition, we minimize the objective function by using stochastic gradient descent over shu ed mini-batches and the Adam update rule (Kingma y Ba, 2014) to learn the parameters. 4

Results and Discussion

The CNN model was training with the training set and we obtained the best values of each parameters ne-tuning them on the validation set (see Table 4).

The results were measured with precision (P), recall (R) and F1, de ned as:

P =

C C + S

R =

C C + M

F 1 = 2

P R P + R where Correct (C) are the relations that matched to the test set and the prediction, Missing (M) are the relations that are in the test set but not in the prediction, and Spurious (S) are the relations that are in the prediction but not in the test set.

Parameter Maximal length in the dataset, n Word embeddings dimension, Me Position embeddings dimension, Md Filter window sizes, w Filters for each window size, m Dropout rate, p Non-linear function, g Mini-batch size Learning rate Value 38 300 10 3, 4, 5 200 0.5 ReLU 50 0.001 Tabla 4: The CNN model parameters and their values used for the results.

Table 3 shows the results of the CNN conguration with position embeddings. We observe that the number of Missing is very high. This may be due to the fact that the dataset is very unbalanced and these instances are classi ed as None by the system. In fact, we see that the classes that are more representative have better Recall. To solve this problem we propose to use sampling techniques to increase the number of instances of the less representative classes.

Only ve teams submitted results for this subtask. Our system achieved the second highest F1, being very similiar to the top score (micro F1=44.8 %), but very much higher than the other teams, which are bellow than 11 % of F1. One of the main advantage of our approach is that it does not require any external knowledge resource. 5

Conclusions and Future work

In this paper, we propose a CNN model for the subtask C (Setting semantic relationships) of the TASS-2018 Task 3. The o cial results for this model show that the CNN is a very promising system because neither expert domain knowledge nor external features are needed. The con guration of the architecture is very simple with a basic preprocessing adapted for Spanish documents.

The results show that the system produces very many false negatives. We think that this may be due to the unbalanced nature of the dataset. To solve this problem, we propose to use oversampling techniques to increase the number of instances of the less representative classes. Our system also seems to have di culties in order to distinguish the directionality of the relationships. For these reasons, we will use more complex settings of the architecture for tackling the directionality problem.

Moreover, we plan to use external features as part of the embeddings such as the entity labels given by the second subtask, the Partof-Speech (PoS) tags and the dependency types of each word for the Spanish documents in order to increase the information of each sentence. We want to explore in detail each feature contribution and the ne-tune all the parameters. Furthermore, we will use some rules to distinguish the relations and the roles with the entity labels and train two di erent classi er, thus, they would be more accurate. In addition, we will use another neural network architectures like the Recurrent Neural Network and possible combinations with the CNN.

Funding

This work was supported by the Research Program of the Ministry of Economy and Competitiveness - Government of Spain, (DeepEMR project TIN2017-87548-C2-1-R).

Bibliograf a

Augenstein , I. , M. Das , S. Riedel , L. Vikraman, y A. McCallum . 2017 . Semeval 2017 task 10: Scienceie - extracting keyphrases and relations from scienti c publications . En Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) , paginas 546 { 555 , Vancouver, Canada, August. Association for Computational Linguistics.

Kim , Y.

2014 . Convolutional neural networks for sentence classi cation . En Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , paginas 1746 { 1751 .

Kingma , D. P. y J.

Ba . 2014 . Adam: A method for stochastic optimization . CoRR, abs/1412 .6980.

Krizhevsky , A. , I. Sutskever, y

G. E.

Hinton . 2012 . Imagenet classi cation with deep convolutional neural networks . En Advances in Neural Information Processing Systems 25 , paginas 1097 { 1105 . Curran Associates, Inc.

Lawrence , S., C. L.

Giles , A. C.

Tsoi , y A. D.

Back . 1997 . Face recognition: a convolutional neural-network approach . IEEE Transactions on Neural Networks , 8 ( 1 ): 98 { 113 , Jan .

Mart nez-Camara, E., Y.

Almeida-Cruz , M. C.

D az-

Galiano , S.

Estevez-Velarde , M. A.

Garc a-Cumbreras, M. Garc aVega, Y.

Gutierrez , A.

Montejo-Raez , A.

Montoyo , R. Mun~oz, A. PiadMor s, y J. Villena-Roman . 2018 . Overview of TASS 2018: Opinions, health and emotions . En E. Mart nezCamara

Almeida Cruz M. C. D azGaliano S. Estevez Velarde M. A. Garc a -Cumbreras

Garc a-Vega Y.

Gutierrez Vazquez A. Montejo Raez A. Montoyo Guijarro R. Mun

~oz Guillena A. Piad Mor s , y J. Villena-Roman, editores , Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ), volumen 2172 de CEUR Workshop Proceedings, Sevilla, Spain, September. CEUR-WS.

Suarez-Paniagua , V. , I. Segura-Bedmar, y P. Mart nez . 2017a. Exploring convolutional neural networks for drug-drug interaction extraction . Database , 2017 : bax019 .

Suarez-Paniagua , V. , I. Segura-Bedmar, y P. Mart nez . 2017b. LABDA at semeval2017 task 10 : Relation classi cation between keyphrases via convolutional neural network . En Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval@ACL 2017 , Vancouver, Canada, August 3-4 , 2017 , paginas 969 { 972 .

Zeng , D. ,

Liu ,

Lai , G. Zhou, y J . Zhao . 2014 . Relation classi cation via convolutional deep neural network . En Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014 ), Technical Papers, paginas 2335 { 2344 , Dublin, Ireland, August . Dublin City University and Association for Computational Linguistics.