Deep Learning approach for Negation Cues Detection in Spanish Aplicación Basada en Deep Learning para Identificación de Claves de Negación en Castellano Hermenegildo Fabregat1 , Juan Martinez-Romo1−2 ,Lourdes Araujo1−2 1 Universidad Nacional de Educación a Distancia (UNED) 2 IMIENS: Instituto Mixto de Investigación {gildo.fabregat, lurdes, juaner}@lsi.uned.es Abstract: This paper describes the negation cues detection model presented by the UNED group for task 2 (Task 2: Negation cues detection) of the NEGES workshop collocated in the SEPLN congress (Sevilla, 2018). This task deals with negation cues detection in Spanish reviews in domains such as cars, music and books. In order to deal with the extraction of both semantic and syntactic patterns and the extraction of contextual patterns, we have proposed a model based on the combination of some dense neural networks and one Bidirectional Long Short-Term Memory (Bi-LSTM). The evaluation is divided by domains and using an inter-domain average we have obtained acceptable results. Keywords: Negation detection, negation cues, Deep Learning, Bi-LSTM Resumen: Este artı́culo describe el modelo propuesto por el grupo UNED para la tarea 2 (Task 2: Negation cues detection) del workshop NEGES, asociado a la conferencia SEPLN (Sevilla, 2018). Esta tarea trata la detección de “señales o claves” de negación en castellano, centrando la atención en comentarios de dominios tales como coches, musica y libros. Con el fin de extraer patrones tanto sintácticos como semánticos además de patrones basados en información contextual, el modelo esta basado en el uso de varias redes neuronales junto a una LSTM (Long Short-Term Memory) bidireccional. Estando la evaluación de la tarea dividida en función del dominio de los comentarios, los resultados medios obtenidos durante la evaluación han sido aceptables. Palabras clave: Detección de negación, claves de negación, Deep Learning, Bi- LSTM 1 Introduction dish), Skeppstedt (2011) (Swedish) and Co- tik et al. (2016) (Spanish) which also explore To understand the meaning of a sentence th- other syntactic approaches based on rules de- rough the use of the natural language pro- rived from PoS-tagging and dependency tree cessing techniques it is necessary to take in- patterns for negation detection in Spanish. to account that a sentence can express a ne- gated fact. In some languages such as En- The proposal of the task 2 of NEGES glish, detection and processing of negation workshop (Jiménez-Zafra et al., 2018a) focu- is a recurrent working area. It is a very in- ses on the detection of negated cues in Spa- teresting field of study if we consider the in- nish. For this purpose the organizers facilita- fluence of the negation in tasks such as sen- te the corpus SFU ReviewSP-NEG (Jiménez- timent analysis and relationship extraction Zafra et al., 2018b) which consists of 400 re- (Reitan et al., 2015; Chowdhury and Lave- views related to 8 different domains (cars, ho- lli, 2013). NegEx (Chapman et al., 2001) is tels, washing machines, books, cell phones, one of the most popular algorithms for ne- music, computers and movies), 221866 words gation detection in English. The use of this and 9455 sentences, out of which 3022 senten- algorithm for other languages has been ad- ces contain at least one negation structure. dressed by some recent works, such as Chap- The organizers have presented the corpus di- man et al. (2013) (French, German and Swe- vided in three sets: training, development and 43 Proceedings of NEGES 2018: Workshop on Negation in Spanish, pages 43-48 Seville, Spain, September, 18, 2018 test. As can be seen in the figure 1, the cor- it is a supervised approach which uses the fo- pus was presented using the format CoNLLL llowing embedded features: words, lemmas, (Hajič et al., 2009). PoS-tagging and case-tagging. Both words and lemmas are encoded using a pre-trained hoteles 21 1 Y y cc coordinating - - - Spanish word embedding (Cardellino, 2016) hoteles 21 2 no no rn negative no - - and both PoS-tagging and casing embedding models have been implemented using two Ke- hoteles 21 3 hay haber vmip3s0 main - - - ras Embedding Layer1 initialized using a ran- hoteles 21 4 en en sps00 preposition - - - dom uniform distribution. In order to avoid hoteles 21 5 la el da0fs0 article - - - any cascade error we used both lemmas and PoS-tagging provided in the corpus. hoteles 21 6 habitación habitación ncfs000 common --- XL XW XP XC hoteles 21 7 ni ni rn negative ni - - hoteles 21 8 una uno di0fs0 indefinite - - - Embedding layers hoteles 21 9 triste triste aq0cs0 qualificative - - - hoteles 21 10 hoja hoja ncfs000 common - - - Concatenate Figure 1: Corpus SFU ReviewSP-NEG - An- notation format. Bidirectional LSTM Each line corresponds to a token, where an empty line is the end of a sentence and each column represents an annotation ... about a specific term (for instance, column ... one contains the name of the domain file and columns three and four contain word and lemma). Column eight onwards shows the annotations related to negation. If the sentence has no negations, column eight has a value “***” and there are no more columns. Otherwise, the notation for each Dense Neural network negation is provided in three columns. The first column contains the word that belongs Time Distributed wrapper to the negation cue. The second and third Dense Neural network columns contain “-”. This work is organized as follows: Section 2 contains both the description of the pro- T1 T2 T3 ... Tz posed model and the description of the fea- Output Layer tures and resources used. In section 3 we re- port and discuss the results obtained during Figure 2: Architecture of the proposed mo- the evaluation stage. And finally, in section 4 del, where XL and XW (L: Lemma, W: Raw conclusions and future work are presented. word) are the encoded word inputs and XP and XC are the encoded inputs representing 2 Proposed model the PoS-tagging and casing information. Bi- LSTM inputs (Yx ) are the concatenated em- Inspired by the model presented by Fancellu, bedded features of each word. In the output Lopez, and Webber (2016), the problem is addressed as a sequence labeling task. layer, Tx represents the assigned tag. The casing embedding matrix is a hot- The proposed model has been implemen- one encoding matrix of size 8 which was ted using Python’s Keras library (Chollet and others, 2015) with TensorFlow backend and 1 https://keras.io/layers/embeddings/ 44 calculated for each input token making The model has been trained with data from use the following encoder dictionary: { 0: all the categories and this process has been Input token is numerical - 1: - 2: - 3: Initial limited to 25 epochs in order to avoid pos- character is upper case - 4: Input token is sible over-fitting. We have evaluated during mainly numerical - 5: Contains at least one the training phase for each epoch the gene- digit - 6: Other case }. rated model using the script provided by the organizers (Morante and Blanco, 2012) and In order to ensure that the words pre- the development set and we have observed sented in the corpus, which are linked by that, for most of the domains, 20th epoch are an underscore such as “ya que” are not enough to reach the best results (Figure 3). being left out of the embedding, we have carried out a preprocessing step to divide these expressions according to the number of underscores that these expressions have. To standardize the sentences to a common "# length, after dividing expressions with more ! " than one term, a padding of up to 200 positions has been applied. To label the targets, we follow the standard IOB labeling scheme (Ramshaw and Marcus, 1999). The first cue of a negation phrase is denoted by B (Begin) and the remaining cues, if any with I (Inside). O (Out) indicates that the word Figure 3: Training phase, temporal evalua- does not correspond to any kind of entity tion for each domain using development set. considered. For example: Pre-trained resources parameters and mo- Del (O) buffet (O) del (O) desayuno (O) del’s hyper-parameters are the following: no (B) puedo (O) opinar (O) ya que (B) no – Pre-trained English Word Embedding (I) lo (O) incluia (O) nuestro (O) regimen dimension: 300 (O) . (O) – Embeddings dimension (Casing / PoS- Figure 2 shows the proposed model architec- tagging): 8 / 50 ture. The first layer is a densely connected – Hidden Dense units (output dimension / hidden layer (Dense neural network), which activation function): 200 / tanh has as activation function the hyperbolic tan- gent function (tanh). This layer takes as in- – LSTM output dimension: 300 put the concatenation of the different embed- – Dropout (for each dense unit): 0.25 dings. The output of the first layer is con- nected to an LSTM (Long Short-Term me- – Batch size / Model optimizer: 32 / Ada- mory) enveloped in a bidirectional wrapper Grad (Duchi, Hazan, and Singer, 2011) (forward and backward processing network). Once the model has been set and it has a sta- For each network, this second layer uses a ble and similar performance for all categories, hidden state for processing data from the cu- the model has been re-trained with the data rrent step taking into account information of of the development set. previous steps. In the next layer and connec- ted to the output layer, another dense hidden 3 Evaluation layer has been used to reduce the complexity of the bidirectional LSTM output. To avoid In this section we describe the obtained re- possible over-fitting we have applied a dro- sults, taking into account the following eva- pout factor of 0.25 to the output of this den- luation criteria proposed by the organizers: se layer. Finally, another dense hidden layer, – Punctuation tokens are ignored. using the softmax activation function, calcu- lates the probabilities of all tags for each word – True positives are counted when the sys- in a sentence. The most probable label is the tem produces negation elements exactly one selected as the final tag. as they are in gold. 45 Domain Precision Recall F-measure Cars 44.74 % 72.34 % 55.29 % Hotels 51.32 % 63.93 % 56.94 % Washing machines 55.36 % 68.89 % 61.39 % Books 53.11 % 65.28 % 58.57 % Phones 54.62 % 65.14 % 59.42 % Music 43.59 % 65.38 % 52.31 % Computers 38.57 % 51.92 % 44.26 % Films 50.00 % 59.09 % 54.17 % Table 1: Baseline - Evaluation per domain: development set Domain Precision Recall F-measure Cars 94.23 % (88.37 %) 72.06 % (80.85 %) 81.67 % (84.44 %) Hotels 97.67 % (90.62 %) 71.19 % (47.54 %) 82.35 % (62.36 %) Washing machines 92.00 % (96.88 %) 66.67 % (68.89 %) 77.31 % (80.52 %) Books 79.52 % (91.00 %) 66.27 % (63.19 %) 72.29 % (74.59 %) Phones 93.33 % (94.20 %) 73.68 % (59.63 %) 82.35 % (73.03 %) Music 92.59 % (85.19 %) 57.47 % (88.46 %) 70.92 % (86.79 %) Computers – (84.62 %) – (63.46 %) – (72.53 %) Films 86.26 % (93.33 %) 69.33 % (63.64 %) 76.87 % (75.68 %) Table 2: Evaluation per domain: test set ( development set ) – Partial matches are not counted as FP, the organizers using the unannotated test set. only as FN. Due to an error submitting the system out- put, there are no test results for the computer – False negatives are counted either by the category. On the one hand, the results obtai- system not identifying negation elements ned in a preliminary analysis (development present in gold, or by identifying them set) show that the proposed system signifi- partially. cantly improves the results obtained by the – False positives are counted when the sys- baseline. On the other hand, as shown in ta- tem produces a negation element not ble 2, the difference between recall and preci- present in gold. sion is very remarkable. Taking into account that we have not generated a specific model for each domain, due to the needs of the sys- In order to carry out a study of the perfor- tem presented, the differences between pre- mance of the presented system, it has been cision and recall observed during the evalua- compared with a baseline based on a lookup tion of the test set may indicate, among other of a filtered list of terms extracted from the things, that the system has some over-fitting training set. To take into account the scope of and is adjusting to very recurrent patterns or the negation, the sentences have been divided that there are expressions that have not been according to the following delimiters: “.” - “,” processed correctly (for example, there may - “;”. The list of terms has been tunned in or- be expressions that are not correctly included der to improve the results obtained through in the word embedding used). On the other this baseline. Table 1 shows the results obtai- hand, the fall of the recall value in the music ned using the baseline (evaluating it with the domain is notable, comparing the results of development set) and table 2 shows the re- the test and training. sults obtained using the proposed approach. As can be seen, table 2 presents two scores Because the gold standard has not been for each evaluation metric (precision, recall published, we have not been able to perform and f-measure). These scores correspond to an exhaustive analysis of the recognition mis- the evaluation of the system using the deve- takes made evaluating with the test set. Ho- lopment set during the training phase and to wever, some of the detected errors during the the evaluation of the system carried out by training phase related to the obtained recall, 46 correspond to situations in which the model Chapman, W. W., D. Hilert, S. Velupillai, has not been able to recognize some multi- M. Kvist, M. Skeppstedt, B. E. Chapman, word expressions related to a negation such M. Conway, M. Tharp, D. L. Mowery, and as “a no ser que” and “no hay mas que”. L. Deleger. 2013. Extending the Ne- gEx lexicon for multiple languages. Stu- 4 Concluding Remarks dies in health technology and informatics, The detection of negation cues is an impor- 192:677. tant task in the natural language processing Chollet, F. et al. 2015. Keras. https:// area. In this field we present a deep learning github.com/fchollet/keras. model for detection of negation cues inspired in named entity recognition architectures Chowdhury, M. F. M. and A. Lavelli. 2013. and negation scope detection models. This Exploiting the scope of negations and he- model achieves high performance without terogeneous features for relation extrac- any sophisticated features extraction process tion: A case study for drug-drug interac- and although the model has some weaknesses tion extraction. In Proceedings of the 2013 in terms of coverage, the results are accep- Conference of the North American Chap- table and comparable with those obtained ter of the Association for Computational by the UPC-TALP team (average results, Linguistics: Human Language Technolo- 91.47 % precision, 82.17 % recall and 86.44 % gies, pages 765–771. F-measure). Cogswell, M., F. Ahmed, R. B. Girshick, L. Zitnick, and D. Batra. 2015. Re- As a future work, based on the low recall ob- ducing Overfitting in Deep Networks by tained we will explore others regularization Decorrelating Representations. CoRR, methods such as the use of some regulariza- abs/1511.06068. tion function (Cogswell et al., 2015) and we Cotik, V., V. Stricker, J. Vivaldi, and will explore some model modifications such as H. Rodrı́guez Hontoria. 2016. Syntactic the addition of a semantic vector representa- methods for negation detection in radio- tion for the whole sentence and the use of a logy reports in Spanish. In Proceedings CRF-based layer instead of the current dense of the 15th Workshop on Biomedical Na- based output layer. Finally, the study of the tural Language Processing, BioNLP 2016: patterns generated by the current model can Berlin, Germany, August 12, 2016, pages lead to the creation of a rule-based auxiliary 156–165. Association for Computational model for the re-labeling of negation begin- Linguistics. ning cues (label B). If we take into account that the model has been trained using non- Duchi, J., E. Hazan, and Y. Singer. 2011. handcrafted features, the results obtained in- Adaptive subgradient methods for onli- dicate that the system is capable of achieving ne learning and stochastic optimization. more competitive levels of precision and re- Journal of Machine Learning Research, call. 12(Jul):2121–2159. Fancellu, F., A. Lopez, and B. Webber. 2016. Acknowledgments Neural networks for negation scope detec- This work has been partially supported by tion. In Proceedings of the 54th Annual the projects EXTRECM (TIN2013-46616- Meeting of the Association for Compu- C2-2-R), PROSA-MED (TIN2016-77820-C3- tational Linguistics (Volume 1: Long Pa- 2-R), and EXTRAE (IMIENS 2017). pers), volume 1, pages 495–504. References Hajič, J., M. Ciaramita, R. Johansson, D. Kawahara, M. A. Martı́, L. Màrquez, Cardellino, C. 2016. Spanish billion words A. Meyers, J. Nivre, S. Padó, J. Štěpánek, corpus and embeddings. et al. 2009. The CoNLL-2009 shared task: Chapman, W. W., W. Bridewell, P. Hanbury, Syntactic and semantic dependencies in G. F. Cooper, and B. G. Buchanan. 2001. multiple languages. In Proceedings of the A Simple Algorithm for Identifying Ne- Thirteenth Conference on Computational gated Findings and Diseases in Discharge Natural Language Learning: Shared Task, Summaries. Journal of Biomedical Infor- pages 1–18. Association for Computatio- matics, 34(5):301 – 310. nal Linguistics. 47 Jiménez-Zafra, S. M., N. P. Cruz-Dı́az, R. Morante, and M. T. Martı́n-Valdivia. 2018a. Resumen de la Tarea 2 del Taller NEGES 2018: Detección de Claves de Ne- gación. In Proceedings of NEGES 2018: Workshop on Negation in Spanish, volu- me 2174, pages 35–41. Jiménez-Zafra, S. M., M. Taulé, M. T. Martı́n-Valdivia, L. A. Ureña-López, and M. A. Martı́. 2018b. SFU Review SP- NEG: a Spanish corpus annotated with negation for sentiment analysis. A typo- logy of negation patterns. Language Re- sources and Evaluation, 52(2):533–569. Morante, R. and E. Blanco. 2012. * SEM 2012 shared task: Resolving the scope and focus of negation. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Pro- ceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Se- mantic Evaluation, pages 265–274. Asso- ciation for Computational Linguistics. Ramshaw, L. A. and M. P. Marcus. 1999. Text chunking using transformation-based learning. In Natural language processing using very large corpora. Springer, pages 157–176. Reitan, J., J. Faret, B. Gambäck, and L. Bun- gum. 2015. Negation scope detection for twitter sentiment analysis. In Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 99–108. Skeppstedt, M. 2011. Negation detection in Swedish clinical text: An adaption of Ne- gEx to Swedish. In Journal of Biomedi- cal Semantics, volume 2, page S3. BioMed Central. 48