TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 13-27 Overview of TASS 2018: Opinions, Health and Emotions Resumen de TASS 2018: Opiniones, Salud y Emociones Eugenio Martı́nez-Cámara1 , Yudivián Almeida-Cruz2 , Manuel Carlos Dı́az-Galiano3 Suilan Estévez-Velarde2 , Miguel Á. Garcı́a-Cumbreras3 , Manuel Garcı́a-Vega3 , Yoan Gutiérrez4 , Arturo Montejo-Ráez3 , Andrés Montoyo4 , Rafael Muñoz4 , Alejandro Piad-Morffis2 , Julio Villena-Román5 1 Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI) Universidad de Granada, España 2 Universidad de La Habana, Cuba 3 Centro de Estudios Avanzados en Tecnologı́as de la Información y de la Comunicación (CEATIC) Universidad de Jaén 4 Universidad de Alicante, España 5 MeaningCloud Abstract: This is an overview of the Workshop on Semantic Analysis at the SE- PLN congress held in Sevilla, Spain, in September 2018. This forum proposes to participants four different semantic tasks on texts written in Spanish. Task 1 fo- cuses on polarity classification; Task 2 encourages the development of aspect-based polarity classification systems; Task 3 provides a scenario for discovering knowledge from eHealth documents; finally, Task 4 is about automatic classification of news articles according to safety. The former two tasks are novel in this TASS’s edition. We detail the approaches and the results of the submitted systems of the different groups in each task. Keywords: Sentiment Analysis, Opinion Mining, Affect Computing, eHealth, So- cial Media Resumen: Este artı́culo ofrece un resumen sobre el Taller de Análisis Semántico en la SEPLN (TASS) celebrado en Sevilla, España, en septiembre de 2018. Este foro propone a los participantes cuatro tareas diferentes de análisis semántico sobre textos en español. La Tarea 1 se centra en la clasificación de la polaridad; la Tarea 2 anima al desarollo de sistemas de polaridad orientados a aspectos; la Tarea 3 con- siste en descubrir conocimiento en documentos sobre salud; finalmente, la Tarea 4 propone la clasificación automática de noticias periodı́sticas según un nivel de se- guridad. Las dos últimas tareas son nuevas en esta edición. Se ofrece una sı́ntesis de los sistemas y los resultados aportados por los distintos equipos participantes, ası́ como una discusión sobre los mismos. Palabras clave: Análisis de Sentimientos, Minerı́a de Opiniones, Informática Afec- tiva, e-Salud, Medios Sociales The aim of the workshop is the furtherance of the research in semantic tasks on texts writ- 1 Introduction ten in Spanish, roughly speaking in Spanish The Workshop on Semantic Analysis at data. The edition 2018 has proposed two new the SEPLN1 (in Spanish Taller de Análisis challenges (Tasks 3 and 4), and provided sev- Semántico en la SEPLN, TASS) is the evolu- eral linguistic resources. tion of the Workshop on Sentiment Analysis The processing of health data is attracting at the SEPLN which is being held since 2012. the attention of the Natural Language Pro- 1 http://www.sepln.org/workshops/tass cessing (NLP) research community (Denecke, ISSN 1613-0073 Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes. E. Martínez-Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. García-Cumbreras, M. García-Vega, Y. Gutiérrez et al. 2015). In this line, Task 3 proposes mod- of Task 4. elling the human language in a scenario in which Spanish electronic health documents 2.1 Task 1 could be machine readable from a semantic This task focused on the evaluation of po- point of view. This Task 3 consists of de- larity classification systems at tweet level of tecting and classifying concepts for seman- tweets written in Spanish. tic relating them. Task 4 is related to the The submitted systems had to face, as brand safety concept, which is crucial for the usual, the lack of context due to length of reputation of a brand or the company of the tweets written in an informal language with brand. Task 4 proposes the classification of misspelling or emojis, even onomatopeias. the level of safety of a news for the publica- But this edition brought new challenges to tion of a ads spot of a brand according to the this task: headline of that news. • Multilinguality: training, tests and de- Tasks 3 and 4 provided specific datasets velopment corpus contain tweets written for accomplishing the proposed challenge, in Spanish from Spain, Peru and Costa and are described in Sections 2.3.1 and 2.4.1 Rica. respectively. Task 1 provided an extension of the InterTASS corpus, that was presented in • Generalization: Several corpora have the edition of 2017 (Martı́nez-Cámara et al., been used. One of them is the develop- 2017). The main novelty of the new version ment set, so it follows a similar distribu- of InterTASS is the incorporation of tweets tion. The second corpus is the test set written in the Spanish language spoken in of the General Corpus of TASS, which Spain and in the several other countries of was compiled some years ago, so it may America. Since the difficulty of Task 2 is be lexically and semantically different high, the organisation proposed the same set- from the training and development data. ting of the task as in previous editions. Furthermore, the system will be evalu- The paper is organised as follows: Section ated with test sets of tweets written in 2 describes all the tasks proposed in the edi- the Spanish language spoken in different tion of year 2018. The specific details of each American countries. Subtask are in Section 2.1, 2.2, 2.3 and 2.4 re- The General Corpus of TASS has been spectively. Section 3 exposes the conclusions provided in the same way as previous edi- of the paper. tions. Further details in (Martı́nez-Cámara et al., 2017). 2 Spanish Semantic Analysis However, International TASS Corpus (In- Tasks terTASS) is a corpus released in 2017 that As mentioned before, TASS is a relevant has been updated for this edition with new workshop for semantic analysis tasks, partic- texts. It is composed of tweets written in ularly for Spanish. In 2018, new resources different varieties of Spanish (for Spain, Peru and challenges were introduced to evolve and Costa Rica), so it exhibits a large amount Sentiment Analysis systems to a semantic of lexical and even structural differences in level. In the last editions, several research each variant. The main purpose of compiling groups from different countries, like Uruguay and using an inter-varietal corpus of Spanish or Costa Rica, presented their systems, and for the evaluation tasks is to challenge partic- it was mandatory to make an effort to build ipating systems to cope with the many faces adequate resources for their languages. of this language worldwide. In addition, society and companies are in- Datasets were annotated with 4 different terested in new specific challenges, and for polarity labels positive, negative, neu- this reason new tasks arise, while maintain- tral and none), and systems had to iden- ing the main task (global polarity). tify the orientation of the opinion expressed In this Section, we describe the four tasks in each tweet in any of those 4 polarity levels. of the edition of 2018, namely Section 2.1 ex- The Spanish variety part was released in pose the details of Task 1; Section 2.2 de- 2017 and its description can be found in scribes the corpus and the systems submit- (Martı́nez-Cámara et al., 2017). Table 1 ted to Task 2; Section 2.3 is focused on the shows the tweets distribution for training, de- Task 3; and Section 2.4 describes all details velopment (dev.) and test corpora. 14 Overview of TASS 2018: Opinions, Health and Emotions Training Dev. Test Accuracy and the macro-averaged ver- sions of Precision, Recall and F1 were used P 317 156 642 as evaluation measures. Systems were ranked NEU 133 69 216 by the Macro-F1 and Accuracy measures. N 416 219 767 NONE 138 62 274 2.1.1 Analysis of the Results For task 1 five system were presented. Most Total 1,008 506 1,899 of them make use of deep learning algorithms, combining different ways of obtaining the Table 1: Tweets distribution in InterTASS- word embeddings. ES INGEOTEC. Moctezuma1 et al. (2018) The Peru and Costa Rica varieties have present a polarity classification system based been released for this edition. The tweets on the combination of different labelling sys- distributions are shown in Tables 2 and 3 re- tems. The main component is the EvoMSA spectively for both variants. system, based on genetic algorithms, which combines the outputs of the other systems. EvoMSA is based on the B4MSA system for Training Dev. Test the adjustment of the different parameters P 231 95 430 (how the text is normalised, how the to- NEU 166 61 367 kens are calculated or how the tokens are N 242 106 472 weighted) and on the EvoDAG program that NONE 361 238 159 carries out the classification. As for the in- Total 1,000 500 1,428 put systems, various systems are used based on lexicons of affectivity or aggressiveness. Table 2: Tweets distribution in InterTASS- It also uses the algorithm of word embed- PE dings called FastText, using the Wikipedia in Spanish to train it. Vectors are generated for each document and SVM is used for train- Training Dev. Test ing. Their approach performs better when it is trained with tweets from Spain and test P 230 93 354 with other Spanish varieties. NEU 94 39 164 N 311 110 491 RETUYT-InCo. Chiruzzo and Rosá NONE 165 58 224 (2018) submitted three approaches: SVM using word embedding centroids and man- Total 800 300 1,233 ually crafted features, CNN using word embeddings as input, and Long Short Table 3: Tweets distribution in InterTASS- Term Memory (LSTM) using word embed- CR dings, trained with focus on improving the Four sub-tasks were proposed, working recognition of neutral tweets. In all cases, with the datasets of the different countries: embedding improves results and LSTM has Subtask-1: Monolingual ES. training the best behaviour for neutral tweets. The and test were the InterTASS ES datasets. use of a mixed-balanced training method Subtask-2: Monolingual PE. training for the LSTM resulted in a significant and test were the InterTASS PE datasets. improvement in the detection of neutral tweets. Subtask-3: Monolingual CR. training and test were the InterTASS CR datasets. ITAINNOVA. Montanés, Aznar, and del Subtask-4: Cross-lingual. The training Hoyo (2018) analyse the use of convolutional could be done with any dataset, but using network models (CNN), LSTM, Bidirectional a different one for the evaluation, in order LSTM (BiLSTM) and a hybrid approach be- to test the dependency of systems on a lan- tween CNN and LSTM. The combination guage. CNN-LSTM has been chosen as it integrates Results were submitted in a plain text file the benefits of both models. They choose with the following format: the CNN-LSTM combination because it inte- grates the benefits provided from both mod- tweet id \t polarity els. 15 E. Martínez-Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. García-Cumbreras, M. García-Vega, Y. Gutiérrez et al. Run M. F1 Acc. Run M. F1 Acc. elirf-es-run-1 0.503 0.612 retuyt-lstm-cr-2 0.504 0.537 retuyt-lstm-es-1 0.499 0.549 retuyt-svm-cr-2 0.499 0.577 retuyt-lstm-es-2 0.498 0.514 retuyt-svm-cr-1 0.493 0.567 retuyt-combined-es 0.491 0.602 elirf-cr-run-2 0.482 0.561 elirf-es-run-2 0.489 0.593 retuyt-cnn-cr-1 0.477 0.569 atalaya-ubav3-100-3-syn 0.476 0.544 atalaya-cr-lr-50-2 0.475 0.582 retuyt-svm-es-2 0.473 0.584 ingeotec-run1 0.474 0.522 atalaya-lr-50-2-bis 0.468 0.599 retuyt-lstm-cr-1 0.473 0.530 atalaya-lr-50-2 0.461 0.598 retuyt-cnn-cr-2 0.469 0.563 atalaya-ubav3-50-3 0.460 0.583 elirf-intertass-cr-run-1 0.463 0.544 retuyt-cnn-es-1 0.458 0.592 atalaya-mlp-300-sentiment 0.439 0.520 atalaya-lr-50-2-roc 0.455 0.595 atalaya-mlp-ubav3-50-3 0.436 0.560 ingeotec-run1 0.445 0.530 ingeotec-run1 0.384 0.398 retuyt-cnn-es-2 0.445 0.574 elirf-cr-run-1 0.317 0.288 atalaya-svm-50-2 0.431 0.583 itainnova-cl-base 0.383 0.433 Table 5: Task 1: InterTASS Monolingual CR itainnova-cl-proc1 0.320 0.395 retuyt-cnn-es-1 0.097 0.096 Run M. F1 Acc. retuyt-cnn-pe-1 0.472 0.494 Table 4: Task 1: InterTASS Monolingual ES atalaya-pe-lr-50-2 0.462 0.451 retuyt-lstm-pe-2 0.443 0.488 ELiRF-UPV. González, Hurtado, and retuyt-svm-pe-2 0.441 0.471 Pla (2018b) explore different approaches ingeotec-run1 0.439 0.447 based on Deep Learning. Specifically, they elirf-intertass-pe-run-2 0.438 0.461 study the behaviour of the CNN, Atten- atalaya-mlp-sentiment- 0.437 0.520 tion Bidirectional Long Short Term Memory ubav3-50-3 (Att-BLSTM) and Deep Averaging Networks retuyt-svm-pe-1 0.437 0.474 (DAN). In order to study the behaviour of elirf-intertass-pe-run-1 0.435 0.440 the different models, they carry out an ad- atalaya-mlp-300-sentiment 0.429 0.395 justment process. They get the best re- atalaya-mlp-50-sentiment 0.427 0.501 sults in InterTASS-ES. However, linguistic retuyt-svm-pe-2 0.425 0.477 variability affects the choice of architecture retuyt-cnn-pe-2 0.425 0.477 and its hyperparameters, so the application retuyt-lstm-pe-1 0.419 0.420 of the same system to InterTASS-CR and elirf-intertass-pe-run-1 0.225 0.210 InterTASS-PE tasks, without making any ad- justment, has not allowed to obtain results as Table 6: Task 1: InterTASS Monolingual PE competitive as in InterTASS-ES. ATALAYA. Luque and Pérez (2018) pre- guage. Tables 7, 9 and 8 show the results sented a system that uses a weighted scheme obtained in these cross-lingual subtasks. to average the subword-aware embeddings The overall results, in terms of F1, ob- obtained from preprocessed tweets that have tained with the monolingual and multilingual been enriched with data obtained from ma- systems for the Spanish and Costa Rica col- chine translation. This novel solution in- lections are quite comparable, but the one volves translating tweets into another lan- with the Peru collection fall by around 10%. guage and back into the source language, to lexically and grammatically increase them. 2.2 Task 2 Tables 4, 5 and 6 show the results ob- Task 2, Aspect-based Sentiment Analysis, tained in the monolingual subtasks (Spain, proposes the development of aspect-based Costa Rica and Peru variants). polarity classification systems. Similar to For the cross-lingual runs, the participants previous editions (Martı́nez-Cámara et al., selected an InterTASS dataset to train their 2017), two datasets were used to evaluate the systems and a different one to test, in order different approaches: Social-TV and STOM- to test the dependency of systems on a lan- POL. Both datasets were annotated with 16 Overview of TASS 2018: Opinions, Health and Emotions Run M. F1 Acc. sion of Precision, Recall, F1, and Accuracy were considered, and Macro-F1 was used for retuyt-svm-cross-es-2 0.471 0.555 a final ranking of proposed systems. retuyt-lstm-cross-es-2 0.470 0.466 retuyt-svm-cross-es-1 0.464 0.572 2.2.1 Collections retuyt-cnn-cross-es-1 0.450 0.524 The Social-TV corpus was collected during retuyt-cnn-cross-es-2 0.448 0.563 the 2014 Final of “Copa del Rey” cham- ingeotec-run1 0.445 0.530 pionship in Spain. After filtering out use- atalaya-mlp-300-sentiment 0.441 0.485 less information, a subset of 2,773 tweets retuyt-lstm-cross-es-1 0.438 0.498 was obtained. The details of the corpus are described in (Villena-Román et al., 2015; Table 7: Task 1: InterTASS Cross-lingual Garcı́a-Cumbreras et al., 2016; Martı́nez- with ES as test Cámara et al., 2017). STOMPOL (corpus of Spanish Tweets for Run M. F1 Acc. Opinion Mining at aspect level about POLi- tics) is a corpus for the task of Aspect Based ingeotec-run1 0.447 0.506 Sentiment Analysis. The corpus is composed retuyt-svm-cross-pe-2 0.445 0.514 of 1,284 tweets manually annotated by two retuyt-svm-cross-pe-1 0.444 0.505 annotators, and a third one in case of dis- retuyt-lstm-cross-pe-2 0.444 0.465 agreement. The details of the corpus are atalaya-mlp-300-sentiment 0.438 0.523 described in (Villena-Román et al., 2015; retuyt-lstm-cross-pe-1 0.425 0.472 Garcı́a-Cumbreras et al., 2016; Martı́nez- retuyt-cnn-cross-pe-1 0.409 0.481 Cámara et al., 2017). retuyt-cnn-cross-pe-2 0.391 0.438 itainnova-cl-base-cross-PE 0.367 0.382 2.2.2 Results Only the research group ELiRF (González, Table 8: Task 1: InterTASS Cross-lingual Hurtado, and Pla, 2018c) participated in this with PE as test edition. They explored different approaches based on Deep Learning. Specifically, they Run M. F1 Acc. studied the behaviour of the CNN, Atten- retuyt-svm-cross-cr-1 0.476 0.569 tion Bidirectional Long Short Term Memory retuyt-svm-cross-cr-2 0.474 0.542 (Att-BLSTM) and Deep Averaging Networks retuyt-lstm-cross-cr-1 0.473 0.530 (DAN), similar to the proposal of the team retuyt-cnn-cross-cr-2 0.462 0.551 for Task 1. In order to study the performance ingeotec-run2 0.454 0.538 of the different models, they carried out an retuyt-lstm-cross-cr-2 0.444 0.468 adjustment process. Tables 10 and 11 show retuyt-cnn-cross-cr-1 0.421 0.423 the results obtained in their experiments. itainnova-cl-base-cross-CR 0.409 0.440 Run M. F1 Acc. ingeotec-run1 0.384 0.398 ELiRF-UPV-run1 0.485 0.627 Table 9: Task 1: InterTASS Cross-lingual ELiRF-UPV-run3 0.483 0.628 with CR as test ELiRF-UPV-run2 0.476 0.625 aspect-related metadata: the main category Table 10: Task 2 Social-TV corpus results of the aspect, and the polarity of the opinion about the aspect. Systems had to classify Run M. F1 Acc. the opinion about the given aspect in 3 dif- ferent polarity labels (positive, negative, ELiRF-UPV-run2 0.526 0.633 neutral). ELiRF-UPV-run1 0.490 0.613 Participants were expected to submit up ELiRF-UPV-run3 0.447 0.576 to 3 experiments for each provided collection, each in a plain text file with tweet identifica- Table 11: Task 2 STOMPOL corpus results tion, aspect and polarity. For evaluation, exact match with a single 2.3 Task 3 label combining “aspect-polarity” was used. NLP methods are increasingly being used to Similarly to Task 1, the macro-averaged ver- mine knowledge from unstructured content of 17 E. Martínez-Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. García-Cumbreras, M. García-Vega, Y. Gutiérrez et al. health (Liu et al., 2013; Doing-Harris and Subtask A as either Concept or Ac- Zeng-Treitler, 2011; Gonzalez-Hernandez et tion. al., 2017) and other domains (Estevez- Velarde et al., 2018). Over the years, many • Subtask C is concerned with the dis- eHealth challenges have taken place, such covery of the semantic relations between as SemEval2 , CLEF3 campaigns and oth- pairs of entities. ers (Augenstein et al., 2017). These tasks have mainly dealt with identification, clas- To compute the evaluation metrics for sification, extraction and linking of knowl- each subtask, we define the following sets for edge. The Task 3: eHealth Knowledge Dis- comparing the annotations between both the covery (eHealth-KD) proposes modelling the expected output (gold standard) and the ac- human language in a scenario in which Span- tual output in each subtask: ish electronic health documents could be ma- chine readable from a semantic point of view. Correct matches (C): in all subtasks, This task is designed to encourage the de- when one gold and one given annotation velopment of software technologies to auto- exactly match. matically extract a large variety of knowledge from eHealth documents written in the Span- Partial matches (P ): in subtask A, when ish language. two key phrases have a non-empty inter- In order to capture the semantics of a section. broad range of health related text, eHealth- KD proposes the identification of two types Missing matches (M ): in subtasks A and of elements: Concepts and Actions. Con- C, when an annotation in the gold out- cepts are key phrases that represent actors or put is not provided by the system. entities which are relevant in a domain, while Actions represent how these Concepts inter- Spurious matches (S): in subtasks A and act with each other. Actions and Concepts C, when an annotation given by the sys- can be linked by two types of relations: sub- tem does not appear in the gold output. ject and target, which describe the main Incorrect matches (I): in subtask B, roles that a Concept can perform. Also, four when one assigned label is incorrect. specific semantic relations between Concepts are defined: is-a, part-of, property-of and same-as. Figure 1 provides an example. To measure the individual subtasks results as well as overall results, the eHealth-KD challenge proposes three evaluation scenar- ios. Scenario 1. The first scenario requires all subtasks (i.e. A, B and C) to be performed sequentially. The input in this scenario con- sists of plain text (100 sentences), and par- ticipants must submit the three output files Figure 1: Example annotation of a small set corresponding to subtasks A, B and C. In this of documents. scenario the overall quality of the participant systems is evaluated. So, a combined micro F1 metric was defined, taking into account To simplify and normalise the extraction results of the three tasks: process, the overall task is divided into three subtasks: 2 · PABC · RABC • Subtask A is concerned with the extrac- F1ABC = (1) PABC + RABC tion of the relevant key phrases. TABC + 12 PA PABC = (2) • Subtask B is concerned with the classi- TABC + PA + MA + IB + MC fication of the key phrases identified in TABC + 21 PA RABC = (3) 2 International Workshop on Semantic Evaluation TABC + PA + SA + IB + SC 3 Conference and Labs of the Evaluation Forum TABC = CA + CB + CC (4) 18 Overview of TASS 2018: Opinions, Health and Emotions Scenario 2. In the second scenario only Train Dev. Test subtasks B and C are performed. Hence, participants receive plain text inputs and the Files 6 1 3 corresponding outputs for subtask A (a dif- Sentences 559 285 300 ferent subset of 100 sentences). This sce- Annotations 5976 3573 3310 nario allows participants to focus on the key Entities 3280 1958 1805 phrases classification, without being affected - Concepts 2431 1524 1305 by errors related to the extraction of key - Actions 849 434 500 phrases. Like Scenario 1, a combined micro Roles 1684 843 988 F1 is defined which takes into account the re- - subject 693 339 401 sults for subtasks B and C: - target 991 504 587 2 · PBC · RBC Relations 1012 772 517 F1BC = (5) - is-a 434 370 235 PBC + RBC TBC - part-of 149 145 96 PBC = (6) - property-of 399 244 178 TBC + IB + MC TBC - same-as 30 13 8 RBC = (7) TBC + IB + SC Table 12: Statistics of the eHealth-KD v1.0 TBC = CB + CC (8) corpus. Scenario 3. Finally, the third scenario evaluates only subtask C. Participants are annotation tool5 by 15 human annotators di- provided with plain text inputs and the cor- vided into seven groups. The final 1, 173 responding output of subtasks A and B (a fi- tagged sentences were organised in three col- nal subset of another 100 sentences). In this lections: training, development and test. Ta- scenario, the following metric is defined for ble 12 summarises the main statistics of the evaluation: corpus. 2.3.2 Analysis of the Results PC · RecC eHealth-KD challenge attracted the attention F1C = 2· (9) PC + RC of a total 31 registered teams of which six CC of then successfully concluded their partici- PC = (10) CC + SC pation. Their results are summarised in Ta- CC ble 13. The following tag labels are designed RC = (11) CC + MC to provide an overview of the main charac- For competition purposes, the best system teristics of each participant system: is defined as the submission that maximises S: Uses shallow supervised models such as the macro-average F1 across all three scenar- CRF, logistic regression, SVM, decision ios: trees, etc. D: Uses deep learning models, such as LSTM F1ABC + F1BC + F1C or convolutional networks. F1 = (12) 3 E: Uses word embeddings or other embed- 2.3.1 Corpora ding models trained with external cor- For evaluation purposes, a corpus of health- pora. related sentences in Spanish was manually K: Uses external knowledge bases, either ex- built and tagged. The corpus consists of a plicitly or implicitly (i.e, through third- selection of articles collected from the Med- party tools). linePlus4 website. These files contain sev- eral entries related to health and medicine R: Uses hand crafted rules based on domain topics, and environmental topics strongly re- expertise. lated to health care. Spanish language items N: Uses natural language processing tech- were converted to a plain text document, pro- niques or features, i.e., POS-tagging, de- cessed, and manually tagged using the Brat pendency parsing, etc. 4 5 https://medlineplus.gov/xml.html http://brat.nlplab.org/ 19 E. Martínez-Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. García-Cumbreras, M. García-Vega, Y. Gutiérrez et al. Baseline description: A baseline, to deal with, even after having applied novel trained on the training corpus, was defined. approaches (i.e. TALP and LaBDA) based on This strategy consists of a dummy approach convolutional neural networks. based solely on the text of key phrases. This The best-performing systems in each sce- technique collects all training data and stores nario highly coincide with all three task re- three maps: (1) key phrases associated with sults. For Scenario 1, the top performing their most common class (either Concept or strategy belongs to UC3M, which achieves the Action); (2) pairs of concepts associated with best scores in subtasks A and B, and the over- their most common relation; and (3) tuples all best result in the shared subtask (aver- of associated with their aged across all three scenarios), pretty close most common role. At prediction time, these to SINAI. Likewise, the best strategy in Sce- maps are used to select a key phrase, decide nario 3 corresponds to TALP, which achieves its class, and predict relations and roles. the best score for subtask C. However, for Once the shared subtask ended, the official the overall results, other participants such as results were published. However, some par- SINAI and UPF-UPC achieve higher average ticipants noticed that their systems provided scores, even though their performance in sub- duplicated outputs on some occasions. These task C and Scenario 3 is practically negligi- duplicated outputs, even if correct, were be- ble. In contrast, these teams obtain relatively ing counted as spurious after the first match. high scores in subtasks A and B. To account for this duplication, the evalu- The diverse nature and complexity of the ation script was modified to remove dupli- three subtasks make it difficult to design a cated outputs from the participants submis- single fair evaluation metric. For this rea- sions prior to calculating the evaluation met- son, we consider that each system submission rics. Table 14 shows this second version of the gets more accurate results related to the spe- metrics, where variations in scores are high- cific sub-problems that it tackles. Although lighted in bold text. This proved not to be generalisation across the three subtasks is a a significant problem, since only two partic- desirable characteristic, advances in any par- ipants were affected, and even though their ticular subtask are also very valuable. metrics improved marginally, the overall re- In general, the most competitive ap- sults or the main conclusions of the shared proaches in individual subtasks are domi- subtask did not change. nated by state-of-the-art machine learning. The results of this task, eHealth-KD, show In the particular case of subtask C, where that a variety of approaches, on the whole, modern deep learning approaches seem to deal effectively with the health knowledge outperform classic techniques. In addi- discovery problem. However, issues still need tion, incorporating domain-specific knowl- to be resolved to obtain highly competi- edge provides a significant boost to the per- tive systems. The best performing submis- formance. Most participants use NLP fea- sions include classic supervised learning, deep tures, either explicitly, or implicitly captured learning and knowledge-based techniques. In in word embeddings and other representa- subtask A, the best approach (UC3M) is based tions. An interesting phenomenon is that on a CRF model with pre-trained embed- the best systems in subtask A do not cor- dings as features. This approach obtains sim- relate with the best systems in subtask C. ilar scores in subtask B. In general, subtask This suggests that the optimal approach for B appears to be easier than the rest, which either subtask is different, giving rise to an in- is understandable given that there are only teresting research line that would explore in- two classes and there is a large correlation tegrated approaches to simultaneously solv- between word lemmas and their classes (as ing these three subtasks. The overall results shown by the relatively high performance of show that general purpose knowledge discov- the baseline). ery in domain-specific documents is poten- Subtask C, in concordance with Scenario tially a prolific research area, particularly for 3, does not exceed 45% in F-score. This re- the Spanish language. We expect similar fu- inforces the belief that this task is difficult ture initiatives to provide fruitful evaluation 7 This extracts lexical and syntactic features for scenarios where researchers can deploy tech- each token. Afterwards, it applies a set of handcrafted niques from several domains, and compete in heuristics for each subtask. friendly contests to improve the state-of-the- 20 Overview of TASS 2018: Opinions, Health and Emotions UC3M† SINAI† UPF-UPC† TALP† LaBDA† UH Baseline Tags SDEN KRN SKN DEN D RN Subtask A 0.872 0.798 0.805 - 0.323 0.172 0.597 Subtask B 0.959 0.921 0.954 0.931 0.594 0.639 0.774 Subtask C - - 0.036 0.448 0.420 0.018 0.107 Average 0.610 0.573 0.598 0.460 0.446 0.276 0.493 Scenario 1 0.744 0.710 0.681 - 0.297 0.181 0.566 Scenario 2 0.648 0.674 0.622 0.722 0.275 0.255 0.577 Scenario 3 - - 0.036 0.448 0.420 0.018 0.107 Average 0.464 0.461 0.446 0.390 0.331 0.151 0.417 Table 13: Summary of systems and results for the TASS 2018 Task 3 event. The best scores are in bold text. More details in UC3M (Zavala, Martı́nez, and Segura-Bedmar, 2018), SINAI (López- Ubeda et al., 2018), UPF-UPC (Palatresi and Hontoria, 2018), TALP (Medina and Turmo, 2018), LaBDA (Suarez-Paniagua, Segura-Bedmar, and Martı́nez, 2018) and UH7 . The symbol † means that the group submitted a system description paper. UPF-UPC LaBDA news article into safe (positive emotions, so safe for ads) or unsafe (negative emotions, Tags SKN DN so better avoid ads). This task could be con- Subtask A 0.805 0.323 sidered as a kind of stance classification, on Subtask B 0.954 0.594 the positioning of readers of news contents. Subtask C 0.036 0.444 The task is a strong challenge because it has Average 0.598 0.454 to deal with the polarity of feeling (safe vs. Scenario 1 0.681 0.310 unsafe) and to work in combination with a Scenario 2 0.626 0.294 (pseudo) thematic classification to be able to Scenario 3 0.036 0.444 determine the meaning of the news. For ex- Average 0.448 0.349 ample, the reduction of traffic accidents has a negative feeling because of the accidents, Table 14: Summary of results of submissions but the context of reducing the numbers of that changed once duplicated entries were re- accidents makes those bad news good news, moved. Variations in score are highlighted in hence safe news. bold text. 2.4.1 Corpora The Spanish brANd Safe Emotion corpus art. (SANSE) corpus was specifically built for this task. RSS feeds of different online newspa- 2.4 Task 4 pers written in different varieties of Spanish When news are about natural disasters, read- (Argentina, Chile, Colombia, Cuba, Spain, ers usually feel negative emotions (sadness, USA, Mexico, Peru and Venezuela) were col- for instance), whereas when those news are lected for over a month. Finally 15,152 ar- about the last championship won by your ticles were captured, containing the URL, favourite team, readers feel positive emotions the publication date and the headline. News like happiness. Moreover, it is commonly as- summaries were also collected for several sumed in marketing that emotions aroused sources, but finally they were discarded to in the reader by news articles have an im- make the dataset consistent and homoge- pact in the perception of the advertisements neous. displayed along with those articles. Thus, Then 2,000 articles (L1 subset) were ran- from that marketing perspective, if a com- domly selected and were manually annotated pany wants to promote their brand, the ads into an emotional categorisation of SAFE or should better be associated to (i.e., shown UNSAFE, from the point of view of the gen- with) news that arouse positive emotions. eral public of each corresponding country. The objective of Task-4 is to encourage the The other 13,152 articles (L2 subset) were development of systems that can classify a not annotated. 21 E. Martínez-Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. García-Cumbreras, M. García-Vega, Y. Gutiérrez et al. As the datasets were annotated with two Subset Size levels of safety: SAFE and UNSAFE, the task can be considered as a binary classifi- Training 1250 cation task. Development 250 Test 500 The annotation was carried out by two human annotators (the two organisers of the Table 15: Statistics of the SANSE corpus task), and, for those cases with no agreement between the two annotators, a third annota- tor undid the tie. A safe headline of a news Subset Size was defined as an utterance that arises a pos- Training (Spain) 300 itive or neutral emotion in the reader and is Dev. (Spain) 48 not related to a controversial topics: religion, Test (Mexico) 144 extreme wing political topics, or controver- Test (Cuba) 194 sial topics (those that arise strong positive Test (Chile) 194 emotions to some readers but strong negative Test (Colombia) 195 emotions to other ones). An unsafe headline Test (Argentina) 198 was defined as an utterance that arises nega- Test (Venezuela) 233 tive emotions on the reader. Test (Peru) 234 Some examples in Spanish: Test (USA) 260 Ası́ será el nuevo pan integral en Table 16: Caption España, según una nueva ley en mar- cha. → SAFE the Spanish language, it thus a monolingual This will be the new integral bread in evaluation. In this task, datasets are com- Spain, according to a new law underway. posed of headlines of news written in differ- ent version of the Spanish language, but the Casi 300 municipios de Colombia en country of the text is not relevant for this riesgo electoral. → UNSAFE task. Almost 300 municipalities in Colombia at Participants were provided with the train- electoral risk. ing and development subsets of L1 SANSE The agreement of the annotation was 0.58 corpus for building the systems, and two test according to π (Scott, 1955) and k (Cohen, sets for the evaluation: the test subset of L1 1960), which may consider moderate accord- SANSE corpus and the L2 SANSE corpus. ing to Landis and Koch (1977). Although The systems presented were evaluated us- the agreement is moderate, it is close to be ing the measures of Macro-Precision (M. P.), considered substantial, and we have also to Macro-Recall (M. R.), Macro-F1 (M. F1) and take into account that it is a new classifica- Accuracy (Acc.). tion task that works with a strong subjective Subtask 2 (S2) is similar to S1, but in this content. We will work in making the annota- case the aim is to evaluate the generalisation tions guidelines more precise in order to im- capacity of the submitted systems. For train- prove the agreement of the annotators. Be- ing their systems, participants were provided sides, we hope that the participants will give with SANSE subsets with headlines written us insights with the aim of improving the an- only in the Spanish language spoken in Spain. notation of the data. The test set was composed of headlines writ- The L1 subset was then again divided in ten in the Spanish language spoken in differ- three subsets, specifically: training, develop- ent countries of America. The statistics of ment and test. The statistics of the three SENSE corpus for S2 are shown in the Table subsets are in Table 15. 16. 2.4.2 Tasks 2.4.3 Results Two subtasks were proposed. Subtask 1 (S1) Task 4 attracted the attention of seven teams, consists of the classification of headlines into and most of them participated in both lev- safe or unsafe for incorporating an ad of els of evaluation of the S1 and in S2. Table a brand. The evaluation of the systems does 2.4.3 shows the participation of the teams in not take into account the cultural varieties of each Subtask. Five groups of the seven ones 22 Overview of TASS 2018: Opinions, Health and Emotions submitted a system description paper, whose rbnUGR. Rodrı́guez Barroso, Martı́nez- main features will be detailed as what follows. Cámara, and Herrera (2018) submitted three INGEOTEC. Moctezuma et al. (2018) systems grounded in deep learning. Although propose an ensemble classification system the three systems are based on Long Short- (EvoMSA), which is composed of several and Term Memory (LSTM) Recurrent Neural heterogeneous base systems and a genetic Network (RNN), they have several differ- programming system (EvoDAG, (Graff et al., ences: 2017)) that optimises the contribution of each Run 1. It uses a LSTM layer as encoding base system in the final classification. The layer, and its output is the last vector authors combined supervised and unsuper- state of the LSTM layer. vised system as base classification systems. Run 2. It uses a BiLSTM8 layer as encoding The supervised ones are based on the use layer, and its output is the concatenation of the algorithm SVM with different repre- of the last vector state of the two LSTM sentations of the input text, namely TF-IDF layers. and pre-trained word vectors. The system Run 3. It uses a LSTM layer as encoding reached the best results in the monolingual layer, and its output is the concatenation and the multilingual evaluations, however the of the corresponding output state vector performance of the system dropped a bit in of each input token. S1 L2. Since the annotation test set of S1 L2 was conducted by a voting system of the The results show that the systems based all the submitted systems, the lower perfor- on one single LSTM layer perform better mance in S1 L2 may be caused by a different than the one based on BiLSTM. Regarding error distribution between INGEOTEC and the different results in S1 and S2 indicate the systems submitted by the other groups. that the use the entire output of the encod- ELiRF UPV. González, Hurtado, and ing layer allow to improve the generalisation Pla (2018a) propose a deep neural network, capacity of the model, because the multilin- specifically the model Deep Averaging Net- gual evaluation requires a higher generalisa- works (DAN) (Iyyer et al., 2015). The au- tion capacity. thors used a set of pre-trained word embed- MeaningCloud. Herrera-Planells and dings for representing the news headlines. Villena-Román (2018) propose three su- The set of pre-trained word embeddings were pervised systems, two of them are lineal prepared by the authors and built upon a cor- classification systems and the other one a pus of tweets (Hurtado, Pla, and González, non-lineal classification system. The linear 2017). The high performance reached by classification systems use XGBoost (Chen a set of pre-trained word embeddings built and Guestrin, 2016) as classification system. upon tweets with news headlines stands out, They differ in the set of features used to because the genre of news headlines and represent the news headlines, which are tweets are different. However, it may mean mainly built using the public APIs of the that the use of language in tweets and news text analytic platform of MeaningCloud. headlines is similar. The non lineal classification system is a neural network based on the use of a CNN layer. The proposal that reached higher Team S1 L1 S1 L2 S2 results was the one grounded in a CNN INGEOTEC† X X X (Run 3). ELiRF-UPV† X X X SINAI. Plaza del Arco et al. (2018) pro- rbnUGR† X X X pose to represent the news headlines as a vec- MeaningCloud† X X X tor of unigrams weighted with TF-IDF, and SINAI† X X - the number of positive and negative words ac- lone wolf X X - cording to three list of opinion bearing words. TNT-UA-WFU X X - The authors used SVM as classification algo- rithm. Table 17: Participation of each team on each The evaluation measures in the two Sub- Subtask. The symbol † means that the group tasks were accuracy and the macro-average submitted a system description paper 8 A BiLSTM is an elaboration of two LSTM layers. 23 E. Martínez-Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. García-Cumbreras, M. García-Vega, Y. Gutiérrez et al. S1 L1 S1 L2 S2 System M. P. M. R. M. F1 Acc. M. P. M. R. M. F1 Acc. M. P. M. R. M. F1 Acc. INGEOTEC run1 0.794 0.795 0.7951 0.802 0.853 0.880 0.8664 0.871 0.722 0.715 0.7191 0.737 ELiRF UPV run2 0.787 0.794 0.7902 0.794 0.850 0.884 0.8673 0.865 0.747 0.657 0.6992 0.722 ELiRF UPV run1 0.795 0.784 0.7903 0.800 0.878 0.889 0.8831 0.893 0.736 0.649 0.6903 0.715 rbnUGR run1 0.784 0.764 0.7744 0.786 0.880 0.867 0.8732 0.888 0.683 0.661 0.6726 0.700 MEANING- 0.767 0.767 0.7675 0.776 0.781 0.804 0.7937 0.801 0.647 0.654 0.6517 0.658 CLOUD run3 rbnUGR run3 0.763 0.765 0.7646 0.772 0.838 0.870 0.8536 0.853 0.687 0.678 0.6834 0.631 rbnUGR run2 0.774 0.752 0.7637 0.776 0.868 0.857 0.8635 0.878 0.679 0.672 0.6765 0.698 SINAI 0.733 0.722 0.7288 0.742 0.769 0.777 0.7738 0.793 - - - - MEANING- 0.723 0.727 0.7259 0.732 - - - - - - - - CLOUD run2 MEANING- 0.713 0.722 0.71710 0.714 - - - - - - - - CLOUD run1 Table 18: Macro precision (M. P.), macro recall (M. R.), macro f1 (M. F1) and accuracy (Acc.) reached by each submitted system to each Subtask of the groups that submitted a system description paper of precision, recall and F1, and the systems 89517-P) from the Spanish Government, and were ranked according to the value of macro- “Plataforma Inteligente para Recuperación, F1. Table 18 show the results reached by each Análisis y Representación de la Infor- group that submitted the description of their mación Generada por Usuarios en Internet” systems in S1 L1, S1 L2 and S2 respectively. (GRE16-01) from University of Alicante. Eugenio Martı́nez Cámara was supported by 3 Conclusions the Spanish Government Programme Juan The edition of TASS 2018 has attracted the de la Cierva Formación (FJCI-2016-28353). participation of 16 systems, and the submis- sion of 15 system description papers. More- References over, we have proposed two new challenges to Augenstein, I., M. Das, S. Riedel, L. Vikra- the international reserach community, which man, and A. McCallum. 2017. Semeval are in line to the requirements of the Indus- 2017 task 10: Scienceie-extracting try. keyphrases and relations from sci- The submitted systems are in the line of entific publications. arXiv preprint the state-of-the-art in other similar work- arXiv:1704.02853. shops, and most of them are grounded in Deep Learning and the use of hand-crafted Chen, T. and C. Guestrin. 2016. Xgboost: linguistic features. Therefore, TASS may be A scalable tree boosting system. In Pro- considered as a reference forum for setting ceedings of the 22Nd ACM SIGKDD In- up the state-of-the-art in semantic analysis ternational Conference on Knowledge Dis- in Spanish. covery and Data Mining, KDD ’16, pages As future work, we plan to enlarge the cov- 785–794, New York, NY, USA. ACM. erage of the Spanish language of the corpus Chiruzzo, L. and A. Rosá. 2018. Retuyt- InterTASS, as well as consolidating the new inco at tass 2018: Sentiment analy- challenges (Task 3 and Task 4). Moreover, sis in spanish variants using neural we will keep working in the development of networks and svm. In E. Martı́nez- new corpora and linguistic resources for the Cámara, Y. Almeida Cruz, M. C. research community. Dı́az-Galiano, S. Estévez Velarde, M. A. Garcı́a-Cumbreras, M. Garcı́a-Vega, Acknowledgments Y. Gutiérrez Vázquez, A. Montejo Ráez, This work has been partially supported by a A. Montoyo Guijarro, R. Muñoz Guillena, grant from the Fondo Europeo de Desarrollo A. Piad Morffis, and J. Villena-Román, Regional (FEDER), the projects REDES editors, Proceedings of TASS 2018: Work- (TIN2015-65136-C2-1-R, TIN2015-65136- shop on Semantic Analysis at SEPLN C2-2-R) and SMART-DASCI (TIN2017- (TASS 2018), volume 2172 of CEUR 24 Overview of TASS 2018: Opinions, Health and Emotions Workshop Proceedings, Sevilla, Spain, Gonzalez-Hernandez, G., A. Sarker, September. CEUR-WS. K. O’Connor, and G. Savova. 2017. Cap- Cohen, J. 1960. A coefficient of agreement turing the patient’s perspective: a review for nominal scales. Educational and Psy- of advances in natural language process- chological Measurement, 20(1):37–46. ing of health-related text. Yearbook of medical informatics, 26(01):214–227. Denecke, K. 2015. Health Web Science. Springer International Publishing. Graff, M., E. S. Tellez, H. Jair Escalante, and S. Miranda-Jiménez, 2017. Semantic Ge- Doing-Harris, K. M. and Q. Zeng-Treitler. netic Programming for Sentiment Analy- 2011. Computer-assisted update of a con- sis, pages 43–65. Springer International sumer health vocabulary through mining Publishing, Cham. of social network data. Journal of medical Internet research, 13(2). Herrera-Planells, J. and J. Villena-Román. 2018. MeaningCloud at TASS 2018: News Estevez-Velarde, S., Y. Gutiérrez, A. Mon- headlines categorization for brand safety toyo, A. Piad-Morffis, R. M. noz, and assessment. In Proceedings of TASS 2018: Y. Almeida-Cruz. 2018. Gathering Workshop on Semantic Analysis at SE- object interactions as semantic knowl- PLN (TASS 2018), volume 2172, Septem- edge. In Proceedings of the 2018 Inter- ber. national Conference on Artificial Intelli- gence (ICAI’18). Hurtado, L.-F., F. Pla, and J.-A. González. 2017. Elirf-upv en tass 2017: Análisis de Garcı́a-Cumbreras, M. A., J. Villena-Román, sentimientos en twitter basado en apren- E. Martı́nez-Cámara, M. C. Dı́az-Galiano, dizaje profundo. In Proceedings of TASS M. T. Martı́n-Valdivia, and L. A. Ureña 2017: Workshop on Sentiment Analysis at López. 2016. Overview of tass 2016. SEPLN co-located with 33nd SEPLN Con- In TASS 2016: Workshop on Sentiment ference (SEPLN 2017). Analysis at SEPLN, pages 13–21. Iyyer, M., V. Manjunatha, J. Boyd-Graber, González, J.-A., L.-F. Hurtado, and F. Pla. and H. Daumé III. 2015. Deep unordered 2018a. ELiRF-UPV en TASS 2018: Cat- composition rivals syntactic methods for egorizació emocional de noticias. In Pro- text classification. In Proceedings of the ceedings of TASS 2018: Workshop on Se- 53rd Annual Meeting of the Association mantic Analysis at SEPLN (TASS 2018), for Computational Linguistics and the 7th volume 2172, September. International Joint Conference on Natu- González, J.-A., L.-F. Hurtado, and F. Pla. ral Language Processing (Volume 1: Long 2018b. Elirf-upv en tass 2018: Análisis Papers), pages 1681–1691. Association for de sentimientos en twitter basado en Computational Linguistics. aprendizaje profundo. In E. Martı́nez- Landis, J. R. and G. G. Koch. 1977. The Cámara, Y. Almeida Cruz, M. C. measurement of observer agreement for Dı́az-Galiano, S. Estévez Velarde, M. A. categorical data. biometrics, pages 159– Garcı́a-Cumbreras, M. Garcı́a-Vega, 174. Y. Gutiérrez Vázquez, A. Montejo Ráez, A. Montoyo Guijarro, R. Muñoz Guillena, Liu, H., S. J. Bielinski, S. Sohn, S. Murphy, A. Piad Morffis, and J. Villena-Román, K. B. Wagholikar, S. R. Jonnalagadda, editors, Proceedings of TASS 2018: Work- K. Ravikumar, S. T. Wu, I. J. Kullo, shop on Semantic Analysis at SEPLN and C. G. Chute. 2013. An informa- (TASS 2018), volume 2172 of CEUR tion extraction framework for cohort iden- Workshop Proceedings, Sevilla, Spain, tification using electronic health records. September. CEUR-WS. AMIA Summits on Translational Science Proceedings, 2013:149. González, J.-A., L.-F. Hurtado, and F. Pla. 2018c. Elirf-upv en tass 2018: Análisis de López-Ubeda, P., M. C. Dı́az-Galiano, M. T. sentimientos en twitter basado en apren- Martı́n-Valdivia, and L. A. Urena-Lopez. dizaje profundo. In Proceedings of TASS 2018. Sinai en tass 2018 task 3. clasifi- 2018: Workshop on Semantic Analysis at cando acciones y conceptos con umls en SEPLN (TASS 2018). medline. In Proceedings of TASS 2018: 25 E. Martínez-Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. García-Cumbreras, M. García-Vega, Y. Gutiérrez et al. Workshop on Semantic Analysis at SE- Workshop Proceedings, Sevilla, Spain, PLN (TASS 2018). September. CEUR-WS. Luque, F. M. and J. M. Pérez. 2018. Montanés, R., R. Aznar, and R. del Hoyo. Atalaya at tass 2018: Sentiment 2018. Aplicación de un modelo hı́brido de analysis with tweet embeddings and aprendizaje profundo para el análisis de data augmentation. In E. Martı́nez- sentimiento en twitter. In E. Martı́nez- Cámara, Y. Almeida Cruz, M. C. Cámara, Y. Almeida Cruz, M. C. Dı́az-Galiano, S. Estévez Velarde, M. A. Dı́az-Galiano, S. Estévez Velarde, M. A. Garcı́a-Cumbreras, M. Garcı́a-Vega, Garcı́a-Cumbreras, M. Garcı́a-Vega, Y. Gutiérrez Vázquez, A. Montejo Ráez, Y. Gutiérrez Vázquez, A. Montejo Ráez, A. Montoyo Guijarro, R. Muñoz Guillena, A. Montoyo Guijarro, R. Muñoz Guillena, A. Piad Morffis, and J. Villena-Román, A. Piad Morffis, and J. Villena-Román, editors, Proceedings of TASS 2018: Work- editors, Proceedings of TASS 2018: Work- shop on Semantic Analysis at SEPLN shop on Semantic Analysis at SEPLN (TASS 2018), volume 2172 of CEUR (TASS 2018), volume 2172 of CEUR Workshop Proceedings, Sevilla, Spain, Workshop Proceedings, Sevilla, Spain, September. CEUR-WS. September. CEUR-WS. Martı́nez-Cámara, E., M. C. Dı́az-Galiano, Palatresi, J. V. and H. R. Hontoria. 2018. M. A. Garcı́a-Cumbreras, M. Garcı́a- Tass2018: Medical knowledge discovery Vega, and J. Villena-Román. 2017. by combining terminology extraction tech- Overview of TASS 2017. In E. Martı́nez- niques with machine learning classifica- Cámara, M. C. Dı́az-Galiano, M. A. tion. In Proceedings of TASS 2018: Work- Garcı́a-Cumbreras, M. Garcı́a-Vega, and shop on Semantic Analysis at SEPLN J. Villena-Román, editors, Proceedings of (TASS 2018). TASS 2017: Workshop on Semantic Anal- Plaza del Arco, F. M., E. Martı́nez-Cámara, ysis at SEPLN (TASS 2017), volume 1896 M. T. Martı́n Valdivia, and A. Ureña of CEUR Workshop Proceedings, Murcia, López. 2018. SINAI en TASS 2018: Spain, September. CEUR-WS. Inserción de conocimiento emocional ex- Medina, S. and J. Turmo. 2018. Joint clas- terno a un clasificador lineal de emociones. sification of key-phrases and relations in In Proceedings of TASS 2018: Workshop electronic health documents. In Proceed- on Semantic Analysis at SEPLN (TASS ings of TASS 2018: Workshop on Seman- 2018), volume 2172, September. tic Analysis at SEPLN (TASS 2018). Rodrı́guez Barroso, N., E. Martı́nez-Cámara, Moctezuma, D., J. Ortiz-Bejar, E. S. Tellez, and F. Herrera. 2018. SCI2 S at TASS S. Miranda-Jiménez, and M. Graff. 2018. 2018: Emotion classification with recur- Ingeotec solution for task 4 in tass’18 com- rent neural networks. In Proceedings petition. In Proceedings of TASS 2018: of TASS 2018: Workshop on Semantic Workshop on Semantic Analysis at SE- Analysis at SEPLN (TASS 2018), volume PLN (TASS 2018), volume 2172, Septem- 2172, September. ber. Scott, W. A. 1955. Reliability of content Moctezuma1, D., J. Ortiz-Bejar, E. S. Téllez, analysis: The case of nominal scale cod- S. Miranda-Jiménez, and M. Graff. ing. Public opinion quarterly, pages 321– 2018. Ingeotec solution for task 1 in 325. tass’18 competition. In E. Martı́nez- Suarez-Paniagua, V., I. Segura-Bedmar, and Cámara, Y. Almeida Cruz, M. C. P. Martı́nez. 2018. Labda at tass-2018 Dı́az-Galiano, S. Estévez Velarde, M. A. task 3: Convolutional neural networks for Garcı́a-Cumbreras, M. Garcı́a-Vega, relation classification in spanish ehealth Y. Gutiérrez Vázquez, A. Montejo Ráez, documents. In Proceedings of TASS 2018: A. Montoyo Guijarro, R. Muñoz Guillena, Workshop on Semantic Analysis at SE- A. Piad Morffis, and J. Villena-Román, PLN (TASS 2018). editors, Proceedings of TASS 2018: Work- shop on Semantic Analysis at SEPLN Villena-Román, J., J. Garcı́a-Morera, M. A. (TASS 2018), volume 2172 of CEUR Garcı́a-Cumbreras, E. Martı́nez-Cámara, 26 Overview of TASS 2018: Opinions, Health and Emotions M. T. Martı́n-Valdivia, and L. A. Ureña López. 2015. Overview of tass 2015. In TASS 2015: Workshop on Sentiment Analysis at SEPLN, pages 13–21. Zavala, R. M. R., P. Martı́nez, and I. Segura- Bedmar. 2018. A hybrid bi-lstm-crf model for knowledge recognition from ehealth documents. In Proceedings of TASS 2018: Workshop on Semantic Anal- ysis at SEPLN (TASS 2018). 27