-

1613-0073

Overview of TASS 2018: Opinions, Health and Emotions

Eugenio Mart nez-Camara

Yudivian Almeida-Cruz

Manuel Carlos D az-Galiano

Suilan Estevez-Velarde

Miguel A. Garc a-Cumbreras

Manuel Garc a-Vega

Yoan Gutierrez

Arturo Montejo-Raez

Andres Montoyo

Rafael Mun~oz

Alejandro Piad-Mor s

Julio Villena-Roman

2 0 Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI) Universidad de Granada , Espan~a 1 Centro de Estudios Avanzados en Tecnolog as de la Informacion y de la Comunicacion (CEATIC) Universidad de Jaen 2 MeaningCloud 3 Universidad de Alicante , Espan~a 4 Universidad de La Habana , Cuba

2018

13 27

This is an overview of the Workshop on Semantic Analysis at the SEPLN congress held in Sevilla, Spain, in September 2018. This forum proposes to participants four di erent semantic tasks on texts written in Spanish. Task 1 focuses on polarity classi cation; Task 2 encourages the development of aspect-based polarity classi cation systems; Task 3 provides a scenario for discovering knowledge from eHealth documents; nally, Task 4 is about automatic classi cation of news articles according to safety. The former two tasks are novel in this TASS's edition. We detail the approaches and the results of the submitted systems of the di erent groups in each task.

The Workshop on Semantic Analysis at the SEPLN1 (in Spanish Taller de Analisis Semantico en la SEPLN, TASS) is the evolution of the Workshop on Sentiment Analysis at the SEPLN which is being held since 2012.

1http://www.sepln.org/workshops/tass The aim of the workshop is the furtherance of the research in semantic tasks on texts written in Spanish, roughly speaking in Spanish data. The edition 2018 has proposed two new challenges (Tasks 3 and 4), and provided several linguistic resources.

The processing of health data is attracting the attention of the Natural Language Processing (NLP) research community (Denecke,

Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes. 2015). In this line, Task 3 proposes modelling the human language in a scenario in which Spanish electronic health documents could be machine readable from a semantic point of view. This Task 3 consists of detecting and classifying concepts for semantic relating them. Task 4 is related to the brand safety concept, which is crucial for the reputation of a brand or the company of the brand. Task 4 proposes the classi cation of the level of safety of a news for the publication of a ads spot of a brand according to the headline of that news.

Tasks 3 and 4 provided speci c datasets for accomplishing the proposed challenge, and are described in Sections 2.3.1 and 2.4.1 respectively. Task 1 provided an extension of the InterTASS corpus, that was presented in the edition of 2017 (Mart nez-Camara et al., 2017) . The main novelty of the new version of InterTASS is the incorporation of tweets written in the Spanish language spoken in Spain and in the several other countries of America. Since the di culty of Task 2 is high, the organisation proposed the same setting of the task as in previous editions.

The paper is organised as follows: Section 2 describes all the tasks proposed in the edition of year 2018. The speci c details of each Subtask are in Section 2.1, 2.2, 2.3 and 2.4 respectively. Section 3 exposes the conclusions of the paper. 2

Spanish Semantic Analysis

Tasks As mentioned before, TASS is a relevant workshop for semantic analysis tasks, particularly for Spanish. In 2018, new resources and challenges were introduced to evolve Sentiment Analysis systems to a semantic level. In the last editions, several research groups from di erent countries, like Uruguay or Costa Rica, presented their systems, and it was mandatory to make an e ort to build adequate resources for their languages.

In addition, society and companies are interested in new speci c challenges, and for this reason new tasks arise, while maintaining the main task (global polarity).

In this Section, we describe the four tasks of the edition of 2018, namely Section 2.1 expose the details of Task 1; Section 2.2 describes the corpus and the systems submitted to Task 2; Section 2.3 is focused on the Task 3; and Section 2.4 describes all details of Task 4. 2.1 Task 1 This task focused on the evaluation of polarity classi cation systems at tweet level of tweets written in Spanish.

The submitted systems had to face, as usual, the lack of context due to length of tweets written in an informal language with misspelling or emojis, even onomatopeias.

But this edition brought new challenges to this task:

Multilinguality: training, tests and development corpus contain tweets written in Spanish from Spain, Peru and Costa Rica.

Generalization: Several corpora have been used. One of them is the development set, so it follows a similar distribution. The second corpus is the test set of the General Corpus of TASS, which was compiled some years ago, so it may be lexically and semantically di erent from the training and development data.

Furthermore, the system will be evaluated with test sets of tweets written in the Spanish language spoken in di erent

American countries.

The General Corpus of TASS has been provided in the same way as previous editions. Further details in (Mart nez-Camara et al., 2017) .

However, International TASS Corpus (InterTASS) is a corpus released in 2017 that has been updated for this edition with new texts. It is composed of tweets written in di erent varieties of Spanish (for Spain, Peru and Costa Rica), so it exhibits a large amount of lexical and even structural di erences in each variant. The main purpose of compiling and using an inter-varietal corpus of Spanish for the evaluation tasks is to challenge participating systems to cope with the many faces of this language worldwide.

Datasets were annotated with 4 di erent polarity labels positive, negative, neutral and none), and systems had to identify the orientation of the opinion expressed in each tweet in any of those 4 polarity levels.

The Spanish variety part was released in 2017 and its description can be found in (Mart nez-Camara et al., 2017) . Table 1 shows the tweets distribution for training, development (dev.) and test corpora.

Training Dev.

The Peru and Costa Rica varieties have been released for this edition. The tweets distributions are shown in Tables 2 and 3 respectively for both variants.

Training Dev.

Four sub-tasks were proposed, working with the datasets of the di erent countries:

Subtask-1: Monolingual ES. training and test were the InterTASS ES datasets.

Subtask-2: Monolingual PE. training and test were the InterTASS PE datasets.

Subtask-3: Monolingual CR. training and test were the InterTASS CR datasets.

Subtask-4: Cross-lingual. The training could be done with any dataset, but using a di erent one for the evaluation, in order to test the dependency of systems on a language.

Results were submitted in a plain text le with the following format: t w e e t i d n t p o l a r i t y

Accuracy and the macro-averaged versions of Precision, Recall and F1 were used as evaluation measures. Systems were ranked by the Macro-F1 and Accuracy measures.

2.1.1 Analysis of the Results

For task 1 ve system were presented. Most of them make use of deep learning algorithms, combining di erent ways of obtaining the word embeddings.

INGEOTEC. Moctezuma1 et al. (2018) present a polarity classi cation system based on the combination of di erent labelling systems. The main component is the EvoMSA system, based on genetic algorithms, which combines the outputs of the other systems. EvoMSA is based on the B4MSA system for the adjustment of the di erent parameters (how the text is normalised, how the tokens are calculated or how the tokens are weighted) and on the EvoDAG program that carries out the classi cation. As for the input systems, various systems are used based on lexicons of a ectivity or aggressiveness. It also uses the algorithm of word embeddings called FastText, using the Wikipedia in Spanish to train it. Vectors are generated for each document and SVM is used for training. Their approach performs better when it is trained with tweets from Spain and test with other Spanish varieties.

RETUYT-InCo. Chiruzzo and Rosa (2018) submitted three approaches: SVM using word embedding centroids and manually crafted features, CNN using word embeddings as input, and Long Short Term Memory (LSTM) using word embeddings, trained with focus on improving the recognition of neutral tweets. In all cases, embedding improves results and LSTM has the best behaviour for neutral tweets. The use of a mixed-balanced training method for the LSTM resulted in a signi cant improvement in the detection of neutral tweets.

ITAINNOVA. Montanes, Aznar, and del Hoyo (2018) analyse the use of convolutional network models (CNN), LSTM, Bidirectional LSTM (BiLSTM) and a hybrid approach between CNN and LSTM. The combination CNN-LSTM has been chosen as it integrates the bene ts of both models. They choose the CNN-LSTM combination because it integrates the bene ts provided from both models. ELiRF-UPV. Gonzalez, Hurtado, and Pla (2018b) explore di erent approaches based on Deep Learning. Speci cally, they study the behaviour of the CNN, Attention Bidirectional Long Short Term Memory (Att-BLSTM) and Deep Averaging Networks (DAN). In order to study the behaviour of the di erent models, they carry out an adjustment process. They get the best results in InterTASS-ES. However, linguistic variability a ects the choice of architecture and its hyperparameters, so the application of the same system to InterTASS-CR and InterTASS-PE tasks, without making any adjustment, has not allowed to obtain results as competitive as in InterTASS-ES.

ATALAYA. Luque and Perez (2018) presented a system that uses a weighted scheme to average the subword-aware embeddings obtained from preprocessed tweets that have been enriched with data obtained from machine translation. This novel solution involves translating tweets into another language and back into the source language, to lexically and grammatically increase them.

Tables 4, 5 and 6 show the results obtained in the monolingual subtasks (Spain, Costa Rica and Peru variants).

For the cross-lingual runs, the participants selected an InterTASS dataset to train their systems and a di erent one to test, in order to test the dependency of systems on a language. Tables 7, 9 and 8 show the results obtained in these cross-lingual subtasks.

The overall results, in terms of F1, obtained with the monolingual and multilingual systems for the Spanish and Costa Rica collections are quite comparable, but the one with the Peru collection fall by around 10%. 2.2

Task 2 Task 2, Aspect-based Sentiment Analysis, proposes the development of aspect-based polarity classi cation systems. Similar to previous editions (Mart nez-Camara et al., 2017) , two datasets were used to evaluate the di erent approaches: Social-TV and STOMPOL. Both datasets were annotated with Run retuyt-lstm-cr-2 retuyt-svm-cr-2 retuyt-svm-cr-1 elirf-cr-run-2 retuyt-cnn-cr-1 atalaya-cr-lr-50-2 ingeotec-run1 retuyt-lstm-cr-1 retuyt-cnn-cr-2 elirf-intertass-cr-run-1 atalaya-mlp-300-sentiment atalaya-mlp-ubav3-50-3 ingeotec-run1 elirf-cr-run-1 Run retuyt-cnn-pe-1 atalaya-pe-lr-50-2 retuyt-lstm-pe-2 retuyt-svm-pe-2 ingeotec-run1 elirf-intertass-pe-run-2 atalaya-mlp-sentimentubav3-50-3 retuyt-svm-pe-1 elirf-intertass-pe-run-1 atalaya-mlp-300-sentiment atalaya-mlp-50-sentiment retuyt-svm-pe-2 retuyt-cnn-pe-2 retuyt-lstm-pe-1 elirf-intertass-pe-run-1 M. F1 sion of Precision, Recall, F1, and Accuracy were considered, and Macro-F1 was used for a nal ranking of proposed systems.

2.2.1 Collections

The Social-TV corpus was collected during the 2014 Final of \Copa del Rey" championship in Spain. After ltering out useless information, a subset of 2,773 tweets was obtained. The details of the corpus are described in (Villena-Roman et al., 2015; Garc a-Cumbreras et al., 2016; Mart nezCamara et al., 2017) .

STOMPOL (corpus of Spanish Tweets for Opinion Mining at aspect level about POLitics) is a corpus for the task of Aspect Based Sentiment Analysis. The corpus is composed of 1,284 tweets manually annotated by two annotators, and a third one in case of disagreement. The details of the corpus are described in (Villena-Roman et al., 2015; Garc a-Cumbreras et al., 2016; Mart nezCamara et al., 2017) .

2.2.2 Results

Only the research group ELiRF (Gonzalez, Hurtado, and Pla, 2018c) participated in this edition. They explored di erent approaches based on Deep Learning. Speci cally, they studied the behaviour of the CNN, Attention Bidirectional Long Short Term Memory (Att-BLSTM) and Deep Averaging Networks (DAN), similar to the proposal of the team for Task 1. In order to study the performance of the di erent models, they carried out an adjustment process. Tables 10 and 11 show the results obtained in their experiments. aspect-related metadata: the main category of the aspect, and the polarity of the opinion about the aspect. Systems had to classify the opinion about the given aspect in 3 different polarity labels (positive, negative, neutral).

Participants were expected to submit up to 3 experiments for each provided collection, each in a plain text le with tweet identi cation, aspect and polarity.

For evaluation, exact match with a single label combining \aspect-polarity" was used. Similarly to Task 1, the macro-averaged ver2.3 Task 3 NLP methods are increasingly being used to mine knowledge from unstructured content of health (Liu et al., 2013; Doing-Harris and Zeng-Treitler, 2011; Gonzalez-Hernandez et al., 2017) and other domains (EstevezVelarde et al., 2018). Over the years, many eHealth challenges have taken place, such as SemEval2, CLEF3 campaigns and others (Augenstein et al., 2017) . These tasks have mainly dealt with identi cation, classi cation, extraction and linking of knowledge. The Task 3: eHealth Knowledge Discovery (eHealth-KD) proposes modelling the human language in a scenario in which Spanish electronic health documents could be machine readable from a semantic point of view. This task is designed to encourage the development of software technologies to automatically extract a large variety of knowledge from eHealth documents written in the Spanish language.

In order to capture the semantics of a broad range of health related text, eHealthKD proposes the identi cation of two types of elements: Concepts and Actions. Concepts are key phrases that represent actors or entities which are relevant in a domain, while Actions represent how these Concepts interact with each other. Actions and Concepts can be linked by two types of relations: subject and target, which describe the main roles that a Concept can perform. Also, four speci c semantic relations between Concepts are de ned: is-a, part-of, property-of and same-as. Figure 1 provides an example.

To simplify and normalise the extraction process, the overall task is divided into three subtasks:

Subtask A is concerned with the extraction of the relevant key phrases.

Subtask B is concerned with the classication of the key phrases identi ed in 2International Workshop on Semantic Evaluation 3Conference and Labs of the Evaluation Forum

Subtask A as either Concept or Action.

Subtask C is concerned with the discovery of the semantic relations between pairs of entities.

To compute the evaluation metrics for each subtask, we de ne the following sets for comparing the annotations between both the expected output (gold standard) and the actual output in each subtask:

Correct matches (C): in all subtasks,

when one gold and one given annotation exactly match.

Partial matches (P ): in subtask A, when

two key phrases have a non-empty intersection.

Missing matches (M ): in subtasks A and

C, when an annotation in the gold output is not provided by the system.

Spurious matches (S): in subtasks A and

C, when an annotation given by the system does not appear in the gold output.

Incorrect matches (I): in subtask

when one assigned label is incorrect. B,

To measure the individual subtasks results as well as overall results, the eHealth-KD challenge proposes three evaluation scenarios.

Scenario 1. The rst scenario requires all subtasks (i.e. A, B and C) to be performed sequentially. The input in this scenario consists of plain text (100 sentences), and participants must submit the three output les corresponding to subtasks A, B and C. In this scenario the overall quality of the participant systems is evaluated. So, a combined micro F1 metric was de ned, taking into account results of the three tasks:

F1ABC

PABC RABC TABC

= = = =

2 PABC RABC

PABC + RABC

TABC + 12 PA TABC + PA + MA + IB + MC

TABC + 12 PA TABC + PA + SA + IB + SC CA + CB + CC (1) (2) (3) (4) Scenario 2. In the second scenario only subtasks B and C are performed. Hence, participants receive plain text inputs and the corresponding outputs for subtask A (a different subset of 100 sentences). This scenario allows participants to focus on the key phrases classi cation, without being a ected by errors related to the extraction of key phrases. Like Scenario 1, a combined micro F1 is de ned which takes into account the results for subtasks B and C:

F1BC PBC RBC TBC = = = = 2 PBC RBC PBC + RBC

TBC TBC + IB + MC

TBC TBC + IB + SC

CB + CC Scenario 3. Finally, the third scenario evaluates only subtask C. Participants are provided with plain text inputs and the corresponding output of subtasks A and B (a nal subset of another 100 sentences). In this scenario, the following metric is de ned for evaluation:

F1C PC RC = 2 = =

PC RecC PC + RC

CC CC + SC

CC + MC

For competition purposes, the best system is de ned as the submission that maximises the macro-average F1 across all three scenarios:

F1 =

F1ABC + F1BC + F1C 3

2.3.1 Corpora

For evaluation purposes, a corpus of healthrelated sentences in Spanish was manually built and tagged. The corpus consists of a selection of articles collected from the MedlinePlus4 website. These les contain several entries related to health and medicine topics, and environmental topics strongly related to health care. Spanish language items were converted to a plain text document, processed, and manually tagged using the Brat Files Sentences Annotations

Entities

- Concepts - Actions

Roles

- subject - target

Relations

- is-a - part-of - property-of - same-as

Train Dev.

annotation tool5 by 15 human annotators divided into seven groups. The nal 1; 173 tagged sentences were organised in three collections: training, development and test. Table 12 summarises the main statistics of the corpus. 2.3.2 Analysis of the Results eHealth-KD challenge attracted the attention of a total 31 registered teams of which six of then successfully concluded their participation. Their results are summarised in Table 13. The following tag labels are designed to provide an overview of the main characteristics of each participant system: S: Uses shallow supervised models such as CRF, logistic regression, SVM, decision trees, etc.

D: Uses deep learning models, such as LSTM or convolutional networks.

E: Uses word embeddings or other embedding models trained with external corpora.

K: Uses external knowledge bases, either explicitly or implicitly (i.e, through thirdparty tools).

R: Uses hand crafted rules based on domain expertise.

N: Uses natural language processing techniques or features, i.e., POS-tagging, dependency parsing, etc.

Baseline description: A baseline,

trained on the training corpus, was de ned. This strategy consists of a dummy approach based solely on the text of key phrases. This technique collects all training data and stores three maps: (1) key phrases associated with their most common class (either Concept or Action); (2) pairs of concepts associated with their most common relation; and (3) tuples of <Action,Concept> associated with their most common role. At prediction time, these maps are used to select a key phrase, decide its class, and predict relations and roles.

Once the shared subtask ended, the o cial results were published. However, some participants noticed that their systems provided duplicated outputs on some occasions. These duplicated outputs, even if correct, were being counted as spurious after the rst match. To account for this duplication, the evaluation script was modi ed to remove duplicated outputs from the participants submissions prior to calculating the evaluation metrics. Table 14 shows this second version of the metrics, where variations in scores are highlighted in bold text. This proved not to be a signi cant problem, since only two participants were a ected, and even though their metrics improved marginally, the overall results or the main conclusions of the shared subtask did not change.

The results of this task, eHealth-KD, show that a variety of approaches, on the whole, deal e ectively with the health knowledge discovery problem. However, issues still need to be resolved to obtain highly competitive systems. The best performing submissions include classic supervised learning, deep learning and knowledge-based techniques. In subtask A, the best approach (UC3M) is based on a CRF model with pre-trained embeddings as features. This approach obtains similar scores in subtask B. In general, subtask B appears to be easier than the rest, which is understandable given that there are only two classes and there is a large correlation between word lemmas and their classes (as shown by the relatively high performance of the baseline).

Subtask C, in concordance with Scenario 3, does not exceed 45% in F-score. This reinforces the belief that this task is di cult 7This extracts lexical and syntactic features for each token. Afterwards, it applies a set of handcrafted heuristics for each subtask. to deal with, even after having applied novel approaches (i.e. TALP and LaBDA) based on convolutional neural networks.

The best-performing systems in each scenario highly coincide with all three task results. For Scenario 1, the top performing strategy belongs to UC3M, which achieves the best scores in subtasks A and B, and the overall best result in the shared subtask (averaged across all three scenarios), pretty close to SINAI. Likewise, the best strategy in Scenario 3 corresponds to TALP, which achieves the best score for subtask C. However, for the overall results, other participants such as SINAI and UPF-UPC achieve higher average scores, even though their performance in subtask C and Scenario 3 is practically negligible. In contrast, these teams obtain relatively high scores in subtasks A and B.

The diverse nature and complexity of the three subtasks make it di cult to design a single fair evaluation metric. For this reason, we consider that each system submission gets more accurate results related to the speci c sub-problems that it tackles. Although generalisation across the three subtasks is a desirable characteristic, advances in any particular subtask are also very valuable.

In general, the most competitive approaches in individual subtasks are dominated by state-of-the-art machine learning. In the particular case of subtask C, where modern deep learning approaches seem to outperform classic techniques. In addition, incorporating domain-speci c knowledge provides a signi cant boost to the performance. Most participants use NLP features, either explicitly, or implicitly captured in word embeddings and other representations. An interesting phenomenon is that the best systems in subtask A do not correlate with the best systems in subtask C. This suggests that the optimal approach for either subtask is di erent, giving rise to an interesting research line that would explore integrated approaches to simultaneously solving these three subtasks. The overall results show that general purpose knowledge discovery in domain-speci c documents is potentially a proli c research area, particularly for the Spanish language. We expect similar future initiatives to provide fruitful evaluation scenarios where researchers can deploy techniques from several domains, and compete in friendly contests to improve the state-of-the

Tags Subtask A Subtask B Subtask C Average Scenario 1 Scenario 2 Scenario 3 Average Tags Subtask A Subtask B Subtask C Average Scenario 1 Scenario 2 Scenario 3 Average

UC3My SDEN When news are about natural disasters, readers usually feel negative emotions (sadness, for instance), whereas when those news are about the last championship won by your favourite team, readers feel positive emotions like happiness. Moreover, it is commonly assumed in marketing that emotions aroused in the reader by news articles have an impact in the perception of the advertisements displayed along with those articles. Thus, from that marketing perspective, if a company wants to promote their brand, the ads should better be associated to (i.e., shown with) news that arouse positive emotions.

The objective of Task-4 is to encourage the development of systems that can classify a news article into safe (positive emotions, so safe for ads) or unsafe (negative emotions, so better avoid ads). This task could be considered as a kind of stance classi cation, on the positioning of readers of news contents. The task is a strong challenge because it has to deal with the polarity of feeling (safe vs. unsafe) and to work in combination with a (pseudo) thematic classi cation to be able to determine the meaning of the news. For example, the reduction of tra c accidents has a negative feeling because of the accidents, but the context of reducing the numbers of accidents makes those bad news good news, hence safe news.

2.4.1 Corpora

The Spanish brANd Safe Emotion corpus (SANSE) corpus was speci cally built for this task. RSS feeds of di erent online newspapers written in di erent varieties of Spanish (Argentina, Chile, Colombia, Cuba, Spain, USA, Mexico, Peru and Venezuela) were collected for over a month. Finally 15,152 articles were captured, containing the URL, the publication date and the headline. News summaries were also collected for several sources, but nally they were discarded to make the dataset consistent and homogeneous.

Then 2,000 articles (L1 subset) were randomly selected and were manually annotated into an emotional categorisation of SAFE or UNSAFE, from the point of view of the general public of each corresponding country. The other 13,152 articles (L2 subset) were not annotated.

Subset Training Development Test Subset Training (Spain)

Dev. (Spain) Test (Mexico) Test (Cuba) Test (Chile) Test (Colombia) Test (Argentina) Test (Venezuela) Test (Peru) Test (USA)

As the datasets were annotated with two levels of safety: SAFE and UNSAFE, the task can be considered as a binary classi cation task.

The annotation was carried out by two human annotators (the two organisers of the task), and, for those cases with no agreement between the two annotators, a third annotator undid the tie. A safe headline of a news was de ned as an utterance that arises a positive or neutral emotion in the reader and is not related to a controversial topics: religion, extreme wing political topics, or controversial topics (those that arise strong positive emotions to some readers but strong negative emotions to other ones). An unsafe headline was de ned as an utterance that arises negative emotions on the reader.

Some examples in Spanish: As sera el nuevo pan integral en Espan~a, segun una nueva ley en marcha. ! SAFE This will be the new integral bread in Spain, according to a new law underway. Casi 300 municipios de Colombia en riesgo electoral. ! UNSAFE Almost 300 municipalities in Colombia at electoral risk.

The agreement of the annotation was 0.58 according to (Scott, 1955) and k (Cohen, 1960), which may consider moderate according to Landis and Koch (1977) . Although the agreement is moderate, it is close to be considered substantial, and we have also to take into account that it is a new classi cation task that works with a strong subjective content. We will work in making the annotations guidelines more precise in order to improve the agreement of the annotators. Besides, we hope that the participants will give us insights with the aim of improving the annotation of the data.

The L1 subset was then again divided in three subsets, speci cally: training, development and test. The statistics of the three subsets are in Table 15. 2.4.2

Tasks

Two subtasks were proposed. Subtask 1 (S1) consists of the classi cation of headlines into safe or unsafe for incorporating an ad of a brand. The evaluation of the systems does not take into account the cultural varieties of

Size

the Spanish language, it thus a monolingual evaluation. In this task, datasets are composed of headlines of news written in di erent version of the Spanish language, but the country of the text is not relevant for this task.

Participants were provided with the training and development subsets of L1 SANSE corpus for building the systems, and two test sets for the evaluation: the test subset of L1 SANSE corpus and the L2 SANSE corpus.

The systems presented were evaluated using the measures of Macro-Precision (M. P.), Macro-Recall (M. R.), Macro-F1 (M. F1) and Accuracy (Acc.).

Subtask 2 (S2) is similar to S1, but in this case the aim is to evaluate the generalisation capacity of the submitted systems. For training their systems, participants were provided with SANSE subsets with headlines written only in the Spanish language spoken in Spain. The test set was composed of headlines written in the Spanish language spoken in di erent countries of America. The statistics of SENSE corpus for S2 are shown in the Table 16. Task 4 attracted the attention of seven teams, and most of them participated in both levels of evaluation of the S1 and in S2. Table 2.4.3 shows the participation of the teams in each Subtask. Five groups of the seven ones submitted a system description paper, whose main features will be detailed as what follows. INGEOTEC. Moctezuma et al. (2018) propose an ensemble classi cation system (EvoMSA), which is composed of several and heterogeneous base systems and a genetic programming system (EvoDAG, (Gra et al., 2017) ) that optimises the contribution of each base system in the nal classi cation. The authors combined supervised and unsupervised system as base classi cation systems. The supervised ones are based on the use of the algorithm SVM with di erent representations of the input text, namely TF-IDF and pre-trained word vectors. The system reached the best results in the monolingual and the multilingual evaluations, however the performance of the system dropped a bit in S1 L2. Since the annotation test set of S1 L2 was conducted by a voting system of the all the submitted systems, the lower performance in S1 L2 may be caused by a di erent error distribution between INGEOTEC and the systems submitted by the other groups. ELiRF UPV. Gonzalez, Hurtado, and Pla (2018a) propose a deep neural network, speci cally the model Deep Averaging Networks (DAN) (Iyyer et al., 2015) . The authors used a set of pre-trained word embeddings for representing the news headlines. The set of pre-trained word embeddings were prepared by the authors and built upon a corpus of tweets (Hurtado, Pla, and Gonzalez, 2017) . The high performance reached by a set of pre-trained word embeddings built upon tweets with news headlines stands out, because the genre of news headlines and tweets are di erent. However, it may mean that the use of language in tweets and news headlines is similar.

S1 L1

S1 L2

Team INGEOTECy

ELiRF-UPVy rbnUGRy MeaningCloudy SINAIy lone wolf TNT-UA-WFU

X X X X X X X

X X X X X X X S2 X X X X rbnUGR. Rodr guez Barroso, Mart nezCamara, and Herrera (2018) submitted three systems grounded in deep learning. Although the three systems are based on Long ShortTerm Memory (LSTM) Recurrent Neural Network (RNN), they have several di erences: Run 1. It uses a LSTM layer as encoding layer, and its output is the last vector state of the LSTM layer.

Run 2. It uses a BiLSTM8 layer as encoding layer, and its output is the concatenation of the last vector state of the two LSTM layers.

Run 3. It uses a LSTM layer as encoding layer, and its output is the concatenation of the corresponding output state vector of each input token.

The results show that the systems based on one single LSTM layer perform better than the one based on BiLSTM. Regarding the di erent results in S1 and S2 indicate that the use the entire output of the encoding layer allow to improve the generalisation capacity of the model, because the multilingual evaluation requires a higher generalisation capacity.

MeaningCloud. Herrera-Planells and Villena-Roman (2018) propose three supervised systems, two of them are lineal classi cation systems and the other one a non-lineal classi cation system. The linear classi cation systems use XGBoost (Chen and Guestrin, 2016) as classi cation system. They di er in the set of features used to represent the news headlines, which are mainly built using the public APIs of the text analytic platform of MeaningCloud. The non lineal classi cation system is a neural network based on the use of a CNN layer. The proposal that reached higher results was the one grounded in a CNN (Run 3).

SINAI. Plaza del Arco et al. (2018 ) propose to represent the news headlines as a vector of unigrams weighted with TF-IDF, and the number of positive and negative words according to three list of opinion bearing words. The authors used SVM as classi cation algorithm.

The evaluation measures in the two Subtasks were accuracy and the macro-average 8A BiLSTM is an elaboration of two LSTM layers. System INGEOTEC run1 0.794 ELiRF UPV run2 0.787 ELiRF UPV run1 0.795 rbnUGR run1 0.784 MEANING- 0.767 CLOUD run3 rbnUGR run3 0.763 rbnUGR run2 0.774 SINAI 0.733 MEANING- 0.723 CLOUD run2 MEANINGCLOUD run1 of precision, recall and F1, and the systems were ranked according to the value of macroF1. Table 18 show the results reached by each group that submitted the description of their systems in S1 L1, S1 L2 and S2 respectively. 3

Conclusions The edition of TASS 2018 has attracted the participation of 16 systems, and the submission of 15 system description papers. Moreover, we have proposed two new challenges to the international reserach community, which are in line to the requirements of the Industry.

The submitted systems are in the line of the state-of-the-art in other similar workshops, and most of them are grounded in Deep Learning and the use of hand-crafted linguistic features. Therefore, TASS may be considered as a reference forum for setting up the state-of-the-art in semantic analysis in Spanish.

As future work, we plan to enlarge the coverage of the Spanish language of the corpus InterTASS, as well as consolidating the new challenges (Task 3 and Task 4). Moreover, we will keep working in the development of new corpora and linguistic resources for the research community.

Acknowledgments This work has been partially supported by a grant from the Fondo Europeo de Desarrollo Regional (FEDER), the projects REDES (TIN2015-65136-C2-1-R, TIN2015-65136C2-2-R) and SMART-DASCI (TIN2017

Workshop on Semantic Analysis at SEPLN (TASS 2018).

Workshop Proceedings, Sevilla, Spain, September. CEUR-WS.

Augenstein , I. , M. Das , S.

Riedel , L.

Vikraman , and

McCallum . 2017 . Semeval 2017 task 10: Scienceie-extracting keyphrases and relations from scienti c publications . arXiv preprint arXiv:1704 . 02853 .

Chen , T. and C.

Guestrin . 2016 . Xgboost: A scalable tree boosting system . In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16 , pages 785 { 794 , New York, NY, USA. ACM.

Chiruzzo , L. and

Rosa . 2018 . Retuytinco at tass 2018: Sentiment analysis in spanish variants using neural networks and svm . In E. Mart nezCamara,

Almeida Cruz , M. C. D az-

Galiano , S. Estevez

Velarde , M. A.

Garc a-Cumbreras, M.

Garc a-Vega, Y.

Gutierrez Vazquez , A. Montejo

Raez , A.

Montoyo

Guijarro

Mun ~oz Guillena, A. Piad Mor s, and J. Villena-Roman, editors, Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ), volume 2172 of CEUR Cohen, J. 1960 . A coe cient of agreement for nominal scales . Educational and Psychological Measurement , 20 ( 1 ): 37 { 46 .

Denecke , K.

2015 . Health Web Science. Springer International Publishing.

Doing-Harris , K. M. and

Zeng-Treitler . 2011 . Computer-assisted update of a consumer health vocabulary through mining of social network data . Journal of medical Internet research , 13 ( 2 ).

Estevez-Velarde , S. ,

Gutierrez ,

Montoyo ,

Piad-Mor s , R. M. noz, and

Almeida-Cruz . 2018 . Gathering object interactions as semantic knowledge . In Proceedings of the 2018 International Conference on Arti cial Intelligence (ICAI'18).

Garc a-Cumbreras, M. A. ,

Villena-Roman , E.

Mart nez-

Camara , M. C.

D az-

Galiano , M. T.

Mart n-Valdivia, and L. A.

Uren ~a Lopez. 2016 . Overview of tass 2016 . In TASS 2016: Workshop on Sentiment Analysis at SEPLN , pages 13 { 21 .

Gonzalez , J.-A. ,

L.-F.

Hurtado , and

Pla . 2018a. ELiRF-UPV en TASS 2018: Categorizacio emocional de noticias . In Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ), volume 2172 , September .

Gonzalez , J.-A. ,

L.-F.

Hurtado , and

Pla . 2018b. Elirf-upv en tass 2018: Analisis de sentimientos en twitter basado en aprendizaje profundo . In E. Mart nezCamara,

Almeida Cruz , M. C. D az-

Galiano , S. Estevez

Velarde , M. A.

Garc a-Cumbreras, M.

Garc a-Vega, Y.

Gutierrez Vazquez , A. Montejo

Raez , A.

Montoyo

Guijarro

Mun ~oz Guillena, A. Piad Mor s, and J. Villena-Roman, editors, Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ), volume 2172 of CEUR Workshop Proceedings , Sevilla, Spain, September. CEUR-WS.

Gonzalez , J.-A. ,

L.-F.

Hurtado , and

Pla . 2018c. Elirf-upv en tass 2018: Analisis de sentimientos en twitter basado en aprendizaje profundo . In Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ).

Gonzalez-Hernandez , G. ,

Sarker , K. O'Connor , and G. Savova . 2017 . Capturing the patient's perspective: a review of advances in natural language processing of health-related text . Yearbook of medical informatics , 26 ( 01 ): 214 { 227 .

Gra , M. ,

E. S.

Tellez ,

H. Jair

Escalante , and

Miranda-Jimenez , 2017 . Semantic Genetic Programming for Sentiment Analysis , pages 43 { 65 . Springer International Publishing, Cham.

Herrera-Planells , J. and

Villena-Roman . 2018 . MeaningCloud at TASS 2018: News headlines categorization for brand safety assessment . In Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ), volume 2172 , September .

Hurtado , L.- F. , F.

Pla , and J.-A.

Gonzalez . 2017 . Elirf-upv en tass 2017: Analisis de sentimientos en twitter basado en aprendizaje profundo . In Proceedings of TASS 2017: Workshop on Sentiment Analysis at SEPLN co-located with 33nd SEPLN Conference (SEPLN 2017 ).

Iyyer , M. ,

Manjunatha ,

Boyd-Graber , and

H. Daume

III . 2015 . Deep unordered composition rivals syntactic methods for text classi cation . In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 1681 { 1691 . Association for Computational Linguistics .

Landis , J. R. and

G. G.

Koch . 1977 . The measurement of observer agreement for categorical data . biometrics , pages 159 { 174 .

Liu , H., S. J.

Bielinski , S.

Sohn , S.

Murphy , K. B. Wagholikar , S. R.

Jonnalagadda , K.

Ravikumar , S. T.

Wu , I. J.

Kullo , and C. G.

Chute . 2013 . An information extraction framework for cohort identi cation using electronic health records . AMIA Summits on Translational Science Proceedings , 2013 : 149 .

Lopez-Ubeda , P. , M. C.

D az-

Galiano , M. T.

Mart n-Valdivia, and L. A.

Urena-Lopez . 2018 . Sinai en tass 2018 task 3. clasi - cando acciones y conceptos con umls en medline . In Proceedings of TASS 2018 : Luque, F. M. and

J. M.

Perez . 2018 . Atalaya at tass 2018: Sentiment analysis with tweet embeddings and data augmentation . In E. Mart nezCamara,

Almeida Cruz , M. C. D az-

Galiano , S. Estevez

Velarde , M. A.

Garc a-Cumbreras, M.

Garc a-Vega, Y.

Gutierrez Vazquez , A. Montejo

Raez , A.

Montoyo

Guijarro

Mart nez-Camara, E. , M. C.

D az-

Galiano , M. A.

Garc a-Cumbreras, M.

Garc aVega, and

Villena-Roman . 2017 . Overview of TASS 2017 . In E. Mart nezCamara, M. C. D az-

Galiano , M. A.

Garc a-Cumbreras, M.

Garc

a-Vega, and

J . Villena-Roman, editors, Proceedings of TASS 2017: Workshop on Semantic Analysis at SEPLN (TASS 2017 ), volume 1896 of CEUR Workshop Proceedings , Murcia, Spain, September. CEUR-WS.

Medina , S. and

Turmo . 2018 . Joint classi cation of key-phrases and relations in electronic health documents . In Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ).

Moctezuma , D. ,

Ortiz-Bejar ,

E. S.

Tellez ,

Miranda-Jimenez , and

Gra . 2018 . Ingeotec solution for task 4 in tass'18 competition . In Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ), volume 2172 , September .

Moctezuma1 , D. ,

Ortiz-Bejar ,

E. S.

Tellez ,

Miranda-Jimenez , and

Gra . 2018 . Ingeotec solution for task 1 in tass'18 competition . In E. Mart nezCamara,

Almeida Cruz , M. C. D az-

Galiano , S. Estevez

Velarde , M. A.

Garc a-Cumbreras, M.

Garc a-Vega, Y.

Gutierrez Vazquez , A. Montejo

Raez , A.

Montoyo

Guijarro

Mun ~oz Guillena, A. Piad Mor s, and J. Villena-Roman, editors, Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ), volume 2172 of CEUR Montanes, R. ,

Aznar , and R. del Hoyo . 2018 . Aplicacion de un modelo h brido de aprendizaje profundo para el analisis de sentimiento en twitter . In E. Mart nezCamara,

Almeida Cruz , M. C. D az-

Galiano , S. Estevez

Velarde , M. A.

Garc a-Cumbreras, M.

Garc a-Vega, Y.

Gutierrez Vazquez , A. Montejo

Raez , A.

Montoyo

Guijarro

Palatresi , J. V. and

H. R.

Hontoria . 2018 . Tass2018: Medical knowledge discovery by combining terminology extraction techniques with machine learning classi cation . In Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ).

Plaza del Arco , F. M. , E.

Mart nez-

Camara , M. T.

Mart n Valdivia, and

A . Uren~a Lopez . 2018 . SINAI en TASS 2018: Insercion de conocimiento emocional externo a un clasi cador lineal de emociones . In Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ), volume 2172 , September .

Rodr guez Barroso , N., E.

Mart nez-Camara, and

Herrera . 2018 . SCI2S at TASS 2018: Emotion classi cation with recurrent neural networks . In Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ), volume 2172 , September .

Scott , W. A.

1955 . Reliability of content analysis: The case of nominal scale coding . Public opinion quarterly , pages 321 { 325 .

Suarez-Paniagua , V. ,

Segura-Bedmar , and P. Mart nez. 2018 . Labda at tass-2018 task 3: Convolutional neural networks for relation classi cation in spanish ehealth documents . In Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ).

Villena-Roman , J. ,

Garc a-Morera,

M. A.

Garc a-Cumbreras, E.

Mart nez-

Camara , M. T.

Mart n-Valdivia, and L. A.

Uren ~a Lopez. 2015 . Overview of tass 2015 .

In TASS 2015: Workshop on Sentiment Analysis at SEPLN , pages 13 { 21 .

Zavala , R. M. R. , P.

Mart nez, and

I. SeguraBedmar.

2018 . A hybrid bi-lstm-crf model for knowledge recognition from ehealth documents . In Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN (TASS 2018 ).