TASS 2018: Workshop on Semantic Analysis at SEPLN, septiembre 2018, págs. 13-27


               Overview of TASS 2018: Opinions, Health and
                               Emotions
              Resumen de TASS 2018: Opiniones, Salud y Emociones
  Eugenio Martı́nez-Cámara1 , Yudivián Almeida-Cruz2 , Manuel Carlos Dı́az-Galiano3
     Suilan Estévez-Velarde2 , Miguel Á. Garcı́a-Cumbreras3 , Manuel Garcı́a-Vega3 ,
       Yoan Gutiérrez4 , Arturo Montejo-Ráez3 , Andrés Montoyo4 , Rafael Muñoz4 ,
                      Alejandro Piad-Morffis2 , Julio Villena-Román5
     1
       Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI)
                                 Universidad de Granada, España
                                2
                                  Universidad de La Habana, Cuba
3
  Centro de Estudios Avanzados en Tecnologı́as de la Información y de la Comunicación (CEATIC)
                                        Universidad de Jaén
                                 4
                                   Universidad de Alicante, España
                                           5
                                             MeaningCloud

               Abstract: This is an overview of the Workshop on Semantic Analysis at the SE-
               PLN congress held in Sevilla, Spain, in September 2018. This forum proposes to
               participants four different semantic tasks on texts written in Spanish. Task 1 fo-
               cuses on polarity classification; Task 2 encourages the development of aspect-based
               polarity classification systems; Task 3 provides a scenario for discovering knowledge
               from eHealth documents; finally, Task 4 is about automatic classification of news
               articles according to safety. The former two tasks are novel in this TASS’s edition.
               We detail the approaches and the results of the submitted systems of the different
               groups in each task.
               Keywords: Sentiment Analysis, Opinion Mining, Affect Computing, eHealth, So-
               cial Media
               Resumen: Este artı́culo ofrece un resumen sobre el Taller de Análisis Semántico
               en la SEPLN (TASS) celebrado en Sevilla, España, en septiembre de 2018. Este
               foro propone a los participantes cuatro tareas diferentes de análisis semántico sobre
               textos en español. La Tarea 1 se centra en la clasificación de la polaridad; la Tarea
               2 anima al desarollo de sistemas de polaridad orientados a aspectos; la Tarea 3 con-
               siste en descubrir conocimiento en documentos sobre salud; finalmente, la Tarea 4
               propone la clasificación automática de noticias periodı́sticas según un nivel de se-
               guridad. Las dos últimas tareas son nuevas en esta edición. Se ofrece una sı́ntesis
               de los sistemas y los resultados aportados por los distintos equipos participantes, ası́
               como una discusión sobre los mismos.
               Palabras clave: Análisis de Sentimientos, Minerı́a de Opiniones, Informática Afec-
               tiva, e-Salud, Medios Sociales


                                                                        The aim of the workshop is the furtherance of
                                                                        the research in semantic tasks on texts writ-
     1       Introduction                                               ten in Spanish, roughly speaking in Spanish
     The Workshop on Semantic Analysis at                               data. The edition 2018 has proposed two new
     the SEPLN1 (in Spanish Taller de Análisis                         challenges (Tasks 3 and 4), and provided sev-
     Semántico en la SEPLN, TASS) is the evolu-                        eral linguistic resources.
     tion of the Workshop on Sentiment Analysis
                                                                           The processing of health data is attracting
     at the SEPLN which is being held since 2012.
                                                                        the attention of the Natural Language Pro-
         1
             http://www.sepln.org/workshops/tass                        cessing (NLP) research community (Denecke,
    ISSN 1613-0073                         Copyright © 2018 by the paper's authors. Copying permitted for private and academic purposes.
E. Martínez-Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. García-Cumbreras, M. García-Vega, Y. Gutiérrez et al.


 2015). In this line, Task 3 proposes mod-                              of Task 4.
 elling the human language in a scenario in
 which Spanish electronic health documents                              2.1      Task 1
 could be machine readable from a semantic                              This task focused on the evaluation of po-
 point of view. This Task 3 consists of de-                             larity classification systems at tweet level of
 tecting and classifying concepts for seman-                            tweets written in Spanish.
 tic relating them. Task 4 is related to the                               The submitted systems had to face, as
 brand safety concept, which is crucial for the                         usual, the lack of context due to length of
 reputation of a brand or the company of the                            tweets written in an informal language with
 brand. Task 4 proposes the classification of                           misspelling or emojis, even onomatopeias.
 the level of safety of a news for the publica-                         But this edition brought new challenges to
 tion of a ads spot of a brand according to the                         this task:
 headline of that news.
                                                                           • Multilinguality: training, tests and de-
     Tasks 3 and 4 provided specific datasets                                velopment corpus contain tweets written
 for accomplishing the proposed challenge,                                   in Spanish from Spain, Peru and Costa
 and are described in Sections 2.3.1 and 2.4.1                               Rica.
 respectively. Task 1 provided an extension of
 the InterTASS corpus, that was presented in                               • Generalization: Several corpora have
 the edition of 2017 (Martı́nez-Cámara et al.,                              been used. One of them is the develop-
 2017). The main novelty of the new version                                  ment set, so it follows a similar distribu-
 of InterTASS is the incorporation of tweets                                 tion. The second corpus is the test set
 written in the Spanish language spoken in                                   of the General Corpus of TASS, which
 Spain and in the several other countries of                                 was compiled some years ago, so it may
 America. Since the difficulty of Task 2 is                                  be lexically and semantically different
 high, the organisation proposed the same set-                               from the training and development data.
 ting of the task as in previous editions.                                   Furthermore, the system will be evalu-
     The paper is organised as follows: Section                              ated with test sets of tweets written in
 2 describes all the tasks proposed in the edi-                              the Spanish language spoken in different
 tion of year 2018. The specific details of each                             American countries.
 Subtask are in Section 2.1, 2.2, 2.3 and 2.4 re-                           The General Corpus of TASS has been
 spectively. Section 3 exposes the conclusions                          provided in the same way as previous edi-
 of the paper.                                                          tions. Further details in (Martı́nez-Cámara
                                                                        et al., 2017).
 2     Spanish Semantic Analysis
                                                                            However, International TASS Corpus (In-
       Tasks                                                            terTASS) is a corpus released in 2017 that
 As mentioned before, TASS is a relevant                                has been updated for this edition with new
 workshop for semantic analysis tasks, partic-                          texts. It is composed of tweets written in
 ularly for Spanish. In 2018, new resources                             different varieties of Spanish (for Spain, Peru
 and challenges were introduced to evolve                               and Costa Rica), so it exhibits a large amount
 Sentiment Analysis systems to a semantic                               of lexical and even structural differences in
 level. In the last editions, several research                          each variant. The main purpose of compiling
 groups from different countries, like Uruguay                          and using an inter-varietal corpus of Spanish
 or Costa Rica, presented their systems, and                            for the evaluation tasks is to challenge partic-
 it was mandatory to make an effort to build                            ipating systems to cope with the many faces
 adequate resources for their languages.                                of this language worldwide.
     In addition, society and companies are in-                             Datasets were annotated with 4 different
 terested in new specific challenges, and for                           polarity labels positive, negative, neu-
 this reason new tasks arise, while maintain-                           tral and none), and systems had to iden-
 ing the main task (global polarity).                                   tify the orientation of the opinion expressed
     In this Section, we describe the four tasks                        in each tweet in any of those 4 polarity levels.
 of the edition of 2018, namely Section 2.1 ex-                             The Spanish variety part was released in
 pose the details of Task 1; Section 2.2 de-                            2017 and its description can be found in
 scribes the corpus and the systems submit-                             (Martı́nez-Cámara et al., 2017). Table 1
 ted to Task 2; Section 2.3 is focused on the                           shows the tweets distribution for training, de-
 Task 3; and Section 2.4 describes all details                          velopment (dev.) and test corpora.
                                                                  14
                            Overview of TASS 2018: Opinions, Health and Emotions


             Training        Dev.          Test              Accuracy and the macro-averaged ver-
                                                          sions of Precision, Recall and F1 were used
P                  317         156           642          as evaluation measures. Systems were ranked
NEU                133          69           216          by the Macro-F1 and Accuracy measures.
N                  416         219           767
NONE               138          62           274          2.1.1 Analysis of the Results
                                                          For task 1 five system were presented. Most
Total            1,008         506         1,899
                                                          of them make use of deep learning algorithms,
                                                          combining different ways of obtaining the
Table 1: Tweets distribution in InterTASS-
                                                          word embeddings.
ES
                                                          INGEOTEC. Moctezuma1 et al. (2018)
   The Peru and Costa Rica varieties have                 present a polarity classification system based
been released for this edition. The tweets                on the combination of different labelling sys-
distributions are shown in Tables 2 and 3 re-             tems. The main component is the EvoMSA
spectively for both variants.                             system, based on genetic algorithms, which
                                                          combines the outputs of the other systems.
                                                          EvoMSA is based on the B4MSA system for
             Training        Dev.          Test
                                                          the adjustment of the different parameters
P                  231          95           430          (how the text is normalised, how the to-
NEU                166          61           367          kens are calculated or how the tokens are
N                  242         106           472          weighted) and on the EvoDAG program that
NONE               361         238           159          carries out the classification. As for the in-
Total            1,000         500         1,428          put systems, various systems are used based
                                                          on lexicons of affectivity or aggressiveness.
Table 2: Tweets distribution in InterTASS-                It also uses the algorithm of word embed-
PE                                                        dings called FastText, using the Wikipedia
                                                          in Spanish to train it. Vectors are generated
                                                          for each document and SVM is used for train-
             Training        Dev.          Test           ing. Their approach performs better when it
                                                          is trained with tweets from Spain and test
P                  230          93           354
                                                          with other Spanish varieties.
NEU                 94          39           164
N                  311         110           491          RETUYT-InCo. Chiruzzo and Rosá
NONE               165          58           224          (2018) submitted three approaches: SVM
                                                          using word embedding centroids and man-
Total              800         300         1,233          ually crafted features, CNN using word
                                                          embeddings as input, and Long Short
Table 3: Tweets distribution in InterTASS-                Term Memory (LSTM) using word embed-
CR                                                        dings, trained with focus on improving the
   Four sub-tasks were proposed, working                  recognition of neutral tweets. In all cases,
with the datasets of the different countries:             embedding improves results and LSTM has
   Subtask-1: Monolingual ES. training                    the best behaviour for neutral tweets. The
and test were the InterTASS ES datasets.                  use of a mixed-balanced training method
   Subtask-2: Monolingual PE. training                    for the LSTM resulted in a significant
and test were the InterTASS PE datasets.                  improvement in the detection of neutral
                                                          tweets.
   Subtask-3: Monolingual CR. training
and test were the InterTASS CR datasets.                  ITAINNOVA. Montanés, Aznar, and del
   Subtask-4: Cross-lingual. The training                 Hoyo (2018) analyse the use of convolutional
could be done with any dataset, but using                 network models (CNN), LSTM, Bidirectional
a different one for the evaluation, in order              LSTM (BiLSTM) and a hybrid approach be-
to test the dependency of systems on a lan-               tween CNN and LSTM. The combination
guage.                                                    CNN-LSTM has been chosen as it integrates
   Results were submitted in a plain text file            the benefits of both models. They choose
with the following format:                                the CNN-LSTM combination because it inte-
                                                          grates the benefits provided from both mod-
tweet id \t polarity                                      els.
                                                    15
E. Martínez-Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. García-Cumbreras, M. García-Vega, Y. Gutiérrez et al.


 Run                                       M. F1        Acc.            Run                                       M. F1         Acc.
 elirf-es-run-1                              0.503      0.612           retuyt-lstm-cr-2                             0.504      0.537
 retuyt-lstm-es-1                            0.499      0.549           retuyt-svm-cr-2                              0.499      0.577
 retuyt-lstm-es-2                            0.498      0.514           retuyt-svm-cr-1                              0.493      0.567
 retuyt-combined-es                          0.491      0.602           elirf-cr-run-2                               0.482      0.561
 elirf-es-run-2                              0.489      0.593           retuyt-cnn-cr-1                              0.477      0.569
 atalaya-ubav3-100-3-syn                     0.476      0.544           atalaya-cr-lr-50-2                           0.475      0.582
 retuyt-svm-es-2                             0.473      0.584           ingeotec-run1                                0.474      0.522
 atalaya-lr-50-2-bis                         0.468      0.599           retuyt-lstm-cr-1                             0.473      0.530
 atalaya-lr-50-2                             0.461      0.598           retuyt-cnn-cr-2                              0.469      0.563
 atalaya-ubav3-50-3                          0.460      0.583           elirf-intertass-cr-run-1                     0.463      0.544
 retuyt-cnn-es-1                             0.458      0.592           atalaya-mlp-300-sentiment                    0.439      0.520
 atalaya-lr-50-2-roc                         0.455      0.595           atalaya-mlp-ubav3-50-3                       0.436      0.560
 ingeotec-run1                               0.445      0.530           ingeotec-run1                                0.384      0.398
 retuyt-cnn-es-2                             0.445      0.574           elirf-cr-run-1                               0.317      0.288
 atalaya-svm-50-2                            0.431      0.583
 itainnova-cl-base                           0.383      0.433           Table 5: Task 1: InterTASS Monolingual CR
 itainnova-cl-proc1                          0.320      0.395
 retuyt-cnn-es-1                             0.097      0.096           Run                                       M. F1         Acc.
                                                                        retuyt-cnn-pe-1                              0.472      0.494
 Table 4: Task 1: InterTASS Monolingual ES
                                                                        atalaya-pe-lr-50-2                           0.462      0.451
                                                                        retuyt-lstm-pe-2                             0.443      0.488
 ELiRF-UPV. González, Hurtado, and                                     retuyt-svm-pe-2                              0.441      0.471
 Pla (2018b) explore different approaches                               ingeotec-run1                                0.439      0.447
 based on Deep Learning. Specifically, they                             elirf-intertass-pe-run-2                     0.438      0.461
 study the behaviour of the CNN, Atten-                                 atalaya-mlp-sentiment-                       0.437      0.520
 tion Bidirectional Long Short Term Memory                              ubav3-50-3
 (Att-BLSTM) and Deep Averaging Networks                                retuyt-svm-pe-1                              0.437      0.474
 (DAN). In order to study the behaviour of                              elirf-intertass-pe-run-1                     0.435      0.440
 the different models, they carry out an ad-                            atalaya-mlp-300-sentiment                    0.429      0.395
 justment process. They get the best re-                                atalaya-mlp-50-sentiment                     0.427      0.501
 sults in InterTASS-ES. However, linguistic                             retuyt-svm-pe-2                              0.425      0.477
 variability affects the choice of architecture                         retuyt-cnn-pe-2                              0.425      0.477
 and its hyperparameters, so the application                            retuyt-lstm-pe-1                             0.419      0.420
 of the same system to InterTASS-CR and                                 elirf-intertass-pe-run-1                     0.225      0.210
 InterTASS-PE tasks, without making any ad-
 justment, has not allowed to obtain results as                         Table 6: Task 1: InterTASS Monolingual PE
 competitive as in InterTASS-ES.
 ATALAYA. Luque and Pérez (2018) pre-                                  guage. Tables 7, 9 and 8 show the results
 sented a system that uses a weighted scheme                            obtained in these cross-lingual subtasks.
 to average the subword-aware embeddings                                   The overall results, in terms of F1, ob-
 obtained from preprocessed tweets that have                            tained with the monolingual and multilingual
 been enriched with data obtained from ma-                              systems for the Spanish and Costa Rica col-
 chine translation. This novel solution in-                             lections are quite comparable, but the one
 volves translating tweets into another lan-                            with the Peru collection fall by around 10%.
 guage and back into the source language, to
 lexically and grammatically increase them.                             2.2      Task 2
    Tables 4, 5 and 6 show the results ob-                              Task 2, Aspect-based Sentiment Analysis,
 tained in the monolingual subtasks (Spain,                             proposes the development of aspect-based
 Costa Rica and Peru variants).                                         polarity classification systems. Similar to
    For the cross-lingual runs, the participants                        previous editions (Martı́nez-Cámara et al.,
 selected an InterTASS dataset to train their                           2017), two datasets were used to evaluate the
 systems and a different one to test, in order                          different approaches: Social-TV and STOM-
 to test the dependency of systems on a lan-                            POL. Both datasets were annotated with
                                                                  16
                              Overview of TASS 2018: Opinions, Health and Emotions


Run                             M. F1        Acc.           sion of Precision, Recall, F1, and Accuracy
                                                            were considered, and Macro-F1 was used for
retuyt-svm-cross-es-2             0.471      0.555          a final ranking of proposed systems.
retuyt-lstm-cross-es-2            0.470      0.466
retuyt-svm-cross-es-1             0.464      0.572          2.2.1 Collections
retuyt-cnn-cross-es-1             0.450      0.524          The Social-TV corpus was collected during
retuyt-cnn-cross-es-2             0.448      0.563          the 2014 Final of “Copa del Rey” cham-
ingeotec-run1                     0.445      0.530          pionship in Spain. After filtering out use-
atalaya-mlp-300-sentiment         0.441      0.485          less information, a subset of 2,773 tweets
retuyt-lstm-cross-es-1            0.438      0.498          was obtained. The details of the corpus
                                                            are described in (Villena-Román et al., 2015;
Table 7: Task 1: InterTASS Cross-lingual                    Garcı́a-Cumbreras et al., 2016; Martı́nez-
with ES as test                                             Cámara et al., 2017).
                                                                STOMPOL (corpus of Spanish Tweets for
Run                             M. F1        Acc.           Opinion Mining at aspect level about POLi-
                                                            tics) is a corpus for the task of Aspect Based
ingeotec-run1                     0.447      0.506
                                                            Sentiment Analysis. The corpus is composed
retuyt-svm-cross-pe-2             0.445      0.514
                                                            of 1,284 tweets manually annotated by two
retuyt-svm-cross-pe-1             0.444      0.505
                                                            annotators, and a third one in case of dis-
retuyt-lstm-cross-pe-2            0.444      0.465
                                                            agreement. The details of the corpus are
atalaya-mlp-300-sentiment         0.438      0.523
                                                            described in (Villena-Román et al., 2015;
retuyt-lstm-cross-pe-1            0.425      0.472
                                                            Garcı́a-Cumbreras et al., 2016; Martı́nez-
retuyt-cnn-cross-pe-1             0.409      0.481
                                                            Cámara et al., 2017).
retuyt-cnn-cross-pe-2             0.391      0.438
itainnova-cl-base-cross-PE        0.367      0.382          2.2.2 Results
                                                            Only the research group ELiRF (González,
Table 8: Task 1: InterTASS Cross-lingual                    Hurtado, and Pla, 2018c) participated in this
with PE as test                                             edition. They explored different approaches
                                                            based on Deep Learning. Specifically, they
Run                             M. F1        Acc.           studied the behaviour of the CNN, Atten-
retuyt-svm-cross-cr-1             0.476      0.569          tion Bidirectional Long Short Term Memory
retuyt-svm-cross-cr-2             0.474      0.542          (Att-BLSTM) and Deep Averaging Networks
retuyt-lstm-cross-cr-1            0.473      0.530          (DAN), similar to the proposal of the team
retuyt-cnn-cross-cr-2             0.462      0.551          for Task 1. In order to study the performance
ingeotec-run2                     0.454      0.538          of the different models, they carried out an
retuyt-lstm-cross-cr-2            0.444      0.468          adjustment process. Tables 10 and 11 show
retuyt-cnn-cross-cr-1             0.421      0.423          the results obtained in their experiments.
itainnova-cl-base-cross-CR        0.409      0.440
                                                            Run                           M. F1     Acc.
ingeotec-run1                     0.384      0.398
                                                            ELiRF-UPV-run1                  0.485   0.627
Table 9: Task 1: InterTASS Cross-lingual                    ELiRF-UPV-run3                  0.483   0.628
with CR as test                                             ELiRF-UPV-run2                  0.476   0.625

aspect-related metadata: the main category                   Table 10: Task 2 Social-TV corpus results
of the aspect, and the polarity of the opinion
about the aspect. Systems had to classify                   Run                           M. F1     Acc.
the opinion about the given aspect in 3 dif-
ferent polarity labels (positive, negative,                 ELiRF-UPV-run2                  0.526   0.633
neutral).                                                   ELiRF-UPV-run1                  0.490   0.613
   Participants were expected to submit up                  ELiRF-UPV-run3                  0.447   0.576
to 3 experiments for each provided collection,
each in a plain text file with tweet identifica-            Table 11: Task 2 STOMPOL corpus results
tion, aspect and polarity.
   For evaluation, exact match with a single                2.3     Task 3
label combining “aspect-polarity” was used.                 NLP methods are increasingly being used to
Similarly to Task 1, the macro-averaged ver-                mine knowledge from unstructured content of
                                                      17
E. Martínez-Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. García-Cumbreras, M. García-Vega, Y. Gutiérrez et al.


 health (Liu et al., 2013; Doing-Harris and                                   Subtask A as either Concept or Ac-
 Zeng-Treitler, 2011; Gonzalez-Hernandez et                                   tion.
 al., 2017) and other domains (Estevez-
 Velarde et al., 2018). Over the years, many                               • Subtask C is concerned with the dis-
 eHealth challenges have taken place, such                                   covery of the semantic relations between
 as SemEval2 , CLEF3 campaigns and oth-                                      pairs of entities.
 ers (Augenstein et al., 2017). These tasks
 have mainly dealt with identification, clas-                              To compute the evaluation metrics for
 sification, extraction and linking of knowl-                           each subtask, we define the following sets for
 edge. The Task 3: eHealth Knowledge Dis-                               comparing the annotations between both the
 covery (eHealth-KD) proposes modelling the                             expected output (gold standard) and the ac-
 human language in a scenario in which Span-                            tual output in each subtask:
 ish electronic health documents could be ma-
 chine readable from a semantic point of view.                          Correct matches (C): in all subtasks,
 This task is designed to encourage the de-                                when one gold and one given annotation
 velopment of software technologies to auto-                               exactly match.
 matically extract a large variety of knowledge
 from eHealth documents written in the Span-                            Partial matches (P ): in subtask A, when
 ish language.                                                             two key phrases have a non-empty inter-
     In order to capture the semantics of a                                section.
 broad range of health related text, eHealth-
 KD proposes the identification of two types                            Missing matches (M ): in subtasks A and
 of elements: Concepts and Actions. Con-                                   C, when an annotation in the gold out-
 cepts are key phrases that represent actors or                            put is not provided by the system.
 entities which are relevant in a domain, while
 Actions represent how these Concepts inter-                            Spurious matches (S): in subtasks A and
 act with each other. Actions and Concepts                                 C, when an annotation given by the sys-
 can be linked by two types of relations: sub-                             tem does not appear in the gold output.
 ject and target, which describe the main
                                                                        Incorrect matches (I): in subtask B,
 roles that a Concept can perform. Also, four
                                                                            when one assigned label is incorrect.
 specific semantic relations between Concepts
 are defined: is-a, part-of, property-of and
 same-as. Figure 1 provides an example.                                    To measure the individual subtasks results
                                                                        as well as overall results, the eHealth-KD
                                                                        challenge proposes three evaluation scenar-
                                                                        ios.
                                                                        Scenario 1. The first scenario requires all
                                                                        subtasks (i.e. A, B and C) to be performed
                                                                        sequentially. The input in this scenario con-
                                                                        sists of plain text (100 sentences), and par-
                                                                        ticipants must submit the three output files
 Figure 1: Example annotation of a small set                            corresponding to subtasks A, B and C. In this
 of documents.                                                          scenario the overall quality of the participant
                                                                        systems is evaluated. So, a combined micro
                                                                        F1 metric was defined, taking into account
    To simplify and normalise the extraction                            results of the three tasks:
 process, the overall task is divided into three
 subtasks:
                                                                                         2 · PABC · RABC
    • Subtask A is concerned with the extrac-                              F1ABC       =                          (1)
                                                                                          PABC + RABC
      tion of the relevant key phrases.                                                          TABC + 12 PA
                                                                             PABC      =                          (2)
    • Subtask B is concerned with the classi-                                            TABC + PA + MA + IB + MC
      fication of the key phrases identified in                                                 TABC + 21 PA
                                                                            RABC       =                          (3)
    2
      International Workshop on Semantic Evaluation                                      TABC + PA + SA + IB + SC
    3
      Conference and Labs of the Evaluation Forum                            TABC      = CA + CB + CC             (4)
                                                                  18
                                 Overview of TASS 2018: Opinions, Health and Emotions


Scenario 2. In the second scenario only                                                         Train   Dev.   Test
subtasks B and C are performed. Hence,
participants receive plain text inputs and the                 Files                                6      1      3
corresponding outputs for subtask A (a dif-                    Sentences                          559    285    300
ferent subset of 100 sentences). This sce-                     Annotations                       5976   3573   3310
nario allows participants to focus on the key                  Entities                          3280   1958   1805
phrases classification, without being affected                 - Concepts                        2431   1524   1305
by errors related to the extraction of key                     - Actions                          849    434    500
phrases. Like Scenario 1, a combined micro
                                                               Roles                             1684    843    988
F1 is defined which takes into account the re-
                                                               - subject                          693    339    401
sults for subtasks B and C:
                                                               - target                           991    504    587
                      2 · PBC · RBC                            Relations                         1012    772    517
            F1BC    =                              (5)         - is-a                             434    370    235
                       PBC + RBC
                            TBC                                - part-of                          149    145     96
             PBC    =                              (6)         - property-of                      399    244    178
                      TBC + IB + MC
                            TBC                                - same-as                           30     13      8
             RBC    =                              (7)
                      TBC + IB + SC
                                                               Table 12: Statistics of the eHealth-KD v1.0
             TBC    = CB + CC                      (8)
                                                               corpus.
Scenario 3. Finally, the third scenario
evaluates only subtask C. Participants are                     annotation tool5 by 15 human annotators di-
provided with plain text inputs and the cor-                   vided into seven groups. The final 1, 173
responding output of subtasks A and B (a fi-                   tagged sentences were organised in three col-
nal subset of another 100 sentences). In this                  lections: training, development and test. Ta-
scenario, the following metric is defined for                  ble 12 summarises the main statistics of the
evaluation:                                                    corpus.
                                                               2.3.2 Analysis of the Results
                           PC · RecC                           eHealth-KD challenge attracted the attention
              F1C    =   2·                        (9)
                           PC + RC                             of a total 31 registered teams of which six
                           CC                                  of then successfully concluded their partici-
               PC    =                            (10)
                         CC + SC                               pation. Their results are summarised in Ta-
                           CC                                  ble 13. The following tag labels are designed
               RC    =                            (11)
                         CC + MC                               to provide an overview of the main charac-
   For competition purposes, the best system                   teristics of each participant system:
is defined as the submission that maximises                    S: Uses shallow supervised models such as
the macro-average F1 across all three scenar-                      CRF, logistic regression, SVM, decision
ios:                                                               trees, etc.
                                                               D: Uses deep learning models, such as LSTM
                    F1ABC + F1BC + F1C                            or convolutional networks.
          F1 =                                   (12)
                            3
                                                               E: Uses word embeddings or other embed-
2.3.1 Corpora                                                      ding models trained with external cor-
For evaluation purposes, a corpus of health-                       pora.
related sentences in Spanish was manually                      K: Uses external knowledge bases, either ex-
built and tagged. The corpus consists of a                        plicitly or implicitly (i.e, through third-
selection of articles collected from the Med-                     party tools).
linePlus4 website. These files contain sev-
eral entries related to health and medicine                    R: Uses hand crafted rules based on domain
topics, and environmental topics strongly re-                     expertise.
lated to health care. Spanish language items                   N: Uses natural language processing tech-
were converted to a plain text document, pro-                     niques or features, i.e., POS-tagging, de-
cessed, and manually tagged using the Brat                        pendency parsing, etc.
  4                                                               5
      https://medlineplus.gov/xml.html                                http://brat.nlplab.org/
                                                         19
E. Martínez-Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. García-Cumbreras, M. García-Vega, Y. Gutiérrez et al.


     Baseline description:          A baseline,                         to deal with, even after having applied novel
 trained on the training corpus, was defined.                           approaches (i.e. TALP and LaBDA) based on
 This strategy consists of a dummy approach                             convolutional neural networks.
 based solely on the text of key phrases. This                              The best-performing systems in each sce-
 technique collects all training data and stores                        nario highly coincide with all three task re-
 three maps: (1) key phrases associated with                            sults. For Scenario 1, the top performing
 their most common class (either Concept or                             strategy belongs to UC3M, which achieves the
 Action); (2) pairs of concepts associated with                         best scores in subtasks A and B, and the over-
 their most common relation; and (3) tuples                             all best result in the shared subtask (aver-
 of <Action,Concept> associated with their                              aged across all three scenarios), pretty close
 most common role. At prediction time, these                            to SINAI. Likewise, the best strategy in Sce-
 maps are used to select a key phrase, decide                           nario 3 corresponds to TALP, which achieves
 its class, and predict relations and roles.                            the best score for subtask C. However, for
     Once the shared subtask ended, the official                        the overall results, other participants such as
 results were published. However, some par-                             SINAI and UPF-UPC achieve higher average
 ticipants noticed that their systems provided                          scores, even though their performance in sub-
 duplicated outputs on some occasions. These                            task C and Scenario 3 is practically negligi-
 duplicated outputs, even if correct, were be-                          ble. In contrast, these teams obtain relatively
 ing counted as spurious after the first match.                         high scores in subtasks A and B.
 To account for this duplication, the evalu-                                The diverse nature and complexity of the
 ation script was modified to remove dupli-                             three subtasks make it difficult to design a
 cated outputs from the participants submis-                            single fair evaluation metric. For this rea-
 sions prior to calculating the evaluation met-                         son, we consider that each system submission
 rics. Table 14 shows this second version of the                        gets more accurate results related to the spe-
 metrics, where variations in scores are high-                          cific sub-problems that it tackles. Although
 lighted in bold text. This proved not to be                            generalisation across the three subtasks is a
 a significant problem, since only two partic-                          desirable characteristic, advances in any par-
 ipants were affected, and even though their                            ticular subtask are also very valuable.
 metrics improved marginally, the overall re-                               In general, the most competitive ap-
 sults or the main conclusions of the shared                            proaches in individual subtasks are domi-
 subtask did not change.                                                nated by state-of-the-art machine learning.
     The results of this task, eHealth-KD, show                         In the particular case of subtask C, where
 that a variety of approaches, on the whole,                            modern deep learning approaches seem to
 deal effectively with the health knowledge                             outperform classic techniques.          In addi-
 discovery problem. However, issues still need                          tion, incorporating domain-specific knowl-
 to be resolved to obtain highly competi-                               edge provides a significant boost to the per-
 tive systems. The best performing submis-                              formance. Most participants use NLP fea-
 sions include classic supervised learning, deep                        tures, either explicitly, or implicitly captured
 learning and knowledge-based techniques. In                            in word embeddings and other representa-
 subtask A, the best approach (UC3M) is based                           tions. An interesting phenomenon is that
 on a CRF model with pre-trained embed-                                 the best systems in subtask A do not cor-
 dings as features. This approach obtains sim-                          relate with the best systems in subtask C.
 ilar scores in subtask B. In general, subtask                          This suggests that the optimal approach for
 B appears to be easier than the rest, which                            either subtask is different, giving rise to an in-
 is understandable given that there are only                            teresting research line that would explore in-
 two classes and there is a large correlation                           tegrated approaches to simultaneously solv-
 between word lemmas and their classes (as                              ing these three subtasks. The overall results
 shown by the relatively high performance of                            show that general purpose knowledge discov-
 the baseline).                                                         ery in domain-specific documents is poten-
     Subtask C, in concordance with Scenario                            tially a prolific research area, particularly for
 3, does not exceed 45% in F-score. This re-                            the Spanish language. We expect similar fu-
 inforces the belief that this task is difficult                        ture initiatives to provide fruitful evaluation
    7
     This extracts lexical and syntactic features for
                                                                        scenarios where researchers can deploy tech-
 each token. Afterwards, it applies a set of handcrafted                niques from several domains, and compete in
 heuristics for each subtask.                                           friendly contests to improve the state-of-the-
                                                                  20
                              Overview of TASS 2018: Opinions, Health and Emotions


                  UC3M†       SINAI†         UPF-UPC†            TALP†         LaBDA†         UH     Baseline
Tags             SDEN          KRN                 SKN           DEN                    D     RN
Subtask A         0.872         0.798               0.805             -              0.323   0.172     0.597
Subtask B         0.959         0.921               0.954         0.931              0.594   0.639     0.774
Subtask C             -             -               0.036        0.448               0.420   0.018     0.107
Average           0.610         0.573               0.598         0.460              0.446   0.276     0.493
Scenario 1        0.744         0.710               0.681             -              0.297   0.181     0.566
Scenario 2         0.648        0.674               0.622        0.722               0.275   0.255     0.577
Scenario 3             -            -               0.036        0.448               0.420   0.018     0.107
Average           0.464         0.461               0.446         0.390              0.331   0.151     0.417

Table 13: Summary of systems and results for the TASS 2018 Task 3 event. The best scores are in
bold text. More details in UC3M (Zavala, Martı́nez, and Segura-Bedmar, 2018), SINAI (López-
Ubeda et al., 2018), UPF-UPC (Palatresi and Hontoria, 2018), TALP (Medina and Turmo,
2018), LaBDA (Suarez-Paniagua, Segura-Bedmar, and Martı́nez, 2018) and UH7 . The symbol †
means that the group submitted a system description paper.

                    UPF-UPC               LaBDA             news article into safe (positive emotions, so
                                                            safe for ads) or unsafe (negative emotions,
Tags                       SKN                DN            so better avoid ads). This task could be con-
Subtask A                  0.805             0.323          sidered as a kind of stance classification, on
Subtask B                  0.954             0.594          the positioning of readers of news contents.
Subtask C                  0.036            0.444           The task is a strong challenge because it has
Average                    0.598            0.454           to deal with the polarity of feeling (safe vs.
Scenario 1                0.681             0.310           unsafe) and to work in combination with a
Scenario 2               0.626              0.294           (pseudo) thematic classification to be able to
Scenario 3                0.036             0.444           determine the meaning of the news. For ex-
Average                  0.448              0.349           ample, the reduction of traffic accidents has
                                                            a negative feeling because of the accidents,
Table 14: Summary of results of submissions                 but the context of reducing the numbers of
that changed once duplicated entries were re-               accidents makes those bad news good news,
moved. Variations in score are highlighted in               hence safe news.
bold text.                                                  2.4.1 Corpora
                                                            The Spanish brANd Safe Emotion corpus
art.                                                        (SANSE) corpus was specifically built for this
                                                            task. RSS feeds of different online newspa-
2.4    Task 4                                               pers written in different varieties of Spanish
When news are about natural disasters, read-                (Argentina, Chile, Colombia, Cuba, Spain,
ers usually feel negative emotions (sadness,                USA, Mexico, Peru and Venezuela) were col-
for instance), whereas when those news are                  lected for over a month. Finally 15,152 ar-
about the last championship won by your                     ticles were captured, containing the URL,
favourite team, readers feel positive emotions              the publication date and the headline. News
like happiness. Moreover, it is commonly as-                summaries were also collected for several
sumed in marketing that emotions aroused                    sources, but finally they were discarded to
in the reader by news articles have an im-                  make the dataset consistent and homoge-
pact in the perception of the advertisements                neous.
displayed along with those articles. Thus,                      Then 2,000 articles (L1 subset) were ran-
from that marketing perspective, if a com-                  domly selected and were manually annotated
pany wants to promote their brand, the ads                  into an emotional categorisation of SAFE or
should better be associated to (i.e., shown                 UNSAFE, from the point of view of the gen-
with) news that arouse positive emotions.                   eral public of each corresponding country.
    The objective of Task-4 is to encourage the             The other 13,152 articles (L2 subset) were
development of systems that can classify a                  not annotated.
                                                      21
E. Martínez-Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. García-Cumbreras, M. García-Vega, Y. Gutiérrez et al.


     As the datasets were annotated with two                            Subset                                                   Size
 levels of safety: SAFE and UNSAFE, the
 task can be considered as a binary classifi-                           Training                                                 1250
 cation task.                                                           Development                                               250
                                                                        Test                                                      500
     The annotation was carried out by two
 human annotators (the two organisers of the
                                                                          Table 15: Statistics of the SANSE corpus
 task), and, for those cases with no agreement
 between the two annotators, a third annota-
 tor undid the tie. A safe headline of a news                           Subset                                                   Size
 was defined as an utterance that arises a pos-                         Training (Spain)                                          300
 itive or neutral emotion in the reader and is                          Dev. (Spain)                                               48
 not related to a controversial topics: religion,                       Test (Mexico)                                             144
 extreme wing political topics, or controver-                           Test (Cuba)                                               194
 sial topics (those that arise strong positive                          Test (Chile)                                              194
 emotions to some readers but strong negative                           Test (Colombia)                                           195
 emotions to other ones). An unsafe headline                            Test (Argentina)                                          198
 was defined as an utterance that arises nega-                          Test (Venezuela)                                          233
 tive emotions on the reader.                                           Test (Peru)                                               234
     Some examples in Spanish:                                          Test (USA)                                                260

     Ası́ será el nuevo pan integral en                                                   Table 16: Caption
     España, según una nueva ley en mar-
     cha. → SAFE
                                                                        the Spanish language, it thus a monolingual
     This will be the new integral bread in                             evaluation. In this task, datasets are com-
     Spain, according to a new law underway.                            posed of headlines of news written in differ-
                                                                        ent version of the Spanish language, but the
     Casi 300 municipios de Colombia en
                                                                        country of the text is not relevant for this
     riesgo electoral. → UNSAFE
                                                                        task.
     Almost 300 municipalities in Colombia at                               Participants were provided with the train-
     electoral risk.                                                    ing and development subsets of L1 SANSE
    The agreement of the annotation was 0.58                            corpus for building the systems, and two test
 according to π (Scott, 1955) and k (Cohen,                             sets for the evaluation: the test subset of L1
 1960), which may consider moderate accord-                             SANSE corpus and the L2 SANSE corpus.
 ing to Landis and Koch (1977). Although                                    The systems presented were evaluated us-
 the agreement is moderate, it is close to be                           ing the measures of Macro-Precision (M. P.),
 considered substantial, and we have also to                            Macro-Recall (M. R.), Macro-F1 (M. F1) and
 take into account that it is a new classifica-                         Accuracy (Acc.).
 tion task that works with a strong subjective                              Subtask 2 (S2) is similar to S1, but in this
 content. We will work in making the annota-                            case the aim is to evaluate the generalisation
 tions guidelines more precise in order to im-                          capacity of the submitted systems. For train-
 prove the agreement of the annotators. Be-                             ing their systems, participants were provided
 sides, we hope that the participants will give                         with SANSE subsets with headlines written
 us insights with the aim of improving the an-                          only in the Spanish language spoken in Spain.
 notation of the data.                                                  The test set was composed of headlines writ-
    The L1 subset was then again divided in                             ten in the Spanish language spoken in differ-
 three subsets, specifically: training, develop-                        ent countries of America. The statistics of
 ment and test. The statistics of the three                             SENSE corpus for S2 are shown in the Table
 subsets are in Table 15.                                               16.
 2.4.2 Tasks                                                            2.4.3 Results
 Two subtasks were proposed. Subtask 1 (S1)                             Task 4 attracted the attention of seven teams,
 consists of the classification of headlines into                       and most of them participated in both lev-
 safe or unsafe for incorporating an ad of                              els of evaluation of the S1 and in S2. Table
 a brand. The evaluation of the systems does                            2.4.3 shows the participation of the teams in
 not take into account the cultural varieties of                        each Subtask. Five groups of the seven ones
                                                                  22
                             Overview of TASS 2018: Opinions, Health and Emotions


submitted a system description paper, whose                rbnUGR. Rodrı́guez Barroso, Martı́nez-
main features will be detailed as what follows.            Cámara, and Herrera (2018) submitted three
INGEOTEC. Moctezuma et al. (2018)                          systems grounded in deep learning. Although
propose an ensemble classification system                  the three systems are based on Long Short-
(EvoMSA), which is composed of several and                 Term Memory (LSTM) Recurrent Neural
heterogeneous base systems and a genetic                   Network (RNN), they have several differ-
programming system (EvoDAG, (Graff et al.,                 ences:
2017)) that optimises the contribution of each             Run 1. It uses a LSTM layer as encoding
base system in the final classification. The                  layer, and its output is the last vector
authors combined supervised and unsuper-                      state of the LSTM layer.
vised system as base classification systems.               Run 2. It uses a BiLSTM8 layer as encoding
The supervised ones are based on the use                      layer, and its output is the concatenation
of the algorithm SVM with different repre-                    of the last vector state of the two LSTM
sentations of the input text, namely TF-IDF                   layers.
and pre-trained word vectors. The system                   Run 3. It uses a LSTM layer as encoding
reached the best results in the monolingual                   layer, and its output is the concatenation
and the multilingual evaluations, however the                 of the corresponding output state vector
performance of the system dropped a bit in                    of each input token.
S1 L2. Since the annotation test set of S1
L2 was conducted by a voting system of the                    The results show that the systems based
all the submitted systems, the lower perfor-               on one single LSTM layer perform better
mance in S1 L2 may be caused by a different                than the one based on BiLSTM. Regarding
error distribution between INGEOTEC and                    the different results in S1 and S2 indicate
the systems submitted by the other groups.                 that the use the entire output of the encod-
ELiRF UPV. González, Hurtado, and                         ing layer allow to improve the generalisation
Pla (2018a) propose a deep neural network,                 capacity of the model, because the multilin-
specifically the model Deep Averaging Net-                 gual evaluation requires a higher generalisa-
works (DAN) (Iyyer et al., 2015). The au-                  tion capacity.
thors used a set of pre-trained word embed-                MeaningCloud. Herrera-Planells             and
dings for representing the news headlines.                 Villena-Román (2018) propose three su-
The set of pre-trained word embeddings were                pervised systems, two of them are lineal
prepared by the authors and built upon a cor-              classification systems and the other one a
pus of tweets (Hurtado, Pla, and González,                non-lineal classification system. The linear
2017). The high performance reached by                     classification systems use XGBoost (Chen
a set of pre-trained word embeddings built                 and Guestrin, 2016) as classification system.
upon tweets with news headlines stands out,                They differ in the set of features used to
because the genre of news headlines and                    represent the news headlines, which are
tweets are different. However, it may mean                 mainly built using the public APIs of the
that the use of language in tweets and news                text analytic platform of MeaningCloud.
headlines is similar.                                      The non lineal classification system is a
                                                           neural network based on the use of a CNN
                                                           layer. The proposal that reached higher
Team                S1 L1       S1 L2          S2
                                                           results was the one grounded in a CNN
INGEOTEC†              X            X           X          (Run 3).
ELiRF-UPV†             X            X           X          SINAI. Plaza del Arco et al. (2018) pro-
rbnUGR†                X            X           X          pose to represent the news headlines as a vec-
MeaningCloud†          X            X           X          tor of unigrams weighted with TF-IDF, and
SINAI†                 X            X           -          the number of positive and negative words ac-
lone wolf              X            X           -          cording to three list of opinion bearing words.
TNT-UA-WFU             X            X           -          The authors used SVM as classification algo-
                                                           rithm.
Table 17: Participation of each team on each                  The evaluation measures in the two Sub-
Subtask. The symbol † means that the group                 tasks were accuracy and the macro-average
submitted a system description paper                          8
                                                                  A BiLSTM is an elaboration of two LSTM layers.
                                                     23
E. Martínez-Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. García-Cumbreras, M. García-Vega, Y. Gutiérrez et al.


                                S1 L1                                 S1 L2                                  S2
 System
                      M. P. M. R. M. F1              Acc.     M. P. M. R. M. F1 Acc.                M. P. M. R. M. F1 Acc.
 INGEOTEC run1 0.794             0.795     0.7951    0.802      0.853      0.880   0.8664 0.871       0.722    0.715     0.7191 0.737
 ELiRF UPV run2 0.787            0.794     0.7902    0.794      0.850      0.884   0.8673 0.865       0.747    0.657     0.6992 0.722
 ELiRF UPV run1 0.795            0.784     0.7903    0.800      0.878      0.889   0.8831 0.893       0.736    0.649     0.6903 0.715
 rbnUGR run1    0.784            0.764     0.7744    0.786      0.880      0.867   0.8732 0.888       0.683    0.661     0.6726 0.700
 MEANING-       0.767            0.767     0.7675    0.776      0.781      0.804   0.7937 0.801       0.647    0.654     0.6517 0.658
 CLOUD run3
 rbnUGR run3    0.763            0.765     0.7646    0.772      0.838      0.870 0.8536 0.853         0.687    0.678 0.6834 0.631
 rbnUGR run2    0.774            0.752     0.7637    0.776      0.868      0.857 0.8635 0.878         0.679    0.672 0.6765 0.698
 SINAI          0.733            0.722     0.7288    0.742      0.769      0.777 0.7738 0.793             -        -      -     -
 MEANING-       0.723            0.727     0.7259    0.732          -          -      -     -             -        -      -     -
 CLOUD run2
 MEANING-       0.713            0.722     0.71710 0.714               -       -        -       -          -         -         -      -
 CLOUD run1

 Table 18: Macro precision (M. P.), macro recall (M. R.), macro f1 (M. F1) and accuracy (Acc.)
 reached by each submitted system to each Subtask of the groups that submitted a system
 description paper

 of precision, recall and F1, and the systems                              89517-P) from the Spanish Government, and
 were ranked according to the value of macro-                              “Plataforma Inteligente para Recuperación,
 F1. Table 18 show the results reached by each                             Análisis y Representación de la Infor-
 group that submitted the description of their                             mación Generada por Usuarios en Internet”
 systems in S1 L1, S1 L2 and S2 respectively.                              (GRE16-01) from University of Alicante.
                                                                           Eugenio Martı́nez Cámara was supported by
 3     Conclusions                                                         the Spanish Government Programme Juan
 The edition of TASS 2018 has attracted the                                de la Cierva Formación (FJCI-2016-28353).
 participation of 16 systems, and the submis-
 sion of 15 system description papers. More-                               References
 over, we have proposed two new challenges to                              Augenstein, I., M. Das, S. Riedel, L. Vikra-
 the international reserach community, which                                 man, and A. McCallum. 2017. Semeval
 are in line to the requirements of the Indus-                               2017 task 10:         Scienceie-extracting
 try.                                                                        keyphrases and relations from sci-
    The submitted systems are in the line of                                 entific publications.     arXiv preprint
 the state-of-the-art in other similar work-                                 arXiv:1704.02853.
 shops, and most of them are grounded in
 Deep Learning and the use of hand-crafted                                 Chen, T. and C. Guestrin. 2016. Xgboost:
 linguistic features. Therefore, TASS may be                                 A scalable tree boosting system. In Pro-
 considered as a reference forum for setting                                 ceedings of the 22Nd ACM SIGKDD In-
 up the state-of-the-art in semantic analysis                                ternational Conference on Knowledge Dis-
 in Spanish.                                                                 covery and Data Mining, KDD ’16, pages
    As future work, we plan to enlarge the cov-                              785–794, New York, NY, USA. ACM.
 erage of the Spanish language of the corpus                               Chiruzzo, L. and A. Rosá. 2018. Retuyt-
 InterTASS, as well as consolidating the new                                 inco at tass 2018: Sentiment analy-
 challenges (Task 3 and Task 4). Moreover,                                   sis in spanish variants using neural
 we will keep working in the development of                                  networks and svm.        In E. Martı́nez-
 new corpora and linguistic resources for the                                Cámara, Y. Almeida Cruz, M. C.
 research community.                                                         Dı́az-Galiano, S. Estévez Velarde, M. A.
                                                                             Garcı́a-Cumbreras,      M. Garcı́a-Vega,
 Acknowledgments                                                             Y. Gutiérrez Vázquez, A. Montejo Ráez,
 This work has been partially supported by a                                 A. Montoyo Guijarro, R. Muñoz Guillena,
 grant from the Fondo Europeo de Desarrollo                                  A. Piad Morffis, and J. Villena-Román,
 Regional (FEDER), the projects REDES                                        editors, Proceedings of TASS 2018: Work-
 (TIN2015-65136-C2-1-R,      TIN2015-65136-                                  shop on Semantic Analysis at SEPLN
 C2-2-R) and SMART-DASCI (TIN2017-                                           (TASS 2018), volume 2172 of CEUR
                                                                  24
                            Overview of TASS 2018: Opinions, Health and Emotions


  Workshop Proceedings, Sevilla, Spain,                   Gonzalez-Hernandez,     G.,    A. Sarker,
  September. CEUR-WS.                                       K. O’Connor, and G. Savova. 2017. Cap-
Cohen, J. 1960. A coefficient of agreement                  turing the patient’s perspective: a review
  for nominal scales. Educational and Psy-                  of advances in natural language process-
  chological Measurement, 20(1):37–46.                      ing of health-related text. Yearbook of
                                                            medical informatics, 26(01):214–227.
Denecke, K. 2015. Health Web Science.
  Springer International Publishing.                      Graff, M., E. S. Tellez, H. Jair Escalante, and
                                                            S. Miranda-Jiménez, 2017. Semantic Ge-
Doing-Harris, K. M. and Q. Zeng-Treitler.                   netic Programming for Sentiment Analy-
  2011. Computer-assisted update of a con-                  sis, pages 43–65. Springer International
  sumer health vocabulary through mining                    Publishing, Cham.
  of social network data. Journal of medical
  Internet research, 13(2).                               Herrera-Planells, J. and J. Villena-Román.
                                                            2018. MeaningCloud at TASS 2018: News
Estevez-Velarde, S., Y. Gutiérrez, A. Mon-                 headlines categorization for brand safety
   toyo, A. Piad-Morffis, R. M. noz, and                    assessment. In Proceedings of TASS 2018:
   Y. Almeida-Cruz.     2018.      Gathering                Workshop on Semantic Analysis at SE-
   object interactions as semantic knowl-                   PLN (TASS 2018), volume 2172, Septem-
   edge. In Proceedings of the 2018 Inter-                  ber.
   national Conference on Artificial Intelli-
   gence (ICAI’18).                                       Hurtado, L.-F., F. Pla, and J.-A. González.
                                                            2017. Elirf-upv en tass 2017: Análisis de
Garcı́a-Cumbreras, M. A., J. Villena-Román,                sentimientos en twitter basado en apren-
  E. Martı́nez-Cámara, M. C. Dı́az-Galiano,                dizaje profundo. In Proceedings of TASS
  M. T. Martı́n-Valdivia, and L. A. Ureña                  2017: Workshop on Sentiment Analysis at
  López. 2016. Overview of tass 2016.                      SEPLN co-located with 33nd SEPLN Con-
  In TASS 2016: Workshop on Sentiment                       ference (SEPLN 2017).
  Analysis at SEPLN, pages 13–21.
                                                          Iyyer, M., V. Manjunatha, J. Boyd-Graber,
González, J.-A., L.-F. Hurtado, and F. Pla.                 and H. Daumé III. 2015. Deep unordered
  2018a. ELiRF-UPV en TASS 2018: Cat-                        composition rivals syntactic methods for
  egorizació emocional de noticias. In Pro-                 text classification. In Proceedings of the
  ceedings of TASS 2018: Workshop on Se-                     53rd Annual Meeting of the Association
  mantic Analysis at SEPLN (TASS 2018),                      for Computational Linguistics and the 7th
  volume 2172, September.                                    International Joint Conference on Natu-
González, J.-A., L.-F. Hurtado, and F. Pla.                 ral Language Processing (Volume 1: Long
  2018b. Elirf-upv en tass 2018: Análisis                   Papers), pages 1681–1691. Association for
  de sentimientos en twitter basado en                       Computational Linguistics.
  aprendizaje profundo. In E. Martı́nez-
                                                          Landis, J. R. and G. G. Koch. 1977. The
  Cámara, Y. Almeida Cruz, M. C.
                                                            measurement of observer agreement for
  Dı́az-Galiano, S. Estévez Velarde, M. A.
                                                            categorical data. biometrics, pages 159–
  Garcı́a-Cumbreras,      M. Garcı́a-Vega,
                                                            174.
  Y. Gutiérrez Vázquez, A. Montejo Ráez,
  A. Montoyo Guijarro, R. Muñoz Guillena,                Liu, H., S. J. Bielinski, S. Sohn, S. Murphy,
  A. Piad Morffis, and J. Villena-Román,                    K. B. Wagholikar, S. R. Jonnalagadda,
  editors, Proceedings of TASS 2018: Work-                   K. Ravikumar, S. T. Wu, I. J. Kullo,
  shop on Semantic Analysis at SEPLN                         and C. G. Chute. 2013. An informa-
  (TASS 2018), volume 2172 of CEUR                           tion extraction framework for cohort iden-
  Workshop Proceedings, Sevilla, Spain,                      tification using electronic health records.
  September. CEUR-WS.                                        AMIA Summits on Translational Science
                                                             Proceedings, 2013:149.
González, J.-A., L.-F. Hurtado, and F. Pla.
  2018c. Elirf-upv en tass 2018: Análisis de             López-Ubeda, P., M. C. Dı́az-Galiano, M. T.
  sentimientos en twitter basado en apren-                   Martı́n-Valdivia, and L. A. Urena-Lopez.
  dizaje profundo. In Proceedings of TASS                    2018. Sinai en tass 2018 task 3. clasifi-
  2018: Workshop on Semantic Analysis at                     cando acciones y conceptos con umls en
  SEPLN (TASS 2018).                                         medline. In Proceedings of TASS 2018:
                                                    25
E. Martínez-Cámara, Y. Almeida-Cruz, M. C. Díaz-Galiano, S. Estévez-Velarde, M. Á. García-Cumbreras, M. García-Vega, Y. Gutiérrez et al.


     Workshop on Semantic Analysis at SE-                                   Workshop Proceedings, Sevilla, Spain,
     PLN (TASS 2018).                                                       September. CEUR-WS.
 Luque, F. M. and J. M. Pérez.         2018.                           Montanés, R., R. Aznar, and R. del Hoyo.
   Atalaya at tass 2018:           Sentiment                              2018. Aplicación de un modelo hı́brido de
   analysis with tweet embeddings and                                     aprendizaje profundo para el análisis de
   data augmentation.       In E. Martı́nez-                              sentimiento en twitter. In E. Martı́nez-
   Cámara, Y. Almeida Cruz, M. C.                                        Cámara, Y. Almeida Cruz, M. C.
   Dı́az-Galiano, S. Estévez Velarde, M. A.                              Dı́az-Galiano, S. Estévez Velarde, M. A.
   Garcı́a-Cumbreras,      M. Garcı́a-Vega,                               Garcı́a-Cumbreras,      M. Garcı́a-Vega,
   Y. Gutiérrez Vázquez, A. Montejo Ráez,                              Y. Gutiérrez Vázquez, A. Montejo Ráez,
   A. Montoyo Guijarro, R. Muñoz Guillena,                               A. Montoyo Guijarro, R. Muñoz Guillena,
   A. Piad Morffis, and J. Villena-Román,                                A. Piad Morffis, and J. Villena-Román,
   editors, Proceedings of TASS 2018: Work-                               editors, Proceedings of TASS 2018: Work-
   shop on Semantic Analysis at SEPLN                                     shop on Semantic Analysis at SEPLN
   (TASS 2018), volume 2172 of CEUR                                       (TASS 2018), volume 2172 of CEUR
   Workshop Proceedings, Sevilla, Spain,                                  Workshop Proceedings, Sevilla, Spain,
   September. CEUR-WS.                                                    September. CEUR-WS.
 Martı́nez-Cámara, E., M. C. Dı́az-Galiano,                            Palatresi, J. V. and H. R. Hontoria. 2018.
   M. A. Garcı́a-Cumbreras, M. Garcı́a-                                    Tass2018: Medical knowledge discovery
   Vega, and J. Villena-Román.         2017.                              by combining terminology extraction tech-
   Overview of TASS 2017. In E. Martı́nez-                                 niques with machine learning classifica-
   Cámara, M. C. Dı́az-Galiano, M. A.                                     tion. In Proceedings of TASS 2018: Work-
   Garcı́a-Cumbreras, M. Garcı́a-Vega, and                                 shop on Semantic Analysis at SEPLN
   J. Villena-Román, editors, Proceedings of                              (TASS 2018).
   TASS 2017: Workshop on Semantic Anal-
                                                                        Plaza del Arco, F. M., E. Martı́nez-Cámara,
   ysis at SEPLN (TASS 2017), volume 1896
                                                                           M. T. Martı́n Valdivia, and A. Ureña
   of CEUR Workshop Proceedings, Murcia,
                                                                           López. 2018. SINAI en TASS 2018:
   Spain, September. CEUR-WS.
                                                                           Inserción de conocimiento emocional ex-
 Medina, S. and J. Turmo. 2018. Joint clas-                                terno a un clasificador lineal de emociones.
   sification of key-phrases and relations in                              In Proceedings of TASS 2018: Workshop
   electronic health documents. In Proceed-                                on Semantic Analysis at SEPLN (TASS
   ings of TASS 2018: Workshop on Seman-                                   2018), volume 2172, September.
   tic Analysis at SEPLN (TASS 2018).
                                                                        Rodrı́guez Barroso, N., E. Martı́nez-Cámara,
 Moctezuma, D., J. Ortiz-Bejar, E. S. Tellez,                             and F. Herrera. 2018. SCI2 S at TASS
   S. Miranda-Jiménez, and M. Graff. 2018.                               2018: Emotion classification with recur-
   Ingeotec solution for task 4 in tass’18 com-                           rent neural networks.      In Proceedings
   petition. In Proceedings of TASS 2018:                                 of TASS 2018: Workshop on Semantic
   Workshop on Semantic Analysis at SE-                                   Analysis at SEPLN (TASS 2018), volume
   PLN (TASS 2018), volume 2172, Septem-                                  2172, September.
   ber.
                                                                        Scott, W. A. 1955. Reliability of content
 Moctezuma1, D., J. Ortiz-Bejar, E. S. Téllez,                            analysis: The case of nominal scale cod-
   S. Miranda-Jiménez, and M. Graff.                                      ing. Public opinion quarterly, pages 321–
   2018. Ingeotec solution for task 1 in                                   325.
   tass’18 competition.     In E. Martı́nez-
                                                                        Suarez-Paniagua, V., I. Segura-Bedmar, and
   Cámara, Y. Almeida Cruz, M. C.
                                                                          P. Martı́nez. 2018. Labda at tass-2018
   Dı́az-Galiano, S. Estévez Velarde, M. A.
                                                                          task 3: Convolutional neural networks for
   Garcı́a-Cumbreras,      M. Garcı́a-Vega,
                                                                          relation classification in spanish ehealth
   Y. Gutiérrez Vázquez, A. Montejo Ráez,
                                                                          documents. In Proceedings of TASS 2018:
   A. Montoyo Guijarro, R. Muñoz Guillena,
                                                                          Workshop on Semantic Analysis at SE-
   A. Piad Morffis, and J. Villena-Román,
                                                                          PLN (TASS 2018).
   editors, Proceedings of TASS 2018: Work-
   shop on Semantic Analysis at SEPLN                                   Villena-Román, J., J. Garcı́a-Morera, M. A.
   (TASS 2018), volume 2172 of CEUR                                        Garcı́a-Cumbreras, E. Martı́nez-Cámara,
                                                                  26
                            Overview of TASS 2018: Opinions, Health and Emotions


   M. T. Martı́n-Valdivia, and L. A. Ureña
   López. 2015. Overview of tass 2015.
   In TASS 2015: Workshop on Sentiment
   Analysis at SEPLN, pages 13–21.
Zavala, R. M. R., P. Martı́nez, and I. Segura-
  Bedmar. 2018. A hybrid bi-lstm-crf
  model for knowledge recognition from
  ehealth documents. In Proceedings of
  TASS 2018: Workshop on Semantic Anal-
  ysis at SEPLN (TASS 2018).


                                                    27