Overview of the EVALITA 2018 Aspect-based Sentiment Analysis task (ABSITA) Pierpaolo Basile Valerio Basile University of Bari Aldo Moro University of Turin pierpaolo.basile@uniba.it basile@di.uniroma1.it Danilo Croce Marco Polignano University of Rome "Tor Vergata" University of Bari Aldo Moro croce@info.uniroma2.it marco.polignano@uniba.it In its most basic form, a SA the aspects of given target entities and the system takes in input a text written in natural lan- sentiment expressed for each aspect. Two guage and assign it a label indicating whether the subtasks are defined, namely Aspect Cat- text is expressing a positive or negative sentiment, egory Detection (ACD) and Aspect Cate- or neither (neutral, or objective, text). However, gory Polarity (ACP). In total, 20 runs were reviews are often quite detailed in expressing the submitted by 7 teams comprising 11 to- reviewer’s opinion on several aspects of the target tal individual participants. The best sys- entity. Aspect-based Sentiment Analysis (ABSA) tem achieved a micro F1-score of 0.810 for is an evolution of Sentiment Analysis that aims ACD and 0.767 for ACP. at capturing the aspect-level opinions expressed in natural language texts (Liu, 2007). Italiano. ABSITA è l’esercizio di valu- At the international level, ABSA was intro- tazione di aspect-based sentiment analy- duced as a shared task at SemEval, the most sis di EVALITA 2018 (Caselli et al., 2018). prominent evaluation campaign in the Natu- Il compito ha l’obiettivo di promuovere la ral Language Processing field, in 2014 (SE- ricerca nel campo della sentiment analy- ABSA14), providing a benchmark dataset of re- sis per lingua italiana: ai partecipanti è views in English (Pontiki et al., 2014). Datasets stato richiesto di identificare gli aspetti ril- of computer laptops and restaurant reviews were evanti per le entitá fornite come input e la annotated with aspect terms (both fine-grained, sentiment espressa per ognuno di essi. In e.g. ”hard disk”, ”pizza”, and coarse-grained, e.g., particolare abbiamo definito come sotto- ”food”) and their polarity (positive or negative). task l’Aspect Category Detection (ACD) e The task was repeated in SemEval 2015 (SE- l’Aspect Category Polarity (ACP). In to- ABSA15) and 2016 (SE-ABSA16), aiming to fa- tale, sono state presentate 20 soluzioni di cilitate more in-depth research by providing a new 7 team composti in totale da 11 singoli ABSA framework to investigate the relations be- partecipanti. Il miglior sistema ha ot- tween the identified constituents of the expressed tenuto un punteggio di micro F1 di 0,810 opinions and growing up to include languages per ACD e 0,767 per ACP. other than English and different domains (Pontiki et al., 2015; Pontiki et al., 2016). ABSITA (Aspect-based Sentiment Analysis on 1 Introduction Italian) aims at providing a similar evaluation with In recent years, many websites started offering a respect to texts in Italian. In a nutshell, partic- high level interaction with users, who are no more 1 http://www.amazon.com 2 a passive audience, but can actively produce new http://www.tripadvisor.com ipants are asked to detect within sentences (ex- pect has been detected in the text. Table 1 shows pressing opinions about accommodation services) examples of annotation for the ACD task. some of the aspects considered by the writer. For the ACP task, the input is the review text These aspects belongs to a close set ranging from paired with the set of aspects identified in the text the cleanliness of the room to the price of the ac- within the ACD subtask, and the goal is to assign commodation. Moreover, for each detected as- polarity labels to each of the aspect category. Two pect, participants are asked to detect a specific po- binary polarity labels are expected for each aspect: larity class, expressing appreciation or criticism POS an NEG, indicating a positive and negative towards it. sentiment expressed towards a specific aspect, re- During the organization of the task, we col- spectively. Note that the two labels are not mutu- lected a dataset composed of more than 9,000 sen- ally exclusive: in addition to the annotation of pos- tences and we annotated them with aspects and itive aspects (POS:true, NEG:false) and neg- polarity labels. During the task, 20 runs were sub- ative aspects (POS:false, NEG:true), there mitted by 7 teams comprising 11 individual partic- can be aspects with no polarity, or neutral polar- ipants. ity (POS:false, NEG:false). This is also the In the rest of the paper Section 2 provides a de- default polarity annotation for the aspects that are tailed definition of the task. Section 3 describes not detected in a text. Finally, the polarity of an the dataset made available in the evaluation cam- aspect can be mixed (POS:true, NEG:true), paign, while Section 4 reports the official evalu- in cases where both sentiments are expressed to- ation measures. In Section 5 and 6, the results wards a certain aspect in a text. Table 2 summa- obtained by the participants are reported and dis- rizes the possible annotations with examples. cussed, respectively. Finally, Section 7 derives the The participants could choose to submit only the conclusions. results of the ACD subtask, or both tasks. In the latter case, the output of the ACD task is used 2 Definition of the task as input for the ACP. As a constraint on the re- sults submitted for the ACP task, the polarity of In ABSITA, Aspect-based Sentiment Analysis is an aspect for a given sentence can be different than decomposed as a cascade of two subtasks: Aspect (POS:false, NEG:false) only if the aspect is Category Detection (ACD) and Aspect Category detected in the ACD step. Polarity (ACP). For example, let us consider the sentence describing an hotel: 3 Dataset I servizi igienici sono puliti e il personale cor- diale e disponibile. (Toilets are clean but the staff is not The data source chosen for creating the ABSITA friendly nor helpful.) datasets is the popular website booking.com3 . The In the ACD task, one or more ”aspect cate- platform allows users to share their opinions about gories” evoked in a sentence are identified, e.g. hotels visited through a positive/negative textual the pulizia (cleanliness) and staff cat- review and a fine-grain rating system that can be egories in sentence 2. In the Aspect Category used for assigning a score to each different as- Polarity (ACP) task, the polarity of each ex- pect: cleanliness, comfort, facilities, staff, value pressed category is recognized, e.g. a positive for money, free/paid WiFi, location. Therefore, category polarity is expressed concerning the the website provides a large number of reviews in pulizia category while it is negative if con- many languages. sidering the staff category. We extracted the textual reviews in Italian, la- In our evaluation framework, the set of aspect beled on the website with one of the eighth con- categories is known and given to the participants, sidered aspects. The dataset contains reviews left so the ACD task can be seen as a multi-class, non- by users for hotels situated in several main Italian exclusive classification task where each input text cities such as Rome, Milan, Naples, Turin, Bari, has to be classified as evoking or not each aspect and more. We split the reviews into groups of sen- category. The participant systems are asked to re- tences which describe the positive and the nega- turn a binary vector where each dimension cor- tive characteristics of the selected hotel. The re- responds to an aspect category and the values 0 views have been collected between the 16th and 3 (false) and 1 (true) indicate whether each as- https://www.booking.com Sentence C LEANLINESS S TAFF C OMFORT L OCATION I servizi igienici sono puliti e il personale cordiale e disponibile 1 1 0 0 La posizione è molto comoda per il treno e la metro. 0 0 0 1 Ottima la disponibilitá del personale, e la struttura della stanza 0 1 1 0 Table 1: Examples of categories detection ACD. Sentence Aspect POS NEG Il bagno andrebbe ristrutturato C LEANLINESS 0 0 Camera pulita e spaziosa. C LEANLINESS 1 0 Pulizia della camera non eccelsa. C LEANLINESS 0 1 Il bagno era pulito ma lasciava un po’ a desiderare C LEANLINESS 1 1 Table 2: Examples of polarity annotations with respect to the cleanliness aspect. the 17th of April 2018 using Scrapy4 , a Python scores provided by the original review platform. web crawler. We collect in total 4,121 distinct re- Incomplete, irrelevant, and incomprehensible views in Italian language. sentences have been discarded from the dataset The reviews have been manually checked to during the annotation. At the end of the annotation verify the annotation of the aspects provided by process, we obtained the gold standard dataset booking.com, and to add missing links between with the associations among sentence, sentiment sentences and aspects. We started by annotat- and aspect. The entire annotation process took a ing a small portion of the whole dataset split by few weeks to complete. The positive and negative sentences (250 randomly chosen sentences) us- polarities are annotated independently, thus for ing four annotators (the task organizers) in order each aspect the four sentiment combination to check the agreement of the annotation. For discussed in Section 2 are possible: positive, neg- the ACD task, we asked the annotators to answer ative , neutral and mixed. The resulting classes straightforward questions in the form of “Is aspect are: cleanliness positive, cleanliness negative, X mentioned in the sentence Y ?” (Tab. 1). comfort positive, comfort negative, ameni- The set of italian aspects is the direct trans- ties positive, amenities negative, staff positive, lation of those booking.com: PULIZIA (clean- staff negative, value positive, value negative, liness), COMFORT, SERVIZI (amenities), STAFF, wifi positive, wifi negative, location positive, QUALITA - PREZZO (value), WIFI (wireless Internet location negative, other positive, other negative. connection) and POSIZIONE (location). Similarly, For each aspect, the sentiment is encoded in two for the ACP subtask, the annotation is performed classes: at sentence level, but with the set of aspects al- ready provided by the ACD annotation, and check- • negative = (* positive = 0, * negative = 1) boxes to indicate positive and negative polarity of • positive = (* positive = 1, * negative = 0) each aspect (Tab. 2). The result of the pilot anno- tation has been used to compute an inter-annotator • neutral = (* positive = 0, * negative = 0) agreement measure, in order to understand if it was possible to allow annotators to work indepen- • mixed = (* positive = 1, * negative = 1) dently each other on a different set of sentences. Please note that the special topic, OTHER has been We found agreement ranging from 82.8% to 100% added for completeness, to annotate sentences with an average value of 94.4% obtained counting with opinions on aspects not among the seven con- the number of sentences annotated with the same sidered by the task. The aspect OTHER is provided label by all the annotators. additionally and it is not part of the evaluation of In order to complete the annotation, we as- results provided for the task. signed different 1,000 reviews to each annotator We released the data in Comma-separated Value (about 2,500 sentences on average). We split format (CSV) with UTF-8 encoding and semi- the dataset among the annotators so that each of colon as separator. The first attribute is the id of them received a uniformly balanced distribution the review. Note that in booking.com the order of of positive and negative aspects, based on the positive and negative sentences is strictly defined 4 https://scrapy.org and this can make too easy the task. To overcome Dataset Description #Sentences Trial dataset containing a small set of features used for checking the format of the file 30 Trial set format 0.34% of Total The dataset contains sentences provided for training. They have been selected using a 6,337 Training set random stratification of the whole dataset. 69.75% of Total The dataset contains sentences provided for testing. They contains sentences without the 2,718 Test set annotations of aspects. 29.91% of Total Table 3: List of datasets released for the ABSITA task at EVALITA 2018. Dataset clean pos comf pos amen pos staff pos value pos wifi pos loca pos Trial set 2 8 6 3 1 1 5 Training set 504 978 948 937 169 43 1,184 Test set 193 474 388 411 94 18 526 Dataset clean neg comf neg amen neg staff neg value neg wifi neg loca neg Trial set 1 2 3 1 1 0 1 Training set 383 1,433 920 283 251 86 163 Test set 196 666 426 131 126 52 103 Table 4: Distribution of the sentences in the datasets among the aspects and polarities. |Sa ∩Ga | this issue, we randomly assign for each sentence |Ga | . Here Sa is the set of aspect category a new position in the review. As a consequence, annotations that a system returned for all the test the final positional id showed in the data file do sentences, and Ga is the set of the gold (cor- not reflect the real order of the sentences in the rect) aspect category annotations. For instance, review. The text of the sentence is provided at if a review is labeled in the gold standard with the end of the line and delimited by ”. It is pre- the two aspects Ga = {CLEANLINESS, STAFF}, ceded by three binary values for each aspect indi- and the system predicts the two aspects Sa = cating respectively: the presence in the sentence {CLEANLINESS, COMFORT}, we have that |Sa ∩ (aspectX presence:0/1), the positive polarity for Ga | = 1, |Ga | = 2 and |Sa | = 2 so that Pa = 12 , that aspect (aspectX pos:0/1) and finally the neg- Ra = 12 and F 1a = 12 . For the ACD task the ative polarity (aspectX neg:0/1). Fig. 1 shows an baseline will be computed by considering a system example of the annotated dataset in the proposed which assigns the most frequent aspect category format. (estimated over the training set) to each sentence. The list of the datasets released for the task For the ACP task we evaluate the entire is provided in Tab. 3 and the distribution of chain, thus considering both the aspect cate- the sentences among aspects and polarity is pro- gories detected in the sentences together with vided in Tab. 4. The subdivision adopted for their corresponding polarity, in the form of it is respectively 0.34%, 69.75%, 29,91% for (aspect, polarity) pairs. We again compute trial, training and test data. The datasets can Precision, Recall and F1 -score now defined as 2Pp Rp be freely downloaded from http://sag.art. F 1p = Pp +R p . Precision (Pp ) and Recall (Rp ) uniroma2.it/absita/ and reused in non- p |S ∩G | p p p |S ∩G | are defined as Pp = |Sp | ; Rp = |Gp | , commercial projects and researches. After the where Sp is the set of (aspect, polarity) pairs submission deadline, we also distributed the gold that a system returned for all the test sen- standard test set and evaluation script. tences, and Ga is the set of the gold (correct) pairs annotations. For instance, if a review 4 Evaluation measures and baselines is labeled in the gold standard with the pairs Gp = {(CLEANLINESS, P OS), (STAFF, P OS)}, We evaluate the ACD and ACP subtasks sepa- and the system predicts the three pairs Sp = rately by comparing the classifications provided {(CLEANLINESS, P OS), (CLEANLINESS, N EG), by the participant systems to the gold standard an- (COMFORT, P OS)}, we have that |Sp ∩ Gp | = 1, notations of the test set. For the ACD task, we |Gp | = 2 and |Sp | = 3 so that Pa = 13 , Ra = 12 compute Precision, Recall and F1 -score defined and F 1a = 0.28. as: F 1a = P2Pa +R a Ra a , where Precision (Pa ) and Re- For the ACP task, the baseline is computed by call (Ra ) are defined as: Pa = |Sa|S∩G a| a| ; Ra = considering a system which assigns the most fre- sentence_id; aspect1_presence; aspect1_pos; aspect1_neg; ...; sentence 201606240;0;0;0;0;0;0;0;0;0;0;0;0;1;1;0;0;0;0;1;1;0;"Considerato il prezzo e per una sola notte,va ..." 201606241;1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;"Almeno i servizi igienici andrebbero rivisti e ..." 201606242;0;0;0;1;0;1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;"La struttura purtroppo \‘e vecchia e ci vorrebbero ..." Figure 1: Sample of the annotated dataset in CSV format. System Micro-P Micro-R: Micro-F1 System Micro-P Micro-R: Micro-F1 ItaliaNLP 1 0.8397 0.7837 0.8108 ItaliaNLP 1 0.8264 0.7161 0.7673 gw2017 1 0.8713 0.7504 0.8063 UNIPV 0.8612 0.6562 0.7449 gw2017 2 0.8697 0.7481 0.8043 gw2017 2 0.7472 0.7186 0.7326 X2Check gs 0.8626 0.7519 0.8035 gw2017 1 0.7387 0.7206 0.7295 UNIPV 0.8819 0.7378 0.8035 ItaliaNLP 2 0.8735 0.5649 0.6861 X2Check w 0.8980 0.6937 0.7827 SeleneBianco 0.6869 0.5409 0.6052 ItaliaNLP 2 0.8658 0.6970 0.7723 ilc 2 0.4123 0.3125 0.3555 SeleneBianco 0.7902 0.7181 0.7524 ilc 1 0.5452 0.2511 0.3439 VENSES 1 0.6232 0.6093 0.6162 mfc baseline 0.2451 0.1681 0.1994 VENSES 2 0.6164 0.6134 0.6149 ilc 2 0.5443 0.5418 0.5431 Table 6: Results of the submissions for the ACP ilc 1 0.6213 0.4330 0.5104 subtask. mfc baseline 0.4111 0.2866 0.3377 Table 5: Results of the submissions for the ACD perform the baseline demonstrating the efficacy subtask. of the solutions proposed and the affordability of all two tasks. The results obtained for the ACD quent (aspect, polarity) pair (estimated over the task (Tab. 5) show a small range of variability, at training set) to each sentence. least in the first part of the ranking (the top results We produced separate rankings for the tasks, are concentrated around a F1 score value of 0.80). based on the F1 scores. Participants who submit- On the contrary, the values of precision and recall ted only the result of the ACD task appear in the show higher variability, indicating significant dif- first ranking only. ference among the proposed approaches. 5 Results 6 Discussion We received submissions from several teams that participated in past editions of EVALITA, in par- The teams of the ABSITA challenge have been in- ticular to the SENTIPOLC (Sentiment Polarity vited to describe their solution in a technical re- Classification (Barbieri et al., 2016)) and NEEL-it port and to fill in a questionnaire, in order to gain (Named Entity Recognition (Basile et al., 2016)), an insight on their approaches and to support their but also some new entries in the community. In to- replicability. Five systems (ItaliaNLP, gw2017, tal, 20 runs were submitted by 7 teams comprising X2Check, UNIPV, SeleneBianco) are based on su- 11 individual participants. The task allowed par- pervised machine learning, that is, all the systems ticipant teams to send up to 2 submissions from for which we have access to the implementation each team. In particular, 12 runs were submitted details, with the exception of VENSES, which is a to ACD task and 8 runs to the ACP task. rule-based unsupervised system. Among the sys- We also provide the result of a baseline sys- tem that use supervised approaches, three systems tem that assigns to each instance the most frequent (ItaliaNLP, gw2017, UNIPV) employ deep learn- class in each task, i.e., the aspect (C OMFORT) and ing (in particular LTSM networks, often in their polarity (positive) for that aspect, according to the bi-directional variant). frequency of classes in the training set. The results All runs submitted can be considered ”con- of the submissions for the two tasks, and the base- strained runs”, that is, the systems were trained on line (namely mfc baseline), are reported in Tab. 5 the provided data set only. and Tab. 6. Of the seven teams who participated Besides additional training data, some sys- to the ACD task, five teams also participated to the tems employ different kind of external resources. ACP task. Among these, pre-trained word embeddings are The results obtained by the teams largely out- used as word representations by UNIPV (Fast- text5 ) and gw2017 (word embeddings provided by cussion. In particular, the ABSA (Aspect-based the SpaCy framework6 ). The system of ItaliaNLP Sentiment Analysis) task concerns the association employs word embedding created from the ItWaC of a polarity (positive, negative, neutral/objective) corpus (Baroni et al., 2009) and corpus extracted to the piece of the sentence that refers to an as- from Booking.com. pect of interest. In ABSITA, we proposes to au- Some of the systems are ABSA extensions built tomatically extract users’ opinions about aspects on top of custom or pre-existing NLP pipelines. in hotel rewievs. The complexity of the task has This is the case for ItaliaNLP, VENSES and been successfully faced by the solutions submit- X2Check. Other systems make use of off-the- ted to the task. Systems that used supervised ma- shelf NLP tools for preprocessing the data, such chine learning approaches, based on semantic and as SpaCy (gw2017, UNIPV) and Freeling7 (Se- morphosyntactic features representation of textual leneBianco). contents, demonstrate encouraging performances Finally, additional resources used by the sys- in the task. Good results have also been obtained tems often include domain-specific or affective using rule-based systems, even though they suffer lexicons. ItaliaNLP employed the MPQA affec- from generalization issues and need to be tailored tive lexicon (Wilson et al., 2005), and further de- on the set of sentences to classify. The decision to veloped an affective lexicon from a large corpus of use additional resources as additional lexicons in tweets by distant supervision. The UNIPV system conjunction with semantic word embeddings have makes use of the affective lexicon for Italian devel- been demonstrated to be successful. 