Overview of the Evalita 2016 SENTIment POLarity Classification Task Francesco Barbieri Valerio Basile Danilo Croce Pompeu Fabra University Université Côte d’Azur, University of Rome “Tor Vergata” Spain Inria, CNRS, I3S Italy francesco.barbieri@upf.edu France croce@info.uniroma2.it valerio.basile@inria.fr Malvina Nissim Nicole Novielli Viviana Patti University of Groningen University of Bari “A. Moro” University of Torino The Netherlands Italy Italy m.nissim@rug.nl nicole.novielli@uniba.it patti@di.unito.it Abstract or negative sentiment, is by now an established English. The SENTIment POLarity Clas- task. Such solid and growing interest is reflected sification Task 2016 (SENTIPOLC), is a in the fact that the Sentiment Analysis tasks at Se- rerun of the shared task on sentiment clas- mEval (where they constitute now a whole track) sification at the message level on Italian have attracted the highest number of participants tweets proposed for the first time in 2014 in the last years (Rosenthal et al., 2014; Rosenthal for the Evalita evaluation campaign. It in- et al., 2015; Nakov et al., 2016), and so it has been cludes three subtasks: subjectivity classi- for the latest Evalita campaign, where a sentiment fication, polarity classification, and irony classification task (SENTIPOLC 2014) was intro- detection. In 2016 SENTIPOLC has been duced for the first time (Basile et al., 2014). again the most participated EVALITA task In addition to detecting the polarity of a tweet, with a total of 57 submitted runs from 13 it is also deemed important to detect whether a different teams. We present the datasets tweet is subjective or is merely reporting some – which includes an enriched annotation fact, and whether some form of figurative mech- scheme for dealing with the impact on po- anism, chiefly irony, is also present. Subjectivity, larity of a figurative use of language – the polarity, and irony detection form the three tasks evaluation methodology, and discuss re- of the SENTIPOLC 2016 campaign, which is a re- sults and participating systems. run of SENTIPOLC 2014. Italiano. Descriviamo modalità e risul- tati della seconda edizione della cam- Innovations with respect to SENTIPOLC 2014 pagna di valutazione di sistemi di senti- While the three tasks are the same as those organ- ment analysis (SENTIment POLarity Clas- ised within SENTIPOLC 2014, we want to high- sification Task), proposta nel contesto di light the innovations that we have included in this “EVALITA 2016: Evaluation of NLP and year’s edition. First, we have introduced two new Speech Tools for Italian”. In SENTIPOLC annotation fields which express literal polarity, to è stata valutata la capacità dei sistemi di provide insights into the mechanisms behind po- riconoscere diversi aspetti del sentiment larity shifts in the presence of figurative usage. espresso nei messaggi Twitter in lingua Second, the test data is still drawn from Twitter, italiana, con un’articolazione in tre sotto- but it is composed of a portion of random tweets task: subjectivity classification, polarity and a portion of tweets selected via keywords, classification e irony detection. La cam- which do not exactly match the selection proce- pagna ha suscitato nuovamente grande in- dure that led to the creation of the training set. teresse, con un totale di 57 run inviati da This was intentionally done to observe the porta- 13 gruppi di partecipanti. bility of supervised systems, in line with what ob- served in (Basile et al., 2015). Third, a portion of the data was annotated via Crowdflower rather 1 Introduction than by experts. This has led to several observa- Sentiment classification on Twitter, namely detect- tions on the quality of the data, and on the theoret- ing whether a tweet is polarised towards a positive ical description of the task itself. Fourth, a portion of the test data overlaps with the test data from 3 Development and Test Data three other tasks at Evalita 2016, namely PoST- WITA (Bosco et al., 2016), NEEL-IT (Basile et Data released for the shared task comes from al., 2016a), and FactA (Minard et al., 2016). This different datasets. We re-used the whole SEN- was meant to produce a layered annotated dataset TIPOLC 2014 dataset, and also added new tweets where end-to-end systems that address a variety of derived from different datasets previously devel- tasks can be fully developed and tested. oped for Italian. The dataset composition has been designed in cooperation with other Evalita 2016 2 Task description tasks, in particular the Named Entity rEcognition and Linking in Italian Tweets shared task (NEEL- As in SENTIPOLC 2014, we have three tasks. IT, Basile et al. (2016a)). The multiple layers of annotation are intended as a first step towards the Task 1: Subjectivity Classification: a system long-term goal of enabling participants to develop must decide whether a given message is subjec- end-to-end systems from entity linking to entity- tive or objective (Bruce and Wiebe, 1999; Pang based sentiment analysis (Basile et al., 2015). A and Lee, 2008). portion of the data overlaps with data from NEEL- Task 2: Polarity Classification: a system must IT (Basile et al., 2016a), PoSTWITA (Bosco et decide whether a given message is of positive, al., 2016) and FacTA (Minard et al., 2016). See negative, neutral or mixed sentiment. Differently (Basile et al., 2016b) for details. from most SA tasks (chiefly the Semeval tasks) 3.1 Corpora Description and in accordance with (Basile et al., 2014), in our data positive and negative polarities are not mu- Both training and test data developed for the tually exclusive and each is annotated as a binary 2014 edition of the shared task were included as category. A tweet can thus be at the same time training data in the 2016 release. Summarizing, positive and negative, yielding a mixed polarity, the data that we are using for this shared task or also neither positive nor negative, meaning it is is a collection of tweets which is partially de- a subjective statement with neutral polarity.1 Sec- rived from two existing corpora, namely Sentipolc tion 3 provides further explanation and examples. 2014 (TW-SENTIPOLC14, 6421 tweets) (Basile et al., 2014), and TWitterBuonaScuola (TW-BS) Task 3: Irony Detection: a system must decide (Stranisci et al., 2016), from which we selected whether a given message is ironic or not. Twit- 1500 tweets. Furthermore, two new sets have ter communications include a high percentage of been annotated from scratch following the SEN- ironic messages (Davidov et al., 2010; Hao and TIPOLC 2016 annotation scheme: the first one Veale, 2010; González-Ibáñez et al., 2011; Reyes consists of a set of 1500 tweets selected from the et al., 2013; Reyes and Rosso, 2014), and plat- TWITA 2015 collection (TW-TWITA15, Basile forms monitoring the sentiment in Twitter mes- and Nissim (2013)), the second one consists of sages experienced the phenomenon of wrong po- 1000 (reduced to 989 after eliminating malformed larity classification in ironic messages (Bosco et tweets) tweets collected in the context of the al., 2013; Ghosh et al., 2015). Indeed, ironic de- NEEL-IT shared task (TW-NEELIT, Basile et al. vices in a text can work as unexpected “polarity (2016a)). The subsets of data extracted from ex- reversers” (one says something “good” to mean isting corpora (TW-SENTIPOLC14 and TW-BS) something “bad”), thus undermining systems’ ac- have been revised according to the new annotation curacy. In this sense, though not including a spe- guidelines specifically devised for this task (see cific task on its detection, we have added an an- Section 3.3 for details). notation layer of literal polarity (see Section 3.2) Tweets in the datasets are marked with a “topic” which could be potentially used by systems, and tag. The training data includes both a political also allows us to observe patterns of irony. collection of tweets and a generic collection of The three tasks are meant to be independent. For tweets. The former has been extracted exploiting example, a team could take part in the polarity specific keywords and hashtags marking political classification task without tackling Task 1. topics (topic = 1 in the dataset), while the latter is composed of random tweets on any topic (topic = 1 In accordance with (Wiebe et al., 2005). 0). The test material includes tweets from the TW-BS corpus, that were extracted with a specific 1, lpos = 0, and lneg = 0 can co-exist, as socio-political topic (via hashtags and keywords well as iro = 1, lpos = 1, and lneg = 1. related to #labuonascuola, different from the ones used to collect the training material). To mark the • For subjective tweets without irony (iro = fact that such tweets focus on a different topic they 0), the overall (opos and oneg) and the lit- have been marked with topic = 2. While SEN- eral (lpos and lneg) polarities are always TIPOLC does not include any task which takes the annotated consistently, i.e. opos = lpos “topic” information into account, we release it in and oneg = lneg. Note that in such cases case participants want to make use of it. the literal polarity is implied automatically from the overall polarity and not annotated 3.2 Annotation Scheme manually. The manual annotation of literal Six fields contain values related to manual annota- polarity only concerns tweets with iro = 1. tion are: subj, opos, oneg, iro, lpos, lneg. The annotation scheme applied in SEN- Table 1 summarises the allowed combinations. TIPOLC 2014 has been enriched with two new 3.3 Annotation procedure fields, lpos and lneg, which encode the literal positive and negative polarity of tweets, respec- Annotations for data from existing corpora (TW- tively. Even if SENTIPOLC does not include any BS and TW-SENTIPOLC14) have been revised task which involves the actual classification of lit- and completed by exploiting an annotation pro- eral polarity, this information is provided to enable cedure which involved a group of six expert an- participants to reason about the possible polarity notators, in order to make them compliant to inversion due to the use of figurative language in the SENTIPOLC 2016 annotation scheme. Data ironic tweets. Indeed, in the presence of a figura- from NEEL-IT and TWITA15 was annotated from tive reading, the literal polarity of a tweet might scratch using CrowdFlower. Both training and test differ from the intended overall polarity of the text data included a mixture of data annotated by ex- (expressed by opos and oneg). Please note the perts and crowd. In particular, the whole TW- following issues about our annotation scheme: SENTIPOLC14 has been included in the develop- ment data release, while TW-BS was included in • An objective tweet will not have any polarity the test data release. Moreover, a set of 500 tweets nor irony, thus if subj = 0, then opos = from crowdsourced data was included in the test 0, oneg = 0, iro = 0, lpos = 0, and set, after a manual check and re-assessment (see lneg = 0 . below: Crowdsourced data: consolidation of an- • A subjective, non ironic, tweet can exhibit at notations). This set contains the 300 tweets used the same time overall positive and negative as test data in the PoSTWITA, NEEL-IT-it and polarity (mixed polarity), thus opos = 1 and FactA EVALITA 2016 shared tasks. oneg = 1 can co-exist. Mixed literal polar- TW-SENTIPOLC14 Data from the previous ity might also be observed, so that lpos = 1 evaluation campaign didn’t include any distinction and lneg = 1 can co-exist, and this is true between literal and overall polarity. Therefore, the for both non-ironic and ironic tweets. old tags pos and neg were automatically mapped • A subjective, non ironic, tweet can exhibit into the new labels opos and oneg, respectively, no specific polarity and be neutral but with which indicate overall polarity. Then, we had to a subjective flavor, thus subj = 1 and extend the annotation to provide labels for posi- opos = 0, oneg = 0. Neutral literal polar- tive and negative literal polarity. In case of tweets ity might also be observed, so that lpos = 0 without irony, literal polarity values were implied and lneg = 0 is a possible combination; this from the overall polarity. For ironic tweets, in- is true for both non-ironic and ironic tweets. stead, i.e. iro = 1 (806 tweets), we resorted to • An ironic tweet is always subjective and manual annotation: for each tweet, two indepen- it must have one defined polarity, so that dent annotations have been provided for the literal iro = 1 cannot be combined with opos polarity dimension. The inter-annotator agree- and oneg having the same value. However, ment at this stage was κ = 0.538. In a second mixed or neutral literal polarity could be ob- round, a third independent annotation was pro- served for ironic tweets. Therefore, iro = vided to solve the disagreement. The final label Table 1: Combinations of values allowed by our annotation scheme subj opos oneg iro lpos lneg description and explanatory tweet in Italian objective 0 0 0 0 0 0 l’articolo di Roberto Ciccarelli dal manifesto di oggi http://fb.me/1BQVy5WAk subjective with neutral polarity and no irony 1 0 0 0 0 0 Primo passaggio alla #strabrollo ma secondo me non era un iscritto subjective with positive polarity and no irony 1 1 0 0 1 0 splendida foto di Fabrizio, pluri cliccata nei siti internazionali di Photo Natura http: //t.co/GWoZqbxAuS subjective with negative polarity and no irony 1 0 1 0 0 1 Monti, ripensaci: l’inutile Torino-Lione inguaia l’Italia: Tav, appello a Mario Monti da Mercalli, Cicconi, Pont... http://t.co/3CazKS7Y subjective with both positive and negative polarity (mixed polarity) and no irony 1 1 1 0 1 1 Dati negativi da Confindustria che spera nel nuovo governo Monti. Castiglione: ”Avanti con le riforme” http://t.co/kIKnbFY7 subjective with positive polarity, and an ironic twist 1 1 0 1 1 0 Questo governo Monti dei paschi di Siena sta cominciando a carburare; speriamo bene... subjective with positive polarity, an ironic twist, and negative literal polarity 1 1 0 1 0 1 Non riesco a trovare nani e ballerine nel governo Monti. Ci deve essere un errore! :) subjective with negative polarity, and an ironic twist 1 0 1 1 0 1 Calderoli: Governo Monti? Banda Bassotti ..infatti loro erano quelli della Magliana.. #FullMonti #fuoritutti #piazzapulita subjective with negative polarity, an ironic twist, and positive literal polarity 1 0 1 1 1 0 Ho molta fiducia nel nuovo Governo Monti. Più o meno la stessa che ripongo in mia madre che tenta di inviare un’email. subjective with positive polarity, an ironic twist, and neutral literal polarity 1 1 0 1 0 0 Il vecchio governo paragonato al governo #monti sembra il cast di un film di lino banfi e Renzo montagnani rispetto ad uno di scorsese subjective with negative polarity, an ironic twist, and neutral literal polarity 1 0 1 1 0 0 arriva Mario #Monti: pronti a mettere tutti il grembiulino? subjective with positive polarity, an ironic twist, and mixed literal polarity 1 1 0 1 1 1 Non aspettare che il Governo Monti prenda anche i tuoi regali di Natale... Corri da noi, e potrai trovare IDEE REGALO a partire da 10e... subjective with negative polarity, an ironic twist, and mixed literal polarity 1 0 1 1 1 1 applauso freddissimo al Senato per Mario Monti. Ottimo. was assigned by majority vote on each field inde- described above was applied to obtain literal po- pendently. With three annotators, this procedure larity values: two independent annotations were ensures an unambiguous result for every tweet. provided (inter-annotator agreement κ = 0.605), TW-BS The TW-BS section of the dataset had and a third annotation was added in a second round been previously annotated for polarity and irony2 . in cases of disagreement. Just as with the TW- The original TW-BS annotation scheme, however, SENTIPOLC14 set, the final label assignment was did not provide any separate annotation for overall done by majority vote on each field. and literal polarity. The tags POS, NEG, MIXED TW-TWITA15 and TW-NEEL-IT For these and NONE, HUMPOS, HUMNEG in TW-BS new datasets, all fields were annotated from were automatically mapped in the following val- scratch using CrowdFlower (CF)4 , a crowdsourc- ues for the SENTIPOLC’s subj, opos, oneg, ing platform which has also been recently used for iro, lpos and lneg annotation fields: POS ⇒ a similar annotation task (Nakov et al., 2016). CF 110010; NEG ⇒ 101001; MIXED ⇒ 111011; enables quality control of the annotations across NONE ⇒ 0000003 ; HUMPOS ⇒ 1101??; HUM- a number of dimensions, also by employing test NEG ⇒ 1011??. For the last two cases, i.e. where questions to find and exclude unreliable annota- iro=1, the same manual annotation procedure tors. We gave the users a series of guidelines in Italian, including a list of examples of tweets 2 For the annotation process and inter-annotator agreement and their annotation according to the SENTIPOLC see (Stranisci et al., 2016) 3 Two independent annotators reconsidered the set of scheme. The guidelines also contained an expla- tweets tagged by NONE in order to distinguish the few cases nation of the rules we followed for the annota- of subjective, neutral, not-ironic tweets, i.e. 100000, as the tion of the rest of the dataset, although in prac- original TW-BS scheme did not allow such finer distinction. The inter-annotator agreement on this task was measured as tice these constraints were not enforced in the CF κ = 0.841 and a third independent annotation was used to 4 solve the few cases of disagreement. http://www.crowdflower.com/ interface. As requested by the platform, we pro- Table 2: Distribution of value combinations vided a restricted set of “correct” answers to test the reliability of the users. This step proved to combination subj opos oneg iro lpos lneg dev test be challenging, since in many cases the annota- 0 0 0 0 0 0 2,312 695 tion of at least one dimension is not clear cut. We 1 0 0 0 0 0 504 219 1 0 1 0 0 1 1,798 520 required to collect at least three independent judg- 1 0 1 1 0 0 210 73 ments for each tweet. The total cost of the crowd- 1 0 1 1 0 1 225 53 1 0 1 1 1 0 239 66 sourcing has been 55 USD and we collected 9517 1 0 1 1 1 1 71 22 judgments in total from 65 workers. We adopted 1 1 0 0 1 0 1,488 295 1 1 0 1 0 0 29 3 the default CF settings for assigning the majority 1 1 0 1 0 1 22 4 label (relative majority). The CF reported aver- 1 1 0 1 1 0 62 8 1 1 0 1 1 1 10 6 age confidence (i.e., inter-rater agreement) is 0.79 1 1 1 0 1 1 440 36 for subjectivity, 0.89 for positive polarity (0.90 for total 7,410 2,000 literal positivity), 0.91 for negative polarity (0.93 for literal negativity) and 0.92 for irony. While 6 (neg), 9 (pos), 11 (oneg), and 17 (opos), and such scores appear high, they are skewed towards under 50% for subj.7 This could be an indication the over-assignment of the ”0” label for basically of a more conservative interpretation of sentiment all of classes (see below for further comments on on the part of the crowd (note that 0 is also the de- this). Percentage agreement on the assignment of fault value), possibly also due to too few examples ”1” is much lower (ranging from 0.70 to 0.77).5 in the guidelines, and in any case to the intrinsic On the basis of such observations and on a first subjectivity of the task. On such basis, we decided analysis of the resulting combinations, we oper- to add two more expert annotations to the crowd- ated a few revisions on the crowd-collected data. annotated test-set, and take the majority vote from Crowdsourced data: consolidation of annota- crowd, expert1, and expert2. This does not erase tions Despite having provided the workers with the contribution of the crowd, but hopefully max- guidelines, we identified a few cases of value com- imises consistency with the guidelines in order to binations that were not allowed in our annotation provide a solid evaluation benchmark for this task. scheme, e.g., ironic or polarised tweets (positive, negative or mixed) which were not marked as sub- 3.4 Format and Distribution jective. We automatically fixed the annotation for We provided participants we a single development such cases, in order to release datasets of only set, which consists of a collection of 7,410 tweets, tweets annotated with labels consistent with the with IDs and annotations concerning all three SENTIPOLC’s annotation scheme.6 SENTIPOLC’s subtasks: subjectivity classifica- Moreover, we applied a further manual check tion (subj), polarity classification (opos,oneg) of crowdsourced data stimulated by the follow- and irony detection (iro). ing observations. When comparing the distribu- Including the two additional fields with respect tions of values (0,1) for each label in both training to SENTIPOLC 2014, namely lpos and lneg, and crowdsourced test data, we observed, as men- the final data format of the distribution is as fol- tioned above, that while the assignment of 1s con- lows: “id”, “subj, “opos”, “oneg”, “iro”, stituted from 28 to 40% of all assignments for the “lpos”, “lneg”, “top”, “text”. opos/pos/ oneg/neg labels, and about 68% for The development data includes for each tweet the subjectivity label, figures were much lower for the manual annotation for the subj, opos, the crowdsourced data, with percentages as low as oneg, iro, lpos and lneg fields, according 5 to the format explained above. Instead, the blind This would be taken into account if using Kappa, which is however an unsuitable measure in this context due to the version of the test data, which consists of 2000 varying number of annotators per instance. tweets, only contains values for the idtwitter 6 In particular, for CF data we applied two automatic trans- and text fields. In other words, the development formations for restoring consistency of configurations of an- notated values in cases where we observed a violation of the data contains the six columns manually annotated, scheme: when at least a value 1 is present in the fields opos, 7 oneg, iro, lpos, or lneg, we set the field subj accord- The annotation of the presence of irony shows less dis- ingly: subj=0 ⇒ subj=1; when iro=0, the literal polarity tance, with 12% in the training set and 8% in the crowd- value is overwritten by the overall polarity value. annotated test set. while the test data will contain values only in the Task1, but with different targeted classes. The first (idtwitter) and last two columns (top overall F-score will be the average of the F-scores and text). The literal polarity might be predicted for ironic and non-ironic classes. and used by participants to provide the final clas- Informal evaluation of literal polarity classifi- sification of the items in the test set, however this cation. Our coding system allows for four com- should be specified in the submission phase. The binations of positive (lpos) and negative distribution of combinations in both development (lneg) values for literal polarity, namely: 10: and test data is given in Table 2. positive literal polarity; 01: negative literal polar- 4 Evaluation ity; 11: mixed literal polarity; 00: no polarity. SENTIPOLC does not include any task that ex- Task1: subjectivity classification. Systems are plicitly takes into account the evaluation of lit- evaluated on the assignment of a 0 or 1 value to the eral polarity classification. However, participants subjectivity field. A response is considered plainly could find it useful in developing their system, and correct or wrong when compared to the gold stan- might learn to predict it. Therefore, they could dard annotation. We compute precision (p), recall choose to submit also this information to receive (r) and F-score (F) for each class (subj,obj): an informal evaluation of the performance on these #correctclass #correctclass two fields, following the same evaluation criteria pclass = rclass = adopted for Task 2. The performance on the literal #assignedclass #totalclass pclass rclass polarity classification will not affect in any way Fclass = 2 pclass + rclass the final ranks for the three SENTIPOLC tasks. The overall F-score will be the average of the F- 5 Participants and Results scores for subjective and objective classes. A total of 13 teams from 6 different countries Task2: polarity classification. Our coding sys- participated in at least one of the three tasks of tem allows for four combinations of opos and SENTIPOLC. Table 3 provides an overview of the oneg values: 10 (positive polarity), 01 (nega- teams, their affiliation, their country (C) and the tive polarity), 11 (mixed polarity), 00 (no polar- tasks they took part in. ity). Accordingly, we evaluate positive and neg- ative polarity independently by computing preci- sion, recall and F-score for both classes (0 and 1): Table 3: Teams participating to SENTIPOLC 2016 team institution C tasks #correctpos #correctpos ADAPT Adapt Centre IE T1,T2,T3 ppos class = class pos rclass = class CoLingLab CoLingLab pos pos #assignedclass #totalclass IT T2 University of Pisa #correctneg #correctneg FICLIT pneg class = class neg neg rclass = class neg CoMoDI University of Bologna IT T3 #assignedclass #totalclass pos ppos pos class rclass neg pneg neg class rclass INGEOTEC CentroGEO/INFOTEC CONACyT MX T1,T2 Fclass = 2 pos pos Fclass = 2 neg neg pclass + rclass pclass + rclass IntIntUniba University of Bari IT T2 Univer. Pol. de IRADABE Université de Paris Valencia, ES,FR T1,T2,T3 The F-score for the two polarity classes is the av- ItaliaNLP Lab ItaliaNLP ILC (CNR) IT T1,T2,T3 erage of the F-scores of the respective pairs: samskara LARI Lab, ILC CNR IT T1,T2 (F0pos + F1pos ) neg (F neg + F1neg ) SwissCheese Zurich University CH T1,T2,T3 F pos = F = 0 of Applied Sciences 2 2 tweet2check Finsa s.p.a. IT T1,T2,T3 UniBO University of Bologna IT T1,T2 Finally, the overall F-score for Task 2 is given by UniPI University of Pisa IT T1,T2 the average of the F-scores of the two polarities. Unitor University of Roma IT T1,T2,T3 Tor Vergata Task3: irony detection. Systems are evaluated on their assignment of a 0 or 1 value to the irony field. Almost all teams participated to both subjectivity A response is considered fully correct or wrong and polarity classification subtasks. Each team when compared to the gold standard annotation. had to submit at least a constrained run. Fur- We measure precision, recall and F-score for each thermore, teams were allowed to submit up to class (ironic,non-ironic), similarly to the four runs (2 constrained and 2 unconstrained) in case they implemented different systems. Over- Table 4: Task 1: F-scores for constrained “.c” and uncon- all we have 19, 26, 12 submitted runs for strained runs “.u”. After the deadline, two teams reported the subjectivity, polarity, and irony detection about a conversion error from their internal format to the of- tasks, respectively. In particular, three teams ficial one. The resubmitted amended runs are marked with *. System Obj Subj F (UniPI, Unitor and tweet2check) participated Unitor.1.u 0.6784 0.8105 0.7444 with both a constrained and an unconstrained Unitor.2.u 0.6723 0.7979 0.7351 runs on the both the subjectivity and polarity samskara.1.c 0.6555 0.7814 0.7184 subtasks. Unconstrained runs were submitted to ItaliaNLP.2.c 0.6733 0.7535 0.7134 IRADABE.2.c 0.6671 0.7539 0.7105 the polarity subtask only by IntIntUniba.SentiPy INGEOTEC.1.c 0.6623 0.7550 0.7086 and INGEOTEC.B4MSA. Differently from SEN- Unitor.c 0.6499 0.7590 0.7044 TIPOLC 2014, unconstrained systems performed UniPI.1/2.c 0.6741 0.7133 0.6937 better than constrained ones, with the only excep- UniPI.1/2.u 0.6741 0.7133 0.6937 tion of UniPI, whose constrained system ranked ItaliaNLP.1.c 0.6178 0.7350 0.6764 ADAPT.c 0.5646 0.7343 0.6495 first for the polarity classification subtask. IRADABE.1.c 0.6345 0.6139 0.6242 We produced a single-ranking table for each tweet2check16.c 0.4915 0.7557 0.6236 subtask, where unconstrained runs are properly tweet2check14.c 0.3854 0.7832 0.5843 marked. Notice that we only use the final F-score tweet2check14.u 0.3653 0.7940 0.5797 UniBO.1.c 0.5997 0.5296 0.5647 for global scoring and ranking. However, systems UniBO.2.c 0.5904 0.5201 0.5552 that are ranked midway might have excelled in Baseline 0.0000 0.7897 0.3949 precision for a given class or scored very bad in *SwissCheese.c late 0.6536 0.7748 0.7142 recall for another.8 *tweet2check16.u late 0.4814 0.7820 0.6317 For each task, we ran a majority class baseline to set a lower-bound for performance. In the tables 5.3 Task3: irony detection it is always reported as Baseline. Table 6 shows results for the irony detection task, 5.1 Task1: subjectivity classification which attracted 12 submissions from 7 teams. The highest F-score was achieved by tweet2check at Table 4 shows results for the subjectivity classifi- 0.5412 (constrained run). The only unconstrained cation task, which attracted 19 total submissions run was submitted by Unitor achieving 0.4810 as from 10 different teams. The highest F-score is F-score. While all participating systems show an achieved by Unitor at 0.7444, which is also the improvement over the baseline (F = 0.4688), many best unconstrained performance. Among the con- systems score very close to it, highlighting the strained systems, the best F-score is achieved by complexity of the task. samskara with F = 0.7184. All participating systems show an improvement over the baseline. 6 Discussion 5.2 Task2: polarity classification We compare the participating systems accord- Table 5 shows results for polarity classification, ing to the following main dimensions: classifi- the most popular subtask with 26 submissions cation framework (approaches, algorithms, fea- from 12 teams. The highest F-score is achieved tures), tweet representation strategy, exploitation by UniPi at 0.6638, which is also the best score of further Twitter annotated data for training, ex- among the constrained runs. As for unconstrained ploitation of available resources (e.g. sentiment runs, the best performance is achieved by Unitor lexicons, NLP tools, etc.), and issues about the in- with F = 0.6620. All participating systems show terdependency of tasks in case of systems partici- an improvement over the baseline.9 pating in several subtasks. Since we did not receive details about the 8 Detailed scores for all classes and tasks are avail- systems adopted by some participants, i.e., able at http://www.di.unito.it/˜tutreeb/ sentipolc-evalita16/index.html tweet2check, ADAPT and UniBO, we are not in- 9 After the deadline, SwissCheese and tweet2check re- cluding them in the following discussion. We con- ported about a conversion error from their internal format to sider however tweet2check’s results in the dis- the official one. The resubmitted amended runs are shown in the table (marked by the * symbol), but the official ranking cussion regarding irony detection. was not revised. Approaches based on Convolutional Neural Table 5: Task 2: F-scores for constrained ”.c” and uncon- Table 6: Task 3: F-scores for constrained “.c” and uncon- strained runs ”.u”. Amended runs are marked with * . strained runs “.u”. Amended runs are marked with *. System Pos Neg F System Non-Iro Iro F UniPI.2.c 0.6850 0.6426 0.6638 tweet2check16.c 0.9115 0.1710 0.5412 Unitor.1.u 0.6354 0.6885 0.6620 CoMoDI.c 0.8993 0.1509 0.5251 Unitor.2.u 0.6312 0.6838 0.6575 tweet2check14.c 0.9166 0.1159 0.5162 ItaliaNLP.1.c 0.6265 0.6743 0.6504 IRADABE.2.c 0.9241 0.1026 0.5133 IRADABE.2.c 0.6426 0.6480 0.6453 ItaliaNLP.1.c 0.9359 0.0625 0.4992 ItaliaNLP.2.c 0.6395 0.6469 0.6432 ADAPT.c 0.8042 0.1879 0.4961 UniPI.1.u 0.6699 0.6146 0.6422 IRADABE.1.c 0.9259 0.0484 0.4872 UniPI.1.c 0.6766 0.6002 0.6384 Unitor.2.u 0.9372 0.0248 0.4810 Unitor.c 0.6279 0.6486 0.6382 Unitor.c 0.9358 0.0163 0.4761 UniBO.1.c 0.6708 0.6026 0.6367 Unitor.1.u 0.9373 0.0084 0.4728 IntIntUniba.c 0.6189 0.6372 0.6281 ItaliaNLP.2.c 0.9367 0.0083 0.4725 IntIntUniba.u 0.6141 0.6348 0.6245 Baseline 0.9376 0.000 0.4688 UniBO.2.c 0.6589 0.5892 0.6241 *SwissCheese.c late 0.9355 0.1367 0.5361 UniPI.2.u 0.6586 0.5654 0.6120 CoLingLab.c 0.5619 0.6579 0.6099 IRADABE.1.c 0.6081 0.6111 0.6096 INGEOTEC.1.u 0.5944 0.6205 0.6075 In addition, micro-blogging specific features such INGEOTEC.2.c 0.6414 0.5694 0.6054 as emoticons and hashtags are also adopted, for ADAPT.c 0.5632 0.6461 0.6046 example by ColingLab, INGEOTEC) or Co- IntIntUniba.c 0.5779 0.6296 0.6037 MoDi. Deep learning methods adopted by some tweet2check16.c 0.6153 0.5878 0.6016 tweet2check14.u 0.5585 0.6300 0.5943 teams, such as UniPi and SwissCheese required tweet2check14.c 0.5660 0.6034 0.5847 to model individual tweets through geometrical samskara.1.c 0.5198 0.6168 0.5683 representation of tweets, i.e. vectors. Words Baseline 0.4518 0.3808 0.4163 from individual tweets are represented through *SwissCheese.c late 0.6529 0.7128 0.6828 Word Embeddings, mostly derived by using the *tweet2check16.u late 0.6528 0.6373 0.6450 Word2Vec tool or similar approaches. Unitor ex- tends this representation with additional features Networks (CNN) have been investigated at SEN- derived from Distributional Polarity Lexicons. In TIPOLC this year for the first time by a few teams. addition, some teams (e.g. ColingLab) adopted Most of the other teams adopted learning meth- Topic Models to represent tweets. Samskara also ods already investigated in SENTIPOLC 2014; in used feature modelling with a communicative and particular, Support Vector Machine (SVM) is the pragmatic value. CoMoDi is one of the few sys- most adopted learning algorithm. The SVM is tems that investigated irony-specific features. generally based over specific linguistic/semantic feature engineering, as discussed for example Exploitation of additional data for training. by ItaliaNLP, IRADABE, INGEOTEC or Col- Some teams submitted unconstrained results, as ingLab. Other methods have been also used, as a they used additional Twitter annotated data for Bayesian approach by samskara (achieving good training their systems. In particular, UniPI used results in polarity recognition) combined with lin- a silver standard corpus made of more than 1M guistically motivated feature modelling. CoMoDi tweets to pre-train the CNN; this corpus is an- is the only participant that adopted a rule based ap- notated using a polarity lexicon and specific po- proach in combination with a rich set of linguistic larised words. Also Unitor used external tweets cues dedicated to irony detection. to pre-train their CNN. This corpus is made of the contexts of the tweets populating the training ma- Tweet representation schemas. Almost all teams terial and automatically annotated using the clas- adopted (i) traditional manual feature engineering sifier trained only over the training material, in a or (ii) distributional models (i.e. Word embed- semi-supervised fashion. Moreover, Unitor used dings) to represent tweets. The teams adopting the distant supervision to label a set of tweets used for strategy (i) make use of traditional feature mod- the acquisition of their so-called Distribution Po- eling, as presented in SENTIPOLC 2014, using larity Lexicon. Distant supervision is also adopted specific features that encode word-based, syntac- by INGEOTEC to extend the training material for tic and semantic (mostly lexicon-based) features. the their SVM classifier. External Resources. The majority of teams used tweets. We plan to run further experiments on this external resources, such as lexicons specific for issue, including a larger and more balanced dataset Sentiment Analysis tasks. Some teams used al- of ironic tweets in future campaigns. ready existing lexicons, such as Samskara, Ital- iaNLP, CoLingLab, or CoMoDi, while others 7 Closing Remarks created their own task specific resources, such as Unitor, IRADABE, CoLingLab. All systems, except CoMoDI, exploited machine learning techniques in a supervised setting. Two Issues about the interdependency of tasks. main strategies emerged. One involves using Among the systems participating in more than one linguistically principled approaches to represent task, SwissCheese and Unitor designed systems tweets and provide the learning framework with that exploit the interdependency of specific sub- valuable information to converge to good results. tasks. In particular, SwissCheese trained one The other exploits state-of-the-art learning frame- CNN for all the tasks simultaneously, by joining works in combination with word embedding meth- the labels. The results of their experiments in- ods over large-scale corpora of tweets. On bal- dicate that the multi-task CNN outperforms the ance, the last approach achieved better results in single-task CNN. Unitor made the training step the final ranks. However, with F-scores of 0.744 dependent on the subtask, e.g. considering only (unconstrained) and 0.7184 (constrained) in sub- subjective tweets when training the Polarity Clas- jectivity recognition and 0.6638 (constrained) and sifier. However it is difficult to assess the contri- 0.6620 (unconstrained) in polarity recognition, we bution of cross-task information based only on the are still far from having solved sentiment analy- experimental results obtained by the single teams. sis on Twitter. For the future, we envisage the Irony detection. As also observed at SEN- definition of novel approaches, for example by TIPOLC 2014, irony detection appears truly chal- combining neural network-based learning with a lenging, as even the best performing system sub- linguistic-aware choice of features. mitted by Tweet2Check (F = 0.5412) shows a Besides modelling choices, data also matters. low recall of 0.1710. We also observe that the At this campaign we intentionally designed a test performances of the supervised system developed set with a sampling procedure that was close but by Tweet2Check and CoMoDi’s rule-based ap- not identical to that adopted for the training set proach, specifically tailored for irony detection, (focusing again on political debates but on a dif- are very similar (Table 6). ferent topic), so as to have a means to test the While results seem to suggest that irony detec- generalisation power of the systems (Basile et al., tion is the most difficult task, its complexity does 2015). A couple of teams indeed reported substan- not depend (only) on the inner structure of irony, tial drops from the development to the official test but also on unbalanced data distribution (1 out of 7 set (e.g. IRADABE), and we plan to further inves- examples is ironic in the training set). The classi- tigate this aspect in future work. Overall, results fiers are thus biased towards the non-irony class, confirm that sentiment analysis of micro-blogging and tend to retrieve all the non-ironic examples is challenging, mostly due to the subjective nature (high recall in the class non-irony) instead of ac- of the phenomenon, and it’s reflected in the inter- tually modelling irony. If we measure the number annotator agreement (Section 3.3). Crowdsourced of correctly predicted examples instead of the av- data for this task also proved to be not entirely re- erage of the two classes, the systems perform well liable, but this requires a finer-grained analysis on (micro F1 of best system is 0.82). the collected data, and further experiments includ- Moreover, performance for irony detection ing a stricter implementation of the guidelines. drops significantly compared to SENTIPOLC Although evaluated over different data, we see 2014. An explanation for this could be that un- that this year’s best systems show better, albeit like SENTIPOLC 2014, at this edition the topics comparable, performance for subjectivity with re- in the train and in the test sets are different, and it spect to 2014’s systems, and outperform them for has been shown that systems might be modelling polarity (if we consider late submissions). For a topic rather than irony (Barbieri et al., 2015). This proper evaluation across the various editions, we evidence suggests that examples are probably not propose the use of a progress set for the next edi- sufficient to generalise over the structure of ironic tion, as already done in the SemEval campaign. References Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Francesco Barbieri, Francesco Ronzano, and Horacio Natural Language Processing and Speech Tools Saggion. 2015. How Topic Biases Your Results? for Italian. Final Workshop (EVALITA 2016). As- A Case Study of Sentiment Analysis and Irony De- sociazione Italiana di Linguistica Computazionale tection in Italian. In RANLP, Recent Advances in (AILC). Natural Language Processing. Rebecca F. Bruce and Janyce M. Wiebe. 1999. Recog- Valerio Basile and Malvina Nissim. 2013. Sentiment nizing Subjectivity: A Case Study in Manual Tag- analysis on Italian tweets. In Proc. of the 4th Work- ging. Nat. Lang. Eng., 5(2):187–205, June. shop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 100– Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010. 107, Atlanta, Georgia, June. Semi-supervised recognition of sarcastic sentences in Twitter and Amazon. In Proc. of CoNLL ’10, Valerio Basile, Andrea Bolioli, Malvina Nissim, Vi- pages 107–116. viana Patti, and Paolo Rosso. 2014. Overview of the Evalita 2014 SENTIment POLarity Classifica- Aniruddha Ghosh, Guofu Li, Tony Veale, Paolo Rosso, tion Task. In Proc. of EVALITA 2014, pages 50–57, Ekaterina Shutova, Antonio Reyes, and Jhon Barn- Pisa, Italy. Pisa University Press. den. 2015. Semeval-2015 task 11: Sentiment anal- ysis of figurative language in Twitter. In Proc. of the Pierpaolo Basile, Valerio Basile, Malvina Nissim, and 9th International Workshop on Semantic Evaluation Nicole Novielli. 2015. Deep tweets: from entity (SemEval 2015), pages 470–475. linking to sentiment analysis. In Proc. of CLiC- it 2015. Roberto González-Ibáñez, Smaranda Muresan, and Nina Wacholder. 2011. Identifying sarcasm in twit- Pierpaolo Basile, Annalina Caputo, Anna Lisa Gen- ter: A closer look. In Proc. of ACL–HLT ’11, pages tile, and Giuseppe Rizzo. 2016a. Overview of 581–586, Portland, Oregon. the EVALITA 2016 Named Entity rEcognition and Linking in Italian Tweets (NEEL-IT) Task. In Yanfen Hao and Tony Veale. 2010. An ironic fist in a Pierpaolo Basile, Anna Corazza, Franco Cutugno, velvet glove: Creative mis-representation in the con- Simonetta Montemagni, Malvina Nissim, Viviana struction of ironic similes. Minds Mach., 20(4):635– Patti, Giovanni Semeraro and Rachele Sprugnoli, 650. editors, Proceedings of Third Italian Conference on Anne-Lyse Minard, Manuela Speranza, and Tommaso Computational Linguistics (CLiC-it 2016) & Fifth Caselli. 2016. The EVALITA 2016 Event Factuality Evaluation Campaign of Natural Language Pro- Annotation Task (FactA). In Pierpaolo Basile, Anna cessing and Speech Tools for Italian. Final Work- Corazza, Franco Cutugno, Simonetta Montemagni, shop (EVALITA 2016). Associazione Italiana di Lin- Malvina Nissim, Viviana Patti, Giovanni Semer- guistica Computazionale (AILC). aro and Rachele Sprugnoli, editors, Proceedings of Pierpaolo Basile, Franco Cutugno, Malvina Nissim, Third Italian Conference on Computational Linguis- Viviana Patti, and Rachele Sprugnoli. 2016b. tics (CLiC-it 2016) & Fifth Evaluation Campaign EVALITA 2016: Overview of the 5th Evalua- of Natural Language Processing and Speech Tools tion Campaign of Natural Language Processing and for Italian. Final Workshop (EVALITA 2016). As- Speech Tools for Italian. In Pierpaolo Basile, Anna sociazione Italiana di Linguistica Computazionale Corazza, Franco Cutugno, Simonetta Montemagni, (AILC). Malvina Nissim, Viviana Patti, Giovanni Semer- Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio aro and Rachele Sprugnoli, editors, Proceedings of Sebastiani, and Veselin Stoyanov. 2016. Semeval- Third Italian Conference on Computational Linguis- 2016 task 4: Sentiment analysis in twitter. In Proc. tics (CLiC-it 2016) & Fifth Evaluation Campaign of the 10th International Workshop on Semantic of Natural Language Processing and Speech Tools Evaluation (SemEval-2016), pages 1–18. for Italian. Final Workshop (EVALITA 2016). As- sociazione Italiana di Linguistica Computazionale Bo Pang and Lillian Lee. 2008. Opinion Mining and (AILC). Sentiment Analysis. Foundations and trends in in- formation retrieval, 2(1-2):1–135, January. Cristina Bosco, Viviana Patti, and Andrea Bolioli. 2013. Developing Corpora for Sentiment Anal- Antonio Reyes and Paolo Rosso. 2014. On the ysis: The Case of Irony and Senti-TUT. IEEE difficulty of automatically detecting irony: Be- Intelligent Systems, Special Issue on Knowledge- yond a simple case of negation. Knowl. Inf. Syst., based Approaches to Content-level Sentiment Anal- 40(3):595–614. ysis, 28(2):55–63. Antonio Reyes, Paolo Rosso, and Tony Veale. 2013. Cristina Bosco, Fabio Tamburini, Andrea Bolioli, and A multidimensional approach for detecting irony in Alessandro Mazzei. 2016. Overview of the twitter. Lang. Resour. Eval., 47(1):239–268, March. EVALITA 2016 Part Of Speech on TWitter for ITAl- Sara Rosenthal, Alan Ritter, Preslav Nakov, and ian Task. In In Pierpaolo Basile, Anna Corazza, Veselin Stoyanov. 2014. SemEval-2014 Task 9: Franco Cutugno, Simonetta Montemagni, Malv- Sentiment Analysis in Twitter. In Proc. of the 8th In- ina Nissim, Viviana Patti, Giovanni Semeraro and ternational Workshop on Semantic Evaluation (Se- Rachele Sprugnoli, editors, Proceedings of Third mEval 2014), pages 73–80, Dublin, Ireland, August. Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko, Saif M Mohammad, Alan Ritter, and Veselin Stoy- anov. 2015. SemEval-2015 Task 10: Sentiment Analysis in Twitter. In Proc. of the 9th International Workshop on Semantic Evaluation, SemEval ’2015. Marco Stranisci, Cristina Bosco, Delia Iraz Hernndez Faras, and Viviana Patti. 2016. Annotating senti- ment and irony in the online italian political debate on #labuonascuola. In Proc. of the Tenth Interna- tional Conference on Language Resources and Eval- uation (LREC 2016), pages 2892–2899. ELRA. Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emo- tions in language. Language Resources and Evalu- ation, 1(2).