Overview of the Evalita 2016 SENTIment POLarity Classification Task
       Francesco Barbieri                     Valerio Basile                  Danilo Croce
     Pompeu Fabra University              Université Côte d’Azur, University of Rome “Tor Vergata”
             Spain                          Inria, CNRS, I3S                       Italy
    francesco.barbieri@upf.edu                    France                croce@info.uniroma2.it
                                         valerio.basile@inria.fr

         Malvina Nissim                      Nicole Novielli                   Viviana Patti
      University of Groningen          University of Bari “A. Moro”          University of Torino
         The Netherlands                           Italy                            Italy
         m.nissim@rug.nl                nicole.novielli@uniba.it              patti@di.unito.it


                      Abstract                         or negative sentiment, is by now an established
     English. The SENTIment POLarity Clas-             task. Such solid and growing interest is reflected
     sification Task 2016 (SENTIPOLC), is a            in the fact that the Sentiment Analysis tasks at Se-
     rerun of the shared task on sentiment clas-       mEval (where they constitute now a whole track)
     sification at the message level on Italian        have attracted the highest number of participants
     tweets proposed for the first time in 2014        in the last years (Rosenthal et al., 2014; Rosenthal
     for the Evalita evaluation campaign. It in-       et al., 2015; Nakov et al., 2016), and so it has been
     cludes three subtasks: subjectivity classi-       for the latest Evalita campaign, where a sentiment
     fication, polarity classification, and irony      classification task (SENTIPOLC 2014) was intro-
     detection. In 2016 SENTIPOLC has been             duced for the first time (Basile et al., 2014).
     again the most participated EVALITA task             In addition to detecting the polarity of a tweet,
     with a total of 57 submitted runs from 13         it is also deemed important to detect whether a
     different teams. We present the datasets          tweet is subjective or is merely reporting some
     – which includes an enriched annotation           fact, and whether some form of figurative mech-
     scheme for dealing with the impact on po-         anism, chiefly irony, is also present. Subjectivity,
     larity of a figurative use of language – the      polarity, and irony detection form the three tasks
     evaluation methodology, and discuss re-           of the SENTIPOLC 2016 campaign, which is a re-
     sults and participating systems.                  run of SENTIPOLC 2014.
     Italiano. Descriviamo modalità e risul-
     tati della seconda edizione della cam-            Innovations with respect to SENTIPOLC 2014
     pagna di valutazione di sistemi di senti-         While the three tasks are the same as those organ-
     ment analysis (SENTIment POLarity Clas-           ised within SENTIPOLC 2014, we want to high-
     sification Task), proposta nel contesto di        light the innovations that we have included in this
     “EVALITA 2016: Evaluation of NLP and              year’s edition. First, we have introduced two new
     Speech Tools for Italian”. In SENTIPOLC           annotation fields which express literal polarity, to
     è stata valutata la capacità dei sistemi di     provide insights into the mechanisms behind po-
     riconoscere diversi aspetti del sentiment         larity shifts in the presence of figurative usage.
     espresso nei messaggi Twitter in lingua           Second, the test data is still drawn from Twitter,
     italiana, con un’articolazione in tre sotto-      but it is composed of a portion of random tweets
     task: subjectivity classification, polarity       and a portion of tweets selected via keywords,
     classification e irony detection. La cam-         which do not exactly match the selection proce-
     pagna ha suscitato nuovamente grande in-          dure that led to the creation of the training set.
     teresse, con un totale di 57 run inviati da       This was intentionally done to observe the porta-
     13 gruppi di partecipanti.                        bility of supervised systems, in line with what ob-
                                                       served in (Basile et al., 2015). Third, a portion
                                                       of the data was annotated via Crowdflower rather
1    Introduction
                                                       than by experts. This has led to several observa-
Sentiment classification on Twitter, namely detect-    tions on the quality of the data, and on the theoret-
ing whether a tweet is polarised towards a positive    ical description of the task itself. Fourth, a portion
of the test data overlaps with the test data from      3     Development and Test Data
three other tasks at Evalita 2016, namely PoST-
WITA (Bosco et al., 2016), NEEL-IT (Basile et          Data released for the shared task comes from
al., 2016a), and FactA (Minard et al., 2016). This     different datasets. We re-used the whole SEN-
was meant to produce a layered annotated dataset       TIPOLC 2014 dataset, and also added new tweets
where end-to-end systems that address a variety of     derived from different datasets previously devel-
tasks can be fully developed and tested.               oped for Italian. The dataset composition has been
                                                       designed in cooperation with other Evalita 2016
2       Task description                               tasks, in particular the Named Entity rEcognition
                                                       and Linking in Italian Tweets shared task (NEEL-
As in SENTIPOLC 2014, we have three tasks.             IT, Basile et al. (2016a)). The multiple layers of
                                                       annotation are intended as a first step towards the
Task 1: Subjectivity Classification: a system          long-term goal of enabling participants to develop
must decide whether a given message is subjec-         end-to-end systems from entity linking to entity-
tive or objective (Bruce and Wiebe, 1999; Pang         based sentiment analysis (Basile et al., 2015). A
and Lee, 2008).                                        portion of the data overlaps with data from NEEL-
Task 2: Polarity Classification: a system must         IT (Basile et al., 2016a), PoSTWITA (Bosco et
decide whether a given message is of positive,         al., 2016) and FacTA (Minard et al., 2016). See
negative, neutral or mixed sentiment. Differently      (Basile et al., 2016b) for details.
from most SA tasks (chiefly the Semeval tasks)
                                                       3.1    Corpora Description
and in accordance with (Basile et al., 2014), in our
data positive and negative polarities are not mu-      Both training and test data developed for the
tually exclusive and each is annotated as a binary     2014 edition of the shared task were included as
category. A tweet can thus be at the same time         training data in the 2016 release. Summarizing,
positive and negative, yielding a mixed polarity,      the data that we are using for this shared task
or also neither positive nor negative, meaning it is   is a collection of tweets which is partially de-
a subjective statement with neutral polarity.1 Sec-    rived from two existing corpora, namely Sentipolc
tion 3 provides further explanation and examples.      2014 (TW-SENTIPOLC14, 6421 tweets) (Basile
                                                       et al., 2014), and TWitterBuonaScuola (TW-BS)
Task 3: Irony Detection: a system must decide          (Stranisci et al., 2016), from which we selected
whether a given message is ironic or not. Twit-        1500 tweets. Furthermore, two new sets have
ter communications include a high percentage of        been annotated from scratch following the SEN-
ironic messages (Davidov et al., 2010; Hao and         TIPOLC 2016 annotation scheme: the first one
Veale, 2010; González-Ibáñez et al., 2011; Reyes    consists of a set of 1500 tweets selected from the
et al., 2013; Reyes and Rosso, 2014), and plat-        TWITA 2015 collection (TW-TWITA15, Basile
forms monitoring the sentiment in Twitter mes-         and Nissim (2013)), the second one consists of
sages experienced the phenomenon of wrong po-          1000 (reduced to 989 after eliminating malformed
larity classification in ironic messages (Bosco et     tweets) tweets collected in the context of the
al., 2013; Ghosh et al., 2015). Indeed, ironic de-     NEEL-IT shared task (TW-NEELIT, Basile et al.
vices in a text can work as unexpected “polarity       (2016a)). The subsets of data extracted from ex-
reversers” (one says something “good” to mean          isting corpora (TW-SENTIPOLC14 and TW-BS)
something “bad”), thus undermining systems’ ac-        have been revised according to the new annotation
curacy. In this sense, though not including a spe-     guidelines specifically devised for this task (see
cific task on its detection, we have added an an-      Section 3.3 for details).
notation layer of literal polarity (see Section 3.2)      Tweets in the datasets are marked with a “topic”
which could be potentially used by systems, and        tag. The training data includes both a political
also allows us to observe patterns of irony.           collection of tweets and a generic collection of
The three tasks are meant to be independent. For       tweets. The former has been extracted exploiting
example, a team could take part in the polarity        specific keywords and hashtags marking political
classification task without tackling Task 1.           topics (topic = 1 in the dataset), while the latter is
                                                       composed of random tweets on any topic (topic =
    1
        In accordance with (Wiebe et al., 2005).       0). The test material includes tweets from the
TW-BS corpus, that were extracted with a specific             1, lpos = 0, and lneg = 0 can co-exist, as
socio-political topic (via hashtags and keywords              well as iro = 1, lpos = 1, and lneg = 1.
related to #labuonascuola, different from the ones
used to collect the training material). To mark the       • For subjective tweets without irony (iro =
fact that such tweets focus on a different topic they       0), the overall (opos and oneg) and the lit-
have been marked with topic = 2. While SEN-                 eral (lpos and lneg) polarities are always
TIPOLC does not include any task which takes the            annotated consistently, i.e. opos = lpos
“topic” information into account, we release it in          and oneg = lneg. Note that in such cases
case participants want to make use of it.                   the literal polarity is implied automatically
                                                            from the overall polarity and not annotated
3.2   Annotation Scheme                                     manually. The manual annotation of literal
Six fields contain values related to manual annota-         polarity only concerns tweets with iro = 1.
tion are: subj, opos, oneg, iro, lpos, lneg.
   The annotation scheme applied in SEN-                Table 1 summarises the allowed combinations.
TIPOLC 2014 has been enriched with two new
                                                        3.3   Annotation procedure
fields, lpos and lneg, which encode the literal
positive and negative polarity of tweets, respec-       Annotations for data from existing corpora (TW-
tively. Even if SENTIPOLC does not include any          BS and TW-SENTIPOLC14) have been revised
task which involves the actual classification of lit-   and completed by exploiting an annotation pro-
eral polarity, this information is provided to enable   cedure which involved a group of six expert an-
participants to reason about the possible polarity      notators, in order to make them compliant to
inversion due to the use of figurative language in      the SENTIPOLC 2016 annotation scheme. Data
ironic tweets. Indeed, in the presence of a figura-     from NEEL-IT and TWITA15 was annotated from
tive reading, the literal polarity of a tweet might     scratch using CrowdFlower. Both training and test
differ from the intended overall polarity of the text   data included a mixture of data annotated by ex-
(expressed by opos and oneg). Please note the           perts and crowd. In particular, the whole TW-
following issues about our annotation scheme:           SENTIPOLC14 has been included in the develop-
                                                        ment data release, while TW-BS was included in
  • An objective tweet will not have any polarity       the test data release. Moreover, a set of 500 tweets
    nor irony, thus if subj = 0, then opos =            from crowdsourced data was included in the test
    0, oneg = 0, iro = 0, lpos = 0, and                 set, after a manual check and re-assessment (see
    lneg = 0 .                                          below: Crowdsourced data: consolidation of an-
  • A subjective, non ironic, tweet can exhibit at      notations). This set contains the 300 tweets used
    the same time overall positive and negative         as test data in the PoSTWITA, NEEL-IT-it and
    polarity (mixed polarity), thus opos = 1 and        FactA EVALITA 2016 shared tasks.
    oneg = 1 can co-exist. Mixed literal polar-            TW-SENTIPOLC14 Data from the previous
    ity might also be observed, so that lpos = 1        evaluation campaign didn’t include any distinction
    and lneg = 1 can co-exist, and this is true         between literal and overall polarity. Therefore, the
    for both non-ironic and ironic tweets.              old tags pos and neg were automatically mapped
  • A subjective, non ironic, tweet can exhibit         into the new labels opos and oneg, respectively,
    no specific polarity and be neutral but with        which indicate overall polarity. Then, we had to
    a subjective flavor, thus subj = 1 and              extend the annotation to provide labels for posi-
    opos = 0, oneg = 0. Neutral literal polar-          tive and negative literal polarity. In case of tweets
    ity might also be observed, so that lpos = 0        without irony, literal polarity values were implied
    and lneg = 0 is a possible combination; this        from the overall polarity. For ironic tweets, in-
    is true for both non-ironic and ironic tweets.      stead, i.e. iro = 1 (806 tweets), we resorted to
  • An ironic tweet is always subjective and            manual annotation: for each tweet, two indepen-
    it must have one defined polarity, so that          dent annotations have been provided for the literal
    iro = 1 cannot be combined with opos                polarity dimension. The inter-annotator agree-
    and oneg having the same value. However,            ment at this stage was κ = 0.538. In a second
    mixed or neutral literal polarity could be ob-      round, a third independent annotation was pro-
    served for ironic tweets. Therefore, iro =          vided to solve the disagreement. The final label
                               Table 1: Combinations of values allowed by our annotation scheme
 subj opos oneg iro       lpos lneg                         description and explanatory tweet in Italian
                                       objective
  0     0     0      0     0      0    l’articolo di Roberto Ciccarelli dal manifesto di oggi http://fb.me/1BQVy5WAk
                                       subjective with neutral polarity and no irony
  1     0     0      0     0      0    Primo passaggio alla #strabrollo ma secondo me non era un iscritto
                                       subjective with positive polarity and no irony
  1     1     0      0     1      0    splendida foto di Fabrizio, pluri cliccata nei siti internazionali di Photo Natura http:
                                       //t.co/GWoZqbxAuS
                                       subjective with negative polarity and no irony
  1     0     1      0     0      1    Monti, ripensaci: l’inutile Torino-Lione inguaia l’Italia: Tav, appello a Mario Monti
                                       da Mercalli, Cicconi, Pont... http://t.co/3CazKS7Y
                                       subjective with both positive and negative polarity (mixed polarity) and no irony
  1     1     1      0     1      1    Dati negativi da Confindustria che spera nel nuovo governo Monti. Castiglione:
                                       ”Avanti con le riforme” http://t.co/kIKnbFY7
                                       subjective with positive polarity, and an ironic twist
  1     1     0      1     1      0    Questo governo Monti dei paschi di Siena sta cominciando a carburare; speriamo
                                       bene...
                                       subjective with positive polarity, an ironic twist, and negative literal polarity
  1     1     0      1     0      1    Non riesco a trovare nani e ballerine nel governo Monti. Ci deve essere un errore! :)
                                       subjective with negative polarity, and an ironic twist
  1     0     1      1     0      1    Calderoli: Governo Monti? Banda Bassotti ..infatti loro erano quelli della Magliana..
                                       #FullMonti #fuoritutti #piazzapulita
                                       subjective with negative polarity, an ironic twist, and positive literal polarity
  1     0     1      1     1      0    Ho molta fiducia nel nuovo Governo Monti. Più o meno la stessa che ripongo in mia
                                       madre che tenta di inviare un’email.
                                       subjective with positive polarity, an ironic twist, and neutral literal polarity
  1     1     0      1     0      0    Il vecchio governo paragonato al governo #monti sembra il cast di un film di lino banfi
                                       e Renzo montagnani rispetto ad uno di scorsese
                                       subjective with negative polarity, an ironic twist, and neutral literal polarity
  1     0     1      1     0      0    arriva Mario #Monti: pronti a mettere tutti il grembiulino?
                                       subjective with positive polarity, an ironic twist, and mixed literal polarity
  1     1     0      1     1      1    Non aspettare che il Governo Monti prenda anche i tuoi regali di Natale... Corri da noi,
                                       e potrai trovare IDEE REGALO a partire da 10e...
                                       subjective with negative polarity, an ironic twist, and mixed literal polarity
  1     0     1      1     1      1    applauso freddissimo al Senato per Mario Monti. Ottimo.


was assigned by majority vote on each field inde-                 described above was applied to obtain literal po-
pendently. With three annotators, this procedure                  larity values: two independent annotations were
ensures an unambiguous result for every tweet.                    provided (inter-annotator agreement κ = 0.605),
  TW-BS The TW-BS section of the dataset had                      and a third annotation was added in a second round
been previously annotated for polarity and irony2 .               in cases of disagreement. Just as with the TW-
The original TW-BS annotation scheme, however,                    SENTIPOLC14 set, the final label assignment was
did not provide any separate annotation for overall               done by majority vote on each field.
and literal polarity. The tags POS, NEG, MIXED                       TW-TWITA15 and TW-NEEL-IT For these
and NONE, HUMPOS, HUMNEG in TW-BS                                 new datasets, all fields were annotated from
were automatically mapped in the following val-                   scratch using CrowdFlower (CF)4 , a crowdsourc-
ues for the SENTIPOLC’s subj, opos, oneg,                         ing platform which has also been recently used for
iro, lpos and lneg annotation fields: POS ⇒                       a similar annotation task (Nakov et al., 2016). CF
110010; NEG ⇒ 101001; MIXED ⇒ 111011;                             enables quality control of the annotations across
NONE ⇒ 0000003 ; HUMPOS ⇒ 1101??; HUM-                            a number of dimensions, also by employing test
NEG ⇒ 1011??. For the last two cases, i.e. where                  questions to find and exclude unreliable annota-
iro=1, the same manual annotation procedure                       tors. We gave the users a series of guidelines
                                                                  in Italian, including a list of examples of tweets
   2
      For the annotation process and inter-annotator agreement    and their annotation according to the SENTIPOLC
see (Stranisci et al., 2016)
    3
      Two independent annotators reconsidered the set of          scheme. The guidelines also contained an expla-
tweets tagged by NONE in order to distinguish the few cases       nation of the rules we followed for the annota-
of subjective, neutral, not-ironic tweets, i.e. 100000, as the    tion of the rest of the dataset, although in prac-
original TW-BS scheme did not allow such finer distinction.
The inter-annotator agreement on this task was measured as        tice these constraints were not enforced in the CF
κ = 0.841 and a third independent annotation was used to
                                                                      4
solve the few cases of disagreement.                                      http://www.crowdflower.com/
interface. As requested by the platform, we pro-
                                                                      Table 2: Distribution of value combinations
vided a restricted set of “correct” answers to test
the reliability of the users. This step proved to                                  combination
                                                                    subj    opos    oneg   iro      lpos   lneg
                                                                                                                   dev     test
be challenging, since in many cases the annota-                       0      0      0           0   0       0     2,312     695
tion of at least one dimension is not clear cut. We                   1      0      0           0   0       0       504     219
                                                                      1      0      1           0   0       1     1,798     520
required to collect at least three independent judg-                  1      0      1           1   0       0       210      73
ments for each tweet. The total cost of the crowd-                    1      0      1           1   0       1       225      53
                                                                      1      0      1           1   1       0       239      66
sourcing has been 55 USD and we collected 9517                        1      0      1           1   1       1        71      22
judgments in total from 65 workers. We adopted                        1      1      0           0   1       0     1,488     295
                                                                      1      1      0           1   0       0        29       3
the default CF settings for assigning the majority                    1      1      0           1   0       1        22       4
label (relative majority). The CF reported aver-                      1      1      0           1   1       0        62       8
                                                                      1      1      0           1   1       1        10       6
age confidence (i.e., inter-rater agreement) is 0.79                  1      1      1           0   1       1       440      36
for subjectivity, 0.89 for positive polarity (0.90 for                                  total                     7,410   2,000
literal positivity), 0.91 for negative polarity (0.93
for literal negativity) and 0.92 for irony. While                  6 (neg), 9 (pos), 11 (oneg), and 17 (opos), and
such scores appear high, they are skewed towards                   under 50% for subj.7 This could be an indication
the over-assignment of the ”0” label for basically                 of a more conservative interpretation of sentiment
all of classes (see below for further comments on                  on the part of the crowd (note that 0 is also the de-
this). Percentage agreement on the assignment of                   fault value), possibly also due to too few examples
”1” is much lower (ranging from 0.70 to 0.77).5                    in the guidelines, and in any case to the intrinsic
On the basis of such observations and on a first                   subjectivity of the task. On such basis, we decided
analysis of the resulting combinations, we oper-                   to add two more expert annotations to the crowd-
ated a few revisions on the crowd-collected data.                  annotated test-set, and take the majority vote from
Crowdsourced data: consolidation of annota-                        crowd, expert1, and expert2. This does not erase
tions Despite having provided the workers with                     the contribution of the crowd, but hopefully max-
guidelines, we identified a few cases of value com-                imises consistency with the guidelines in order to
binations that were not allowed in our annotation                  provide a solid evaluation benchmark for this task.
scheme, e.g., ironic or polarised tweets (positive,
negative or mixed) which were not marked as sub-                   3.4     Format and Distribution
jective. We automatically fixed the annotation for
                                                                   We provided participants we a single development
such cases, in order to release datasets of only
                                                                   set, which consists of a collection of 7,410 tweets,
tweets annotated with labels consistent with the
                                                                   with IDs and annotations concerning all three
SENTIPOLC’s annotation scheme.6
                                                                   SENTIPOLC’s subtasks: subjectivity classifica-
   Moreover, we applied a further manual check                     tion (subj), polarity classification (opos,oneg)
of crowdsourced data stimulated by the follow-                     and irony detection (iro).
ing observations. When comparing the distribu-
                                                                      Including the two additional fields with respect
tions of values (0,1) for each label in both training
                                                                   to SENTIPOLC 2014, namely lpos and lneg,
and crowdsourced test data, we observed, as men-
                                                                   the final data format of the distribution is as fol-
tioned above, that while the assignment of 1s con-
                                                                   lows: “id”, “subj, “opos”, “oneg”, “iro”,
stituted from 28 to 40% of all assignments for the
                                                                   “lpos”, “lneg”, “top”, “text”.
opos/pos/ oneg/neg labels, and about 68% for
                                                                      The development data includes for each tweet
the subjectivity label, figures were much lower for
                                                                   the manual annotation for the subj, opos,
the crowdsourced data, with percentages as low as
                                                                   oneg, iro, lpos and lneg fields, according
    5                                                              to the format explained above. Instead, the blind
      This would be taken into account if using Kappa, which
is however an unsuitable measure in this context due to the        version of the test data, which consists of 2000
varying number of annotators per instance.                         tweets, only contains values for the idtwitter
    6
      In particular, for CF data we applied two automatic trans-   and text fields. In other words, the development
formations for restoring consistency of configurations of an-
notated values in cases where we observed a violation of the       data contains the six columns manually annotated,
scheme: when at least a value 1 is present in the fields opos,
                                                                      7
oneg, iro, lpos, or lneg, we set the field subj accord-                The annotation of the presence of irony shows less dis-
ingly: subj=0 ⇒ subj=1; when iro=0, the literal polarity           tance, with 12% in the training set and 8% in the crowd-
value is overwritten by the overall polarity value.                annotated test set.
while the test data will contain values only in the       Task1, but with different targeted classes. The
first (idtwitter) and last two columns (top               overall F-score will be the average of the F-scores
and text). The literal polarity might be predicted        for ironic and non-ironic classes.
and used by participants to provide the final clas-
                                                          Informal evaluation of literal polarity classifi-
sification of the items in the test set, however this
                                                          cation. Our coding system allows for four com-
should be specified in the submission phase. The
                                                          binations of positive (lpos) and negative
distribution of combinations in both development
                                                          (lneg) values for literal polarity, namely: 10:
and test data is given in Table 2.
                                                          positive literal polarity; 01: negative literal polar-
4     Evaluation                                          ity; 11: mixed literal polarity; 00: no polarity.
                                                             SENTIPOLC does not include any task that ex-
Task1: subjectivity classification. Systems are           plicitly takes into account the evaluation of lit-
evaluated on the assignment of a 0 or 1 value to the      eral polarity classification. However, participants
subjectivity field. A response is considered plainly      could find it useful in developing their system, and
correct or wrong when compared to the gold stan-          might learn to predict it. Therefore, they could
dard annotation. We compute precision (p), recall         choose to submit also this information to receive
(r) and F-score (F) for each class (subj,obj):            an informal evaluation of the performance on these
               #correctclass            #correctclass     two fields, following the same evaluation criteria
     pclass =                  rclass =                   adopted for Task 2. The performance on the literal
             #assignedclass              #totalclass
                pclass rclass                             polarity classification will not affect in any way
    Fclass = 2
               pclass + rclass                            the final ranks for the three SENTIPOLC tasks.

The overall F-score will be the average of the F-         5   Participants and Results
scores for subjective and objective classes.
                                                          A total of 13 teams from 6 different countries
Task2: polarity classification. Our coding sys-
                                                          participated in at least one of the three tasks of
tem allows for four combinations of opos and
                                                          SENTIPOLC. Table 3 provides an overview of the
oneg values: 10 (positive polarity), 01 (nega-
                                                          teams, their affiliation, their country (C) and the
tive polarity), 11 (mixed polarity), 00 (no polar-
                                                          tasks they took part in.
ity). Accordingly, we evaluate positive and neg-
ative polarity independently by computing preci-
sion, recall and F-score for both classes (0 and 1):          Table 3: Teams participating to SENTIPOLC 2016
                                                             team            institution           C     tasks
              #correctpos              #correctpos         ADAPT       Adapt Centre                IE   T1,T2,T3
    ppos
     class =
                        class  pos
                              rclass =           class
                                                           CoLingLab CoLingLab
                         pos                   pos
             #assignedclass             #totalclass                                                IT   T2
                                                                       University of Pisa
              #correctneg              #correctneg                     FICLIT
    pneg
     class =
                        class
                         neg
                               neg
                              rclass =           class
                                               neg         CoMoDI      University of Bologna       IT   T3
             #assignedclass             #totalclass
     pos       ppos    pos
                class rclass   neg       pneg    neg
                                          class rclass
                                                           INGEOTEC CentroGEO/INFOTEC
                                                                       CONACyT                    MX T1,T2
    Fclass = 2 pos       pos  Fclass = 2 neg       neg
              pclass + rclass           pclass + rclass    IntIntUniba University of Bari          IT   T2
                                                                       Univer. Pol.  de
                                                           IRADABE Université de Paris Valencia, ES,FR T1,T2,T3
The F-score for the two polarity classes is the av-                    ItaliaNLP Lab
                                                           ItaliaNLP   ILC (CNR)                   IT   T1,T2,T3
erage of the F-scores of the respective pairs:
                                                           samskara    LARI Lab, ILC CNR           IT   T1,T2
                (F0pos + F1pos ) neg  (F neg + F1neg )     SwissCheese Zurich University          CH    T1,T2,T3
     F pos =                    F    = 0                               of Applied Sciences
                       2                     2             tweet2check Finsa s.p.a.                IT   T1,T2,T3
                                                           UniBO       University of Bologna       IT   T1,T2
Finally, the overall F-score for Task 2 is given by        UniPI       University of Pisa          IT   T1,T2
the average of the F-scores of the two polarities.         Unitor      University  of Roma         IT   T1,T2,T3
                                                                       Tor Vergata
Task3: irony detection. Systems are evaluated on
their assignment of a 0 or 1 value to the irony field.    Almost all teams participated to both subjectivity
A response is considered fully correct or wrong           and polarity classification subtasks. Each team
when compared to the gold standard annotation.            had to submit at least a constrained run. Fur-
We measure precision, recall and F-score for each         thermore, teams were allowed to submit up to
class (ironic,non-ironic), similarly to the               four runs (2 constrained and 2 unconstrained) in
case they implemented different systems. Over-
                                                                Table 4: Task 1: F-scores for constrained “.c” and uncon-
all we have 19, 26, 12 submitted runs for                       strained runs “.u”. After the deadline, two teams reported
the subjectivity, polarity, and irony detection                 about a conversion error from their internal format to the of-
tasks, respectively. In particular, three teams                 ficial one. The resubmitted amended runs are marked with *.
                                                                           System               Obj        Subj         F
(UniPI, Unitor and tweet2check) participated                        Unitor.1.u                 0.6784     0.8105     0.7444
with both a constrained and an unconstrained                        Unitor.2.u                 0.6723     0.7979     0.7351
runs on the both the subjectivity and polarity                      samskara.1.c               0.6555     0.7814     0.7184
subtasks. Unconstrained runs were submitted to                      ItaliaNLP.2.c              0.6733     0.7535     0.7134
                                                                    IRADABE.2.c                0.6671     0.7539     0.7105
the polarity subtask only by IntIntUniba.SentiPy                    INGEOTEC.1.c               0.6623     0.7550     0.7086
and INGEOTEC.B4MSA. Differently from SEN-                           Unitor.c                   0.6499     0.7590     0.7044
TIPOLC 2014, unconstrained systems performed                        UniPI.1/2.c                0.6741     0.7133     0.6937
better than constrained ones, with the only excep-                  UniPI.1/2.u                0.6741     0.7133     0.6937
tion of UniPI, whose constrained system ranked                      ItaliaNLP.1.c              0.6178     0.7350     0.6764
                                                                    ADAPT.c                    0.5646     0.7343     0.6495
first for the polarity classification subtask.                      IRADABE.1.c                0.6345     0.6139     0.6242
    We produced a single-ranking table for each                     tweet2check16.c            0.4915     0.7557     0.6236
subtask, where unconstrained runs are properly                      tweet2check14.c            0.3854     0.7832     0.5843
marked. Notice that we only use the final F-score                   tweet2check14.u            0.3653     0.7940     0.5797
                                                                    UniBO.1.c                  0.5997     0.5296     0.5647
for global scoring and ranking. However, systems                    UniBO.2.c                  0.5904     0.5201     0.5552
that are ranked midway might have excelled in                       Baseline                   0.0000     0.7897     0.3949
precision for a given class or scored very bad in                   *SwissCheese.c late        0.6536     0.7748     0.7142
recall for another.8                                                *tweet2check16.u late      0.4814     0.7820     0.6317
    For each task, we ran a majority class baseline
to set a lower-bound for performance. In the tables
                                                                5.3    Task3: irony detection
it is always reported as Baseline.
                                                                Table 6 shows results for the irony detection task,
5.1   Task1: subjectivity classification                        which attracted 12 submissions from 7 teams. The
                                                                highest F-score was achieved by tweet2check at
Table 4 shows results for the subjectivity classifi-
                                                                0.5412 (constrained run). The only unconstrained
cation task, which attracted 19 total submissions
                                                                run was submitted by Unitor achieving 0.4810 as
from 10 different teams. The highest F-score is
                                                                F-score. While all participating systems show an
achieved by Unitor at 0.7444, which is also the
                                                                improvement over the baseline (F = 0.4688), many
best unconstrained performance. Among the con-
                                                                systems score very close to it, highlighting the
strained systems, the best F-score is achieved by
                                                                complexity of the task.
samskara with F = 0.7184. All participating
systems show an improvement over the baseline.
                                                                6     Discussion
5.2   Task2: polarity classification                            We compare the participating systems accord-
Table 5 shows results for polarity classification,              ing to the following main dimensions: classifi-
the most popular subtask with 26 submissions                    cation framework (approaches, algorithms, fea-
from 12 teams. The highest F-score is achieved                  tures), tweet representation strategy, exploitation
by UniPi at 0.6638, which is also the best score                of further Twitter annotated data for training, ex-
among the constrained runs. As for unconstrained                ploitation of available resources (e.g. sentiment
runs, the best performance is achieved by Unitor                lexicons, NLP tools, etc.), and issues about the in-
with F = 0.6620. All participating systems show                 terdependency of tasks in case of systems partici-
an improvement over the baseline.9                              pating in several subtasks.
                                                                   Since we did not receive details about the
    8
      Detailed scores for all classes and tasks are avail-      systems adopted by some participants, i.e.,
able at http://www.di.unito.it/˜tutreeb/
sentipolc-evalita16/index.html                                  tweet2check, ADAPT and UniBO, we are not in-
    9
      After the deadline, SwissCheese and tweet2check re-       cluding them in the following discussion. We con-
ported about a conversion error from their internal format to   sider however tweet2check’s results in the dis-
the official one. The resubmitted amended runs are shown in
the table (marked by the * symbol), but the official ranking    cussion regarding irony detection.
was not revised.                                                   Approaches based on Convolutional Neural
Table 5: Task 2: F-scores for constrained ”.c” and uncon-   Table 6: Task 3: F-scores for constrained “.c” and uncon-
strained runs ”.u”. Amended runs are marked with * .        strained runs “.u”. Amended runs are marked with *.
         System                Pos       Neg         F               System            Non-Iro       Iro        F
  UniPI.2.c                  0.6850     0.6426    0.6638      tweet2check16.c          0.9115      0.1710    0.5412
  Unitor.1.u                 0.6354     0.6885    0.6620      CoMoDI.c                 0.8993      0.1509    0.5251
  Unitor.2.u                 0.6312     0.6838    0.6575      tweet2check14.c          0.9166      0.1159    0.5162
  ItaliaNLP.1.c              0.6265     0.6743    0.6504      IRADABE.2.c              0.9241      0.1026    0.5133
  IRADABE.2.c                0.6426     0.6480    0.6453      ItaliaNLP.1.c            0.9359      0.0625    0.4992
  ItaliaNLP.2.c              0.6395     0.6469    0.6432      ADAPT.c                  0.8042      0.1879    0.4961
  UniPI.1.u                  0.6699     0.6146    0.6422      IRADABE.1.c              0.9259      0.0484    0.4872
  UniPI.1.c                  0.6766     0.6002    0.6384      Unitor.2.u               0.9372      0.0248    0.4810
  Unitor.c                   0.6279     0.6486    0.6382      Unitor.c                 0.9358      0.0163    0.4761
  UniBO.1.c                  0.6708     0.6026    0.6367      Unitor.1.u               0.9373      0.0084    0.4728
  IntIntUniba.c              0.6189     0.6372    0.6281      ItaliaNLP.2.c            0.9367      0.0083    0.4725
  IntIntUniba.u              0.6141     0.6348    0.6245      Baseline                 0.9376      0.000     0.4688
  UniBO.2.c                  0.6589     0.5892    0.6241      *SwissCheese.c late      0.9355      0.1367    0.5361
  UniPI.2.u                  0.6586     0.5654    0.6120
  CoLingLab.c                0.5619     0.6579    0.6099
  IRADABE.1.c                0.6081     0.6111    0.6096
  INGEOTEC.1.u               0.5944     0.6205    0.6075    In addition, micro-blogging specific features such
  INGEOTEC.2.c               0.6414     0.5694    0.6054    as emoticons and hashtags are also adopted, for
  ADAPT.c                    0.5632     0.6461    0.6046    example by ColingLab, INGEOTEC) or Co-
  IntIntUniba.c              0.5779     0.6296    0.6037    MoDi. Deep learning methods adopted by some
  tweet2check16.c            0.6153     0.5878    0.6016
  tweet2check14.u            0.5585     0.6300    0.5943
                                                            teams, such as UniPi and SwissCheese required
  tweet2check14.c            0.5660     0.6034    0.5847    to model individual tweets through geometrical
  samskara.1.c               0.5198     0.6168    0.5683    representation of tweets, i.e. vectors. Words
  Baseline                   0.4518     0.3808    0.4163    from individual tweets are represented through
  *SwissCheese.c late        0.6529     0.7128    0.6828    Word Embeddings, mostly derived by using the
  *tweet2check16.u late      0.6528     0.6373    0.6450
                                                            Word2Vec tool or similar approaches. Unitor ex-
                                                            tends this representation with additional features
Networks (CNN) have been investigated at SEN-               derived from Distributional Polarity Lexicons. In
TIPOLC this year for the first time by a few teams.         addition, some teams (e.g. ColingLab) adopted
Most of the other teams adopted learning meth-              Topic Models to represent tweets. Samskara also
ods already investigated in SENTIPOLC 2014; in              used feature modelling with a communicative and
particular, Support Vector Machine (SVM) is the             pragmatic value. CoMoDi is one of the few sys-
most adopted learning algorithm. The SVM is                 tems that investigated irony-specific features.
generally based over specific linguistic/semantic
feature engineering, as discussed for example               Exploitation of additional data for training.
by ItaliaNLP, IRADABE, INGEOTEC or Col-                     Some teams submitted unconstrained results, as
ingLab. Other methods have been also used, as a             they used additional Twitter annotated data for
Bayesian approach by samskara (achieving good               training their systems. In particular, UniPI used
results in polarity recognition) combined with lin-         a silver standard corpus made of more than 1M
guistically motivated feature modelling. CoMoDi             tweets to pre-train the CNN; this corpus is an-
is the only participant that adopted a rule based ap-       notated using a polarity lexicon and specific po-
proach in combination with a rich set of linguistic         larised words. Also Unitor used external tweets
cues dedicated to irony detection.                          to pre-train their CNN. This corpus is made of the
                                                            contexts of the tweets populating the training ma-
Tweet representation schemas. Almost all teams              terial and automatically annotated using the clas-
adopted (i) traditional manual feature engineering          sifier trained only over the training material, in a
or (ii) distributional models (i.e. Word embed-             semi-supervised fashion. Moreover, Unitor used
dings) to represent tweets. The teams adopting the          distant supervision to label a set of tweets used for
strategy (i) make use of traditional feature mod-           the acquisition of their so-called Distribution Po-
eling, as presented in SENTIPOLC 2014, using                larity Lexicon. Distant supervision is also adopted
specific features that encode word-based, syntac-           by INGEOTEC to extend the training material for
tic and semantic (mostly lexicon-based) features.           the their SVM classifier.
External Resources. The majority of teams used            tweets. We plan to run further experiments on this
external resources, such as lexicons specific for         issue, including a larger and more balanced dataset
Sentiment Analysis tasks. Some teams used al-             of ironic tweets in future campaigns.
ready existing lexicons, such as Samskara, Ital-
iaNLP, CoLingLab, or CoMoDi, while others                 7   Closing Remarks
created their own task specific resources, such as
Unitor, IRADABE, CoLingLab.                               All systems, except CoMoDI, exploited machine
                                                          learning techniques in a supervised setting. Two
Issues about the interdependency of tasks.                main strategies emerged. One involves using
Among the systems participating in more than one          linguistically principled approaches to represent
task, SwissCheese and Unitor designed systems             tweets and provide the learning framework with
that exploit the interdependency of specific sub-         valuable information to converge to good results.
tasks. In particular, SwissCheese trained one             The other exploits state-of-the-art learning frame-
CNN for all the tasks simultaneously, by joining          works in combination with word embedding meth-
the labels. The results of their experiments in-          ods over large-scale corpora of tweets. On bal-
dicate that the multi-task CNN outperforms the            ance, the last approach achieved better results in
single-task CNN. Unitor made the training step            the final ranks. However, with F-scores of 0.744
dependent on the subtask, e.g. considering only           (unconstrained) and 0.7184 (constrained) in sub-
subjective tweets when training the Polarity Clas-        jectivity recognition and 0.6638 (constrained) and
sifier. However it is difficult to assess the contri-     0.6620 (unconstrained) in polarity recognition, we
bution of cross-task information based only on the        are still far from having solved sentiment analy-
experimental results obtained by the single teams.        sis on Twitter. For the future, we envisage the
Irony detection. As also observed at SEN-                 definition of novel approaches, for example by
TIPOLC 2014, irony detection appears truly chal-          combining neural network-based learning with a
lenging, as even the best performing system sub-          linguistic-aware choice of features.
mitted by Tweet2Check (F = 0.5412) shows a                   Besides modelling choices, data also matters.
low recall of 0.1710. We also observe that the            At this campaign we intentionally designed a test
performances of the supervised system developed           set with a sampling procedure that was close but
by Tweet2Check and CoMoDi’s rule-based ap-                not identical to that adopted for the training set
proach, specifically tailored for irony detection,        (focusing again on political debates but on a dif-
are very similar (Table 6).                               ferent topic), so as to have a means to test the
   While results seem to suggest that irony detec-        generalisation power of the systems (Basile et al.,
tion is the most difficult task, its complexity does      2015). A couple of teams indeed reported substan-
not depend (only) on the inner structure of irony,        tial drops from the development to the official test
but also on unbalanced data distribution (1 out of 7      set (e.g. IRADABE), and we plan to further inves-
examples is ironic in the training set). The classi-      tigate this aspect in future work. Overall, results
fiers are thus biased towards the non-irony class,        confirm that sentiment analysis of micro-blogging
and tend to retrieve all the non-ironic examples          is challenging, mostly due to the subjective nature
(high recall in the class non-irony) instead of ac-       of the phenomenon, and it’s reflected in the inter-
tually modelling irony. If we measure the number          annotator agreement (Section 3.3). Crowdsourced
of correctly predicted examples instead of the av-        data for this task also proved to be not entirely re-
erage of the two classes, the systems perform well        liable, but this requires a finer-grained analysis on
(micro F1 of best system is 0.82).                        the collected data, and further experiments includ-
   Moreover, performance for irony detection              ing a stricter implementation of the guidelines.
drops significantly compared to SENTIPOLC                    Although evaluated over different data, we see
2014. An explanation for this could be that un-           that this year’s best systems show better, albeit
like SENTIPOLC 2014, at this edition the topics           comparable, performance for subjectivity with re-
in the train and in the test sets are different, and it   spect to 2014’s systems, and outperform them for
has been shown that systems might be modelling            polarity (if we consider late submissions). For a
topic rather than irony (Barbieri et al., 2015). This     proper evaluation across the various editions, we
evidence suggests that examples are probably not          propose the use of a progress set for the next edi-
sufficient to generalise over the structure of ironic     tion, as already done in the SemEval campaign.
References                                                  Italian Conference on Computational Linguistics
                                                            (CLiC-it 2016) & Fifth Evaluation Campaign of
Francesco Barbieri, Francesco Ronzano, and Horacio          Natural Language Processing and Speech Tools
  Saggion. 2015. How Topic Biases Your Results?             for Italian. Final Workshop (EVALITA 2016). As-
  A Case Study of Sentiment Analysis and Irony De-          sociazione Italiana di Linguistica Computazionale
  tection in Italian. In RANLP, Recent Advances in          (AILC).
  Natural Language Processing.
                                                          Rebecca F. Bruce and Janyce M. Wiebe. 1999. Recog-
Valerio Basile and Malvina Nissim. 2013. Sentiment          nizing Subjectivity: A Case Study in Manual Tag-
  analysis on Italian tweets. In Proc. of the 4th Work-     ging. Nat. Lang. Eng., 5(2):187–205, June.
  shop on Computational Approaches to Subjectivity,
  Sentiment and Social Media Analysis, pages 100–         Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010.
  107, Atlanta, Georgia, June.                             Semi-supervised recognition of sarcastic sentences
                                                           in Twitter and Amazon. In Proc. of CoNLL ’10,
Valerio Basile, Andrea Bolioli, Malvina Nissim, Vi-        pages 107–116.
  viana Patti, and Paolo Rosso. 2014. Overview of
  the Evalita 2014 SENTIment POLarity Classifica-         Aniruddha Ghosh, Guofu Li, Tony Veale, Paolo Rosso,
  tion Task. In Proc. of EVALITA 2014, pages 50–57,         Ekaterina Shutova, Antonio Reyes, and Jhon Barn-
  Pisa, Italy. Pisa University Press.                       den. 2015. Semeval-2015 task 11: Sentiment anal-
                                                            ysis of figurative language in Twitter. In Proc. of the
Pierpaolo Basile, Valerio Basile, Malvina Nissim, and       9th International Workshop on Semantic Evaluation
   Nicole Novielli. 2015. Deep tweets: from entity          (SemEval 2015), pages 470–475.
   linking to sentiment analysis. In Proc. of CLiC-
   it 2015.                                               Roberto González-Ibáñez, Smaranda Muresan, and
                                                            Nina Wacholder. 2011. Identifying sarcasm in twit-
Pierpaolo Basile, Annalina Caputo, Anna Lisa Gen-           ter: A closer look. In Proc. of ACL–HLT ’11, pages
   tile, and Giuseppe Rizzo. 2016a. Overview of             581–586, Portland, Oregon.
   the EVALITA 2016 Named Entity rEcognition and
   Linking in Italian Tweets (NEEL-IT) Task. In           Yanfen Hao and Tony Veale. 2010. An ironic fist in a
   Pierpaolo Basile, Anna Corazza, Franco Cutugno,          velvet glove: Creative mis-representation in the con-
   Simonetta Montemagni, Malvina Nissim, Viviana            struction of ironic similes. Minds Mach., 20(4):635–
   Patti, Giovanni Semeraro and Rachele Sprugnoli,          650.
   editors, Proceedings of Third Italian Conference on    Anne-Lyse Minard, Manuela Speranza, and Tommaso
   Computational Linguistics (CLiC-it 2016) & Fifth         Caselli. 2016. The EVALITA 2016 Event Factuality
   Evaluation Campaign of Natural Language Pro-             Annotation Task (FactA). In Pierpaolo Basile, Anna
   cessing and Speech Tools for Italian. Final Work-        Corazza, Franco Cutugno, Simonetta Montemagni,
   shop (EVALITA 2016). Associazione Italiana di Lin-       Malvina Nissim, Viviana Patti, Giovanni Semer-
   guistica Computazionale (AILC).                          aro and Rachele Sprugnoli, editors, Proceedings of
Pierpaolo Basile, Franco Cutugno, Malvina Nissim,           Third Italian Conference on Computational Linguis-
   Viviana Patti, and Rachele Sprugnoli.        2016b.      tics (CLiC-it 2016) & Fifth Evaluation Campaign
   EVALITA 2016: Overview of the 5th Evalua-                of Natural Language Processing and Speech Tools
   tion Campaign of Natural Language Processing and         for Italian. Final Workshop (EVALITA 2016). As-
   Speech Tools for Italian. In Pierpaolo Basile, Anna      sociazione Italiana di Linguistica Computazionale
   Corazza, Franco Cutugno, Simonetta Montemagni,           (AILC).
   Malvina Nissim, Viviana Patti, Giovanni Semer-         Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio
   aro and Rachele Sprugnoli, editors, Proceedings of       Sebastiani, and Veselin Stoyanov. 2016. Semeval-
   Third Italian Conference on Computational Linguis-       2016 task 4: Sentiment analysis in twitter. In Proc.
   tics (CLiC-it 2016) & Fifth Evaluation Campaign          of the 10th International Workshop on Semantic
   of Natural Language Processing and Speech Tools          Evaluation (SemEval-2016), pages 1–18.
   for Italian. Final Workshop (EVALITA 2016). As-
   sociazione Italiana di Linguistica Computazionale      Bo Pang and Lillian Lee. 2008. Opinion Mining and
   (AILC).                                                  Sentiment Analysis. Foundations and trends in in-
                                                            formation retrieval, 2(1-2):1–135, January.
Cristina Bosco, Viviana Patti, and Andrea Bolioli.
  2013. Developing Corpora for Sentiment Anal-            Antonio Reyes and Paolo Rosso. 2014. On the
  ysis: The Case of Irony and Senti-TUT. IEEE               difficulty of automatically detecting irony: Be-
  Intelligent Systems, Special Issue on Knowledge-          yond a simple case of negation. Knowl. Inf. Syst.,
  based Approaches to Content-level Sentiment Anal-         40(3):595–614.
  ysis, 28(2):55–63.                                      Antonio Reyes, Paolo Rosso, and Tony Veale. 2013.
Cristina Bosco, Fabio Tamburini, Andrea Bolioli, and        A multidimensional approach for detecting irony in
  Alessandro Mazzei.     2016.    Overview of the           twitter. Lang. Resour. Eval., 47(1):239–268, March.
  EVALITA 2016 Part Of Speech on TWitter for ITAl-        Sara Rosenthal, Alan Ritter, Preslav Nakov, and
  ian Task. In In Pierpaolo Basile, Anna Corazza,           Veselin Stoyanov. 2014. SemEval-2014 Task 9:
  Franco Cutugno, Simonetta Montemagni, Malv-               Sentiment Analysis in Twitter. In Proc. of the 8th In-
  ina Nissim, Viviana Patti, Giovanni Semeraro and          ternational Workshop on Semantic Evaluation (Se-
  Rachele Sprugnoli, editors, Proceedings of Third          mEval 2014), pages 73–80, Dublin, Ireland, August.
Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko,
  Saif M Mohammad, Alan Ritter, and Veselin Stoy-
  anov. 2015. SemEval-2015 Task 10: Sentiment
  Analysis in Twitter. In Proc. of the 9th International
  Workshop on Semantic Evaluation, SemEval ’2015.
Marco Stranisci, Cristina Bosco, Delia Iraz Hernndez
 Faras, and Viviana Patti. 2016. Annotating senti-
 ment and irony in the online italian political debate
 on #labuonascuola. In Proc. of the Tenth Interna-
 tional Conference on Language Resources and Eval-
 uation (LREC 2016), pages 2892–2899. ELRA.
Janyce Wiebe, Theresa Wilson, and Claire Cardie.
   2005. Annotating expressions of opinions and emo-
   tions in language. Language Resources and Evalu-
   ation, 1(2).