Overview of the EVALITA 2018 Task on Irony Detection in Italian Tweets
                            (IronITA)
           Alessandra Teresa Cignarella            Valerio Basile, Cristina Bosco
                   Simona Frenda                             Viviana Patti
            Dipartimento di Informatica              Dipartimento di Informatica
       Università degli Studi di Torino, Italy   Università degli Studi di Torino, Italy
              PRHLT Research Center              {basile,bosco,patti}@di.unito.it
      Universitat Politècnica de València, Spain
             {cigna,frenda}@di.unito.it

                                            Paolo Rosso
                                      PRHLT Research Center
                              Universitat Politècnica de València, Spain
                                           prosso@dsic.upv.es


                     Abstract                           intention, it hinders correct sentiment analysis of
                                                        texts and, therefore, correct opinion mining. In-
    English. IronITA is a new shared task in            deed, the presence of ironic devices in a text can
    the EVALITA 2018 evaluation campaign,               work as an unexpected “polarity reverser” (one
    focused on the automatic classification of          says something “good” to mean something “bad”),
    irony in Italian texts from Twitter. It in-         thus undermining systems’ accuracy.
    cludes two tasks: 1) irony detection and 2)
                                                           Considering the majority of state-of-the-art
    detection of different types of irony, with
                                                        studies in computational linguistics, irony is of-
    a special focus on sarcasm identification.
                                                        ten used as an umbrella-term which includes
    We received 17 submissions for the first
                                                        satire, sarcasm and parody due to fuzzy bound-
    task and 7 submissions for the second task
                                                        aries among them (Marchetti et al., 2007). How-
    from 7 teams.
                                                        ever, some linguistic studies focused on sarcasm,
    Italiano. IronITA è un nuovo esercizio              a particular type of verbal irony defined in Gibbs
    di valutazione della campagna di val-               (2000) as “a sharp or cutting ironic expression
    utazione EVALITA 2018, specificamente               with the intent to convey scorn or insult”. Other
    dedicato alla classificazione automatica            scholars concentrated on cognitive aspects related
    dell’ironia presente in testi estratti da           on how such figurative expressions are processed
    Twitter. Comprende due task: 1) ri-                 in the brain, focusing on key aspects influencing
    conoscimento dell’ironia e 2) riconosci-            processing (see for instance the “defaultness” hy-
    mento di diversi tipi di ironia, con partico-       pothesis presented in Giora et al. (2018)).
    lare attenzione all’identificazione del sar-           The importance to detect irony and sarcasm is
    casmo. Abbiamo ricevuto 17 sottomissioni            also very relevant for reaching better predictions
    per il primo task e 7 per il secondo, da            in Sentiment Analysis, for instance, what are the
    parte di 7 gruppi partecipanti.                     real opinion and orientation of users about a spe-
                                                        cific subject (product, service, topic, issue, person,
1   Introduction
                                                        organization, or event).
Irony is a figurative language device that conveys         IronITA is organized in continuity with previ-
the opposite of literal meaning, profiling intention-   ous shared tasks of the past years within the con-
ally a secondary or extended meaning. Users on          text of the EVALITA evaluation campaign (see
the web usually tend to use irony like a creative       for instance the irony detection subtask proposed
device to express their thoughts in short-texts like    at SENTIPOLC in the 2014 and 2016 editions
tweets, reviews, posts or commentaries. But irony,      (Basile et al., 2014; Barbieri et al., 2016)). It is
as well as other figurative language devices, for       also inspired by the recent experience within the
example metaphors, is very difficult to deal with       SemEval2018-Task3 Irony detection in English
automatically. For its traits of recalling another      tweets (Van Hee et al., 2018). The shared task
meaning or obfuscating the real communicative           we propose for Italian is specifically dedicated to
irony detection taking into account both the classi-     ganizers. On the other hand, the participant teams
cal binary classification task (irony vs not irony),     are encouraged to train their systems on additional
and a related subtask, which gives to participants       annotated data and submit the resulting uncon-
the possibility to reason on different types of irony.   strained runs.
Differently from SemEval2018-Task3, we indeed               We implemented two straightforward baseline
ask the participants to distinguish sarcasm as a         systems for the task. baseline-mfc (Most Fre-
specific type of irony. This is motivated by the         quent Class) assigns to each instance the majority
growing interest for detecting sarcasm, which is         class of the respective task, namely not-ironic
characterized by sharp tones and aggressive inten-       for task A and not-sarcastic for task B.
tion (Gibbs, 2000; Joshi et al., 2017; Sulis et al.,     baseline-random assigns uniformly random values
2016) often present in interesting domains such as       to the instances. Note that for task A, a class is as-
politics and hate speech (Sanguinetti et al., 2018).     signed randomly to every instance, while for task
                                                         B the classes are assigned randomly only to eligi-
2   Task Description                                     ble tweets who are marked ironic.
The task consists in automatically annotating mes-
                                                         3     Training and Test Data
sages from Twitter for irony and sarcasm. It is or-
ganized in a main task (Task A) centered on irony,       3.1    Composition of the datasets
and a second task (Task B) centered on sarcasm,          The data released for the shared task come from
whose results will be separately evaluated. Partic-      different source datasets, namely: Hate Speech
ipation was allowed to both the tasks (Task A and        Corpus (HSC) (Sanguinetti et al., 2018) and the
Task B) or to Task A only.                               TWITTIRÒ corpus (Cignarella et al., 2018), com-
Task A: Irony Detection. Task A consists in a            posed of tweets from LaBuonaScuola corpus (TW-
two-class (or binary) classification where systems       BS ) (Stranisci et al., 2016), Sentipolc corpus ( TW-
have to predict whether a tweet is ironic or not.        SENTIPOLC ), Spinoza corpus ( TW- SPINO ) (Barbi-
                                                         eri et al., 2016).
Task B: Different types of irony with special fo-
                                                            In the test data we have the same sources, and
cus on sarcasm identification. Sarcasm has been
                                                         in addition some tweets from the TWITA collec-
recognized in Bowes and Katz (2011) with a spe-
                                                         tion, that were annotated by the organizers of the
cific target to attack (Attardo, 2007; Dynel, 2014),
                                                         SENTIPOLC 2016 shared task, but were not ex-
more offensive and delivered with a cutting tone
                                                         ploited during the 2016 campaign (Barbieri et al.,
(rarely ambiguous). According to Lee and Katz
                                                         2016).
(1998) hearers perceive aggressiveness as the fea-
ture that distinguishes sarcasm. Provided a defini-      3.2    Annotation of the datasets
tion of sarcasm as a specific type of irony, Task B
                                                         The annotation process involved four Italian na-
consists in a multi-class classification where sys-
                                                         tive speakers and focused only on the finer-grained
tems have to predict one out of the three following
                                                         annotation of sarcasm in the ironic tweets, since
labels: i) sarcasm, ii) irony not categorized as
                                                         the presence of irony was already annotated in the
sarcasm (i.e. other kinds of verbal irony or de-
                                                         source datasets. It began by splitting in two halves
scriptions of situational irony which do not show
                                                         the dataset and assigning the annotation task for
the characteristics of sarcasm), and iii) not-irony.
                                                         each portion to a different couple of annotators. In
The proposed tasks encourage the investigation           the following step, the final inter-annotator agree-
of this linguistic devices. Moreover, providing a        ment (IAA) has been calculated on all the dataset.
dataset from social media (Twitter), we focus on         Then, in order to achieve an agreement on a larger
texts especially hard to be dealt with, because of       portion of data, the effort of the annotators has
their shortness and because they will be analyzed        been focused only on the detected cases of dis-
out of the context where they were generated.            agreement. In particular, the couple previously in-
   The participants are allowed to submit either         volved in the annotation of the first half of the cor-
“constrained” or “unconstrained” runs (or both,          pus produced a new annotation for the tweets in
within the submission limits). The constrained           disagreement of the second portion of the dataset,
runs have to be produced by systems whose only           while the couple involved in the annotation of the
training data is the dataset provided by the task or-    second half of the corpus did the same on the first
                                   TRAINING SET                                  TEST SET
                     IRONIC      NOT- IRO SARC     NOT- SARC       IRONIC   NOT- IRO  SARC         NOT- SARC    TOTAL
   TW- BS                467         646     173             294      111        161     51                60
   TW- SPINO             342           0     126             216       73          0     32                41    2,886
   TW- SENTIPOLC         461         625     143             318        0          0      0                 0
   HSC                   753         683     471             282      185        119    106                79    1,740
   TWITA                   0           0       0               0       67        156     28                39      223
   TOTAL                               3,977                                        872                          4,849

                               Table 1: Distribution of tweets according to the topic


portion of the dataset. After that, the cases where            Speech Detection (Bosco et al., 2018). In the train-
the disagreement persists have been discarded as               ing set we count 781 overlapping tweets, while in
too ambiguous to be classified (131 tweets).                   the test set we count an overlap of just 96 tweets.
   The final IAA calculated with Fleiss’ kappa is
κ = 0.56 for the tweets belonging to the TWIT-                 3.3    Data Release
TIRÒ corpus and κ = 0.52 for the data from the                 The data were released in the following format3 :
HSC corpus and it is considered moderate1 and sat-
                                                                     idtwitter text irony sarcasm topic
isfying for the purpose of the shared task.
   In this process the annotators relied on a spe-             where idtwitter is the Twitter ID of the mes-
cific definition of “sarcasm”, and followed de-                sage, text is the content of the message, irony
tailed guidelines2 . In particular we defined sar-             is 1 or 0 (respectively for ironic and not ironic
casm as a kind of sharp, explicit and sometimes                tweets), sarcasm is 1 or 0 (respectively for sar-
aggressive irony, aimed at hitting a specific target           castic and not sarcastic tweets), and topic refers
to hurt or criticize without excluding the possibil-           to the source corpus from where the tweet has been
ity of having fun (Du Marsais et al., 1981; Gibbs,             extracted.
2000). The factors we have taken into account for                 The training set includes for each tweet the an-
the annotation are, the presence of:                           notation for the irony and sarcasm fields, ac-
   1. a clear target,                                          cording to the format explained above. Instead, the
   2. an obvious intention to hurt or criticize,               test set only containes values for the idtwitter,
                                                               text and topic fields.
   3. negativity (weak or strong).
We have also tried to differentiate our concept of             4     Evaluation Measures
“sarcasm” from that of “satire”, often present in
tweets. For us, satire aims to ridicule the target             Task A: Irony detection. Systems have been
as well as criticize it. Differently from sarcasm,             evaluated against the gold standard test set on
satire is solely focused on a more negative type               their assignment of a 0 or 1 value to the irony
of criticism and moved by a personal and angry                 field. We measured the precision, recall and F1-
emotional charge.                                              score of the prediction for both the ironic and
                                                               not-ironic classes:
A single training set has been provided for both
tasks A and B, which includes 3,977 tweets. Fol-                                             #correct_class
                                                                         precisionclass =
                                                                                             #assigned_class
lowing, a single test set has been distributed for
both tasks A and B, which includes 872 tweets,                                              #correct_class
                                                                            recallclass =
hence creating an 82% − 18% balance between                                                  #total_class
training and test data. Table 1 shows the distribu-
                                                                                       precisionclass recallclass
tion of ironic and sarcastic tweets among the dif-                     F 1class = 2
                                                                                      precisionclass + recallclass
ferent source/topic datasets cited in Section 3.1.
   Additionally the IronITA datasets overlap with
                                                               The overall F1-score is the average of the F1-
the data released for HaSpeeDe, the task of Hate
                                                               scores for the ironic and not-ironic classes
   1
    According to the parameters proposed by Fleiss (1971).     (i.e. macro F1-score).
   2
    For more details on this regard, please refer to
the guidelines: https://github.com/AleT-Cig/
                                                                 3
IronITA-2018/blob/master/Definition%20of%                          Link to the datasets: http://www.di.unito.it/
20Sarcasm.pdf                                                  ~tutreeb/ironita-evalita18/data.html
    topic       irony    sarcasm     text
    TWITTIRÒ    0        0           @SteGiannini @sdisponibile Semmai l’anno DELLA buona scuola. De la, in
                                     italiano, non esiste
    TWITTIRÒ    1        1           #labuonascuola Fornitura illimitata di rotoli di carta igienica e poi, piano pi-
                                     ano, tutti gli altri aspetti meno importanti.
    HSC         1        0           Di fronte a queste forme di terrorismo siamo tutti sulla stessa barca. A parte
                                     Briatore. Briatore ha la sua.
    HSC         1        1           Anche oggi sono in arrivo 2000migranti dalla Libia avanti in italia ce posto per
                                     tutti vero @lauraboldrini ? Li puoi accogliere a casa tua

                                 Table 2: Examples for each combinations


Task B: Different types of irony. Systems have             5.1     Task A: Irony Detection
been evaluated against the gold standard test set on       Table 4 shows the results for the irony detection
their assignment of a 0 or 1 value to the sarcasm          task, which attracted 17 total submissions from
field, assuming that the irony field is also pro-          7 different teams. The best scores are achieved
vided as part of the results.                              by the ItaliaNLP team (Cimino et al., 2018) that,
   We have measured the precision, recall and F1-          with a constrained run, obtained the best score for
score for each of the three classes:                       both the ironic and not-ironic class, thus
    • not-ironic                                           obtaining the highest averaged F1-score of 0.731.
      irony = 0, sarcasm = 0                                  Among the unconstrained systems, the best F1-
    • ironic-not-sarcastic                                 score for the not-ironic class is achieved by
      irony = 1, sarcasm = 0                               the X2Check team (Di Rosa and Durante, 2018)
    • sarcastic                                            with F = 0.708, and the best F1-score for the
      irony = 1, sarcasm = 1                               ironic class is obtained by the UNITOR team
  The evaluation metric is the macro F1-score              (Santilli et al., 2018) with F = 0.733.
computed over the three classes. Note that for the            All participating systems show an improvement
purpose of the evaluation of task B, the following         over the baselines, with the exception of the only
combination is always considered wrong:                    unsupervised system (venses-itgetarun, see de-
                                                           tails in Section 6).
    • irony = 0, sarcasm = 1
                                                                 team name          id              F1-score
Our scheme imposes that a tweet can be annotated                                          not-iro       iro macro
as sarcastic only if it is also annotated as ironic,             ItaliaNLP          1      0.707     0.754   0.731
which correspond to interpreting sarcasm as a spe-               ItaliaNLP          2      0.693     0.733   0.713
                                                                 UNIBA              1      0.689     0.730   0.710
cific type of irony, as reported in Table 2.                     UNIBA              2      0.689     0.730   0.710
                                                                 X2Check            1      0.708     0.700   0.704
5    Participants and Results                                    UNITOR             1      0.662     0.739   0.700
                                                                 UNITOR             2      0.668     0.733   0.700
A total of 7 teams, both from academia and indus-                X2Check            2      0.700     0.689   0.695
                                                                 Aspie96            1      0.668     0.722   0.695
try sector participated to at least one of the two               X2Check            2      0.679     0.708   0.693
tasks of IronITA. Table 3 provides an overview of                X2Check            1      0.674     0.693   0.683
the teams, their affiliation, and the tasks they took            UO_IRO             2      0.603     0.700   0.651
                                                                 UO_IRO             1      0.626     0.665   0.646
part in.                                                         UO_IRO             2      0.579     0.678   0.629
   Four teams participated to both tasks A and B.                UO_IRO             1      0.652     0.577   0.614
Teams were allowed to submit up to four runs (2                  baseline-random           0.503     0.506   0.505
                                                                 venses-itgetarun   1      0.651     0.289   0.470
constrained and 2 unconstrained) in case they im-                venses-itgetarun   2      0.645     0.195   0.420
plemented different systems. Furthermore, each                   baseline-mfc              0.668     0.000   0.334
team had to submit at least a constrained run. Par-
                                                           Table 4: Results Task A. Unconstrained runs are
ticipants have been invited to submit multiple runs
                                                           marked by grey background.
to experiment with different models and architec-
tures. However, they have been discouraged from
submitting slight variations of the same model.            5.2     Task B: Different types of irony
Overall we have 17 runs for Task A and 7 runs              Table 5 shows the results for the different types
for Task B.                                                of irony task, which attracted 7 total submis-
            team name            institution                                                               tasks
            ItaliaNLP            ItaliaNLP group ILC-CNR                                                   A,B
            UNIBA                University of Bari                                                        A
            X2Check              App2Check srl                                                             A
            UNITOR               University of Roma “Tor Vergata”                                          A,B
            Aspie96              University of Torino                                                      A,B
            UO_IRO               CERPAMID, Santiago de Cuba / University of Informatics Sciences, Havana   A
            venses-itgetarun     Ca’ Foscari University of Venice                                          A,B

                                                    Table 3: Participants


sions from 4 different teams. The best scores are                 ports out of 7 participating teams) and on the an-
achieved by the UNITOR team that with an uncon-                   swers to a questionnaire sent by the organizers to
strained run obtained the highest macro F1-score                  the participants.
of 0.520.
   Among the constrained systems, the best F1-                    System architecture. Most submitted runs to
score for the not-ironic class is achieved by                     IronITA are produced by supervised machine
the ItaliaNLP team with F1-score = 0.707, and the                 learning systems. In fact, all but one systems are
best F1-score for the ironic class is obtained by                 supervised, although the nature and complexity
the Aspie96 team (Giudice, 2018) with F1-score                    of their architectures varies significantly. UNIBA
= 0.438. The best score for the sarcastic class                   (Basile and Semeraro, 2018) and UNITOR use
is obtained by a constrained run of the UNITOR                    Support Vector Machine (SVM) classifiers, with
team with F1-score = 0.459. The best performing                   different parameter settings. UNITOR, in partic-
UNITOR team is also the only team that partici-                   ular, employs a multiple kernel-based approach to
pated to Task B with an unconstrained run.                        create two SVM classifiers that work on the two
                                                                  tasks. X2Check uses several models based on
    team name          id                  F1-score               Multinomial Naive Bayes and SVM in a voting en-
                               not-iro      iro   sarc   macro    semble. Three systems implemented deep learn-
    UNITOR             2        0.668    0.447 0.446      0.520
    UNITOR             1        0.662    0.432 0.459      0.518
                                                                  ing neural networks for the classification of irony
    ItaliaNLP          1        0.707    0.432 0.409      0.516   and sarcasm. Sequence-learning networks were a
    ItaliaNLP          2        0.693    0.423 0.392      0.503   popular choice, in the form of Bidirectional Long
    Aspie96            1        0.668    0.438 0.289      0.465
    baseline-random             0.503    0.266 0.242      0.337
                                                                  Short-term Memory Networks (used by ItaliaNLP
    venses-itgetarun   1        0.431    0.260 0.018      0.236   and UO_IRO (Ortega-Bueno and Medina Pagola,
    baseline-mfc                0.668    0.000 0.000      0.223   2018)) and Gated Recurrent Units (Aspie96). The
    venses-itgetarun   2        0.413    0.183 0.000      0.199
                                                                  venses-itgetarun team proposed the only unsu-
Table 5: Results Task B. Unconstrained runs are                   pervised system submitted to IronITA. The system
marked by grey background.                                        is based on an extension of the ITGETARUN rule-
                                                                  based fully symbolic semantic parser (Delmonte,
All participating systems show an improvement                     2014). The performance of the venses-itgetarun
over the baselines, with the exception of the only                system is penalized mainly by its low recall (see
unsupervised system (venses-itgetarun, see de-                    the detailed results on the task website).
tails in Section 6).                                              Features. In addition to explore a broad spec-
                                                                  trum of supervised and unsupervised architec-
6      Discussion                                                 tures, the submitted systems leverage different
We compare the participating systems according                    kinds of linguistic and semantic information ex-
to the following main dimensions: classification                  tracted from the tweets. Word n-grams of vary-
framework (approaches, algorithms, features), text                ing size are used by ItaliaNLP, UNIBA, and
representation strategy, use of additional anno-                  X2Check. Word embeddings were used as fea-
tated data for training, external resources (e.g. sen-            tures by three systems, namely ItaliaNLP (built
timent lexica, NLP tools, etc.), and interdepen-                  with word2vec on a concatenation of ItWaC4 and
dency between the two tasks. This discussion is                   a custom tweet corpus), UNITOR (built with
based on the information contained in the reports                   4
                                                                      https://www.sketchengine.eu/
submitted by the participants (we received 6 re-                  itwac-italian-corpus/
word2vec on a custom Twitter corpus) and UNIBA         notated for hate speech from the HaSpeeDe task
(built with Random Indexing (Sahlgren, 2005)) on       at EVALITA 2018 (Bosco et al., 2018). We do
a subset of TWITA (Basile et al., 2018). Affective     not consider their runs unconstrained, because the
lexicons were also employed to extract polarity-       phenomena annotated in the data they employed
related features from the words in the tweets, by      are different from irony.
UNIBA, ItaliaNLP and UNITOR and UO_IRO
                                                       Interdependency of tasks. Since the tasks A
(see the “Lexical Resources” section for details
                                                       and B are inherently linked (a tweet can be sarcas-
on the lexica). UNIBA and UO_IRO also com-
                                                       tic only if it is also ironic), some of the participat-
puted sentiment variation and contrast in order
                                                       ing teams leveraged this information in their clas-
to extract the ironic content from the text. Fea-
                                                       sification systems. ItaliaNLP employed a Multi-
tures derived from sentiment analysis are also
                                                       task learning approach, thus solving the two tasks
employed by the unsupervised system venses-
                                                       simultaneously. UNITOR adopted a cascade ar-
itgetarun. Aspie96 performs its classification
                                                       chitecture where only tweets that were classified
based on the single characters of the tweet. Fi-
                                                       as ironic were passed through to the sarcasm clas-
nally, a great number of other features is employed
                                                       sifier. In the system by venses-itgetarun, the de-
by the systems, including stylistic and structural
                                                       cision on whether to assign a tweet to sarcasm
features (UO_IRO), special tokens and emoticons
                                                       or irony is based on the contemporary presence
(X2Check). See the details in the EVALITA pro-
                                                       of features common to the two tasks.
ceedings (Caselli et al., 2018).
                                                       7   Concluding remarks
Lexical Resources. Several systems employed
affective resources, mainly as a tool to com-          Differently from the previous sub-tasks on irony
pute the sentiment polarity of words and each          detection in Italian language proposed as part of
tweet. ItaliaNLP used two affective lexica gen-        the previous SENTIPOLC shared tasks, having
erated automatically by means of distant supervi-      Sentiment Analysis as reference framework, the
sion and automatic translation. UNIBA used an          IronITA tasks specifically focus on the irony and
automatic translation of SentiWordNet (Esuli and       sarcasm identification.
Sebastiani, 2006). UNITOR used the Distributed            Comparing the results for irony detection ob-
Polarity Lexicon by Castellucci et al. (2016).         tained within the SENTIPOLC sub-task (the best
UO_IRO used the affective lexicon derived from         performing system in the 2016 edition reached
the OpeNER project (Russo et al., 2016) and a          F = 0.5412 and in 2014 F = 0.575) with the
polarity lexicon of emojis by Kralj Novak et al.       ones obtained in IronITA, it is worth to notice that
(2015). venses-itgetarun used several lexica, in-      a dedicated task on irony detection leaded to a
cluding some specifically built for ITGETARUNS         remarkable improvement of the scores, with the
and a translation of SentiWordNet (Esuli and Se-       highest value here being F = 0.731.
bastiani, 2006).                                          Surprisingly, scores for Italian are in line with
                                                       those obtained at SemEval2018-Task3 on irony
Additional training data. Three teams took the         detection in English tweets, even if a lower amount
opportunity to send unconstrained runs along with      of linguistic resources is available for Italian than
constrained runs. X2Check included in the un-          for English, especially in term of affective lexica,
constrained training set a balanced version of the     a type of resource that is frequently exploited in
SENTIPOLC 2016 dataset, Italian tweets anno-           this kind of task. Actually, some teams used re-
tated with irony (Barbieri et al., 2016). UNITOR       sources provided by the Italian NLP community
used for their unconstrained runs a dataset of 6,000   also in the framework of previous EVALITA’s edi-
tweets obtained by distant supervision (searching      tion (e.g. additional information from annotated
for the hashtag #ironia — #irony). UO_IRO em-          corpora as SENTIPOLC, HaSpeeDe and POST-
ployed tweets annotated with fine-grained irony        WITA).
from TWITTIRÒ (Cignarella et al., 2018).                  The good results obtained in this edition can
   The team ItaliaNLP did not send unconstrained       be read also as a confirmation that linguistic
runs, although they used the information about po-     resources for Italian language are increasing in
larity of Italian tweets from the SENTIPOLC 2016       quantity and quality, and they are helpful also for
dataset (Barbieri et al., 2016) and the data an-       a very challenging task as irony detection.
   Another interesting factor in this edition is the       2016. Overview of the Evalita 2016 sentiment polar-
use of the innovative deep learning techniques,            ity classification task. In Proceedings of 3rd Italian
                                                           Conference on Computational Linguistics (CLiC-
mirroring the growing interest in deep learning by
                                                           it 2016) & 5th Evaluation Campaign of Natural
the NLP community at large. Indeed, the best per-          Language Processing and Speech Tools for Italian,
forming system is based on a deep learning ap-             Naples, Italy. CEUR.org.
proach revealing its usefulness also for irony de-
tection. The high performance of deep learning           Pierpaolo Basile and Giovanni Semeraro.           2018.
                                                            UNIBA - Integrating distributional semantics fea-
methods is an indication that irony and sarcasm             tures in a supervised approach for detecting irony
are phenomena involving more complex features               in Italian tweets. In Proceedings of the 6th evalua-
than n-grams and lexical polarity.                          tion campaign of Natural Language Processing and
   The number of participants in task B was lower.          Speech tools for Italian (EVALITA’18), Turin, Italy.
                                                            CEUR.org.
Even though we wanted to encourage the inves-
tigation in the identification of sarcasm, we are        Valerio Basile, Andrea Bolioli, Malvina Nissim, Vi-
aware that addressing the finer-grained task to dis-       viana Patti, and Paolo Rosso. 2014. Overview
criminate between irony and sarcasm is still really        of the Evalita 2014 sentiment polarity classification
difficult.                                                 task. In Proceedings of the 4th evaluation campaign
                                                           of Natural Language Processing and Speech tools
   In hindsight, the organization of such a shared         for Italian (EVALITA’14), Pisa, Italy. Pisa Univer-
task, specifically dedicated to irony detection in         sity Press.
Italian tweets, and also focused on diverse types of
irony has been a hazard. It was intended to foster       Valerio Basile, Mirko Lai, and Manuela Sanguinetti.
                                                           2018. Long-term Social Media Data Collection
research teams in the exploitation of lexical and af-      at the University of Turin. In Proceedings of the
fective resources in Italian, developed in our NLP         5th Italian Conference on Computational Linguis-
community and to encourage the investigation es-           tics (CLiC-it 2018), Turin, Italy. CEUR.org.
pecially on data about politics and immigration.
   Our proposal for this shared task arose from the      Cristina Bosco, Felice Dell’Orletta, Fabio Poletto,
                                                           Manuela Sanguinetti, and Maurizio Tesconi. 2018.
intuition that a better recognition of figurative lan-     Overview of the Evalita 2018 Hate Speech De-
guage like irony in social media data could also           tection Task. In Proceedings of the 6th evalua-
lead to a better resolution of other Sentiment Anal-       tion campaign of Natural Language Processing and
ysis tasks such as Hate Speech Detection (Bosco            Speech tools for Italian (EVALITA’18), Turin, Italy.
                                                           CEUR.org.
et al., 2018), Stance Detection (Mohammad et
al., 2017), and Misogyny Detection (Fersini et al.,      Andrea Bowes and Albert Katz. 2011. When sarcasm
2018). IronITA wanted to be a first try-out and a          stings. Discourse Processes: A Multidisciplinary
first stimulus in this challenging field.                  Journal, 48(4):215–236.

                                                         Tommaso Caselli, Nicole Novielli, Viviana Patti, and
Acknowledgments                                            Paolo Rosso. 2018. EVALITA 2018: Overview of
                                                           the 6th Evaluation Campaign of Natural Language
V. Basile, C. Bosco and V. Patti were partially            Processing and Speech Tools for Italian. In Proceed-
supported by Progetto di Ateneo/CSP 2016 (Im-              ings of 6th Evaluation Campaign of Natural Lan-
migrants, Hate and Prejudice in Social Media-              guage Processing and Speech Tools for Italian. Final
IhatePrejudice, S1618_L2_BOSC_01). The work                Workshop (EVALITA 2018), Turin, Italy. CEUR.org.
of S.Frenda and P. Rosso was partially funded
                                                         Giuseppe Castellucci, Danilo Croce, and Roberto
by the Spanish research project SomEMBED                   Basili. 2016. A language independent method for
TIN2015-71147-C2-1-P (MINECO/FEDER).                       generating large scale polarity lexicons. In Proceed-
                                                           ings of the 10th International Conference on Lan-
                                                           guage Resources and Evaluation (LREC 2016), Por-
References                                                 torož, Slovenia. ELRA.

Salvatore Attardo. 2007. Irony as relevant inappro-      Alessandra Teresa Cignarella, Cristina Bosco, Viviana
  priateness. In H. Colston and R. Gibbs, editors,         Patti, and Mirko Lai. 2018. Application and Anal-
  Irony in language and thought: A cognitive science       ysis of a Multi-layered Scheme for Irony on the
  reader, pages 135–172. Lawrence Erlbaum.                 Italian Twitter Corpus TWITTIRÒ. In Proceedings
                                                           of the 11th International Conference on Language
Francesco Barbieri, Valerio Basile, Danilo Croce,          Resources and Evaluation (LREC 2018), Miyazaki,
  Malvina Nissim, Nicole Novielli, and Viviana Patti.      Japan. ELRA.
Andrea Cimino, Lorenzo De Mattei, and Felice               Christopher J. Lee and Albert N. Katz. 1998. The
  Dell’Orletta. 2018. Multi-task Learning in Deep            differential role of ridicule in sarcasm and irony.
  Neural Networks at EVALITA 2018. In Proceed-               Metaphor and Symbol, 13(1):1–15.
  ings of the 6th evaluation campaign of Natural
  Language Processing and Speech tools for Italian         A. Marchetti, D. Massaro, and A. Valle. 2007. Non
  (EVALITA’18), Turin, Italy. CEUR.org.                      dicevo sul serio. Riflessioni su ironia e psicologia.
                                                             Collana di psicologia. Franco Angeli.
Rodolfo Delmonte. 2014. A linguistic rule-based sys-
  tem for pragmatic text processing. In Proceedings        Saif M Mohammad, Parinaz Sobhani, and Svetlana
  of Fourth International Workshop EVALITA 2014,             Kiritchenko. 2017. Stance and sentiment in tweets.
  Pisa. Edizioni PLUS, Pisa University Press.                ACM Transactions on Internet Technology (TOIT),
                                                             17(3):26.
Emanuele Di Rosa and Alberto Durante. 2018. Irony
  detection in tweets: X2Check at Ironita 2018. In         Reynier Ortega-Bueno and José E. Medina Pagola.
  Proceedings of the 6th evaluation campaign of Nat-         2018. UO_IRO: Linguistic informed deep-learning
  ural Language Processing and Speech tools for Ital-        model for irony detection. In Proceedings of the
  ian (EVALITA’18), Turin, Italy. CEUR.org.                  6th evaluation campaign of Natural Language Pro-
                                                             cessing and Speech tools for Italian (EVALITA’18),
César Chesneau Du Marsais, Jean Paulhan, and Claude          Turin, Italy. CEUR.org.
  Mouchard. 1981. Traité des tropes. Le Nouveau
  Commerce.                                                Irene Russo, Francesca Frontini, and Valeria Quochi.
                                                              2016. OpeNER sentiment lexicon italian - LMF.
Marta Dynel. 2014. Linguistic approaches to (non)             ILC-CNR for CLARIN-IT repository hosted at In-
 humorous irony. Humor - International Journal of             stitute for Computational Linguistics “A. Zampolli”,
 Humor Research, 27(6):537–550.                               National Research Council, in Pisa.

Andrea Esuli and Fabrizio Sebastiani. 2006. Senti-         Magnus Sahlgren. 2005. An introduction to random
  wordnet: A publicly available lexical resource for        indexing. In In Methods and Applications of Seman-
  opinion mining. In Proceedings of the 5th Interna-        tic Indexing Workshop at the 7th International Con-
  tional Conference on Language Resources and Eval-         ference on Terminology and Knowledge Engineer-
  uation (LREC 2006), Genova, Italy.                        ing, TKE 2005.

Elisabetta Fersini, Maria Anzovino, and Paolo Rosso.       Manuela Sanguinetti, Fabio Poletto, Cristina Bosco,
   2018. Overview of the Task on Automatic Misog-           Viviana Patti, and Marco Stranisci. 2018. An Ital-
   yny Identification at IberEval. In Proceedings of 3rd    ian Twitter Corpus of Hate Speech against Immi-
   Workshop on Evaluation of Human Language Tech-           grants. In Proceedings of the 11th International
   nologies for Iberian Languages (IberEval 2018).          Conference on Language Resources and Evaluation
   CEUR-WS.org.                                             (LREC 2018), Miyazaki, Japan.

Joseph L. Fleiss. 1971. Measuring nominal scale            Andrea Santilli, Danilo Croce, and Roberto Basili.
   agreement among many raters. Psychological bul-           2018. A Kernel-based Approach for Irony and Sar-
   letin.                                                    casm Detection in Italian. In Proceedings of the
                                                             6th evaluation campaign of Natural Language Pro-
Raymond W. Gibbs. 2000. Irony in talk among                  cessing and Speech tools for Italian (EVALITA’18),
  friends. Metaphor and symbol, 15(1-2):5–27.                Turin, Italy. CEUR.org.
Rachel Giora, Adi Cholev, Ofer Fein, and Orna Pe-          Marco Stranisci, Cristina Bosco, Delia Irazú Hernán-
  leg. 2018. On the superiority of defaultness: Hemi-       dez Farías, and Viviana Patti. 2016. Annotat-
  spheric perspectives of processing negative and affir-    ing Sentiment and Irony in the Online Italian Po-
  mative sarcasm. Metaphor and Symbol, 33(3):163–           litical Debate on #labuonascuola. In Proceedings
  174.                                                      of the 10th International Conference on Language
                                                            Resources and Evaluation (LREC 2016), Portorož,
Valentino Giudice.     2018.    Aspie96 at IronITA          Slovenia. ELRA.
  (EVALITA 2018): Irony Detection in Italian Tweets
  with Character-Level Convolutional RNN. In Pro-          Emilio Sulis, D. Irazú Hernández Farías, Paolo Rosso,
  ceedings of the 6th evaluation campaign of Natural         Viviana Patti, and Giancarlo Ruffo. 2016. Figura-
  Language Processing and Speech tools for Italian           tive messages and affect in Twitter: Differences be-
  (EVALITA’18), Turin, Italy. CEUR.org.                      tween #irony, #sarcasm and #not. Knowledge-Based
                                                             Systems, 108:132 – 143. New Avenues in Knowl-
Aditya Joshi, Pushpak Bhattacharyya, and Mark James          edge Bases for Natural Language Processing.
  Carman. 2017. Automatic sarcasm detection: A
  survey. ACM Comput. Surv., 50(5):73:1–73:22.             Cynthia Van Hee, Els Lefever, and Véronique Hoste.
                                                             2018. Semeval-2018 task 3: Irony detection in En-
Petra Kralj Novak, Jasmina Smailović, Borut Sluban,         glish tweets. In Proceedings of The 12th Interna-
  and Igor Mozetič. 2015. Sentiment of emojis.              tional Workshop on Semantic Evaluation.
  PLOS ONE, 10(12):1–22, 12.
Appendix: Detailed results per class for all tasks

                                                        precision      recall         F1-score       precision   recall     F1-score   average
 ranking     team name           run type      run id
                                                        (non-ironic)   (non-ironic)   (non-ironic)   (ironic)    (ironic)   (ironic)   F1-score
 1           ItaliaNLP            c            1        0.785          0.643          0.707          0.696       0.823      0.754      0.731
 2           ItaliaNLP            c            2        0.751          0.643          0.693          0.687       0.786      0.733      0.713
 3           UNIBA                c            1        0.748          0.638          0.689          0.683       0.784      0.730      0.710
 4           UNIBA                c            2        0.748          0.638          0.689          0.683       0.784      0.730      0.710
 5           X2Check              u            1        0.700          0.716          0.708          0.708       0.692      0.700      0.704
 6           UNITOR               c            1        0.778          0.577          0.662          0.662       0.834      0.739      0.700
 7           UNITOR               u            2        0.764          0.593          0.668          0.666       0.816      0.733      0.700
 8           X2Check              u            2        0.690          0.712          0.700          0.701       0.678      0.689      0.695
 9           Aspie96              c            1        0.742          0.606          0.668          0.666       0.789      0.722      0.695
 10          X2Check              c            2        0.716          0.645          0.679          0.676       0.743      0.708      0.693
 11          X2Check              c            1        0.697          0.652          0.674          0.672       0.715      0.693      0.683
 12          UO_IRO               u            2        0.722          0.517          0.603          0.623       0.800      0.700      0.651
 13          UO_IRO               u            1        0.667          0.590          0.626          0.631       0.703      0.665      0.646
 14          UO_IRO               c            2        0.687          0.501          0.579          0.606       0.770      0.678      0.629
 15          UO_IRO               c            1        0.600          0.714          0.652          0.645       0.522      0.577      0.614
 16          baseline-random c                 1        0.506          0.501          0.503          0.503       0.508      0.506      0.505
 17          venses-itgetarun     c            1        0.520          0.872          0.651          0.597       0.191      0.289      0.470
 18          venses-itgetarun     c            2        0.505          0.892          0.645          0.525       0.120      0.195      0.420
 19          baseline-mfc         c            1        0.501          1.000          0.668          0.000       0.000      0.000      0.334
Detailed results of Task A (Irony Detection)


                                                        precision      recall         F1-score       precision   recall     F1-score   precision     recall        F1-score      average
 ranking     team name           run type      run id
                                                        (non-ironic)   (non-ironic)   (non-ironic)   (ironic)    (ironic)   (ironic)   (sarcastic)   (sarcastic)   (sarcastic)   F1-score
 1           UNITOR               u          2          0.764          0.593          0.668          0.362       0.584      0.447      0.492         0.407         0.446         0.520
 2           UNITOR               c          1          0.778          0.577          0.662          0.355       0.553      0.432      0.469         0.449         0.459         0.518
 3           ItaliaNLP            c          1          0.785          0.643          0.707          0.343       0.584      0.432      0.518         0.338         0.409         0.516
 4           ItaliaNLP            c          2          0.751          0.643          0.693          0.340       0.562      0.423      0.507         0.319         0.392         0.503
 5           Aspie96              c          1          0.742          0.606          0.668          0.353       0.575      0.438      0.342         0.250         0.289         0.465
 6           baseline-random c               1          0.506          0.501          0.503          0.267       0.265      0.266      0.239         0.245         0.242         0.337
 7           venses-itgetarun     c          1          0.606          0.334          0.431          0.341       0.210      0.260      0.500         0.009         0.018         0.236
 8           baseline-mfc         c          1          0.501          1.000          0.668          0.000       0.000      0.000      0.000         0.000         0.000         0.223
 9           venses-itgetarun     c          2          0.559          0.327          0.413          0.296       0.132      0.183      0.000         0.000         0.000         0.199
Detailed results of Task B (Sarcasm Detection)