Overview of the EVALITA 2018 Task on Irony Detection in Italian Tweets (IronITA) Alessandra Teresa Cignarella Valerio Basile, Cristina Bosco Simona Frenda Viviana Patti Dipartimento di Informatica Dipartimento di Informatica Università degli Studi di Torino, Italy Università degli Studi di Torino, Italy PRHLT Research Center {basile,bosco,patti}@di.unito.it Universitat Politècnica de València, Spain {cigna,frenda}@di.unito.it Paolo Rosso PRHLT Research Center Universitat Politècnica de València, Spain prosso@dsic.upv.es Abstract intention, it hinders correct sentiment analysis of texts and, therefore, correct opinion mining. In- English. IronITA is a new shared task in deed, the presence of ironic devices in a text can the EVALITA 2018 evaluation campaign, work as an unexpected “polarity reverser” (one focused on the automatic classification of says something “good” to mean something “bad”), irony in Italian texts from Twitter. It in- thus undermining systems’ accuracy. cludes two tasks: 1) irony detection and 2) Considering the majority of state-of-the-art detection of different types of irony, with studies in computational linguistics, irony is of- a special focus on sarcasm identification. ten used as an umbrella-term which includes We received 17 submissions for the first satire, sarcasm and parody due to fuzzy bound- task and 7 submissions for the second task aries among them (Marchetti et al., 2007). How- from 7 teams. ever, some linguistic studies focused on sarcasm, Italiano. IronITA è un nuovo esercizio a particular type of verbal irony defined in Gibbs di valutazione della campagna di val- (2000) as “a sharp or cutting ironic expression utazione EVALITA 2018, specificamente with the intent to convey scorn or insult”. Other dedicato alla classificazione automatica scholars concentrated on cognitive aspects related dell’ironia presente in testi estratti da on how such figurative expressions are processed Twitter. Comprende due task: 1) ri- in the brain, focusing on key aspects influencing conoscimento dell’ironia e 2) riconosci- processing (see for instance the “defaultness” hy- mento di diversi tipi di ironia, con partico- pothesis presented in Giora et al. (2018)). lare attenzione all’identificazione del sar- The importance to detect irony and sarcasm is casmo. Abbiamo ricevuto 17 sottomissioni also very relevant for reaching better predictions per il primo task e 7 per il secondo, da in Sentiment Analysis, for instance, what are the parte di 7 gruppi partecipanti. real opinion and orientation of users about a spe- cific subject (product, service, topic, issue, person, 1 Introduction organization, or event). Irony is a figurative language device that conveys IronITA is organized in continuity with previ- the opposite of literal meaning, profiling intention- ous shared tasks of the past years within the con- ally a secondary or extended meaning. Users on text of the EVALITA evaluation campaign (see the web usually tend to use irony like a creative for instance the irony detection subtask proposed device to express their thoughts in short-texts like at SENTIPOLC in the 2014 and 2016 editions tweets, reviews, posts or commentaries. But irony, (Basile et al., 2014; Barbieri et al., 2016)). It is as well as other figurative language devices, for also inspired by the recent experience within the example metaphors, is very difficult to deal with SemEval2018-Task3 Irony detection in English automatically. For its traits of recalling another tweets (Van Hee et al., 2018). The shared task meaning or obfuscating the real communicative we propose for Italian is specifically dedicated to irony detection taking into account both the classi- ganizers. On the other hand, the participant teams cal binary classification task (irony vs not irony), are encouraged to train their systems on additional and a related subtask, which gives to participants annotated data and submit the resulting uncon- the possibility to reason on different types of irony. strained runs. Differently from SemEval2018-Task3, we indeed We implemented two straightforward baseline ask the participants to distinguish sarcasm as a systems for the task. baseline-mfc (Most Fre- specific type of irony. This is motivated by the quent Class) assigns to each instance the majority growing interest for detecting sarcasm, which is class of the respective task, namely not-ironic characterized by sharp tones and aggressive inten- for task A and not-sarcastic for task B. tion (Gibbs, 2000; Joshi et al., 2017; Sulis et al., baseline-random assigns uniformly random values 2016) often present in interesting domains such as to the instances. Note that for task A, a class is as- politics and hate speech (Sanguinetti et al., 2018). signed randomly to every instance, while for task B the classes are assigned randomly only to eligi- 2 Task Description ble tweets who are marked ironic. The task consists in automatically annotating mes- 3 Training and Test Data sages from Twitter for irony and sarcasm. It is or- ganized in a main task (Task A) centered on irony, 3.1 Composition of the datasets and a second task (Task B) centered on sarcasm, The data released for the shared task come from whose results will be separately evaluated. Partic- different source datasets, namely: Hate Speech ipation was allowed to both the tasks (Task A and Corpus (HSC) (Sanguinetti et al., 2018) and the Task B) or to Task A only. TWITTIRÒ corpus (Cignarella et al., 2018), com- Task A: Irony Detection. Task A consists in a posed of tweets from LaBuonaScuola corpus (TW- two-class (or binary) classification where systems BS ) (Stranisci et al., 2016), Sentipolc corpus ( TW- have to predict whether a tweet is ironic or not. SENTIPOLC ), Spinoza corpus ( TW- SPINO ) (Barbi- eri et al., 2016). Task B: Different types of irony with special fo- In the test data we have the same sources, and cus on sarcasm identification. Sarcasm has been in addition some tweets from the TWITA collec- recognized in Bowes and Katz (2011) with a spe- tion, that were annotated by the organizers of the cific target to attack (Attardo, 2007; Dynel, 2014), SENTIPOLC 2016 shared task, but were not ex- more offensive and delivered with a cutting tone ploited during the 2016 campaign (Barbieri et al., (rarely ambiguous). According to Lee and Katz 2016). (1998) hearers perceive aggressiveness as the fea- ture that distinguishes sarcasm. Provided a defini- 3.2 Annotation of the datasets tion of sarcasm as a specific type of irony, Task B The annotation process involved four Italian na- consists in a multi-class classification where sys- tive speakers and focused only on the finer-grained tems have to predict one out of the three following annotation of sarcasm in the ironic tweets, since labels: i) sarcasm, ii) irony not categorized as the presence of irony was already annotated in the sarcasm (i.e. other kinds of verbal irony or de- source datasets. It began by splitting in two halves scriptions of situational irony which do not show the dataset and assigning the annotation task for the characteristics of sarcasm), and iii) not-irony. each portion to a different couple of annotators. In The proposed tasks encourage the investigation the following step, the final inter-annotator agree- of this linguistic devices. Moreover, providing a ment (IAA) has been calculated on all the dataset. dataset from social media (Twitter), we focus on Then, in order to achieve an agreement on a larger texts especially hard to be dealt with, because of portion of data, the effort of the annotators has their shortness and because they will be analyzed been focused only on the detected cases of dis- out of the context where they were generated. agreement. In particular, the couple previously in- The participants are allowed to submit either volved in the annotation of the first half of the cor- “constrained” or “unconstrained” runs (or both, pus produced a new annotation for the tweets in within the submission limits). The constrained disagreement of the second portion of the dataset, runs have to be produced by systems whose only while the couple involved in the annotation of the training data is the dataset provided by the task or- second half of the corpus did the same on the first TRAINING SET TEST SET IRONIC NOT- IRO SARC NOT- SARC IRONIC NOT- IRO SARC NOT- SARC TOTAL TW- BS 467 646 173 294 111 161 51 60 TW- SPINO 342 0 126 216 73 0 32 41 2,886 TW- SENTIPOLC 461 625 143 318 0 0 0 0 HSC 753 683 471 282 185 119 106 79 1,740 TWITA 0 0 0 0 67 156 28 39 223 TOTAL 3,977 872 4,849 Table 1: Distribution of tweets according to the topic portion of the dataset. After that, the cases where Speech Detection (Bosco et al., 2018). In the train- the disagreement persists have been discarded as ing set we count 781 overlapping tweets, while in too ambiguous to be classified (131 tweets). the test set we count an overlap of just 96 tweets. The final IAA calculated with Fleiss’ kappa is κ = 0.56 for the tweets belonging to the TWIT- 3.3 Data Release TIRÒ corpus and κ = 0.52 for the data from the The data were released in the following format3 : HSC corpus and it is considered moderate1 and sat- idtwitter text irony sarcasm topic isfying for the purpose of the shared task. In this process the annotators relied on a spe- where idtwitter is the Twitter ID of the mes- cific definition of “sarcasm”, and followed de- sage, text is the content of the message, irony tailed guidelines2 . In particular we defined sar- is 1 or 0 (respectively for ironic and not ironic casm as a kind of sharp, explicit and sometimes tweets), sarcasm is 1 or 0 (respectively for sar- aggressive irony, aimed at hitting a specific target castic and not sarcastic tweets), and topic refers to hurt or criticize without excluding the possibil- to the source corpus from where the tweet has been ity of having fun (Du Marsais et al., 1981; Gibbs, extracted. 2000). The factors we have taken into account for The training set includes for each tweet the an- the annotation are, the presence of: notation for the irony and sarcasm fields, ac- 1. a clear target, cording to the format explained above. Instead, the 2. an obvious intention to hurt or criticize, test set only containes values for the idtwitter, text and topic fields. 3. negativity (weak or strong). We have also tried to differentiate our concept of 4 Evaluation Measures “sarcasm” from that of “satire”, often present in tweets. For us, satire aims to ridicule the target Task A: Irony detection. Systems have been as well as criticize it. Differently from sarcasm, evaluated against the gold standard test set on satire is solely focused on a more negative type their assignment of a 0 or 1 value to the irony of criticism and moved by a personal and angry field. We measured the precision, recall and F1- emotional charge. score of the prediction for both the ironic and not-ironic classes: A single training set has been provided for both tasks A and B, which includes 3,977 tweets. Fol- #correct_class precisionclass = #assigned_class lowing, a single test set has been distributed for both tasks A and B, which includes 872 tweets, #correct_class recallclass = hence creating an 82% − 18% balance between #total_class training and test data. Table 1 shows the distribu- precisionclass recallclass tion of ironic and sarcastic tweets among the dif- F 1class = 2 precisionclass + recallclass ferent source/topic datasets cited in Section 3.1. Additionally the IronITA datasets overlap with The overall F1-score is the average of the F1- the data released for HaSpeeDe, the task of Hate scores for the ironic and not-ironic classes 1 According to the parameters proposed by Fleiss (1971). (i.e. macro F1-score). 2 For more details on this regard, please refer to the guidelines: https://github.com/AleT-Cig/ 3 IronITA-2018/blob/master/Definition%20of% Link to the datasets: http://www.di.unito.it/ 20Sarcasm.pdf ~tutreeb/ironita-evalita18/data.html topic irony sarcasm text TWITTIRÒ 0 0 @SteGiannini @sdisponibile Semmai l’anno DELLA buona scuola. De la, in italiano, non esiste TWITTIRÒ 1 1 #labuonascuola Fornitura illimitata di rotoli di carta igienica e poi, piano pi- ano, tutti gli altri aspetti meno importanti. HSC 1 0 Di fronte a queste forme di terrorismo siamo tutti sulla stessa barca. A parte Briatore. Briatore ha la sua. HSC 1 1 Anche oggi sono in arrivo 2000migranti dalla Libia avanti in italia ce posto per tutti vero @lauraboldrini ? Li puoi accogliere a casa tua Table 2: Examples for each combinations Task B: Different types of irony. Systems have 5.1 Task A: Irony Detection been evaluated against the gold standard test set on Table 4 shows the results for the irony detection their assignment of a 0 or 1 value to the sarcasm task, which attracted 17 total submissions from field, assuming that the irony field is also pro- 7 different teams. The best scores are achieved vided as part of the results. by the ItaliaNLP team (Cimino et al., 2018) that, We have measured the precision, recall and F1- with a constrained run, obtained the best score for score for each of the three classes: both the ironic and not-ironic class, thus • not-ironic obtaining the highest averaged F1-score of 0.731. irony = 0, sarcasm = 0 Among the unconstrained systems, the best F1- • ironic-not-sarcastic score for the not-ironic class is achieved by irony = 1, sarcasm = 0 the X2Check team (Di Rosa and Durante, 2018) • sarcastic with F = 0.708, and the best F1-score for the irony = 1, sarcasm = 1 ironic class is obtained by the UNITOR team The evaluation metric is the macro F1-score (Santilli et al., 2018) with F = 0.733. computed over the three classes. Note that for the All participating systems show an improvement purpose of the evaluation of task B, the following over the baselines, with the exception of the only combination is always considered wrong: unsupervised system (venses-itgetarun, see de- tails in Section 6). • irony = 0, sarcasm = 1 team name id F1-score Our scheme imposes that a tweet can be annotated not-iro iro macro as sarcastic only if it is also annotated as ironic, ItaliaNLP 1 0.707 0.754 0.731 which correspond to interpreting sarcasm as a spe- ItaliaNLP 2 0.693 0.733 0.713 UNIBA 1 0.689 0.730 0.710 cific type of irony, as reported in Table 2. UNIBA 2 0.689 0.730 0.710 X2Check 1 0.708 0.700 0.704 5 Participants and Results UNITOR 1 0.662 0.739 0.700 UNITOR 2 0.668 0.733 0.700 A total of 7 teams, both from academia and indus- X2Check 2 0.700 0.689 0.695 Aspie96 1 0.668 0.722 0.695 try sector participated to at least one of the two X2Check 2 0.679 0.708 0.693 tasks of IronITA. Table 3 provides an overview of X2Check 1 0.674 0.693 0.683 the teams, their affiliation, and the tasks they took UO_IRO 2 0.603 0.700 0.651 UO_IRO 1 0.626 0.665 0.646 part in. UO_IRO 2 0.579 0.678 0.629 Four teams participated to both tasks A and B. UO_IRO 1 0.652 0.577 0.614 Teams were allowed to submit up to four runs (2 baseline-random 0.503 0.506 0.505 venses-itgetarun 1 0.651 0.289 0.470 constrained and 2 unconstrained) in case they im- venses-itgetarun 2 0.645 0.195 0.420 plemented different systems. Furthermore, each baseline-mfc 0.668 0.000 0.334 team had to submit at least a constrained run. Par- Table 4: Results Task A. Unconstrained runs are ticipants have been invited to submit multiple runs marked by grey background. to experiment with different models and architec- tures. However, they have been discouraged from submitting slight variations of the same model. 5.2 Task B: Different types of irony Overall we have 17 runs for Task A and 7 runs Table 5 shows the results for the different types for Task B. of irony task, which attracted 7 total submis- team name institution tasks ItaliaNLP ItaliaNLP group ILC-CNR A,B UNIBA University of Bari A X2Check App2Check srl A UNITOR University of Roma “Tor Vergata” A,B Aspie96 University of Torino A,B UO_IRO CERPAMID, Santiago de Cuba / University of Informatics Sciences, Havana A venses-itgetarun Ca’ Foscari University of Venice A,B Table 3: Participants sions from 4 different teams. The best scores are ports out of 7 participating teams) and on the an- achieved by the UNITOR team that with an uncon- swers to a questionnaire sent by the organizers to strained run obtained the highest macro F1-score the participants. of 0.520. Among the constrained systems, the best F1- System architecture. Most submitted runs to score for the not-ironic class is achieved by IronITA are produced by supervised machine the ItaliaNLP team with F1-score = 0.707, and the learning systems. In fact, all but one systems are best F1-score for the ironic class is obtained by supervised, although the nature and complexity the Aspie96 team (Giudice, 2018) with F1-score of their architectures varies significantly. UNIBA = 0.438. The best score for the sarcastic class (Basile and Semeraro, 2018) and UNITOR use is obtained by a constrained run of the UNITOR Support Vector Machine (SVM) classifiers, with team with F1-score = 0.459. The best performing different parameter settings. UNITOR, in partic- UNITOR team is also the only team that partici- ular, employs a multiple kernel-based approach to pated to Task B with an unconstrained run. create two SVM classifiers that work on the two tasks. X2Check uses several models based on team name id F1-score Multinomial Naive Bayes and SVM in a voting en- not-iro iro sarc macro semble. Three systems implemented deep learn- UNITOR 2 0.668 0.447 0.446 0.520 UNITOR 1 0.662 0.432 0.459 0.518 ing neural networks for the classification of irony ItaliaNLP 1 0.707 0.432 0.409 0.516 and sarcasm. Sequence-learning networks were a ItaliaNLP 2 0.693 0.423 0.392 0.503 popular choice, in the form of Bidirectional Long Aspie96 1 0.668 0.438 0.289 0.465 baseline-random 0.503 0.266 0.242 0.337 Short-term Memory Networks (used by ItaliaNLP venses-itgetarun 1 0.431 0.260 0.018 0.236 and UO_IRO (Ortega-Bueno and Medina Pagola, baseline-mfc 0.668 0.000 0.000 0.223 2018)) and Gated Recurrent Units (Aspie96). The venses-itgetarun 2 0.413 0.183 0.000 0.199 venses-itgetarun team proposed the only unsu- Table 5: Results Task B. Unconstrained runs are pervised system submitted to IronITA. The system marked by grey background. is based on an extension of the ITGETARUN rule- based fully symbolic semantic parser (Delmonte, All participating systems show an improvement 2014). The performance of the venses-itgetarun over the baselines, with the exception of the only system is penalized mainly by its low recall (see unsupervised system (venses-itgetarun, see de- the detailed results on the task website). tails in Section 6). Features. In addition to explore a broad spec- trum of supervised and unsupervised architec- 6 Discussion tures, the submitted systems leverage different We compare the participating systems according kinds of linguistic and semantic information ex- to the following main dimensions: classification tracted from the tweets. Word n-grams of vary- framework (approaches, algorithms, features), text ing size are used by ItaliaNLP, UNIBA, and representation strategy, use of additional anno- X2Check. Word embeddings were used as fea- tated data for training, external resources (e.g. sen- tures by three systems, namely ItaliaNLP (built timent lexica, NLP tools, etc.), and interdepen- with word2vec on a concatenation of ItWaC4 and dency between the two tasks. This discussion is a custom tweet corpus), UNITOR (built with based on the information contained in the reports 4 https://www.sketchengine.eu/ submitted by the participants (we received 6 re- itwac-italian-corpus/ word2vec on a custom Twitter corpus) and UNIBA notated for hate speech from the HaSpeeDe task (built with Random Indexing (Sahlgren, 2005)) on at EVALITA 2018 (Bosco et al., 2018). We do a subset of TWITA (Basile et al., 2018). Affective not consider their runs unconstrained, because the lexicons were also employed to extract polarity- phenomena annotated in the data they employed related features from the words in the tweets, by are different from irony. UNIBA, ItaliaNLP and UNITOR and UO_IRO Interdependency of tasks. Since the tasks A (see the “Lexical Resources” section for details and B are inherently linked (a tweet can be sarcas- on the lexica). UNIBA and UO_IRO also com- tic only if it is also ironic), some of the participat- puted sentiment variation and contrast in order ing teams leveraged this information in their clas- to extract the ironic content from the text. Fea- sification systems. ItaliaNLP employed a Multi- tures derived from sentiment analysis are also task learning approach, thus solving the two tasks employed by the unsupervised system venses- simultaneously. UNITOR adopted a cascade ar- itgetarun. Aspie96 performs its classification chitecture where only tweets that were classified based on the single characters of the tweet. Fi- as ironic were passed through to the sarcasm clas- nally, a great number of other features is employed sifier. In the system by venses-itgetarun, the de- by the systems, including stylistic and structural cision on whether to assign a tweet to sarcasm features (UO_IRO), special tokens and emoticons or irony is based on the contemporary presence (X2Check). See the details in the EVALITA pro- of features common to the two tasks. ceedings (Caselli et al., 2018). 7 Concluding remarks Lexical Resources. Several systems employed affective resources, mainly as a tool to com- Differently from the previous sub-tasks on irony pute the sentiment polarity of words and each detection in Italian language proposed as part of tweet. ItaliaNLP used two affective lexica gen- the previous SENTIPOLC shared tasks, having erated automatically by means of distant supervi- Sentiment Analysis as reference framework, the sion and automatic translation. UNIBA used an IronITA tasks specifically focus on the irony and automatic translation of SentiWordNet (Esuli and sarcasm identification. Sebastiani, 2006). UNITOR used the Distributed Comparing the results for irony detection ob- Polarity Lexicon by Castellucci et al. (2016). tained within the SENTIPOLC sub-task (the best UO_IRO used the affective lexicon derived from performing system in the 2016 edition reached the OpeNER project (Russo et al., 2016) and a F = 0.5412 and in 2014 F = 0.575) with the polarity lexicon of emojis by Kralj Novak et al. ones obtained in IronITA, it is worth to notice that (2015). venses-itgetarun used several lexica, in- a dedicated task on irony detection leaded to a cluding some specifically built for ITGETARUNS remarkable improvement of the scores, with the and a translation of SentiWordNet (Esuli and Se- highest value here being F = 0.731. bastiani, 2006). Surprisingly, scores for Italian are in line with those obtained at SemEval2018-Task3 on irony Additional training data. Three teams took the detection in English tweets, even if a lower amount opportunity to send unconstrained runs along with of linguistic resources is available for Italian than constrained runs. X2Check included in the un- for English, especially in term of affective lexica, constrained training set a balanced version of the a type of resource that is frequently exploited in SENTIPOLC 2016 dataset, Italian tweets anno- this kind of task. Actually, some teams used re- tated with irony (Barbieri et al., 2016). UNITOR sources provided by the Italian NLP community used for their unconstrained runs a dataset of 6,000 also in the framework of previous EVALITA’s edi- tweets obtained by distant supervision (searching tion (e.g. additional information from annotated for the hashtag #ironia — #irony). UO_IRO em- corpora as SENTIPOLC, HaSpeeDe and POST- ployed tweets annotated with fine-grained irony WITA). from TWITTIRÒ (Cignarella et al., 2018). The good results obtained in this edition can The team ItaliaNLP did not send unconstrained be read also as a confirmation that linguistic runs, although they used the information about po- resources for Italian language are increasing in larity of Italian tweets from the SENTIPOLC 2016 quantity and quality, and they are helpful also for dataset (Barbieri et al., 2016) and the data an- a very challenging task as irony detection. Another interesting factor in this edition is the 2016. Overview of the Evalita 2016 sentiment polar- use of the innovative deep learning techniques, ity classification task. In Proceedings of 3rd Italian Conference on Computational Linguistics (CLiC- mirroring the growing interest in deep learning by it 2016) & 5th Evaluation Campaign of Natural the NLP community at large. Indeed, the best per- Language Processing and Speech Tools for Italian, forming system is based on a deep learning ap- Naples, Italy. CEUR.org. proach revealing its usefulness also for irony de- tection. The high performance of deep learning Pierpaolo Basile and Giovanni Semeraro. 2018. UNIBA - Integrating distributional semantics fea- methods is an indication that irony and sarcasm tures in a supervised approach for detecting irony are phenomena involving more complex features in Italian tweets. In Proceedings of the 6th evalua- than n-grams and lexical polarity. tion campaign of Natural Language Processing and The number of participants in task B was lower. Speech tools for Italian (EVALITA’18), Turin, Italy. CEUR.org. Even though we wanted to encourage the inves- tigation in the identification of sarcasm, we are Valerio Basile, Andrea Bolioli, Malvina Nissim, Vi- aware that addressing the finer-grained task to dis- viana Patti, and Paolo Rosso. 2014. Overview criminate between irony and sarcasm is still really of the Evalita 2014 sentiment polarity classification difficult. task. In Proceedings of the 4th evaluation campaign of Natural Language Processing and Speech tools In hindsight, the organization of such a shared for Italian (EVALITA’14), Pisa, Italy. Pisa Univer- task, specifically dedicated to irony detection in sity Press. Italian tweets, and also focused on diverse types of irony has been a hazard. It was intended to foster Valerio Basile, Mirko Lai, and Manuela Sanguinetti. 2018. Long-term Social Media Data Collection research teams in the exploitation of lexical and af- at the University of Turin. In Proceedings of the fective resources in Italian, developed in our NLP 5th Italian Conference on Computational Linguis- community and to encourage the investigation es- tics (CLiC-it 2018), Turin, Italy. CEUR.org. pecially on data about politics and immigration. Our proposal for this shared task arose from the Cristina Bosco, Felice Dell’Orletta, Fabio Poletto, Manuela Sanguinetti, and Maurizio Tesconi. 2018. intuition that a better recognition of figurative lan- Overview of the Evalita 2018 Hate Speech De- guage like irony in social media data could also tection Task. In Proceedings of the 6th evalua- lead to a better resolution of other Sentiment Anal- tion campaign of Natural Language Processing and ysis tasks such as Hate Speech Detection (Bosco Speech tools for Italian (EVALITA’18), Turin, Italy. CEUR.org. et al., 2018), Stance Detection (Mohammad et al., 2017), and Misogyny Detection (Fersini et al., Andrea Bowes and Albert Katz. 2011. When sarcasm 2018). IronITA wanted to be a first try-out and a stings. Discourse Processes: A Multidisciplinary first stimulus in this challenging field. Journal, 48(4):215–236. Tommaso Caselli, Nicole Novielli, Viviana Patti, and Acknowledgments Paolo Rosso. 2018. EVALITA 2018: Overview of the 6th Evaluation Campaign of Natural Language V. Basile, C. Bosco and V. Patti were partially Processing and Speech Tools for Italian. In Proceed- supported by Progetto di Ateneo/CSP 2016 (Im- ings of 6th Evaluation Campaign of Natural Lan- migrants, Hate and Prejudice in Social Media- guage Processing and Speech Tools for Italian. Final IhatePrejudice, S1618_L2_BOSC_01). The work Workshop (EVALITA 2018), Turin, Italy. CEUR.org. of S.Frenda and P. Rosso was partially funded Giuseppe Castellucci, Danilo Croce, and Roberto by the Spanish research project SomEMBED Basili. 2016. A language independent method for TIN2015-71147-C2-1-P (MINECO/FEDER). generating large scale polarity lexicons. In Proceed- ings of the 10th International Conference on Lan- guage Resources and Evaluation (LREC 2016), Por- References torož, Slovenia. ELRA. Salvatore Attardo. 2007. Irony as relevant inappro- Alessandra Teresa Cignarella, Cristina Bosco, Viviana priateness. In H. Colston and R. Gibbs, editors, Patti, and Mirko Lai. 2018. Application and Anal- Irony in language and thought: A cognitive science ysis of a Multi-layered Scheme for Irony on the reader, pages 135–172. Lawrence Erlbaum. Italian Twitter Corpus TWITTIRÒ. In Proceedings of the 11th International Conference on Language Francesco Barbieri, Valerio Basile, Danilo Croce, Resources and Evaluation (LREC 2018), Miyazaki, Malvina Nissim, Nicole Novielli, and Viviana Patti. Japan. ELRA. Andrea Cimino, Lorenzo De Mattei, and Felice Christopher J. Lee and Albert N. Katz. 1998. The Dell’Orletta. 2018. Multi-task Learning in Deep differential role of ridicule in sarcasm and irony. Neural Networks at EVALITA 2018. In Proceed- Metaphor and Symbol, 13(1):1–15. ings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian A. Marchetti, D. Massaro, and A. Valle. 2007. Non (EVALITA’18), Turin, Italy. CEUR.org. dicevo sul serio. Riflessioni su ironia e psicologia. Collana di psicologia. Franco Angeli. Rodolfo Delmonte. 2014. A linguistic rule-based sys- tem for pragmatic text processing. In Proceedings Saif M Mohammad, Parinaz Sobhani, and Svetlana of Fourth International Workshop EVALITA 2014, Kiritchenko. 2017. Stance and sentiment in tweets. Pisa. Edizioni PLUS, Pisa University Press. ACM Transactions on Internet Technology (TOIT), 17(3):26. Emanuele Di Rosa and Alberto Durante. 2018. Irony detection in tweets: X2Check at Ironita 2018. In Reynier Ortega-Bueno and José E. Medina Pagola. Proceedings of the 6th evaluation campaign of Nat- 2018. UO_IRO: Linguistic informed deep-learning ural Language Processing and Speech tools for Ital- model for irony detection. In Proceedings of the ian (EVALITA’18), Turin, Italy. CEUR.org. 6th evaluation campaign of Natural Language Pro- cessing and Speech tools for Italian (EVALITA’18), César Chesneau Du Marsais, Jean Paulhan, and Claude Turin, Italy. CEUR.org. Mouchard. 1981. Traité des tropes. Le Nouveau Commerce. Irene Russo, Francesca Frontini, and Valeria Quochi. 2016. OpeNER sentiment lexicon italian - LMF. Marta Dynel. 2014. Linguistic approaches to (non) ILC-CNR for CLARIN-IT repository hosted at In- humorous irony. Humor - International Journal of stitute for Computational Linguistics “A. Zampolli”, Humor Research, 27(6):537–550. National Research Council, in Pisa. Andrea Esuli and Fabrizio Sebastiani. 2006. Senti- Magnus Sahlgren. 2005. An introduction to random wordnet: A publicly available lexical resource for indexing. In In Methods and Applications of Seman- opinion mining. In Proceedings of the 5th Interna- tic Indexing Workshop at the 7th International Con- tional Conference on Language Resources and Eval- ference on Terminology and Knowledge Engineer- uation (LREC 2006), Genova, Italy. ing, TKE 2005. Elisabetta Fersini, Maria Anzovino, and Paolo Rosso. Manuela Sanguinetti, Fabio Poletto, Cristina Bosco, 2018. Overview of the Task on Automatic Misog- Viviana Patti, and Marco Stranisci. 2018. An Ital- yny Identification at IberEval. In Proceedings of 3rd ian Twitter Corpus of Hate Speech against Immi- Workshop on Evaluation of Human Language Tech- grants. In Proceedings of the 11th International nologies for Iberian Languages (IberEval 2018). Conference on Language Resources and Evaluation CEUR-WS.org. (LREC 2018), Miyazaki, Japan. Joseph L. Fleiss. 1971. Measuring nominal scale Andrea Santilli, Danilo Croce, and Roberto Basili. agreement among many raters. Psychological bul- 2018. A Kernel-based Approach for Irony and Sar- letin. casm Detection in Italian. In Proceedings of the 6th evaluation campaign of Natural Language Pro- Raymond W. Gibbs. 2000. Irony in talk among cessing and Speech tools for Italian (EVALITA’18), friends. Metaphor and symbol, 15(1-2):5–27. Turin, Italy. CEUR.org. Rachel Giora, Adi Cholev, Ofer Fein, and Orna Pe- Marco Stranisci, Cristina Bosco, Delia Irazú Hernán- leg. 2018. On the superiority of defaultness: Hemi- dez Farías, and Viviana Patti. 2016. Annotat- spheric perspectives of processing negative and affir- ing Sentiment and Irony in the Online Italian Po- mative sarcasm. Metaphor and Symbol, 33(3):163– litical Debate on #labuonascuola. In Proceedings 174. of the 10th International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Valentino Giudice. 2018. Aspie96 at IronITA Slovenia. ELRA. (EVALITA 2018): Irony Detection in Italian Tweets with Character-Level Convolutional RNN. In Pro- Emilio Sulis, D. Irazú Hernández Farías, Paolo Rosso, ceedings of the 6th evaluation campaign of Natural Viviana Patti, and Giancarlo Ruffo. 2016. Figura- Language Processing and Speech tools for Italian tive messages and affect in Twitter: Differences be- (EVALITA’18), Turin, Italy. CEUR.org. tween #irony, #sarcasm and #not. Knowledge-Based Systems, 108:132 – 143. New Avenues in Knowl- Aditya Joshi, Pushpak Bhattacharyya, and Mark James edge Bases for Natural Language Processing. Carman. 2017. Automatic sarcasm detection: A survey. ACM Comput. Surv., 50(5):73:1–73:22. Cynthia Van Hee, Els Lefever, and Véronique Hoste. 2018. Semeval-2018 task 3: Irony detection in En- Petra Kralj Novak, Jasmina Smailović, Borut Sluban, glish tweets. In Proceedings of The 12th Interna- and Igor Mozetič. 2015. Sentiment of emojis. tional Workshop on Semantic Evaluation. PLOS ONE, 10(12):1–22, 12. Appendix: Detailed results per class for all tasks precision recall F1-score precision recall F1-score average ranking team name run type run id (non-ironic) (non-ironic) (non-ironic) (ironic) (ironic) (ironic) F1-score 1 ItaliaNLP c 1 0.785 0.643 0.707 0.696 0.823 0.754 0.731 2 ItaliaNLP c 2 0.751 0.643 0.693 0.687 0.786 0.733 0.713 3 UNIBA c 1 0.748 0.638 0.689 0.683 0.784 0.730 0.710 4 UNIBA c 2 0.748 0.638 0.689 0.683 0.784 0.730 0.710 5 X2Check u 1 0.700 0.716 0.708 0.708 0.692 0.700 0.704 6 UNITOR c 1 0.778 0.577 0.662 0.662 0.834 0.739 0.700 7 UNITOR u 2 0.764 0.593 0.668 0.666 0.816 0.733 0.700 8 X2Check u 2 0.690 0.712 0.700 0.701 0.678 0.689 0.695 9 Aspie96 c 1 0.742 0.606 0.668 0.666 0.789 0.722 0.695 10 X2Check c 2 0.716 0.645 0.679 0.676 0.743 0.708 0.693 11 X2Check c 1 0.697 0.652 0.674 0.672 0.715 0.693 0.683 12 UO_IRO u 2 0.722 0.517 0.603 0.623 0.800 0.700 0.651 13 UO_IRO u 1 0.667 0.590 0.626 0.631 0.703 0.665 0.646 14 UO_IRO c 2 0.687 0.501 0.579 0.606 0.770 0.678 0.629 15 UO_IRO c 1 0.600 0.714 0.652 0.645 0.522 0.577 0.614 16 baseline-random c 1 0.506 0.501 0.503 0.503 0.508 0.506 0.505 17 venses-itgetarun c 1 0.520 0.872 0.651 0.597 0.191 0.289 0.470 18 venses-itgetarun c 2 0.505 0.892 0.645 0.525 0.120 0.195 0.420 19 baseline-mfc c 1 0.501 1.000 0.668 0.000 0.000 0.000 0.334 Detailed results of Task A (Irony Detection) precision recall F1-score precision recall F1-score precision recall F1-score average ranking team name run type run id (non-ironic) (non-ironic) (non-ironic) (ironic) (ironic) (ironic) (sarcastic) (sarcastic) (sarcastic) F1-score 1 UNITOR u 2 0.764 0.593 0.668 0.362 0.584 0.447 0.492 0.407 0.446 0.520 2 UNITOR c 1 0.778 0.577 0.662 0.355 0.553 0.432 0.469 0.449 0.459 0.518 3 ItaliaNLP c 1 0.785 0.643 0.707 0.343 0.584 0.432 0.518 0.338 0.409 0.516 4 ItaliaNLP c 2 0.751 0.643 0.693 0.340 0.562 0.423 0.507 0.319 0.392 0.503 5 Aspie96 c 1 0.742 0.606 0.668 0.353 0.575 0.438 0.342 0.250 0.289 0.465 6 baseline-random c 1 0.506 0.501 0.503 0.267 0.265 0.266 0.239 0.245 0.242 0.337 7 venses-itgetarun c 1 0.606 0.334 0.431 0.341 0.210 0.260 0.500 0.009 0.018 0.236 8 baseline-mfc c 1 0.501 1.000 0.668 0.000 0.000 0.000 0.000 0.000 0.000 0.223 9 venses-itgetarun c 2 0.559 0.327 0.413 0.296 0.132 0.183 0.000 0.000 0.000 0.199 Detailed results of Task B (Sarcasm Detection)