Irony detection in tweets: X2Check at Ironita 2018 (Short Paper) Emanuele Di Rosa Alberto Durante Chief Technology Officer Research Scientist App2Check s.r.l. App2Check s.r.l. emanuele.dirosa alberto.durante @app2check.com @app2check.com Abstract This paper is structured as follow: after the in- troduction we present the descriptions of our two English. In this paper we describe and systems submitted for the irony detection task; show the results of the two systems that then we show and discuss the results on the offi- we have specifically developed to partici- cial test set of the competition, finally we provide pate at Ironita 2018 for the irony detection our conclusions. task. We scored as the third team in the official ranking of the competition, thanks to the X2C-B system, at a distance of just 2 Systems description 0.027 of F1 score from the best system. The dataset provided by Ironita organizers has Italiano. In questo report descriviamo been split into training set (80% of the documents) i due sistemi che abbiamo sviluppato ad and development set (the remaining 20%). We hoc per partecipare ad Ironita 2018, nello randomly sampled the examples for each cate- specifico al task di irony detection. Il nos- gory, thus obtaining different sets for training/test tro team è risultato essere il terzo classi- set, by keeping the distribution of ironic and non- ficato nella classifica ufficiale della com- ironic samples through the two sets. We submitted petizione, grazie al nostro sistema X2C-B, two runs, as the results of the two different sys- che ha ottenuto un F1 score solo 0.027 in- tems we developed for each category, called X2C- feriore rispetto al primo classificato. A and X2C-B. The former has been developed on top of the Scikit-learn library in Python language (Pedregosa et al., 2011), and the latter on top of the 1 Introduction WEKA library (Frank et al., 2016) in JAVA lan- In social media, the use of irony in tweets and guage. In both cases, input text has been cleaned Facebook posts is widely spread and makes very with a typical NLP pipeline, involving punctua- difficult for sentiment analysis tools to properly tion (with the exclusion of question/exclamation automatically classify people opinion (Hernández mark), numbers and stopwords removal. In partic- and Rosso, 2016). The ability to detect irony with ular, since it is still hard to detect irony in a text, high accuracy would bring an important contribu- very often also for humans, we tried to take ad- tion in opinion mining systems and lead to many vantage of features trying to help triggering the industrial applications. For this reason, irony de- presence of irony. For instance, question and ex- tection has been largely studied in recent research clamation marks, text strings representing laughs, papers like (Farı́as et al., 2011), (Barbieri et al., emoticons, mixed sentiment in the same sentence 2014), (Farı́as et al., 2016), (Freitas et al., 2014). are some of the text features that we extracted from In this paper we describe and show the results the text and represented with a specific explicit of the two systems that we have specifically devel- marker highlighting their presence. oped to participate at Ironita 2018 (Cignarella et Both the X2C-A and X2C-B unconstrained run al., 2018) for the irony detection task. We scored were trained using the SENTIPOLC 2016 Irony as the third team in the official ranking of the com- training set and test set (Barbieri et al., 2016) as petition, thanks to the X2C-B system, at a distance external source, in addition to the Ironita training of just 0.027 of F1 score from the best system. set. 2.1 X2C-A F1 non-iro F1 iro Macro F1 NB-const 0.715 0.696 0.707 The X2C-A system has been created by apply- NB-uncon 0.729 0.750 0.740 ing an NLP pipeline including a vectorization of SMO-const 0.678 0.689 0.683 the collection of reviews to a matrix of token SMO-uncon 0.704 0.492 0.598 counts of bi-grams; then, the count matrix has been transformed to a normalized tf-idf represen- Table 2: Results on development set for X2C-B. tation (term-frequency times inverse document- frequency). For the training, we created an en- semble model, more specifically a voting ensem- 3 Results and discussion ble, that takes into account three different algo- In Table 3 we show the results of our runs on the rithms: LinearSVC (an implementation of Sup- official test set of the competition. In accordance port Vector Machines), Multinomial Naive Bayes with what we noticed before, comparing Table 1 and the SGD classifier. All of them have an im- and Table 2, our best run is X2C-B unconstrained, plementation available in the Scikit-learn library. which reached the best F1 overall on non-ironic The ensemble model has been the best model in documents; it also ranks fifth in the overall F1- our model selection activity. In order to prop- score, at a distance of 0.027 from the best system. erly select the best hyper-parameters, we applied a The performance of the X2C-A run is very similar grid search approach for each of the model in the to the unconstrained run, obtaining a F1-score that voting ensemble. The resulting ensemble model is only 0.002 higher than the constrained run. The showed a macro F1 score of 70.98 on our develop- difference between the two X2C-B runs is larger ment set and is very close to the final result on the in relative terms, but is only of 0.021. We can also competition test set (shown in table ). see that our X2C-B-u shows the best F1 score on the non-ironic tweets compared to all of the sys- Acc F1 ironic Macro F1 tems. LinearSVM 0.706 0.699 0.706 We added to this ranking also the model that NB 0.706 0.699 0.706 reached the first position on the Irony task at SGD 0.697 0.728 0.693 SENTIPOLC 2016 (Di Rosa and Durante, 2016). Ensemble 0.710 0.709 0.710 The score of that model on this test set, called X2C2016 in the table, reached a F1-score of just Table 1: Results on the development set for X2C- 0.432, which is lower than the baseline of this year. A constrained. This surprising result may indicate either that the irony detection systems had a great improvement in the past two years, or that irony detectors have 2.2 X2C-B a performance that is very much dependent on the topics treated in the training set, i.e. they are still In the model selection process, the two best al- not so good to generalize. gorithms have been Naive Bayes Multinomial and SMO, both using unigram features. We took into 4 Conclusions account the F1 score on the positive labels and the Macro-F1 in order to select the best algo- In this paper we described the two systems that we rithm. As shown in Table 2, Naive Bayes Multi- built and submitted for the Ironita 2018 competi- nomial reached a Macro F1 score 2.38% higher tion for the irony detection task. The results show on the constrained run and a 14.2% on the uncon- that our system X2C-B scored as the third team at strained run, thus both the constrained and the un- a distance of just 0.027 of F1 score from the best constrained submitted runs were produced using system. this algorithm. Comparing the results in Table 2 with the ones References in Table 1, we can notice that X2C-B uncon- Alessandra Teresa Cignarella and Simona Frenda and strained reached the highest performance on the Valerio Basile and Cristina Bosco and Viviana Patti development set, while X2C-B constrained ob- and Paolo Rosso. 2018. Overview of the Evalita tained the lowest score. 2018 Task on Irony Detection in Italian Tweets team F1 non-iro F1 iro F1 Duchesnay, E. 2011. Scikit-learn: Machine Learn- 1 team 1 0.707 0.754 0.731 ing in Python in Journal of Machine Learning Re- 2 team 1 0.693 0.733 0.713 search, pp. 2825–2830. 3 team 2 0.689 0.730 0.710 Farı́as, Delia Irazú Hernández et al. Irony Detection 4 team 2 0.689 0.730 0.710 in Twitter: The Role of Affective Content. 2011. in 5 X2C-B-u 0.708 0.700 0.704 ACM Trans. Internet Techn. 16 (2016): 19:1-19:24. 6 team 4 0.662 0.739 0.700 Barbieri, Francesco and Horacio Saggion. 2014. Mod- 7 team 4 0.668 0.733 0.700 elling Irony in Twitter: Feature Analysis and Evalu- 8 X2C-A-u 0.700 0.689 0.695 ation. in LREC (2014). 9 team 5 0.668 0.722 0.695 Delia Irazú Hernández Farı́as, Viviana Patti, and Paolo 10 X2C-A-c 0.679 0.708 0.693 Rosso. 2016. Irony Detection in Twitter: The Role 11 X2C-B-c 0.674 0.693 0.683 of Affective Content. in ACM Transaction Internet 12 team 6 0.603 0.700 0.651 Technology 16, 3, Article 19 (July 2016), pp. 1-24. DOI: https://doi.org/10.1145/2930663 13 team 6 0.626 0.665 0.646 14 team 6 0.579 0.678 0.629 Freitas, Larissa and Vanin, Aline and Hogetop, Denise 15 team 6 0.652 0.577 0.614 and N. Bochernitsan, Marco and Vieira, Renata. 16 baseline-1 0.503 0.506 0.505 2014. Pathways for irony detection in tweets. in Pro- ceedings of the ACM Symposium on Applied Com- 17 team 7 0.651 0.289 0.470 puting. 10.1145/2554850.2555048. 18 X2C2016 0.665 0.198 0.432 19 team 7 0.645 0.195 0.420 Hernández I., Rosso P. 2016. Irony, Sarcasm, and Sen- timent Analysis. Chapter 7 In: Sentiment Analysis in 20 baseline-2 0.668 0 0.334 Social Networks, F.A. Pozzi, E. Fersini, E. Messina, and B. Liu (Eds.), Elsevier Science and Technology, Table 3: Ironita 2018 official ranking. pp. 113-128 Sulis E., Hernández I., Rosso P., Patti V., Ruffo G. 2016. Figurative Messages and Affect in Twitter: (IronITA) in Proceedings of the 6th evaluation cam- Differences Between #irony, #sarcasm and #not. In: paign of Natural Language Processing and Speech Knowledge-Based Systems, vol. 108, pp. 132–143 tools for Italian (EVALITA’18). Francesco Barbieri and Valerio Basile and Danilo Croce and Malvina Nissim and Nicole Novielli and Viviana Patti. 2016. Overview of the Evalita 2016 SENTIment POLarity Classification Task in Pro- ceedings of Third Italian Conference on Computa- tional Linguistics (CLiC-it 2016) & Fifth Evalua- tion Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Napoli, Italy, December 5-7, 2016. Emanuele Di Rosa and Alberto Durante. 2016. Tweet2Check evaluation at Evalita Sentipolc 2016 in Proceedings of Third Italian Conference on Com- putational Linguistics (CLiC-it 2016) & Fifth Evalu- ation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Napoli, Italy, December 5-7, 2016. Eibe Frank, Mark A. Hall, and Ian H. Witten. 2016. The WEKA Workbench. Online Appendix for ”Data Mining: Practical Machine Learning Tools and Techniques”, Morgan Kaufmann, Fourth Edi- tion, 2016. Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and