Irony Detection: from the Twittersphere to the News Space Alessandra Cervone, Evgeny A. Stepanov, Fabio Celli, Giuseppe Riccardi Signals and Interactive Systems Lab Department of Information Engineering and Computer Science University of Trento, Trento, Italy {alessandra.cervone,evgeny.stepanov}@unitn.it {fabio.celli,giuseppe.riccardi}@unitn.it Abstract dati e feature sparseness. Per ovviare alla feature sparseness proponiamo esper- English. Automatic detection of irony is imenti con diverse rappresentazioni del one of the hot topics for sentiment analy- testo – bag-of-words, stile di scrittura e sis, as it changes the polarity of text. Most word embeddings; per ovviare alla man- of the work has been focused on the detec- canza di bilanciamento nei dati utilizzi- tion of figurative language in Twitter data amo invece tecniche di bilanciamento. due to relative ease of obtaining annotated data, thanks to the use of hashtags to sig- nal irony. However, irony is present gener- 1 Introduction ally in natural language conversations and in particular in online public fora. In this The detection of irony in user generated content paper, we present a comparative evalua- is one of the major issues in sentiment analysis tion of irony detection from Italian news and opinion mining (Ravi and Ravi, 2015). The fora and Twitter posts. Since irony is not problem is that irony can flip the polarity of ap- a very frequent phenomenon, its automatic parently positive sentences, negatively affecting detection suffers from data imbalance and the performance of sentiment polarity classifica- feature sparseness problems. We experi- tion (Poria et al., 2016). Detecting irony from text ment with different representations of text is extremely difficult because it is deeply related to – bag-of-words, writing style, and word many out-of-text factors such as context, intona- embeddings to address the feature sparse- tion, speakers’ intentions, background knowledge ness; and balancing techniques to address and so on. This also affects interpretation and an- the data imbalance. notation of irony by humans, often leading to low inter-annotator agreements. Italiano. Il rilevamento automatico di iro- Twitter posts are frequently used for the irony nia è uno degli argomenti più interessanti detection research, since users often signal irony in sentiment analysis, poiché modifica la in their posts utilizing hashtags such as #irony, polarità del testo. La maggior parte degli #justjoking, etc. Despite the relative ease of col- studi si sono concentrati sulla rilevazione lecting the data, Twitter is a very particular kind del linguaggio figurativo nei dati di Twit- of text. In this paper we experiment with dif- ter per la relativa facilità nell’ottenere ferent representations of text to evaluate the util- dati annotati con gli hashtags per seg- ity of Twitter data for the detection of irony in nalare l’ironia. Tuttavia, l’ironia è un text coming from other sources such as news fora. fenomeno che si trova nelle conversazioni The representations of text – bag-of-words, writ- umane in generale e in particolare nei fo- ing style, and word embeddings – are chosen such rum online. In questo lavoro presentiamo that they are not dependent on the resources avail- una valutazione comparativa sul rileva- able for the language. Due to the fact that irony is mento dell’ironia in blogs giornalistici e less frequent than literal meaning, the data is usu- conversazioni su Twitter. Poiché l’ironia ally imbalanced. We experiment with balancing non è un fenomeno molto frequente, il suo techniques such as random undersampling, ran- rilevamento automatico risente di prob- dom oversampling and cost-sensitive training to lemi di mancanza di bilanciamento nei observe its effects on a supervised irony detection. The paper is structured as follows. In Section Sulis et al. (2016) investigated a new set of fea- 2 we introduce related work on irony. In Sec- tures for irony detection in Twitter with particular tion 3 we describe the corpora used throughout regard to affective features; and studied the differ- experiments. In Sections 4 and 5 we describe the ence between irony and sarcasm. Barbieri et al. methodology and the result of the experiments. In (2014) were the first ones to propose an approach Section 6 we provide concluding remarks. for irony detection in Italian. Irony detection is a popular topic for shared 2 Related Works tasks and evaluation campaigns. Among others, SemEval-2015 (Ghosh et al., 2015) task on sen- The detection of irony in text has been widely timent analysis of figurative language in Twitter, addressed. Carvalho et al. (2009) showed that and SENTIPOLC 2014 (Basile et al., 2014) and in Portuguese news blogs, pragmatic and gestu- 2016 (Barbieri et al., 2016) tasks on irony and ral text features such as emoticons, onomatopoeic sentiment classification in Twitter. SemEval con- expressions and heavy punctuation marks work sidered three broad classes of figurative language: better than deeper linguistic information such as irony, sarcasm and metaphor. The task was cast n-grams, words or syntax. Reyes et al. (2013) as a regression as participants had to predict a nu- addressed irony detection in Twitter, using com- meric score (crowd-annotated). The best perform- plex features like temporal expressions, counter- ing systems made use of manual and automatic factuality markers, pleasantness or imageability lexica, term-frequencies, part-of-speech tags, and of words, and pair-wise semantic relatedness of emoticons. terms in adjacent sentences. This rich feature set The SENTIPOLC campaigns on Italian tweets, enabled the same authors to detect 30% of the on the other hand, included three tasks: subjec- irony in movie and book reviews in (Reyes and tivity detection, sentiment polarity classification Rosso, 2014). and irony detection (binary classification). The Ravi and Ravi (2016), on the other hand, ex- best performing systems utilized broad sets of fea- ploited resources such as LIWC (Tausczik and tures ranging from the established Twitter-based Pennebaker, 2010) to analyze irony in two differ- features, such as URL links, mentions, and hash- ent domains: satirical news and Amazon reviews; tags, to emoticons, punctuation, and vector space and found out that LIWC’s words related to sex or models to spot out-of-context words (Castellucci death are good indicators of irony. et al., 2014). Specifically, in SENTIPOLC 2016, Charalampakis et al. (2016) addressed irony de- the best performing system exploited lexica, hand- tection in Greek political tweets comparing semi- crafted rules, topic models and Named Entities supervised and supervised approaches, with the (Di Rosa and Durante, 2016). In this paper, on the aim to analyze whether irony predicts election re- other hand, we address irony detection from fea- sults or not. In order to detect irony, they use tures not dependent on language resources such as as features: spoken style words, word frequency, manually crafted lexica and source-dependent fea- number of WordNet SynSets as a measure of am- tures such as hashtags and emoticons. biguity, punctuation, repeated patterns and emoti- cons. They found that supervised methods work 3 Data Set better than semi-supervised in the prediction of irony (Charalampakis et al., 2016). The experiments reported in this paper make use Poria et al. (2016) developed models based on of two data sets: SENTIPOLC 2016 (Barbieri et pre-trained convolutional neural networks (CNNs) al., 2016) and CorEA (Celli et al., 2014). While to exploit sentiment, emotion and personality fea- SENTIPOLC is a corpus of tweets, CorEA is a tures for a sarcasm detection task. They trained data set of news articles and related reader com- and tested their models on balanced and unbal- ments collected from the Italian news website cor- anced sets of tweets retrieved searching the hash- riere.it. The two corpora consist of inherently dif- tag #sarcasm. They found that CNNs with pre- ferent types of text. While tweets have a limit on trained models perform very well and that, al- the length of the post, news articles comments are though sentiment features are good also when used not constrained. The length limitation does not alone, emotion and personality features help in the only impact the number of tokens per post, but task (Poria et al., 2016). also the style of writing, since in Tweets authors SENTIPOLC 2016 CorEA @gadlernertweet Se #Grillo fosse al governo, dopo due mesi bravo, escludi l’universitá .... restare ignoranti non fa male lo Stato smetterebbe di pagare stipendi e pensioni. E lui a nessuno, solo a sé stessi. questi sono i nostri.... geni. non capeggerebbe la rivolta mi meraviglierei se votasse grillo #Grillo,fa i comizi sulle cassette della frutta,mentre alcune beh dipende da come la guardi..A campagna elettorale del #Pdl li fanno senza,cassetta...solo sulle banane. #ballaró all’inverso: rispettano ció che avevano promesso @Italialand @MissAllyBlue Non mi fido della compagnia.. meglio far Saranno solo 4 milioni (comunque dimentichi i 42 mil di finta di stare sveglio.. sveglissimo O o rimborsi) peró pochi o tanti li hanno restituiti. Gli altri in- vece , probabilmente politici a te “simpatici” continuano a gozzovigliare con i soldi tuoi . Sveglia volpone Table 1: Examples of ironic posts from SENTIPOLC 2016 and CorEA. naturally try to squeeze as much content as possi- Non-Ironic Ironic Total ble within the limits. SENTIPOLC 2016 Training 6,542 (88%) 868 (12%) 7,410 This difference can be seen also in the type Test 1,765 (88%) 235 (12%) 2,000 of irony used across the two corpora, as shown CorEA 2,299 (80%) 576 (20%) 2,875 in the examples reported in Table 1. While in Table 2: Counts and percentages of ironic and Tweets we observe much more the presence of non-ironic posts in SENTIPOLC 2016 training external ‘sources’ (such as URL links, mentions, and test set and CorEA corpus. hashtags and emoticons) to signal the irony and make it interpretable (for example by disambiguat- ing entities using hashtags); news fora users tend 159K for SENTIPOLC and 164K for CorEA. to use style much more similar to natural language, Consequently, there are drastic differences in the where entities are not specifically signaled and average number of tokens per post: 21 for SEN- there are no emojis to mark the non-literal mean- TIPOLC and 57 for CorEA. As shown in Table 2, ing of a sentence. Thus, CorEA presents a more we also observe a major difference in the percent- difficult, but also a more interesting, dataset for ages of ironic posts between the corpora: 12% for automatic irony detection, given the closer simi- SENTIPOLC and 20% for CorEA. larity to the language used in other genres. Both corpora have been annotated following 4 Methodology a version of the scheme of SENTIPOLC 2014 In this paper we address irony detection in Ital- (Basile et al., 2014). According to the scheme, the ian making use of source independent and ‘easily’ annotator is asked to decide whether the given text obtainable representations of text such as lexical is subjective or not, and in case it is considered (bag-of-words), stylometric, and word embedding subjective, to annotate the polarity of the text and vectors. The models are trained and tested using irony as binary values. The CorEA corpus (Celli et Support Vector Machines (SVM) (Vapnik, 1995) al., 2014) was annotated for irony by three anno- with linear kernel and defaults parameters, imple- tators specifically for this paper, and has an inter- mented in the scikit-learn (Pedregosa et al., 2011) annotator agreement of κ = 0.57. python library. Since SENTIPOLC 2016 is composed of differ- To obtain the desired representations of text, the ent data sets, which used various agreement met- data is pre- For the bag-of-word representation, the rics (Barbieri et al., 2016), it is not possible to data is lowercased, and all source-specific entities, directly compare the inter-annotator agreements such as emoji, URL, Twitter hashtags, and men- between the corpora. The two component data tions are mapped to a single entity (e.g. hHi for sets of SENTIPOLC 2016 for which a comparable hashtags); as the objective is to use Twitter mod- metric is reported have an inter-annotator agree- els to detect irony in news fora and other kinds ment of κ = 0.538 (TW-SENTIPOLC14) and of textual data, where presence of such entities is κ = 0.492 (TW-BS) (Stranisci et al., 2016). less likely. We also apply a cut-off frequency and Despite the differences in the number of posts remove all the tokens that appear in a single docu- (9,410 for SENTIPOLC and 2,875 for CorEA; see ment only. Table 2); due to the length constraint of the former, For the style representation, we use the lexical the corpora have comparable numbers of tokens: richness metrics based on type and token frequen- cies such as type-token ratio, entropy, Guiraud’s Model NI I Mic-F1 Mac-F1 R, Honores H, etc. (Tweedie and Baayen, 1998) SENTIPOLC: Training BL: Chance 0.8783 0.1183 0.7862 0.4983 (22 features); and character-type ratios, (includ- BL: Majority 0.9378 0.0000 0.8829 0.4689 ing specific punctuation marks) (46 features) that BoW 0.8979 0.2112 0.8207 0.5546 previously were successfully applied to tasks such Style 0.8817 0.0892 0.7612 0.4605 WE 0.9361 0.0044 0.8799 0.4702 as agreement-disagreement classification (Celli et CorEA al., 2016) and mood detection (Alam et al., 2016). BL: Chance 0.7952 0.1895 0.6733 0.4923 To extract the word embedding representation BL: Majority 0.8886 0.0000 0.7996 0.4443 BoW 0.8414 0.2951 0.7411 0.5682 (Mikolov et al., 2013), we use skip-gram vec- Style 0.7116 0.1688 0.6186 0.4402 tors (size: 300, window: 10) pre-trained on Ital- WE 0.8811 0.1447 0.7912 0.5129 ian Wikipedia, and a document is represented as a term-frequency weighted average of per-word Table 3: Average per-class, micro and macro- vectors. F1 scores for stratified 10-fold cross-validation on SENTIPOLC 2016 training set and CorEA for dif- Since our goal is to analyze utility of Twit- ferent document representations: bag-of-words ter data for irony detection in Italian news fora, (BoW), stylometric features (Style) and word em- we first experiment with the text representations beddings (WE). BL: Chance and BL: Majority are and chose models that behave above chance-level chance-level and majority baselines. NI and I are baseline on per-class F1 scores and Micro-F1 non-ironic and ironic classes, respectively. score using a 10-fold stratified cross-validation setting. Even though on imbalanced data the fre- quently used evaluation metric is Macro-F1 score, document representations behave similarly across e.g. (Barbieri et al., 2016), which we report for corpora, and the only representation that achieves comparison purposes; it is misleading as it does above chance-level per-class and micro-F1 scores not reflect the amount of correctly classified in- is the bag-of-words. At the same time, it achieves stances. The majority baseline, on the other hand, the highest macro-F1 score. However, none of is very strong for highly imbalanced data sets, and the representations is able to surpass the majority is provided for reference purposes only. baseline in terms of micro-F1 . As data imbalance has been observed to ad- The performance of the bag-of-words represen- versely affect irony detection performance (Poria tation on data balancing techniques is presented et al., 2016; Ptacek et al., 2014), we experiment in Table 4. The training with natural distribu- with simple balancing techniques such as random tion (BoW: ND) yields the best performance across under- and oversampling and cost sensitive train- the corpora. For SENTIPOLC data, it is the only ing. While undersampling balances the data set model that produces above chance-level (Table by removing majority class instances, oversam- 3: BL: Chance) performances for per-class and pling achieves that by replicating (copying) mi- micro-F1 scores. nority class instances. Undersampling is often re- ported as a better option, as oversampling may Cost-sensitive training (BoW: CS) and random lead to overfitting problems (Chawla et al., 2002). oversampling (BoW: RO) perform very close. For In cost-sensitive training, on the other hand, the CorEA corpus, all balancing techniques except performance on minority class is improved by random undersampling (BoW: RU) yield above higher misclassification costs for it. In the paper, chance-level performances. Random undersam- the selected representations are analyzed in terms pling, however, yields the highest F1 score for of balancing effects and cross-source performance the irony class, which unfortunately comes at the (Twitter - news fora). expense of the overall performance. This ver- ifies previous observations in the literature that 5 Results and Discussion undersampling leads to negative effect on novel imbalanced data (Stepanov and Riccardi, 2011). The results of experiments comparing different Since cost-sensitive training achieves the best per- document representations – bag-of-words, writ- formance in terms of macro-F1 score, which was ing style, and word embeddings – are presented used as official evaluation metrics in SENTIPOLC in Table 3 for stratified 10-fold cross-validation 2016 (Barbieri et al., 2016), it is retained for SEN- on both corpora (SENTIPOLC and CorEA). The TIPOLC training-test and cross-corpora (SEN- Model NI I Mic-F1 Mac-F1 Model NI I Mic-F1 Mac-F1 SENTIPOLC: Training SENTIPOLC: Training - Test Split BoW: ND 0.8979 0.2112 0.8207 0.5546 BL: Chance 0.8826 0.1155 0.7927 0.4990 BoW: CS 0.8732 0.2493 0.7861 0.5612 BL: Majority 0.9376 0.0000 0.8825 0.4688 BoW: RO 0.8737 0.2375 0.7857 0.5555 SoA 0.9115 0.1710 – 0.5412 BoW: RU 0.7270 0.2679 0.6115 0.4974 BoW: ND 0.9330 0.1678 0.8760 0.5504 CorEA BoW: CS 0.9245 0.2023 0.8620 0.5634 BoW: ND 0.8414 0.2951 0.7411 0.5682 SENTIPOLC - CorEA: 10-fold testing BoW: CS 0.8331 0.3202 0.7321 0.5766 BL: Chance 0.8393 0.1213 0.7286 0.4803 BoW: RO 0.8302 0.3138 0.7279 0.5720 BL: Majority 0.8886 0.0000 0.7996 0.4443 BoW: RU 0.6882 0.3599 0.5810 0.5241 BoW: ND 0.8164 0.1755 0.7001 0.4959 BoW: CS 0.8109 0.2020 0.6945 0.5065 Table 4: Average per-class, micro and macro- F1 scores for stratified 10-fold cross-validation Table 5: Average per-class, micro and macro-F1 on SENTIPOLC 2016 training set and CorEA scores for SENTIPOLC Training-Test split and for balancing techniques: cost-sensitive training 10-fold testing of SENTIPOLC models on CorEA (CS), random oversampling (RO) and random un- for bag-of-words representation with imbalanced dersampling (RU). ND is training with natural dis- (ND) and cost-sensitive (CS) training. SoA are tribution of classes (BoW in Table 3). NI and I are the state-of-the-art results for SENTIPOLC 2016: non-ironic and ironic classes, respectively. the system of (Di Rosa and Durante, 2016). BL: Chance and BL: Majority are chance-level and majority baselines. NI and I are non-ironic and TIPOLC - CorEA) evaluation along with the mod- ironic classes, respectively. els trained on natural imbalanced distribution with equal costs. The final models make use of bag-of-words rep- words, writing style as stylometric features, and resentation and are trained on SENTIPOLC train- word embeddings. The objective is to evaluate ing set in cost-sensitive and insensitive settings. the suitability of Twitter data for detecting irony The evaluation of models is performed on SEN- in news fora. The models were compared for bal- TIPOLC 2016 test set and CorEA’s 10-folds. This anced and imbalanced training, as well as cross- setting allows us to compare our results to the state corpora performance. We have observed that of the art on SENTIPOLC data and CorEA’s cross- the bag-of-words representation with imbalanced validation setting. From the results in Table 5, cost-insensitive training produces the best results we observe that on the SENTIPOLC test set both (micro-F1 ) across settings, closely followed by models outperform the state of the art in terms of cost-sensitive training. macro-F1 score. The model with cost-sensitive The models outperform the results on irony de- training additionally outperforms it in terms of tection in Italian tweets (Di Rosa and Durante, irony class F1 score. However, both models fall 2016) in terms of macro-F1 scores reported for slightly short of outperforming the majority base- SENTIPOLC 2016 (Barbieri et al., 2016). How- line in terms of micro-F1 . ever, micro-F1 is the most informative metric for In the cross-corpora setting the behavior of the downstream application of irony detection, as models is similar – cost-sensitive training favors it considers the total amount of true positives. minority class F1 and macro-F1 scores. While Given that the highest micro-F1 is attained by the both models perform worse than the chance-level majority baselines for both corpora (0.8829 for baseline generated using the label distribution of SENTIPOLC and 0.7996 for CorEA), the task of SENTIPOLC data in terms of micro-F1 , they both irony detection is far from being solved. outperform it in terms of irony class F1 score. However, only the model with cost-sensitive train- Acknowledgments ing yields statistically significant difference using The research leading to these results has re- paired two-tail t-test with p = 0.05. ceived funding from the European Union – Sev- enth Framework Programme (FP7/2007-2013) un- 6 Conclusion der grant agreement No. 610916 – SENSEI. We have presented experiments on irony detec- We would like to thank Paolo Rosso and Mirko tion in Italian Twitter and news fora data compar- Lai for their help in annotating CorEA. ing different document representations – bag-of- References E. Duchesnay. 2011. Scikit-learn: Machine learn- ing in Python. Journal of Machine Learning Re- F. Alam, F. Celli, E.A. Stepanov, A. Ghosh, and G. Ric- search. cardi. 2016. The social mood of news: Self- reported annotations to design automatic mood de- S. Poria, E. Cambria, D. Hazarika, and P. Vij. 2016. A tection systems. In PEOPLES @COLING. deeper look into sarcastic tweets using deep convo- lutional neural networks. arXiv:1610.08815. F. Barbieri, F. Ronzano, and H. Saggion. 2014. Italian irony detection in twitter: a first approach. In CLiC- T. Ptacek, I. Habernal, and J. Hong. 2014. Sarcasm it 2014 & EVALITA. detection on czech and english twitter. In COLING. F. Barbieri, V. Basile, D. Croce, M. Nissim, K. Ravi and V. Ravi. 2015. A survey on opinion min- N. Novielli, and V. Patti. 2016. Overview of the ing and sentiment analysis: tasks, approaches and evalita 2016 sentiment polarity classification task. applications. Knowledge-Based Systems. In CLiC-it - EVALITA. K. Ravi and V. Ravi. 2016. A novel automatic satire V. Basile, A. Bolioli, M. Nissim, V. Patti, and P. Rosso. and irony detection using ensembled feature selec- 2014. Overview of the evalita 2014 sentiment polar- tion and data mining. Knowledge-Based Systems. ity classification task. In EVALITA. A. Reyes and P. Rosso. 2014. On the difficulty of au- P. Carvalho, L. Sarmento, M.J. Silva, and tomatically detecting irony: beyond a simple case of E. De Oliveira. 2009. Clues for detecting irony in negation. Knowledge and Information Systems. user-generated contents: oh...!! it’s so easy;-. In Topic-sentiment analysis for mass opinion. A. Reyes, P. Rosso, and T. Veale. 2013. A multidimen- sional approach for detecting irony in twitter. Lan- G. Castellucci, D. Croce, and R. Basili. 2014. Context- guage resources and evaluation, 47(1):239–268. aware convolutional neural networks for twitter sen- timent analysis in italian. In EVALITA. E.A. Stepanov and G. Riccardi. 2011. Detecting gen- eral opinions from customer surveys. In SENTIRE F. Celli, G. Riccardi, and A. Ghosh. 2014. CorEA: @ICDM. Italian news corpus with emotions and agreement. In CLIC-it. M. Stranisci, C. Bosco, D.I. Hernández Farı́as, and V. Patti. 2016. Annotating sentiment and irony in F. Celli, E.A. Stepanov, and G. Riccardi. 2016. Tell the online Italian political debate on #labuonascuola. me who you are, I’ll tell whether you agree or In LREC. disagree: Prediction of agreement/disagreement in news blogs. In NLPJ @IJCAI. E. Sulis, D.I. Hernández Farı́as, P. Rosso, V. Patti, and G. Ruffo. 2016. Figurative messages and affect in B. Charalampakis, D. Spathis, E. Kouslis, and K. Ker- Twitter: Differences between# irony,# sarcasm and# manidis. 2016. A comparison between semi- not. Knowledge-Based Systems, 108:132–143. supervised and supervised text mining techniques on detecting irony in greek political tweets. Engineer- Y.R. Tausczik and J.W. Pennebaker. 2010. The psy- ing Applications of Artificial Intelligence, 51:50–57. chological meaning of words: LIWC and comput- erized text analysis methods. Journal of Language N.V. Chawla, K.W. Bowyer, L.O. Hall, and W.P. and Social Psychology. Kegelmeyer. 2002. Smote: Synthetic minority over- sampling technique. J. Artif. Int. Res., 16(1):321– F.J. Tweedie and R.H. Baayen. 1998. How variable 357. may a constant be? Measures of lexical richness in perspective. Computers and the Humanities. E. Di Rosa and A. Durante. 2016. Tweet2check V.N. Vapnik. 1995. The Nature of Statistical Learning evaluation at evalita sentipolc 2016. In CLiC-it - Theory. Springer. EVALITA. A. Ghosh, G. Li, T. Veale, P. Rosso, E. Shutova, J. Barnden, and A. Reyes. 2015. Semeval-2015 task 11: Sentiment analysis of figurative language in twitter. In SemEval. T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vec- tor space. arXiv preprint arXiv:1301.3781. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- sos, D. Cournapeau, M. Brucher, M. Perrot, and