Irony Detection: from the Twittersphere to the News Space
     Alessandra Cervone, Evgeny A. Stepanov, Fabio Celli, Giuseppe Riccardi
                       Signals and Interactive Systems Lab
           Department of Information Engineering and Computer Science
                        University of Trento, Trento, Italy
          {alessandra.cervone,evgeny.stepanov}@unitn.it
             {fabio.celli,giuseppe.riccardi}@unitn.it


                 Abstract                           dati e feature sparseness. Per ovviare
                                                    alla feature sparseness proponiamo esper-
English. Automatic detection of irony is            imenti con diverse rappresentazioni del
one of the hot topics for sentiment analy-          testo – bag-of-words, stile di scrittura e
sis, as it changes the polarity of text. Most       word embeddings; per ovviare alla man-
of the work has been focused on the detec-          canza di bilanciamento nei dati utilizzi-
tion of figurative language in Twitter data         amo invece tecniche di bilanciamento.
due to relative ease of obtaining annotated
data, thanks to the use of hashtags to sig-
nal irony. However, irony is present gener-     1   Introduction
ally in natural language conversations and
in particular in online public fora. In this    The detection of irony in user generated content
paper, we present a comparative evalua-         is one of the major issues in sentiment analysis
tion of irony detection from Italian news       and opinion mining (Ravi and Ravi, 2015). The
fora and Twitter posts. Since irony is not      problem is that irony can flip the polarity of ap-
a very frequent phenomenon, its automatic       parently positive sentences, negatively affecting
detection suffers from data imbalance and       the performance of sentiment polarity classifica-
feature sparseness problems. We experi-         tion (Poria et al., 2016). Detecting irony from text
ment with different representations of text     is extremely difficult because it is deeply related to
– bag-of-words, writing style, and word         many out-of-text factors such as context, intona-
embeddings to address the feature sparse-       tion, speakers’ intentions, background knowledge
ness; and balancing techniques to address       and so on. This also affects interpretation and an-
the data imbalance.                             notation of irony by humans, often leading to low
                                                inter-annotator agreements.
Italiano. Il rilevamento automatico di iro-        Twitter posts are frequently used for the irony
nia è uno degli argomenti più interessanti    detection research, since users often signal irony
in sentiment analysis, poiché modifica la      in their posts utilizing hashtags such as #irony,
polarità del testo. La maggior parte degli     #justjoking, etc. Despite the relative ease of col-
studi si sono concentrati sulla rilevazione     lecting the data, Twitter is a very particular kind
del linguaggio figurativo nei dati di Twit-     of text. In this paper we experiment with dif-
ter per la relativa facilità nell’ottenere     ferent representations of text to evaluate the util-
dati annotati con gli hashtags per seg-         ity of Twitter data for the detection of irony in
nalare l’ironia. Tuttavia, l’ironia è un       text coming from other sources such as news fora.
fenomeno che si trova nelle conversazioni       The representations of text – bag-of-words, writ-
umane in generale e in particolare nei fo-      ing style, and word embeddings – are chosen such
rum online. In questo lavoro presentiamo        that they are not dependent on the resources avail-
una valutazione comparativa sul rileva-         able for the language. Due to the fact that irony is
mento dell’ironia in blogs giornalistici e      less frequent than literal meaning, the data is usu-
conversazioni su Twitter. Poiché l’ironia      ally imbalanced. We experiment with balancing
non è un fenomeno molto frequente, il suo      techniques such as random undersampling, ran-
rilevamento automatico risente di prob-         dom oversampling and cost-sensitive training to
lemi di mancanza di bilanciamento nei           observe its effects on a supervised irony detection.
   The paper is structured as follows. In Section        Sulis et al. (2016) investigated a new set of fea-
2 we introduce related work on irony. In Sec-         tures for irony detection in Twitter with particular
tion 3 we describe the corpora used throughout        regard to affective features; and studied the differ-
experiments. In Sections 4 and 5 we describe the      ence between irony and sarcasm. Barbieri et al.
methodology and the result of the experiments. In     (2014) were the first ones to propose an approach
Section 6 we provide concluding remarks.              for irony detection in Italian.
                                                         Irony detection is a popular topic for shared
2   Related Works                                     tasks and evaluation campaigns. Among others,
                                                      SemEval-2015 (Ghosh et al., 2015) task on sen-
The detection of irony in text has been widely        timent analysis of figurative language in Twitter,
addressed. Carvalho et al. (2009) showed that         and SENTIPOLC 2014 (Basile et al., 2014) and
in Portuguese news blogs, pragmatic and gestu-        2016 (Barbieri et al., 2016) tasks on irony and
ral text features such as emoticons, onomatopoeic     sentiment classification in Twitter. SemEval con-
expressions and heavy punctuation marks work          sidered three broad classes of figurative language:
better than deeper linguistic information such as     irony, sarcasm and metaphor. The task was cast
n-grams, words or syntax. Reyes et al. (2013)         as a regression as participants had to predict a nu-
addressed irony detection in Twitter, using com-      meric score (crowd-annotated). The best perform-
plex features like temporal expressions, counter-     ing systems made use of manual and automatic
factuality markers, pleasantness or imageability      lexica, term-frequencies, part-of-speech tags, and
of words, and pair-wise semantic relatedness of       emoticons.
terms in adjacent sentences. This rich feature set
                                                         The SENTIPOLC campaigns on Italian tweets,
enabled the same authors to detect 30% of the
                                                      on the other hand, included three tasks: subjec-
irony in movie and book reviews in (Reyes and
                                                      tivity detection, sentiment polarity classification
Rosso, 2014).
                                                      and irony detection (binary classification). The
   Ravi and Ravi (2016), on the other hand, ex-
                                                      best performing systems utilized broad sets of fea-
ploited resources such as LIWC (Tausczik and
                                                      tures ranging from the established Twitter-based
Pennebaker, 2010) to analyze irony in two differ-
                                                      features, such as URL links, mentions, and hash-
ent domains: satirical news and Amazon reviews;
                                                      tags, to emoticons, punctuation, and vector space
and found out that LIWC’s words related to sex or
                                                      models to spot out-of-context words (Castellucci
death are good indicators of irony.
                                                      et al., 2014). Specifically, in SENTIPOLC 2016,
   Charalampakis et al. (2016) addressed irony de-    the best performing system exploited lexica, hand-
tection in Greek political tweets comparing semi-     crafted rules, topic models and Named Entities
supervised and supervised approaches, with the        (Di Rosa and Durante, 2016). In this paper, on the
aim to analyze whether irony predicts election re-    other hand, we address irony detection from fea-
sults or not. In order to detect irony, they use      tures not dependent on language resources such as
as features: spoken style words, word frequency,      manually crafted lexica and source-dependent fea-
number of WordNet SynSets as a measure of am-         tures such as hashtags and emoticons.
biguity, punctuation, repeated patterns and emoti-
cons. They found that supervised methods work         3   Data Set
better than semi-supervised in the prediction of
irony (Charalampakis et al., 2016).                   The experiments reported in this paper make use
   Poria et al. (2016) developed models based on      of two data sets: SENTIPOLC 2016 (Barbieri et
pre-trained convolutional neural networks (CNNs)      al., 2016) and CorEA (Celli et al., 2014). While
to exploit sentiment, emotion and personality fea-    SENTIPOLC is a corpus of tweets, CorEA is a
tures for a sarcasm detection task. They trained      data set of news articles and related reader com-
and tested their models on balanced and unbal-        ments collected from the Italian news website cor-
anced sets of tweets retrieved searching the hash-    riere.it. The two corpora consist of inherently dif-
tag #sarcasm. They found that CNNs with pre-          ferent types of text. While tweets have a limit on
trained models perform very well and that, al-        the length of the post, news articles comments are
though sentiment features are good also when used     not constrained. The length limitation does not
alone, emotion and personality features help in the   only impact the number of tokens per post, but
task (Poria et al., 2016).                            also the style of writing, since in Tweets authors
 SENTIPOLC 2016                                                    CorEA
 @gadlernertweet Se #Grillo fosse al governo, dopo due mesi        bravo, escludi l’universitá .... restare ignoranti non fa male
 lo Stato smetterebbe di pagare stipendi e pensioni. E lui         a nessuno, solo a sé stessi. questi sono i nostri.... geni. non
 capeggerebbe la rivolta                                           mi meraviglierei se votasse grillo
 #Grillo,fa i comizi sulle cassette della frutta,mentre alcune     beh dipende da come la guardi..A campagna elettorale
 del #Pdl li fanno senza,cassetta...solo sulle banane. #ballaró   all’inverso: rispettano ció che avevano promesso
 @Italialand
 @MissAllyBlue Non mi fido della compagnia.. meglio far            Saranno solo 4 milioni (comunque dimentichi i 42 mil di
 finta di stare sveglio.. sveglissimo O o                          rimborsi) peró pochi o tanti li hanno restituiti. Gli altri in-
                                                                   vece , probabilmente politici a te “simpatici” continuano a
                                                                   gozzovigliare con i soldi tuoi . Sveglia volpone

                    Table 1: Examples of ironic posts from SENTIPOLC 2016 and CorEA.


naturally try to squeeze as much content as possi-                                       Non-Ironic      Ironic              Total
ble within the limits.                                                                    SENTIPOLC 2016
                                                                       Training        6,542 (88%) 868 (12%)                7,410
   This difference can be seen also in the type                        Test            1,765 (88%) 235 (12%)                2,000
of irony used across the two corpora, as shown                         CorEA           2,299 (80%) 576 (20%)                2,875
in the examples reported in Table 1. While in
                                                                   Table 2: Counts and percentages of ironic and
Tweets we observe much more the presence of
                                                                   non-ironic posts in SENTIPOLC 2016 training
external ‘sources’ (such as URL links, mentions,
                                                                   and test set and CorEA corpus.
hashtags and emoticons) to signal the irony and
make it interpretable (for example by disambiguat-
ing entities using hashtags); news fora users tend                 159K for SENTIPOLC and 164K for CorEA.
to use style much more similar to natural language,                Consequently, there are drastic differences in the
where entities are not specifically signaled and                   average number of tokens per post: 21 for SEN-
there are no emojis to mark the non-literal mean-                  TIPOLC and 57 for CorEA. As shown in Table 2,
ing of a sentence. Thus, CorEA presents a more                     we also observe a major difference in the percent-
difficult, but also a more interesting, dataset for                ages of ironic posts between the corpora: 12% for
automatic irony detection, given the closer simi-                  SENTIPOLC and 20% for CorEA.
larity to the language used in other genres.
   Both corpora have been annotated following                      4      Methodology
a version of the scheme of SENTIPOLC 2014                          In this paper we address irony detection in Ital-
(Basile et al., 2014). According to the scheme, the                ian making use of source independent and ‘easily’
annotator is asked to decide whether the given text                obtainable representations of text such as lexical
is subjective or not, and in case it is considered                 (bag-of-words), stylometric, and word embedding
subjective, to annotate the polarity of the text and               vectors. The models are trained and tested using
irony as binary values. The CorEA corpus (Celli et                 Support Vector Machines (SVM) (Vapnik, 1995)
al., 2014) was annotated for irony by three anno-                  with linear kernel and defaults parameters, imple-
tators specifically for this paper, and has an inter-              mented in the scikit-learn (Pedregosa et al., 2011)
annotator agreement of κ = 0.57.                                   python library.
   Since SENTIPOLC 2016 is composed of differ-                        To obtain the desired representations of text, the
ent data sets, which used various agreement met-                   data is pre- For the bag-of-word representation, the
rics (Barbieri et al., 2016), it is not possible to                data is lowercased, and all source-specific entities,
directly compare the inter-annotator agreements                    such as emoji, URL, Twitter hashtags, and men-
between the corpora. The two component data                        tions are mapped to a single entity (e.g. hHi for
sets of SENTIPOLC 2016 for which a comparable                      hashtags); as the objective is to use Twitter mod-
metric is reported have an inter-annotator agree-                  els to detect irony in news fora and other kinds
ment of κ = 0.538 (TW-SENTIPOLC14) and                             of textual data, where presence of such entities is
κ = 0.492 (TW-BS) (Stranisci et al., 2016).                        less likely. We also apply a cut-off frequency and
   Despite the differences in the number of posts                  remove all the tokens that appear in a single docu-
(9,410 for SENTIPOLC and 2,875 for CorEA; see                      ment only.
Table 2); due to the length constraint of the former,                 For the style representation, we use the lexical
the corpora have comparable numbers of tokens:                     richness metrics based on type and token frequen-
cies such as type-token ratio, entropy, Guiraud’s       Model            NI      I      Mic-F1     Mac-F1
R, Honores H, etc. (Tweedie and Baayen, 1998)                          SENTIPOLC: Training
                                                        BL: Chance     0.8783 0.1183     0.7862     0.4983
(22 features); and character-type ratios, (includ-      BL: Majority   0.9378 0.0000     0.8829     0.4689
ing specific punctuation marks) (46 features) that      BoW            0.8979 0.2112     0.8207     0.5546
previously were successfully applied to tasks such      Style          0.8817 0.0892     0.7612     0.4605
                                                        WE             0.9361 0.0044     0.8799     0.4702
as agreement-disagreement classification (Celli et                           CorEA
al., 2016) and mood detection (Alam et al., 2016).      BL: Chance     0.7952 0.1895     0.6733     0.4923
   To extract the word embedding representation         BL: Majority   0.8886 0.0000     0.7996     0.4443
                                                        BoW            0.8414 0.2951     0.7411     0.5682
(Mikolov et al., 2013), we use skip-gram vec-           Style          0.7116 0.1688     0.6186     0.4402
tors (size: 300, window: 10) pre-trained on Ital-       WE             0.8811 0.1447     0.7912     0.5129
ian Wikipedia, and a document is represented as
a term-frequency weighted average of per-word          Table 3: Average per-class, micro and macro-
vectors.                                               F1 scores for stratified 10-fold cross-validation on
                                                       SENTIPOLC 2016 training set and CorEA for dif-
   Since our goal is to analyze utility of Twit-
                                                       ferent document representations: bag-of-words
ter data for irony detection in Italian news fora,
                                                       (BoW), stylometric features (Style) and word em-
we first experiment with the text representations
                                                       beddings (WE). BL: Chance and BL: Majority are
and chose models that behave above chance-level
                                                       chance-level and majority baselines. NI and I are
baseline on per-class F1 scores and Micro-F1
                                                       non-ironic and ironic classes, respectively.
score using a 10-fold stratified cross-validation
setting. Even though on imbalanced data the fre-
quently used evaluation metric is Macro-F1 score,      document representations behave similarly across
e.g. (Barbieri et al., 2016), which we report for      corpora, and the only representation that achieves
comparison purposes; it is misleading as it does       above chance-level per-class and micro-F1 scores
not reflect the amount of correctly classified in-     is the bag-of-words. At the same time, it achieves
stances. The majority baseline, on the other hand,     the highest macro-F1 score. However, none of
is very strong for highly imbalanced data sets, and    the representations is able to surpass the majority
is provided for reference purposes only.               baseline in terms of micro-F1 .
   As data imbalance has been observed to ad-
                                                          The performance of the bag-of-words represen-
versely affect irony detection performance (Poria
                                                       tation on data balancing techniques is presented
et al., 2016; Ptacek et al., 2014), we experiment
                                                       in Table 4. The training with natural distribu-
with simple balancing techniques such as random
                                                       tion (BoW: ND) yields the best performance across
under- and oversampling and cost sensitive train-
                                                       the corpora. For SENTIPOLC data, it is the only
ing. While undersampling balances the data set
                                                       model that produces above chance-level (Table
by removing majority class instances, oversam-
                                                       3: BL: Chance) performances for per-class and
pling achieves that by replicating (copying) mi-
                                                       micro-F1 scores.
nority class instances. Undersampling is often re-
ported as a better option, as oversampling may            Cost-sensitive training (BoW: CS) and random
lead to overfitting problems (Chawla et al., 2002).    oversampling (BoW: RO) perform very close. For
In cost-sensitive training, on the other hand, the     CorEA corpus, all balancing techniques except
performance on minority class is improved by           random undersampling (BoW: RU) yield above
higher misclassification costs for it. In the paper,   chance-level performances. Random undersam-
the selected representations are analyzed in terms     pling, however, yields the highest F1 score for
of balancing effects and cross-source performance      the irony class, which unfortunately comes at the
(Twitter - news fora).                                 expense of the overall performance. This ver-
                                                       ifies previous observations in the literature that
5   Results and Discussion                             undersampling leads to negative effect on novel
                                                       imbalanced data (Stepanov and Riccardi, 2011).
The results of experiments comparing different         Since cost-sensitive training achieves the best per-
document representations – bag-of-words, writ-         formance in terms of macro-F1 score, which was
ing style, and word embeddings – are presented         used as official evaluation metrics in SENTIPOLC
in Table 3 for stratified 10-fold cross-validation     2016 (Barbieri et al., 2016), it is retained for SEN-
on both corpora (SENTIPOLC and CorEA). The             TIPOLC training-test and cross-corpora (SEN-
    Model         NI      I      Mic-F1     Mac-F1       Model           NI       I       Mic-F1 Mac-F1
                SENTIPOLC: Training                                SENTIPOLC: Training - Test Split
    BoW: ND     0.8979 0.2112     0.8207     0.5546      BL: Chance    0.8826 0.1155       0.7927   0.4990
    BoW: CS     0.8732 0.2493     0.7861     0.5612      BL: Majority  0.9376 0.0000       0.8825   0.4688
    BoW: RO     0.8737 0.2375     0.7857     0.5555      SoA           0.9115 0.1710             –  0.5412
    BoW: RU     0.7270 0.2679     0.6115     0.4974      BoW: ND       0.9330 0.1678       0.8760   0.5504
                      CorEA                              BoW: CS       0.9245 0.2023       0.8620   0.5634
    BoW: ND     0.8414 0.2951     0.7411     0.5682              SENTIPOLC - CorEA: 10-fold testing
    BoW: CS     0.8331 0.3202     0.7321     0.5766      BL: Chance    0.8393 0.1213       0.7286   0.4803
    BoW: RO     0.8302 0.3138     0.7279     0.5720      BL: Majority  0.8886 0.0000       0.7996   0.4443
    BoW: RU     0.6882 0.3599     0.5810     0.5241      BoW: ND       0.8164 0.1755       0.7001   0.4959
                                                         BoW: CS       0.8109 0.2020       0.6945   0.5065
Table 4: Average per-class, micro and macro-
F1 scores for stratified 10-fold cross-validation       Table 5: Average per-class, micro and macro-F1
on SENTIPOLC 2016 training set and CorEA                scores for SENTIPOLC Training-Test split and
for balancing techniques: cost-sensitive training       10-fold testing of SENTIPOLC models on CorEA
(CS), random oversampling (RO) and random un-           for bag-of-words representation with imbalanced
dersampling (RU). ND is training with natural dis-      (ND) and cost-sensitive (CS) training. SoA are
tribution of classes (BoW in Table 3). NI and I are     the state-of-the-art results for SENTIPOLC 2016:
non-ironic and ironic classes, respectively.            the system of (Di Rosa and Durante, 2016). BL:
                                                        Chance and BL: Majority are chance-level and
                                                        majority baselines. NI and I are non-ironic and
TIPOLC - CorEA) evaluation along with the mod-
                                                        ironic classes, respectively.
els trained on natural imbalanced distribution with
equal costs.
   The final models make use of bag-of-words rep-       words, writing style as stylometric features, and
resentation and are trained on SENTIPOLC train-         word embeddings. The objective is to evaluate
ing set in cost-sensitive and insensitive settings.     the suitability of Twitter data for detecting irony
The evaluation of models is performed on SEN-           in news fora. The models were compared for bal-
TIPOLC 2016 test set and CorEA’s 10-folds. This         anced and imbalanced training, as well as cross-
setting allows us to compare our results to the state   corpora performance. We have observed that
of the art on SENTIPOLC data and CorEA’s cross-         the bag-of-words representation with imbalanced
validation setting. From the results in Table 5,        cost-insensitive training produces the best results
we observe that on the SENTIPOLC test set both          (micro-F1 ) across settings, closely followed by
models outperform the state of the art in terms of      cost-sensitive training.
macro-F1 score. The model with cost-sensitive              The models outperform the results on irony de-
training additionally outperforms it in terms of        tection in Italian tweets (Di Rosa and Durante,
irony class F1 score. However, both models fall         2016) in terms of macro-F1 scores reported for
slightly short of outperforming the majority base-      SENTIPOLC 2016 (Barbieri et al., 2016). How-
line in terms of micro-F1 .                             ever, micro-F1 is the most informative metric for
   In the cross-corpora setting the behavior of         the downstream application of irony detection, as
models is similar – cost-sensitive training favors      it considers the total amount of true positives.
minority class F1 and macro-F1 scores. While            Given that the highest micro-F1 is attained by the
both models perform worse than the chance-level         majority baselines for both corpora (0.8829 for
baseline generated using the label distribution of      SENTIPOLC and 0.7996 for CorEA), the task of
SENTIPOLC data in terms of micro-F1 , they both         irony detection is far from being solved.
outperform it in terms of irony class F1 score.
However, only the model with cost-sensitive train-      Acknowledgments
ing yields statistically significant difference using   The research leading to these results has re-
paired two-tail t-test with p = 0.05.                   ceived funding from the European Union – Sev-
                                                        enth Framework Programme (FP7/2007-2013) un-
6     Conclusion                                        der grant agreement No. 610916 – SENSEI.
We have presented experiments on irony detec-             We would like to thank Paolo Rosso and Mirko
tion in Italian Twitter and news fora data compar-      Lai for their help in annotating CorEA.
ing different document representations – bag-of-
References                                                    E. Duchesnay. 2011. Scikit-learn: Machine learn-
                                                              ing in Python. Journal of Machine Learning Re-
F. Alam, F. Celli, E.A. Stepanov, A. Ghosh, and G. Ric-       search.
   cardi. 2016. The social mood of news: Self-
   reported annotations to design automatic mood de-        S. Poria, E. Cambria, D. Hazarika, and P. Vij. 2016. A
   tection systems. In PEOPLES @COLING.                        deeper look into sarcastic tweets using deep convo-
                                                               lutional neural networks. arXiv:1610.08815.
F. Barbieri, F. Ronzano, and H. Saggion. 2014. Italian
   irony detection in twitter: a first approach. In CLiC-   T. Ptacek, I. Habernal, and J. Hong. 2014. Sarcasm
   it 2014 & EVALITA.                                          detection on czech and english twitter. In COLING.
F. Barbieri, V. Basile, D. Croce, M. Nissim,                K. Ravi and V. Ravi. 2015. A survey on opinion min-
  N. Novielli, and V. Patti. 2016. Overview of the            ing and sentiment analysis: tasks, approaches and
  evalita 2016 sentiment polarity classification task.        applications. Knowledge-Based Systems.
  In CLiC-it - EVALITA.
                                                            K. Ravi and V. Ravi. 2016. A novel automatic satire
V. Basile, A. Bolioli, M. Nissim, V. Patti, and P. Rosso.     and irony detection using ensembled feature selec-
   2014. Overview of the evalita 2014 sentiment polar-        tion and data mining. Knowledge-Based Systems.
   ity classification task. In EVALITA.
                                                            A. Reyes and P. Rosso. 2014. On the difficulty of au-
P.    Carvalho, L. Sarmento, M.J. Silva, and                  tomatically detecting irony: beyond a simple case of
     E. De Oliveira. 2009. Clues for detecting irony in       negation. Knowledge and Information Systems.
     user-generated contents: oh...!! it’s so easy;-. In
     Topic-sentiment analysis for mass opinion.             A. Reyes, P. Rosso, and T. Veale. 2013. A multidimen-
                                                               sional approach for detecting irony in twitter. Lan-
G. Castellucci, D. Croce, and R. Basili. 2014. Context-        guage resources and evaluation, 47(1):239–268.
   aware convolutional neural networks for twitter sen-
   timent analysis in italian. In EVALITA.                  E.A. Stepanov and G. Riccardi. 2011. Detecting gen-
                                                              eral opinions from customer surveys. In SENTIRE
F. Celli, G. Riccardi, and A. Ghosh. 2014. CorEA:             @ICDM.
   Italian news corpus with emotions and agreement.
   In CLIC-it.                                              M. Stranisci, C. Bosco, D.I. Hernández Farı́as, and
                                                              V. Patti. 2016. Annotating sentiment and irony in
F. Celli, E.A. Stepanov, and G. Riccardi. 2016. Tell          the online Italian political debate on #labuonascuola.
   me who you are, I’ll tell whether you agree or             In LREC.
   disagree: Prediction of agreement/disagreement in
   news blogs. In NLPJ @IJCAI.                              E. Sulis, D.I. Hernández Farı́as, P. Rosso, V. Patti, and
                                                               G. Ruffo. 2016. Figurative messages and affect in
B. Charalampakis, D. Spathis, E. Kouslis, and K. Ker-          Twitter: Differences between# irony,# sarcasm and#
   manidis. 2016. A comparison between semi-                   not. Knowledge-Based Systems, 108:132–143.
   supervised and supervised text mining techniques on
   detecting irony in greek political tweets. Engineer-     Y.R. Tausczik and J.W. Pennebaker. 2010. The psy-
   ing Applications of Artificial Intelligence, 51:50–57.     chological meaning of words: LIWC and comput-
                                                              erized text analysis methods. Journal of Language
N.V. Chawla, K.W. Bowyer, L.O. Hall, and W.P.                 and Social Psychology.
  Kegelmeyer. 2002. Smote: Synthetic minority over-
  sampling technique. J. Artif. Int. Res., 16(1):321–       F.J. Tweedie and R.H. Baayen. 1998. How variable
  357.                                                         may a constant be? Measures of lexical richness in
                                                               perspective. Computers and the Humanities.
E. Di Rosa and A. Durante. 2016. Tweet2check
                                                            V.N. Vapnik. 1995. The Nature of Statistical Learning
  evaluation at evalita sentipolc 2016. In CLiC-it -
                                                              Theory. Springer.
  EVALITA.

A. Ghosh, G. Li, T. Veale, P. Rosso, E. Shutova,
  J. Barnden, and A. Reyes. 2015. Semeval-2015
  task 11: Sentiment analysis of figurative language
  in twitter. In SemEval.

T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013.
   Efficient estimation of word representations in vec-
   tor space. arXiv preprint arXiv:1301.3781.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
   B. Thirion, O. Grisel, M. Blondel, P. Pretten-
   hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
   sos, D. Cournapeau, M. Brucher, M. Perrot, and