Irony detection in tweets: X2Check at Ironita 2018 (Short Paper)

                  Emanuele Di Rosa                                  Alberto Durante
                Chief Technology Officer                           Research Scientist
                    App2Check s.r.l.                                App2Check s.r.l.
                 emanuele.dirosa                                  alberto.durante
                  @app2check.com                                  @app2check.com


                     Abstract                               This paper is structured as follow: after the in-
                                                         troduction we present the descriptions of our two
    English. In this paper we describe and               systems submitted for the irony detection task;
    show the results of the two systems that             then we show and discuss the results on the offi-
    we have specifically developed to partici-           cial test set of the competition, finally we provide
    pate at Ironita 2018 for the irony detection         our conclusions.
    task. We scored as the third team in the
    official ranking of the competition, thanks
    to the X2C-B system, at a distance of just           2   Systems description
    0.027 of F1 score from the best system.
                                                         The dataset provided by Ironita organizers has
    Italiano. In questo report descriviamo               been split into training set (80% of the documents)
    i due sistemi che abbiamo sviluppato ad              and development set (the remaining 20%). We
    hoc per partecipare ad Ironita 2018, nello           randomly sampled the examples for each cate-
    specifico al task di irony detection. Il nos-        gory, thus obtaining different sets for training/test
    tro team è risultato essere il terzo classi-        set, by keeping the distribution of ironic and non-
    ficato nella classifica ufficiale della com-         ironic samples through the two sets. We submitted
    petizione, grazie al nostro sistema X2C-B,           two runs, as the results of the two different sys-
    che ha ottenuto un F1 score solo 0.027 in-           tems we developed for each category, called X2C-
    feriore rispetto al primo classificato.              A and X2C-B. The former has been developed on
                                                         top of the Scikit-learn library in Python language
                                                         (Pedregosa et al., 2011), and the latter on top of the
1   Introduction                                         WEKA library (Frank et al., 2016) in JAVA lan-
In social media, the use of irony in tweets and          guage. In both cases, input text has been cleaned
Facebook posts is widely spread and makes very           with a typical NLP pipeline, involving punctua-
difficult for sentiment analysis tools to properly       tion (with the exclusion of question/exclamation
automatically classify people opinion (Hernández        mark), numbers and stopwords removal. In partic-
and Rosso, 2016). The ability to detect irony with       ular, since it is still hard to detect irony in a text,
high accuracy would bring an important contribu-         very often also for humans, we tried to take ad-
tion in opinion mining systems and lead to many          vantage of features trying to help triggering the
industrial applications. For this reason, irony de-      presence of irony. For instance, question and ex-
tection has been largely studied in recent research      clamation marks, text strings representing laughs,
papers like (Farı́as et al., 2011), (Barbieri et al.,    emoticons, mixed sentiment in the same sentence
2014), (Farı́as et al., 2016), (Freitas et al., 2014).   are some of the text features that we extracted from
   In this paper we describe and show the results        the text and represented with a specific explicit
of the two systems that we have specifically devel-      marker highlighting their presence.
oped to participate at Ironita 2018 (Cignarella et          Both the X2C-A and X2C-B unconstrained run
al., 2018) for the irony detection task. We scored       were trained using the SENTIPOLC 2016 Irony
as the third team in the official ranking of the com-    training set and test set (Barbieri et al., 2016) as
petition, thanks to the X2C-B system, at a distance      external source, in addition to the Ironita training
of just 0.027 of F1 score from the best system.          set.
2.1   X2C-A                                                               F1 non-iro    F1 iro    Macro F1
                                                            NB-const        0.715       0.696      0.707
The X2C-A system has been created by apply-                 NB-uncon        0.729       0.750      0.740
ing an NLP pipeline including a vectorization of            SMO-const       0.678       0.689      0.683
the collection of reviews to a matrix of token              SMO-uncon       0.704       0.492      0.598
counts of bi-grams; then, the count matrix has
been transformed to a normalized tf-idf represen-        Table 2: Results on development set for X2C-B.
tation (term-frequency times inverse document-
frequency). For the training, we created an en-
semble model, more specifically a voting ensem-         3    Results and discussion
ble, that takes into account three different algo-      In Table 3 we show the results of our runs on the
rithms: LinearSVC (an implementation of Sup-            official test set of the competition. In accordance
port Vector Machines), Multinomial Naive Bayes          with what we noticed before, comparing Table 1
and the SGD classifier. All of them have an im-         and Table 2, our best run is X2C-B unconstrained,
plementation available in the Scikit-learn library.     which reached the best F1 overall on non-ironic
The ensemble model has been the best model in           documents; it also ranks fifth in the overall F1-
our model selection activity. In order to prop-         score, at a distance of 0.027 from the best system.
erly select the best hyper-parameters, we applied a     The performance of the X2C-A run is very similar
grid search approach for each of the model in the       to the unconstrained run, obtaining a F1-score that
voting ensemble. The resulting ensemble model           is only 0.002 higher than the constrained run. The
showed a macro F1 score of 70.98 on our develop-        difference between the two X2C-B runs is larger
ment set and is very close to the final result on the   in relative terms, but is only of 0.021. We can also
competition test set (shown in table ).                 see that our X2C-B-u shows the best F1 score on
                                                        the non-ironic tweets compared to all of the sys-
                   Acc     F1 ironic   Macro F1         tems.
  LinearSVM       0.706     0.699       0.706              We added to this ranking also the model that
  NB              0.706     0.699       0.706           reached the first position on the Irony task at
  SGD             0.697     0.728       0.693           SENTIPOLC 2016 (Di Rosa and Durante, 2016).
  Ensemble        0.710     0.709       0.710           The score of that model on this test set, called
                                                        X2C2016 in the table, reached a F1-score of just
Table 1: Results on the development set for X2C-        0.432, which is lower than the baseline of this year.
A constrained.                                          This surprising result may indicate either that the
                                                        irony detection systems had a great improvement
                                                        in the past two years, or that irony detectors have
2.2   X2C-B                                             a performance that is very much dependent on the
                                                        topics treated in the training set, i.e. they are still
In the model selection process, the two best al-        not so good to generalize.
gorithms have been Naive Bayes Multinomial and
SMO, both using unigram features. We took into          4    Conclusions
account the F1 score on the positive labels and
the Macro-F1 in order to select the best algo-          In this paper we described the two systems that we
rithm. As shown in Table 2, Naive Bayes Multi-          built and submitted for the Ironita 2018 competi-
nomial reached a Macro F1 score 2.38% higher            tion for the irony detection task. The results show
on the constrained run and a 14.2% on the uncon-        that our system X2C-B scored as the third team at
strained run, thus both the constrained and the un-     a distance of just 0.027 of F1 score from the best
constrained submitted runs were produced using          system.
this algorithm.
   Comparing the results in Table 2 with the ones       References
in Table 1, we can notice that X2C-B uncon-
                                                        Alessandra Teresa Cignarella and Simona Frenda and
strained reached the highest performance on the           Valerio Basile and Cristina Bosco and Viviana Patti
development set, while X2C-B constrained ob-              and Paolo Rosso. 2018. Overview of the Evalita
tained the lowest score.                                  2018 Task on Irony Detection in Italian Tweets
        team         F1 non-iro     F1 iro    F1           Duchesnay, E. 2011. Scikit-learn: Machine Learn-
  1     team 1         0.707        0.754    0.731         ing in Python in Journal of Machine Learning Re-
  2     team 1         0.693        0.733    0.713         search, pp. 2825–2830.
  3     team 2         0.689        0.730    0.710       Farı́as, Delia Irazú Hernández et al. Irony Detection
  4     team 2         0.689        0.730    0.710         in Twitter: The Role of Affective Content. 2011. in
  5     X2C-B-u        0.708        0.700    0.704         ACM Trans. Internet Techn. 16 (2016): 19:1-19:24.
  6     team 4         0.662        0.739    0.700       Barbieri, Francesco and Horacio Saggion. 2014. Mod-
  7     team 4         0.668        0.733    0.700         elling Irony in Twitter: Feature Analysis and Evalu-
  8     X2C-A-u        0.700        0.689    0.695         ation. in LREC (2014).
  9     team 5         0.668        0.722    0.695       Delia Irazú Hernández Farı́as, Viviana Patti, and Paolo
  10    X2C-A-c        0.679        0.708    0.693         Rosso. 2016. Irony Detection in Twitter: The Role
  11    X2C-B-c        0.674        0.693    0.683         of Affective Content. in ACM Transaction Internet
  12    team 6         0.603        0.700    0.651         Technology 16, 3, Article 19 (July 2016), pp. 1-24.
                                                           DOI: https://doi.org/10.1145/2930663
  13    team 6         0.626        0.665    0.646
  14    team 6         0.579        0.678    0.629       Freitas, Larissa and Vanin, Aline and Hogetop, Denise
  15    team 6         0.652        0.577    0.614         and N. Bochernitsan, Marco and Vieira, Renata.
  16    baseline-1     0.503        0.506    0.505         2014. Pathways for irony detection in tweets. in Pro-
                                                           ceedings of the ACM Symposium on Applied Com-
  17    team 7         0.651        0.289    0.470         puting. 10.1145/2554850.2555048.
  18    X2C2016        0.665        0.198    0.432
  19    team 7         0.645        0.195    0.420       Hernández I., Rosso P. 2016. Irony, Sarcasm, and Sen-
                                                           timent Analysis. Chapter 7 In: Sentiment Analysis in
  20    baseline-2     0.668          0      0.334         Social Networks, F.A. Pozzi, E. Fersini, E. Messina,
                                                           and B. Liu (Eds.), Elsevier Science and Technology,
       Table 3: Ironita 2018 official ranking.             pp. 113-128
                                                         Sulis E., Hernández I., Rosso P., Patti V., Ruffo G.
                                                           2016. Figurative Messages and Affect in Twitter:
  (IronITA) in Proceedings of the 6th evaluation cam-
                                                           Differences Between #irony, #sarcasm and #not. In:
  paign of Natural Language Processing and Speech
                                                           Knowledge-Based Systems, vol. 108, pp. 132–143
  tools for Italian (EVALITA’18).

Francesco Barbieri and Valerio Basile and Danilo
  Croce and Malvina Nissim and Nicole Novielli and
  Viviana Patti. 2016. Overview of the Evalita 2016
  SENTIment POLarity Classification Task in Pro-
  ceedings of Third Italian Conference on Computa-
  tional Linguistics (CLiC-it 2016) & Fifth Evalua-
  tion Campaign of Natural Language Processing and
  Speech Tools for Italian. Final Workshop (EVALITA
  2016), Napoli, Italy, December 5-7, 2016.

Emanuele Di Rosa and Alberto Durante.            2016.
  Tweet2Check evaluation at Evalita Sentipolc 2016
  in Proceedings of Third Italian Conference on Com-
  putational Linguistics (CLiC-it 2016) & Fifth Evalu-
  ation Campaign of Natural Language Processing and
  Speech Tools for Italian. Final Workshop (EVALITA
  2016), Napoli, Italy, December 5-7, 2016.

Eibe Frank, Mark A. Hall, and Ian H. Witten. 2016.
  The WEKA Workbench. Online Appendix for
  ”Data Mining: Practical Machine Learning Tools
  and Techniques”, Morgan Kaufmann, Fourth Edi-
  tion, 2016.

Pedregosa, F. and Varoquaux, G. and Gramfort, A.
  and Michel, V. and Thirion, B. and Grisel, O. and
  Blondel, M. and Prettenhofer, P. and Weiss, R. and
  Dubourg, V. and Vanderplas, J. and Passos, A. and
  Cournapeau, D. and Brucher, M. and Perrot, M. and