UNIBA - Integrating distributional semantics features in a supervised
               approach for detecting irony in Italian tweets
                            Pierpaolo Basile and Giovanni Semeraro
                                 Department of Computer Science
                                   University of Bari Aldo Moro
                              Via, E. Orabona, 4 - 70125 Bari (Italy)
                    {pierpaolo.basile,giovanni.semeraro}@uniba.it


                         Abstract                        that the word “badante” (caregiver) is used in an
                                                         unconventional context, since “caregiver” usually
        English. This paper describes the UNIBA
                                                         does not co-occur with words “Premier” or “Mario
        team participation in the IronITA 2018
                                                         Monti”.
        task at EVALITA 2018. We propose a su-
        pervised approach based on LIBLINEAR                Following this idea in our work we introduce a
        that relies on keyword, polarity, micro-         feature able to detect words used out of their usual
        blogging features and representation of          context. Moreover, we integrate further features
        tweets in a distributional semantic model.       based on keywords, bigrams, trigrams, polarity
        Our system ranked 3rd and 4th in the irony       and micro-blogging features as reported in (Basile
        detection subtask. We participated only in       and Novielli, 2014). Our idea is supported by best
        the constraint run exploiting the training       systems participating in the Semeval-2018 task 3 -
        data provided by the task organizers.            Irony detection in English tweets (Van Hee et al.,
                                                         2018), where the best systems not based on deep
        Italiano.       Questo articolo descrive         learning exploit features based on polarity contrast
        la partecipazione del team UNIBA al              information and context incongruity.
        task IronITA 2018 organizzato durante               We evaluate our approach in the context of the
        EVALITA 2018. Nell’articolo proponi-             IronITA task at EVALITA 2018 (Cignarella et al.,
        amo un approccio supervisionato basato           2018). The goal of the task is to predict irony in
        su LIBLINEAR che sfrutta le parole chi-          Italian tweets. The task is organized in two sub-
        ave, la polarità, attributi tipici dei micro-   tasks: 1) irony detection and 2) different types of
        blog e la rappresentazione dei tweet in uno      irony. In the second sub-task participates must
        spazio semantico distribuzionale. Il nos-        identify if irony belongs to sarcasm or not. In this
        tro sistema si è classificato terzo e quarto    paper, we propose an approach which is able to de-
        nel sotto task di identificazione dell’ironia.   tect the presence of irony without taking into ac-
        Abbiamo partecipato solamente nel con-           count different types of irony. We evaluate the ap-
        straing run utilizzando i dati di training       proach in a constrained setting using only the data
        forniti dagli organizzatori del task.            provided by task organizers. The only external re-
                                                         sources exploited in our approach are a polarity
1       Introduction                                     lexicon and a collection of about 40M tweets ran-
                                                         domly extracted from TWITA(Basile and Nissim,
The irony is defined as “the use of words that
                                                         2013) (a collection of about 800M Italian tweets).
say the opposite of what you really mean, often
                                                            The paper is structured as follows: Section 2
as a joke and with a tone of voice that shows
                                                         describes our system, while evaluation and results
this”1 . This suggests us that when we are ana-
                                                         are reported in Section 3. Final remarks are pro-
lyzing written text for detecting irony, we should
                                                         vided in Section 4.
focus our attention on those words that are used in
an unconventional context. For example, given the        2   System Description
tweet: “S&P ha declassato Mario Monti da Pre-
mier a Badante #declassaggi”2 , we can observe           Our approach adopts a supervised classifier based
    1                                                    on LIBLINEAR (Fan et al., 2008), in particular we
    Oxford Learner Dictionary
    2
    In English: “S&P has downgraded Mario Monti from     use the L2-regularized L2-loss linear SVM. Each
Premier to Caregiver”                                    tweet is represented using several sets of features:
keyword-based : keyword-based features exploit                            many switches from POS to NEG, or vice
    tokens occurring in the tweets. Unigrams, bi-                         versa, occur in the tweet.
    grams and trigrams are considered. During
                                                                   distributional semantics features : we compute
    the tokenization we replace the user mentions
                                                                        two kinds of distributional semantics fea-
    and URLs with two metatokens: “ USER ”,
                                                                        tures:
    “ URL ”;
                                                                             1. given a set of unlabelled downloaded
microblogging : microblogging features take into                                tweets, we build a geometric space in
    account some attributes of the tweets that                                  which each word is represented as a
    are peculiar in the context of microblog-                                   mathematical point. The similarity be-
    ging. We exploit the following features: the                                tween words is computed as their close-
    presence of emoticons, item character repe-                                 ness in the space. To represent a tweet
    titions3 , informal expressions of laughters4                               in the geometric space, we adopt the su-
    and the presence of exclamation and interrog-                               perposition operator (Smolensky, 1990),
    ative marks. All microblobbing features are                                 that is the vector sum of all the vectors
    binary.                                                                     of words occurring in the tweet. We use
                                                                                                  →
                                                                                                  −
                                                                                the tweet vector t as a semantic feature
polarity : this block contains features extracted                               in training our classifiers;
    from the SentiWordNet (Esuli and Sebastiani,                             2. we extract three features that taking into
    2006) lexicon. We translate SentiWordNet in                                 account the usage of words in an uncon-
    Italian through MultiWordNet (Pianta et al.,                                ventional context. In particular, for each
    2002). It is important to underline that Senti-                             word wi we compute a score aci that
    WordNet is a synset-based lexicon while our                                 measures how the word is out of its con-
    Italian translation is a word based lexicon. In                             ventional context. Finally, we compute
    order to automatically derive our Italian sen-                              three features: the average, the maxi-
    timent lexicon from SentiWordNet, we per-                                   mum and the minimum of all the aci
    form three steps. First, we translate the synset                            scores. More details about the computa-
    offset in SentiWordNet from version 3.0 to                                  tion of the aci score are reported in Sub-
    1.65 using automatically generated mapping                                  section 2.1.
    file. Then, we transfer the prior polarity of
    SentiWordNet to the Italian lemmata. Fi-                       2.1     Distributional Semantics Features
    nally, we expand the lexicon using Morph-                      The distributional semantics model is built on a
    it! (Zanchetta and Baroni, 2005), a lexicon                    collection of tweets. We randomly extract 40M
    of inflected forms with their lemma and mor-                   tweets from TWITA and build a semantic space
    phological features. We extend the polarity                    based on the Random Indexing (RI) (Sahlgren,
    scores of each lemma to its inflected forms.                   2005) technique using a context windows equals
    Details about the creation of the sentiment                    to 2. Moreover, we consider only words occurring
    lexicon are reported in (Basile and Novielli,                  more than ten times6 . The context window is dy-
    2014). The obtained Italian translation of                     namic and it does not take into account words that
    SentiWordNet is used to compute three fea-                     are not in the vocabulary. Our vocabulary contains
    tures based on prior polarity of words in the                  105,543 terms.
    tweets: 1) the maximum positive polarity;                         The mathematical insight behind the RI is the
    2) the maximum negative polarity; 3) polar-                    projection of a high-dimensional space on a lower
    ity variation: for each token occurring in the                 dimensional one using a random matrix; this kind
    tweet a tag is assigned, according to the high-                of projection does not compromise distance met-
    est polarity score of the token in the Italian                 rics (Dasgupta and Gupta, 1999).
    lexicon. Tag values are in the set {OBJ, POS                      Formally, given a n × m matrix A and an m ×
    , NEG}. The sentiment variation counts how                     k matrix R, which contains random vectors, we
   3
                                                                   define a new n × k matrix B as:
      These features usually plays the same role of intensifiers
in informal writing contexts.
    4
                                                                               An,m · Rm,k = B n,k         k << m     (1)
      i.e., sequences of “ah”.
    5                                                                6
      Since MultiWordNet is based on WordNet 1.6.                        We call this set of words: the vocabulary.
   The new matrix B has the property to preserve       (−1, 1, 0, −2, 0, −1, 0, 1, 0, 1). This operation is
the distance between points, that is if the distance   repeated for all the sentences in the corpus and for
between two any points in A is d; then the distance    all the words in V . In this example, we used very
dr between the corresponding points in B will sat-     small vectors, but in a real scenario, the vector di-
isfy the property that dr ≈ c × d. A proof of that     mension ranges from hundreds to thousands of di-
is reported in the Johnson-Lindenstrauss lemma         mensions. In particular, in our experiment we use
(Dasgupta and Gupta, 1999).                            a vector dimension equals to 200 with 10 no-zero
   Specifically, RI creates the WordSpace in two       elements.
steps:                                                    In order to compute the aci score for a word wi
                                                       in a tweet, we build a context vector cwi as the sum
  1. A context vector is assigned to each word.
                                                       of random vectors assigned to words that co-occur
     This vector is sparse, high-dimensional and
                                                       with wi in the tweet. Then we compare the cosine
     ternary, which means that its elements can
                                                       similarity between cwi and the semantic vector svi
     take values in {-1, 0, 1}. A context vec-
                                                       assigned to wi . The idea is to measure how the
     tor contains a small number of randomly dis-
                                                       semantic vector is dissimilar to the context vector.
     tributed non-zero elements, and the structure
                                                       If the word wi has never appeared in the context
     of this vector follows the hypothesis behind
                                                       under analysis, its semantic vector does not con-
     the concept of Random Projection;
                                                       tain the random vectors of the words in the con-
  2. Context vectors are accumulated by analyz-        text, this results in low cosine similarity. Finally,
     ing co-occurring words. In particular, the se-    the divergence from the context is computed as
     mantic vector for any word is computed as         1 − cosSim(cwi , svi ).
     the sum of the context vectors for words that
     co-occur with the analyzed word.                  3    Evaluation

   Formally, given a corpus C of n documents, and      We perform the evaluation using the data provided
a vocabulary V of m words extracted from C, we         by the task organizers. The number of tweets in
perform two steps: 1) assign a context vector ci to    the training set is 3,977, while the testing set con-
each word in V ; 2) compute a semantic vector svi      sists of 872 tweets. The only parameter to set in
for each word wi as the sum of all context vectors     LIBLINEAR is C (the cost), after a 5-fold cross
assigned to words co-occurring with wi . The con-      validation on training we set C=1.
text is the set of m words that precede and follow        We submit two runs: UNIBA1 includes the se-
wi .                                                   mantic vector representing the tweet as a feature,
   For example, considering the following tweet:       while UNIBA2 does not include this vector. Nev-
“siete il buono della scuola fatelo capire”. In the    ertheless, features about the divergence are in-
first step we assign a random vector to each term      cluded in both the runs.
as follows:                                               Official results are reported in Table 1. Our runs
                                                       rank third and fourth in the final rank. Our team
                                                       is classified as second since the first two runs in
       csiete = (−1, 0, 0, −1, 0, 0, 0, 0, 0, 0)       the rank belong to the team1. We can notice that
      cbuono = (0, 0, 0, −1, 0, 0, 0, 1, 0, 0)         runs are very close in the rank. The last run is
      cscuola = (0, 0, 0, 0, −1, 0, 0, 0, 1, 0)        ranked below the baseline random, while any sys-
                                                       tem is ranked below the baseline baseline-mfc that
      cf atelo = (0, 1, 0, 0, 0, −1, 0, 0, 0, 0)
                                                       assigns the most frequent class (non-ironic).
      ccapire = (−1, 0, 0, 0, 0, 0, 0, 0, 0, 1)           Results show that our system is not able to
                                                       improve performance exploiting the distributional
                                                       representation of tweets, since the two runs report
   In the second step, we build a semantic vec-
                                                       the same average F1-score. We performed further
tor for each term by accumulating random vec-
                                                       experiments in order to understand the contribu-
tors of its co-occurring words. For example fix-
                                                       tion of each feature. Some relevant outcomes are
ing m = 2, the semantic vector for the word
                                                       reported in Table 2, in particular:
scuola is the sum of the random vectors si-
ete, buono, fatelo, capire. Summing these vec-             • keyword-based features are able to achieve
tors, the semantic vector for scuola results in              the best performance, in particular bigrams
 team              precision   recall       F1-score    precision     recall        F1-score       average
                   (non-       (non-        (non-       (ironic)      (ironic)      (ironic)       F1-score
                   ironic)     ironic)      ironic)
 team1             0.785       0.643        0.707       0.696         0.823         0.754          0.731
 team1             0.751       0.643        0.693       0.687         0.786         0.733          0.713
 UNIBA1            0.748       0.638        0.689       0.683         0.784         0.730          0.710
 UNIBA2            0.748       0.638        0.689       0.683         0.784         0.730          0.710
 team3             0.700       0.716        0.708       0.708         0.692         0.700          0.704
 team6             0.600       0.714        0.652       0.645         0.522         0.577          0.614
 random            0.506       0.501        0.503       0.503         0.508         0.506          0.505
 team7             0.505       0.892        0.645       0.525         0.120         0.195          0.420
 baseline-mfc      0.501       1.000        0.668       0.000         0.000         0.000          0.334

                                           Table 1: Task results.
            run     note                                            no-iro-F     iro-F    avg-F
            run1    all                                             0.6888       0.7301   0.7095
            run2    no DSM                                          0.6888       0.7301   0.7095
            1       keyword                                         0.6738       0.6969   0.6853
            2       keyword, bigrams                                0.6916       0.7219   0.7067
            3       keyword, bigrams, trigrams                      0.6992       0.7343   0.7168
            4       keyword, bigrams, trigrams, blog                0.7000       0.7337   0.7168
            5       keyword, bigrams, trigrams, polarity            0.6906       0.7329   0.7117
            6       keyword, bigrams, trigrams, context             0.6937       0.7325   0.7131
            7       only DSM                                        0.6166       0.6830   0.6406
            8       only context                                    0.4993       0.5587   0.5290

                   Table 2: Task results obtained combining different types of features.


    and trigrams contribute to improve the per-         different kernels for distributional and keyword-
    formance (run 1 and 2);                             based features.

  • DSM features introduce some kind of noise           4   Conclusions
    when are combined with other features, in
                                                        We propose a supervised system for detecting
    fact run 4, 5 and 6 achieve good performance
                                                        irony in Italian tweets. The proposed system ex-
    without DSM;
                                                        ploits different kinds of features: keyword-based,
  • DSM alone without any other kind of features        microblogging features, polarity, distributional se-
    is able to achieve remarkable results, it is im-    mantics features and a score that measure how a
    portant to notice that in this run only the tweet   word is used in an unconventional context. The
    vector is used as a feature;                        word divergence from its conventional context is
                                                        computed exploiting the distributional semantics
  • blog, polarity, and context features are not        model build by the Random Indexing.
    able to give a contribution to the overall sys-        Results prove that our system is able to achieve
    tem performance, however we can observe             good performance and rank third in the official
    that using only context features (only three        ranking. However, a deep study on different com-
    features for each tweet) we are able to over-       binations of features shows that keyword-based
    come both the baselines.                            features alone are able to achieve the best result,
                                                        while distributional features introduce noise dur-
   Analyzing results we can conclude that a more        ing the training. This outcome suggests the need
effective way to combine distributional with no-        for a different strategy for combining distribu-
distributional features is needed. We plan to in-       tional a no-distributional features.
vestigate as a future work the combination of two
References
Valerio Basile and Malvina Nissim. 2013. Sentiment
  analysis on italian tweets. In Proc. of WASSA 2013,
  pages 100–107.
Pierpaolo Basile and Nicole Novielli. 2014. Uniba
   at evalita 2014-sentipolc task: Predicting tweet sen-
   timent polarity combining micro-blogging, lexicon
   and semantic features. In Proc. of EVALITA 2014,
   pages 58–63, Pisa, Italy.
Alessandra Teresa Cignarella, Simona Frenda, Vale-
  rio Basile, Cristina Bosco, Viviana Patti, and Paolo
  Rosso. 2018. Overview of the evalita 2018 task on
  irony detection in italian tweets (ironita). In Tom-
  maso Caselli, Nicole Novielli, Viviana Patti, and
  Paolo Rosso, editors, Proceedings of the 6th evalua-
  tion campaign of Natural Language Processing and
  Speech tools for Italian (EVALITA’18), Turin, Italy.
  CEUR.org.

Sanjoy Dasgupta and Anupam Gupta. 1999. An ele-
  mentary proof of the Johnson-Lindenstrauss lemma.
  Technical report, Technical Report TR-99-006, In-
  ternational Computer Science Institute, Berkeley,
  California, USA.
Andrea Esuli and Fabrizio Sebastiani. 2006. Senti-
  wordnet: A publicly available lexical resource for
  opinion mining. In Proc. of LREC, pages 417–422.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-
  Rui Wang, and Chih-Jen Lin. 2008. Liblinear: A
  library for large linear classification. Journal of ma-
  chine learning research, 9(Aug):1871–1874.
Emanuele Pianta, Luisa Bentivogli, and Christian Gi-
  rardi. 2002. Multiwordnet: developing an aligned
  multilingual database. In Proc. 1st Intl Conf. on
  Global WordNet, pages 293–302.
Magnus Sahlgren. 2005. An Introduction to Random
 Indexing. In Methods and Applications of Semantic
 Indexing Workshop at the 7th International Confer-
 ence on Terminology and Knowledge Engineering,
 TKE, volume 5.
Paul Smolensky. 1990. Tensor product variable bind-
  ing and the representation of symbolic structures in
  connectionist systems. Artificial Intelligence, 46(1-
  2):159–216, November.
Cynthia Van Hee, Els Lefever, and Véronique Hoste.
  2018. Semeval-2018 task 3: Irony detection in en-
  glish tweets. In Proceedings of The 12th Interna-
  tional Workshop on Semantic Evaluation, pages 39–
  50.
Eros Zanchetta and Marco Baroni. 2005. Morph-it!:
  a free corpus-based morphological resource for the
  italian language. Proc. of the Corpus Linguistics
  Conf. 2005.