IRADABE2: Lexicon Merging and Positional Features
                     for Sentiment Analysis in Italian
              Davide Buscaldi                               Delia Irazú Hernandez-Farias
           LIPN, Université Paris 13                        Dipartimento di Informatica
             Villetaneuse, France                           Università degli studi di Torino
     buscaldi@lipn.univ-paris13.fr                                    Turin, Italy
                                                                     PRHLT group
                                                           Universitat Politècnica de València
                                                                    Valencia, Spain
                                                           dhernandez1@dsic.upv.es
                     Abstract                          to perform SA in languages different from English
                                                       (Mohammad, 2016). This year for the second
    English. This paper presents the partici-          time a sentiment analysis task on Italian tweets has
    pation of the IRADABE team to the SEN-             been organized at EvalIta, the Sentiment Polarity
    TIPOLC 2016 task. This year we inves-              Classification (SENTIPOLC) task (Barbieri et al.,
    tigated the use of positional features to-         2016).
    gether with the fusion of sentiment anal-             In this paper we study the effect of positional
    ysis resources with the aim to classify Ital-      features over the sentiment, irony and polarity
    ian tweets according to subjectivity, po-          classification tasks in the context of SENTIPOLC
    larity and irony. Our approach uses as             2016 task. We propose a revised version of
    starting point our participation in the SEN-       our IRADABE system (Hernandez-Farias et al.,
    TIPOLC 2014 edition. For classifica-               2014), which participated with fairly good results
    tion we adopted a supervised approach              in 2014. The novelties for this participation are
    that takes advantage of support vector ma-         not only in the positional features, but also in a
    chines and neural networks.                        new sentiment lexicon that was built combining
    Italiano. Quest’articolo presenta il lavoro        and expanding the lexicons we used in 2014.
    svolto dal team IRADABE per la parteci-               The rest of the paper is structured as follows: in
    pazione al task SENTIPOLC 2016. Il la-             Section 2 we describe the steps we took to build an
    voro svolto include l’utilizzo di caratteris-      enhanced sentiment dictionary in Italian from ex-
    tiche posizionali e la fusione di lessici spe-     isting English resources; in Section 3 we describe
    cialistici, finalizzato alla classificazione di    the new positional features of the IRADABE sys-
    tweet in italiano, secondo la loro sogge-          tem.
    tività, polarità ed ironia. Il nostro ap-
                                                       2     Building a unified dictionary
    proccio si basa sull’esperienza acquisita
    nel corso della partecipazione all’edizione        In sentiment analysis related tasks, there are sev-
    2014 di SENTIPOLC. Per la classifi-                eral factors that can be considered in order to de-
    cazione sono stati adottati dei metodi su-         termine the polarity of a given piece of text. Over-
    pervisionati come le macchine a supporto           all, the presence of positive or negative words is
    vettoriale e le reti neurali.                      used as a strong indicator of sentiment. Nowa-
                                                       days there are many sentiment analysis related re-
                                                       sources that can be exploited to infer polarity from
1   Introduction
                                                       texts. Recently, this kind of lexicons has been
Sentiment analysis (SA) related tasks have at-         proven to be effective for detecting irony in Twitter
tracted the attention of many researchers during       (Hernańdez Farı́as et al., 2016). Unfortunately, the
the last decade. Several approaches have been pro-     majority of available resources are in English. A
posed in order to address SA. Most of them have        common practice to deal with the lack of resources
in common the use of machine learning together         in different languages is to automatically translate
with natural language processing techniques. De-       it from English.
spite all those efforts there still many challenges        However, the language barrier is not the only
left such as: multililngual sentiment analysis, i.e,   drawback for these resources. Another issue is
the limited coverage of certain resources. For                    in SENTIPOLC 2014;
instance, AFINN (Nielsen, 2011) includes only
2477 words in its English version, and the Hu-Liu            2. Extend the lexicon with the WordNet syn-
lexicon (Hu and Liu, 2004) contains about 6800                  onyms of words obtained in step 1;
words. We verified on the SENTIPOLC14 train-
ing set that the Hu-Liu lexicon provided a score             3. Extend the lexicon with pseudo-synonyms
for 63.1% of training sentences, while the cover-               of words obtained in step 1 and 2, using
age for AFINN was of 70.7%, indicating that the                 word2vec for similarity. We denote them
number of items in the lexicons is not proportional             as “pseudo-synonyms” because the similar-
to the expected coverage; in other words, although              ity according to word2vec doesn’t necessar-
AFINN is smaller, the words included are more                   ily means that the words are synonyms, only
frequently used than those listed in the Hu-Liu lex-            that they usually share the same contexts.
icon. The coverage provided by a hypothetical
lexicon obtaining by the combination of the two          The scores at each step were calculated as follows:
resources would be 79.5%.                                in step 1, the weight of a word is the average of
   We observed also that in some cases these lex-        the non-zero scores from the three lexicons. In
icons provide a score for a word but not for one         step 2, the weight for a synonym is the same of
of their synonyms: in the Hu-Liu lexicon, for in-        the originating word. If the synonym is already
stance, the word ‘repel’ is listed as a negative one,    in the lexicon, then we keep the most polarizing
but ‘resist’, which is listed as one of its synonym      weight (if the scores have the same sign), or the
in the Roget’s thesaurus1 , is not. SentiWordNet         sum of the weights (if the scores have opposed
(Baccianella et al., 2010) compensates some of the       signs). For step 3 we previously built semantic
issues; its coverage is considerably higher than the     vectors using word2vec (Mikolov et al., 2013) on
previously named lexicons: 90.6% on the SEN-             the ItWaC2 corpus (Baroni et al., 2009). Then,
TIPOLC14 training set. Its scores are also as-           we select for each word in the lexicon obtained at
signed to synsets, and not words. However, it            step 2 the 10 most similar pseudo-synonyms hav-
is not complete: we measured that a combina-             ing a similarity score ≥ 0.6. If the related pseudo-
tion of SentiWordNet with AFINN and Hu-Liu               synonym already exists in the lexicon, its score is
would attain a coverage of 94.4% on the SEN-             kept, otherwise it is added to the lexicon with a po-
TIPOLC14 training set. Moreover, the problem             larity resulting from the score of the original word
of working with synsets is that it is necessary to       multiplied by the similarity score of the pseudo-
carry out word sense disambiguation, which is a          synonym. We named the obtained resource the
difficult task, particularly in the case of short sen-   ‘Unified Italian Semantic Lexicon’, shortened as
tences like tweets. For this reason, our translation     UnISeLex. It contains 31, 601 words. At step 1,
of SentiWordNet into Italian (Hernandez-Farias et        the dictionary size was 12, 102; at step 2, after
al., 2014) resulted in a word-based lexicon and not      adding the synonyms, it contained 15, 412 words.
a synset-based one.                                         In addition to this new resource, we exploited
   Therefore, we built a sentiment lexicon which         labMT-English words. It is a list (Dodds et al.,
was aimed to provide the highest possible cover-         2011) composed of 10,000 words manually anno-
age by merging existing resources and extending          tated with a happiness measure in a range between
the scores to synonyms or quasi-synonyms. The            0 up to 9. These words were collected from dif-
sentiment lexicon was built following a three-step       ferent resources such as Twitter, Google Books,
process:                                                 music lyrics, and the New York Times (1987 to
                                                         2007).
  1. Create a unique set of opinion words from
     the AFINN, Hu-Liu and SentiWordNet lex-             3       Positional Features
     icons, and merge the scores if multiple scores
                                                         It is well known that in the context of opinion
     are available for the same word; the original
                                                         mining and summarization the position of opin-
     English resources were previously translated
                                                         ion words is an important feature (Pang and Lee,
     into the Italian language for our participation
                                                         2008), (Taboada and Grieve, 2004). In reviews,
  1
    http://www.thesaurus.com/
                                                             2
Roget-Alpha-Index.html                                           http://wacky.sslmit.unibo.it
users tend to summarize the judgment in the fi-                          Subj    Pol(+)    Pol(-)    Iro
nal sentence, after a comprehensive analysis of            pos. BOW      0.528   0.852     0.848    0.900
the various features of the item being reviewed            std. BOW      0.542   0.849     0.842    0.894
(for instance, in a movie review, they would re-
view the photography, the screenplay, the actor          Table 1: F-measures for positional and standard
performance, and finally provide an overall judg-        BOW models trained on the train part of the dev
ment of the movie). Since SENTIPOLC is focused           set; results are calculated on the test part of the
on tweets, whose length is limited to 140 charac-        dev set.
ters, there is less room for a complex analysis and
therefore it is not clear whether the position of sen-   may repeat later, providing a wrong score for the
timent words is important or not.                        feature.
   In fact, we analyzed the training set and noticed        With respect to the 2014 version of IRAD-
that some words tend to appear in certain positions      ABE, we introduced 3 more position-dependent
when the sentence is labelled with a class rather        features. Each tweet was divided into 3 sections,
than the other one. For example, in the subjec-          head, centre and tail. For each section, we con-
tive sub-task, ‘non’ (not), ‘io’ (I), auxiliary verbs    sider the sum of the sentiment scores of the in-
like ‘potere’ (can), ‘dovere’ (must) tend to occur       cluded words as a separate feature. Therefore, we
mostly at the beginning of the sentence if the sen-      have three features, named in Table 3.1 as headS,
tence is subjective. In the positive polarity sub-       centreS and tailS.
task, words like ‘bello’ (beautiful), ‘piacere’ (like)
and ‘amare’ (love) are more often observed at the        Figure 1: Example of lexicon positional scores for
beginning of the sentence if the tweet is positive.      the sentence “My phone is shattered as well my
   We therefore introduced a positional Bag-of-          hopes and dreams”.
Words (BOW) weighting, where the weight of a
word t is calculated as:

            w(t) = 1 + pos(t)/len(s)

where pos(t) is the last observed position of the
word in the sentence, and len(s) is the length of
the sentence. For instance, in the sentence “I love
apples in fall.”, w(love) = 1 + 1/5 = 1.2, since
the word love is at position 1 in a sentence of 5
words.
  The Bag of Words was obtained by taking all the
lemmatized forms w that appeared in the training
corpus with a frequency greater than 5 and I(w) >        3.1   Other features
0.001, where I(w) is the informativeness of word
                                                         We renewed most of the features used for SEN-
w calculated as:
                                                         TIPOLC 2014, with the main difference that we
I(w) = p(w|c+ ) log(p(w|c+ )) − log(p(w|c− ))
                                                        are now using a single sentiment lexicon in-
                                                         stead than 3. In IRADABE 2014 we grouped
where p(w|c+ ) and p(w|c− ) are the probabilities        the features into two categories: Surface Fea-
of a word appearing in the tweets tagged with the        tures and Lexicon-based Features. We recall the
positive or negative class, respectively. The re-        ones appearing in Table 2, directing the reader
sult of this selection consisted in 943 words for the    to (Hernandez-Farias et al., 2014) for a more de-
subj subtask, 831 for pos, 991 for neg and 1197          tailed description. The first group comprises fea-
for iro.                                                 tures such as the presence of an URL address
   The results in Table 3 show a marginal improve-       (http), the length of the tweet (length), a list of
ment for the polarity and irony classes, while in        swearing words (taboo), and the ratio of uppercase
subjectivity the system lost 2% in F-measure. This       characters (shout). Among the features extracted
is probably due to the fact that the important words     from dictionaries, we used the sum of polarity
that tend to appear in the first part of the sentence    scores (polSum), the sum of only negative or pos-
itive scores (sum(−) and sum(+)), the number                          ent polarity depending on the context (. Clas-
of negative scores (count(−)) on UnISeLex, and                        sification was carried out using a SVM (radial
the average and the standard deviation of scores                      basis function kernel) for all subtasks, includ-
on labMT (avglabM T and stdlabM T , respectively).                    ing subj.
Furthermore, to determine both polarity and irony,
a subjectivity indicator (subj) feature was used; it                From the results, we can observe that the DNN
is obtained by identifying first if a tweet is subjec-           obtained an excellent precision (more than 93%)
tive or not. Finally, the mixed feature indicates is             in subj, but the recall was very low. This may
the tweet has mixed polarity or not.                             indicate a problem due to the class not being bal-
                                                                 anced, or an overfitting problem with the DNN,
       Subj        Pol(+)           Pol(-)           Iro         which is plausible given the number of features.
       http         subj            subj            subj         This may also be the reason for which the SVM
      shout      avglabM T        sum(−)            http         performs better, because SVMs are less afflicted
     sum(−)       ‘grazie’       count(−)       ‘governo’        by the “curse of dimensionality”.
    count(−)     smileys         avglabM T        mixed
      headS      polSum            length          shout                                run 1
       pers         http          polSum          ‘Mario’                         Subj   Pol(+)      Pol(-)       Iro
        ‘!’          ‘?’             http          ‘che’           Precision     0.9328  0.6755      0.5161     0.1296
    avglabM T    sum(+)           centreS        ‘#Grillo’          Recall       0.4575  0.3325      0.2273     0.0298
       ‘mi’        ‘bello’          taboo         length          F-Measure      0.6139  0.4456      0.3156     0.0484
      taboo       ‘amare’        stdlabM T       sum(−)                                 run 2
                                                                                  Subj   Pol(+)      Pol(-)       Iro
Table 2: The 10 best features for each subtask in
                                                                   Precision     0.8714 0.6493       0.4602     0.2078
the training set.
                                                                    Recall       0.6644 0.4377       0.3466     0.0681
                                                                  F-Measure      0.7539 0.5229       0.3955     0.1026
4        Results and Discussion
                                                                 Table 3: Official results of our model on the test
We evaluated our approach on the dataset pro-                    set.
vided by the organizers of SENTIPOLC 2016.
This dataset is composed by up to 10,000 tweets
distributed in training set and test set. Both                   5   Conclusions
datasets contain tweets related to political and                 As future work, it could be interesting to exploit
socio-political domains, as well as some generic                 the labels for exact polarity as provided by the
tweets3 .                                                        organizers. This kind of information could help
   We experimented with different configurations                 in some way to identify the use of figurative lan-
for assessing subjectivity, polarity and irony.                  guage. Furthermore, we are planning to enrich
We sent two runs for evaluation purposes in                      IRADABE with other kinds of features that allow
SENTIPOLC-2016:                                                  us to cover more subtle aspects of sentiment, such
     • run 1. For assessign the subjectivity label a             as emotions. The introduction of the “happiness
       Tensorflow4 implementation of Deep Neural                 score” provided by labMT was particularly useful,
       Ngetwork (DNN) was applied, with 2 hidden                 with the related features being critical in the sub-
       layers with 1024 and 512 states, respectively.            jectivity and polarity subtasks. This motivates us
       Then, the polarity and irony labels were de-              to look for dictionaries that may express different
       termined by exploiting a SVM5 .                           feelings than just the overall polarity of a word.
                                                                 We will also need to verify the effectiveness of
     • run 2. In this run, the bag-of-words were re-             the resource we produced automatically with re-
       vised to remove words that may have a differ-             spect to other hand-crafted dictionaries for the Ital-
     3
      Further details on the datasets can be found in the task   ian language, such as Sentix (Basile and Nissim,
overview (Barbieri et al., 2016)                                 2013)
    4
      http://www.tensorflow.org                                     We plan to use a more refined weighting scheme
    5
      As in IRADABE-2014 version, the subjectivity label in-
fluences the determination of both the polarity values and the   for the positional features, such as the locally-
presence of irony.                                               weighted bag-of-words or LOWBOW (Lebanon et
al., 2007), although it would mean an increase of         pher M. Danforth. 2011. Temporal patterns of hap-
the feature space of at least 3 times (if we keep the     piness and information in a global social network:
                                                          Hedonometrics and twitter. PLoS ONE, 6(12).
head, centre, tail cuts), probably furtherly compro-
mising the use of DNN for classification.               Irazú Hernandez-Farias, Davide Buscaldi, and Belém
   About the utility of positional features, the cur-      Priego-Sánchez. 2014. IRADABE: Adapting En-
rent results are inconclusive, so we need to inves-        glish Lexicons to the Italian Sentiment Polarity
                                                           Classification task. In First Italian Conference on
tigate further about how the positional scoring af-        Computational Linguistics (CLiC-it 2014) and the
fects the results. On the other hand, the results          fourth International Workshop EVALITA2014, Pro-
show that the merged dictionary was a useful re-           ceedings of the First Italian Conference on Compu-
source, with dictionary-based features represent-          tational Linguistics (CLiC-it 2014) and the fourth
                                                           International Workshop EVALITA2014, pages 75–
ing 25% of the most discriminating features.               81, Pisa, Italy, December.
Acknowledgments                                         Delia Irazú Hernańdez Farı́as, Viviana Patti, and Paolo
                                                          Rosso. 2016. Irony detection in twitter: The role
This research work has been supported by the              of affective content. ACM Trans. Internet Technol.,
“Investissements d’Avenir” program ANR-10-                16(3):19:1–19:24, July.
LABX-0083 (Labex EFL). The National Coun-
                                                        Minqing Hu and Bing Liu. 2004. Mining and summa-
cil for Science and Technology (CONACyT Mex-              rizing customer reviews. In Proceedings of the Tenth
ico) has funded the research work of Delia Irazú         ACM SIGKDD International Conference on Knowl-
Hernández Farı́ias (Grant No. 218109/313683              edge Discovery and Data Mining, KDD ’04, pages
CVU-369616).                                              168–177, Seattle, WA, USA. ACM.
                                                        Guy Lebanon, Yi Mao, and Joshua Dillon. 2007. The
                                                          locally weighted bag of words framework for docu-
References                                                ment representation. Journal of Machine Learning
                                                          Research, 8(Oct):2405–2441.
Stefano Baccianella, Andrea Esuli, and Fabrizio Se-
   bastiani. 2010. Sentiwordnet 3.0: An enhanced        Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
   lexical resource for sentiment analysis and opin-      rado, and Jeff Dean. 2013. Distributed representa-
   ion mining. In Proceedings of the Seventh Inter-       tions of words and phrases and their compositional-
   national Conference on Language Resources and          ity. In Advances in neural information processing
   Evaluation (LREC’10), pages 2200,2204, Valletta,       systems, pages 3111–3119.
   Malta, may. European Language Resources Associ-
   ation (ELRA).                                        Saif M. Mohammad. 2016. Challenges in sentiment
                                                          analysis. In A Practical Guide to Sentiment Analy-
Francesco Barbieri, Valerio Basile, Danilo Croce,         sis. Springer.
  Malvina Nissim, Nicole Novielli, and Viviana Patti.
  2016. Overview of the EVALITA 2016 SENTi-             Finn Årup Nielsen. 2011. A new ANEW: evaluation of
  ment POLarity Classification Task. In Pierpaolo          a word list for sentiment analysis in microblogs. In
  Basile, Anna Corazza, Franco Cutugno, Simonetta          Proceedings of the ESWC2011 Workshop on ’Mak-
  Montemagni, Malvina Nissim, Viviana Patti, Gio-          ing Sense of Microposts’: Big things come in small
  vanni Semeraro, and Rachele Sprugnoli, editors,          packages, volume 718 of CEUR Workshop Pro-
  Proceedings of Third Italian Conference on Compu-        ceedings, pages 93–98, Heraklion, Crete, Greece.
  tational Linguistics (CLiC-it 2016) & Fifth Evalua-      CEUR-WS.org.
  tion Campaign of Natural Language Processing and
  Speech Tools for Italian. Final Workshop (EVALITA     Bo Pang and Lillian Lee. 2008. Opinion mining and
  2016). Associazione Italiana di Linguistica Com-        sentiment analysis. Foundations and trends in infor-
  putazionale (AILC).                                     mation retrieval, 2(1-2):1–135.
Marco Baroni, Silvia Bernardini, Adriano Ferraresi,     Maite Taboada and Jack Grieve. 2004. Analyzing ap-
 Eros Zanchetta, Springer, and Science+business          praisal automatically. In Proceedings of the AAAI
 Media B. V. 2009. The wacky wide web: A col-            Spring Symposium on Exploring Attitude and Affect
 lection of very large linguistically processed we-      in Text: Theories and Applications, pages 158–161,
 bcrawled corpora. language resources and evalua-        Stanford, US.
 tion.
Valerio Basile and Malvina Nissim. 2013. Sentiment
  analysis on Italian tweets. In WASSA 2013, Atlanta,
  United States, June.
Peter Sheridan Dodds, Kameron Decker Harris, Is-
  abel M. Kloumann, Catherine A. Bliss, and Christo-