IRADABE2: Lexicon Merging and Positional Features for Sentiment Analysis in Italian Davide Buscaldi Delia Irazú Hernandez-Farias LIPN, Université Paris 13 Dipartimento di Informatica Villetaneuse, France Università degli studi di Torino buscaldi@lipn.univ-paris13.fr Turin, Italy PRHLT group Universitat Politècnica de València Valencia, Spain dhernandez1@dsic.upv.es Abstract to perform SA in languages different from English (Mohammad, 2016). This year for the second English. This paper presents the partici- time a sentiment analysis task on Italian tweets has pation of the IRADABE team to the SEN- been organized at EvalIta, the Sentiment Polarity TIPOLC 2016 task. This year we inves- Classification (SENTIPOLC) task (Barbieri et al., tigated the use of positional features to- 2016). gether with the fusion of sentiment anal- In this paper we study the effect of positional ysis resources with the aim to classify Ital- features over the sentiment, irony and polarity ian tweets according to subjectivity, po- classification tasks in the context of SENTIPOLC larity and irony. Our approach uses as 2016 task. We propose a revised version of starting point our participation in the SEN- our IRADABE system (Hernandez-Farias et al., TIPOLC 2014 edition. For classifica- 2014), which participated with fairly good results tion we adopted a supervised approach in 2014. The novelties for this participation are that takes advantage of support vector ma- not only in the positional features, but also in a chines and neural networks. new sentiment lexicon that was built combining Italiano. Quest’articolo presenta il lavoro and expanding the lexicons we used in 2014. svolto dal team IRADABE per la parteci- The rest of the paper is structured as follows: in pazione al task SENTIPOLC 2016. Il la- Section 2 we describe the steps we took to build an voro svolto include l’utilizzo di caratteris- enhanced sentiment dictionary in Italian from ex- tiche posizionali e la fusione di lessici spe- isting English resources; in Section 3 we describe cialistici, finalizzato alla classificazione di the new positional features of the IRADABE sys- tweet in italiano, secondo la loro sogge- tem. tività, polarità ed ironia. Il nostro ap- 2 Building a unified dictionary proccio si basa sull’esperienza acquisita nel corso della partecipazione all’edizione In sentiment analysis related tasks, there are sev- 2014 di SENTIPOLC. Per la classifi- eral factors that can be considered in order to de- cazione sono stati adottati dei metodi su- termine the polarity of a given piece of text. Over- pervisionati come le macchine a supporto all, the presence of positive or negative words is vettoriale e le reti neurali. used as a strong indicator of sentiment. Nowa- days there are many sentiment analysis related re- sources that can be exploited to infer polarity from 1 Introduction texts. Recently, this kind of lexicons has been Sentiment analysis (SA) related tasks have at- proven to be effective for detecting irony in Twitter tracted the attention of many researchers during (Hernańdez Farı́as et al., 2016). Unfortunately, the the last decade. Several approaches have been pro- majority of available resources are in English. A posed in order to address SA. Most of them have common practice to deal with the lack of resources in common the use of machine learning together in different languages is to automatically translate with natural language processing techniques. De- it from English. spite all those efforts there still many challenges However, the language barrier is not the only left such as: multililngual sentiment analysis, i.e, drawback for these resources. Another issue is the limited coverage of certain resources. For in SENTIPOLC 2014; instance, AFINN (Nielsen, 2011) includes only 2477 words in its English version, and the Hu-Liu 2. Extend the lexicon with the WordNet syn- lexicon (Hu and Liu, 2004) contains about 6800 onyms of words obtained in step 1; words. We verified on the SENTIPOLC14 train- ing set that the Hu-Liu lexicon provided a score 3. Extend the lexicon with pseudo-synonyms for 63.1% of training sentences, while the cover- of words obtained in step 1 and 2, using age for AFINN was of 70.7%, indicating that the word2vec for similarity. We denote them number of items in the lexicons is not proportional as “pseudo-synonyms” because the similar- to the expected coverage; in other words, although ity according to word2vec doesn’t necessar- AFINN is smaller, the words included are more ily means that the words are synonyms, only frequently used than those listed in the Hu-Liu lex- that they usually share the same contexts. icon. The coverage provided by a hypothetical lexicon obtaining by the combination of the two The scores at each step were calculated as follows: resources would be 79.5%. in step 1, the weight of a word is the average of We observed also that in some cases these lex- the non-zero scores from the three lexicons. In icons provide a score for a word but not for one step 2, the weight for a synonym is the same of of their synonyms: in the Hu-Liu lexicon, for in- the originating word. If the synonym is already stance, the word ‘repel’ is listed as a negative one, in the lexicon, then we keep the most polarizing but ‘resist’, which is listed as one of its synonym weight (if the scores have the same sign), or the in the Roget’s thesaurus1 , is not. SentiWordNet sum of the weights (if the scores have opposed (Baccianella et al., 2010) compensates some of the signs). For step 3 we previously built semantic issues; its coverage is considerably higher than the vectors using word2vec (Mikolov et al., 2013) on previously named lexicons: 90.6% on the SEN- the ItWaC2 corpus (Baroni et al., 2009). Then, TIPOLC14 training set. Its scores are also as- we select for each word in the lexicon obtained at signed to synsets, and not words. However, it step 2 the 10 most similar pseudo-synonyms hav- is not complete: we measured that a combina- ing a similarity score ≥ 0.6. If the related pseudo- tion of SentiWordNet with AFINN and Hu-Liu synonym already exists in the lexicon, its score is would attain a coverage of 94.4% on the SEN- kept, otherwise it is added to the lexicon with a po- TIPOLC14 training set. Moreover, the problem larity resulting from the score of the original word of working with synsets is that it is necessary to multiplied by the similarity score of the pseudo- carry out word sense disambiguation, which is a synonym. We named the obtained resource the difficult task, particularly in the case of short sen- ‘Unified Italian Semantic Lexicon’, shortened as tences like tweets. For this reason, our translation UnISeLex. It contains 31, 601 words. At step 1, of SentiWordNet into Italian (Hernandez-Farias et the dictionary size was 12, 102; at step 2, after al., 2014) resulted in a word-based lexicon and not adding the synonyms, it contained 15, 412 words. a synset-based one. In addition to this new resource, we exploited Therefore, we built a sentiment lexicon which labMT-English words. It is a list (Dodds et al., was aimed to provide the highest possible cover- 2011) composed of 10,000 words manually anno- age by merging existing resources and extending tated with a happiness measure in a range between the scores to synonyms or quasi-synonyms. The 0 up to 9. These words were collected from dif- sentiment lexicon was built following a three-step ferent resources such as Twitter, Google Books, process: music lyrics, and the New York Times (1987 to 2007). 1. Create a unique set of opinion words from the AFINN, Hu-Liu and SentiWordNet lex- 3 Positional Features icons, and merge the scores if multiple scores It is well known that in the context of opinion are available for the same word; the original mining and summarization the position of opin- English resources were previously translated ion words is an important feature (Pang and Lee, into the Italian language for our participation 2008), (Taboada and Grieve, 2004). In reviews, 1 http://www.thesaurus.com/ 2 Roget-Alpha-Index.html http://wacky.sslmit.unibo.it users tend to summarize the judgment in the fi- Subj Pol(+) Pol(-) Iro nal sentence, after a comprehensive analysis of pos. BOW 0.528 0.852 0.848 0.900 the various features of the item being reviewed std. BOW 0.542 0.849 0.842 0.894 (for instance, in a movie review, they would re- view the photography, the screenplay, the actor Table 1: F-measures for positional and standard performance, and finally provide an overall judg- BOW models trained on the train part of the dev ment of the movie). Since SENTIPOLC is focused set; results are calculated on the test part of the on tweets, whose length is limited to 140 charac- dev set. ters, there is less room for a complex analysis and therefore it is not clear whether the position of sen- may repeat later, providing a wrong score for the timent words is important or not. feature. In fact, we analyzed the training set and noticed With respect to the 2014 version of IRAD- that some words tend to appear in certain positions ABE, we introduced 3 more position-dependent when the sentence is labelled with a class rather features. Each tweet was divided into 3 sections, than the other one. For example, in the subjec- head, centre and tail. For each section, we con- tive sub-task, ‘non’ (not), ‘io’ (I), auxiliary verbs sider the sum of the sentiment scores of the in- like ‘potere’ (can), ‘dovere’ (must) tend to occur cluded words as a separate feature. Therefore, we mostly at the beginning of the sentence if the sen- have three features, named in Table 3.1 as headS, tence is subjective. In the positive polarity sub- centreS and tailS. task, words like ‘bello’ (beautiful), ‘piacere’ (like) and ‘amare’ (love) are more often observed at the Figure 1: Example of lexicon positional scores for beginning of the sentence if the tweet is positive. the sentence “My phone is shattered as well my We therefore introduced a positional Bag-of- hopes and dreams”. Words (BOW) weighting, where the weight of a word t is calculated as: w(t) = 1 + pos(t)/len(s) where pos(t) is the last observed position of the word in the sentence, and len(s) is the length of the sentence. For instance, in the sentence “I love apples in fall.”, w(love) = 1 + 1/5 = 1.2, since the word love is at position 1 in a sentence of 5 words. The Bag of Words was obtained by taking all the lemmatized forms w that appeared in the training corpus with a frequency greater than 5 and I(w) > 3.1 Other features 0.001, where I(w) is the informativeness of word We renewed most of the features used for SEN- w calculated as: TIPOLC 2014, with the main difference that we I(w) = p(w|c+ ) log(p(w|c+ )) − log(p(w|c− ))  are now using a single sentiment lexicon in- stead than 3. In IRADABE 2014 we grouped where p(w|c+ ) and p(w|c− ) are the probabilities the features into two categories: Surface Fea- of a word appearing in the tweets tagged with the tures and Lexicon-based Features. We recall the positive or negative class, respectively. The re- ones appearing in Table 2, directing the reader sult of this selection consisted in 943 words for the to (Hernandez-Farias et al., 2014) for a more de- subj subtask, 831 for pos, 991 for neg and 1197 tailed description. The first group comprises fea- for iro. tures such as the presence of an URL address The results in Table 3 show a marginal improve- (http), the length of the tweet (length), a list of ment for the polarity and irony classes, while in swearing words (taboo), and the ratio of uppercase subjectivity the system lost 2% in F-measure. This characters (shout). Among the features extracted is probably due to the fact that the important words from dictionaries, we used the sum of polarity that tend to appear in the first part of the sentence scores (polSum), the sum of only negative or pos- itive scores (sum(−) and sum(+)), the number ent polarity depending on the context (. Clas- of negative scores (count(−)) on UnISeLex, and sification was carried out using a SVM (radial the average and the standard deviation of scores basis function kernel) for all subtasks, includ- on labMT (avglabM T and stdlabM T , respectively). ing subj. Furthermore, to determine both polarity and irony, a subjectivity indicator (subj) feature was used; it From the results, we can observe that the DNN is obtained by identifying first if a tweet is subjec- obtained an excellent precision (more than 93%) tive or not. Finally, the mixed feature indicates is in subj, but the recall was very low. This may the tweet has mixed polarity or not. indicate a problem due to the class not being bal- anced, or an overfitting problem with the DNN, Subj Pol(+) Pol(-) Iro which is plausible given the number of features. http subj subj subj This may also be the reason for which the SVM shout avglabM T sum(−) http performs better, because SVMs are less afflicted sum(−) ‘grazie’ count(−) ‘governo’ by the “curse of dimensionality”. count(−) smileys avglabM T mixed headS polSum length shout run 1 pers http polSum ‘Mario’ Subj Pol(+) Pol(-) Iro ‘!’ ‘?’ http ‘che’ Precision 0.9328 0.6755 0.5161 0.1296 avglabM T sum(+) centreS ‘#Grillo’ Recall 0.4575 0.3325 0.2273 0.0298 ‘mi’ ‘bello’ taboo length F-Measure 0.6139 0.4456 0.3156 0.0484 taboo ‘amare’ stdlabM T sum(−) run 2 Subj Pol(+) Pol(-) Iro Table 2: The 10 best features for each subtask in Precision 0.8714 0.6493 0.4602 0.2078 the training set. Recall 0.6644 0.4377 0.3466 0.0681 F-Measure 0.7539 0.5229 0.3955 0.1026 4 Results and Discussion Table 3: Official results of our model on the test We evaluated our approach on the dataset pro- set. vided by the organizers of SENTIPOLC 2016. This dataset is composed by up to 10,000 tweets distributed in training set and test set. Both 5 Conclusions datasets contain tweets related to political and As future work, it could be interesting to exploit socio-political domains, as well as some generic the labels for exact polarity as provided by the tweets3 . organizers. This kind of information could help We experimented with different configurations in some way to identify the use of figurative lan- for assessing subjectivity, polarity and irony. guage. Furthermore, we are planning to enrich We sent two runs for evaluation purposes in IRADABE with other kinds of features that allow SENTIPOLC-2016: us to cover more subtle aspects of sentiment, such • run 1. For assessign the subjectivity label a as emotions. The introduction of the “happiness Tensorflow4 implementation of Deep Neural score” provided by labMT was particularly useful, Ngetwork (DNN) was applied, with 2 hidden with the related features being critical in the sub- layers with 1024 and 512 states, respectively. jectivity and polarity subtasks. This motivates us Then, the polarity and irony labels were de- to look for dictionaries that may express different termined by exploiting a SVM5 . feelings than just the overall polarity of a word. We will also need to verify the effectiveness of • run 2. In this run, the bag-of-words were re- the resource we produced automatically with re- vised to remove words that may have a differ- spect to other hand-crafted dictionaries for the Ital- 3 Further details on the datasets can be found in the task ian language, such as Sentix (Basile and Nissim, overview (Barbieri et al., 2016) 2013) 4 http://www.tensorflow.org We plan to use a more refined weighting scheme 5 As in IRADABE-2014 version, the subjectivity label in- fluences the determination of both the polarity values and the for the positional features, such as the locally- presence of irony. weighted bag-of-words or LOWBOW (Lebanon et al., 2007), although it would mean an increase of pher M. Danforth. 2011. Temporal patterns of hap- the feature space of at least 3 times (if we keep the piness and information in a global social network: Hedonometrics and twitter. PLoS ONE, 6(12). head, centre, tail cuts), probably furtherly compro- mising the use of DNN for classification. Irazú Hernandez-Farias, Davide Buscaldi, and Belém About the utility of positional features, the cur- Priego-Sánchez. 2014. IRADABE: Adapting En- rent results are inconclusive, so we need to inves- glish Lexicons to the Italian Sentiment Polarity Classification task. In First Italian Conference on tigate further about how the positional scoring af- Computational Linguistics (CLiC-it 2014) and the fects the results. On the other hand, the results fourth International Workshop EVALITA2014, Pro- show that the merged dictionary was a useful re- ceedings of the First Italian Conference on Compu- source, with dictionary-based features represent- tational Linguistics (CLiC-it 2014) and the fourth International Workshop EVALITA2014, pages 75– ing 25% of the most discriminating features. 81, Pisa, Italy, December. Acknowledgments Delia Irazú Hernańdez Farı́as, Viviana Patti, and Paolo Rosso. 2016. Irony detection in twitter: The role This research work has been supported by the of affective content. ACM Trans. Internet Technol., “Investissements d’Avenir” program ANR-10- 16(3):19:1–19:24, July. LABX-0083 (Labex EFL). The National Coun- Minqing Hu and Bing Liu. 2004. Mining and summa- cil for Science and Technology (CONACyT Mex- rizing customer reviews. In Proceedings of the Tenth ico) has funded the research work of Delia Irazú ACM SIGKDD International Conference on Knowl- Hernández Farı́ias (Grant No. 218109/313683 edge Discovery and Data Mining, KDD ’04, pages CVU-369616). 168–177, Seattle, WA, USA. ACM. Guy Lebanon, Yi Mao, and Joshua Dillon. 2007. The locally weighted bag of words framework for docu- References ment representation. Journal of Machine Learning Research, 8(Oct):2405–2441. Stefano Baccianella, Andrea Esuli, and Fabrizio Se- bastiani. 2010. Sentiwordnet 3.0: An enhanced Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- lexical resource for sentiment analysis and opin- rado, and Jeff Dean. 2013. Distributed representa- ion mining. In Proceedings of the Seventh Inter- tions of words and phrases and their compositional- national Conference on Language Resources and ity. In Advances in neural information processing Evaluation (LREC’10), pages 2200,2204, Valletta, systems, pages 3111–3119. Malta, may. European Language Resources Associ- ation (ELRA). Saif M. Mohammad. 2016. Challenges in sentiment analysis. In A Practical Guide to Sentiment Analy- Francesco Barbieri, Valerio Basile, Danilo Croce, sis. Springer. Malvina Nissim, Nicole Novielli, and Viviana Patti. 2016. Overview of the EVALITA 2016 SENTi- Finn Årup Nielsen. 2011. A new ANEW: evaluation of ment POLarity Classification Task. In Pierpaolo a word list for sentiment analysis in microblogs. In Basile, Anna Corazza, Franco Cutugno, Simonetta Proceedings of the ESWC2011 Workshop on ’Mak- Montemagni, Malvina Nissim, Viviana Patti, Gio- ing Sense of Microposts’: Big things come in small vanni Semeraro, and Rachele Sprugnoli, editors, packages, volume 718 of CEUR Workshop Pro- Proceedings of Third Italian Conference on Compu- ceedings, pages 93–98, Heraklion, Crete, Greece. tational Linguistics (CLiC-it 2016) & Fifth Evalua- CEUR-WS.org. tion Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA Bo Pang and Lillian Lee. 2008. Opinion mining and 2016). Associazione Italiana di Linguistica Com- sentiment analysis. Foundations and trends in infor- putazionale (AILC). mation retrieval, 2(1-2):1–135. Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Maite Taboada and Jack Grieve. 2004. Analyzing ap- Eros Zanchetta, Springer, and Science+business praisal automatically. In Proceedings of the AAAI Media B. V. 2009. The wacky wide web: A col- Spring Symposium on Exploring Attitude and Affect lection of very large linguistically processed we- in Text: Theories and Applications, pages 158–161, bcrawled corpora. language resources and evalua- Stanford, US. tion. Valerio Basile and Malvina Nissim. 2013. Sentiment analysis on Italian tweets. In WASSA 2013, Atlanta, United States, June. Peter Sheridan Dodds, Kameron Decker Harris, Is- abel M. Kloumann, Catherine A. Bliss, and Christo-