Towards an Italian Lexicon for Polarity Classification (polarITA): a Comparative Analysis of Lexical Resources for Sentiment Analysis Delia Irazú Hernández Farı́as Irene Laganà Viviana Patti, Cristina Bosco PRHLT Research Center Dipartimento di Dipartimento di Informatica Universitat Politècnica de València Studi Umanistici Università di Torino dhernandez1@dsic.upv.es Università di Pavia {patti,bosco}@di.unito.it irene.lagana01@universitadipavia.it Abstract piece of text (Mohammad, 2016), is currently among the most widely investigated topics within English. The paper describes a prelimi- NLP. Overall, the approaches for addressing such nary study for the development of a novel task are mainly based on techniques ranging from lexicon for Italian sentiment analysis, i.e. traditional machine learning to novel deep learn- where words are associated with polarity ing ones, as it can be seen also in the context of values. Given the influence of sentiment shared tasks on sentiment polarity classification in lexica on the performance of sentiment Twitter recently proposed, respectively for English analysis systems, a methodology based on (Nakov et al., 2016) and Italian (Barbieri et al., the detection and classification of errors 2016), within the SemEval and Evalita periodical in existing lexical resources is proposed evaluation campaigns. Moreover, the detection of and an extrinsic evaluation of the impact specific words associated with polarity values or of such errors is applied. The final aim is emotions has been considered as a powerful in- to build a novel resource from the filtering formation source for identifying the sentiment be- applied to the existing lexical resources, hind a text. Among the resources which are more which can integrate them with missing lex- commonly exploited by SA systems for perform- ical entries and more reliable associations ing their task there are therefore sentiment lexica, of polarity with entries. i.e., lists of words with associated polarity values Italiano. L’articolo descrive uno studio or emotions. preliminare per lo sviluppo di una nuova Several techniques have been applied for the de- risorsa lessicale per la sentiment analysis velopment of lexical resources for SA: they can be in italiano, i.e. dove alle parole sono as- built from scratch, manually or automatically, or sociati valori di polarità. Data l’influenza extracted from corpora (Nissim and Patti, 2017). dei lessici di sentiment sulle performance Nevertheless, the vast majority of these resources dei sistemi di sentiment analysis, viene are written in English, and a lack of resources cur- proposta una metodologia basata sulla ril- rently features several other languages. One of evazione e classificazione degli errori pre- the most commonly applied alternatives for hav- senti nei lessici attualmente disponibili ed ing resources in language other than English is una valutazione estrinseca dell’impatto di to automatically translate some available English tali errori sui sistemi. L’obiettivo finale lexicon via tools such as Google translate1 . But è ottenere un nuovo lessico grazie ad un there are many constraints involved in this kind filtraggio applicato alle risorse lessicali of process, such as handling synonyms and pol- disponibili, e a un’integrazione con le voci ysemous words, multi-word expressions, but also lessicali mancanti, ottenendo una mag- to deal with cultural differences between source giore affidabilità nell’associazione delle and target language. Apart from this, possible polarità alle voci. variations of polarity across different contexts and languages should be carefully taken into account, while such approaches rely somehow on the as- 1 Introduction sumption that affective norms related to sentiment are stable across languages. Sentiment Analysis (SA), described as the task 1 of automatically determine the polarity in a given https://translate.google.com/ In this paper we are interested into evaluate the expected by a human annotator or also if there are reliability of the lexical resources currently avail- other entries in the tweet that should appear as po- able for Italian SA and, providing that the most of larized but are not in the lexicons). them are obtained by translation, we will mainly We take as starting point the SA lexica ex- focus on the reliability of automatically translating ploited by (Hernández Farı́as et al., 2014) in the English resources to Italian language. For doing IRADABE system at Evalita2014’s SENTIPOLC so, we carried out a methodology involving differ- (Basile et al., 2014). The same resources where ent facets. Our final aim is to develop a new SA used also in the upgraded system that participated resource for Italian, which comprises pre-existing at the same task in Evalita2016 (Buscaldi and translated lexical entries enriched with the man- Hernández Farı́as, 2016). ual correction of the polarity assigned, as resulting In those works the lexicon AFINN, (Nielsen, from our analysis, but also includes entries which 2011), the one developed by Hu and Liu (hence- are featured by a polarity but are missing in the forth HaL) (Hu and Liu, 2004), and SentiWord- available lexica. Net (SWN) (Baccianella et al., 2010) were indeed The paper is organized as follows. In the automatically translated to Italian, to exploit ob- next section, we describe our methodology which tained information as features in their supervised mainly consists in three steps: the selection of a system, but no specific evaluation or refining of sample of tweets from an Italian sentiment cor- them was performed. In the present paper we ex- pus and exploited as part of the gold standard in tend our selection by considering, beyond these the Sentipolc@Evalita2016 shared task (Stranisci three, a further resource, i.e. Sentix (Basile and et al., 2016; Barbieri et al., 2016); automatic ex- Nissim, 2013) (see Sec. 2.1) which has been de- traction of the lexical entries polarized according veloped following a semantics oriented strategy to a set of benchmark sentiment lexica for Italian; (see Sec. 2.1). Henceforth, we will use the expres- the analysis of these entries and the comparison sion benchmark lexica) for referring to the four re- with those expected by a human judge. Section sources. As reference corpus, we considered, in- three shows instead an extrinsic evaluation of the stead, TwBuonaScuola (Stranisci et al., 2016), an impact of the detected errors on the results of the Italian dataset manually annotated for sentiment SA system. Some hints about future development polarity and irony, focused on the on-line debate of this research are given in the conclusion. regarding a controversial Italian political reform, which is part of the gold standard provided for 2 Our Methodology the Sentipolc shared task (Barbieri et al., 2016) at Evalita 2016 (Basile et al., 2017). Given the relevance of affective lexica in SA and Our methodology, whose results are shown in related tasks, our major aims in the current re- Sec. 2.2, includes the steps described below. search are to detect the limits of the currently Given a random selection of 500 tweets from available lexical resources for Italian and to ex- TwBuonaScuola (henceforth ItalianTweets) in- plore the possibility to develop a novel resource cluding 2,706 different words, we manually eval- by correcting and extending them. In this paper uated the coverage of the benchmark lexica for we focus in particular on the detection of the de- the words included in these tweets. In particular, ficiencies of existing resources and on their mo- for each tweet we extracted automatically all the tivations. Our methodology consists therefore in: words which are included in each of the bench- (i) selecting of a sample of tweets from an Ital- mark lexica and its associated polarity. ian sentiment corpus featured by political contents Then, for each tweets belonging to ItalianTweets, (Stranisci et al., 2016) and exploited as part of we manually checked the obtained lists of words, the gold standard in the Sentipolc@Evalita2016 considered in the context of the tweet, with a two- shared task (Barbieri et al., 2016), with sentiment fold objective: polarity annotation at the tweet level; (ii) automat- (i) To deduce which words in the benchmark ically extracting the lexical entries polarized ac- lexica have a wrong polarity associated; cording to a set of benchmark sentiment lexica for Italian and (iii) manually checking the results for (ii) To identify those words that express certain each expected lexical entry in the context of the polarity in the corpus but are not included in whole tweet (i.e. if the polarity of the entry is that the benchmark lexica. 2.1 Sentiment Analysis Resources two sets of Italian words, the first composed of In this section we describe the benchmark lexica. 277,000 entries with associated inflexion. How- AFINN (Nielsen, 2011) is an English lexicon ever the lexicon is not publicly available. composed of 2,477 words and 15 multi-word ex- Finally let us mention ItEM (Passaro et al., 2015), pressions. Each entry is associated with a score an Italian emotive lexicon which aims at offering which varies from -5 to +5 in order to respectively information about affect expressed in text accord- introduce negative and positive polarity. The start- ing to finer levels of granularity, i.e. referring ing point for the development of this resource is not simply to positive or negative sentiment po- a list of obscene words and some positive words; larity but to emotional categories. In ItEM each then the lexicon has been extended with words word is tagged with an emotional label from the from a corpus of tweets and other lists of words height basic emotions of the Plutchik’s psycholog- from Urban Dictionary2 for representing entries ical model (Plutchik, 1980). typical of Internet language (e.g. “WTF” and Several scholars are devoting their efforts to the “LOL”). After the manual annotation of the en- development of resources for other languages, by tries the lexicon has been evaluated based on a cor- applying translation or other methodologies. Let pus of tweets manually annotated for SA. us cite e.g. FEEL (Abdaoui et al., 2017), a French HaL, (Hu and Liu, 2004), has been built within lexicon where words are associated with polarity a project for developing methods to deal with and emotions obtained thanks to the application of opinions expressed in reviews about various kinds translation tools to NRC-EmoLEx3 and a manual of goods. A group of 30 adjectives featured by a validation of results. single and stable polarity and manually annotated has been expanded by including the words which 2.2 Qualitative Analysis of Benchmark Lexica in WordNet’s synsets are synonyms or antonyms In order to detect the coverage and correctness of of these seeds, providing that synonyms are fea- each benchmark lexicon, we selected from our ref- tured by the same polarity and antonyms by the erence sample corpus the list of words that accord- opposite one. The lexicon currently includes 6,800 ing to a human judge are featured by some affec- entries classified as positive or negative. tive value in the context of the tweet where they SentiWordNet 3.0 (Baccianella et al., 2010) is appear. Then, for each entry of this list and for among the larger and more used resources ex- each benchmark lexicon, we observed if the word ploited for SA. The main goal of the SentiWord- is represented in the resource and featured by the Net project is the fully automated annotation of same polarity. the polarity of the WordNet’s synsets using scores Given the preliminary nature of this investigation that vary from 0.0 to 1.0 to each of the three ba- only a couple of researchers have been involved sic polarity values (positive, negative, neutral) in in the task. Moreover, a further limit of our cur- order to obtain 1 as the sum of them. By contrast rent research approach depends on the reference with the other resources, SentiWordNet takes into to a given context (that determined by our sample account different possible senses for each word. corpus); issues related to the context will be ac- As far as Italian is concerned, only a few re- counted for in future investigations. sources exist, such as Sentix (Basile and Nissim, We observed different coverages of the bench- 2013) and SABRINA (Borzı̀ et al., 2015). Sen- mark lexica on our Twitter corpus, first of all in tix is the result of the alignment of four seman- terms of numbers of affective words occurring in tic database, namely WordNet (Fellbaum, 1998), the tweets for each lexicon. The full vocabulary of SentiWordNet, MultiWordNet (Pianta et al., 2002) the tweets is composed of 2,706 different words. and Babelnet (Navigli and Ponzetto, 2012). The Only some of these words are featured by some methodology consists in transferring to the Italian affective value, and focusing on them only we ob- section of WordNet the information about polarity served the following occurrences: 160 words in encoded in the English SentiWordNet’s synsets, AFINN, 190 words in HaL, 302 words in SWN thus aligning Italian and English synsets. and 551 in Sentix. These word sets are partially The development of SABRINA instead is based overlapped, since 69 words are included in all the on the application of a prior polarity method on 3 http://www.saifmohammad.com/WebPages/ 2 http://www.urbandictionary.com lexicons.html lexica. “school” or “institution”, is aligned with “prison” and “house/prison”, with a negative polarity which Error is not appropriate for the Italian word. Resource (i) (ii) (iii) (iv) Several errors could be probably avoided in AFINN 1.2 2.5 16.8 8.7 the transition among languages by applying a HaL 1.5 1.0 12.6 12.6 pre-processing including Part of Speech tagging SWN 5.9 1.6 15.5 13.2 and considering the grammatical category of the Sentix 5.9 2.1 15.2 16.6 source and target terms. See for instance, the word tagliando (cutting) that occurs in the cor- Table 1: Distribution of different errors in the pus as a Verb and in the benchmark lexica is in- benchmark lexica (percentage wrt the coverage of stead aligned with the corresponding noun with the lexicon). the meaning of voucher/coupon. This motivates our decision about the attribution of PoS tags to The total amount of words missing or with an the words in the first nucleus of a novel resource attributed erroneous polarity in the benchmark lex- obtained by extending and correcting the existing ica is 388. As far as the erroneous polarization ones. The overall impression is that, a manual concerns, as summarized in Table 1, these words check, even is a very time-consuming task, is al- are featured by four different kinds of errors: (i) a ways necessary and unavoidable, both when the positive word is annotated as negative; (ii) a neg- new lexicon is obtained by translation, and when ative word is annotated as positive; (iii) a neutral4 it is obtained relying on synset alignment. word is annotated as positive; and (iv) a neutral word is annotated as negative. The values are ex- 3 Lost in Translation: Impact of the pressed in percentage with respect to the coverage Errors of the lexica. As far as the distribution of errors in the four classes, they are for all lexica prevail- The methodology even if applied on a small set of ingly distributed in the last two classes, i.e. iii and tweets and based on a manual check of the bench- iv, laying foundation for the hypothesis that in the mark lexica, confirms the hypothesis that many di- automatic transition between English and Italian rections can be followed to improve the quality of several non (clearly) polarized Italian words were existing lexical resources. The first result of this instead polarized. preliminary analysis is the collection of a list of Nevertheless, observing Table 1, we can see words with associated polarity which will be the also that all the lexica are featured by very simi- nucleus of the novel resource, i.e. polarITA. Each lar amounts of errors, regardless of the methodol- of the words in polarITA has been annotated with ogy applied for their development (i.e. translation an overall polarity value (i.e., positive, negative, or extraction from semantic databases). Several or none), and its corresponding Part-Of-Speech errors, in particular for what concerns the polar- (POS) label. Table 2 summarizes the distribution ity associated to specific words, can be generated of the words in polarITA in terms of polarity and during translation, and a portion of them is there- POS labels. fore motivated by the application of translation Experiments on a larger corpus and a quantita- tools mainly because they do not consider context tive analysis based on a more formal classifica- where each word occurs. But observing the results tion of errors is needed for the development of a extracted from Sentix, which is not obtained sim- fully developed reliable lexical resource, together ply by translation, and weighting the larger cov- with an in-depth investigation of the relevance of erage that features this resource, we can see that context in the attribution of polarity, which is a errors occurs in a percentage that positively com- very important issue. A comparison of the re- pares with that of the other resources. In this case sults that a given SA engine exploiting features ex- the problem probably depends on misalignment tracted from sentiment lexica, for instance IRAD- of synsets for different languages. For example, ABE (Hernández Farı́as et al., 2014; Buscaldi and the Italian word “istituto”, whose meaning can be Hernández Farı́as, 2016), obtains using each of the 4 benchmark lexica and using polarITA is planned We considered neutral a word which is featured by a po- larity which may vary across contexts, indicated by None in as future work for the evaluation of the novel lex- Table 2. icon, which is not currently suitable because the limited size of our reference corpus and the conse- Total words 388 quent partial coverage of errors. Polarity Considering the current preliminary stage of de- Positive Negative None velopment of polarITA, we tried an extrinsic eval- 225 140 23 uation for detecting the impact on the performance Part-of-speech labels of SA systems of the errors currently featuring the Adjective 84 benchmark lexica and corrected in the novel lex- Adjective/Noun 1 icon. We compared the words which are miss- Adjective/Pronoun 2 ing or assigned to erroneous polarity in the bench- Adverb 16 mark lexica with the Italian words more com- Interjection 3 monly used and understood by native speakers, Noun 187 whose collection is available in the Vocabolario Noun/Adverb 1 di base della lingua italiana (vocItalian)5 recently Preposition 1 newly released. Like the first version of this re- Pronoun 1 source, published in 1980, (De Mauro, 1980), it Verb 92 includes three word classes: 2,999 High Usage vocItalian words (HU), 2,231 High Availability words (HA) FO 187 and 1,979 Foundational words (FO). HU 86 In polarITA we collected until now 284 words of HA 11 the vocItalian, whose distribution across the three classes is shown in Table 2. Among the words in Table 2: Distribution of the words in polarITA in the FO category we found “bene” (good), “men- terms of polarity, POS labels, and vocItalian. tire” (lie), and “giustizia” (justice). While words like “assassino” (killer), “preoccupato” (worried), and “entusiasta” (enthusiastic) are part of the HU ploited as a starting point for developing the novel category. Finally, in the HA category it is possi- resource. ble to find words such as “dannoso” (harmful) and As future work, we are planning to extend the “emozionante” (exciting). resource in several directions: by investigating This analysis suggests some hints for further in- multi-word expressions, extending the coverage to vestigation, showing that the failures of lexica cur- a larger corpus, exploring the impact of figurative rently available for Italian SA affect words very language devices such as irony and sarcasm in the commonly used in communication and therefore use of certain polarized words (Hernández Farı́as the improvement of these resources may hopefully et al., 2016). Moreover, our future effort will be result in an advancement for SA and related tasks. oriented to the automatization of a larger part of the methodology and its application to other lan- 4 Conclusions and Future Work guages currently under resourced. In this paper we propose the preliminary investiga- Acknowledgements tion about a methodology for the development of a novel lexical resource for Italian SA, namely po- C. Bosco and V. Patti were partially funded by Pro- larITA, which takes advantage of the analysis and getto di Ateneo/CSP 2016 (Immigrants, Hate and filtering of errors occurring in the available lexi- Prejudice in Social Media, S1618 L2 BOSC 01) cal resources. We carried out a manual analysis and by Fondazione CRT (Hate Speech and Social of a set of tweets for determining the reliability of Media, 2016.0688). sentiment-related lexica, showing that, even if the transfer of lexical information between two differ- ent languages is a common practice to address the References lack of resources, information related to sentiment Amine Abdaoui, Jérôme Azé, Sandra Bringay, and Pas- is lost during it. The identified errors are then ex- cal Poncelet. 2017. FEEL: a French Expanded Emotion Lexicon. Language Resources and Eval- 5 https://www.internazionale.it/ uation, 51:833–855, September. opinione/tullio-de-mauro/2016/12/23/il- nuovo-vocabolario-di-base-della-lingua- Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas- italiana tiani. 2010. SentiWordNet 3.0: An Enhanced Lex- ical Resource for Sentiment Analysis and Opinion Minqing Hu and Bing Liu. 2004. Mining and summa- Mining. In Proceedings of the Seventh International rizing customer reviews. In Proceedings of the Tenth Conference on Language Resources and Evaluation ACM SIGKDD International Conference on Knowl- (LREC’10), pages 2200–2204, Valletta, Malta. Eu- edge Discovery and Data Mining, KDD ’04, pages ropean Language Resources Association (ELRA). 168–177, New York, NY, USA. ACM. Francesco Barbieri, Valerio Basile, Danilo Croce, Saif M. Mohammad. 2016. Sentiment Analysis: Malvina Nissim, Nicole Novielli, and Viviana Patti. Detecting Valence, Emotions, and Other Affectual 2016. Overview of the EVALITA 2016 SENTiment States from Text. In Herb Meiselman, editor, Emo- POLarity Classification Task. In Basile, Cutugno, tion Measurement. Elsevier. Nissim, Patti, and Sprugnoli, editors, Proceedings of Third Italian Conference on Computational Linguis- Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio tics (CLiC-it 2016) & Fifth Evaluation Campaign Sebastiani, and Veselin Stoyanov. 2016. SemEval- of Natural Language Processing and Speech Tools 2016 Task 4: Sentiment Analysis in Twitter. In for Italian. Final Workshop (EVALITA 2016). CEUR Proceedings of the 10th International Workshop on Workshop Proceedings. Semantic Evaluation (SemEval-2016), pages 1–18, San Diego, California. Valerio Basile and Malvina Nissim. 2013. Senti- ment Analysis on Italian Tweets. In Proceedings Roberto Navigli and Simone Paolo Ponzetto. 2012. of the 4th Workshop on Computational Approaches BabelNet: The Automatic Construction, Evaluation to Subjectivity, Sentiment and Social Media Analy- and Application of a Wide-Coverage Multilingual sis, pages 100–107, Atlanta, USA. Association for Semantic Network. Artificial Intelligence, 193:217– Computational Linguistics. 250. Valerio Basile, Andrea Bolioli, Malvina Nissim, Vi- Finn Årup Nielsen. 2011. A new ANEW: evaluation of viana Patti, and Paolo Rosso. 2014. Overview of a word list for sentiment analysis in microblogs. In the Evalita 2014 SENTIment POLarity Classifica- Proceedings of the ESWC2011 Workshop on ’Mak- tion Task. In Proceedings of the 4th evaluation cam- ing Sense of Microposts’: Big things come in small paign of Natural Language Processing and Speech packages, volume 718 of CEUR Workshop Pro- tools for Italian (EVALITA 2014), Pisa, Italy. ceedings, pages 93–98, Heraklion, Crete, Greece. CEUR-WS.org. Pierpaolo Basile, Francesco Cutugno, Malvina Nissim, Viviana Patti, and Rachele Sprugnoli. 2017. Evalita Malvina Nissim and Viviana Patti. 2017. Semantic goes social: Tasks, data, and community at the 2016 aspects in sentiment analysis. In Federico Alberto edition. IJCoL - Italian Journal of Computational Pozzi, Elisabetta Fersini, Enza Messina, and Bing Linguistics, 3(1):93–127. Liu, editors, Sentiment Analysis in Social Networks, Valeria Borzı̀, Simone Faro, Arianna Pavone, and pages 31–48. Morgan Kaufmann, Boston. Sabrina Sansone. 2015. Prior Polarity Lexi- Lucia Passaro, Laura Pollacci, and Alessandro Lenci. cal Resources for the Italian Language. CoRR, 2015. ItEM: A Vector Space Model to Bootstrap an abs/1507.00133. Italian Emotive Lexicon. volume II. Davide Buscaldi and Delia Irazú Hernández Farı́as. 2016. IRADABE2: Lexicon Merging and Positional E. Pianta, L. Bentivogli, and C. Girardi. 2002. Mul- Features for Sentiment Analysis in Italian. In Pro- tiWordNet: Developing an Aligned Multilingual ceedings of the 5th Evaluation Campaign of Natural Database. In Proceedings of International Confer- Language Processing and Speech Tools for Italian ence on Global WordNet. (EVALITA 2016). aAcademia University Press. Robert Plutchik. 1980. A general psychoevolutionary Tullio De Mauro. 1980. Guida all’uso delle parole theory of emotion. In R. Plutchik and H. Kellerman, Num. 3 dei Libri di base. Editori Riuniti, Roma. editors, Emotion: Theory, research, and experience: Vol. 1. Theories of emotion, pages 3–33. Academic Christiane Fellbaum. 1998. WordNet: An Electronic press, New York. Lexical Database. Bradford Books. Marco Stranisci, Cristina Bosco, Delia Irazú Delia Irazú Hernández Farı́as, Davide Buscaldi, and Hernández Farı́as, and Viviana Patti. 2016. Belém Priego-Sánchez. 2014. IRADABE: Adapt- Annotating Sentiment and Irony in the Online ing English Lexicons to the Italian Sentiment Polar- Italian Political Debate on #labuonascuola. In ity Classification task. In First Italian Conference Proceedings of the Tenth International Conference on Computational Linguistics (CLiC-it 2014) and on Language Resources and Evaluation (LREC the fourth International Workshop EVALITA 2014, 2016). European Language Resources Association pages 75–81. (ELRA). Delia Irazú Hernández Farı́as, Viviana Patti, and Paolo Rosso. 2016. Irony Detection in Twitter: The Role of Affective Content. ACM Trans. Internet Technol., 16(3):19:1–19:24.