Emo2Val: Inferring Valence Scores from fine-grained Emotion Values Alessandro Bondielli, Lucia C. Passaro and Alessandro Lenci CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica University of Pisa (Italy) alessandro.bondielli@gmail.com lucia.passaro@for.unipi.it alessandro.lenci@unipi.it Abstract In the domain of Affective Computing, the goal moves from the identification of such variables to English. This paper studies the relation- the annotation of the texts with the emotions they ship between the valence, one of the psy- express and - for Sentiment Analysis - with their cholinguistic variables in the Italian ver- degree of positivity and/or negativity. sion of ANEW (Montefinese et al., 2014), The aim of this work is to study the relationship and emotive scores calculated by exploit- between the most important psycholinguistic vari- ing distributional methods (Passaro et al., ables and emotive scores calculated by exploiting 2015). We show two methods to infer va- distributional methods. In particular, we will fo- lence from fine grained emotions and dis- cus on valence ratings, assuming that, within these cuss their evaluation. three dimensions, valence is the most highly re- lated with a positive, negative or neutral emotional Italiano. Questo lavoro studia la re- content. In fact, it can be defined as the “the polar- lazione tra la valenza, una delle vari- ity of emotional activation” (Lang et al., 1999). abili psicolinguistiche presenti nella ver- A possible approach to infer the valence of sione italiana di ANEW (Montefinese et the words from co-occurrence statistics is the one al., 2014) e degli score emotivi calco- adopted by Louwerse and Recchia (2014), who lati distribuzionalmente (Passaro et al., followed a bootstrapping method to extend the 2015). Mostriamo due metodi per inferire ANEW lexicon (Bradley and Lang, 1999). An- la valenza a partire da tali valori e ne dis- other approach would be to exploit a resource such cutiamo la valutazione. as SenticNet (Cambria et al., 2016) to infer va- lence based on values of polarity for words or conceptual primitives. An alternative strategy is 1 Introduction to infer the valence from an emotive lexicon such Recent years have seen a surge in studies con- as ItEM (Passaro et al., 2015; Passaro and Lenci, cerning emotional ratings, both in psycholinguis- 2016), a distributional lexicon for Italian, in which tics and in affective computing. Traditionally, the words are associated with an emotive score for 8 three main behavioral dimensions to measure the different emotions. In our opinion, this solution emotional value of a word are valence, arousal and has several advantages: first of all, ItEM has been dominance. Warriner et al. (2013) define valence proven to be quite robust, and guarantees high cov- as the “pleasantness of the stimulus”, usually rang- erage over Italian words; secondly, it is not only a ing from 1 (very unpleasant) to 9 (very pleasant). static resource, but it can be easily expanded with The word dead has a low valence rating, whereas new words, allowing for a quick adaptation to dif- holiday has a higher one. Arousal is the intensity ferent contexts. Finally, associating words with of the feeling evoked on a scale from “stimulated” fine-grained emotional values allows for a wide to “unaroused”. A highly stimulating word is pas- range of analyses, such as for instance hate and sion. On the contrary, sleep is not arousing. Fi- violence detection in texts. nally, dominance is identified with the degree to Experimental results showed, in an indirect which the stimulus makes the reader feel “in con- way, that distributional emotive ratings can be trol” (Louwerse and Recchia, 2014). Victory is a very useful in the implementation of systems for word with high dominance. polarity classification (Passaro and Lenci, 2016; Bondielli, 2016). However, what is the real re- each target term is associated with a score quan- lation between emotive scores and valence? Our tifying its association with each emotion in the hypothesis is that emotions can be seen as a rep- Plutchik’s taxonomy (Plutchik, 1994): J OY, S AD - resentation of valence on a more granular scale. NESS , A NGER , F EAR , T RUST, D ISGUST, S UR - The Plutchik’s emotion taxonomy (Plutchik, 1994; PRISE and A NTICIPATION . The resource has been Plutchik, 2001) is partitioned into positive or nega- created as follows: in a first phase, feature elicita- tive emotions. However, borderline emotions such tion was used to create a small set of seed lemmas as S URPRISE are harder to be included into a posi- highly associated to one or more of the emotions tive or negative class, and therefore to be attributed in the taxonomy. Then, these lemmas have been with a direct valence rating. Words like party distributionally expanded with the most frequent and gun will have widely differing valence rat- words in two Italian corpora (Baroni et al., 2004; ings, but both strongly elicit the emotion of S UR - Baroni et al., 2009). Finally, the emotive scores for PRISE. Hence it is interesting to ask the follow- each word were calculated by measuring the co- ing question: given ItEM, are we able to predict sine similarity between the lemma and eight emo- the valence (i.e., positivity and/or negativity) of its tive centroids built from the collected seeds. words? In order to address this latter point, we performed a simple regression model to predict the 3 From fine-grained Emotion Values to valence ratings of words in ANEW (Montefinese Polarity et al., 2014) given the respective emotive values in ItEM (Passaro et al., 2015; Passaro and Lenci, We used 2 main regression models to predict the 2016). valence from the distributional emotive scores. The first experiment, described in section 3.1 This paper is organized as follow: in Section 2 shows a polynomial regression model, and the sec- we describe the resources used for the creation of ond one (section 3.2) shows a logistic model in the model. Section 3 shows our method and the which the valence scores in ANEW have been dis- results obtained. Finally, in Section 4 we evaluate cretized into two classes representing the positive- the results and discuss our findings. ness and negativeness of the word. 2 Resources A simple preprocessing phase has been applied to align the two resources. ANEW has 1121 The main resources we used for our experiments words, but 65 of them have multiple POS (e.g. are the Italian version of the Affective Norms for aereo (plane) can be both a noun and an adjective). English Words (Montefinese et al., 2014) and the We duplicated each word, extending the dataset Italian EMotive lexicon (Passaro et al., 2015). to 1189 elements, and extracted distinct emotive scores for each pair. In addition, 2.1 Italian ANEW we replaced word forms like “scorie” (waste), with ANEW (Affective Norms for English Words) their most frequent word type (scoria) in ItaWaC (Bradley and Lang, 1999) is a database created (Baroni et al., 2004) and La Repubblica (Baroni et from a rating of 1034 English words with val- al., 2004). Eventually, 57 ANEW words were left ues for valence, arousal and dominance. Mon- out of the analysis because they were not in ItEM. tefinese et al. (2014) provided an Italian version Overall, the resulting size of the aligned dataset is of ANEW, developed by translating the English 1129 elements. Finally, to cope with the different ANEW words, and by adding the words taken distribution of data among the various emotions in from the Italian semantic norms (Montefinese et ItEM, we normalized the scores with their z-score. al., 2012), for a total of 1121 words. Ratings have been obtained via an experiment where par- 3.1 Polynomial regression ticipants had to rate words for the target variables. Due to the bimodal distribution of the data in The reported ratings are the average of the ratings ANEW, we decided to use a polynomial regres- for all participants. sion model to predict the valence of the words in ANEW by exploiting their emotive normalized 2.2 ItEM scores in ItEM. Preliminary tests had in fact shown ItEM (Passaro et al., 2015; Passaro and Lenci, that a simple multiple linear regression model was 2016) is an emotive lexicon for Italian, in which not able to properly fit the data. The histogram (MeanAE) of 1.08, a mean squared error (MSE) of 1.81, and a Median absolute error (MedianAE) of 0.95. Figure 1: Valence ratings distribution in Figure 1 shows such data distribution, in which most of the ANEW words have a valence score in Figure 2: Fitting of predictions the ranges 2-3 and 6-8, with a slight bias towards higher values. For this experiment, we also provide two ad- To define the most performing degree (Deg) of ditional evaluations (the corresponding results are the polynomial function, we performed 10-fold shown in Table 2): cross validation for degrees in the range {1...5}. The results, presented in Table 1, clearly show A) the results of prediction by means of a 10- overfitting for degrees equal or higher than 3. This fold cross validation; is due to the fact that, given the number of param- eters (#P), the estimated minimum number of ob- B) the results of prediction by means of split servations (Min. Obs.), computed as #P × 15, of the data between training (66%) and test must be at most around the total number of obser- (33%). vations. This is true only for polynomial of de- gree 1 and 2. This finding is in line with Schmidt Method R2 MeanAE MSE MedianAE (1971) and Harrell (2001) who demonstrated that A 0.53 1.13 1.99 0.98 to guarantee the reliability of the prediction, each B 0.54 1.13 2.00 0.93 parameter in the regression model should have a Table 2: Results of the evaluations minimum number of observations between 10 and 20. We would like to notice that our prediction per- Deg #P Min. Obs. R2 MSE forms better for words with a very high arousal. In 1 9 ∼ 135 0.46 2.24 fact, emotionally arousing words were more likely 2 45 ∼ 675 0.53 1.82 to be produced as an emotive prototypical word in 3 165 ∼ 2475 0.31 1.50 the elicitation phase of ItEM. As a consequence, 4 495 ∼ 7425 −81.29 0.96 5 1287 ∼ 19305 −11 B 0.00 since ItEM’s emotive centroids have been con- structed using the vectors of these words (namely Table 1: Experiments performed to define the most the seeds), also their nearest neighbors (i.e., the performing Deg for the polynomial most emotive words) are assumed to have a high level of arousal. Moreover, the distribution of the Given this result, we performed a polynomial data in Figure 3, clearly shows how, in ANEW, interpolation over our parameters with a polyno- high arousal corresponds to very high (or very mial of degree 2. Then, we applied a simple mul- low) valence ratings, suggesting that highly arous- tiple linear regression over the new data for pre- ing words tend to be very positive or very negative dicting the valence. Figure 2 shows the result of (i.e. polarized). Building on this evidence, we per- the regression fitting. For this model, we obtained formed an additional experiment in which we used a R-Squared (R2 ) of 0.58, a mean absolute error the portion of the data (573 words) with an arousal 3.2 Logistic regression Building on the last experiment, and supposing a discretization of the valence into the positive and negative class, we also used a logistic regression model to predict this binary valence. The results of this experiment are very promising. We per- formed 10-fold cross validation to evaluate the effectiveness of the logistic regression over the transformed valence ratings, and obtained an av- erage mean accuracy of 0.80. Detailed results for this evaluation are shown in Table 3. Precision Recall F1 MicroAVG 0.806 0.803 0.802 Figure 3: Valence-Arousal distribution MacroAVG 0.803 0.803 0.803 Table 3: Logistic regression (Cross Validation) rating higher than its median (5.64) for prediction. In such model, in fact, R2 is attested to ∼ 0.64. Given the distribution of the data showed in Fig- 4 Results and discussion ure 2, it is clear that a polynomial regression might not be a perfect fit for valence ratings. Neverthe- The results provided in previous experiments less, it is very important to focus on MeanAE and showed both pros and cons of this approach. MSE values. These errors are relatively low with The main advantage of exploiting distributional respect to the scale of the human-rated valences. emotive scores to predict the word’s valence is that such scores can be easily obtained in an unsuper- This means that, on average, the difference be- vised way by means of co-occurrence statistics. tween human-rated valence and predicted valence Moreover, predicted data showed a rather good is between 1 and 2. To prove this point, we also accuracy with respect to the actual distribution, es- compared the obtained scores with the original hu- pecially considering the logistic regression experi- man annotations, by exploiting the standard devia- ment. In fact, our models reach peak performances tion for each valence rating. We found that 73, 5% by focusing the analysis on the sign of the valence of our predictions fall into the correct range around with logistic regression instead of working with the average valence. If we consider a word having continuous values. (in ANEW) a valence score of around 8 (e.g. pace On the other hand, the main drawback of our ap- (peace)) the system will predict a score between proach derives from the dimension of the ANEW 6 and 9, leaving the word around the same (posi- dataset, and in particular from the lack of exam- tive) area of the distribution. The same (and oppo- ples around the medium valence score ratings. It site) goes for low-valenced words, such as drogato is clear that the ratings distribution in this resource (drug addicted) and feccia (scum). Problems arise prevented us from obtaining reliable results for in the case of the words with a medium valence. continuous values. This might also provide an ex- Examples can be corridoio (corridor) and insipido planation for the errors concerning the logistic re- (bland). In this case, the word will have the same gression experiment. We are confident that having chance to be attributed with a high valence score access to a new resource covering the full spec- (5-6) or with a low one (3-4). Supposing to dis- trum of the valence more evenly would have a pos- cretize valence ratings in two classes, a positive itive impact on our model. and a negative one, with a cut on the median, pre- dictions will fall in the right class for most of the 5 Conclusions and ongoing research high (or low) valenced words, and (possibly) in the wrong one for the words of medium valence. In this work we studied the relationship between In fact, by constructing a shallow mapping of the valence and distributional emotive scores. We valence into positive (with valence >= 5.5) and modeled our data with regression in order to pre- negative class, we found a correlation of 0.73 be- dict both a continuous score for valence and its tween predicted and actual data. corresponding binarized version (i.e., polarity). Despite the difficulties of modeling an accu- European Chapter of the Association for Computa- rate representation of a continuous valence rating tional Linguistics (EACL06), Trento (Italy). Associ- ation for Computational Linguistics. from a small and unbalanced dataset like the Ital- ian ANEW, we can identify a clear relationship A. Esuli and F. Sebastiani. 2006b. Sentiwordnet: A between distributional emotional scores and a dis- publicly available lexical resource for opinion min- crete valence obtained by categorizing the ratings ing. In Proceedings of the 5th International Confer- ence on Language Resources and Evaluation, pages into a positive and a negative class. 417–422, Genoa (Italy). European Language Re- In the near future, we plan to improve our re- source Association (ELRA). gression models, with the aim of reducing the im- F.E. Harrell. 2001. Regression Modeling Strategies: pact of the distribution of the data in ANEW, pos- With Applications to Linear Models, Logistic Re- sibly implementing new strategies able to cope gression, and Survival Analysis. Graduate Texts in with non linear data. ANEW is a highly renown Mathematics. Springer. psycholinguistic dataset, but we plan to extend the Peter J Lang, Margaret M Bradley, and Bruce N Cuth- present work to predict sentiment polarity scores bert. 1999. International affective picture sys- taken from SentiWordNet (Esuli and Sebastiani, tem (iaps): Technical manual and affective ratings. 2006a; Esuli and Sebastiani, 2006b), thereby ex- Gainesville, FL: The Center for Research in Psy- ploiting the larger coverage of this resource. chophysiology, University of Florida, 2. Moreover, we plan to follow the approach em- MM Louwerse and G Recchia. 2014. Reproducing ployed in ItEM to create a polarity lexicon for Ital- affective norms with lexical co-occurrence statis- ian, using ANEW words as seed to build posi- tics: Predicting valence, arousal, and dominance. tive and negative polarity centroids. This would The Quarterly Journal of Experimental Psychology, 68(12):1–15. also be beneficial for evaluating performances on a emotion-based approach and a polarity-based one. Maria Montefinese, Ettore Ambrosini, Beth Fairfield, Finally, we aim at testing the effectiveness of and Nicola Mammarella. 2012. Semantic memory: A feature-based analysis and new norms for Italian. our system for Sentiment Polarity Classification. Behavior Research Methods, pages 1–22, oct. Maria Montefinese, Ettore Ambrosini, Beth Fairfield, References and Nicola Mammarella. 2014. The adaptation of the affective norms for english words (anew) for ital- Marco Baroni, Silvia Bernardini, Federica Comastri, ian. Behavior research methods, 46(3):887–903. Lorenzo Piccioni, Alessandra Volpi, Guy Aston, and Marco Mazzoleni. 2004. Introducing the la repub- Lucia C. Passaro and Alessandro Lenci. 2016. Eval- blica corpus: A large, annotated, tei (xml)-compliant uating context selection strategies to build emotive corpus of newspaper italian. issues, 2:5–163. vector space models. In Proceedings of the Tenth In- Marco Baroni, Silvia Bernardini, Adriano Ferraresi, ternational Conference on Language Resources and and Eros Zanchetta. 2009. The wacky wide Evaluation (LREC 2016), Portoro (Slovenia). web: a collection of very large linguistically pro- Lucia C Passaro, Laura Pollacci, and Alessandro Lenci. cessed web-crawled corpora. Language resources 2015. Item: A vector space model to bootstrap an and evaluation, 43(3):209–226. italian emotive lexicon. CLiC it, 60(15):215. Alessandro Bondielli. 2016. Da facebook a twitter: Robert Plutchik. 1994. The psychology and biology of Creazione e utilizzo di una risorsa lessicale emotiva emotion. HarperCollins College Publishers. per la sentiment analysis di tweet. Master’s thesis, University of Pisa, Italy. R. Plutchik. 2001. The nature of emotions. American Margaret M Bradley and Peter J Lang. 1999. Affective Scientist, 89:344–350. norms for english words (anew): Instruction manual Frank L Schmidt. 1971. The relative efficiency of re- and affective ratings. Technical report, Technical re- gression and simple unit predictor weights in applied port C-1, the center for research in psychophysiol- differential psychology. Educational and Psycho- ogy, University of Florida. logical Measurement, 31(3):699–714. Erik Cambria, Soujanya Poria, Rajiv Bajpai, and Björn W Schuller. 2016. Senticnet 4: A semantic Amy Beth Warriner, Victor Kuperman, and Marc Brys- resource for sentiment analysis based on conceptual baert. 2013. Norms of valence, arousal, and dom- primitives. In COLING, pages 2666–2677. inance for 13,915 english lemmas. Behavior re- search methods, 45(4):1191–1207. A. Esuli and F. Sebastiani. 2006a. Determining term subjectivity and term orientation for opinion min- ing. In Proceedings of the 11th Conference of the