Feature Selection for Emotion Classification∗ Alberto Purpura Chiara Masiero Gianmaria Silvello Gian Antonio Susto purpuraa@dei.unipd.it chiara.masiero@statwolf. silvello@dei.unipd.it sustogia@dei.unipd.it University of Padua com University of Padua University of Padua Padua, Italy Statwolf Data Science Padua, Italy Padua, Italy Padua, Italy ABSTRACT a document d, and a set of candidate emotion labels, the goal is In this paper, we describe a novel supervised approach to extract to assign one label to d – sometimes more than one label can be a set of features for document representation in the context of assigned, changing the task to multi-label classification. The most Emotion Classification (EC). Our approach employs the coefficients used set of emotions in computer science is the set of the six Ek- of a logistic regression model to extract the most discriminative man emotions [3] (i.e. anger, fear, disgust, joy, sadness, surprise). word unigrams and bigrams to perform EC. In particular, we employ Traditionally, EC has been performed using dictionary-based ap- this set of features to represent the documents, while we perform proaches, i.e. lists of terms which are known to be related to certain the classification using a Support Vector Machine. The proposed emotions as in ANEW [2]. However, there are two main issues method is evaluated on two publicly available and widely-used which limit their application on a large scale: (i) they cannot adapt collections. We also evaluate the robustness of the extracted set of to the context or domain where a word is used (ii) they cannot features on different domains, using the first collection to perform infer an emotion label for portions of text which do not contain feature extraction and the second one to perform EC. We compare any of the terms available in the dictionary. A possible alterna- the obtained results to similar supervised approaches for document tive to dictionary-based approaches are machine learning and deep classification (i.e. FastText), EC (i.e. #Emotional Tweets, SNBC and learning models based on an embedded representation of words, UMM) and to a Word2Vec-based pipeline. such as Word2Vec [5] or FastText [4]. These approaches however, need lots of data to train an accurate model and they cannot eas- CCS CONCEPTS ily adapt to low resource domains. For this reason, we present a novel approach for feature selection and a pipeline for emotion • Information systems → Content analysis and feature se- classification which outperform state-of-the-art approaches with- lection; Sentiment analysis; • Computing methodologies → out requiring large amounts of data. Additionally, we show how Supervised learning by classification; the proposed approach generalizes well to different domains. We KEYWORDS evaluate our approach on two popular and publicly available data sets – i.e. the Twitter Emotion Corpus (TEC) [6] and SemEval 2007 Supervised Learning, Feature Selection, Emotion Classification, Affective Text Corpus (1,250 Headlines) [12] – and compare it to Document Classification state of-the-art approaches for document representation – such as Word2Vec and FastText – and classification – i.e. #Emotional 1 INTRODUCTION Tweets [6], SNBC [11] and UMM [1]. The goal of Emotion classification (EC) is to detect and categorize the emotion(s) expressed by a human. We can find numerous exam- 2 PROPOSED APPROACH ples in the literature presenting ways to perform EC on different The proposed approach exploits the coefficients of a multinomial types of data sources such as audio [10] or microblogs [8]. Emo- logistic regression model to extract an emotion lexicon from a tions have a large influence on our decision making. For this reason, collection of short textual documents. First, we extract all word being able to understand how to identify them can be useful not unigrams and bigrams in the target collection after performing only to improve the interaction between humans and machines stopwords removal. 1 Second, we represent the documents using (i.e. with chatbots, or robots), but also to extract useful insights for the vector space model (TF-IDF). Then, we train a logistic regressor marketing goals [7]. Indeed, EC is employed in a wide variety of model with elastic-net regularization to perform EC. This model is contexts which include – but are not limited to – social media [8] characterized by the following loss function: and online stores – where it is closely related to Sentiment Analy- sis [9] – with the goal of interpreting emerging trends or to better " N K K !# 1 Õ Õ T ℓ({β 0k , β k }1K ) = − yi ℓ (β 0k + x iT β k ) − log( e β0k +x i βk ) Õ understand the opinions of customers. In this work, we focus EC N i =1 approaches which can be applied to textual data. The task is most k =1 k =1 p " # frequently tackled as a multi-class classification problem. Given + λ (1 − α ) | |β | |F /2 + α 2 Õ | |β | |1 , ∗ Extended abstract of the original paper published in [8]. j=1 (1) This work was supported by the CDC-STARS project and co-funded by UNIPD. where β is a (p+1)×K matrix of coefficients and βk refers to the k- Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). th column (for outcome category k). For last penalty term ||β ||1 , we IIR 2019, September 16–18, 2019, Padova, Italy employ a lasso penalty on its coefficients in order to induce sparse 1 We employ a list of 170 English terms, see nltk v.3.2.5 https://www.nltk.org. 47 IIR 2019, September 16–18, 2019, Padova, Italy Alberto Purpura, Chiara Masiero, Gianmaria Silvello, and Gian Antonio Susto Method Mean Precision Mean Recall Mean F1 Score solution. To solve this optimization problem we use the partial Proposed Approach 0.377 0.790 0.479 Newton algorithm by making a partial quadratic approximation of FastText 0.442 0.509 0.378 the log-likelihood, allowing only (β 0k , βk ) to vary for a single class Word2Vec + GNB 0.309 0.423 0.346 at a time. For each value of λ, we first cycle over all classes indexed #Emotional Tweets 0.444 0.353 0.393 UMM (ngrams + POS + CF) - - 0.410 by k, computing each time a partial quadratic approximation about the parameters of the current class. 2 Finally, we examine the β- Table 2: Comparison with #Emotional Tweets, UMM (best coefficients for each class of the trained model and keep the features pipeline on the dataset), FastText and Word2Vec+GNB on (i.e. word unigrams and bigrams) associated to non-zero weights in 250 Headlines data set. any of the classes. To evaluate the quality of the extracted features, we perform EC using a Support Vector Machine (SVM). We consider 4 DISCUSSION AND FUTURE WORK a vector representation of documents based on the set of features We presented and evaluated a supervised approach to perform fea- extracted as described above, weighting them according to their ture selection for Emotion Classification (EC). Our pipeline relies TF-IDF score. on a multinomial logistic regression model to perform feature se- lection, and on a Support Vector Machine (SVM) to perform EC. 3 RESULTS We evaluated it on two publicly available and widely-used experi- mental collections, i.e. the Twitter Emotion Corpus (TEC) [6] and For the evaluation of the proposed approach we consider the TEC SemEval 2007 (1,250 Headlines) [12]. We also compared it to sim- and 1,250 Headlines collections. TEC is composed by 21,051 tweets ilar techniques such as the one described in #Emotional Tweets which were labeled automatically – according to the set of six Ek- [6], FastText [4], SNBC [11], UMM [1] and a Word2Vec-based [5] man emotions – using the hashtags they contained and removing classification pipeline. We first evaluated our pipeline for EC on them afterwards. We split the collection into a training and a test documents from the same domain from which the features where set of equal size to train the logistic regression model for feature extracted (i.e. the TEC data set). Then, we employed it to perform selection. Then, we perform a 5-fold cross validation to train an EC on the 1,250 Headlines dataset using the features extracted from SVM for EC using the previously extracted features and report in TEC. In both experiments, our approach outperformed the selected Table 1 the average of the results over all six classes, obtained in baselines in almost all the performance measures. More information the five folds. We also report in Table 1 the performance of FastText to reproduce our experiments is provided in [8]. We also make our – that we computed as in the previous case – and the one of SNBC code publicly available. 4 We highlight that our approach might as described in [11]. From the results in Table 1, we observe that be applied to other document classification tasks, such as topic Method Mean Precision Mean Recall Mean F1 Score labeling or sentiment analysis. Indeed, we are using a general ap- Proposed Approach 0.509 0.477 0.490 proach adaptable to any task or applicative domain in the document #Emotional Tweets 0.474 0.360 0.406 classification field. FastText 0.504 0.453 0.461 SNBC 0.488 0.499 0.476 Table 1: Comparison with #Emotional Tweets, FastText and REFERENCES [1] A. Bandhakavi, N. Wiratunga, D. Padmanabhan, and S. Massie. 2017. Lexicon SNBC on the TEC data set. based feature extraction for emotion text classification. Pattern Recognition Letters 93 (2017), 133–142. the proposed classification pipeline outperforms almost all of the [2] M. M. Bradley and P. J. Lang. 1999. Affective norms for English words (ANEW): selected baselines on the TEC data set. The only exception is SNBC, Instruction manual and affective ratings. Technical Report. Citeseer. [3] P. Ekman. 1993. Facial expression and emotion. American psychologist 48, 4 where we achieve a slighlty lower Recall (-0.022). The 1,250 Head- (1993), 384. lines data set is a collection of 1,250 newspaper headlines divided [4] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. 2016. Bag of Tricks for Efficient Text Classification. (2016). arXiv:1607.01759 http://arxiv.org/abs/1607.01759 in a training (1000 headlines) and a test (250 headlines) set. We [5] T. Mikolov, I. Sutskever, Chen K., G. S Corrado, and J. Dean. 2013. Distributed employ this data set to evaluate the robustness of the features that Representations of Words and Phrases and their Compositionality. In NIPS 2013. we extracted from a randomly sampled subset of tweets equal to 3111–3119. [6] S. M. Mohammad. 2012. # Emotional tweets. In Proc. of the First Joint Conference on 70% of the total size of TEC data set. 3 The results of this experiment Lexical and Computational Semantics. Association for Computational Linguistics, are reported in Table 2. We report the performance of (i) a FastText 246–255. model trained on the training subsed of the data set of 1,000 head- [7] B. Pang and L. Lee. 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2, 1–2 (2008), 1–135. lines, (ii) an EC classification pipeline based on Word2Vec and a [8] A. Purpura, C. Masiero, G. Silvello, and G. A. Susto. 2019. Supervised Lexicon Gaussian Naive Bayes classifier (GNB) trained on the same training Extraction for Emotion Classification. In Companion Proc. of WWW 2019. ACM, 1071–1078. subset of 1,000 headlines of the data set, (iii) #Emotional Tweets, [9] A. Purpura, C. Masiero, and G.A. Susto. 2018. WS4ABSA: An NMF-Based Weakly- described in [6], and (iv) UMM, reported in [1]. From the results Supervised Approach for Aspect-Based Sentiment Analysis with Application reported in Table 2, we see that our approach outperforms again to Online Reviews. In Discovery Science (Lecture Notes in Computer Science), Vol. 11198. Springer International Publishing, Cham, 386–401. all the selected baselines in almost all of the evaluations measures. [10] F. H. Rachman, R. Sarno, and C. Fatichah. 2018. Music emotion classification based The approach presented in [6] is the only one to have a slightly on lyrics-audio using corpus based emotion. International Journal of Electrical higher precision than our method (+0.002). and Computer Engineering 8, 3 (2018), 1720. [11] A. G. Shahraki and O. R. Zaiane. 2017. Lexical and learning-based emotion mining from text. In Proc. of CICLing 2017. 2 A Python implementation which optimizes the parameters of the model is: https: [12] C. Strapparava and R. Mihalcea. 2007. Semeval-2007 task 14: Affective text. ACL, //github.com/bbalasub1/glmnet_python/blob/master/docs/glmnet_vignette.ipynb. 70–74. 3 We restricted the training set for the multinomial logistic regressor because of the limitations of the glmnet library we used for its implementation. 4 https://bitbucket.org/albpurpura/supervisedlexiconextractionforec/src/master/. 48