Feature Selection for Emotion Classification∗
          Alberto Purpura                               Chiara Masiero                       Gianmaria Silvello                                Gian Antonio Susto
       purpuraa@dei.unipd.it                       chiara.masiero@statwolf.                     silvello@dei.unipd.it                           sustogia@dei.unipd.it
        University of Padua                                  com                                University of Padua                              University of Padua
            Padua, Italy                            Statwolf Data Science                            Padua, Italy                                    Padua, Italy
                                                          Padua, Italy

ABSTRACT                                                                                    a document d, and a set of candidate emotion labels, the goal is
In this paper, we describe a novel supervised approach to extract                           to assign one label to d – sometimes more than one label can be
a set of features for document representation in the context of                             assigned, changing the task to multi-label classification. The most
Emotion Classification (EC). Our approach employs the coefficients                          used set of emotions in computer science is the set of the six Ek-
of a logistic regression model to extract the most discriminative                           man emotions [3] (i.e. anger, fear, disgust, joy, sadness, surprise).
word unigrams and bigrams to perform EC. In particular, we employ                           Traditionally, EC has been performed using dictionary-based ap-
this set of features to represent the documents, while we perform                           proaches, i.e. lists of terms which are known to be related to certain
the classification using a Support Vector Machine. The proposed                             emotions as in ANEW [2]. However, there are two main issues
method is evaluated on two publicly available and widely-used                               which limit their application on a large scale: (i) they cannot adapt
collections. We also evaluate the robustness of the extracted set of                        to the context or domain where a word is used (ii) they cannot
features on different domains, using the first collection to perform                        infer an emotion label for portions of text which do not contain
feature extraction and the second one to perform EC. We compare                             any of the terms available in the dictionary. A possible alterna-
the obtained results to similar supervised approaches for document                          tive to dictionary-based approaches are machine learning and deep
classification (i.e. FastText), EC (i.e. #Emotional Tweets, SNBC and                        learning models based on an embedded representation of words,
UMM) and to a Word2Vec-based pipeline.                                                      such as Word2Vec [5] or FastText [4]. These approaches however,
                                                                                            need lots of data to train an accurate model and they cannot eas-
CCS CONCEPTS                                                                                ily adapt to low resource domains. For this reason, we present a
                                                                                            novel approach for feature selection and a pipeline for emotion
• Information systems → Content analysis and feature se-
                                                                                            classification which outperform state-of-the-art approaches with-
lection; Sentiment analysis; • Computing methodologies →
                                                                                            out requiring large amounts of data. Additionally, we show how
Supervised learning by classification;
                                                                                            the proposed approach generalizes well to different domains. We
KEYWORDS                                                                                    evaluate our approach on two popular and publicly available data
                                                                                            sets – i.e. the Twitter Emotion Corpus (TEC) [6] and SemEval 2007
Supervised Learning, Feature Selection, Emotion Classification,                             Affective Text Corpus (1,250 Headlines) [12] – and compare it to
Document Classification                                                                     state of-the-art approaches for document representation – such
                                                                                            as Word2Vec and FastText – and classification – i.e. #Emotional
1     INTRODUCTION                                                                          Tweets [6], SNBC [11] and UMM [1].
The goal of Emotion classification (EC) is to detect and categorize
the emotion(s) expressed by a human. We can find numerous exam-                             2      PROPOSED APPROACH
ples in the literature presenting ways to perform EC on different                           The proposed approach exploits the coefficients of a multinomial
types of data sources such as audio [10] or microblogs [8]. Emo-                            logistic regression model to extract an emotion lexicon from a
tions have a large influence on our decision making. For this reason,                       collection of short textual documents. First, we extract all word
being able to understand how to identify them can be useful not                             unigrams and bigrams in the target collection after performing
only to improve the interaction between humans and machines                                 stopwords removal. 1 Second, we represent the documents using
(i.e. with chatbots, or robots), but also to extract useful insights for                    the vector space model (TF-IDF). Then, we train a logistic regressor
marketing goals [7]. Indeed, EC is employed in a wide variety of                            model with elastic-net regularization to perform EC. This model is
contexts which include – but are not limited to – social media [8]                          characterized by the following loss function:
and online stores – where it is closely related to Sentiment Analy-
sis [9] – with the goal of interpreting emerging trends or to better                                                      " N     K                                  K
                                                                                                                                                                                           !#
                                                                                                                        1 Õ Õ                                                     T
                                                                                                ℓ({β 0k , β k }1K ) = −               yi ℓ (β 0k + x iT β k ) − log(     e β0k +x i βk )
                                                                                                                                                                     Õ
understand the opinions of customers. In this work, we focus EC
                                                                                                                        N i =1
approaches which can be applied to textual data. The task is most                                                               k =1                                k =1
                                                                                                                                                    p
                                                                                                                        "                                        #
frequently tackled as a multi-class classification problem. Given                                                    + λ (1 − α ) | |β | |F /2 + α
                                                                                                                                          2
                                                                                                                                                   Õ
                                                                                                                                                        | |β | |1 ,
∗ Extended abstract of the original paper published in [8].                                                                                         j=1
                                                                                                                                                                                                (1)
This work was supported by the CDC-STARS project and co-funded by UNIPD.
                                                                                               where β is a (p+1)×K matrix of coefficients and βk refers to the k-
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
                                                                                            th column (for outcome category k). For last penalty term ||β ||1 , we
IIR 2019, September 16–18, 2019, Padova, Italy                                              employ a lasso penalty on its coefficients in order to induce sparse
                                                                                            1 We employ a list of 170 English terms, see nltk v.3.2.5 https://www.nltk.org.


                                                                                       47
IIR 2019, September 16–18, 2019, Padova, Italy                                Alberto Purpura, Chiara Masiero, Gianmaria Silvello, and Gian Antonio Susto

                                                                                                         Method                  Mean Precision      Mean Recall      Mean F1 Score
solution. To solve this optimization problem we use the partial
                                                                                                    Proposed Approach                0.377             0.790             0.479
Newton algorithm by making a partial quadratic approximation of                                          FastText                    0.442             0.509             0.378
the log-likelihood, allowing only (β 0k , βk ) to vary for a single class                            Word2Vec + GNB                  0.309             0.423             0.346
at a time. For each value of λ, we first cycle over all classes indexed                             #Emotional Tweets               0.444              0.353             0.393
                                                                                                  UMM (ngrams + POS + CF)              -                 -               0.410
by k, computing each time a partial quadratic approximation about
the parameters of the current class. 2 Finally, we examine the β-                             Table 2: Comparison with #Emotional Tweets, UMM (best
coefficients for each class of the trained model and keep the features                        pipeline on the dataset), FastText and Word2Vec+GNB on
(i.e. word unigrams and bigrams) associated to non-zero weights in                            250 Headlines data set.
any of the classes. To evaluate the quality of the extracted features,
we perform EC using a Support Vector Machine (SVM). We consider
                                                                                              4     DISCUSSION AND FUTURE WORK
a vector representation of documents based on the set of features                             We presented and evaluated a supervised approach to perform fea-
extracted as described above, weighting them according to their                               ture selection for Emotion Classification (EC). Our pipeline relies
TF-IDF score.                                                                                 on a multinomial logistic regression model to perform feature se-
                                                                                              lection, and on a Support Vector Machine (SVM) to perform EC.
3    RESULTS                                                                                  We evaluated it on two publicly available and widely-used experi-
                                                                                              mental collections, i.e. the Twitter Emotion Corpus (TEC) [6] and
For the evaluation of the proposed approach we consider the TEC
                                                                                              SemEval 2007 (1,250 Headlines) [12]. We also compared it to sim-
and 1,250 Headlines collections. TEC is composed by 21,051 tweets
                                                                                              ilar techniques such as the one described in #Emotional Tweets
which were labeled automatically – according to the set of six Ek-
                                                                                              [6], FastText [4], SNBC [11], UMM [1] and a Word2Vec-based [5]
man emotions – using the hashtags they contained and removing
                                                                                              classification pipeline. We first evaluated our pipeline for EC on
them afterwards. We split the collection into a training and a test
                                                                                              documents from the same domain from which the features where
set of equal size to train the logistic regression model for feature
                                                                                              extracted (i.e. the TEC data set). Then, we employed it to perform
selection. Then, we perform a 5-fold cross validation to train an
                                                                                              EC on the 1,250 Headlines dataset using the features extracted from
SVM for EC using the previously extracted features and report in
                                                                                              TEC. In both experiments, our approach outperformed the selected
Table 1 the average of the results over all six classes, obtained in
                                                                                              baselines in almost all the performance measures. More information
the five folds. We also report in Table 1 the performance of FastText
                                                                                              to reproduce our experiments is provided in [8]. We also make our
– that we computed as in the previous case – and the one of SNBC
                                                                                              code publicly available. 4 We highlight that our approach might
as described in [11]. From the results in Table 1, we observe that
                                                                                              be applied to other document classification tasks, such as topic
           Method            Mean Precision     Mean Recall     Mean F1 Score                 labeling or sentiment analysis. Indeed, we are using a general ap-
      Proposed Approach         0.509             0.477            0.490                      proach adaptable to any task or applicative domain in the document
      #Emotional Tweets          0.474            0.360            0.406                      classification field.
           FastText              0.504            0.453            0.461
            SNBC                 0.488            0.499            0.476
Table 1: Comparison with #Emotional Tweets, FastText and
                                                                                              REFERENCES
                                                                                               [1] A. Bandhakavi, N. Wiratunga, D. Padmanabhan, and S. Massie. 2017. Lexicon
SNBC on the TEC data set.                                                                          based feature extraction for emotion text classification. Pattern Recognition Letters
                                                                                                   93 (2017), 133–142.
the proposed classification pipeline outperforms almost all of the                             [2] M. M. Bradley and P. J. Lang. 1999. Affective norms for English words (ANEW):
selected baselines on the TEC data set. The only exception is SNBC,                                Instruction manual and affective ratings. Technical Report. Citeseer.
                                                                                               [3] P. Ekman. 1993. Facial expression and emotion. American psychologist 48, 4
where we achieve a slighlty lower Recall (-0.022). The 1,250 Head-                                 (1993), 384.
lines data set is a collection of 1,250 newspaper headlines divided                            [4] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. 2016. Bag of Tricks for Efficient
                                                                                                   Text Classification. (2016). arXiv:1607.01759 http://arxiv.org/abs/1607.01759
in a training (1000 headlines) and a test (250 headlines) set. We                              [5] T. Mikolov, I. Sutskever, Chen K., G. S Corrado, and J. Dean. 2013. Distributed
employ this data set to evaluate the robustness of the features that                               Representations of Words and Phrases and their Compositionality. In NIPS 2013.
we extracted from a randomly sampled subset of tweets equal to                                     3111–3119.
                                                                                               [6] S. M. Mohammad. 2012. # Emotional tweets. In Proc. of the First Joint Conference on
70% of the total size of TEC data set. 3 The results of this experiment                            Lexical and Computational Semantics. Association for Computational Linguistics,
are reported in Table 2. We report the performance of (i) a FastText                               246–255.
model trained on the training subsed of the data set of 1,000 head-                            [7] B. Pang and L. Lee. 2008. Opinion mining and sentiment analysis. Foundations
                                                                                                   and Trends in Information Retrieval 2, 1–2 (2008), 1–135.
lines, (ii) an EC classification pipeline based on Word2Vec and a                              [8] A. Purpura, C. Masiero, G. Silvello, and G. A. Susto. 2019. Supervised Lexicon
Gaussian Naive Bayes classifier (GNB) trained on the same training                                 Extraction for Emotion Classification. In Companion Proc. of WWW 2019. ACM,
                                                                                                   1071–1078.
subset of 1,000 headlines of the data set, (iii) #Emotional Tweets,                            [9] A. Purpura, C. Masiero, and G.A. Susto. 2018. WS4ABSA: An NMF-Based Weakly-
described in [6], and (iv) UMM, reported in [1]. From the results                                  Supervised Approach for Aspect-Based Sentiment Analysis with Application
reported in Table 2, we see that our approach outperforms again                                    to Online Reviews. In Discovery Science (Lecture Notes in Computer Science),
                                                                                                   Vol. 11198. Springer International Publishing, Cham, 386–401.
all the selected baselines in almost all of the evaluations measures.                         [10] F. H. Rachman, R. Sarno, and C. Fatichah. 2018. Music emotion classification based
The approach presented in [6] is the only one to have a slightly                                   on lyrics-audio using corpus based emotion. International Journal of Electrical
higher precision than our method (+0.002).                                                         and Computer Engineering 8, 3 (2018), 1720.
                                                                                              [11] A. G. Shahraki and O. R. Zaiane. 2017. Lexical and learning-based emotion mining
                                                                                                   from text. In Proc. of CICLing 2017.
2 A Python implementation which optimizes the parameters of the model is: https:
                                                                                              [12] C. Strapparava and R. Mihalcea. 2007. Semeval-2007 task 14: Affective text. ACL,
//github.com/bbalasub1/glmnet_python/blob/master/docs/glmnet_vignette.ipynb.                       70–74.
3 We restricted the training set for the multinomial logistic regressor because of the
limitations of the glmnet library we used for its implementation.                             4 https://bitbucket.org/albpurpura/supervisedlexiconextractionforec/src/master/.


                                                                                         48