A Study of Synthetic Oversampling for Twitter Imbalanced Sentiment Analysis Julien Ah-Pine and Edmundo Pavel Soriano Morales University of Lyon, ERIC Lab 5 avenue Pierre Mendès France 69676 Bron Cedex, France {julien.ah-pine,edmundo.soriano-morales}@univ-lyon2.fr Abstract. The majority of Twitter sentiment analysis systems implic- itly assume that the class distribution is balanced while in practice it is usually skewed. We argue that Twitter opinion mining using learning methods should be addressed in the framework of imbalanced learning. In this work, we present a study of synthetic oversampling techniques for tweet-polarity classification. The experiments we conducted on three publicly available datasets show that these methods can improve the recognition of the minority class as well as the geometric mean criterion. Key words: Synthetic sampling, Sentiment analysis, Social media. 1 Introduction Micro-blogging services are communication tools that are massively used by people to instantaneously share their opinions about any kinds of topics. These opinions are of interest for companies or individuals, like politicians, as they allow them to monitor their online reputation. Twitter has been the most popular micro-blogging service with more than 500 million tweets per day in 20131 . Thus, sentiment analysis of tweets2 has received a lot of attention both from academia and industry during the last years. In this paper, we focus on tweets polarity classification using supervised learning methods. This task is challenging in several respects. Firstly, tweets are limited to 140 characters and they contain irregular lexical units and syntactic patterns. Hence, these data are noisy, sparse and high-dimensional which makes the learning process difficult. Moreover, tweets expressing an opinion about a given topic usually present a skewed polarity distribution. In this case, any clas- sifier would be biased towards the majority class. In order to cope with these challenges, we propose to use synthetic oversam- pling techniques. These procedures are designed to deal with the class imbalance issue. We show that not only they enable reducing the bias towards the majority 1 http://www.internetlivestats.com/twitter-statistics/ 2 Short informal messages in a more general perspective. In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of DMNLP, Workshop at ECML/PKDD, Riva del Garda, Italy, 2016. Copyright c by the paper’s authors. Copying only for private and academic purposes. 18 J. Ah-Pine and E. P. Soriano Morales class, but they also alleviate the data sparsity burden commonly encountered in text mining. The rest of the paper is organized as follows. In section 2, we discuss some related works in order to position and motivate our proposal. In section 3, we present our approach based on three synthetic oversampling methods and two supervised learning methods. Then, in section 4, we detail the experiments we conducted on three datasets including two di↵erent languages and we discuss the obtained results as well. We conclude the paper in section 5. 2 Related Works 2.1 Twitter Sentiment Analysis Twitter sentiment analysis has received a growing interest starting from 2009 [5, 19]. In this work, we focus on polarity detection which aims at predicting the opinion of a tweet as positive or negative. Supervised learning techniques are the mainstream approaches in this case. Due to the characteristics of Twitter data, systems usually used for sentiment analysis (see [14] for a survey of this field) do not perform well. In order to improve classifiers’ performance for tweets opinion mining, most of research works have proposed to extract features/lexicons which are specific to this type of data and/or leverage external resources [5, 19, 11, 22, 10, 20, 15]. In contrast, we apply a corpus-based approach with no particular feature engineering. 2.2 Imbalanced Sentiment Analysis The class imbalance problem in binary classification occurs when the sizes of the classes di↵er greatly. In this case, any classifier is biased toward the majority class (see [9] for a survey of the domain). For example, in the datasets we examined, near 70% of the tweets of the datasets we experimented with are negative. If a naı̈ve classifier always assigns the negative polarity to any tweet, it will give an overall accuracy of 70% but without recovering any positive tweet, which is not satisfying. Imbalanced learning for sentiment analysis has been studied by several re- searchers in di↵erent learning settings [12, 13, 17, 25]. However, we found very few papers that directly address imbalanced sentiment analysis for Twitter data [16, 6]. The methods that are proposed in the two latter works are similar to cost-sensitive approaches. In our case, we rather use sampling techniques. 3 The Proposed Approach 3.1 Vector Space Representation and Neighborhood Tweets contain slang words and irregular expressions. Thus, linguistic analy- ses by conventional NLP tools often give poor performances on such texts. To Synthetic Oversampling for Twitter Imbalanced Sentiment Analysis 19 circumvent these difficulties, and also to deal with di↵erent languages, we rely on a vectorial representation of tweets based on a bag-of-words approach. We denote by F the resulting feature space, x 2 F is a vector representing a tweet and its coordinates are its words’ frequency. In what follows, we use P and N to designate the subsets of tweets with the minority and the majority class labels respectively (|P| < |N|). In order to compare tweets, we use the cosine similarity function. Note that all pairwise proximity measures lie between 0 and 1 since the coordinates of vectors are non-negative. Let x be any tweet in P then its neighborhood is denoted NN(x) and it consists of the k nearest neighbors. 3.2 Synthetic Oversampling To face the skewed class distribution problem, one straightforward approach is to balance the training set so that |P| = |N|. Undersampling the majority class or oversampling the minority class are two possible strategies. Since the data are very sparse, undersampling the majority class is sub-optimal as we may lose meaningful examples in the learning process. Therefore, oversampling the minority class seems a better solution. In this case, synthetic oversampling creates new examples in P by taking convex combinations of existing points. We recall three popular synthetic oversampling methods: SMOTE [2], Borderline- SMOTE [7] and ADASYN [8]. Their general procedure can be cast as follows: 1. Select an original tweet x according to a probability distribution over P. 2. Determine NN(x). 3. Select a neighbor x0 according to a probability distribution over NN(x). 4. Create a synthetic example y as follows: y = x + ↵(x0 x) (1) where ↵ is a random value in [0, 1]. 5. Repeat 1-4 until the desired number of new examples is reached. 6. Append the set of synthetic points to P. Note that y lies in the line segment joining x and x0 . It is important to notice that y belongs to the subspace spanned by the union of the underlying subspaces of x and x0 . Therefore, synthetic examples are less sparse than original ones. The main di↵erences between the three oversampling methods concern the random selection of x 2 P in step 1. SMOTE assumes a uniform distribution over P whereas Borderline-SMOTE assumes a uniform distribution over B, a subset of P. B consists of tweets in P whose neighborhoods contain a majority of points in N. These items lie in subspaces where the decision boundary is prone to errors. Thereby, it is expected that oversampling in these parts of the space improves the classifier performances. Regarding ADASYN, it assumes a non uniform distribution over P. It can be seen as a smoothed version of Borderline- SMOTE: the noisier the neighborhood of x, the more synthetic points around x. In other words, the probability to select x in step 1 is proportional to the number of points of N contained in NN(x). 20 J. Ah-Pine and E. P. Soriano Morales 4 Experiments 4.1 Datasets and Data Representation We assess the approach introduced previously on three publicly available Twit- ter datasets. The first two are OMD “Obama-McCain Debate” [21] and HCR “Health Care Reform” [23]. The third one is IW “Imagiweb” and concerns tweets in French, posted during the 2012 french presidential election [24]. We chose po- litical tweets because they present a particularly skewed class label distribution. Concerning the vectorial representation of tweets, we used unigrams of words and we only removed the hapax. We give the descriptive statistics3 of these datasets below: – OMD: 1906 tweets (710 positive, 1196 negative) and 1569 features; – HCR: 1922 tweets (541 positive, 1381 negative) and 2066 features; – IW: 4519 tweets (1092 positive, 3427 negative) and 3918 features. 4.2 Supervised Learning Methods We experimented with two di↵erent learning models: decision trees and the l1 penalized logistic regression. Decision trees are well-known symbolic learning techniques and o↵er the advantages of coping with high-dimensional data as well as providing human- readable outputs. In this work, we used CART [1], which builds a binary clas- sification tree based on the Gini index splitting criterion. The R package rpart was used and the default parameters values specified in rpart.control were applied. The l1 penalized logistic regression [18] is also an appropriate supervised learning for high-dimensional data since it implicitly performs feature selection. Moreover, this method has proven to provide competitive results in text clas- sification [4]. We used the glmnet R package [3] and in particular the function cv.glmnet which allows us to select the mixing parameter based on the error observed during training phase. 4.3 Assessment Measures We use several performance criteria: overall accuracy (OA), F1-measures of the positive and negative classes (F-P and F-N respectively). OA evaluates the overall performance of a classifier but it does not properly account for the performances on P as compared to N because of the skewed distribution of class labels. Hence, we also use a popular criterion for imbalanced learning: the geometric mean (GM) of both class accuracy rates. Unlike OA, GM is independent of the class distribution (see [9, Chapter 8] for an overview of this topic). Thus we argue that GM should also be a default evaluation criterion in Twitter sentiment analysis tasks. 3 We removed tweets that were labeled as neutral since we are only concerned with polarity detection. Synthetic Oversampling for Twitter Imbalanced Sentiment Analysis 21 4.4 Experiments Setting and Results It is important to note that we are not interested in comparing the results of decision trees against l1 penalized logistic regression. Our purpose is rather to illustrate that synthetic oversampling can improve the performances of learning methods on Twitter imbalanced-polarity detection tasks. We tested the two learning models on the three collections with di↵erent relatively balanced training sets. In what follows, ⌧ is a variable taking its values in {0, 1/4, 1/2, 3/4, 1} which measures how much the training set is balanced with respect to the initial distribution. In fact, ⌧ = 0 is when no oversampling was carried out and we used the initial imbalanced training set (this is our baseline); ⌧ = 1/4 means we generated b(|N| |P|)/4c positive synthetic examples; . . . ; and ⌧ = 1 means we exactly sampled |N| |P| new positive items in order to have a perfectly balanced training set. The neighborhood was set to k = 20 nearest neighbors4 . The results we obtained using a 5 fold cross-validation are plotted in Figure 1 for decision trees and in Figure 2 for l1 penalized logistic regression. 0.61 0.795 0.67 0.72 0.58 0.64 0.775 0.70 0.55 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0.84 0.75 0.48 0.60 0.81 0.40 0.52 0.72 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0.78 0.24 0.38 0.86 0.76 0.18 0.32 0.74 0.84 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 Fig. 1: Results for decision trees (CART). Solid line with circles refers to SMOTE, dashed line with triangles refers to Borderline-SMOTE and dotted line with plus signs refers to ADASYN. From left to right: plots of OA, F-P, F-N and GM measures. From top to bottom: plots for OMD, HCR and IW benchmarks. The x-axis refers to ⌧ going from initial imbalanced (⌧ = 0) to fully balanced (⌧ = 1) training sets. Our main findings are the following: – For both decision tree and l1 penalized logistic regression, we note quite the same trends: oversampling generally improves the results. Indeed, All 4 We also tested with k = 10, 30 but the trends were similar and the results compara- ble. 22 J. Ah-Pine and E. P. Soriano Morales 0.78 0.76 0.70 0.82 0.75 0.67 0.73 0.79 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0.55 0.84 0.76 0.65 0.78 0.45 0.55 0.70 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0.86 0.76 0.35 0.50 0.82 0.72 0.35 0.20 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 Fig. 2: Results for l1 penalized logistic regression. Same legend as in Figure 1. three sampling methods globally improve the GM measure5 . Thereby, our ap- proach allows alleviating the class imbalance problem e↵ectively. For OMD, when ⌧ = 1, the most important gains for GM measures are given by ADASYN (1st row, 4th column in the figures). Regarding HCR and IW, Borderline-SMOTE performs the best but SMOTE often provides compara- ble results (2nd and 3rd rows respectively and 4th column in the figures). – All three oversampling strategies generally boosts F-P values6 . The minority class is thus better recognized. However, this is at the expense of a reduction of F-N values. Nonetheless, since the increasing rate of F-P is generally much larger than the decreasing rate of F-N, we note the overall increase of GM values as highlighted previously. – For all three sampling techniques, the OA measure tends to diminish as the training set is more and more balanced. In fact, since the class distribution in the test set is skewed towards N, the errors on true negative tweets have more impact on OA than the correct detection of true positive tweets. This illustrates again the fact that OA is not a criterion that properly accounts for imbalanced data. – We cannot conclude on which of the three oversampling strategies is the best. However, we can make the following remarks: • SMOTE and Borderline-SMOTE have quite the same behaviours for the HCR and IW collections. F-P measures are greater than ADASYN whereas F-N values are lower. Both methods allows a much better recog- nition of the minority class but in doing so they make more mistakes when detecting the majority class. • In contrast, ADASYN presents peculiar properties. The increase of GM values are lower than for the two other methods but this oversampling 5 The only exception is observed for OMD when using a fully balanced training sets (⌧ = 1) generated by Borderline-SMOTE with CART as shown in Figure 1. 6 Except the same particular case mentioned previously. Synthetic Oversampling for Twitter Imbalanced Sentiment Analysis 23 technique shows more stable OA values and even better ones in some cases. For the OMD dataset specifically, this approach not only provides among the best performances for the GM criterion but it also allows improving the OA measures unlike the other methods. 5 Conclusion Twitter sentiment analysis is confronted with the class imbalance problem and it is important to take this aspect into account when designing opinion mining systems based on machine learning. A way to address this challenge is to use synthetic oversampling which aims at balancing the training set in a meaningful way. Three state-of-the-art methods have been examined in that regard. We conducted experiments on political- tweets polarity classification using three datasets and in two di↵erent languages. The obtained results show that our proposal makes it possible to deal with the skewed class distribution issue by providing better recognition of the minority class as well as obtaining large increases of the overall geometric mean criterion. In future work, we intend to extend our study to multiclass sentiment analysis and also to examine the use of synthetic oversampling methods in other NLP tasks as a general approach to cope with the sparsity problem. Acknowledgment This work was partly supported by the french national projects Imagiweb ANR-2012-CORD-002-01 and Request PIA/FSN. References 1. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees. CRC press (1984) 2. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1) (2002) 3. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33(1) (2010) 4. Genkin, A., Lewis, D.D., Madigan, D.: Large-scale bayesian logistic regression for text categorization. Technometrics 49 (2007) 5. Go, A., Bhayani, R., Huang, L.: Twitter Sentiment Classification using Distant Supervision. Tech. rep., Stanford University (2009), https://sites.google.com/ site/twittersentimenthelp/home 6. Hamdan, H., Bellot, P., Bechet, F.: Lsislif: Feature extraction and label weighting for sentiment analysis in twitter. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015) (2015) 7. Han, H., Wang, W.Y., Mao, B.H.: Borderline-smote: A new over-sampling method in imbalanced data sets learning. In: Advances in Intelligent Computing, Lecture Notes in Computer Science, vol. 3644 (2005) 8. He, H., Bai, Y., Garcia, E., Li, S.: Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence) (2008) 24 J. Ah-Pine and E. P. Soriano Morales 9. He, H., Ma, Y.: Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley-IEEE Press (2013) 10. Kiritchenko, S., Zhu, X., Mohammad, S.M.: Sentiment analysis of short informal texts. Journal of Artificial Intelligence Research (JAIR) 50 (2014) 11. Kouloumpis, E., Wilson, T., Moore, J.D.: Twitter sentiment analysis: The good the bad and the omg! In: Proceedings of the Fifth International Conference on Weblogs and Social Media, Barcelona, Catalonia, Spain, July 17-21, 2011 (2011) 12. Li, S., Wang, Z., Zhou, G., Lee, S.Y.M.: Semi-supervised learning for imbalanced sentiment classification. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Three. IJCAI’11 (2011) 13. Li, S., Zhou, G., Wang, Z., Lee, S.Y.M., Wang, R.: Imbalanced sentiment classifi- cation. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. CIKM ’11 (2011) 14. Liu, B., Zhang, L.: A survey of opinion mining and sentiment analysis. In: Aggar- wal, C.C., Zhai, C. (eds.) Mining Text Data. Springer US (2012) 15. Martı́nez-Cámara, E., Martı́n-Valdivia, M.T., Urena-López, L.A., Montejo-Ráez, A.R.: Sentiment analysis in twitter. Natural Language Engineering 20(01) (2014) 16. Miura, Y., Sakaki, S., Hattori, K., Ohkuma, T.: Teamx: A sentiment analyzer with enhanced lexicon mapping and weighting scheme for unbalanced data. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) (2014) 17. Mountassir, A., Benbrahim, H., Berrada, I.: An empirical study to address the problem of unbalanced data sets in sentiment classification. In: Systems, Man, and Cybernetics (SMC), IEEE International Conference on (2012) 18. Ng, A.Y.: Feature selection, l1 vs. l2 regularization, and rotational invariance. In: Proceedings of the Twenty-first International Conference on Machine Learning. ICML ’04 (2004) 19. Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: LREC. vol. 10 (2010) 20. Saif, H., He, Y., Fernandez, M., Alani, H.: Semantic patterns for sentiment analysis of twitter. In: Proceedings of the 13th International Semantic Web Conference - Part II. ISWC ’14 (2014) 21. Shamma, D.A., Kennedy, L., Churchill, E.F.: Tweet the debates: Understanding community annotation of uncollected sources. In: Proceedings of the First SIGMM Workshop on Social Media. WSM ’09 (2009) 22. Sidorov, G., Miranda-Jiménez, S., Viveros-Jiménez, F., Gelbukh, A., Castro- Sánchez, N., Velásquez, F., Dı́az-Rangel, I., Suárez-Guerra, S., Treviño, A., Gor- don, J.: Empirical study of machine learning based approach for opinion mining in tweets. In: Advances in Artificial Intelligence (2012) 23. Speriosu, M., Sudan, N., Upadhyay, S., Baldridge, J.: Twitter polarity classification with label propagation over lexical links and the follower graph. In: Proceedings of the First Workshop on Unsupervised Learning in NLP. EMNLP’11 (2011) 24. Velcin, J., Kim, Y., Brun, C., Dormagen, J., SanJuan, E., Khouas, L., Peradotto, A., Bonnevay, S., Roux, C., Boyadjian, J., Molina, A., Neihouser, M.: Investigating the image of entities in social media: Dataset design and first results. In: Proceed- ings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014) (2014) 25. Xu, R., Chen, T., Xia, Y., Lu, Q., Liu, B., Wang, X.: Word embedding composition for data imbalances in sentiment and emotion classification. Cognitive Computa- tion 7(2) (2015)