Alleviating Data Sparsity for Twitter Sentiment Analysis Hassan Saif Yulan He Harith Alani Knowledge Media Institute Knowledge Media Institute Knowledge Media Institute The Open University The Open University The Open University Milton Keynes, MK7 6AA, UK Milton Keynes, MK7 6AA, UK Milton Keynes, MK7 6AA, UK h.saif@open.ac.uk y.he@open.ac.uk h.alani@open.ac.uk ABSTRACT been observed [21, 22] that the reputation of a certain product or Twitter has brought much attention recently as a hot research topic company is highly affected by rumours and negative opinions pub- in the domain of sentiment analysis. Training sentiment classifiers lished and shared among users on social networks. Understanding from tweets data often faces the data sparsity problem partly due to this observation, companies realize that monitoring and detecting the large variety of short and irregular forms introduced to tweets public opinions from microblogs leads to building better relation- because of the 140-character limit. In this work we propose using ships with their customers, better understanding of their customers’ two different sets of features to alleviate the data sparseness prob- needs and better response to changes in the market. For public sec- lem. One is the semantic feature set where we extract semantically tors, recent studies [3, 9] show that there is a strong correlation hidden concepts from tweets and then incorporate them into classi- between activities on social networks and the outcomes of certain fier training through interpolation. Another is the sentiment-topic political issues. For example, Twitter and Facebook were used to feature set where we extract latent topics and the associated topic organise demonstrations and build solidarity during Arab Spring sentiment from tweets, then augment the original feature space with of civil uprising in Egypt, Tunisia, and currently in Syria. One these sentiment-topics. Experimental results on the Stanford Twit- week before Egyptian president’s resignation the total rate of tweets ter Sentiment Dataset show that both feature sets outperform the about political change in Egypt increased ten-fold. In Syria, the baseline model using unigrams only. Moreover, using semantic amount of online content produced by opposition groups in Face- features rivals the previously reported best result. Using sentiment- book increased dramatically. topic features achieves 86.3% sentiment classification accuracy, which outperforms existing approaches. Twitter, which is considered now as one of the most popular mi- croblogging services, has attracted much attention recently as a hot Categories and Subject Descriptors research topic in sentiment analysis. Previous work on twitter sen- I.2.7 [Artificial Intelligence]: Natural Language Processing—Text timent analysis [5, 13, 2] rely on noisy labels or distant supervision, Analysis for example, by taking emoticons as the indication of tweet senti- ment, to train supervised classifiers. Other work explore feature engineering in combination of machine learning methods to im- General Terms prove sentiment classification accuracy on tweets [1, 10]. None of Algorithms, Experimentation the work explicitly addressed the data sparsity problem which is one of the major challenges facing when dealing with tweets data. Keywords Microblogs, Sentiment Analysis, Opinion Mining, Twitter, Seman- tic Smoothing, Data Sparsity 1. INTRODUCTION Few years after the explosion of Web 2.0, microblogs and social networks are now considered as one of the most popular forms of communication. Through platforms like Twitter and Facebook, tons of information, which reflect people’s opinions and attitudes, are published and shared among users everyday. Monitoring and analysing opinions from social media provides enormous opportu- nities for both public and private sectors. for private sectors, it has Figure 1: Word frequency statistics. Figure 1 compares the word frequency statistics of the tweets data Copyright c 2012 held by author(s)/owner(s). we used in our experiments and the movie review data1 . X-axis Published as part of the #MSM2012 Workshop proceedings, shows the word frequency interval, e.g., words occur up to 10 times available online as CEUR Vol-838, at: http://ceur-ws.org/Vol-838 1 #MSM2012, April 16, 2012, Lyon, France. http://www.cs.cornell.edu/People/pabo/ movie-review-data/ 2 · #MSM2012 · 2nd Workshop on Making Sense of Microposts · (1-10), more than 10 times but up to 20 times (10-20), etc. Y-axis sequently added into the original feature space for supervised clas- shows the percentage of words falls within certain word frequency sifier training. Our experimental results show that NB learned from interval. It can be observed that the tweets data are sparser than the these features outperforms the baseline model trained on unigrams movie review data since the former contain more infrequent words, only and achieves the state-of-the-art result on the original test set with 93% of the words in the tweets data occurring less than 10 of the Stanford Twitter Sentiment Data. times (cf. 78% in the movie review data). The rest of the paper is organised as follows. Section 2 outlines ex- One possible way to alleviate data sparseness is through word clus- isting work on sentiment analysis with focus on twitter sentiment tering such that words contributing similarly to sentiment classi- analysis. Section 3 describes the data used in our experiments. Sec- fication are grouped together. In this paper, we propose two ap- tion 4 presents our proposed semantic smoothing method. Section proaches to realise word clustering, one is through semantic smooth- 5 describes how we incorporate sentiment-topics extracted from the ing [17], the other is through automatic sentiment-topics extrac- JST model into sentiment classifier training. Experimental results tion. Semantic smoothing extracts semantically hidden concepts are discussed in Section 6. Finally, we conclude our work and out- from tweets and then incorporates them into supervised classifier line future directions in Section 7. training by interpolation. An inspiring example for using seman- tic smoothing is shown in Figure 2 where the left box lists entities 2. RELATED WORK appeared in the training set together with their occurrence proba- Much work has been done in the field of sentiment analysis. Most bilities in positive and negative tweets. For example, the entities of the work follows two basic approaches. The first approach as- “iPad”, “iPod” and “Mac Book Pro” appeared more often in tweets sumes that semantic orientation of a document is an averaged sum of positive polarity and they are all mapped to the semantic concept of the semantic orientations of its words and phrases. The pioneer “Product/Apple”. As a result, the tweet from the test set “Finally, work is the point-wise mutual information approach proposed in I got my iPhone. What a product!” is more likely to have a posi- Turney [20]. Also work such as [6, 8, 19, 16] are good examples tive polarity because it contains the entity “iPhone” which is also of this lexical-based approach. The second approach [15, 14, 4, mapped to the concept “Product/Apple”. 23, 12] addresses the problem as a text classification task where classifiers are built using one of the machine learning methods and trained on a dataset using features such as unigrams, bigrams, part- of-speech (POS) tags, etc. The vast majority of work in sentiment analysis mainly focuses on the domains of movie reviews, product reviews and blogs. Twitter sentiment analysis is considered as a much harder problem than sentiment analysis on conventional text such as review docu- ments, mainly due to the short length of tweet messages, the fre- quent use of informal and irregular words, and the rapid evolution of language in Twitter. Annotated tweets data are impractical to ob- tain. A large amount of work have been conducted on twitter sen- timent analysis using noisy labels (also called distant supervision). For example, Go et al. [5] used emoticons such as “:-)” and “:(” to label tweets as positive or negative and train standard classifiers such as Naïve Bayes (NB), Maximum Entropy (MaxEnt), and Sup- port Vector Machines (SVMs) to detect the sentiments of tweets. The best result of 83% was reported by MaxEnt using a combina- tion of unigrams and bigrams. Barbosa and Feng [2] collected their training data from three different Twitter sentiment detection web- Figure 2: Incorporating semantic concepts for sentiment clas- sites which mainly use some pre-built sentiment lexicons to label sification. each tweet as positive or negative. Using SVMs trained from these noisy labeled data, they obtained 81.3% in sentiment classification We propose a semantic interpolation method to incorporate seman- accuracy. tic concepts into sentiment classifier training where we interpolate the original unigram language model in the Naïve Bayes (NB) clas- While the aforementioned approaches did not detect neutral senti- sifier with the generative model of words given semantic concepts. ment, Pak and Paroubek [13] additionally collected neutral tweets We show on the Stanford Twitter Sentiment Data [5] that simply from Twitter accounts of various newspapers and magazines and replaces words with their corresponding semantic concepts reduces trained a three-class NB classifier which is able to detect neutral the vocabulary size by nearly 20%. However, the sentiment classi- tweets in addition to positive and negative tweets. Their NB was fication accuracy drops by 4% compared to the baseline NB model trained with a combination of n-grams and POS features. trained on unigrams solely. With the interpolation method, the sen- timent classification accuracy improves upon the baseline model by Speriosu et al. [18] argued that using noisy sentiment labels may nearly 4%. hinder the performance of sentiment classifiers. They proposed ex- ploiting the Twitter follower graph to improve sentiment classifica- Our second approach for automatic word clustering is through sentiment- tion and constructed a graph that has users, tweets, word unigrams, topics extraction using the previously proposed joint sentiment- word bigrams, hashtags, and emoticons as its nodes which are con- topic (JST) model [11]. The JST model extracts latent topics and nected based on the link existence among them (e.g., users are con- the associated topic sentiment from the tweets data which are sub- nected to tweets they created; tweets are connected to word uni- 3 · #MSM2012 · 2nd Workshop on Making Sense of Microposts · grams that they contain etc.). They then applied a label propagation Figure 3-c) that allows to retrieve negative/positive tweets towards method where sentiment labels were propagated from a small set of a specific search term. For example, a user can retrieve opinionated of nodes seeded with some initial label information throughout the tweet messages about the search term “Nike”. graph. They claimed that their label propagation method outper- forms MaxEnt trained from noisy labels and obtained an accuracy of 84.7% on the subset of the twitter sentiment test set from [5]. 4. SEMANTIC FEATURES Twitter is an open social environment where there are no restric- tions on what users can tweet about. Therefore, a huge number of There have also been some work in exploring feature engineering infrequent named entities, such as people, organization, products, to improve the performance of sentiment classification on tweets. etc., can be found in tweet messages. These infrequent entities Agarwal et al. [1] studied using the feature based model and the make the data very sparse and hence hinder the sentiment classifi- tree kernel based model for sentiment classification. They explored cation performance. Nevertheless, many of these named entities are a total of 50 different feature types and showed that both the fea- semantically related. For example, the entities “iPad” and “iPhone” ture based and tree kernel based models perform similarly and they can be mapped to the same semantic concept “Product/Apple”. In- outperform the unigram baseline. spired by this observation, we propose using semantic features to alleviate the sparsity problem from tweets data. We first extract Kouloumpis et al. [10] compared various features including n-gram named entities from tweets and map them to their corresponding features, lexicon features based on the existence of polarity words semantic concepts. We then incorporate these semantic concepts from the MPQA subjectivity lexicon2 , POS features, and microblog- into NB classifier training. ging features capturing the presence of emoticons, abbreviations, and intensifiers (e.g., all-caps and character repetitions). They found that micoblogging features are most useful in sentiment classifica- 4.1 Semantic Concept Extraction tion. We investigated three third-party services to extract entities from tweets data, Zemanta,5 OpenCalais,6 and AlchemyAPI.7 A quick and manual comparison of a randomly selected 100 tweet mes- 3. TWITTER SENTIMENT CORPUS sages with the extracted entities and their corresponding semantic In the work conducted in this paper, we used the Stanford Twitter concepts showed that AlchemyAPI performs better than the others Sentiment Data3 which was collected between the 6th of April and in terms of the quality and the quantity of the extracted entities. the 25th of June 2009 [5]. The training set consists of 1.6 million Hence, we used AlchemyAPI for the extraction of semantic con- tweets with the same number of positive and negative tweets la- cepts in our paper. belled using emoticons. For example, a tweet is labelled as positive if it contains :), :-), : ), :D, or =) and is labelled as negative if it has Using AlchemyAPI, we extracted a total of 15,139 entities from :(, :-(, or : (, etc. The original test set consists of 177 negative and the training set, which are mapped to 30 distinct concepts and ex- 182 positive manually annotated tweets. In contrast to the train- tracted 329 entities from the test set, which are mapped to 18 dis- ing set which was collected based on specific emoticons, the test tinct concepts. Table 1 shows the top five extracted concepts from set was collected by searching Twitter API with specific queries the training data with the number of entities associated with them. including products’ names, companies and people. Concept Number of Entities We built our training set by randomly selecting 60,000 balanced Person 4954 tweets from the original training set in the Stanford Twitter Senti- Company 2815 ment Data. Since the original test set only contains a total of 359 City 1575 tweets which is relatively small, we enlarge this set by manually Country 961 annotating more tweets. To simplify and speed up the annotation Organisation 614 efforts, we have built Tweenator4 , a web-based sentiment annota- tion tool that allows users to easily assign a sentiment label to tweet messages, i.e. assign a negative, positive or neutral label to a cer- Table 1: Top 5 concepts with the number of their associated tain tweet with regards to its contextual polarity. Using Tweenator, entities. 12 different users have annotated additional 641 tweets from the original remaining training data. Our final test set contains 1,000 tweet messages with 527 negative and 473 positive. 4.2 Incorporating Semantic Concepts into NB Training It is worth mentioning that users who participated in the annota- The extracted semantic concepts can be incorporated into sentiment tion process have reported that using the annotation interface of classifier training in a naive way where entities are simply replaced Tweenator, as shown in Figure 3-a, they were able to annotate 10 by their mapped semantic concepts in the tweets data. For example, tweet messages in 2 to 3 minutes approximately. all the entities such as “iPhone”, “iPad”, and “iPod” are replaced by the semantic concept “Product/Apple”. A more principled way Recently, we have added two new modules to Tweenator by im- to incorporate semantic concepts is through interpolation. Here, we plementing our work that will be described in Section 4. The first propose interpolating the unigram language model with the gener- module (see Figure 3-b) provides a free-form sentiment detection, ative model of words given semantic concepts in NB training. which allows users to detect the polarity of their textual entries. The second module is the opinionated tweet message retrieval tool (see In NB, the assignment of a sentiment class c to a given tweet w can 2 5 http://www.cs.pitt.edu/mpqa/ http://www.zemanta.com/ 3 6 http://twittersentiment.appspot.com/ http://www.opencalais.com/ 4 7 http://atkmi.com/tweenator/ http://www.alchemyapi.com/ 4 · #MSM2012 · 2nd Workshop on Making Sense of Microposts · (a) Sentiment Annotation Interface. (b) Free-Form Sentiment Detector Interface. (c) Opinionated Tweet Message Retrieval Interface. Figure 3: Tweenator: Web based Sentiment Annotation Tool for Twitter 5 · #MSM2012 · 2nd Workshop on Making Sense of Microposts · be computed as: a sentiment label and a topic label. Hence, JST essentially clus- ters different words sharing similar sentiment and topic. We list ĉ = arg max P (c|w) some of the topic words extracted by JST in Table 2. Words in each c∈C Y cell are grouped under one topic and the upper half of the table = arg max P (c) P (wi |c), (1) c∈C shows topic words bearing positive sentiment while the lower half 1≤i≤Nw shows topic words bearing negative polarity. It can be observed that where Nw is the total number of words in tweet w, P (c) is the words groups under different sentiment and topic are quite informa- prior probability of a tweet appearing in class c, P (wi |c) is the tive and coherent. For example, Topic 3 under positive sentiment is conditional probability of word wi occurring in a tweet of class c. related to a good music album, while Topic 1 under negative senti- ment is about a complaint of feeling sick possibly due to cold and In multinomial NB, P (c) can be estimated by P (c) = Nc /N headache. Where Nc is the number of tweets in class c and N is the total number of tweets. P (wi |c) can be estimated using maximum like- Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 lihood with Laplace smoothing: dream bought song eat movi sweet short listen food show N (w, c) + 1 train hair love coffe award P (w|c) = P (2) Positive 0 w0 ∈V N (w |c) + |V | angel love music dinner live love wear play drink night Where N (w, c) is the occurrence frequency of word w in all train- goodnight shirt album yummi mtv ing tweets of class c and |V | is the number of words in the vo- free dress band chicken concert cabulary. Although using Laplace smoothing helps to prevent zero club photo guitar tea vote probabilities of the “unseen” words, it assigns equal prior probabil- ities to all of these words. feel miss rain exam job today sad bike school hard We propose a new smoothing method where we interpolate the un- hate cry car week find Negative igram language model in NB with the generative model of words sick girl stop tomorrow hate given semantic concepts. Thus, the new class model with semantic cold gonna ride luck interview smoothing has the following formula: suck talk hit suck lost weather bore drive final kick Ps (w|c) =(1 − α)Pu (w|c) headache feel run studi problem X +α P (w|sj )P (sj |c) (3) Table 2: Extracted polarity words by JST. j Where Ps (w|c) is the unigram class model with semantic smooth- ing, Pu (w|c) is the unigram class model with maximum likelihood Inspired by the above observations, grouping words under the same estimate, sj is the j-th concept of the word w, P (sj |c) is the dis- topic and bearing similar sentiment could potentially reduce data tribution of semantic concepts in training data of a given class and sparseness in twitter sentiment classification. Hence, we extract it can computed via the maximum likelihood estimation. P (w|sj ) sentiment-topics from tweets data and augment them as additional is the distribution of words in the training data given a concept and features into the original feature space for NB training. Algorithm 1 it can be also computed via the maximum likelihood estimation. shows how to perform NB training with sentiment-topics extracted Finally, the coefficient α is used to control the influence of the se- from JST. The training set consists of labeled tweets, Dtrain = mantic mapping in the new class model. By setting α to 0 the class {(wn ; cn ) ∈ W × C : 1 ≤ n ≤ N train }, where W is the input model becomes a unigram language model without any semantic space and C is a finite set of class labels. The test set contains interpolation. On the other hand, setting α to 1 reduces the class tweets without labels, Dtest = {wnt ∈ W : 1 ≤ n ≤ N test }. model to a semantic mapping model. In this work, α was empiri- A JST model is first learned from the training set and then infer cally set to 0.5. sentiment-topic for each tweet in the test set. The original tweets are augmented with those sentiment-topics as shown in Step 4 of Algorithm 1, where li _zi denotes a combination of sentiment label 5. SENTIMENT-TOPIC FEATURES li and topic zi for word wi . Finally, an optional feature selection The joint sentiment-topic (JST) model [11] is a four-layer genera- step can be performed according to the information gain criteria tive model which allows the detection of both sentiment and topic and a classifier is then trained from the training set with the new simultaneously from text. The generative procedure under JST feature representation. boils down to three stages. First, one chooses a sentiment label l from the per-document sentiment distribution πd . Following that, 6. EXPERIMENTAL RESULTS one chooses a topic z from the topic distribution θd,l , where θd,l is In this section, we present the results obtained on the twitter sen- conditioned on the sampled sentiment label l. Finally, one draws a timent data using both semantic features and sentiment-topic fea- word wi from the per-corpus word distribution φl,z conditioned on tures and compare with the existing approaches. both topic z and sentiment label l. The JST model does not require labelled documents for training. The only supervision is word prior polarity information which can be obtained from publicly available 6.1 Pre-processing sentiment lexicons such as the MPQA subjectivity lexicon. The raw tweets data are very noisy. There are a large number of irregular words and non-English characters. Tweets data have some We train JST on the training set with tweet sentiment labels being unique characteristics which can be used to reduce the feature space discarded. The resulting model assigns each word in tweets with through the following pre-processing: 6 · #MSM2012 · 2nd Workshop on Making Sense of Microposts · Algorithm 1 NB training with sentiment-topics extracted from in Table 4. The performance degradation can be explained as the JST. mere use of semantic concepts replacement which leads to infor- Input: The training set Dtrain and test set Dtest mation loss and subsequently hurts NB performance. Augmenting Output: NB sentiment classifier the original feature space with semantic concepts performs slightly 1: Train a JST model on Dtrain with the document labels discarded better than sentiment replacement, though it still performs worse 2: Infer sentiment-topic from Dtest 3: for each tweet wn = (w1 , w2 , ..., wm ) ∈ {Dtrain , Dtest } do than the baseline. 4: Augment tweet with sentiment-topics generated from JST, wn0 = (w , w , ..., w , l _z , l _z , ..., l _z ) 1 2 m 1 1 2 2 m m With Semantic interpolation, semantic concepts were incorporated 5: end for 0 into NB training taking into account the generative probability of 6: Create a new training set Dtrain = {(wn0 ; cn ) : 1 ≤ n ≤ N train } words given concepts. The method improves upon the baseline 0 7: Create a new test set Dtest = {wn0 : 1 ≤ n ≤ N test } model and gives a sentiment classification accuracy of 84%. 0 8: Perform feature selection using IG on Dtrain 0 9: Return NB trained on D train Method Accuracy Unigrams 80.7% Pre-processing Vocabulary Size % of Reduction Semantic replacement 76.3% None 95,130 0% Semantic augmentation 77.6% Username 70,804 25.58% Semantic interpolation 84.0% Hashtag 94,200 0.8% URLS 92,363 2.91% Sentiment-topic features 82.3% Repeated Letters 91,824 3.48% Digits 92,785 2.47% Table 4: Sentiment classification results on the 1000-tweet test Symbols 37,054 29.47% set. All 37,054 61.05% Table 3: The effect of pre-processing. 6.3 Sentiment-Topic Features To run JST on the tweets data, the only parameter we need to set is the number of topics T . It is worth noting that the total num- • All Twitter usernames, which start with @ symbol, are re- ber of the sentiment-topics that will be extracted is 3 × T . For placed with the term “USER”. example, when T is set to 50, there are 50 topics under each of positive, negative and neutral sentiment labels. Hence the total • All URL links in the corpus are replaced with the term “URL”. number of sentiment-topic features is 150. We augment the original • Reduce the number of letters that are repeated more than bag-of-words representation of the tweet messages by the extracted twice in all words. For example the word “loooooveeee” be- sentiment-topics. Figure 4 shows the classification accuracy of NB comes “loovee” after reduction. trained from the augmented features by varying the number of top- ics from 1 to 65. The initial sentiment classification accuracy is • Remove all Twitter hashtags which start with the # symbol, 81.1% with topic number 1. Increasing the number of topics leads all single characters and digits, and non-alphanumeric char- to the increase of classification accuracy with the peak value of acters. 82.3% being reached at topic number 50. Further increasing topic numbers degrades the classifier performance. Table 3 shows the effect of pre-processing on reducing features from the original feature space. After all the pre-processing, the vocabulary size is reduced by 62%. 6.2 Semantic Features We have tested both the NB classifier from WEKA8 and the maxi- mum entropy (MaxEnt) model from MALLET9 . Our results show that NB consistently outperforms MaxEnt. Hence, we use NB as our baseline model. Table 4 shows that with NB trained from un- igrams only, the sentiment classification accuracy of 80.7% was obtained. Figure 4: Classification accuracy vs. number of topics. We extracted semantic concepts from tweets data using Alchemy API and then incorporated them into NB training by the follow- ing two simple ways. One is to replace all entities in the tweets corpus with their corresponding semantic concepts (semantic re- 6.4 Comparison with Existing Approaches In order to compare our proposed methods with the existing ap- placement). Another is to augment the original feature space with proaches, we also conducted experiments on the original Stanford semantic concepts as additional features for NB training (seman- Twitter Sentiment test set which consists of 177 negative and 182 tic augmentation). With semantic replacement, the feature space positive tweets. The results are shown in Table 5. The sentiment shrunk substantially by nearly 20%. However, sentiment classifi- classification accuracy of 83% reported in [5] was obtained using cation accuracy drops by 4% compared to the baseline as shown MaxEnt trained on a combination of unigrams and bigrams. It 8 http://www.cs.waikato.ac.nz/ml/weka/ should be noted that while Go et al. used 1.6 million tweets for 9 http://mallet.cs.umass.edu/ training, we only used a subset of 60,000 tweets as our training set. 7 · #MSM2012 · 2nd Workshop on Making Sense of Microposts · Figure 5: Classification accuracy vs. number of features selected by information gain. Speriosu et al. [18] tested on a subset of the Stanford Twitter Sen- We performed feature selection using information gain (IG) on the timent test set with 75 negative and 108 positive tweets. They re- training set. We calculated the IG value for each feature and sorted ported the best accuracy of 84.7% using label propagation on a them in descending order based on IG. Using each distinct IG value rather complicated graph that has users, tweets, word unigrams, as a threshold, we ended up with different sets of features to train a word bigrams, hashtags, and emoticons as its nodes. classifier. Figure 5 shows the sentiment classification accuracy on the 1000-tweet test set versus different number of features. It can be It can be seen from Table 5 that sentiment replacement performs observed that there is an abrupt change in x-axis from around 5600 worse than the baseline. Sentiment augmentation does not result features jumping to over 30,000 features. Using sentiment-topic in the significant decrease of the classification accuracy, though it features consistently performs better than using semantic features. does not lead to the improved performance either. Our semantic With as few as 500 features, augmenting the original feature space interpolation method rivals the best result reported on the Stanford with sentiment-topics already achieves 80.2% accuracy. Although Twitter Sentiment test set. Using the sentiment-topic features, we with all the features included, NB trained with semantic features achieved 86.3% sentiment classification accuracy, which outper- performs better than that with sentiment-topic features, we can still forms the existing approaches. draw a conclusion that sentiment-topic features should be preferred over semantic features for the sentiment classification task since it Method Accuracy gives much better results with far less features. Unigrams 81.0% Semantic replacement 77.3% 7. CONCLUSIONS AND FUTURE WORK Semantic augmentation 80.45% Twitter is an open social environment where users can tweet about Semantic interpolation 84.1% different topics within the 140-character limit. This poses a signif- Sentiment-topic features 86.3% icant challenge to Twitter sentiment analysis since tweets data are (Go et al., 2009) 83% often noisy and contain a large number of irregular words and non- (Speriosu et al., 2011) 84.7% English symbols and characters. Pre-processing by filtering some of the non-standard English words leads to a significant reduction Table 5: Sentiment classification results on the original Stan- of the original feature space by nearly 61.0% on the Twitter senti- ford Twitter Sentiment test set. ment data. Nevertheless, the pre-processed tweets data still contain a large number of rare words. 6.5 Discussion In this paper, we have proposed two sets of features to alleviate the We have explored incorporating semantic features and sentiment- data sparsity problem in Twitter sentiment classification, semantic topic features for twitter sentiment classification. While simple se- features and sentiment-topic features. Our experimental results on mantic replacement or augmentation does not lead to the improve- the Twitter sentiment data show that while both methods improve ment of sentiment classification performance, sentiment interpo- upon the baseline Naïve Bayes model trained from unigram fea- lation improves upon the baseline NB model trained on unigrams tures only, using sentiment-topic features gives much better results only by 3%. Augmenting feature space with sentiment-topics gen- than using semantic features with less features. erated from JST also results in the increase of sentiment classifica- tion accuracy compared to the baseline. On the original Stanford Compared to the existing approaches to twitter sentiment analysis Twitter Sentiment test set, NB classifiers learned from sentiment- which either rely on sophisticated feature engineering or compli- topic features outperform the existing approaches. cated learning procedure, our approaches are much more simple and straightforward and yet attain comparable performance. We have a somewhat contradictory observation here. Using sentiment- topic features performs worse than using semantic features on the There are a few possible directions we would like to explore as fu- test set comprising of 1000 tweets. But the reverse is observed on ture work. First, in the semantic method all entities where simply the original Stanford Twitter Sentiment test set with 359 tweets. replaced by the associated semantic concepts. It is worth to perform We therefore conducted further experiments to compare these two a selective statistical replacement, which is determined based on the approaches. contribution of each concept towards making a better classification 8 · #MSM2012 · 2nd Workshop on Making Sense of Microposts · decision. Second, sentiment-topics generated by JST model were on Association for Computational Linguistics (2004), simply augmented into the original feature space of tweets data. It Association for Computational Linguistics, p. 271. could lead to better performance by attaching a weight to each ex- [15] PANG , B., L EE , L., AND VAITHYANATHAN , S. Thumbs tracted sentiment-topic feature in order to control the impact of the up?: sentiment classification using machine learning newly added features. Finally, the performance of the NB classi- techniques. In Proceedings of the ACL-02 conference on fiers learned from semantic features depends on the quality of the Empirical methods in natural language processing-Volume entity extraction process and entity-concept mapping method. It 10 (2002), Association for Computational Linguistics, is worth to investigate a filtering method which can automatically pp. 79–86. filter out low-confidence semantic concepts. [16] R EAD , J., AND C ARROLL , J. Weakly supervised techniques for domain-independent sentiment classification. In Acknowledgement This work is partially funded by the EU project Proceeding of the 1st international CIKM workshop on ROBUST (grant number 257859). Topic-sentiment analysis for mass opinion (2009), pp. 45–52. [17] S AIF, H., H E , Y., AND A LANI , H. Semantic Smoothing for 8. REFERENCES Twitter Sentiment Analysis. In Proceeding of the 10th [1] AGARWAL , A., X IE , B., VOVSHA , I., R AMBOW, O., AND International Semantic Web Conference (ISWC) (2011). PASSONNEAU , R. Sentiment analysis of twitter data. In [18] S PERIOSU , M., S UDAN , N., U PADHYAY, S., AND Proceedings of the ACL 2011 Workshop on Languages in BALDRIDGE , J. Twitter polarity classification with label Social Media (2011), pp. 30–38. propagation over lexical links and the follower graph. [2] BARBOSA , L., AND F ENG , J. Robust sentiment detection on Proceedings of the EMNLP First workshop on Unsupervised twitter from biased and noisy data. In Proceedings of Learning in NLP (2011), 53–63. COLING (2010), pp. 36–44. [19] TABOADA , M., AND G RIEVE , J. Analyzing appraisal [3] B HUIYAN , S. Social media and its effectiveness in the automatically. In Proceedings of AAAI Spring Symposium on political reform movement in egypt. Middle East Media Exploring Attitude and Affect in Text (AAAI Technical Report Educator 1, 1 (2011), 14–20. SS-04-07) (2004), pp. 158–161. [4] B OIY, E., H ENS , P., D ESCHACHT, K., AND M OENS , M. [20] T URNEY, P. Thumbs up or thumbs down? semantic Automatic sentiment analysis in on-line text. In Proceedings orientation applied to unsupervised classification of reviews. of the 11th International Conference on Electronic In Proceedings of the 40th Annual Meeting of the Publishing (2007), pp. 349–360. Association for Computational Linguistics (ACL’02) (2002). [5] G O , A., B HAYANI , R., AND H UANG , L. Twitter sentiment [21] WARD , J., AND O STROM , A. The internet as information classification using distant supervision. CS224N Project minefield:: An analysis of the source and content of brand Report, Stanford (2009). information yielded by net searches. Journal of Business [6] H ATZIVASSILOGLOU , V., AND W IEBE , J. Effects of research 56, 11 (2003), 907–914. adjective orientation and gradability on sentence subjectivity. [22] YOON , E., G UFFEY, H., AND K IJEWSKI , V. The effects of In Proceedings of the 18th conference on Computational information and company reputation on intentions to buy a linguistics-Volume 1 (2000), Association for Computational business service. Journal of Business Research 27, 3 (1993), Linguistics, pp. 299–305. 215–228. [7] H E , Y., AND S AIF, H. Quantising Opinons for Political [23] Z HAO , J., L IU , K., AND WANG , G. Adding redundant Tweets Analysis. In Proceeding of the The eighth features for CRFs-based sentence sentiment classification. In international conference on Language Resources and Proceedings of the Conference on Empirical Methods in Evaluation (LREC) - In Submission (2012). Natural Language Processing (EMNLP) (2008), [8] H U , M., AND L IU , B. Mining and summarizing customer pp. 117–126. reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (2004), ACM, pp. 168–177. [9] H USSAIN , M., AND H OWARD , P. the role of digital media. Journal of Democracy 22, 3 (2011), 35–48. [10] KOULOUMPIS , E., W ILSON , T., AND M OORE , J. Twitter sentiment analysis: The good the bad and the omg! In Proceedings of the ICWSM (2011). [11] L IN , C., AND H E , Y. Joint sentiment/topic model for sentiment analysis. In Proceeding of the 18th ACM conference on Information and knowledge management (2009), ACM, pp. 375–384. [12] NARAYANAN , R., L IU , B., AND C HOUDHARY, A. Sentiment Analysis of Conditional Sentences. In EMNLP (2009), pp. 180–189. [13] PAK , A., AND PAROUBEK , P. Twitter as a corpus for sentiment analysis and opinion mining. Proceedings of LREC 2010 (2010). [14] PANG , B., AND L EE , L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting 9 · #MSM2012 · 2nd Workshop on Making Sense of Microposts ·