Alleviating Data Sparsity for Twitter Sentiment Analysis

                     Hassan Saif                                   Yulan He                             Harith Alani
            Knowledge Media Institute                   Knowledge Media Institute              Knowledge Media Institute
                The Open University                         The Open University                    The Open University
           Milton Keynes, MK7 6AA, UK                  Milton Keynes, MK7 6AA, UK             Milton Keynes, MK7 6AA, UK
                h.saif@open.ac.uk                           y.he@open.ac.uk                       h.alani@open.ac.uk


ABSTRACT                                                                  been observed [21, 22] that the reputation of a certain product or
Twitter has brought much attention recently as a hot research topic       company is highly affected by rumours and negative opinions pub-
in the domain of sentiment analysis. Training sentiment classifiers       lished and shared among users on social networks. Understanding
from tweets data often faces the data sparsity problem partly due to      this observation, companies realize that monitoring and detecting
the large variety of short and irregular forms introduced to tweets       public opinions from microblogs leads to building better relation-
because of the 140-character limit. In this work we propose using         ships with their customers, better understanding of their customers’
two different sets of features to alleviate the data sparseness prob-     needs and better response to changes in the market. For public sec-
lem. One is the semantic feature set where we extract semantically        tors, recent studies [3, 9] show that there is a strong correlation
hidden concepts from tweets and then incorporate them into classi-        between activities on social networks and the outcomes of certain
fier training through interpolation. Another is the sentiment-topic       political issues. For example, Twitter and Facebook were used to
feature set where we extract latent topics and the associated topic       organise demonstrations and build solidarity during Arab Spring
sentiment from tweets, then augment the original feature space with       of civil uprising in Egypt, Tunisia, and currently in Syria. One
these sentiment-topics. Experimental results on the Stanford Twit-        week before Egyptian president’s resignation the total rate of tweets
ter Sentiment Dataset show that both feature sets outperform the          about political change in Egypt increased ten-fold. In Syria, the
baseline model using unigrams only. Moreover, using semantic              amount of online content produced by opposition groups in Face-
features rivals the previously reported best result. Using sentiment-     book increased dramatically.
topic features achieves 86.3% sentiment classification accuracy, which
outperforms existing approaches.                                          Twitter, which is considered now as one of the most popular mi-
                                                                          croblogging services, has attracted much attention recently as a hot
Categories and Subject Descriptors                                        research topic in sentiment analysis. Previous work on twitter sen-
I.2.7 [Artificial Intelligence]: Natural Language Processing—Text         timent analysis [5, 13, 2] rely on noisy labels or distant supervision,
Analysis                                                                  for example, by taking emoticons as the indication of tweet senti-
                                                                          ment, to train supervised classifiers. Other work explore feature
                                                                          engineering in combination of machine learning methods to im-
General Terms                                                             prove sentiment classification accuracy on tweets [1, 10]. None of
Algorithms, Experimentation                                               the work explicitly addressed the data sparsity problem which is
                                                                          one of the major challenges facing when dealing with tweets data.
Keywords
Microblogs, Sentiment Analysis, Opinion Mining, Twitter, Seman-
tic Smoothing, Data Sparsity

1. INTRODUCTION
Few years after the explosion of Web 2.0, microblogs and social
networks are now considered as one of the most popular forms
of communication. Through platforms like Twitter and Facebook,
tons of information, which reflect people’s opinions and attitudes,
are published and shared among users everyday. Monitoring and
analysing opinions from social media provides enormous opportu-
nities for both public and private sectors. for private sectors, it has

                                                                                         Figure 1: Word frequency statistics.


                                                                          Figure 1 compares the word frequency statistics of the tweets data
 Copyright c 2012 held by author(s)/owner(s).                             we used in our experiments and the movie review data1 . X-axis
 Published as part of the #MSM2012 Workshop proceedings,                  shows the word frequency interval, e.g., words occur up to 10 times
 available online as CEUR Vol-838, at: http://ceur-ws.org/Vol-838
                                                                          1
 #MSM2012, April 16, 2012, Lyon, France.                                    http://www.cs.cornell.edu/People/pabo/
                                                                          movie-review-data/


                                                                                                                                                2
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
(1-10), more than 10 times but up to 20 times (10-20), etc. Y-axis       sequently added into the original feature space for supervised clas-
shows the percentage of words falls within certain word frequency        sifier training. Our experimental results show that NB learned from
interval. It can be observed that the tweets data are sparser than the   these features outperforms the baseline model trained on unigrams
movie review data since the former contain more infrequent words,        only and achieves the state-of-the-art result on the original test set
with 93% of the words in the tweets data occurring less than 10          of the Stanford Twitter Sentiment Data.
times (cf. 78% in the movie review data).
                                                                         The rest of the paper is organised as follows. Section 2 outlines ex-
One possible way to alleviate data sparseness is through word clus-      isting work on sentiment analysis with focus on twitter sentiment
tering such that words contributing similarly to sentiment classi-       analysis. Section 3 describes the data used in our experiments. Sec-
fication are grouped together. In this paper, we propose two ap-         tion 4 presents our proposed semantic smoothing method. Section
proaches to realise word clustering, one is through semantic smooth-     5 describes how we incorporate sentiment-topics extracted from the
ing [17], the other is through automatic sentiment-topics extrac-        JST model into sentiment classifier training. Experimental results
tion. Semantic smoothing extracts semantically hidden concepts           are discussed in Section 6. Finally, we conclude our work and out-
from tweets and then incorporates them into supervised classifier        line future directions in Section 7.
training by interpolation. An inspiring example for using seman-
tic smoothing is shown in Figure 2 where the left box lists entities     2.    RELATED WORK
appeared in the training set together with their occurrence proba-
                                                                         Much work has been done in the field of sentiment analysis. Most
bilities in positive and negative tweets. For example, the entities
                                                                         of the work follows two basic approaches. The first approach as-
“iPad”, “iPod” and “Mac Book Pro” appeared more often in tweets
                                                                         sumes that semantic orientation of a document is an averaged sum
of positive polarity and they are all mapped to the semantic concept
                                                                         of the semantic orientations of its words and phrases. The pioneer
“Product/Apple”. As a result, the tweet from the test set “Finally,
                                                                         work is the point-wise mutual information approach proposed in
I got my iPhone. What a product!” is more likely to have a posi-
                                                                         Turney [20]. Also work such as [6, 8, 19, 16] are good examples
tive polarity because it contains the entity “iPhone” which is also
                                                                         of this lexical-based approach. The second approach [15, 14, 4,
mapped to the concept “Product/Apple”.
                                                                         23, 12] addresses the problem as a text classification task where
                                                                         classifiers are built using one of the machine learning methods and
                                                                         trained on a dataset using features such as unigrams, bigrams, part-
                                                                         of-speech (POS) tags, etc. The vast majority of work in sentiment
                                                                         analysis mainly focuses on the domains of movie reviews, product
                                                                         reviews and blogs.

                                                                         Twitter sentiment analysis is considered as a much harder problem
                                                                         than sentiment analysis on conventional text such as review docu-
                                                                         ments, mainly due to the short length of tweet messages, the fre-
                                                                         quent use of informal and irregular words, and the rapid evolution
                                                                         of language in Twitter. Annotated tweets data are impractical to ob-
                                                                         tain. A large amount of work have been conducted on twitter sen-
                                                                         timent analysis using noisy labels (also called distant supervision).
                                                                         For example, Go et al. [5] used emoticons such as “:-)” and “:(”
                                                                         to label tweets as positive or negative and train standard classifiers
                                                                         such as Naïve Bayes (NB), Maximum Entropy (MaxEnt), and Sup-
                                                                         port Vector Machines (SVMs) to detect the sentiments of tweets.
                                                                         The best result of 83% was reported by MaxEnt using a combina-
                                                                         tion of unigrams and bigrams. Barbosa and Feng [2] collected their
                                                                         training data from three different Twitter sentiment detection web-
Figure 2: Incorporating semantic concepts for sentiment clas-            sites which mainly use some pre-built sentiment lexicons to label
sification.                                                              each tweet as positive or negative. Using SVMs trained from these
                                                                         noisy labeled data, they obtained 81.3% in sentiment classification
We propose a semantic interpolation method to incorporate seman-         accuracy.
tic concepts into sentiment classifier training where we interpolate
the original unigram language model in the Naïve Bayes (NB) clas-        While the aforementioned approaches did not detect neutral senti-
sifier with the generative model of words given semantic concepts.       ment, Pak and Paroubek [13] additionally collected neutral tweets
We show on the Stanford Twitter Sentiment Data [5] that simply           from Twitter accounts of various newspapers and magazines and
replaces words with their corresponding semantic concepts reduces        trained a three-class NB classifier which is able to detect neutral
the vocabulary size by nearly 20%. However, the sentiment classi-        tweets in addition to positive and negative tweets. Their NB was
fication accuracy drops by 4% compared to the baseline NB model          trained with a combination of n-grams and POS features.
trained on unigrams solely. With the interpolation method, the sen-
timent classification accuracy improves upon the baseline model by      Speriosu et al. [18] argued that using noisy sentiment labels may
nearly 4%.                                                              hinder the performance of sentiment classifiers. They proposed ex-
                                                                        ploiting the Twitter follower graph to improve sentiment classifica-
Our second approach for automatic word clustering is through sentiment- tion and constructed a graph that has users, tweets, word unigrams,
topics extraction using the previously proposed joint sentiment-        word bigrams, hashtags, and emoticons as its nodes which are con-
topic (JST) model [11]. The JST model extracts latent topics and        nected based on the link existence among them (e.g., users are con-
the associated topic sentiment from the tweets data which are sub-      nected to tweets they created; tweets are connected to word uni-


                                                                                                                                              3
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
grams that they contain etc.). They then applied a label propagation           Figure 3-c) that allows to retrieve negative/positive tweets towards
method where sentiment labels were propagated from a small set of              a specific search term. For example, a user can retrieve opinionated
of nodes seeded with some initial label information throughout the             tweet messages about the search term “Nike”.
graph. They claimed that their label propagation method outper-
forms MaxEnt trained from noisy labels and obtained an accuracy
of 84.7% on the subset of the twitter sentiment test set from [5].
                                                                               4.    SEMANTIC FEATURES
                                                                               Twitter is an open social environment where there are no restric-
                                                                               tions on what users can tweet about. Therefore, a huge number of
There have also been some work in exploring feature engineering
                                                                               infrequent named entities, such as people, organization, products,
to improve the performance of sentiment classification on tweets.
                                                                               etc., can be found in tweet messages. These infrequent entities
Agarwal et al. [1] studied using the feature based model and the
                                                                               make the data very sparse and hence hinder the sentiment classifi-
tree kernel based model for sentiment classification. They explored
                                                                               cation performance. Nevertheless, many of these named entities are
a total of 50 different feature types and showed that both the fea-
                                                                               semantically related. For example, the entities “iPad” and “iPhone”
ture based and tree kernel based models perform similarly and they
                                                                               can be mapped to the same semantic concept “Product/Apple”. In-
outperform the unigram baseline.
                                                                               spired by this observation, we propose using semantic features to
                                                                               alleviate the sparsity problem from tweets data. We first extract
Kouloumpis et al. [10] compared various features including n-gram
                                                                               named entities from tweets and map them to their corresponding
features, lexicon features based on the existence of polarity words
                                                                               semantic concepts. We then incorporate these semantic concepts
from the MPQA subjectivity lexicon2 , POS features, and microblog-
                                                                               into NB classifier training.
ging features capturing the presence of emoticons, abbreviations,
and intensifiers (e.g., all-caps and character repetitions). They found
that micoblogging features are most useful in sentiment classifica-            4.1    Semantic Concept Extraction
tion.                                                                          We investigated three third-party services to extract entities from
                                                                               tweets data, Zemanta,5 OpenCalais,6 and AlchemyAPI.7 A quick
                                                                               and manual comparison of a randomly selected 100 tweet mes-
3. TWITTER SENTIMENT CORPUS                                                    sages with the extracted entities and their corresponding semantic
In the work conducted in this paper, we used the Stanford Twitter              concepts showed that AlchemyAPI performs better than the others
Sentiment Data3 which was collected between the 6th of April and               in terms of the quality and the quantity of the extracted entities.
the 25th of June 2009 [5]. The training set consists of 1.6 million            Hence, we used AlchemyAPI for the extraction of semantic con-
tweets with the same number of positive and negative tweets la-                cepts in our paper.
belled using emoticons. For example, a tweet is labelled as positive
if it contains :), :-), : ), :D, or =) and is labelled as negative if it has   Using AlchemyAPI, we extracted a total of 15,139 entities from
:(, :-(, or : (, etc. The original test set consists of 177 negative and       the training set, which are mapped to 30 distinct concepts and ex-
182 positive manually annotated tweets. In contrast to the train-              tracted 329 entities from the test set, which are mapped to 18 dis-
ing set which was collected based on specific emoticons, the test              tinct concepts. Table 1 shows the top five extracted concepts from
set was collected by searching Twitter API with specific queries               the training data with the number of entities associated with them.
including products’ names, companies and people.
                                                                                               Concept         Number of Entities
We built our training set by randomly selecting 60,000 balanced
                                                                                               Person               4954
tweets from the original training set in the Stanford Twitter Senti-
                                                                                               Company              2815
ment Data. Since the original test set only contains a total of 359
                                                                                               City                 1575
tweets which is relatively small, we enlarge this set by manually
                                                                                               Country              961
annotating more tweets. To simplify and speed up the annotation
                                                                                               Organisation         614
efforts, we have built Tweenator4 , a web-based sentiment annota-
tion tool that allows users to easily assign a sentiment label to tweet
messages, i.e. assign a negative, positive or neutral label to a cer-          Table 1: Top 5 concepts with the number of their associated
tain tweet with regards to its contextual polarity. Using Tweenator,           entities.
12 different users have annotated additional 641 tweets from the
original remaining training data. Our final test set contains 1,000
tweet messages with 527 negative and 473 positive.                             4.2    Incorporating Semantic Concepts into NB
                                                                                      Training
It is worth mentioning that users who participated in the annota-              The extracted semantic concepts can be incorporated into sentiment
tion process have reported that using the annotation interface of              classifier training in a naive way where entities are simply replaced
Tweenator, as shown in Figure 3-a, they were able to annotate 10               by their mapped semantic concepts in the tweets data. For example,
tweet messages in 2 to 3 minutes approximately.                                all the entities such as “iPhone”, “iPad”, and “iPod” are replaced
                                                                               by the semantic concept “Product/Apple”. A more principled way
Recently, we have added two new modules to Tweenator by im-                    to incorporate semantic concepts is through interpolation. Here, we
plementing our work that will be described in Section 4. The first             propose interpolating the unigram language model with the gener-
module (see Figure 3-b) provides a free-form sentiment detection,              ative model of words given semantic concepts in NB training.
which allows users to detect the polarity of their textual entries. The
second module is the opinionated tweet message retrieval tool (see             In NB, the assignment of a sentiment class c to a given tweet w can
2                                                                              5
  http://www.cs.pitt.edu/mpqa/                                                   http://www.zemanta.com/
3                                                                              6
  http://twittersentiment.appspot.com/                                           http://www.opencalais.com/
4                                                                              7
  http://atkmi.com/tweenator/                                                    http://www.alchemyapi.com/


                                                                                                                                                   4
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
                                          (a) Sentiment Annotation Interface.


                                      (b) Free-Form Sentiment Detector Interface.


                                   (c) Opinionated Tweet Message Retrieval Interface.


                        Figure 3: Tweenator: Web based Sentiment Annotation Tool for Twitter


                                                                                               5
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
be computed as:                                                          a sentiment label and a topic label. Hence, JST essentially clus-
                                                                         ters different words sharing similar sentiment and topic. We list
               ĉ = arg max P (c|w)                                      some of the topic words extracted by JST in Table 2. Words in each
                         c∈C
                                      Y                                  cell are grouped under one topic and the upper half of the table
                  = arg max P (c)             P (wi |c),          (1)
                         c∈C                                             shows topic words bearing positive sentiment while the lower half
                                     1≤i≤Nw
                                                                         shows topic words bearing negative polarity. It can be observed that
where Nw is the total number of words in tweet w, P (c) is the           words groups under different sentiment and topic are quite informa-
prior probability of a tweet appearing in class c, P (wi |c) is the      tive and coherent. For example, Topic 3 under positive sentiment is
conditional probability of word wi occurring in a tweet of class c.      related to a good music album, while Topic 1 under negative senti-
                                                                         ment is about a complaint of feeling sick possibly due to cold and
In multinomial NB, P (c) can be estimated by P (c) = Nc /N               headache.
Where Nc is the number of tweets in class c and N is the total
number of tweets. P (wi |c) can be estimated using maximum like-                            Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
lihood with Laplace smoothing:                                                               dream bought song        eat    movi
                                                                                             sweet   short listen    food    show
                                  N (w, c) + 1                                                train   hair   love    coffe   award
                P (w|c) = P                                       (2)


                                                                                Positive
                                         0
                               w0 ∈V N (w |c) + |V |                                         angel    love music dinner       live
                                                                                              love    wear   play    drink   night
Where N (w, c) is the occurrence frequency of word w in all train-
                                                                                           goodnight shirt album yummi        mtv
ing tweets of class c and |V | is the number of words in the vo-
                                                                                               free  dress band chicken concert
cabulary. Although using Laplace smoothing helps to prevent zero
                                                                                              club   photo guitar     tea     vote
probabilities of the “unseen” words, it assigns equal prior probabil-
ities to all of these words.                                                                   feel   miss   rain    exam     job
                                                                                             today     sad   bike   school    hard
We propose a new smoothing method where we interpolate the un-                                hate     cry    car    week     find


                                                                                Negative
igram language model in NB with the generative model of words                                 sick     girl  stop tomorrow hate
given semantic concepts. Thus, the new class model with semantic                              cold   gonna ride      luck interview
smoothing has the following formula:                                                          suck    talk    hit    suck     lost
                                                                                            weather bore drive       final    kick
               Ps (w|c) =(1 − α)Pu (w|c)                                                   headache feel      run    studi  problem
                              X
                          +α     P (w|sj )P (sj |c)               (3)
                                                                                           Table 2: Extracted polarity words by JST.
                                 j

Where Ps (w|c) is the unigram class model with semantic smooth-
ing, Pu (w|c) is the unigram class model with maximum likelihood         Inspired by the above observations, grouping words under the same
estimate, sj is the j-th concept of the word w, P (sj |c) is the dis-    topic and bearing similar sentiment could potentially reduce data
tribution of semantic concepts in training data of a given class and     sparseness in twitter sentiment classification. Hence, we extract
it can computed via the maximum likelihood estimation. P (w|sj )         sentiment-topics from tweets data and augment them as additional
is the distribution of words in the training data given a concept and    features into the original feature space for NB training. Algorithm 1
it can be also computed via the maximum likelihood estimation.           shows how to perform NB training with sentiment-topics extracted
Finally, the coefficient α is used to control the influence of the se-   from JST. The training set consists of labeled tweets, Dtrain =
mantic mapping in the new class model. By setting α to 0 the class       {(wn ; cn ) ∈ W × C : 1 ≤ n ≤ N train }, where W is the input
model becomes a unigram language model without any semantic              space and C is a finite set of class labels. The test set contains
interpolation. On the other hand, setting α to 1 reduces the class       tweets without labels, Dtest = {wnt ∈ W : 1 ≤ n ≤ N test }.
model to a semantic mapping model. In this work, α was empiri-           A JST model is first learned from the training set and then infer
cally set to 0.5.                                                        sentiment-topic for each tweet in the test set. The original tweets
                                                                         are augmented with those sentiment-topics as shown in Step 4 of
                                                                         Algorithm 1, where li _zi denotes a combination of sentiment label
5. SENTIMENT-TOPIC FEATURES                                              li and topic zi for word wi . Finally, an optional feature selection
The joint sentiment-topic (JST) model [11] is a four-layer genera-       step can be performed according to the information gain criteria
tive model which allows the detection of both sentiment and topic        and a classifier is then trained from the training set with the new
simultaneously from text. The generative procedure under JST             feature representation.
boils down to three stages. First, one chooses a sentiment label
l from the per-document sentiment distribution πd . Following that,      6.    EXPERIMENTAL RESULTS
one chooses a topic z from the topic distribution θd,l , where θd,l is
                                                                         In this section, we present the results obtained on the twitter sen-
conditioned on the sampled sentiment label l. Finally, one draws a
                                                                         timent data using both semantic features and sentiment-topic fea-
word wi from the per-corpus word distribution φl,z conditioned on
                                                                         tures and compare with the existing approaches.
both topic z and sentiment label l. The JST model does not require
labelled documents for training. The only supervision is word prior
polarity information which can be obtained from publicly available       6.1    Pre-processing
sentiment lexicons such as the MPQA subjectivity lexicon.                The raw tweets data are very noisy. There are a large number of
                                                                         irregular words and non-English characters. Tweets data have some
We train JST on the training set with tweet sentiment labels being       unique characteristics which can be used to reduce the feature space
discarded. The resulting model assigns each word in tweets with          through the following pre-processing:


                                                                                                                                             6
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
Algorithm 1 NB training with sentiment-topics extracted from              in Table 4. The performance degradation can be explained as the
JST.                                                                      mere use of semantic concepts replacement which leads to infor-
Input: The training set Dtrain and test set Dtest                         mation loss and subsequently hurts NB performance. Augmenting
Output: NB sentiment classifier                                           the original feature space with semantic concepts performs slightly
 1: Train a JST model on Dtrain with the document labels discarded        better than sentiment replacement, though it still performs worse
 2: Infer sentiment-topic from Dtest
 3: for each tweet wn = (w1 , w2 , ..., wm ) ∈ {Dtrain , Dtest } do       than the baseline.
 4: Augment tweet with sentiment-topics generated from JST,
       wn0 = (w , w , ..., w , l _z , l _z , ..., l _z )
                 1    2       m 1 1 2 2             m m                   With Semantic interpolation, semantic concepts were incorporated
 5: end for                          0
                                                                          into NB training taking into account the generative probability of
 6: Create a new training set Dtrain = {(wn0 ; cn ) : 1 ≤ n ≤ N train }   words given concepts. The method improves upon the baseline
                                0
 7: Create a new test set Dtest = {wn0 : 1 ≤ n ≤ N test }                 model and gives a sentiment classification accuracy of 84%.
                                                  0
 8: Perform feature selection using IG on Dtrain
                                   0
 9: Return NB trained on D   train
                                                                                         Method                      Accuracy
                                                                                         Unigrams                     80.7%
            Pre-processing Vocabulary Size % of Reduction                                Semantic replacement         76.3%
            None               95,130            0%                                      Semantic augmentation        77.6%
            Username           70,804         25.58%                                     Semantic interpolation       84.0%
            Hashtag            94,200           0.8%
            URLS               92,363          2.91%                                     Sentiment-topic features     82.3%
            Repeated Letters   91,824          3.48%
            Digits             92,785          2.47%                      Table 4: Sentiment classification results on the 1000-tweet test
            Symbols            37,054         29.47%                      set.
            All                37,054         61.05%


              Table 3: The effect of pre-processing.                      6.3     Sentiment-Topic Features
                                                                          To run JST on the tweets data, the only parameter we need to set
                                                                          is the number of topics T . It is worth noting that the total num-
     • All Twitter usernames, which start with @ symbol, are re-          ber of the sentiment-topics that will be extracted is 3 × T . For
       placed with the term “USER”.                                       example, when T is set to 50, there are 50 topics under each of
                                                                          positive, negative and neutral sentiment labels. Hence the total
     • All URL links in the corpus are replaced with the term “URL”.
                                                                          number of sentiment-topic features is 150. We augment the original
     • Reduce the number of letters that are repeated more than           bag-of-words representation of the tweet messages by the extracted
       twice in all words. For example the word “loooooveeee” be-         sentiment-topics. Figure 4 shows the classification accuracy of NB
       comes “loovee” after reduction.                                    trained from the augmented features by varying the number of top-
                                                                          ics from 1 to 65. The initial sentiment classification accuracy is
     • Remove all Twitter hashtags which start with the # symbol,         81.1% with topic number 1. Increasing the number of topics leads
       all single characters and digits, and non-alphanumeric char-       to the increase of classification accuracy with the peak value of
       acters.                                                            82.3% being reached at topic number 50. Further increasing topic
                                                                          numbers degrades the classifier performance.
Table 3 shows the effect of pre-processing on reducing features
from the original feature space. After all the pre-processing, the
vocabulary size is reduced by 62%.

6.2     Semantic Features
We have tested both the NB classifier from WEKA8 and the maxi-
mum entropy (MaxEnt) model from MALLET9 . Our results show
that NB consistently outperforms MaxEnt. Hence, we use NB as
our baseline model. Table 4 shows that with NB trained from un-
igrams only, the sentiment classification accuracy of 80.7% was
obtained.
                                                                                Figure 4: Classification accuracy vs. number of topics.
We extracted semantic concepts from tweets data using Alchemy
API and then incorporated them into NB training by the follow-
ing two simple ways. One is to replace all entities in the tweets
corpus with their corresponding semantic concepts (semantic re-
                                                                          6.4     Comparison with Existing Approaches
                                                                          In order to compare our proposed methods with the existing ap-
placement). Another is to augment the original feature space with
                                                                          proaches, we also conducted experiments on the original Stanford
semantic concepts as additional features for NB training (seman-
                                                                          Twitter Sentiment test set which consists of 177 negative and 182
tic augmentation). With semantic replacement, the feature space
                                                                          positive tweets. The results are shown in Table 5. The sentiment
shrunk substantially by nearly 20%. However, sentiment classifi-
                                                                          classification accuracy of 83% reported in [5] was obtained using
cation accuracy drops by 4% compared to the baseline as shown
                                                                          MaxEnt trained on a combination of unigrams and bigrams. It
8
    http://www.cs.waikato.ac.nz/ml/weka/                                  should be noted that while Go et al. used 1.6 million tweets for
9
    http://mallet.cs.umass.edu/                                           training, we only used a subset of 60,000 tweets as our training set.


                                                                                                                                              7
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
                         Figure 5: Classification accuracy vs. number of features selected by information gain.


Speriosu et al. [18] tested on a subset of the Stanford Twitter Sen-    We performed feature selection using information gain (IG) on the
timent test set with 75 negative and 108 positive tweets. They re-      training set. We calculated the IG value for each feature and sorted
ported the best accuracy of 84.7% using label propagation on a          them in descending order based on IG. Using each distinct IG value
rather complicated graph that has users, tweets, word unigrams,         as a threshold, we ended up with different sets of features to train a
word bigrams, hashtags, and emoticons as its nodes.                     classifier. Figure 5 shows the sentiment classification accuracy on
                                                                        the 1000-tweet test set versus different number of features. It can be
It can be seen from Table 5 that sentiment replacement performs         observed that there is an abrupt change in x-axis from around 5600
worse than the baseline. Sentiment augmentation does not result         features jumping to over 30,000 features. Using sentiment-topic
in the significant decrease of the classification accuracy, though it   features consistently performs better than using semantic features.
does not lead to the improved performance either. Our semantic          With as few as 500 features, augmenting the original feature space
interpolation method rivals the best result reported on the Stanford    with sentiment-topics already achieves 80.2% accuracy. Although
Twitter Sentiment test set. Using the sentiment-topic features, we      with all the features included, NB trained with semantic features
achieved 86.3% sentiment classification accuracy, which outper-         performs better than that with sentiment-topic features, we can still
forms the existing approaches.                                          draw a conclusion that sentiment-topic features should be preferred
                                                                        over semantic features for the sentiment classification task since it
               Method                      Accuracy                     gives much better results with far less features.
               Unigrams                     81.0%
               Semantic replacement          77.3%                      7.    CONCLUSIONS AND FUTURE WORK
               Semantic augmentation        80.45%                      Twitter is an open social environment where users can tweet about
               Semantic interpolation       84.1%                       different topics within the 140-character limit. This poses a signif-
               Sentiment-topic features     86.3%                       icant challenge to Twitter sentiment analysis since tweets data are
               (Go et al., 2009)              83%                       often noisy and contain a large number of irregular words and non-
               (Speriosu et al., 2011)       84.7%                      English symbols and characters. Pre-processing by filtering some
                                                                        of the non-standard English words leads to a significant reduction
Table 5: Sentiment classification results on the original Stan-         of the original feature space by nearly 61.0% on the Twitter senti-
ford Twitter Sentiment test set.                                        ment data. Nevertheless, the pre-processed tweets data still contain
                                                                        a large number of rare words.

6.5    Discussion                                                       In this paper, we have proposed two sets of features to alleviate the
We have explored incorporating semantic features and sentiment-         data sparsity problem in Twitter sentiment classification, semantic
topic features for twitter sentiment classification. While simple se-   features and sentiment-topic features. Our experimental results on
mantic replacement or augmentation does not lead to the improve-        the Twitter sentiment data show that while both methods improve
ment of sentiment classification performance, sentiment interpo-        upon the baseline Naïve Bayes model trained from unigram fea-
lation improves upon the baseline NB model trained on unigrams          tures only, using sentiment-topic features gives much better results
only by 3%. Augmenting feature space with sentiment-topics gen-         than using semantic features with less features.
erated from JST also results in the increase of sentiment classifica-
tion accuracy compared to the baseline. On the original Stanford        Compared to the existing approaches to twitter sentiment analysis
Twitter Sentiment test set, NB classifiers learned from sentiment-      which either rely on sophisticated feature engineering or compli-
topic features outperform the existing approaches.                      cated learning procedure, our approaches are much more simple
                                                                        and straightforward and yet attain comparable performance.
We have a somewhat contradictory observation here. Using sentiment-
topic features performs worse than using semantic features on the       There are a few possible directions we would like to explore as fu-
test set comprising of 1000 tweets. But the reverse is observed on      ture work. First, in the semantic method all entities where simply
the original Stanford Twitter Sentiment test set with 359 tweets.       replaced by the associated semantic concepts. It is worth to perform
We therefore conducted further experiments to compare these two         a selective statistical replacement, which is determined based on the
approaches.                                                             contribution of each concept towards making a better classification


                                                                                                                                             8
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
decision. Second, sentiment-topics generated by JST model were               on Association for Computational Linguistics (2004),
simply augmented into the original feature space of tweets data. It          Association for Computational Linguistics, p. 271.
could lead to better performance by attaching a weight to each ex-      [15] PANG , B., L EE , L., AND VAITHYANATHAN , S. Thumbs
tracted sentiment-topic feature in order to control the impact of the        up?: sentiment classification using machine learning
newly added features. Finally, the performance of the NB classi-             techniques. In Proceedings of the ACL-02 conference on
fiers learned from semantic features depends on the quality of the           Empirical methods in natural language processing-Volume
entity extraction process and entity-concept mapping method. It              10 (2002), Association for Computational Linguistics,
is worth to investigate a filtering method which can automatically           pp. 79–86.
filter out low-confidence semantic concepts.                            [16] R EAD , J., AND C ARROLL , J. Weakly supervised techniques
                                                                             for domain-independent sentiment classification. In
Acknowledgement This work is partially funded by the EU project              Proceeding of the 1st international CIKM workshop on
ROBUST (grant number 257859).                                                Topic-sentiment analysis for mass opinion (2009), pp. 45–52.
                                                                        [17] S AIF, H., H E , Y., AND A LANI , H. Semantic Smoothing for
8. REFERENCES                                                                Twitter Sentiment Analysis. In Proceeding of the 10th
 [1] AGARWAL , A., X IE , B., VOVSHA , I., R AMBOW, O., AND                  International Semantic Web Conference (ISWC) (2011).
     PASSONNEAU , R. Sentiment analysis of twitter data. In             [18] S PERIOSU , M., S UDAN , N., U PADHYAY, S., AND
     Proceedings of the ACL 2011 Workshop on Languages in                    BALDRIDGE , J. Twitter polarity classification with label
     Social Media (2011), pp. 30–38.                                         propagation over lexical links and the follower graph.
 [2] BARBOSA , L., AND F ENG , J. Robust sentiment detection on              Proceedings of the EMNLP First workshop on Unsupervised
     twitter from biased and noisy data. In Proceedings of                   Learning in NLP (2011), 53–63.
     COLING (2010), pp. 36–44.                                          [19] TABOADA , M., AND G RIEVE , J. Analyzing appraisal
 [3] B HUIYAN , S. Social media and its effectiveness in the                 automatically. In Proceedings of AAAI Spring Symposium on
     political reform movement in egypt. Middle East Media                   Exploring Attitude and Affect in Text (AAAI Technical Report
     Educator 1, 1 (2011), 14–20.                                            SS-04-07) (2004), pp. 158–161.
 [4] B OIY, E., H ENS , P., D ESCHACHT, K., AND M OENS , M.             [20] T URNEY, P. Thumbs up or thumbs down? semantic
     Automatic sentiment analysis in on-line text. In Proceedings            orientation applied to unsupervised classification of reviews.
     of the 11th International Conference on Electronic                      In Proceedings of the 40th Annual Meeting of the
     Publishing (2007), pp. 349–360.                                         Association for Computational Linguistics (ACL’02) (2002).
 [5] G O , A., B HAYANI , R., AND H UANG , L. Twitter sentiment         [21] WARD , J., AND O STROM , A. The internet as information
     classification using distant supervision. CS224N Project                minefield:: An analysis of the source and content of brand
     Report, Stanford (2009).                                                information yielded by net searches. Journal of Business
 [6] H ATZIVASSILOGLOU , V., AND W IEBE , J. Effects of                      research 56, 11 (2003), 907–914.
     adjective orientation and gradability on sentence subjectivity.    [22] YOON , E., G UFFEY, H., AND K IJEWSKI , V. The effects of
     In Proceedings of the 18th conference on Computational                  information and company reputation on intentions to buy a
     linguistics-Volume 1 (2000), Association for Computational              business service. Journal of Business Research 27, 3 (1993),
     Linguistics, pp. 299–305.                                               215–228.
 [7] H E , Y., AND S AIF, H. Quantising Opinons for Political           [23] Z HAO , J., L IU , K., AND WANG , G. Adding redundant
     Tweets Analysis. In Proceeding of the The eighth                        features for CRFs-based sentence sentiment classification. In
     international conference on Language Resources and                      Proceedings of the Conference on Empirical Methods in
     Evaluation (LREC) - In Submission (2012).                               Natural Language Processing (EMNLP) (2008),
 [8] H U , M., AND L IU , B. Mining and summarizing customer                 pp. 117–126.
     reviews. In Proceedings of the tenth ACM SIGKDD
     international conference on Knowledge discovery and data
     mining (2004), ACM, pp. 168–177.
 [9] H USSAIN , M., AND H OWARD , P. the role of digital media.
     Journal of Democracy 22, 3 (2011), 35–48.
[10] KOULOUMPIS , E., W ILSON , T., AND M OORE , J. Twitter
     sentiment analysis: The good the bad and the omg! In
     Proceedings of the ICWSM (2011).
[11] L IN , C., AND H E , Y. Joint sentiment/topic model for
     sentiment analysis. In Proceeding of the 18th ACM
     conference on Information and knowledge management
     (2009), ACM, pp. 375–384.
[12] NARAYANAN , R., L IU , B., AND C HOUDHARY, A.
     Sentiment Analysis of Conditional Sentences. In EMNLP
     (2009), pp. 180–189.
[13] PAK , A., AND PAROUBEK , P. Twitter as a corpus for
     sentiment analysis and opinion mining. Proceedings of
     LREC 2010 (2010).
[14] PANG , B., AND L EE , L. A sentimental education: Sentiment
     analysis using subjectivity summarization based on
     minimum cuts. In Proceedings of the 42nd Annual Meeting


                                                                                                                                          9
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·