Automatic Stopword Generation using Contextual Semantics for Sentiment Analysis of Twitter Hassan Saif, Miriam Fernandez and Harith Alani Knowledge Media Institute, The Open University, United Kingdom {h.saif,m.fernandez,h.alani}@open.ac.uk Abstract. In this paper we propose a semantic approach to automatically identify and remove stopwords from Twitter data. Unlike most existing approaches, which rely on outdated and context-insensitive stopword lists, our proposed approach considers the contextual semantics and sentiment of words in order to measure their discrimination power. Evaluation results on 6 Twitter datasets show that, removing our semantically identified stopwords from tweets, increases the binary sentiment classification performance over the classic pre-complied stopword list by 0.42% and 0.94% in accuracy and F-measure respectively. Also, our approach reduces the sentiment classifier’s feature space by 48.34% and the dataset sparsity by 1.17%, on average, compared to the classic method. Keywords: Sentiment Analysis, Contextual Semantics, Stopwords, Twitter 1 Introduction The excessive presence of abbreviations and irregular words in tweets make them very noisy, sparse and hard to extract sentiment from [7, 8]. Aiming to address this problem, existing works on Twitter sentiment analysis remove stopwords from tweets as a pre- processing procedure [5]. To this end, these works usually use pre-complied lists of stopwords, such as the Van stoplist [3]. These stoplists, although widely used, have previously been criticised for: (i) being outdated [2] and, (ii) for not accounting for the specificities of the context under analysis [1]. Words with low informative values in some context or corpus, may have discrimination power in a different context. For example, the word “like”, generally consider as stopword, has an important sentiment discrimination power in the sentence “I like you”. In this paper, we propose an unsupervised approach for automatically generating context-aware stoplists for the sentiment analysis task on Twitter. Our approach captures the contextual semantics and sentiment of words in tweets in order to calculate their informative value. Words with low informative value are then selected as stopwords. Con- textual semantics (aka statistical semantics) are based on the proposition that meaning can be extracted from words co-occurrences [9]. We evaluate our approach against the Van stoplist (so-called clasic method) using six Twitter datasets. In particular, we study how removing stopwords generated by our approach affects: (i) the level of data sparsity of the used datasets and (ii) the performance of the Maximum Entropy (MaxEnt) classifier in terms of: (a) the size of the classifier’s feature space and, (b) the classifier’s performance. Our preliminary results show that our approach outperforms the classic stopword removal method in both accuracy and F1-measure by 0.42% and 0.94% respectively. Moreover, removing our semantically- identified stopwords reduces the feature space by 48.34% and the dataset sparsity by 1.17%, compared to the classic method, on average. 2 Stopwords Generation using Contextual Semantics The main principle behind our approach is that the informativeness of words in sentiment analysis relies on their semantics and sentiment within the contexts they occur. Stopwords correspond to those words of weak contextual semantics and sentiment.Therefore, our approach functions by first capturing the contextual semantics and sentiment of words and then calculating their informative values accordingly. 2.1 Capturing Contextual Semantics and Sentiment To capture the contextual semantics and sentiment of words, we use our previously proposed semantic representation model SentiCircles [6]. In summary, the SentiCircle model extracts the contex- Y +1 Very Positive Positive tual semantics of a word from its co-occurrences with other C words in a given tweet corpus. These co-occurrences are then i y i represented as a geometric circle which is subsequently used r θi i +1 to compute the contextual sentiment of the word by apply- -1 x m X i ing simple trigonometric identities on it. In particular, for each unique term m in a tweet collection, we build a two- dimensional geometric circle, where the term m is situated Very Negative Negative -1 in the centre of the circle, and each point around it repre- Fig. 1: SentiCircle of a term m. sents a context term ci (i.e., a term that occurs with m in the Stopwords region is shaded in gray. same context). The position of ci , as illustrated in Figure 1, is defined jointly by its Cartesian coordinates xi , yi as: xi = ri cos(θi ∗ π) yi = ri sin(θi ∗ π) Where θi is the polar angle of the context term ci and its value equals to the prior sentiment of ci in a sentiment lexicon before adaptation, ri is the radius of ci and its value represents the degree of correlation (tdoc) between ci and m, and can be computed as: ri = tdoc(m, ci ) = f (ci , m) × log(N/Nci ) where f (ci , m) is the number of times ci occurs with m in tweets, N is the total number of terms, and Nci is the total number of terms that occur with ci . Note that all terms’ radii in the SentiCircle are normalised. Also, all angles’ values are in radian. The trigonometric properties of the SentiCircle allow us to encode the contextual semantics of a term as sentiment orientation and sentiment strength. Y-axis defines the sentiment of the term, i.e., a positive y value denotes a positive sentiment and vice versa. The X-axis defines the sentiment strength of the term. The smaller the x value, the stronger the sentiment.1 This, in turn, divides the circle into four sentiment quadrants. Terms in the two upper quadrants have a positive sentiment (sin θ > 0), with upper left quadrant representing stronger positive sentiment since it has larger angle values than those in the top right quadrant. Similarly, terms in the two lower quadrants have negative sentiment values (sin θ < 0). Moreover, a small region called the “Neutral Region” can be defined. This region is located very close to X-axis in the “Positive” and the “Negative” quadrants only, where terms lie in this region have very weak sentiment (i.e, |θ| ≈ 0). 1 This is because cos θ < 0 for large angles. The overall Contextual Semantics and Sentiment An effective way to compute the overall sentiment of m is by calculating the geometric median of all the points in its SentiCircle. Formally, for a given set of n points (p1 , p2 , P ..., pn ) in a SentiCirlce Ω, the n 2D geometric median g is defined as: g = arg ming∈R2 i=1 k|pi − g||2 . We call the geometric median g the SentiMedian as its position in the SentiCircle determines the total contextual-sentiment orientation and strength of m. 2.2 Detecting Stopwords with SentiCircles Stopwords in sentiment analysis are those who have weak semantics and sentiment within the context they occur. Hence, stopwords in our approach are those whose SentiMedians are located in the SentiCircle within a very small region close to the origin, as shown in Figure 1. This is because points in this region have: (i) very weak sentiment (i.e., |θ| ≈ 0) and (ii) low importance or low degree of correlation (i.e., r ≈ 0). We call this region the stopword region. Therefore, to detect stopwords in our approach, we first build a SentiCircle for each word in the tweet corpus, calculate its overall Contextual semantics and sentiment by means of its SentiMedian, and check whether the word’s SentiMedian lies within the stopword region or not. We assume the same stopword region boundary for all SentiCircles emerging from the same Twitter corpus, or context. To compute these boundaries we first build the SentiCircle of the complete corpus by merging all SentiCircles of each individual term and then we plot the density distribution of the terms within the constructed SentiCircle. The boundaries of the stopword region are delimited by an increase/decrease in the density of terms along the X- and Y-axis. Table 1 shows the X and Y boundaries of the stopword region for all Twitter datasets that we use in this work. Dataset OMD HCR STS-Gold SemEval WAB GASP X-boundary 0.0001 0.0015 0.0015 0.002 0.0006 0.0005 Y-boundary (Y) 0.0001 0.00001 0.001 0.00001 0.0001 0.001 Table 1: Stopword region boundary for all datasets 3 Evaluation and Results To evaluate our approach, we perform binary sentiment classification (positive / negative classification of tweets) using a MaxEnt classifier and observe fluctuations (increases and decreases) after removing stopwords on: the classification performance, measured in terms of accuracy and F-measure, the size of the classifier’s feature space and the level of data sparsity. To this end, we use 6 Twitter datasets: OMD, HCR, STS-Gold, SemEval, WAP and GASP [4]. Our baseline for comparison is the classic method, which is based on removing stopwords obtained from the pre-complied Van stoplist [3]. Figure 2 depicts the classification performance in accuracy and F1-measure as well as the reduction in the classifier’s features space obtained by applying our SentiCircle stopword removal methods on all datasets. As noted, our method outperforms the classic stopword list by 0.42% and 0.94% in accuracy and F1-measure on average respectively. Moreover, we observe that our method shrinks the feature space substantially by 48.34%, while the classic method has a reduction rate of 5.5% only. Figure 3 shows the average impact of the SentiCircle and the classic methods on the sparsity degree of our datasets. We notice that our SentiCircle method always lowers the sparsity degree of all datasets by 1.17% on average compared to the classic method. Accuracy   F1   Reduc5on  Rate   Classic   SenGCircle   85   50.99   1.000   Accuracy  &  F-­‐Measure   84   0.999   40.99   83   0.998   Sparsity  Degree   Reduc&on  Rate   82   30.99   0.997   81   0.996   80   20.99   0.995   79   0.994   10.99   78   0.993   77   0.99   0.992   Classic   Sen5Circle   OMD   HCR   STS-­‐Gold   SemEval   WAB   GASP   Fig. 2: Average accuracy, F-measure and reduc- Fig. 3: Impact of the classic and SentiCircles tion rate of MaxEnt using different stoplists methods on the sparsity degree of all datasets. 4 Conclusions In this paper we proposed a novel approach for generating context-aware stopword lists for sentiment analysis on Twitter. Our approach exploits the contextual semantics of words in order to capture their context and calculates their discrimination power accordingly. We have evaluated our approach for binary sentiment classification using 6 Twitter datasets. Results show that our stopword removal approach outperforms the classic method in terms of the sentiment classification performance and the reduction in both the classifier’s feature space and the dataset sparsity. Acknowledgment This work was supported by the EU-FP7 project SENSE4US (grant no. 611242). References 1. Ayral, H., Yavuz, S.: An automated domain specific stop word generation method for natural language text classification. In: International Symposium on Innovations in Intelligent Systems and Applications (INISTA) (2011) 2. Lo, R.T.W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. In: Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR) (2005) 3. Rijsbergen, C.J.V.: Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2nd edn. (1979) 4. Saif, H., Fernandez, M., He, Y., Alani, H.: Evaluation datasets for twitter sentiment analysis a survey and a new dataset, the sts-gold. In: Proceedings, 1st ESSEM Workshop. Turin, Italy (2013) 5. Saif, H., Fernandez, M., He, Y., Alani, H.: On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter. In: Proc. 9th Language Resources and Evaluation Conference (LREC). Reykjavik, Iceland (2014) 6. Saif, H., Fernandez, M., He, Y., Alani, H.: Senticircles for contextual and conceptual semantic sentiment analysis of twitter. In: Proc. 11th Extended Semantic Web Conf. (ESWC). Crete, Greece (2014) 7. Saif, H., He, Y., Alani, H.: Alleviating data sparsity for twitter sentiment analysis. In: Proc. 2nd Workshop on Making Sense of Microposts (#MSM2012). Layon, France (2012) 8. Saif, H., He, Y., Alani, H.: Semantic sentiment analysis of twitter. In: Proceedings of the 11th international conference on The Semantic Web. Boston, MA (2012) 9. Turney, P.D., Pantel, P., et al.: From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 37(1), 141–188 (2010)