Twitter Sentiment Polarity Classification using Barrier Features Anita Alicante, Anna Corazza, Antonio Pironti Department of Electrical Engineering and Information Technologies (DIETI) Università di Napoli Federico II via Claudio 21, 80125 Napoli, Italy anita.alicante@unina.it, anna.corazza@unina.it, antonio.pironti@gmail.com Abstract classified as positive or negative. Therefore, two classifiers are trained for the whole SA process: English. A crucial point for the applica- the subjectivity classifier and the polarity classi- bility of sentiment analysis over Twitter is fier. Polarity is an aspect of sentiment analysis represented by the degree of manual inter- which can be faced as a three-way classification vention necessary to adapt the approach to problem, in that it aims to associate either a posi- the considered domain. In this work we tive, negative or neutral polarity to each tweet. propose a new sentiment polarity classi- Expressions in tweets are often ambiguous be- fier exploiting barrier features, originally cause they are very informal messages no longer introduced for the classification of textual than 140 characters, containing a lot of misspelled data. Empirical tests on SemEval2014 words, slang, modal particles and acronyms. The competition data sets show that such ap- characteristics of the employed language are very proach overcomes performance of base- different from more formal documents and we ex- line systems in nearly all cases. pect statistical methods trained on tweets to per- Italiano. Un punto cruciale per form well thanks to an automatic adaptation to l’applicabilità della sentiment analy- such specificities. sis su Twitter è rappresentato dal livello As evidenced by tasks included in competitions di intervento manuale necessario per (Rosenthal et al., 2015) and (Nakov et al., 2016), adattare l’approccio al dominio consid- twitter sentiment analysis is a relevant topic for erato. In questo lavoro proponiamo un scientific research. To the best of our knowledge nuovo classificatore di sentiment polarity (Ravi and Ravi, 2015; Kolchyna et al., 2015; Silva che sfrutta le barrier features, originaria- et al., 2016) present a comprehensive, State-of- mente introdotte per la classificazione di the-Art (SoA) review on the research work done relazioni estratte da testi. Test empirici in various aspects of SA. Furthermore some ap- sui data sets usati nella competizione proaches, as described in (Gonçalves et al., 2016), SemEval2014 mostrano che l’approccio are based on the combination of several existing proposto supera le performance dei SoA “off-the-shelf” methods for sentence-level sistemi baseline nella maggioranza dei sentiment analysis1 . casi. (Saif et al., 2016) proposes an approach based on the notion that the sentiment of a term depends on its contextual semantics and some trigonomet- 1 Introduction ric properties on SentiCircles, that is a 2D geomet- Sentiment analysis (SA) (Pang and Lee, 2008), or ric circle. These properties are applied to amend opinion mining, is mainly about finding out the an initial sentiment score of terms, according to feelings of people from data such as product re- the context in which they are used. The sentiment views and news articles. identification at either entity or tweet-level is then Most methods adopt a two-step strategy for performed by leveraging trigonometric identities SA (Pang and Lee, 2008): in the subjectivity clas- 1 sification step, the target is classified to be sub- A point of strength of this kind of systems is that combin- ing several classification methods in an ensemble approach jective or neutral (objective), while in the polarity results to be very strong with respect to the input vocabulary classification step the subjective targets are further size and to the amount of available training. on SentiCircles. tion algorithm loops over the PoS tags in s and, The approach we are proposing has been exper- for each trigger tag t, it looks backward in the sen- imentally assessed by comparing its performance tence finding the closest occurrence of a PoS tag with two baseline systems. In addition to that, the e such that (e, t) ∈ P . If such endpoint is found, capability of adaptation of the approach to slightly then the algorithm extracts the barrier feature (e, t, different domains has been tested by comparing P Te,t ), where P Te,t is the set of PoS tags occur- on a web-blog data set the performances of two ring between e and t. Otherwise it extracts as many systems in which the Barrier Feature(BF) dictio- barrier features as the number of the elements in P nary has been respectively built on a collection of having t as trigger tag and, for each of them, the re- tweets and Wikipedia webpages. Eventually, the lated tag set is the set of POS tags of all the words contribution of BF has been evaluated. in the sentence preceding the trigger. While in the preceding work (Alicante and 2 Proposed approach Corazza, 2011) (endpoint, trigger) pairs were pre- Some automatic machine learning approaches re- defined, in this work we apply an innovative ap- cently applied to Twitter sentiment polarity classi- proach: we choose such pairs in a completely au- fication try new ways to run the analysis, such as tomatic and unsupervised way, starting from an performing sentiment label propagation on Twit- unannotated data set, not necessarily in the same ter follower graphs and employing social relations domain as the final task. In fact, BFs are unlexi- for user-level sentiment analysis (Speriosu et al., calized as they only depend on PoS tags: for any 2011). Others, not differently from the one we text collection, we can perform this analysis bas- are proposing here, investigate new sets of fea- ing on a different one which has to be similar in tures to train the model for sentiment identifi- the kind of language but not necessarily in the do- cation, such as microblogging features including main. For instance, we expect the pairs which are hashtags, emoticons etc. (Barbosa and Feng, 2010; more effective for the language adopted in tweets Kouloumpis et al., 2011). Indeed, we are propos- to be generally different from the ones adopted for ing to add Barrier Features (BFs) (Alicante and standard texts. Corazza, 2011) to unigrams, bigrams and input In choosing the (endpoint, trigger) pairs, our parse tree and to provide them as input to a Sup- purpose is two-fold: we aim to obtain a high vari- port Vector Machine (SVM) classifier. ability of the identified sets of tags while only con- Introduced in the context of another application sidering statistically significant patterns, that is, of text mining, namely relation classification, BFs patterns having a rather large number of occur- are inspired by (Karlsson et al., 1995) for Part-of- rences. In addition to this, we do not want to pe- Speech (PoS) tagging, but they have been com- nalize longer patterns, although they usually cor- pletely redesigned as features rather than rules. respond to larger and then more infrequent sets. BFs have also been exploited in (Alicante et al., 2016) for Italian Language in a unsupervised en- tity and relation extraction system, proving the Table 1: Endpoints and triggers of the BFs em- language portability of these features. BFs de- ployed for the tweet and text messages task. scribe a linguistic binding between the entities in- Endpoint Trigger volved in each relation. DT JJR or NNPS BFs require PoS tagging of the considered texts, NNP NNP or VBZ which can be automatically performed with very IN NNS high accuracy (Giménez and Màrquez, 2004). In NN NN or VBG or VBN fact, they consist of sets of PoS tags occurring be- RB RBR tween a predefined PoS pair, namely (endpoint, PRP VBD or VBP trigger). Similarly to unigrams and bigrams of TO VB words, these features are Boolean: for each tweet, their value is true if the feature occurs in the tweet, false otherwise. For each possible trigger, we therefore choose Given a set of (endpoint, trigger) pairs P and a the endpoint ep which maximizes the expected in- sentence (or tweet, in our case) s, the BFs extrac- formation per tag of the set corresponding to the Table 2: BFs built considering the (endpoint, trigger) pairs listed in Table 1 and the following text: Now/RB I/PRP can/MD see/VB why/WRB Dave/NNP Winer/NNP screams/NNS about/IN lack/NN of/IN Twitter/NNP API/NNP ,/, its/PRP limitations/NNS and/CC access/NN throttles/NNS !/. Barrier Feature Combined Text (TO, VB, {MD, PRP, RB}) Now I can see (NNP, NNP, {MD, PRP, RB, VB, WRB}) Now I can see why Dave (NNP, NNP, {}) Dave Winer (IN, NNS, {MD, NNP, PRP, RB, VB, WRB}) Now I can see why Dave Winner screams (NN, NN, {IN, MD, NNP, NNS, PRP, RB, VB, WRB}) Now I can see why Dave Winner screams about lack (NNP,NNP, {IN, NN, NNS}) Winer screams about lack of Twitter (NNP, NNP, {}) Twitter API (IN, NNS {,, NNP, PRP}) of Twitter API, its limitations (NN, NN, {,, CC, IN, NNP, NNS, PRP}) lack of Twitter API, its limitations and access (IN,NNS, {,, CC, NN, NNP, NNS, PRP}) of Twitter API, its limitations and access throttles BF, that is: X 1 sc(ep) = − Pr(BF) log Pr(BF) (1) len(BF) BF where Pr(BF) has been estimated by the corre- sponding frequency. In order to cut off insignif- icant cases, a threshold has been put on the min- imum number of occurrences of the considered BF candidates. The normalization on the set size len(BF) has been introduced to avoid penalizing larger sets. Table 1 reports the pairs resulting from this Figure 1: Architecture of the system for Twitter new approach and adopted for the experiments de- sentiment polarity classification. scribed in Section 3, Table 2 shows an example of BF extraction based on those pairs. While in the system presented in (Alicante and 3 Experimental Assessment Corazza, 2011) BFs were collected by only using In order to evaluate our system performance, we the training set, in this work we consider an addi- implemented a solution for the Message Polar- tional feature reduction step: we only take into ac- ity Classification subtask of SemEval-2014 Task 9 count the BFs contained in a BFs dictionary, which (Sentiment Analysis in Twitter)3 (Rosenthal et al., is built by only considering the BFs whose num- 2014). For each input tweet, our classification sys- ber of occurrences within an unannotated data set tem decides whether it expresses a positive, nega- is greater or equal than a threshold value. The data tive, or neutral sentiment. According to the com- set employed in the BFs dictionary construction petition rules, the only training data we used are step is not necessarily constrained to the training the ones that have been provided by the task or- set 2 . ganisers. We used a training set of about 8,000 Being unlexicalized, BFs lead us to improve the tweets, a subset of the training and the develop- portability of our approach not only towards new ment data released by the organisers4 . languages but also towards new kinds of applica- After training the classifier on this training set, tions. In particular, this and the dictionary con- the performance of the obtained system have been struction steps are decisive both for the automation evaluated against the test datasets provided for of the process and for its performance. 3 The SemEval-2014 Task 9 competition website: http: 2 //alt.qcri.org/semeval2014/task9/ Specifically, we used the same data set employed for the 4 identification of the PoS (endpoint, trigger) pairs. This aspect The only way to collect the data is by using the down- is not as trivial as it might seem: such strategy allows the loader script available to the participants to the competition application of the approach also to tasks where the size of the and some of the tweets were no longer available on Twitter at annotated training set is limited. the time we ran the script. the competition: Twitter2013 (T2013), tweets pro- the approach we are proposing and follows the vided for testing the task in 2013; Twitter2014 schema depicted in Figure 1. Input is tagged by (T2014), a new test set delivered in 2014; Twit- using SVMtool5 (Giménez and Màrquez, 2004) an ter2014Sarcasm( T2014Sa), a dataset of sarcas- SVM-based tagger able to achieve a very compet- tic tweets; LiveJournal2014 (LJ2014), a set of itive accuracy on English texts. Although accu- sentences extracted from the LiveJournal blog; racy is likely to be lower on tweets, classification SMS13, text messages provided for testing the performance does not appear to be affected; this same task in 2013. The statistics for each test is probably due to the robustness of the statisti- datasets are shown in Table 3. cal learner against such kind of errors. In order to reduce syntactical irregularities, we remove hash- Table 3: Dataset Statistics of SemEval2014-taskB, tags from tweets before providing them to the PoS- Message Polarity Classification tagger component. Data Set Positive Negative Neutral Total In the BFS system, we use the STS data set6 LiveJournal2014 427 304 411 1142 to build both the (endpoint, trigger) PoS pairs and SMS2013 492 394 1207 2093 Twitter2013 1281 542 1426 3249 the BFs dictionary. For the BFs dictionary con- Twitter2014 633 125 453 1211 struction step we considered a threshold value of Twitter2014Sarcasm 33 40 13 86 10, chosen by 5-fold cross validation on the Se- mEval2014 training set. This resulted in 44, 536 different BFs. In conclusion, once BFs are ex- Table 4: Experimental results to compare the per- tracted from the SemEval2014 datasets, a vector formance of our systems with the two baseline sys- of binary features which encodes all the related tems. Bold cases correspond to the best perfor- unigrams, bigrams and BFs is associated to every mance, while symbol ‡ indicates statistical signif- tweet. icance of the comparison with the confidence in- We use the Stanford Parser7 (Klein and Man- terval. ning, 2003a; Klein and Manning, 2003b) to extract Data Set LJ2014 SMS2013 T2013 T2014 T2014Sa the parse trees for each of the sentences contained BLS1 74.84 70.28 70.75 69.85 58.16 in the datasets. Since a tweet can be composed BLS2 69.44 57.36 72.12 70.96 56.50 by several sentences, we use Tsurgeon8 (Levy and BFS 68.91 72.01‡ 72.88‡ 72.10‡ 58.79‡ Andrew, 2006) to build a single parse tree for each dataset’s item (tweet, text messages, etc.). The classification module based on Support Table 5: Experimental results to evaluate the BF Vector Machines has been implemented using the contribution. Bold cases correspond to the best SVMLight-TK9 (Moschitti, 2006) package. This performance. Symbol † indicates the improvement module takes as input both the feature vectors and of BFS is statistical significant, verified with ap- the parse trees. Moreover, by applying SVMs proximate randomisation. with a combination of two different kernel func- Data set System P R F1 tions, we can handle at the same time both struc- tured and non-structured information. Indeed, as LiveJournal WOBFS 68.94 62.34 65.32 in (Alicante et al., 2014; Alicante and Corazza, 2014 BFS 74.69† 63.97† 68.91† 2011), we applied tree kernels to the parse trees SMS WOBFS 64.25 72.92 67.14 and a linear kernel to the vector of binary fea- 2013 BFS 74.80† 69.43 72.01† 5 The software can be freely downloaded from http: Twitter WOBFS 74.30 67.32 70.62 //www.lsi.upc.edu/˜nlp/SVMTool/ 6 2013 BFS 77.96† 68.45† 72.88† The Stanford Twitter Sentiment (STS) data set can be freely downloaded from http://help. Twitter WOBFS 76.09 65.81 70.57 sentiment140.com/for-students/ 7 The parser can be freely downloaded from http: 2014 BFS 78.19† 66.98† 72.10† //nlp.stanford.edu/software/lex-parser. Twitter2014 WOBFS 61.12 52.92 55.28 shtml 8 Tsurgeon can be freely downloaded from http:// Sarcasm BFS 64.75† 54.59† 58.79† nlp.stanford.edu/software/tregex.shtml 9 The software package can be freely downloaded from http://disi.unitn.it/moschitti/ The Barrier Features System (BFS) implements Tree-Kernel.htm tures described above. We build three binary the difference in performance is not statistically classifiers, one for each sentiment/class (positive, significant though. The main innovation of our negative, neutral). Moreover, for each classifier, system is the introduction of BFs and the way in the training phase has been performed by consid- which it learns them from data. We assess the BFs ering gold positive examples for the considered contribution to the overall classification perfor- class, while negative examples are represented by mance by comparing the performance between the all the other messages. In this way, the num- Barrier Features System (BFS) and Barrier Fea- ber of negative examples is much larger than the tures System (WOBFS) systems we described in positive ones. SVMLight allows to balance the Section 3 and report results in Table 5. Note that number of positive and negative examples by us- this table is more detailed than Table 4 because ing a cost factor given by the rate between the in this case we can run both systems and collect number of negative and positive training exam- all the different parameters. Barrier features al- ples. In order to assess our classification system most always improve performance both in terms performance, we consider two baseline systems of precision and recall, and thus also in terms of (BLS), namely the two systems that won the Se- F1 . In a few cases, the introduction of BFs im- mEval2014 competition (Rosenthal et al., 2014). proves precision while decreasing recall: however, The former, BLS1 (Zhu et al., 2014), is based on in all these cases F1 improves in BFS with respect an SVM classifier and a feature set composed by to WOBFS. some lexical and syntactical features, while the In conclusion, the introduction of BFs always latter, BLS2 (Miura et al., 2014), exploits a Lo- comes with an improvement in terms of F1 and gistic Regression trained with features based on such improvement is nearly always statistically lexical knowledge. significant. We can therefore conclude that BFs provide a crucial contribution to sentiment polar- 4 Results ity classification. Performance is assessed by adopting the same 5 Conclusions and future work evaluation metrics as in the SemEval2014 compe- We explored the effectiveness of BFs for senti- tition (Rosenthal et al., 2014). As usual, they are ment polarity classification in Twitter posts and we based on F1 -measure, which is separately com- showed on SemEval2014 data sets that they can puted for each class (positive, negative and neu- be very effective. In our approach, the need of a tral). Table 4 compares the classification per- manual intervention is really minimum. Indeed, formance of our tweet system, namely BFS, and the BFs dictionary can be built from any collec- the baseline systems, namely BLS1 (Zhu et al., tion of tweets, even one that do not belong to the 2014) and BLS2 (Miura et al., 2014) by adopt- same domain of the considered task. This is quite ing the same evaluation protocol used in the Se- interesting because it suggests that BFs are able to mEval2014 competition (Rosenthal et al., 2014). capture hints about the polarity of the expressions Our system performs significantly better on all in a domain independent way. data sets except LiveJournal2014. However, addi- tional experiments, whose results are here omitted due to space constraints, showed that our approach References performs better than BLS2 on this data set when Anita Alicante and Anna Corazza. 2011. Barrier the BF dictionary is built on Wikipedia. features for classification of semantic relations. In We think that the explanation for this behaviour Galia Angelova, Kalina Bontcheva, Ruslan Mitkov, and Nicolas Nicolov, editors, RANLP, pages 509– depends on the capability of the approach to adapt 514. RANLP 2011 Organising Committee. to the employed data set. In fact, our strategy is based on the use of unsupervised mining of text Anita Alicante, Massimo Benerecetti, Anna Corazza, and Stefano Silvestri. 2014. A distributed in- to maximize the adaptation to the specificity of formation extraction system integrating ontological the type of the language. This also explains why knowledge and probabilistic classifiers. In Proceed- BFS performs worse than the others on LiveJour- ings of the 9th International 3PGCIC-2014 Confer- nal2014: the syntactical structure of the structured ence, Guangzhou, CHINA. In Press. sentences contained in a weblog is quite different Anita Alicante, Anna Corazza, Francesco Isgrò, and from the tweets’ one. It is worth highlighting that Stefano Silvestri. 2016. Unsupervised entity and relation extraction from clinical records in Italian. Alessandro Moschitti. 2006. Making tree kernels prac- Computers in Biology and Medicine. tical for natural language learning. In EACL. Luciano Barbosa and Junlan Feng. 2010. Robust sen- Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio timent detection on twitter from biased and noisy Sebastiani, and Veselin Stoyanov. 2016. Semeval- data. In Proceedings of the 23rd International 2016 task 4: Sentiment analysis in twitter. In Pro- Conference on Computational Linguistics: Posters, ceedings of the 10th international workshop on se- COLING ’10, pages 36–44, Stroudsburg, PA, USA. mantic evaluation (SemEval 2016), San Diego, US Association for Computational Linguistics. (forthcoming). Jesús Giménez and Lluı́s Màrquez. 2004. SVMTool: Bo Pang and Lillian Lee. 2008. Opinion mining and A general POS tagger generator based on Support sentiment analysis. Found. Trends Inf. Retr., 2(1- Vector Machines. In Proceedings of 4th Interna- 2):1–135, January. tional Conference on Language Resources and Eval- Kumar Ravi and Vadlamani Ravi. 2015. A survey on uation (LREC), pages 43–46, Lisbon, Portugal. opinion mining and sentiment analysis: Tasks, ap- proaches and applications. Knowledge-Based Sys- Pollyanna Gonçalves, Daniel Hasan Dalip, Helen tems, 89:14–46. Costa, Marcos André Gonçalves, and Fabrı́cio Ben- evenuto. 2016. On the combination of off-the-shelf Sara Rosenthal, Alan Ritter, Preslav Nakov, and sentiment analysis methods. In Proceedings of the Veselin Stoyanov. 2014. Semeval-2014 task 9: Sen- 31st Annual ACM Symposium on Applied Comput- timent analysis in twitter. In Proceedings of the ing, pages 1158–1165. ACM. 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 73–80, Dublin, Ireland, Au- Fred Karlsson, Atro Voutilainen, Juha Heikkila, and gust. Association for Computational Linguistics and Arto Anttila, editors. 1995. Constraint Grammar: Dublin City University. A Language-Independent System for Parsing Unre- stricted Text. Mouton de Gruyter. Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko, Saif M Mohammad, Alan Ritter, and Veselin Stoy- Dan Klein and Christopher D. Manning. 2003a. Accu- anov. 2015. Semeval-2015 task 10: Sentiment anal- rate unlexicalized parsing. In Proc. of ACL 03 of the ysis in twitter. Proceedings of SemEval-2015. 41st Annual Meeting on Association for Computa- tional Linguistics, pages 423–430, Morristown, NJ, Hassan Saif, Yulan He, Miriam Fernandez, and Harith USA. Association for Computational Linguistics. Alani. 2016. Contextual semantics for sentiment analysis of twitter. Information Processing & Man- Dan Klein and Christopher D. Manning. 2003b. Fast agement, 52(1):5–19. exact inference with a factored model for natural language parsing. In Proc of NIPS03: In Advances Nadia Felix F Da Silva, Luiz FS Coletta, and Ed- in Neural Information Processing Systems, pages 3– uardo R Hruschka. 2016. A survey and compar- 10. MIT Press. ative study of tweet sentiment analysis via semi- supervised learning. ACM Computing Surveys Olga Kolchyna, Thársis TP Souza, Philip Treleaven, (CSUR), 49(1):15. and Tomaso Aste. 2015. Twitter sentiment analysis: Lexicon method, machine learning method and their Michael Speriosu, Nikita Sudan, Sid Upadhyay, and combination. arXiv preprint arXiv:1507.00955. Jason Baldridge. 2011. Twitter polarity classifica- tion with label propagation over lexical links and the E. Kouloumpis, T. Wilson, and J. Moore. 2011. Twit- follower graph. In Proceedings of the First Work- ter Sentiment Analysis: The Good the Bad and the shop on Unsupervised Learning in NLP, EMNLP OMG! In Fifth International AAAI Conference on ’11, pages 53–63, Stroudsburg, PA, USA. Associ- Weblogs and Social Media. ation for Computational Linguistics. Xiaodan Zhu, Svetlana Kiritchenko, and Saif Moham- Roger Levy and Galen Andrew. 2006. Tregex and tsur- mad. 2014. Nrc-canada-2014: Recent improve- geon: tools for querying and manipulating tree data ments in the sentiment analysis of tweets. In Pro- structures. In Proceedings of the fifth international ceedings of the 8th International Workshop on Se- conference on Language Resources and Evaluation, mantic Evaluation (SemEval 2014), pages 443–447, pages 2231–2234. Citeseer. Dublin, Ireland, August. Association for Computa- tional Linguistics and Dublin City University. Yasuhide Miura, Shigeyuki Sakaki, Keigo Hattori, and Tomoko Ohkuma. 2014. Teamx: A sentiment ana- lyzer with enhanced lexicon mapping and weighting scheme for unbalanced data. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 628–632, Dublin, Ireland, August. Association for Computational Linguistics and Dublin City University.