TASS 2016: Workshop on Sentiment Analysis at SEPLN, septiembre 2016, pág. 29-33 LABDA at the 2016 TASS challenge task: using word embeddings for the sentiment analysis task∗ LABDA en la competición TASS 2016: utilizando vectores de palabras para la tarea de análisis de sentimiento Antonio Quirós1,2 , Isabel Segura-Bedmar1 , and Paloma Martı́nez1 1 Departamento de Informática, Universidad Calos III de Madrid Avd. de la Universidad, 30, 28911, Leganés, Madrid, España 100342879@alumnos.uc3m.es, isegura,pmf@inf.uc3m.es 2 Sngular Data&Analytics Av. LLano Castellano 13, Planta 5, 28034 Madrid, España antonio.quiros@sngular.team Resumen: Este artı́culo describe la participación del grupo LABDA en la tarea 1 (Sentiment Analysis at global level) de la competición TASS 2016. En nuestro enfoque, los tweets son representados por medio de vectores de palabras y son cla- sificados utilizando algoritmos como SVM y regresión logı́stica. Palabras clave: Análisis de Sentimiento, Vectores de palabras Abstract: This paper describes the participation of the LABDA group at the Task 1 (Sentiment Analysis at global level). Our approach exploits word embedding re- presentations for tweets and machine learning algorithms such as SVM and logistics regression. Keywords: Sentiment Analysis, Word embeddings 1 Introduction resources for sentiment analysis of tweets in Knowing the opinion of customers or users Spanish. This paper describes the participa- has become a priority for companies and or- tion of the LABDA group at the Task 1 (Sen- ganizations in order to improve the quality of timent Analysis at global level). In this task, their services and products. With the ongoing the participating systems have to determine explosion of social media, it affords a signifi- the global polarity of each tweet in the test cant opportunity to poll the opinion of many dataset. There are two different evaluations: Internet users by processing their comments. one based on 6 different polarity labels (P+, However, it should be noted that sentiment P, NEU, N, N+, NONE) and another based analysis, which can be defined as the auto- on just 4 labels (P, N, NEU, NONE). A de- matic analysis of opinion in texts (Pang and tailed description of the task can be found Lee, 2008), is a challenging task because it is in the overview paper of TASS 2016 (Garcı́a- not strange that different people assign dif- Cumbreras et al., 2016). Our approach ex- ferent polarities to a given text. On Twitter, ploits word embedding representations for the task is even more difficult, because the tweets and machine learning algorithms such texts are small (only 140 characters) and are as SVM and logistics regression. The word charectized by their informal style language, embedding model can yield significant dimen- many grammatical errors and spelling mista- sionality reduction compared to the classical kes, slang and vulgar vocabulary and abbre- Bag-Of-Word (BoW) model. The dimensio- viations. nality redution can have several positive ef- fects on our algorithms such as faster trai- Since their introduction in 2013, the TASS ning, avoiding overfitting and better perfor- shared task editions have had as main goal mance. to promote the development of methods and The paper is organized as follows. Section ∗ This work was supported by eGovernAbility-Access 2 describes our approach. The experimental project (TIN2014-52665-C2-2-R). results are presented and discussed in Section ISSN 1613-0073 A. Quirós, I. Segura-Bedmar, P. Martínez 3. We conclude in Section 4 with a summary vert the tweets to lowercase and replace miss- of our findings and some directions for future pelled accented letters with the correct one work. (for instance “à” with “á”). We also treat elongations (that is, the repetition of a cha- 2 System racter) by removing the repetition of a cha- In this paper, we study the use of word em- racter after its second occurrence (for exam- beddings (also known as word vectors) in or- ple, “hoooolaaaa” would be translated to der to represent tweets and then examine se- “hola”). We then decided to take into account veral machine learning algorithms to classify laughs (for instance “jajaja”) which turned them. Word embeddings have shown promi- out to be challenging because of the diverse sing results in NLP tasks, such as named ways they are expressed (i.e. expressions li- entity recognition (Segura-Bedmar, Suárez- ke “jajajaja” or “jejeje” and even misspelled Paniagua, and Martınez, 2015), relation ex- ones like “jajjajaaj”) We addressed this using traction (Alam et al., 2016), sentiment analy- regular expressions to standardize the diffe- sis (Socher et al., 2013b) or parsing (Socher rent forms (i.e. “jajjjaaj” to “jajaja”) and et al., 2013a). A word embedding is a fun- then replace them with the word “risas”. Fi- ction to map words to low dimensional vec- nally we remove all non-letters characters and tors, which are learned from a large collection all stopwords present in tweets1 . of texts. At present, Neural Network is one of Orientation Emoticons the most used learning techniques for gene- Positive :-), :), :D, :o), :], D:3, rating word embeddings (Mikolov and Dean, :c), :>, =], 8), =), 2013). The essential assumption of this mo- :}, :ˆ), :-D, 8-D, 8D, del is that semantically close words will have x-D, xD, X-D, XD, similar vectors (in terms of cosine similarity). =-D, =D, =-3, =3, Word embeddings can help to capture seman- BˆD, :’), :’), :*, :-*, tic and syntactic relationships of the corres- :ˆ*, ;-), ;), *-), *), ;- ponding words. ], ;], ;D, ;ˆ), >:P, :-P, While the well-known Bag-of-Words :P, X-P, x-p, xp, XP, (BoW) model involves a very large number :-p, :p, =p, :-b, :b of features (as many as the number of non- stopwords words with at least a minimum Negative >:[, :-(, :(, :-c, :-<, number of occurrences in the training data), :<, :-[, :[, :{, ;(, :- the word embedding representation allows ||, >:(, :’-(, :’(, D:<, a significant reduction in the feature set D=, v.v size (in our case, from million to just 300). The dimensionality reduction is a desirable goal, because it helps in avoiding overfitting and leads to a reduction of the training and Table 1: List of positive and negative emoti- classification times, without any performance cons loss. As a preprocessing step, tweets must be Once the tweets are preprocessed, they are cleaned. First, we remove all links and urls. tokenized using the NLKT toolkit (a Pyt- We then remove usernames which can be ea- hon package for NLP); we also performed sily recognized because their first character is experimentation by lemmatizing each tweet the symbol @. We then transform the hash- using MeaningCloud2 Text Analytic software tags to words by removing its first charac- to compare both approaches. Then, for each ter (that is, the symbol #). Taking advanta- token, we search its vector in the word em- ge of regular expressions, the emoticons are bedding model. We use a pretrained model detected and classified in order to count the (Cardellino, 2016), which was generated by number of positive and negative emoticons in using the word2vec algorithm (Mikolov and each tweet and then we remove them from the Dean, 2013) from a collection of Spanish texts text. Table 1 shows the list of positive and with approximately 1.5 billion words. The di- negative emoticons, which were taken from mension of the word embedding is 300. It the wikipedia page https://en.wikipedia. 1 http://snowball.tartarus.org/algorithms/spanish/stop.txt 2 org/wiki/List\_of\_emoticons. We con- https://www.meaningcloud.com/ 30 LABDA at the 2016 TASS challenge task: using word embeddings for the sentiment analysis task should be noted that these texts were ta- negEmo: number of negative emoticons ken from different resources such as Spanish present in the tweet. Wikipedia, WikiSource and Wikibooks, but none of them contains tweets. Therefore, it For the posWords and negWords features is possible that the main characteristics of we used the iSOL lexicon(Molina-González et the social media texts (such as informal style al., 2013), a list composed by 2,509 positive language, noisy, plenty of grammatical errors words and 5,626 negative words. As descri- and spelling mistakes, slang and vulgar voca- bed before, for the emoticons we used the lis- bulary, abbreviations, etc) are not correctly ted in Table 1, but also added to the positive represented in this model. One of the main ones the number of laughs detected; and also, problems is that there is a significant number we included the number of recommendations of words (almost a 13 % of the vocabulary, re- present in the form of a “Follow Friday” hash- presenting the 6 % of words occurrences) that tag (#FF), due to its ease of detection and are not found in the model. We perform a re- its positive bias. view of a small sample of these words, sho- Classification is performed using scikit- wing that most of them were mainly hash- learn, a Python module for machine learning. tags. This package provides many algorithms such In our approach, a tweet of n tokens (T = as Random Forest, Support Vector Machine w1 , w2 , ..., wn ) is represented as the centroid (SVM) and so on. One of its main advantages of the word vectors w ~i of its tokens, as shown is that it is supported by extensive documen- in the following equation: tation. Moreover, it is robust, fast and easy to use. n PN As stated before, we have two main trai- 1 j=1 w ~j .T F (wj , t) T~ = X w ~i = PN (1) ning models: Averaged centroids and the ave- n i=1 j=1 T F (wj , t) raged centroids including the inverted docu- ment frequency, for both the lemmatized and where N is the vocabulary size, that is, not-lemmatized texts. We performed experi- the total number of distinct words, while ments using three different classifiers: Ran- T F (wj , t) refers to the number of occurren- dom Forests, Support Vector Machines and ces of the j-th vocabulary word in the tweet Logistic Regression because these classifiers T. often achieved the best results for text clas- We also explore the effect of including the sification and sentiment analysis. inverse document frequencies IDF to repre- Also we evaluated the impact of applying sent tweets (see Equation 2). This helps to a set of emoticon’s rules as a pre-classification increase the weight of words that occur of- stage, similar to (Chikersal et al., 2015), in ten, but only in a few documents, while it re- which we determine a first stage polarity for duces the relevance of words that occur very each tweet as follows: frequently in a larger number of texts. If posEmo is greater than zero and negE- n PN mo is equal to zero, the tweet is marked 1 j=1 w ~j .T F (wj , t).IDF (wj ) T~ = X w ~i = PN as “P”. n i=1 j=1 T F (wj , t).IDF (wj ) (2) If negEmo is greater than zero and posE- log|D| having IDF (wj ) = |tw∈D:w where |D| mo is equal to zero, the tweet is marked j ∈tw| as “N”. refers to the number of tweets. In addition to using the centroid, we assess If both posEmo and negEmo are grea- the impact of complementing the tweet model ter than zero, the tweet is marked as with the following additional features: “NEU”. posWords: number of positive words pre- If both posEmo and negEmo are equal to sent in the tweet. zero, the tweet is marked as “NONE”. negWords: number of negative words Then, after the classification takes place present in the tweet. we made three tests: i) Applying no rule, posEmo: number of positive emoticons ii) honoring the polarity defined by the rule, present in the tweet. which means, we keep the predefined polarity 31 A. Quirós, I. Segura-Bedmar, P. Martínez if the tweet was marked as “P” or “N”, ot- Run P R F1 Acc herwise we take the value estimated by the RUN-1 0.411 0.449 0.429 0.527 classifier, and iii) a mixed approach where RUN-2 0.412 0.448 0.429 0.527 we give each polarity a value (N+: -2; N: -1; RUN-3 0.402 0.436 0.418 0.549 NEU,NONE: 0; P: 1; P+: 2) and performed an arithmetic sum of both the predefined and estimated polarity if and only if they are not Table 2: Results for Sentiment Analysis at equal; with that for instance, if the classifier global level (5 levels, Full test corpus) marked a tweet as “N” and the rules mar- ked it as “P” the tweet will be classified as Run P R F1 Acc “NEU”. RUN-1 0.506 0.510 0.508 0.652 RUN-2 0.508 0.508 0.508 0.652 3 Results RUN-3 0.512 0.511 0.511 0.653 In order to choose the best-performing clas- sifiers, we use 10-fold cross-validation becau- Table 3: Results for Sentiment Analysis at se there is no development dataset and this global level (3 levels, Full test corpus) strategy has become the standard method in practical terms. Our experiments showed that, although the results were similar3 , the With the settings mentioned above, the best settings for the 5-levels task are: obtained results are extremely similar, but we can state that, in terms of Accuracy, Logis- RUN-1: Support Vector Machine, over tic Regression report the best results; and, the averaged centroids without applying even it’s not measured in this work, is worth any rules for pre-defining polarities. mentioning that Logistic Regression’s perfor- RUN-2: Support Vector Machine, over mance was observably faster. the averaged centroids and applying the mixed rules approach. 4 Conclusions and future work RUN-3: Logistic Regression, over the This paper explores the use of word embed- centroids with inverted document fre- dings for the task of sentiment analysis. Ins- quency and applying the mixed rules ap- tead of using, the bag-of-words model to re- proach. present tweets, these are represented as word vectors taken from a pre-trained model of and for the 3-levels task are: word embeddings. An important advantage of word embedding model compared to the RUN-1: Support Vector Machine, over technique of bag-of-words representation is the averaged centroids and applying the that it achieves a significant dimensional re- mixed rules approach. duction of the feature set needed to represent RUN-2: Logistic Regression, over the tweets and leads, therefore, to a reduction of centroids with inverted document fre- training and testing time of the algorithms. quency and applying the mixed rules ap- In order to use word embedding models proach. properly, a preprocessing stage had to be RUN-3: Logistic Regression, over the completed before training a classifier. Due to averaged centroids and applying the mi- the unstructured nature of the tweets, this xed rules approach. preprocessing proved to be a very important step in order to standardize at some degree Tables 2 and 3 show the results for the- the input data. The experimentation showed se settings provided by the TASS submission that the three tested classifiers obtained very system. For each run, accuracy is provided as similar results, with Random Forest having well as the macro-averaged precision, recall slight worse performance and Logistic Re- and F1-measure. As expected, the results for gression being slightly better and much more 3 levels are higher than for 5 levels because faster. the training dataset is larger. One of the main drawback of our approach 3 Experiments showed that not-lemmatized text is that many words do not have a word vector performed better in all settings, hence the best set- in the word embedding model used for our tings reported here is using not-lematized model experiments. An analysis showed that many 32 LABDA at the 2016 TASS challenge task: using word embeddings for the sentiment analysis task of these words come from hashtags, which are Pang, B. and L. Lee. 2008. Opinion mining usually short phrases. Therefore, we should and sentiment analysis. Foundations and apply a more sophisticated method in order trends in information retrieval, 2(1-2):1– to extract the words forming hashtag. 135. As future work, we also plan to use a word Segura-Bedmar, I., V. Suárez-Paniagua, and embedding model trained on a collection of P. Martınez. 2015. Exploring word text from Spanish social media. We think embedding for drug name recognition. that this will have a positive effect of the per- In SIXTH INTERNATIONAL WORKS- formance of our system to identify the pola- HOP ON HEALTH TEXT MINING AND rity of tweets because this model will be ge- INFORMATION ANALYSIS (LOUHI), nerated from documents characterized by the page 64. main features that describe social media texts (for example, informal style language, plenty Socher, R., J. Bauer, C. D. Manning, and of grammatical errors and spelling mistakes, A. Y. Ng. 2013a. Parsing with composi- slang and vulgar vocabulary). tional vector grammars. In ACL (1), pa- ges 455–465. Acknowledgments Socher, R., A. Perelygin, J. Y. Wu, This work was supported by eGovernAbility- J. Chuang, C. D. Manning, A. Y. Ng, and Access project (TIN2014-52665-C2-2-R). C. Potts. 2013b. Recursive deep models for semantic compositionality over a sen- References timent treebank. In Proceedings of the Alam, F., A. Corazza, A. Lavelli, and R. Za- conference on empirical methods in natu- noli. 2016. A knowledge-poor approach to ral language processing (EMNLP), volume chemical-disease relation extraction. Da- 1631, page 1642. Citeseer. tabase, 2016:baw071. Cardellino, C. 2016. Spanish Billion Words Corpus and Embeddings, March. Chikersal, P., S. Poria, E. Cambria, A. Gel- bukh, and C. E. Siong. 2015. Modelling public sentiment in twitter: using linguis- tic patterns to enhance supervised lear- ning. In International Conference on Inte- lligent Text Processing and Computational Linguistics, pages 49–65. Springer. Garcı́a-Cumbreras, M. A., J. Villena-Román, E. Martı́nez-Cámara, M. C. Dı́az-Galiano, M. T. Martı́n-Valdivia, and L. A. U. na López. 2016. Overview of tass 2016. In Proceedings of TASS 2016: Works- hop on Sentiment Analysis at SEPLN co- located with the 32nd SEPLN Conferen- ce (SEPLN 2016), Salamanca, Spain, Sep- tember. Mikolov, T. and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. Molina-González, M. D., E. Martı́nez-Cáma- ra, M.-T. Martı́n-Valdivia, and J. M. Perea-Ortega. 2013. Semantic orientation for polarity classification in spanish re- views. Expert Systems with Applications, 40(18):7250–7257. 33