Word Embeddings in Sentiment Analysis Ruggero Petrolito• , Felice Dell’Orletta • Università di Pisa ruggero.petrolito@gmail.com  Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR) ItaliaNLP Lab - www.italianlp.it felice.dellorletta@ilc.cnr.it Abstract particular, in recent years systems based on deep learning techniques represent the state of the art. English. In the late years sentiment analy- In this field, word embeddings have been widely sis and its applications have reached grow- used as a way of representing words in sentiment ing popularity. Concerning this field of analysis tasks, and proved to be very effective. research, in the very late years machine A relevant mirror of the state of the art in sen- learning and word representation learning timent analysis field can be found in the SemEval derived from distributional semantics field workshops. In the 2015 edition (Rosenthal et al., (i.e. word embeddings) have proven to be 2015), most participants used machine learning very successful in performing sentiment techniques; in many of the subtasks, the top rank- analysis tasks. In this paper we describe a ing systems used deep learning methods and word set of experiments, with the aim of evalu- embeddings, like the system submitted by Severyn ating the impact of word embedding-based and Moschitti (2015), which was ranked 1st in features in sentiment analysis tasks. subtask A and 2nd in subtask B. In 2016 edition Italiano. Recentemente la Sentiment (Nakov et al., 2016), deep learning based tech- Analysis e le sue applicazioni hanno ac- niques, such as convolutional neural networks and quisito sempre maggiore popolarità. In recurrent neural networks, were the most popular tale ambito di ricerca, negli ultimi anni il approach. In 2017 edition (Rosenthal et al., 2017), machine learning e i metodi di rappresen- machine learning methods were very popular, es- tazione delle parole che derivano dalla se- pecially support vector machines and deep neural mantica distribuzionale (nello specifico i networks like convolutional neural networks and word embedding) si sono dimostrati molto long short-term neural networks. efficaci nello svolgimento dei vari com- Concerning Italian language, EVALITA con- piti collegati con la sentiment analysis. In ference well represents the state of the art in the questo articolo descriviamo una serie di natural language processing field. In 2016 edi- esperimenti condotti con l’obiettivo di va- tion (Barbieri et al., 2016), the top ranking sys- lutare l’impatto dell’uso di feature basate tems used machine learning and deep learning sui word embedding nei vari compiti della techniques (Castellucci et al. (2016), Attardi et sentiment analysis. al. (2016), Di Rosa and Durante (2016)). The purpose of this study is to explore ways of using word embeddings to build meaningful rep- 1 Introduction resentations of documents in sentiment analysis In the late years sentiment analysis has reached tasks performed on Italian tweets. great popularity among NLP tasks. As reported 2 Our Contribution by Mäntylä et al. (2016) the number of papers on this subject has increased significantly in the first In this paper we aimed to evaluate the effect of two decades of 21st century, as well as the extent exploiting word embeddings in sentiment analysis of its applications. A wide variety of technologies tasks. In particular, we explore the effect of five has been used to assess sentiment analysis tasks factors on the performance of a sentiment analy- during this period. In the latter years, machine sis classification system, to answer five research learning techniques proved to be very effective; in questions: 1. What is the effect of the size of the corpus 4 Experimental Setup used to train the embeddings? For our experiments, we used a classifier based on 2. Which text domain allows us to train bet- SVM using LIBLINEAR (Rong-En et al., 2013) ter embeddings (in-domain vs out-of-domain as machine learning library. As features, the clas- data)? sifier uses only information extracted combining the word-embeddings of the words of the analyzed 3. Which type of learning method produces tweet. better embeddings (word vs character-based In all the experiments described in this paper, word embeddings)? our system addresses the classification tasks by 4. Which method to combine the word vectors performing 5-fold cross-validation on the train- produces a better document vector represen- ing set provided for the SENTIPOLC 2016 eval- tation? uation campaign. The final score is the average score. We evaluate each fold using the Average 5. What are the most important words (in terms F-score described by Barbieri et al. (2016). of part-of-speech) to produce a better docu- For what concerns the word embeddings, we ment vector representation? trained two types of word embedding representa- To answer such questions, we performed sev- tions: i) the first one using the word2vec1 toolkit eral classification experiments testing our system (Mikolov et al., 2013). This tool learns lower- on the three sentiment analysis tasks proposed in dimensional word embeddings, which are repre- the 2016 EVALITA SENTIPOLC campaign (Bar- sented by a set of latent (hidden) variables, and bieri et al., 2016): Subjectivity Classification, each word is associated to a multidimensional vec- Polarity Classification and Irony Detection. In tor that represents a specific instantiation of these the first of these tasks, the highest accuracy was variables; ii) the second one using fastText (Bo- achieved by the system of Castellucci et al. (2016). janowski et al., 2016), a library for efficient learn- Concerning the 2nd task, the most accurate system ing of word representations and sentence classifi- was the one submitted by Attardi et al. (2016). Re- cation. This library allows to overcome the prob- garding the 3rd task, the highest accuracy value lem of out-of-vocabulary words which affects the was reached by the system of Di Rosa and Du- methodology of word2vec. Generating out-of- rante (2016). Among these systems, Castellucci et vocabulary word embeddings is a typical issue for al. (2016) and Attardi et al. (2016) use deep learn- morphologically rich languages with large vocab- ing techniques (convolutional neural networks), ularies and many rare words. FastText overcomes while Di Rosa and Durante (2016) use an ensem- this limitation by representing each word as a bag ble of many supervised learning classifiers. of character n-grams. A vector representation is associated to each character n-gram and the word 3 Datasets is represented as the sum of these character n-gram We tested our system on the three sentiment representations. analysis tasks proposed in 2016 EVALITA SEN- In both cases, each word is represented by a 100 TIPOLC campaign. These tasks and the re- dimensions vector, computed using the CBOW al- lated datasets have been described by Barbieri et gorithm – that learns to predict the word in the al. (2016). We conducted our experiments on middle of a symmetric window based on the sum the training set provided by the organizers of the of the vector representations of the words in the evaluation campaign, which is composed of 7921 window – and considering a context window of 5 tweets. words. We train our word embeddings on two corpora: in-domain and out-domain. The in-domain dataset 5 Experiments and Results is a collection of tweets that we collected for this To answer the questions listed in Section 2, we work, named Tweets. It is composed by almost 80 conducted a great amount of experiments, testing millions of tweets, resulting in around 1.2 billions many ways of representing the tweets by exploit- of tokens. The out-of-domain dataset is the Paisà ing in different manners the word embeddings of corpus, a collection of Italian web texts described 1 by Lyding et al. (Lyding et al., 2013). http://code.google.com/p/word2vec/ 0.59 particularly as regards the Irony Detection task: 0.58 in this case, the average F-score basically stops 0.57 0.56 growing after around 80 millions of tokens. Average F-scores 0.55 When we use embeddings trained with fastText, 0.54 the outcome is the opposite: the average F-score 0.53 values decrease as bigger amounts of data are used 0.52 to train the embeddings. The decrease of the val- 0.51 0.5 ues is faster when using the first hundreds of mil- 0.49 lions of tokens. 0.48 Lesson learned: these results suggest that, 0.47 regarding word-based word embeddings, as the 0 200 400 600 800 1,000 1,200 training corpus grows the accuracy rises, but it Millions of tokens becomes stable quickly. On the other hand, the Figure 1: Average F-scores obtained by using embeddings increase of the size of the training corpus appar- trained on increasing amounts of token, using word2vec (cir- ently doesn’t influence the accuracy values when cles) and fastText (crosses). Blue is assigned to Subj. Classi- the embedding have been produced using fastText fication, red to Pol. Classification and green to Irony Detec- tion. (or it even causes a lowering of the accuracy val- ues). the words extracted from the tweets. 5.2 Domain of the Embeddings Training To evaluate the impact (in terms of classifica- Corpus tion accuracy) of the variations of each studied pa- To answer the question n. 2, we ran a set of ex- rameter, we report the accuracy for each variation periments using the four models obtained using of the parameter calculated as the average accu- word2vec and fastText on Paisà and Tweet cor- racy across all the classification experiments that pora. Table 1 reports the results of the experi- we conducted by varying all the other parameters ments. As we can see, the embeddings trained (in a 5-fold cross-validation scenario). with word2vec on the in-domain dataset (Tweets) In all the experiments, we used only features provide features that allow to achieve a higher av- based on word embeddings. erage accuracy compared to the features extracted from the out-domain corpus. Differently, there 5.1 Size of the Embeddings Training Corpus isn’t any variation in terms of accuracy when the To answer the question n. 1, we trained several embeddings are trained with fastText. word embedding models on different partitions Lesson learned: the in-domain word embed- of Tweets corpus of increasing sizes, using both dings are very important in a semantic classifica- word2vec and fastText. Ten smaller partitions were tion scenario. Apparently, this is not true when obtained starting with just ten millions of tokens character-based word embedding are used. (for the smaller one) and adding other ten millions for each new partition, reaching the amount of 100 Subj. Pol. Iro. millions. We created other four bigger partitions, w2v ft w2v ft w2v ft which contain respectively 240, 480, 720 and 960 tw 0.5901 0.5198 0.592 0.5384 0.4837 0.4776 millions of tokens; the size of the smaller of this pa 0.572 0.5206 0.5693 0.5312 0.4793 0.4759 four partitions is comparable to the size of Paisà. Table 1: Average F-scores obtained by using word embed- Figure 1 reports the results. When we use dings trained on Twitter (tw) and Paisà (pa) corpora. embeddings trained with word2vec on increasing amounts of data, the average value of F-score grows for all the three subtasks. The amount of 5.3 Type of Embeddings Learning Model this growth is similar for the subtasks Subjectivity For what regards the question n. 3, the type of Classification (0.016) and Polarity Classification embeddings learning model (words vs character (0.019), while it’s smaller for the subtask Irony n-grams) influences considerably the performance Detection, which is the most challenging among of the classifier. Using embeddings trained with the three. In all cases the increase is significantly word2vec leads to F-score values that are signif- faster in the first 80 to 100 millions of tokens, icantly higher in comparison to the accuracy ob- tained using embeddings trained with fastText (see Subj. Pol. Iro. Table 1). w2v ft w2v ft w2v ft Lesson learned: this outcome suggests that em- Sum 0.6054 0.534 0.6085 0.5532 0.4887 0.5033 Mean 0.6017 0.5951 0.5954 0.5916 0.4709 0.4811 beddings learned by methods that treat words as Max 0.5957 0.5012 0.5964 0.507 0.4736 0.4698 atomic entities provide features that are more use- Min 0.593 0.5012 0.5951 0.5011 0.4754 0.4707 Prod 0.4415 0.4759 0.4384 0.5012 0.4693 0.4628 ful in a semantic task such as sentiment classifica- All 0.6236 0.4846 0.6246 0.51 0.5202 0.4715 tion, in comparison with character-based embed- dings. Table 2: Average F-scores obtained by using different strate- gies of combination of word embeddings. Bold black values 5.4 Methods to Combine Word Embeddings are the best F-scores overall; blue bold values are the best F-scores obtained by using a single combination method in To answer the question n. 4, we tested many meth- the word-based word embeddings scenario (w2v); red bold ods to combine the embeddings of the words of values are the best F-scores in the character-based word em- beddings scenario (ft). each document into a document-level vector rep- resentation. We experimented five combining methods: Meanwhile, the worst approach is the Product Sum, Mean, Maximum-pooling, Minimum- combination. Interestingly, while the concatena- pooling, Product. Each of this methods returns a tion of all the combined word-based word embed- single vector ~t , such that each tn is obtained by dings is surely the best approach to produce the combining the nth components w1n , w2n . . . wmn document-level vector representation, this is not of the embedding of each tweet word. Figure 2 true for the character-based ones. shows a graphical representation of this process. 5.5 Selection of Morpho-syntactic Categories w ~1 w11 w12 ... w1d of Combined Word Embeddings w ~2 w21 w22 ... w2d To answer the question n. 5, we ran a set of experi- ments using only a subset of the word embeddings tweet w ~3 w31 w32 ... w3d of each document to produce the document vector .. .. .. .. representation. The word selection is guided by . . . . wn1 wn2 ... wnd the morpho-syntactic categories of the words. We w ~n tested four categories: noun, verb, adjective, ad- verb. The embeddings of the words belonging to ~t t1 t2 ... td each of these categories were combined in a pos- Figure 2: Embeddings combination process based vector representation document. In addi- tion, we tested the document representation vector We tested these methods separately, and all of obtained through the concatenation of the differ- them jointly as well. When using all methods, ent pos-based vectors (N, V, Adj, Adv) with and the document representation is obtained concate- without the all-word document vector All words, nating the vectors returned by each method. which is the only one taking into account emoti- As we can see in Table 2, the Sum method cons and hash tags. proved to be the best method for all the tasks, Table 3 reports the results of the experiments. In when using embeddings obtained by word2vec. the word-based word embedding scenario, regard- The best results overall are obtained using the con- ing the contribution of single morpho-syntactic catenation of each of the vectors returned by the categories, noun shows the highest performance. used methods (row All in the Table). When using Overall, the highest score is yielded by the combi- embeddings trained with fastText, the best results nation of all the selected categories concatenated are obtained with mean for Subjectivity and Polar- with the combined vector of all the word embed- ity Classification, and with sum for Irony Detec- dings (All words rows in the table). For what re- tion. In this case, the combination of all vector gards the character-based word embeddings, we leads to poor results. can see that the noun is the individually best per- Lesson learned: these outcomes suggest that forming category only for the Subjectivity Clas- the best combination methods are sum for word sification task, while the adjective and the verb vectors obtained by using word-based word em- are the best performing category for the other two beddings and mean for character-based ones. tasks. Subj. Pol. Iro. Yoshua Bengio, Réjean Ducharme, Pascal Vincent and w2v ft w2v ft w2v ft N 0.553 0.5171 0.5417 0.5091 0.4725 0.4749 Christian Jauvin. 2003. A Neural Probabilistic Lan- V 0.4755 0.4778 0.5091 0.5136 0.469 0.4897 guage Model. Journal of Machine Learning Re- Adj 0.4406 0.4534 0.5184 0.5335 0.4705 0.4826 search 3 (2003) 1137–1155. Adv 0.4397 0.4504 0.4971 0.5033 0.4702 0.485 N, V, Adj, Adv 0.6266 0.5578 0.6141 0.5667 0.4948 0.5041 All words 0.6251 0.5363 0.5941 0.515 0.4773 0.4521 Piotr Bojanowski, Edouard Grave, Armand Joulin and All words, N 0.6287 0.5221 0.6032 0.5343 0.4887 0.4646 Tomas Mikolov. 2016. Efficient Estimation of All words, V 0.6326 0.5276 0.6035 0.5339 0.4841 0.4634 All words, Adj 0.6374 0.5328 0.6185 0.5184 0.4867 0.4693 Word Representations in Vector Space. CoRR All words, Adv 0.6337 0.5243 0.6087 0.5187 0.4856 0.4674 abs/1607.04606, 2016. All words, N, V, Adj, Adv 0.6521 0.5691 0.6319 0.5546 0.5139 0.4886 Giuseppe Castellucci, Danilo Croce and Roberto Table 3: Average F-scores obtained using embedding of Basili. 2016. Context-aware Convolutional Neural words belonging to different morpho-syntactic classes. Bold Networks for Twitter Sentiment Analysis in Italian. black values are the best F-scores overall; blue bold values are the best F-scores obtained using a single grammar class CLiC-it/EVALITA. in the word-based word embeddings scenario (w2v); red bold values are the best F-scores obtained using a single grammar Emanuele Di Rosa and Alberto Durante. 2016. class in the character-based word embeddings scenario (ft). Tweet2Check evaluation at Evalita Sentipolc 2016. CLiC-it/EVALITA. Verena Lyding, Egon Stemle, Claudia Borghetti, Marco Lesson learned: these results show that noun Brunello, Sara Castagnoli, Felice Dell’Orletta Hen- class is the most important grammatical category rik Dittmann, Alessandro Lenci and Vito Pirrelli. only in the word-based word embedding scenario; 2013. PAISÀ Corpus of Italian Web Text. Institute meanwhile the concatenation of all the pos-based for Applied Linguistics, Eurac Research. vectors and the All words vector yields the best Mika V. Mäntylä, Daniel Graziotin and Miikka Kuutila. accuracy in both scenarios. 2016. The Evolution of Sentiment Analysis - A Re- view of Research Topics, Venues, and Top Cited Pa- 6 Conclusions pers. Computer Science Review, Volume 27, Febru- ary 2016, Pages 16-32. In this work we study the impact of word Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey embedding-based features in the sentiment anal- Dean. 2013. Efficient Estimation of Word Repre- ysis tasks. We performed several classification sentations in Vector Space. CoRR abs/1301.3781, experiments to investigate the effects on classifi- 2013. cation performances of five dimensions related to Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio the word embeddings. We tested several different Sebastiani and Veselin Stoyanov. 2016. SemEval- ways of selecting and combining the embeddings 2016 task 4: Sentiment analysis in Twitter. Proceed- and we studied how the performance of a senti- ings of the 10th international workshop on semantic evaluation (semeval-2016). ment classifier changes. Despite the lessons learned from this work, sev- Fan Rong-En, Chang Kai-Wei, Hsieh Cho-Jui, Wang eral aspects remain to investigate, such as, for ex- Xiang-Ruind Lin Chih-Jen. 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of ample, the tuning of the parameters used to train Machine Learning Research, 9:1871-1874. the embeddings, and new vector combining strate- gies. Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko, Saif Mohammad, Alan Ritter and Veselin Stoyanov. 2015. Semeval-2015 task 10: Sentiment analysis in twitter. Proceedings of the 9th international work- References shop on semantic evaluation (SemEval 2015), 451– 463. Giuseppe Attardi, Daniele Sartiano, Chiara Alzetta and Federica Semplici. 2016. Convolutional Neural Sara Rosenthal, Noura Farra and Preslav Nakov. 2017. Networks for Sentiment Analysis on Italian Tweets. SemEval-2017 task 4: Sentiment analysis in Twitter. CLiC-it/EVALITA. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 502–518. Francesco Barbieri, Valerio Basile, Danilo Croce, Malvina Nissim, Nicole Novielli and Viviana Patti. Aliaksei Severyn and Alessandro Moschitti. 2015. 2016. Overview of the evalita 2016 sentiment polar- Unitn: Training deep convolutional neural network ity classification task. Proceedings of Third Italian for twitter sentiment classification. Proceedings of Conference on Computational Linguistics (CLiC-it the 9th international workshop on semantic evalua- 2016) & Fifth Evaluation Campaign of Natural Lan- tion (SemEval 2015), 464–469. guage Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016).