TASS 2015, septiembre 2015, pp 71-74 recibido 20-07-15 revisado 24-07-15 aceptado 29-07-15 BittenPotato: Tweet sentiment analysis by combining multiple classifiers. BittenPotato: Análisis de sentimientos de tweets mediante la combinación de varios clasificadores. Iosu Mendizabal Borda Jeroni Carandell Saladich (IIIA) Artificial Intelligence (UPC) Universitat Politecnica de Catalunya Research Institute (URV) Universitat Rovira i Virgili (CSIC) Spanish Council for (UB) Universitat de Barcelona Scientific Research jeroni.carandell@gmail.com iosu@iiia.csic.es Resumen: En este artı́culo, usamos un saco de palabras (bag of words) sobre n-gramas para crear un diccionario de los atributos más usados en una dataset. Seguidamente, aplicamos cuatro distintos clasificadores, el resultado de los cuales, mediante diversas técnicas pretende mostrar la polaridad real de la frase extrayendo el sentimiento que contiene. Palabras clave: Análisis de Sentimientos, Procesamiento de lenguaje natural. Abstract: In this paper, we use a bag-of-words of n-grams to capture a dictionary containing the most used ”words” which we will use as features. We then proceed to classify using four different classifiers and combine their results by apply a voting, a weighted voting and a classifier to obtain the real polarity of a phrase. Keywords: Tweet sentiment analysis, natural language processing. 1 Introduction and objectives and components of the system, namely the pre-processing, the extraction of features, the Sentiment analysis is the branch of natural algorithms used and then the process applied language processing which is used to deter- to their results to obtain our final tag. Sec- mine the subjective polarity of a text. This tion 3 analyses the results obtained in this has many applications ranging from the pop- workshop. Finally, to conclude, in section 4 ularity of a certain product, the general opin- we will draw some conclusions and propose ion about an event or politician among many some future work. others. In the particular case of twitter texts, 2 Architecture and components of these have the misfortune or great advantage of only consisting of a maximum of 140 char- the system acters. The disadvantage is that short texts Our system contains four main phases: data aren’t very accurately describable with bag of pre-processing, feature extraction - vectoriza- words which we will use, on the other hand, tion, the use of classifiers from which we ex- the limit also forces the author of the tweet to tract a new set of features and finally a com- be concise in its opinion and therefore noise bined classifier which uses the latter to pre- or non relevant statements are usually left dict the polarity of the text. out. In this workshop for sentiment analysis 2.1 Pre-processing focused on Spanish, a data set with tagged This step, crucial to all natural language pro- tweets according to their sentiment is given cessing task, consists of extracting noise from along with a description of evaluation mea- the text. Many of the steps such as re- sures as well as descriptions of the different moval of URLs, emails, punctuation, emoti- tasks (Villena-Román et al., 2015). cons, spaced words etc. are general and we The rest of the article is laid out as fol- will not get into so much, yet some are more lows: Section 2 introduces the architecture particular to the language in particular, such Publicado en http://ceur-ws.org/Vol-1397/. CEUR-WS.org es una publicación en serie con ISSN reconocido ISSN 1613-0073 Iosu Mendizabal Borda, Jeroni Carandell Saladich as the removal of letters that are repeated 2.3.3 Random Forest (RF) more than twice in Spanish. We decided to use this ensemble method as well because it has had very positive effects 2.2 Vectorization: Bag of words with accuracies that at times surpass the Ad- aBoosts thanks to its robustness to noise and In order to be able to apply a classifier, we outliers (Breiman, 2001). need to turn each tweet into a vector with the same features. To do this, one of the 2.3.4 Linear Regression (LR) most common approach is to use the Bag- Since the degrees of sentiment polarity are of-Words model with which given a corpus ordered, we decided that it would also be ap- of documents, it finds the N most relevant propriate to consider the problem as a dis- words (or n-grams in our case). Each fea- crete regression problem. Although a very ture, therefore represents the appearance of straightforward approach, it seems to give a different relevant ”word”. Although the rel- the second best results in general at times evance of a word can be defined as the num- surpassing the SVM (Tables 1 and 2). ber of times it appears in the text, this has the disadvantage of considering words that 2.4 Result: Combining classifiers appear largely throughout the whole docu- After computing the confusion matrices of ment and lack semantic relevance. In order the used classifiers we reached the conclusion to counter this effect a more sophisticated that certain algorithms were better at captur- approach called tf-idf (term frequency - in- ing some classes than others. These confusion verse term frequency) is used. In our project matrices can be observed in the next Section we used the Scikit-Learn TfidfVectorizer (Pe- 3. Because of this reason we decided to com- dregosa et al., 2011) to convert each tweet to bine the results of different classifiers to have a length N feature vector. more accurate results. In other words, we use the results of the single classifiers as a codifi- 2.3 Algorithms: Classifiers cation of the tweet into lower dimension. We can interpret each single classifier as an ex- Once we have a way of converting sentences pert that gives its diagnose or opinion about into a representation with the same features, the sentiment of a tweet. Since these differ- we can use any algorithm for classification. ent experts can be mistaken and disagree, we Again, for all of the following algorithms we have to find the best result by combining the used the implementations in the Scikit-Learn latter. python package (Pedregosa et al., 2011). We tried three different combining meth- ods. The first method is a simple voting of 2.3.1 SVM the different classifiers results and the more The first simple method we use is a support repeated one wins, in case of draws a random vector machine with a linear kernel in or- of the drawing ones will win. The second der to classify. This is generally the most proposal is a more sophisticated voting with promising in terms of all the used measures weights in each of the classifier results, these both with the complete of reduced number of weights are computed with a train set and are classes. normalized accuracies of the classification of this set. 2.3.2 AdaBoost (ADA) Finally, the third method consists of an- Adaboost is also a simple, easy to train since other classification algorithm, this time of re- the only relevant parameter is the number of sults. The idea is that we treat each previous rounds, and it has a strong theoretical ba- classifier as an expert that give its own di- sis in assuring that the training error will agnose of the tweet, given that we have the be reduced. However, this is only true with real tweets, we decided to train a Radial Ba- enough data (Freund and Schapire, 1999), sis Function (RBF) with all of the training given that the large number of features (5000) dataset and afterwards use the RBF to clas- compared to the number of instances to train sify the final test results, which were the re- (around 4000 because of the cross validation sults we uploaded to the workshop. All three with the training data that we will use to of these methods enhanced our results by few test the data), this is the worst performing yet significant points. This can be thought of method as can be seen in tables 1 and 2. as a supervised technique for dimensionality 72 BittenPotato: Tweet sentiment analysis by combining multiple classifiers reduction, since we convert a dataset of 5000 P+ labels are very separable for our classi- features into only 4. fiers. This could be because extremes might contain most key words that determine a pos- 3 Empirical analysis itive review as opposed to the more subtle We are now going to analyse the results ob- class P. tained in the workshop with the given test- ing tweet corpus. This section is separated in two subsections, firstly we will introduce the results obtained with the use of the four clas- sifiers explained in Section 2.3. Secondly, we will focus on the usage of the three combining methods introduced in Section 2.4. 3.1 Single classifiers First of all we are going to talk about the re- sults obtained with the simple use of the four single classifiers explained in Section 2.3. The analysis is done with two different data sets; on the one hand a set separated in four classes and on the other hand a data set separated Figure 1: Confusion Matrix for a Random in six classes. Forest with 6 classes. As it is depicted in Tables 1 and 2 the SVM and the Linear Regression classifiers are the most optimal ones in terms of the f1- measure which is the harmonic mean between the recall and the precision. Acc Precision Recall F1 SVM 57.6667 0.4842 0.4759 0.4707 AB. 49.3333 0.4193 0.4142 0.4072 RF 54.0000 0.5122 0.4105 0.3968 LR 59.3333 0.4542 0.4667 0.4516 Table 1: Average measures in 3-Cross vali- dated classifiers for 4 classes. Figure 2: Confusion Matrix for Linear Re- gression with 4 classes. Acc Precision Recall F1 SVM 40.3333 0.3587 0.3634 0.3579 3.2 Combining classifiers AB 35.0 0.3037 0.3070 0.2886 After applying the 4 previous single classi- RF 39.3333 0.3370 0.3267 0.2886 LR 42.3333 0.3828 0.3621 0.3393 fiers to each tweet, we obtain a data matrix where each features correspond to the label Table 2: Average measures in 3-Cross vali- set by each classifier. We can interpret this dated classifiers for 6 classes. as some sort of dimensionality reduction tech- nique where we now have a tweet transformed Observing the confusion matrix of the previ- into an element of 4 attributes each corre- ously mentioned techniques, Random forest sponding to a classifier’s results. and Linear regression, we can learn perhaps In tables 3 and 4 we can see the official more about the data itself. For instance, that results of the three combined classifiers on the number of Neutral tweets are so low that the Train data. tweets are rarely classified as such as seen in We have to keep in mind that when we are the NEU columns of the confusion matrices in comparing the combined classifiers with the figures 2 and 1. Another curious fact is that single classifiers, we are using two different 73 Iosu Mendizabal Borda, Jeroni Carandell Saladich test sets. In the single classifiers, we use 3- In general we are satisfied with the results Cross Validation exclusively on the train data obtained of the TASS2015 challenge. As fu- to obtain average measure for each classifier. ture work, we propose to explore different With the combined classifiers, we trained on classifiers that might capture different phe- the Train set and evaluated on the final Test nomena so that the combined classifier might set. have more diverse information. Also different Notice that the weighted voting outper- combined classifiers should be trained. forms the normal voting. This seems intu- itive because the weighted voting gives more References importance to the most reliable classifiers. Breiman, L. 2001. Random forests. Machine The RBF’s results are not as promising as Learning, 45(1):5–32. the previous two methods but it still outper- forms all of the single classifiers. Freund, Y. and R. E. Schapire. 1999. A short introduction to boosting. Pedregosa, F., G. Varoquaux, A. Gram- Acc Precision Recall F1 fort, V. Michel, B. Thirion, O. Grisel, Voting 59.3 0.500 0.469 0.484 M. Blondel, P. Prettenhofer, R. Weiss, Weighted Voting 59.3 0.508 0.465 0.486 V. Dubourg, J. Vanderplas, A. Passos, RBF 60.2 0.474 0.471 0.472 D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Table 3: Official Results for the combined Machine Learning in Python . Journal classifiers for 4 classes. of Machine Learning Research, 12:2825– 2830. Villena-Román, J., J. Garcı́a-Morera, M. A. Acc Precision Recall F1 Garcı́a-Cumbreras, E. Martı́nez-Cámara, Voting 53.5 0.396 0.421 0.408 M. T. Martı́n-Valdivia, and L. A. Ureña Weighted Voting 53.4 0.402 0.430 0.415 López. 2015. Overview of tass 2015. RBF 51.4 0.377 0.393 0.385 Table 4: Official results for the combined classifiers for 6 classes. In general we can see that these methods, with the exception of the SVM in terms of F1-measure, outperform the rest. 4 Conclusions and future work In this paper we have described our approach for the SEPLN 2015 for the global level with relatively good results considering the num- ber of classes, and the general difficulty of the problem. We have started by describing the initial preprocessing and the extraction of features using a bag of words on trigrams and bi- grams. Then we have described and com- pared four different classifiers that we lated used as a way of translating the data into merely 4 dimensions, from 5000. We can conclude that multiple classifiers are good at capturing different phenomena and that by combining them we tend to have a better global result as we have obtained in most of the TASS 2015 results of the Global level. 74