-

1613-0073

BittenPotato: Tweet sentiment analysis by combining multiple classi ers.

Iosu Mendizabal Borda

iosu@iiia.csic.es 0

Jeroni Carandell Saladich

jeroni.carandell@gmail.com 1 0 (IIIA) Arti cial Intelligence, Research Institute, (CSIC) Spanish Council for, Scienti c Research 1 (UPC) Universitat Politecnica de Catalunya, (URV) Universitat Rovira i Virgili, (UB) Universitat de Barcelona

71 74

In this paper, we use a bag-of-words of n-grams to capture a dictionary containing the most used "words" which we will use as features. We then proceed to classify using four di erent classi ers and combine their results by apply a voting, a weighted voting and a classi er to obtain the real polarity of a phrase.

Sentiment analysis is the branch of natural language processing which is used to determine the subjective polarity of a text. This has many applications ranging from the popularity of a certain product, the general opinion about an event or politician among many others.

In the particular case of twitter texts, these have the misfortune or great advantage of only consisting of a maximum of 140 characters. The disadvantage is that short texts aren't very accurately describable with bag of words which we will use, on the other hand, the limit also forces the author of the tweet to be concise in its opinion and therefore noise or non relevant statements are usually left out.

In this workshop for sentiment analysis focused on Spanish, a data set with tagged tweets according to their sentiment is given along with a description of evaluation measures as well as descriptions of the di erent tasks (Villena-Roman et al., 2015) .

The rest of the article is laid out as follows: Section 2 introduces the architecture and components of the system, namely the pre-processing, the extraction of features, the algorithms used and then the process applied to their results to obtain our nal tag. Section 3 analyses the results obtained in this workshop. Finally, to conclude, in section 4 we will draw some conclusions and propose some future work. 2

Architecture and components of the system Our system contains four main phases: data pre-processing, feature extraction - vectorization, the use of classi ers from which we extract a new set of features and nally a combined classi er which uses the latter to predict the polarity of the text. 2.1

Pre-processing

This step, crucial to all natural language processing task, consists of extracting noise from the text. Many of the steps such as removal of URLs, emails, punctuation, emoticons, spaced words etc. are general and we will not get into so much, yet some are more particular to the language in particular, such Publicado en http://ceur-ws.org/Vol-1397/. CEUR-WS.org es una publicación en serie con ISSN reconocido as the removal of letters that are repeated more than twice in Spanish. 2.2

Vectorization: Bag of words

In order to be able to apply a classi er, we need to turn each tweet into a vector with the same features. To do this, one of the most common approach is to use the Bagof-Words model with which given a corpus of documents, it nds the N most relevant words (or n-grams in our case). Each feature, therefore represents the appearance of a di erent relevant "word". Although the relevance of a word can be de ned as the number of times it appears in the text, this has the disadvantage of considering words that appear largely throughout the whole document and lack semantic relevance. In order to counter this e ect a more sophisticated approach called tf-idf (term frequency - inverse term frequency) is used. In our project we used the Scikit-Learn T dfVectorizer (Pedregosa et al., 2011) to convert each tweet to a length N feature vector. 2.3

Algorithms: Classi ers

Once we have a way of converting sentences into a representation with the same features, we can use any algorithm for classi cation. Again, for all of the following algorithms we used the implementations in the Scikit-Learn python package (Pedregosa et al., 2011) . 2.3.1

SVM The rst simple method we use is a support vector machine with a linear kernel in order to classify. This is generally the most promising in terms of all the used measures both with the complete of reduced number of classes. 2.3.2

AdaBoost (ADA)

Adaboost is also a simple, easy to train since the only relevant parameter is the number of rounds, and it has a strong theoretical basis in assuring that the training error will be reduced. However, this is only true with enough data (Freund and Schapire, 1999) , given that the large number of features (5000) compared to the number of instances to train (around 4000 because of the cross validation with the training data that we will use to test the data), this is the worst performing method as can be seen in tables 1 and 2.

2.3.3 Random Forest (RF)

We decided to use this ensemble method as well because it has had very positive e ects with accuracies that at times surpass the AdaBoosts thanks to its robustness to noise and outliers (Breiman, 2001) .

2.3.4 Linear Regression (LR)

Since the degrees of sentiment polarity are ordered, we decided that it would also be appropriate to consider the problem as a discrete regression problem. Although a very straightforward approach, it seems to give the second best results in general at times surpassing the SVM (Tables 1 and 2). 2.4

Result: Combining classi ers

After computing the confusion matrices of the used classi ers we reached the conclusion that certain algorithms were better at capturing some classes than others. These confusion matrices can be observed in the next Section 3. Because of this reason we decided to combine the results of di erent classi ers to have more accurate results. In other words, we use the results of the single classi ers as a codi cation of the tweet into lower dimension. We can interpret each single classi er as an expert that gives its diagnose or opinion about the sentiment of a tweet. Since these di erent experts can be mistaken and disagree, we have to nd the best result by combining the latter.

We tried three di erent combining methods. The rst method is a simple voting of the di erent classi ers results and the more repeated one wins, in case of draws a random of the drawing ones will win. The second proposal is a more sophisticated voting with weights in each of the classi er results, these weights are computed with a train set and are normalized accuracies of the classi cation of this set.

Finally, the third method consists of another classi cation algorithm, this time of results. The idea is that we treat each previous classi er as an expert that give its own diagnose of the tweet, given that we have the real tweets, we decided to train a Radial Basis Function (RBF) with all of the training dataset and afterwards use the RBF to classify the nal test results, which were the results we uploaded to the workshop. All three of these methods enhanced our results by few yet signi cant points. This can be thought of as a supervised technique for dimensionality BittenPotato: Tweet sentiment analysis by combining multiple classifiers reduction, since we convert a dataset of 5000 features into only 4. 3

Empirical analysis

We are now going to analyse the results obtained in the workshop with the given testing tweet corpus. This section is separated in two subsections, rstly we will introduce the results obtained with the use of the four classi ers explained in Section 2.3. Secondly, we will focus on the usage of the three combining methods introduced in Section 2.4. 3.1

Single classi ers

First of all we are going to talk about the results obtained with the simple use of the four single classi ers explained in Section 2.3. The analysis is done with two di erent data sets; on the one hand a set separated in four classes and on the other hand a data set separated in six classes.

As it is depicted in Tables 1 and 2 the SVM and the Linear Regression classi ers are the most optimal ones in terms of the f1measure which is the harmonic mean between the recall and the precision.

Acc

Precision Recall F1 SVM 57.6667 0.4842 AB. 49.3333 0.4193 RF 54.0000 0.5122 LR 59.3333 0.4542

Observing the confusion matrix of the previously mentioned techniques, Random forest and Linear regression, we can learn perhaps more about the data itself. For instance, that the number of Neutral tweets are so low that tweets are rarely classi ed as such as seen in the NEU columns of the confusion matrices in gures 2 and 1. Another curious fact is that P+ labels are very separable for our classiers. This could be because extremes might contain most key words that determine a positive review as opposed to the more subtle class P. After applying the 4 previous single classiers to each tweet, we obtain a data matrix where each features correspond to the label set by each classi er. We can interpret this as some sort of dimensionality reduction technique where we now have a tweet transformed into an element of 4 attributes each corresponding to a classi er's results.

In tables 3 and 4 we can see the o cial results of the three combined classi ers on the Train data.

We have to keep in mind that when we are comparing the combined classi ers with the single classi ers, we are using two di erent test sets. In the single classi ers, we use 3Cross Validation exclusively on the train data to obtain average measure for each classi er. With the combined classi ers, we trained on the Train set and evaluated on the nal Test set.

Notice that the weighted voting outperforms the normal voting. This seems intuitive because the weighted voting gives more importance to the most reliable classi ers. The RBF's results are not as promising as the previous two methods but it still outperforms all of the single classi ers.

Acc Precision Recall F1 Voting 59.3 0.500 Weighted Voting 59.3 0.508 RBF 60.2 0.474

In general we can see that these methods, with the exception of the SVM in terms of F1-measure, outperform the rest. 4

Conclusions and future work

In this paper we have described our approach for the SEPLN 2015 for the global level with relatively good results considering the number of classes, and the general di culty of the problem.

We have started by describing the initial preprocessing and the extraction of features using a bag of words on trigrams and bigrams. Then we have described and compared four di erent classi ers that we lated used as a way of translating the data into merely 4 dimensions, from 5000.

We can conclude that multiple classi ers are good at capturing di erent phenomena and that by combining them we tend to have a better global result as we have obtained in most of the TASS 2015 results of the Global level.

In general we are satis ed with the results obtained of the TASS2015 challenge. As future work, we propose to explore di erent classi ers that might capture di erent phenomena so that the combined classi er might have more diverse information. Also di erent combined classi ers should be trained.

Breiman , L.

2001 . Random forests . Machine Learning , 45 ( 1 ):5{ 32 .

Freund , Y. and

R. E.

Schapire . 1999 . A short introduction to boosting .

Pedregosa , F. ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , and

Duchesnay . 2011 . Scikit-learn: Machine Learning in Python . Journal of Machine Learning Research , 12 : 2825 { 2830 .

Villena-Roman , J. ,

Garc a-Morera,

M. A.

Garc a-Cumbreras, E.

Mart nez-

Camara , M. T.

Mart n-Valdivia, and L. A.

Uren ~a Lopez. 2015 . Overview of tass 2015 .