TASS 2015, septiembre 2015, pp 71-74                                            recibido 20-07-15 revisado 24-07-15 aceptado 29-07-15


 BittenPotato: Tweet sentiment analysis by combining multiple
                          classifiers.
BittenPotato: Análisis de sentimientos de tweets mediante la combinación
                          de varios clasificadores.

                 Iosu Mendizabal Borda                              Jeroni Carandell Saladich
                (IIIA) Artificial Intelligence                (UPC) Universitat Politecnica de Catalunya
                     Research Institute                          (URV) Universitat Rovira i Virgili
                (CSIC) Spanish Council for                         (UB) Universitat de Barcelona
                     Scientific Research                             jeroni.carandell@gmail.com
                      iosu@iiia.csic.es

        Resumen: En este artı́culo, usamos un saco de palabras (bag of words) sobre
        n-gramas para crear un diccionario de los atributos más usados en una dataset.
        Seguidamente, aplicamos cuatro distintos clasificadores, el resultado de los cuales,
        mediante diversas técnicas pretende mostrar la polaridad real de la frase extrayendo
        el sentimiento que contiene.
        Palabras clave: Análisis de Sentimientos, Procesamiento de lenguaje natural.
        Abstract: In this paper, we use a bag-of-words of n-grams to capture a dictionary
        containing the most used ”words” which we will use as features. We then proceed to
        classify using four different classifiers and combine their results by apply a voting,
        a weighted voting and a classifier to obtain the real polarity of a phrase.
        Keywords: Tweet sentiment analysis, natural language processing.


1     Introduction and objectives                                      and components of the system, namely the
                                                                       pre-processing, the extraction of features, the
Sentiment analysis is the branch of natural
                                                                       algorithms used and then the process applied
language processing which is used to deter-
                                                                       to their results to obtain our final tag. Sec-
mine the subjective polarity of a text. This
                                                                       tion 3 analyses the results obtained in this
has many applications ranging from the pop-
                                                                       workshop. Finally, to conclude, in section 4
ularity of a certain product, the general opin-
                                                                       we will draw some conclusions and propose
ion about an event or politician among many
                                                                       some future work.
others.
   In the particular case of twitter texts,
                                                                       2     Architecture and components of
these have the misfortune or great advantage
of only consisting of a maximum of 140 char-                                 the system
acters. The disadvantage is that short texts                           Our system contains four main phases: data
aren’t very accurately describable with bag of                         pre-processing, feature extraction - vectoriza-
words which we will use, on the other hand,                            tion, the use of classifiers from which we ex-
the limit also forces the author of the tweet to                       tract a new set of features and finally a com-
be concise in its opinion and therefore noise                          bined classifier which uses the latter to pre-
or non relevant statements are usually left                            dict the polarity of the text.
out.
   In this workshop for sentiment analysis                             2.1      Pre-processing
focused on Spanish, a data set with tagged                             This step, crucial to all natural language pro-
tweets according to their sentiment is given                           cessing task, consists of extracting noise from
along with a description of evaluation mea-                            the text. Many of the steps such as re-
sures as well as descriptions of the different                         moval of URLs, emails, punctuation, emoti-
tasks (Villena-Román et al., 2015).                                   cons, spaced words etc. are general and we
   The rest of the article is laid out as fol-                         will not get into so much, yet some are more
lows: Section 2 introduces the architecture                            particular to the language in particular, such
Publicado en http://ceur-ws.org/Vol-1397/. CEUR-WS.org es una publicación en serie con ISSN reconocido               ISSN 1613-0073
                               Iosu Mendizabal Borda, Jeroni Carandell Saladich


as the removal of letters that are repeated                2.3.3 Random Forest (RF)
more than twice in Spanish.                                We decided to use this ensemble method as
                                                           well because it has had very positive effects
2.2     Vectorization: Bag of words                        with accuracies that at times surpass the Ad-
                                                           aBoosts thanks to its robustness to noise and
In order to be able to apply a classifier, we
                                                           outliers (Breiman, 2001).
need to turn each tweet into a vector with
the same features. To do this, one of the                  2.3.4 Linear Regression (LR)
most common approach is to use the Bag-                    Since the degrees of sentiment polarity are
of-Words model with which given a corpus                   ordered, we decided that it would also be ap-
of documents, it finds the N most relevant                 propriate to consider the problem as a dis-
words (or n-grams in our case). Each fea-                  crete regression problem. Although a very
ture, therefore represents the appearance of               straightforward approach, it seems to give
a different relevant ”word”. Although the rel-             the second best results in general at times
evance of a word can be defined as the num-                surpassing the SVM (Tables 1 and 2).
ber of times it appears in the text, this has
the disadvantage of considering words that                 2.4      Result: Combining classifiers
appear largely throughout the whole docu-                  After computing the confusion matrices of
ment and lack semantic relevance. In order                 the used classifiers we reached the conclusion
to counter this effect a more sophisticated                that certain algorithms were better at captur-
approach called tf-idf (term frequency - in-               ing some classes than others. These confusion
verse term frequency) is used. In our project              matrices can be observed in the next Section
we used the Scikit-Learn TfidfVectorizer (Pe-              3. Because of this reason we decided to com-
dregosa et al., 2011) to convert each tweet to             bine the results of different classifiers to have
a length N feature vector.                                 more accurate results. In other words, we use
                                                           the results of the single classifiers as a codifi-
2.3     Algorithms: Classifiers                            cation of the tweet into lower dimension. We
                                                           can interpret each single classifier as an ex-
Once we have a way of converting sentences                 pert that gives its diagnose or opinion about
into a representation with the same features,              the sentiment of a tweet. Since these differ-
we can use any algorithm for classification.               ent experts can be mistaken and disagree, we
Again, for all of the following algorithms we              have to find the best result by combining the
used the implementations in the Scikit-Learn               latter.
python package (Pedregosa et al., 2011).                       We tried three different combining meth-
                                                           ods. The first method is a simple voting of
2.3.1    SVM
                                                           the different classifiers results and the more
The first simple method we use is a support                repeated one wins, in case of draws a random
vector machine with a linear kernel in or-                 of the drawing ones will win. The second
der to classify. This is generally the most                proposal is a more sophisticated voting with
promising in terms of all the used measures                weights in each of the classifier results, these
both with the complete of reduced number of                weights are computed with a train set and are
classes.                                                   normalized accuracies of the classification of
                                                           this set.
2.3.2    AdaBoost (ADA)                                        Finally, the third method consists of an-
Adaboost is also a simple, easy to train since             other classification algorithm, this time of re-
the only relevant parameter is the number of               sults. The idea is that we treat each previous
rounds, and it has a strong theoretical ba-                classifier as an expert that give its own di-
sis in assuring that the training error will               agnose of the tweet, given that we have the
be reduced. However, this is only true with                real tweets, we decided to train a Radial Ba-
enough data (Freund and Schapire, 1999),                   sis Function (RBF) with all of the training
given that the large number of features (5000)             dataset and afterwards use the RBF to clas-
compared to the number of instances to train               sify the final test results, which were the re-
(around 4000 because of the cross validation               sults we uploaded to the workshop. All three
with the training data that we will use to                 of these methods enhanced our results by few
test the data), this is the worst performing               yet significant points. This can be thought of
method as can be seen in tables 1 and 2.                   as a supervised technique for dimensionality
                                                     72
                        BittenPotato: Tweet sentiment analysis by combining multiple classifiers


reduction, since we convert a dataset of 5000                   P+ labels are very separable for our classi-
features into only 4.                                           fiers. This could be because extremes might
                                                                contain most key words that determine a pos-
3     Empirical analysis                                        itive review as opposed to the more subtle
We are now going to analyse the results ob-                     class P.
tained in the workshop with the given test-
ing tweet corpus. This section is separated in
two subsections, firstly we will introduce the
results obtained with the use of the four clas-
sifiers explained in Section 2.3. Secondly, we
will focus on the usage of the three combining
methods introduced in Section 2.4.

3.1    Single classifiers
First of all we are going to talk about the re-
sults obtained with the simple use of the four
single classifiers explained in Section 2.3. The
analysis is done with two different data sets;
on the one hand a set separated in four classes
and on the other hand a data set separated                      Figure 1: Confusion Matrix for a Random
in six classes.                                                 Forest with 6 classes.
   As it is depicted in Tables 1 and 2 the
SVM and the Linear Regression classifiers are
the most optimal ones in terms of the f1-
measure which is the harmonic mean between
the recall and the precision.


          Acc     Precision Recall F1
    SVM 57.6667 0.4842         0.4759     0.4707
    AB. 49.3333 0.4193         0.4142     0.4072
    RF  54.0000 0.5122         0.4105     0.3968
    LR  59.3333 0.4542         0.4667     0.4516

Table 1: Average measures in 3-Cross vali-
dated classifiers for 4 classes.
                                                                Figure 2: Confusion Matrix for Linear Re-
                                                                gression with 4 classes.
          Acc     Precision Recall F1
    SVM 40.3333 0.3587         0.3634     0.3579                3.2       Combining classifiers
    AB 35.0     0.3037         0.3070     0.2886
                                                                After applying the 4 previous single classi-
    RF  39.3333 0.3370         0.3267     0.2886
    LR  42.3333 0.3828         0.3621     0.3393                fiers to each tweet, we obtain a data matrix
                                                                where each features correspond to the label
Table 2: Average measures in 3-Cross vali-                      set by each classifier. We can interpret this
dated classifiers for 6 classes.                                as some sort of dimensionality reduction tech-
                                                                nique where we now have a tweet transformed
Observing the confusion matrix of the previ-                    into an element of 4 attributes each corre-
ously mentioned techniques, Random forest                       sponding to a classifier’s results.
and Linear regression, we can learn perhaps                         In tables 3 and 4 we can see the official
more about the data itself. For instance, that                  results of the three combined classifiers on
the number of Neutral tweets are so low that                    the Train data.
tweets are rarely classified as such as seen in                     We have to keep in mind that when we are
the NEU columns of the confusion matrices in                    comparing the combined classifiers with the
figures 2 and 1. Another curious fact is that                   single classifiers, we are using two different
                                                          73
                                 Iosu Mendizabal Borda, Jeroni Carandell Saladich


test sets. In the single classifiers, we use 3-                 In general we are satisfied with the results
Cross Validation exclusively on the train data               obtained of the TASS2015 challenge. As fu-
to obtain average measure for each classifier.               ture work, we propose to explore different
With the combined classifiers, we trained on                 classifiers that might capture different phe-
the Train set and evaluated on the final Test                nomena so that the combined classifier might
set.                                                         have more diverse information. Also different
    Notice that the weighted voting outper-                  combined classifiers should be trained.
forms the normal voting. This seems intu-
itive because the weighted voting gives more                 References
importance to the most reliable classifiers.                 Breiman, L. 2001. Random forests. Machine
The RBF’s results are not as promising as                      Learning, 45(1):5–32.
the previous two methods but it still outper-
forms all of the single classifiers.                         Freund, Y. and R. E. Schapire. 1999. A short
                                                                introduction to boosting.
                                                             Pedregosa, F., G. Varoquaux, A. Gram-
                     Acc Precision Recall F1                   fort, V. Michel, B. Thirion, O. Grisel,
Voting          59.3 0.500             0.469      0.484        M. Blondel, P. Prettenhofer, R. Weiss,
Weighted Voting 59.3 0.508             0.465      0.486        V. Dubourg, J. Vanderplas, A. Passos,
RBF             60.2 0.474             0.471      0.472        D. Cournapeau, M. Brucher, M. Perrot,
                                                               and E. Duchesnay. 2011. Scikit-learn:
Table 3: Official Results for the combined                     Machine Learning in Python . Journal
classifiers for 4 classes.                                     of Machine Learning Research, 12:2825–
                                                               2830.
                                                             Villena-Román, J., J. Garcı́a-Morera, M. A.
                     Acc Precision Recall F1                    Garcı́a-Cumbreras, E. Martı́nez-Cámara,
Voting          53.5 0.396             0.421      0.408         M. T. Martı́n-Valdivia, and L. A. Ureña
Weighted Voting 53.4 0.402             0.430      0.415         López. 2015. Overview of tass 2015.
RBF             51.4 0.377             0.393      0.385

Table 4: Official results for the combined
classifiers for 6 classes.
In general we can see that these methods,
with the exception of the SVM in terms of
F1-measure, outperform the rest.

4   Conclusions and future work
In this paper we have described our approach
for the SEPLN 2015 for the global level with
relatively good results considering the num-
ber of classes, and the general difficulty of the
problem.
   We have started by describing the initial
preprocessing and the extraction of features
using a bag of words on trigrams and bi-
grams. Then we have described and com-
pared four different classifiers that we lated
used as a way of translating the data into
merely 4 dimensions, from 5000.
   We can conclude that multiple classifiers
are good at capturing different phenomena
and that by combining them we tend to have
a better global result as we have obtained in
most of the TASS 2015 results of the Global
level.
                                                       74