TASS 2016: Workshop on Sentiment Analysis at SEPLN, septiembre 2016, pág. 29-33


        LABDA at the 2016 TASS challenge task: using word
           embeddings for the sentiment analysis task∗
LABDA en la competición TASS 2016: utilizando vectores de palabras para
                 la tarea de análisis de sentimiento

             Antonio Quirós1,2 , Isabel Segura-Bedmar1 , and Paloma Martı́nez1
              1
                  Departamento de Informática, Universidad Calos III de Madrid
                   Avd. de la Universidad, 30, 28911, Leganés, Madrid, España
                     100342879@alumnos.uc3m.es, isegura,pmf@inf.uc3m.es
                                          2
                                   Sngular Data&Analytics
                   Av. LLano Castellano 13, Planta 5, 28034 Madrid, España
                                antonio.quiros@sngular.team

      Resumen: Este artı́culo describe la participación del grupo LABDA en la tarea
      1 (Sentiment Analysis at global level) de la competición TASS 2016. En nuestro
      enfoque, los tweets son representados por medio de vectores de palabras y son cla-
      sificados utilizando algoritmos como SVM y regresión logı́stica.
      Palabras clave: Análisis de Sentimiento, Vectores de palabras
      Abstract: This paper describes the participation of the LABDA group at the Task
      1 (Sentiment Analysis at global level). Our approach exploits word embedding re-
      presentations for tweets and machine learning algorithms such as SVM and logistics
      regression.
      Keywords: Sentiment Analysis, Word embeddings

1   Introduction                                                resources for sentiment analysis of tweets in
Knowing the opinion of customers or users                       Spanish. This paper describes the participa-
has become a priority for companies and or-                     tion of the LABDA group at the Task 1 (Sen-
ganizations in order to improve the quality of                  timent Analysis at global level). In this task,
their services and products. With the ongoing                   the participating systems have to determine
explosion of social media, it affords a signifi-                the global polarity of each tweet in the test
cant opportunity to poll the opinion of many                    dataset. There are two different evaluations:
Internet users by processing their comments.                    one based on 6 different polarity labels (P+,
However, it should be noted that sentiment                      P, NEU, N, N+, NONE) and another based
analysis, which can be defined as the auto-                     on just 4 labels (P, N, NEU, NONE). A de-
matic analysis of opinion in texts (Pang and                    tailed description of the task can be found
Lee, 2008), is a challenging task because it is                 in the overview paper of TASS 2016 (Garcı́a-
not strange that different people assign dif-                   Cumbreras et al., 2016). Our approach ex-
ferent polarities to a given text. On Twitter,                  ploits word embedding representations for
the task is even more difficult, because the                    tweets and machine learning algorithms such
texts are small (only 140 characters) and are                   as SVM and logistics regression. The word
charectized by their informal style language,                   embedding model can yield significant dimen-
many grammatical errors and spelling mista-                     sionality reduction compared to the classical
kes, slang and vulgar vocabulary and abbre-                     Bag-Of-Word (BoW) model. The dimensio-
viations.                                                       nality redution can have several positive ef-
                                                                fects on our algorithms such as faster trai-
   Since their introduction in 2013, the TASS
                                                                ning, avoiding overfitting and better perfor-
shared task editions have had as main goal
                                                                mance.
to promote the development of methods and
                                                                   The paper is organized as follows. Section
∗
  This work was supported by eGovernAbility-Access              2 describes our approach. The experimental
project (TIN2014-52665-C2-2-R).                                 results are presented and discussed in Section
                                                    ISSN 1613-0073
                                   A. Quirós, I. Segura-Bedmar, P. Martínez


3. We conclude in Section 4 with a summary                 vert the tweets to lowercase and replace miss-
of our findings and some directions for future             pelled accented letters with the correct one
work.                                                      (for instance “à” with “á”). We also treat
                                                           elongations (that is, the repetition of a cha-
2   System                                                 racter) by removing the repetition of a cha-
In this paper, we study the use of word em-                racter after its second occurrence (for exam-
beddings (also known as word vectors) in or-               ple, “hoooolaaaa” would be translated to
der to represent tweets and then examine se-               “hola”). We then decided to take into account
veral machine learning algorithms to classify              laughs (for instance “jajaja”) which turned
them. Word embeddings have shown promi-                    out to be challenging because of the diverse
sing results in NLP tasks, such as named                   ways they are expressed (i.e. expressions li-
entity recognition (Segura-Bedmar, Suárez-                ke “jajajaja” or “jejeje” and even misspelled
Paniagua, and Martınez, 2015), relation ex-                ones like “jajjajaaj”) We addressed this using
traction (Alam et al., 2016), sentiment analy-             regular expressions to standardize the diffe-
sis (Socher et al., 2013b) or parsing (Socher              rent forms (i.e. “jajjjaaj” to “jajaja”) and
et al., 2013a). A word embedding is a fun-                 then replace them with the word “risas”. Fi-
ction to map words to low dimensional vec-                 nally we remove all non-letters characters and
tors, which are learned from a large collection            all stopwords present in tweets1 .
of texts. At present, Neural Network is one of                     Orientation       Emoticons
the most used learning techniques for gene-                          Positive        :-), :), :D, :o), :], D:3,
rating word embeddings (Mikolov and Dean,                                            :c), :>, =], 8), =),
2013). The essential assumption of this mo-                                          :}, :ˆ), :-D, 8-D, 8D,
del is that semantically close words will have                                       x-D, xD, X-D, XD,
similar vectors (in terms of cosine similarity).                                     =-D, =D, =-3, =3,
Word embeddings can help to capture seman-                                           BˆD, :’), :’), :*, :-*,
tic and syntactic relationships of the corres-                                       :ˆ*, ;-), ;), *-), *), ;-
ponding words.                                                                       ], ;], ;D, ;ˆ), >:P, :-P,
    While the well-known Bag-of-Words                                                :P, X-P, x-p, xp, XP,
(BoW) model involves a very large number                                             :-p, :p, =p, :-b, :b
of features (as many as the number of non-
stopwords words with at least a minimum                               Negative       >:[, :-(, :(, :-c, :-<,
number of occurrences in the training data),                                         :<, :-[, :[, :{, ;(, :-
the word embedding representation allows                                             ||, >:(, :’-(, :’(, D:<,
a significant reduction in the feature set                                           D=, v.v
size (in our case, from million to just 300).
The dimensionality reduction is a desirable
goal, because it helps in avoiding overfitting
and leads to a reduction of the training and               Table 1: List of positive and negative emoti-
classification times, without any performance              cons
loss.
    As a preprocessing step, tweets must be                   Once the tweets are preprocessed, they are
cleaned. First, we remove all links and urls.              tokenized using the NLKT toolkit (a Pyt-
We then remove usernames which can be ea-                  hon package for NLP); we also performed
sily recognized because their first character is           experimentation by lemmatizing each tweet
the symbol @. We then transform the hash-                  using MeaningCloud2 Text Analytic software
tags to words by removing its first charac-                to compare both approaches. Then, for each
ter (that is, the symbol #). Taking advanta-               token, we search its vector in the word em-
ge of regular expressions, the emoticons are               bedding model. We use a pretrained model
detected and classified in order to count the              (Cardellino, 2016), which was generated by
number of positive and negative emoticons in               using the word2vec algorithm (Mikolov and
each tweet and then we remove them from the                Dean, 2013) from a collection of Spanish texts
text. Table 1 shows the list of positive and               with approximately 1.5 billion words. The di-
negative emoticons, which were taken from                  mension of the word embedding is 300. It
the wikipedia page https://en.wikipedia.                       1
                                                                   http://snowball.tartarus.org/algorithms/spanish/stop.txt
                                                               2
org/wiki/List\_of\_emoticons. We con-                              https://www.meaningcloud.com/
                                                     30
                LABDA at the 2016 TASS challenge task: using word embeddings for the sentiment analysis task


should be noted that these texts were ta-                               negEmo: number of negative emoticons
ken from different resources such as Spanish                            present in the tweet.
Wikipedia, WikiSource and Wikibooks, but
none of them contains tweets. Therefore, it                           For the posWords and negWords features
is possible that the main characteristics of                      we used the iSOL lexicon(Molina-González et
the social media texts (such as informal style                    al., 2013), a list composed by 2,509 positive
language, noisy, plenty of grammatical errors                     words and 5,626 negative words. As descri-
and spelling mistakes, slang and vulgar voca-                     bed before, for the emoticons we used the lis-
bulary, abbreviations, etc) are not correctly                     ted in Table 1, but also added to the positive
represented in this model. One of the main                        ones the number of laughs detected; and also,
problems is that there is a significant number                    we included the number of recommendations
of words (almost a 13 % of the vocabulary, re-                    present in the form of a “Follow Friday” hash-
presenting the 6 % of words occurrences) that                     tag (#FF), due to its ease of detection and
are not found in the model. We perform a re-                      its positive bias.
view of a small sample of these words, sho-                           Classification is performed using scikit-
wing that most of them were mainly hash-                          learn, a Python module for machine learning.
tags.                                                             This package provides many algorithms such
    In our approach, a tweet of n tokens (T =                     as Random Forest, Support Vector Machine
w1 , w2 , ..., wn ) is represented as the centroid                (SVM) and so on. One of its main advantages
of the word vectors w    ~i of its tokens, as shown               is that it is supported by extensive documen-
in the following equation:                                        tation. Moreover, it is robust, fast and easy
                                                                  to use.
           n           PN                                             As stated before, we have two main trai-
        1                 j=1 w
                              ~j .T F (wj , t)
   T~ =
           X
              w
              ~i =       PN                           (1)         ning models: Averaged centroids and the ave-
        n i=1               j=1 T F (wj , t)                      raged centroids including the inverted docu-
                                                                  ment frequency, for both the lemmatized and
   where N is the vocabulary size, that is,
                                                                  not-lemmatized texts. We performed experi-
the total number of distinct words, while
                                                                  ments using three different classifiers: Ran-
T F (wj , t) refers to the number of occurren-
                                                                  dom Forests, Support Vector Machines and
ces of the j-th vocabulary word in the tweet
                                                                  Logistic Regression because these classifiers
T.
                                                                  often achieved the best results for text clas-
   We also explore the effect of including the
                                                                  sification and sentiment analysis.
inverse document frequencies IDF to repre-
                                                                      Also we evaluated the impact of applying
sent tweets (see Equation 2). This helps to
                                                                  a set of emoticon’s rules as a pre-classification
increase the weight of words that occur of-
                                                                  stage, similar to (Chikersal et al., 2015), in
ten, but only in a few documents, while it re-
                                                                  which we determine a first stage polarity for
duces the relevance of words that occur very
                                                                  each tweet as follows:
frequently in a larger number of texts.
                                                                        If posEmo is greater than zero and negE-
        n          PN                                                   mo is equal to zero, the tweet is marked
     1                j=1 w
                          ~j .T F (wj , t).IDF (wj )
T~ =
        X
           w
           ~i =      PN                                                 as “P”.
     n i=1              j=1 T F (wj , t).IDF (wj )
                                                      (2)               If negEmo is greater than zero and posE-
                          log|D|
   having IDF (wj ) = |tw∈D:w         where |D|                         mo is equal to zero, the tweet is marked
                               j ∈tw|
                                                                        as “N”.
refers to the number of tweets.
   In addition to using the centroid, we assess                         If both posEmo and negEmo are grea-
the impact of complementing the tweet model                             ter than zero, the tweet is marked as
with the following additional features:                                 “NEU”.
     posWords: number of positive words pre-                            If both posEmo and negEmo are equal to
     sent in the tweet.                                                 zero, the tweet is marked as “NONE”.
     negWords: number of negative words                               Then, after the classification takes place
     present in the tweet.                                        we made three tests: i) Applying no rule,
     posEmo: number of positive emoticons                         ii) honoring the polarity defined by the rule,
     present in the tweet.                                        which means, we keep the predefined polarity
                                                            31
                                       A. Quirós, I. Segura-Bedmar, P. Martínez


if the tweet was marked as “P” or “N”, ot-                          Run             P       R      F1     Acc
herwise we take the value estimated by the                          RUN-1         0.411   0.449   0.429   0.527
classifier, and iii) a mixed approach where                         RUN-2         0.412   0.448   0.429   0.527
we give each polarity a value (N+: -2; N: -1;                       RUN-3         0.402   0.436   0.418   0.549
NEU,NONE: 0; P: 1; P+: 2) and performed
an arithmetic sum of both the predefined and
estimated polarity if and only if they are not                 Table 2: Results for Sentiment Analysis at
equal; with that for instance, if the classifier               global level (5 levels, Full test corpus)
marked a tweet as “N” and the rules mar-
ked it as “P” the tweet will be classified as                       Run             P       R      F1     Acc
“NEU”.                                                              RUN-1         0.506   0.510   0.508   0.652
                                                                    RUN-2         0.508   0.508   0.508   0.652
3       Results                                                     RUN-3         0.512   0.511   0.511   0.653
In order to choose the best-performing clas-
sifiers, we use 10-fold cross-validation becau-                Table 3: Results for Sentiment Analysis at
se there is no development dataset and this                    global level (3 levels, Full test corpus)
strategy has become the standard method
in practical terms. Our experiments showed
that, although the results were similar3 , the                    With the settings mentioned above, the
best settings for the 5-levels task are:                       obtained results are extremely similar, but we
                                                               can state that, in terms of Accuracy, Logis-
        RUN-1: Support Vector Machine, over                    tic Regression report the best results; and,
        the averaged centroids without applying                even it’s not measured in this work, is worth
        any rules for pre-defining polarities.                 mentioning that Logistic Regression’s perfor-
        RUN-2: Support Vector Machine, over                    mance was observably faster.
        the averaged centroids and applying the
        mixed rules approach.                                  4     Conclusions and future work
        RUN-3: Logistic Regression, over the                   This paper explores the use of word embed-
        centroids with inverted document fre-                  dings for the task of sentiment analysis. Ins-
        quency and applying the mixed rules ap-                tead of using, the bag-of-words model to re-
        proach.                                                present tweets, these are represented as word
                                                               vectors taken from a pre-trained model of
    and for the 3-levels task are:                             word embeddings. An important advantage
                                                               of word embedding model compared to the
        RUN-1: Support Vector Machine, over
                                                               technique of bag-of-words representation is
        the averaged centroids and applying the
                                                               that it achieves a significant dimensional re-
        mixed rules approach.
                                                               duction of the feature set needed to represent
        RUN-2: Logistic Regression, over the                   tweets and leads, therefore, to a reduction of
        centroids with inverted document fre-                  training and testing time of the algorithms.
        quency and applying the mixed rules ap-                    In order to use word embedding models
        proach.                                                properly, a preprocessing stage had to be
        RUN-3: Logistic Regression, over the                   completed before training a classifier. Due to
        averaged centroids and applying the mi-                the unstructured nature of the tweets, this
        xed rules approach.                                    preprocessing proved to be a very important
                                                               step in order to standardize at some degree
   Tables 2 and 3 show the results for the-                    the input data. The experimentation showed
se settings provided by the TASS submission                    that the three tested classifiers obtained very
system. For each run, accuracy is provided as                  similar results, with Random Forest having
well as the macro-averaged precision, recall                   slight worse performance and Logistic Re-
and F1-measure. As expected, the results for                   gression being slightly better and much more
3 levels are higher than for 5 levels because                  faster.
the training dataset is larger.                                    One of the main drawback of our approach
    3
    Experiments showed that not-lemmatized text
                                                               is that many words do not have a word vector
performed better in all settings, hence the best set-          in the word embedding model used for our
tings reported here is using not-lematized model               experiments. An analysis showed that many
                                                         32
               LABDA at the 2016 TASS challenge task: using word embeddings for the sentiment analysis task


of these words come from hashtags, which are                     Pang, B. and L. Lee. 2008. Opinion mining
usually short phrases. Therefore, we should                        and sentiment analysis. Foundations and
apply a more sophisticated method in order                         trends in information retrieval, 2(1-2):1–
to extract the words forming hashtag.                              135.
    As future work, we also plan to use a word
                                                                 Segura-Bedmar, I., V. Suárez-Paniagua, and
embedding model trained on a collection of
                                                                    P. Martınez.    2015.    Exploring word
text from Spanish social media. We think
                                                                    embedding for drug name recognition.
that this will have a positive effect of the per-
                                                                    In SIXTH INTERNATIONAL WORKS-
formance of our system to identify the pola-
                                                                    HOP ON HEALTH TEXT MINING AND
rity of tweets because this model will be ge-
                                                                    INFORMATION ANALYSIS (LOUHI),
nerated from documents characterized by the
                                                                    page 64.
main features that describe social media texts
(for example, informal style language, plenty                    Socher, R., J. Bauer, C. D. Manning, and
of grammatical errors and spelling mistakes,                        A. Y. Ng. 2013a. Parsing with composi-
slang and vulgar vocabulary).                                       tional vector grammars. In ACL (1), pa-
                                                                    ges 455–465.
Acknowledgments
                                                                 Socher, R., A. Perelygin, J. Y. Wu,
This work was supported by eGovernAbility-                          J. Chuang, C. D. Manning, A. Y. Ng, and
Access project (TIN2014-52665-C2-2-R).                              C. Potts. 2013b. Recursive deep models
                                                                    for semantic compositionality over a sen-
References                                                          timent treebank. In Proceedings of the
Alam, F., A. Corazza, A. Lavelli, and R. Za-                        conference on empirical methods in natu-
   noli. 2016. A knowledge-poor approach to                         ral language processing (EMNLP), volume
   chemical-disease relation extraction. Da-                        1631, page 1642. Citeseer.
   tabase, 2016:baw071.
Cardellino, C. 2016. Spanish Billion Words
  Corpus and Embeddings, March.
Chikersal, P., S. Poria, E. Cambria, A. Gel-
  bukh, and C. E. Siong. 2015. Modelling
  public sentiment in twitter: using linguis-
  tic patterns to enhance supervised lear-
  ning. In International Conference on Inte-
  lligent Text Processing and Computational
  Linguistics, pages 49–65. Springer.
Garcı́a-Cumbreras, M. A., J. Villena-Román,
  E. Martı́nez-Cámara, M. C. Dı́az-Galiano,
  M. T. Martı́n-Valdivia, and L. A. U.
  na López. 2016. Overview of tass 2016.
  In Proceedings of TASS 2016: Works-
  hop on Sentiment Analysis at SEPLN co-
  located with the 32nd SEPLN Conferen-
  ce (SEPLN 2016), Salamanca, Spain, Sep-
  tember.
Mikolov, T. and J. Dean. 2013. Distributed
  representations of words and phrases and
  their compositionality. Advances in neural
  information processing systems.
Molina-González, M. D., E. Martı́nez-Cáma-
  ra, M.-T. Martı́n-Valdivia, and J. M.
  Perea-Ortega. 2013. Semantic orientation
  for polarity classification in spanish re-
  views. Expert Systems with Applications,
  40(18):7250–7257.
                                                           33