<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>LABDA at the 2016 TASS challenge task: using word embeddings for the sentiment analysis task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonio Quiros</string-name>
          <email>antonio.quiros@sngular.team</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isabel Segura-Bedmar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paloma Mart nez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departamento de Informatica, Universidad Calos III de Madrid Avd. de la Universidad</institution>
          ,
          <addr-line>30, 28911, Leganes, Madrid, Espan~a</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sngular Data&amp;Analytics Av. LLano Castellano 13</institution>
          ,
          <addr-line>Planta 5, 28034 Madrid, Espan~a</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>2</lpage>
      <abstract>
        <p>This paper describes the participation of the LABDA group at the Task 1 (Sentiment Analysis at global level). Our approach exploits word embedding representations for tweets and machine learning algorithms such as SVM and logistics regression.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Knowing the opinion of customers or users
has become a priority for companies and
organizations in order to improve the quality of
their services and products. With the ongoing
explosion of social media, it a ords a signi
cant opportunity to poll the opinion of many
Internet users by processing their comments.
However, it should be noted that sentiment
analysis, which can be de ned as the
automatic analysis of opinion in texts
        <xref ref-type="bibr" rid="ref7">(Pang and
Lee, 2008)</xref>
        , is a challenging task because it is
not strange that di erent people assign
different polarities to a given text. On Twitter,
the task is even more di cult, because the
texts are small (only 140 characters) and are
charectized by their informal style language,
many grammatical errors and spelling
mistakes, slang and vulgar vocabulary and
abbreviations.
      </p>
      <p>
        Since their introduction in 2013, the TASS
shared task editions have had as main goal
to promote the development of methods and
resources for sentiment analysis of tweets in
Spanish. This paper describes the
participation of the LABDA group at the Task 1
(Sentiment Analysis at global level). In this task,
the participating systems have to determine
the global polarity of each tweet in the test
dataset. There are two di erent evaluations:
one based on 6 di erent polarity labels (P+,
P, NEU, N, N+, NONE) and another based
on just 4 labels (P, N, NEU, NONE). A
detailed description of the task can be found
in the overview paper of TASS 2016
        <xref ref-type="bibr" rid="ref4">(Garc
aCumbreras et al., 2016)</xref>
        . Our approach
exploits word embedding representations for
tweets and machine learning algorithms such
as SVM and logistics regression. The word
embedding model can yield signi cant
dimensionality reduction compared to the classical
Bag-Of-Word (BoW) model. The
dimensionality redution can have several positive
effects on our algorithms such as faster
training, avoiding over tting and better
performance.
      </p>
      <p>The paper is organized as follows. Section
2 describes our approach. The experimental
results are presented and discussed in Section
3. We conclude in Section 4 with a summary
of our ndings and some directions for future
work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>System</title>
      <p>
        In this paper, we study the use of word
embeddings (also known as word vectors) in
order to represent tweets and then examine
several machine learning algorithms to classify
them. Word embeddings have shown
promising results in NLP tasks, such as named
entity recognition (Segura-Bedmar,
SuarezPaniagua, and Mart nez, 2015), relation
extraction
        <xref ref-type="bibr" rid="ref1">(Alam et al., 2016)</xref>
        , sentiment
analysis
        <xref ref-type="bibr" rid="ref8 ref9">(Socher et al., 2013b)</xref>
        or parsing
        <xref ref-type="bibr" rid="ref8 ref9">(Socher
et al., 2013a)</xref>
        . A word embedding is a
function to map words to low dimensional
vectors, which are learned from a large collection
of texts. At present, Neural Network is one of
the most used learning techniques for
generating word embeddings
        <xref ref-type="bibr" rid="ref5">(Mikolov and Dean,
2013)</xref>
        . The essential assumption of this
model is that semantically close words will have
similar vectors (in terms of cosine similarity).
Word embeddings can help to capture
semantic and syntactic relationships of the
corresponding words.
      </p>
      <p>While the well-known Bag-of-Words
(BoW) model involves a very large number
of features (as many as the number of
nonstopwords words with at least a minimum
number of occurrences in the training data),
the word embedding representation allows
a signi cant reduction in the feature set
size (in our case, from million to just 300).
The dimensionality reduction is a desirable
goal, because it helps in avoiding over tting
and leads to a reduction of the training and
classi cation times, without any performance
loss.</p>
      <p>As a preprocessing step, tweets must be
cleaned. First, we remove all links and urls.
We then remove usernames which can be
easily recognized because their rst character is
the symbol @. We then transform the
hashtags to words by removing its rst
character (that is, the symbol #). Taking
advantage of regular expressions, the emoticons are
detected and classi ed in order to count the
number of positive and negative emoticons in
each tweet and then we remove them from the
text. Table 1 shows the list of positive and
negative emoticons, which were taken from
the wikipedia page https://en.wikipedia.
org/wiki/List\_of\_emoticons. We
convert the tweets to lowercase and replace
misspelled accented letters with the correct one
(for instance \a" with \a"). We also treat
elongations (that is, the repetition of a
character) by removing the repetition of a
character after its second occurrence (for
example, \hoooolaaaa" would be translated to
\hola"). We then decided to take into account
laughs (for instance \jajaja") which turned
out to be challenging because of the diverse
ways they are expressed (i.e. expressions
like \jajajaja" or \jejeje" and even misspelled
ones like \jajjajaaj") We addressed this using
regular expressions to standardize the di
erent forms (i.e. \jajjjaaj" to \jajaja") and
then replace them with the word \risas".
Finally we remove all non-letters characters and
all stopwords present in tweets1.</p>
      <sec id="sec-2-1">
        <title>Orientation</title>
        <p>Positive
Negative</p>
      </sec>
      <sec id="sec-2-2">
        <title>Emoticons</title>
        <p>:-), :), :D, :o), :], D:3,
:c), :&gt;, =], 8), =),
:g, :^), :-D, 8-D, 8D,
x-D, xD, X-D, XD,
=-D, =D, =-3, =3,
B^D, :'), :'), :*, :-*,
:^*, ;-), ;), *-), *),
;], ;], ;D, ;^), &gt;:P, :-P,
:P, X-P, x-p, xp, XP,
:-p, :p, =p, :-b, :b
&gt;:[, :-(, :(, :-c, :-&lt;,
:&lt;, :-[, :[, :f, ;(,
:jj, &gt;:(, :'-(, :'(, D:&lt;,</p>
        <p>D=, v.v</p>
        <p>
          Once the tweets are preprocessed, they are
tokenized using the NLKT toolkit (a
Python package for NLP); we also performed
experimentation by lemmatizing each tweet
using MeaningCloud2 Text Analytic software
to compare both approaches. Then, for each
token, we search its vector in the word
embedding model. We use a pretrained model
          <xref ref-type="bibr" rid="ref2">(Cardellino, 2016)</xref>
          , which was generated by
using the word2vec algorithm
          <xref ref-type="bibr" rid="ref5">(Mikolov and
Dean, 2013)</xref>
          from a collection of Spanish texts
with approximately 1.5 billion words. The
dimension of the word embedding is 300. It
1http://snowball.tartarus.org/algorithms/spanish/stop.txt
2https://www.meaningcloud.com/
should be noted that these texts were
taken from di erent resources such as Spanish
Wikipedia, WikiSource and Wikibooks, but
none of them contains tweets. Therefore, it
is possible that the main characteristics of
the social media texts (such as informal style
language, noisy, plenty of grammatical errors
and spelling mistakes, slang and vulgar
vocabulary, abbreviations, etc) are not correctly
represented in this model. One of the main
problems is that there is a signi cant number
of words (almost a 13 % of the vocabulary,
representing the 6 % of words occurrences) that
are not found in the model. We perform a
review of a small sample of these words,
showing that most of them were mainly
hashtags.
        </p>
        <p>In our approach, a tweet of n tokens (T =
w1; w2; :::; wn) is represented as the centroid
of the word vectors w~i of its tokens, as shown
in the following equation:</p>
        <p>T~ = 1 Xn w~i =
n i=1</p>
        <p>PjN=1 w~j :T F (wj ; t)</p>
        <p>PjN=1 T F (wj ; t)
(1)
where N is the vocabulary size, that is,
the total number of distinct words, while
T F (wj ; t) refers to the number of
occurrences of the j-th vocabulary word in the tweet
T.</p>
        <p>We also explore the e ect of including the
inverse document frequencies IDF to
represent tweets (see Equation 2). This helps to
increase the weight of words that occur
often, but only in a few documents, while it
reduces the relevance of words that occur very
frequently in a larger number of texts.</p>
        <p>T~ = 1 Xn w~i =
n i=1</p>
        <p>PjN=1 w~j :T F (wj ; t):IDF (wj )</p>
        <p>PjN=1 T F (wj ; t):IDF (wj )</p>
        <p>(2)
logjDj
having IDF (wj ) = jtw2D:wj2twj where jDj
refers to the number of tweets.</p>
        <p>In addition to using the centroid, we assess
the impact of complementing the tweet model
with the following additional features:
posWords: number of positive words
present in the tweet.
negWords: number of negative words
present in the tweet.
posEmo: number of positive emoticons
present in the tweet.
negEmo: number of negative emoticons
present in the tweet.</p>
        <p>
          For the posWords and negWords features
we used the iSOL lexicon
          <xref ref-type="bibr" rid="ref6">(Molina-Gonzalez et
al., 2013)</xref>
          , a list composed by 2,509 positive
words and 5,626 negative words. As
described before, for the emoticons we used the
listed in Table 1, but also added to the positive
ones the number of laughs detected; and also,
we included the number of recommendations
present in the form of a \Follow Friday"
hashtag (#FF), due to its ease of detection and
its positive bias.
        </p>
        <p>Classi cation is performed using
scikitlearn, a Python module for machine learning.
This package provides many algorithms such
as Random Forest, Support Vector Machine
(SVM) and so on. One of its main advantages
is that it is supported by extensive
documentation. Moreover, it is robust, fast and easy
to use.</p>
        <p>As stated before, we have two main
training models: Averaged centroids and the
averaged centroids including the inverted
document frequency, for both the lemmatized and
not-lemmatized texts. We performed
experiments using three di erent classi ers:
Random Forests, Support Vector Machines and
Logistic Regression because these classi ers
often achieved the best results for text
classi cation and sentiment analysis.</p>
        <p>
          Also we evaluated the impact of applying
a set of emoticon's rules as a pre-classi cation
stage, similar to
          <xref ref-type="bibr" rid="ref3">(Chikersal et al., 2015)</xref>
          , in
which we determine a rst stage polarity for
each tweet as follows:
        </p>
        <p>If posEmo is greater than zero and
negEmo is equal to zero, the tweet is marked
as \P".</p>
        <p>If negEmo is greater than zero and
posEmo is equal to zero, the tweet is marked
as \N".</p>
        <p>If both posEmo and negEmo are
greater than zero, the tweet is marked as
\NEU".</p>
        <p>If both posEmo and negEmo are equal to
zero, the tweet is marked as \NONE".</p>
        <p>Then, after the classi cation takes place
we made three tests: i) Applying no rule,
ii) honoring the polarity de ned by the rule,
which means, we keep the prede ned polarity
Run
RUN-1
RUN-2
RUN-3
Run
RUN-1
RUN-2
RUN-3
if the tweet was marked as \P" or \N",
otherwise we take the value estimated by the
classi er, and iii) a mixed approach where
we give each polarity a value (N+: -2; N: -1;
NEU,NONE: 0; P: 1; P+: 2) and performed
an arithmetic sum of both the prede ned and
estimated polarity if and only if they are not
equal; with that for instance, if the classi er
marked a tweet as \N" and the rules
marked it as \P" the tweet will be classi ed as
\NEU".
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>In order to choose the best-performing
classi ers, we use 10-fold cross-validation
because there is no development dataset and this
strategy has become the standard method
in practical terms. Our experiments showed
that, although the results were similar3, the
best settings for the 5-levels task are:
RUN-1: Support Vector Machine, over
the averaged centroids without applying
any rules for pre-de ning polarities.
RUN-2: Support Vector Machine, over
the averaged centroids and applying the
mixed rules approach.</p>
      <p>RUN-3: Logistic Regression, over the
centroids with inverted document
frequency and applying the mixed rules
approach.
and for the 3-levels task are:</p>
      <p>RUN-1: Support Vector Machine, over
the averaged centroids and applying the
mixed rules approach.</p>
      <p>RUN-2: Logistic Regression, over the
centroids with inverted document
frequency and applying the mixed rules
approach.</p>
      <p>RUN-3: Logistic Regression, over the
averaged centroids and applying the
mixed rules approach.</p>
      <p>With the settings mentioned above, the
obtained results are extremely similar, but we
can state that, in terms of Accuracy,
Logistic Regression report the best results; and,
even it's not measured in this work, is worth
mentioning that Logistic Regression's
performance was observably faster.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and future work</title>
      <p>This paper explores the use of word
embeddings for the task of sentiment analysis.
Instead of using, the bag-of-words model to
represent tweets, these are represented as word
vectors taken from a pre-trained model of
word embeddings. An important advantage
of word embedding model compared to the
technique of bag-of-words representation is
that it achieves a signi cant dimensional
reduction of the feature set needed to represent
tweets and leads, therefore, to a reduction of
training and testing time of the algorithms.</p>
      <p>In order to use word embedding models
properly, a preprocessing stage had to be
completed before training a classi er. Due to
the unstructured nature of the tweets, this
preprocessing proved to be a very important
step in order to standardize at some degree
the input data. The experimentation showed
that the three tested classi ers obtained very
similar results, with Random Forest having
slight worse performance and Logistic
Regression being slightly better and much more
faster.</p>
      <p>One of the main drawback of our approach
is that many words do not have a word vector
in the word embedding model used for our
experiments. An analysis showed that many
of these words come from hashtags, which are
usually short phrases. Therefore, we should
apply a more sophisticated method in order
to extract the words forming hashtag.</p>
      <p>As future work, we also plan to use a word
embedding model trained on a collection of
text from Spanish social media. We think
that this will have a positive e ect of the
performance of our system to identify the
polarity of tweets because this model will be
generated from documents characterized by the
main features that describe social media texts
(for example, informal style language, plenty
of grammatical errors and spelling mistakes,
slang and vulgar vocabulary).</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported by
eGovernAbilityAccess project (TIN2014-52665-C2-2-R).</p>
      <p>Segura-Bedmar, I., V. Suarez-Paniagua, and
P. Mart nez. 2015. Exploring word
embedding for drug name recognition.
In SIXTH INTERNATIONAL
WORKSHOP ON HEALTH TEXT MINING AND
INFORMATION ANALYSIS (LOUHI),
page 64.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Alam</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Corazza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavelli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanoli</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A knowledge-poor approach to chemical-disease relation extraction</article-title>
          .
          <source>Database</source>
          ,
          <year>2016</year>
          :
          <fpage>baw071</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Cardellino</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Spanish Billion Words Corpus</article-title>
          and Embeddings, March.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Chikersal</surname>
            , P.,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Poria</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Cambria</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gelbukh</surname>
            , and
            <given-names>C. E.</given-names>
          </string-name>
          <string-name>
            <surname>Siong</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Modelling public sentiment in twitter: using linguistic patterns to enhance supervised learning</article-title>
          .
          <source>In International Conference on Intelligent Text Processing and Computational Linguistics</source>
          , pages
          <volume>49</volume>
          {
          <fpage>65</fpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Garc</surname>
            a-Cumbreras,
            <given-names>M. A.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Villena-Roman</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <article-title>Mart nez-</article-title>
          <string-name>
            <surname>Camara</surname>
            ,
            <given-names>M. C.</given-names>
          </string-name>
          <article-title>D az-</article-title>
          <string-name>
            <surname>Galiano</surname>
            ,
            <given-names>M. T.</given-names>
          </string-name>
          <string-name>
            <surname>Mart</surname>
            n-Valdivia, and
            <given-names>L. A. U.</given-names>
          </string-name>
          na Lopez.
          <year>2016</year>
          .
          <article-title>Overview of tass 2016</article-title>
          .
          <source>In Proceedings of TASS 2016: Workshop on Sentiment Analysis at SEPLN colocated with the 32nd SEPLN Conference (SEPLN</source>
          <year>2016</year>
          ), Salamanca, Spain, September.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>Advances in neural information processing systems.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Molina-Gonzalez</surname>
            ,
            <given-names>M. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <article-title>Mart nez-</article-title>
          <string-name>
            <surname>Camara</surname>
          </string-name>
          , M.-
          <string-name>
            <surname>T. Mart</surname>
          </string-name>
          n-Valdivia,
          <article-title>and</article-title>
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Perea-Ortega</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Semantic orientation for polarity classi cation in spanish reviews</article-title>
          .
          <source>Expert Systems with Applications</source>
          ,
          <volume>40</volume>
          (
          <issue>18</issue>
          ):
          <volume>7250</volume>
          {
          <fpage>7257</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>B. and L.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Opinion mining and sentiment analysis</article-title>
          .
          <source>Foundations and trends in information retrieval</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          -2):1{
          <fpage>135</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          . 2013a.
          <article-title>Parsing with compositional vector grammars</article-title>
          .
          <source>In ACL (1)</source>
          , pages
          <fpage>455</fpage>
          {
          <fpage>465</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perelygin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          . 2013b.
          <article-title>Recursive deep models for semantic compositionality over a sentiment treebank</article-title>
          .
          <source>In Proceedings of the conference on empirical methods in natural language processing (EMNLP)</source>
          , volume
          <volume>1631</volume>
          , page 1642. Citeseer.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>