=Paper=
{{Paper
|id=Vol-1702/tass2016_proceedings_v28
|storemode=property
|title=LABDA at the 2016 TASS Challenge Task: Using Word Embeddings for the Sentiment Analysis Task
|pdfUrl=https://ceur-ws.org/Vol-1702/tass2016_proceedings_v28.pdf
|volume=Vol-1702
|authors=Antonio Quirós,Isabel Segura-Bedmar,Paloma Martínez
|dblpUrl=https://dblp.org/rec/conf/sepln/QuirosSM16
}}
==LABDA at the 2016 TASS Challenge Task: Using Word Embeddings for the Sentiment Analysis Task==
TASS 2016: Workshop on Sentiment Analysis at SEPLN, septiembre 2016, pág. 29-33
LABDA at the 2016 TASS challenge task: using word
embeddings for the sentiment analysis task∗
LABDA en la competición TASS 2016: utilizando vectores de palabras para
la tarea de análisis de sentimiento
Antonio Quirós1,2 , Isabel Segura-Bedmar1 , and Paloma Martı́nez1
1
Departamento de Informática, Universidad Calos III de Madrid
Avd. de la Universidad, 30, 28911, Leganés, Madrid, España
100342879@alumnos.uc3m.es, isegura,pmf@inf.uc3m.es
2
Sngular Data&Analytics
Av. LLano Castellano 13, Planta 5, 28034 Madrid, España
antonio.quiros@sngular.team
Resumen: Este artı́culo describe la participación del grupo LABDA en la tarea
1 (Sentiment Analysis at global level) de la competición TASS 2016. En nuestro
enfoque, los tweets son representados por medio de vectores de palabras y son cla-
sificados utilizando algoritmos como SVM y regresión logı́stica.
Palabras clave: Análisis de Sentimiento, Vectores de palabras
Abstract: This paper describes the participation of the LABDA group at the Task
1 (Sentiment Analysis at global level). Our approach exploits word embedding re-
presentations for tweets and machine learning algorithms such as SVM and logistics
regression.
Keywords: Sentiment Analysis, Word embeddings
1 Introduction resources for sentiment analysis of tweets in
Knowing the opinion of customers or users Spanish. This paper describes the participa-
has become a priority for companies and or- tion of the LABDA group at the Task 1 (Sen-
ganizations in order to improve the quality of timent Analysis at global level). In this task,
their services and products. With the ongoing the participating systems have to determine
explosion of social media, it affords a signifi- the global polarity of each tweet in the test
cant opportunity to poll the opinion of many dataset. There are two different evaluations:
Internet users by processing their comments. one based on 6 different polarity labels (P+,
However, it should be noted that sentiment P, NEU, N, N+, NONE) and another based
analysis, which can be defined as the auto- on just 4 labels (P, N, NEU, NONE). A de-
matic analysis of opinion in texts (Pang and tailed description of the task can be found
Lee, 2008), is a challenging task because it is in the overview paper of TASS 2016 (Garcı́a-
not strange that different people assign dif- Cumbreras et al., 2016). Our approach ex-
ferent polarities to a given text. On Twitter, ploits word embedding representations for
the task is even more difficult, because the tweets and machine learning algorithms such
texts are small (only 140 characters) and are as SVM and logistics regression. The word
charectized by their informal style language, embedding model can yield significant dimen-
many grammatical errors and spelling mista- sionality reduction compared to the classical
kes, slang and vulgar vocabulary and abbre- Bag-Of-Word (BoW) model. The dimensio-
viations. nality redution can have several positive ef-
fects on our algorithms such as faster trai-
Since their introduction in 2013, the TASS
ning, avoiding overfitting and better perfor-
shared task editions have had as main goal
mance.
to promote the development of methods and
The paper is organized as follows. Section
∗
This work was supported by eGovernAbility-Access 2 describes our approach. The experimental
project (TIN2014-52665-C2-2-R). results are presented and discussed in Section
ISSN 1613-0073
A. Quirós, I. Segura-Bedmar, P. Martínez
3. We conclude in Section 4 with a summary vert the tweets to lowercase and replace miss-
of our findings and some directions for future pelled accented letters with the correct one
work. (for instance “à” with “á”). We also treat
elongations (that is, the repetition of a cha-
2 System racter) by removing the repetition of a cha-
In this paper, we study the use of word em- racter after its second occurrence (for exam-
beddings (also known as word vectors) in or- ple, “hoooolaaaa” would be translated to
der to represent tweets and then examine se- “hola”). We then decided to take into account
veral machine learning algorithms to classify laughs (for instance “jajaja”) which turned
them. Word embeddings have shown promi- out to be challenging because of the diverse
sing results in NLP tasks, such as named ways they are expressed (i.e. expressions li-
entity recognition (Segura-Bedmar, Suárez- ke “jajajaja” or “jejeje” and even misspelled
Paniagua, and Martınez, 2015), relation ex- ones like “jajjajaaj”) We addressed this using
traction (Alam et al., 2016), sentiment analy- regular expressions to standardize the diffe-
sis (Socher et al., 2013b) or parsing (Socher rent forms (i.e. “jajjjaaj” to “jajaja”) and
et al., 2013a). A word embedding is a fun- then replace them with the word “risas”. Fi-
ction to map words to low dimensional vec- nally we remove all non-letters characters and
tors, which are learned from a large collection all stopwords present in tweets1 .
of texts. At present, Neural Network is one of Orientation Emoticons
the most used learning techniques for gene- Positive :-), :), :D, :o), :], D:3,
rating word embeddings (Mikolov and Dean, :c), :>, =], 8), =),
2013). The essential assumption of this mo- :}, :ˆ), :-D, 8-D, 8D,
del is that semantically close words will have x-D, xD, X-D, XD,
similar vectors (in terms of cosine similarity). =-D, =D, =-3, =3,
Word embeddings can help to capture seman- BˆD, :’), :’), :*, :-*,
tic and syntactic relationships of the corres- :ˆ*, ;-), ;), *-), *), ;-
ponding words. ], ;], ;D, ;ˆ), >:P, :-P,
While the well-known Bag-of-Words :P, X-P, x-p, xp, XP,
(BoW) model involves a very large number :-p, :p, =p, :-b, :b
of features (as many as the number of non-
stopwords words with at least a minimum Negative >:[, :-(, :(, :-c, :-<,
number of occurrences in the training data), :<, :-[, :[, :{, ;(, :-
the word embedding representation allows ||, >:(, :’-(, :’(, D:<,
a significant reduction in the feature set D=, v.v
size (in our case, from million to just 300).
The dimensionality reduction is a desirable
goal, because it helps in avoiding overfitting
and leads to a reduction of the training and Table 1: List of positive and negative emoti-
classification times, without any performance cons
loss.
As a preprocessing step, tweets must be Once the tweets are preprocessed, they are
cleaned. First, we remove all links and urls. tokenized using the NLKT toolkit (a Pyt-
We then remove usernames which can be ea- hon package for NLP); we also performed
sily recognized because their first character is experimentation by lemmatizing each tweet
the symbol @. We then transform the hash- using MeaningCloud2 Text Analytic software
tags to words by removing its first charac- to compare both approaches. Then, for each
ter (that is, the symbol #). Taking advanta- token, we search its vector in the word em-
ge of regular expressions, the emoticons are bedding model. We use a pretrained model
detected and classified in order to count the (Cardellino, 2016), which was generated by
number of positive and negative emoticons in using the word2vec algorithm (Mikolov and
each tweet and then we remove them from the Dean, 2013) from a collection of Spanish texts
text. Table 1 shows the list of positive and with approximately 1.5 billion words. The di-
negative emoticons, which were taken from mension of the word embedding is 300. It
the wikipedia page https://en.wikipedia. 1
http://snowball.tartarus.org/algorithms/spanish/stop.txt
2
org/wiki/List\_of\_emoticons. We con- https://www.meaningcloud.com/
30
LABDA at the 2016 TASS challenge task: using word embeddings for the sentiment analysis task
should be noted that these texts were ta- negEmo: number of negative emoticons
ken from different resources such as Spanish present in the tweet.
Wikipedia, WikiSource and Wikibooks, but
none of them contains tweets. Therefore, it For the posWords and negWords features
is possible that the main characteristics of we used the iSOL lexicon(Molina-González et
the social media texts (such as informal style al., 2013), a list composed by 2,509 positive
language, noisy, plenty of grammatical errors words and 5,626 negative words. As descri-
and spelling mistakes, slang and vulgar voca- bed before, for the emoticons we used the lis-
bulary, abbreviations, etc) are not correctly ted in Table 1, but also added to the positive
represented in this model. One of the main ones the number of laughs detected; and also,
problems is that there is a significant number we included the number of recommendations
of words (almost a 13 % of the vocabulary, re- present in the form of a “Follow Friday” hash-
presenting the 6 % of words occurrences) that tag (#FF), due to its ease of detection and
are not found in the model. We perform a re- its positive bias.
view of a small sample of these words, sho- Classification is performed using scikit-
wing that most of them were mainly hash- learn, a Python module for machine learning.
tags. This package provides many algorithms such
In our approach, a tweet of n tokens (T = as Random Forest, Support Vector Machine
w1 , w2 , ..., wn ) is represented as the centroid (SVM) and so on. One of its main advantages
of the word vectors w ~i of its tokens, as shown is that it is supported by extensive documen-
in the following equation: tation. Moreover, it is robust, fast and easy
to use.
n PN As stated before, we have two main trai-
1 j=1 w
~j .T F (wj , t)
T~ =
X
w
~i = PN (1) ning models: Averaged centroids and the ave-
n i=1 j=1 T F (wj , t) raged centroids including the inverted docu-
ment frequency, for both the lemmatized and
where N is the vocabulary size, that is,
not-lemmatized texts. We performed experi-
the total number of distinct words, while
ments using three different classifiers: Ran-
T F (wj , t) refers to the number of occurren-
dom Forests, Support Vector Machines and
ces of the j-th vocabulary word in the tweet
Logistic Regression because these classifiers
T.
often achieved the best results for text clas-
We also explore the effect of including the
sification and sentiment analysis.
inverse document frequencies IDF to repre-
Also we evaluated the impact of applying
sent tweets (see Equation 2). This helps to
a set of emoticon’s rules as a pre-classification
increase the weight of words that occur of-
stage, similar to (Chikersal et al., 2015), in
ten, but only in a few documents, while it re-
which we determine a first stage polarity for
duces the relevance of words that occur very
each tweet as follows:
frequently in a larger number of texts.
If posEmo is greater than zero and negE-
n PN mo is equal to zero, the tweet is marked
1 j=1 w
~j .T F (wj , t).IDF (wj )
T~ =
X
w
~i = PN as “P”.
n i=1 j=1 T F (wj , t).IDF (wj )
(2) If negEmo is greater than zero and posE-
log|D|
having IDF (wj ) = |tw∈D:w where |D| mo is equal to zero, the tweet is marked
j ∈tw|
as “N”.
refers to the number of tweets.
In addition to using the centroid, we assess If both posEmo and negEmo are grea-
the impact of complementing the tweet model ter than zero, the tweet is marked as
with the following additional features: “NEU”.
posWords: number of positive words pre- If both posEmo and negEmo are equal to
sent in the tweet. zero, the tweet is marked as “NONE”.
negWords: number of negative words Then, after the classification takes place
present in the tweet. we made three tests: i) Applying no rule,
posEmo: number of positive emoticons ii) honoring the polarity defined by the rule,
present in the tweet. which means, we keep the predefined polarity
31
A. Quirós, I. Segura-Bedmar, P. Martínez
if the tweet was marked as “P” or “N”, ot- Run P R F1 Acc
herwise we take the value estimated by the RUN-1 0.411 0.449 0.429 0.527
classifier, and iii) a mixed approach where RUN-2 0.412 0.448 0.429 0.527
we give each polarity a value (N+: -2; N: -1; RUN-3 0.402 0.436 0.418 0.549
NEU,NONE: 0; P: 1; P+: 2) and performed
an arithmetic sum of both the predefined and
estimated polarity if and only if they are not Table 2: Results for Sentiment Analysis at
equal; with that for instance, if the classifier global level (5 levels, Full test corpus)
marked a tweet as “N” and the rules mar-
ked it as “P” the tweet will be classified as Run P R F1 Acc
“NEU”. RUN-1 0.506 0.510 0.508 0.652
RUN-2 0.508 0.508 0.508 0.652
3 Results RUN-3 0.512 0.511 0.511 0.653
In order to choose the best-performing clas-
sifiers, we use 10-fold cross-validation becau- Table 3: Results for Sentiment Analysis at
se there is no development dataset and this global level (3 levels, Full test corpus)
strategy has become the standard method
in practical terms. Our experiments showed
that, although the results were similar3 , the With the settings mentioned above, the
best settings for the 5-levels task are: obtained results are extremely similar, but we
can state that, in terms of Accuracy, Logis-
RUN-1: Support Vector Machine, over tic Regression report the best results; and,
the averaged centroids without applying even it’s not measured in this work, is worth
any rules for pre-defining polarities. mentioning that Logistic Regression’s perfor-
RUN-2: Support Vector Machine, over mance was observably faster.
the averaged centroids and applying the
mixed rules approach. 4 Conclusions and future work
RUN-3: Logistic Regression, over the This paper explores the use of word embed-
centroids with inverted document fre- dings for the task of sentiment analysis. Ins-
quency and applying the mixed rules ap- tead of using, the bag-of-words model to re-
proach. present tweets, these are represented as word
vectors taken from a pre-trained model of
and for the 3-levels task are: word embeddings. An important advantage
of word embedding model compared to the
RUN-1: Support Vector Machine, over
technique of bag-of-words representation is
the averaged centroids and applying the
that it achieves a significant dimensional re-
mixed rules approach.
duction of the feature set needed to represent
RUN-2: Logistic Regression, over the tweets and leads, therefore, to a reduction of
centroids with inverted document fre- training and testing time of the algorithms.
quency and applying the mixed rules ap- In order to use word embedding models
proach. properly, a preprocessing stage had to be
RUN-3: Logistic Regression, over the completed before training a classifier. Due to
averaged centroids and applying the mi- the unstructured nature of the tweets, this
xed rules approach. preprocessing proved to be a very important
step in order to standardize at some degree
Tables 2 and 3 show the results for the- the input data. The experimentation showed
se settings provided by the TASS submission that the three tested classifiers obtained very
system. For each run, accuracy is provided as similar results, with Random Forest having
well as the macro-averaged precision, recall slight worse performance and Logistic Re-
and F1-measure. As expected, the results for gression being slightly better and much more
3 levels are higher than for 5 levels because faster.
the training dataset is larger. One of the main drawback of our approach
3
Experiments showed that not-lemmatized text
is that many words do not have a word vector
performed better in all settings, hence the best set- in the word embedding model used for our
tings reported here is using not-lematized model experiments. An analysis showed that many
32
LABDA at the 2016 TASS challenge task: using word embeddings for the sentiment analysis task
of these words come from hashtags, which are Pang, B. and L. Lee. 2008. Opinion mining
usually short phrases. Therefore, we should and sentiment analysis. Foundations and
apply a more sophisticated method in order trends in information retrieval, 2(1-2):1–
to extract the words forming hashtag. 135.
As future work, we also plan to use a word
Segura-Bedmar, I., V. Suárez-Paniagua, and
embedding model trained on a collection of
P. Martınez. 2015. Exploring word
text from Spanish social media. We think
embedding for drug name recognition.
that this will have a positive effect of the per-
In SIXTH INTERNATIONAL WORKS-
formance of our system to identify the pola-
HOP ON HEALTH TEXT MINING AND
rity of tweets because this model will be ge-
INFORMATION ANALYSIS (LOUHI),
nerated from documents characterized by the
page 64.
main features that describe social media texts
(for example, informal style language, plenty Socher, R., J. Bauer, C. D. Manning, and
of grammatical errors and spelling mistakes, A. Y. Ng. 2013a. Parsing with composi-
slang and vulgar vocabulary). tional vector grammars. In ACL (1), pa-
ges 455–465.
Acknowledgments
Socher, R., A. Perelygin, J. Y. Wu,
This work was supported by eGovernAbility- J. Chuang, C. D. Manning, A. Y. Ng, and
Access project (TIN2014-52665-C2-2-R). C. Potts. 2013b. Recursive deep models
for semantic compositionality over a sen-
References timent treebank. In Proceedings of the
Alam, F., A. Corazza, A. Lavelli, and R. Za- conference on empirical methods in natu-
noli. 2016. A knowledge-poor approach to ral language processing (EMNLP), volume
chemical-disease relation extraction. Da- 1631, page 1642. Citeseer.
tabase, 2016:baw071.
Cardellino, C. 2016. Spanish Billion Words
Corpus and Embeddings, March.
Chikersal, P., S. Poria, E. Cambria, A. Gel-
bukh, and C. E. Siong. 2015. Modelling
public sentiment in twitter: using linguis-
tic patterns to enhance supervised lear-
ning. In International Conference on Inte-
lligent Text Processing and Computational
Linguistics, pages 49–65. Springer.
Garcı́a-Cumbreras, M. A., J. Villena-Román,
E. Martı́nez-Cámara, M. C. Dı́az-Galiano,
M. T. Martı́n-Valdivia, and L. A. U.
na López. 2016. Overview of tass 2016.
In Proceedings of TASS 2016: Works-
hop on Sentiment Analysis at SEPLN co-
located with the 32nd SEPLN Conferen-
ce (SEPLN 2016), Salamanca, Spain, Sep-
tember.
Mikolov, T. and J. Dean. 2013. Distributed
representations of words and phrases and
their compositionality. Advances in neural
information processing systems.
Molina-González, M. D., E. Martı́nez-Cáma-
ra, M.-T. Martı́n-Valdivia, and J. M.
Perea-Ortega. 2013. Semantic orientation
for polarity classification in spanish re-
views. Expert Systems with Applications,
40(18):7250–7257.
33