=Paper=
{{Paper
|id=Vol-2421/TASS_paper_3
|storemode=property
|title=GTH-UPM at TASS 2019: Sentiment Analysis of Tweets for Spanish Variants
|pdfUrl=https://ceur-ws.org/Vol-2421/TASS_paper_3.pdf
|volume=Vol-2421
|authors=Ignacio González Godino,Luis Fernando D'Haro
|dblpUrl=https://dblp.org/rec/conf/sepln/GodinoD19
}}
==GTH-UPM at TASS 2019: Sentiment Analysis of Tweets for Spanish Variants==
GTH-UPM at TASS 2019: Sentiment Analysis of
Tweets for Spanish Variants
Ignacio González Godino and Luis Fernando D’Haro
Grupo de Tecnologı́a del Habla, ETSI de Telecomunicación
Universidad Politécnica de Madrid
Avenida Complutense 30, 28040, Madrid, Spain
ignacio.ggodino@alumnos.upm.es, luisfernando.dharo@upm.es
Abstract. This article describes the system developed by the Grupo
de Tecnologı́a del Habla at Universidad Politécnica de Madrid, Spain
(GTH-UPM) for the competition on sentiment analysis in tweets: TASS
2019. The developed system consisted of three classifiers: a) a system
based on feature vectors extracted from the tweets, b) a neural-based
classifier using FastText, and c) a deep neural network classifier using
contextual vector embeddings created using BERT. Finally, the averaged
probabilities of the three classifiers were calculated to get the final score.
The final system obtained an averaged F1 of 48.0% and 48.4% for the
dev set on the mono and cross tasks respectively, 46.0% and 45.0% for
the mono and cross tasks for the test set.
Keywords: TASS · Multiclassifiers · Natural Language Processing NLP
· Twitter · Sentiment Analysis
1 Introduction
Sentiment Analysis (a.k.a opinion mining) is a branch of the Natural Language
Processing field whose goal is to automatically determine whether a piece of
text can be considered as positive, negative, neutral or none, deriving this way
the opinion or attitude of the person writing the text [13]. Sentiment analysis
has recently brought a lot of attention since it can be used for companies to
understand customers’ feelings towards their products [18], for politicians to
poll statements and actions (even to predict the results of an election [2]), or it
can be also used to monitor and analyze social phenomena and general mood.
For TASS 2019 competition, the organizers proposed to research on sentiment
analysis with a special interest on evaluating polarity classification of tweets
written in Spanish variants (i.e. Spanish language spoken in Costa Rica, Spain,
Peru, Uruguay and Mexico). The main challenges the system must face up were
the lack of context (tweets are short, up to 240 characters), presence of infor-
mal language such as misspellings, onomatopeias, emojis, hashtags, usernames,
Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 Septem-
ber 2019, Bilbao, Spain.
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)
etc., similarities between variants, classes imbalance, but specially restrictions
imposed by the organizers on the data used for training [5].
The proposed challenge consisted of two sub-tasks: a) Monolingual: where
participants must train and test their systems using only the dataset for the
corresponding variant, and b) Cross-lingual: where participants must train their
systems on a selection of the complementary given datasets while using the
corresponding variant for testing; the goal here was to test the dependency of
systems on learning specific characteristics of the text for a given variant. For
both tasks, the challenge organizers asked participants that in case submitting a
supervised or semi-supervised system, it must be only trained with the provided
training data being totally forbidden to use other training sets. However, linguis-
tic resources like lexicons, vectors of word embeddings or knowledge bases could
be used by clearly indicating them. The goal here was to have fair comparison
between systems but also to furtherance creativity by restricting system to only
use the same set of training data.
The paper is distributed as follows. In Section 2 we provided detailed infor-
mation about the datasets given by the organizers; afterwards, in Section 3 we
describe in detail the classifiers and features used in our system; then, in Section
4, we present our results on the monolingual and cross-lingual settings. Finally,
in Section 5 we present our conclusions and future work.
2 Corpus description
The organizers provided participants with a corpus including five sets of data,
for five different countries where Spanish is spoken, which are: Costa Rica (CR),
Spain (ES), Mexico (MX), Peru (PE), and Uruguay (UY). For each variant,
training, development and test sets were provided. The data was composed by
several tweets, their ID, user, date, variant and, only in training and development
sets, the sentiment class, which could be ‘P’ (positive), ‘N’ (negative), ‘NEU’
(neutral) or ‘NONE’ (no sentiment).
The label distribution for each variant for the training and development sets
is shown in Table 1. Table 2 shows the label distribution for the test set. As we
can see, the distribution of labels among variants are different showing systems
must deal with class imbalance; however, the label distribution between the
training, dev and test sets for the same variant are quick similar (except for
Peru where there are more Neg, Neu and None for the test set than for training
and dev). This posed the challenge to create a robust system, but also explains
the difference in performance for this variant as shown in Section 4.
The task was divided in two sub-tasks, monolingual and cross-lingual analy-
sis. In the first sub-task, the systems used tweets from the same variant for both
training and testing. In the cross-lingual setting, in order to test the dependency
of systems on a variant, they could be trained in a selection of any variant except
the one which was used to test. In our case, we just combined the other variants
into a single file and evaluate on the corresponding dev or test set for the given
variant.
580
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)
Table 1. Data and labels distribution for the training and development sets
Label NEG NEU NONE POS Total
Variant Train Dev Train Dev Train Dev Train Dev Train Dev
CR 310 143 91 55 155 72 221 120 777 390
ES 474 266 140 83 157 64 354 168 1125 581
MX 505 252 79 51 93 48 312 159 989 510
PE 228 107 170 56 352 230 216 105 966 498
UY 367 192 192 90 94 51 290 153 943 486
Table 2. Data and labels distribution for the test set
Variant NEG NEU NONE POS TOTAL
CR 459 151 220 336 1166
ES 663 195 254 594 1706
MX 745 119 111 525 1500
PE 485 368 176 435 1464
UY 587 290 82 469 1428
After reading the data, each tweet was pre-processed as follows:
– Leading and trailing spaces were removed
– Words starting by the symbol “#” were replaced by just keeping the word
and removing the “#”. If camel case was found, the word was separated. For
instance, “#thisBeautifulDay” was replaced by “this Beautiful Day”.
– Url references (‘http://...’) were replaced by the word ‘http’.
– User references (‘@username’) were replaced by the word USER NAME.
– Sequences of three or more equal characters were replaced by a single occur-
rence of that character. For instance, “siiii” was replaced by ”si”.
– References to human laughter, as “jajja” or “jajajajaj” were replaced by
“jajaja”.
– Numbers were removed.
– Other punctuation symbols were removed.
Then, we performed lemmatization and tokenization using the large Spanish
model included in SpaCy1 . Finally, tweets were converted to lowercase.
During this pre-processing phase, we analyzed the tweets and discovered a
remarkably high percentage of Out-Of-Vocabulary words (OOV’s) by comparing
the vocabularies from the training and development sets, with values ranging
between 53-55%. We managed to reduce it up to 48-51% by using lemmatization
and character vector embeddings (see Section 3.2), but it was still a surprisingly
high value, which reduced the performance of our final system leaving it as a
future work to find additional solutions to this problem.
The resulting pre-processed tweets were then used as input for the three
classifiers mentioned in the following section.
1
https://spacy.io/
581
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)
3 Classifiers
The final system was based on three different and independent classifiers followed
by an ensembling method, which are explained below.
3.1 Feature-based classifier
The first classifier was based on features extracted from the training tweets2 .
Then, we concatenated them and trained a classifier for each variant. The ex-
tracted features were:
– Number of words in the tweet.
– The number of words with all characters in upper case.
– Number of “hashtags” found in the tweet (i.e. words starting with symbol
“#”).
– Whether the tweet has an exclamation mark.
– Whether the tweet has a question mark.
– Presence or absence of words with one character repeated more than two
times, as “holaaa”.
These features were selected based on features commonly used in sentiment
analysis [1] or [3], and our intuition. For instance, we can intuitively expect that
longer tweets tend to be more negative in order to explain the sorrow situation,
or that upper case words usually have more importance and tend to be used
in highly sentimental tweets. On the other hand, when analyzing the corpus,
we noticed than tweets containing exclamation marks, “hashtags” or words like
“holaaa” are more likely to be positive, and tweets containing exclamation marks
tend to be non-emotional; therefore many of the proposed features were extracted
based on our initial analysis and intuition.
Besides, a negative and positive vocabulary was automatically created from
the training data extracting the 25 most discriminating words between the classes
‘P’ and ‘N’ using the algorithm proposed in [11]. Four features were extracted
from this vocabulary (checking if these words were in the tweet or not) and
normalized by the number of words in the tweet:
– Number of negative words.
– Number of positive words.
– Number of positive words minus the number of negative words.
– Total count of both negative and positive words.
This set of ten features was used for training eight different classifiers: Logistic
Regression, Multinomial Naı̈ve Bayes, Decision Tree, Support Vector Machines,
Random Forest, Extra Trees, AdaBoost and Gradient Boost. Each one was
computed following three different strategies, Normal, One-Vs-Rest and One-
Vs-One. All these classifiers were implemented with the scikit-learn3 tools for
Python, and finally we kept the one that obtained the best performance for the
corresponding variant on the development set.
2
Some features were extracted before applying the pre-processing methods.
3
https://scikit-learn.org
582
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)
3.2 Fasttext classifier
FastText [10] is an efficient library, created by Facebook’s AI Research (FAIR)
lab4 that allows learning n-gram word and sub-word representations (i.e. vec-
tor embeddings) using a supervised or unsupervised learning algorithm on a
standard multicore-CPU. The library also allows training a multi-class sentence
classifier using a simple linear model (multinomial logistic regression) with rank
constraint.
In more detail, for the sentence classifier, the library implements a shallow
neural network that uses as input features the averaged vector embeddings of
the input sentence and a Softmax layer to obtain a probability distribution over
the pre-defined classes. Several tricks are implemented such as Hierarchical Soft-
max [7] and Huffman coding tree [12] to reduce the computational complexity
when the number of labels is large; use of bag of n-grams and hashing trick [19]
are also implemented to maintain a fast and memory efficient mapping of the
learned n-grams. Finally, FastText also deals with the problem of OOVs (Out-of-
Vocabulary words) by training bag of n-grams of characters; this capability was
one of the main motivations for using FastText as we discovered there was a huge
proportion of OOVs between training and dev data during our data analysis (see
Section 2).
For our classifier, we first created a set of vector embeddings with dimension
100 using a supervised method trained with the labeled data from previous
TASS challenge [6]. The idea was to use those pre-trained vector embeddings as
a linguistic resource for the following steps. Initially, instead we tried using the
available pre-trained vectors [8] released by FAIR for Spanish5 but our results
were worse probably due to differences in pre-processing, nature of the text
(tweets vs formal text), and the reduce number of training data to correctly adapt
the 300-dimensional pre-trained vector embeddings. The hyper-parameters we
used for this pre-training phase were: learning rate: 1.0, epochs: 5, wordNgrams:
2, dimension: 100. The rest of parameters were the default ones provided by
FastText.
Next, we trained 5 independent supervised models for each variant in the
competition using the pre-trained vector embeddings as input and the corre-
sponding training data for that variant complying to the established rules by
the organizers. Finally, we fine-tuned the model hyper-parameters (learning rate,
number of iterations, and size of word n-grams) using the corresponding devel-
opment set. In general, the only parameter we fine-tuned was the number of
epochs ranging from 5 to 10 depending on the amount of training data for each
variant.
3.3 BERT classifier
BERT stands for Bidirectional Encoder Representations from Transformers. Pro-
posed by [4], this is a new method of pre-training contextual vector representa-
4
https://fasttext.cc/
5
https://fasttext.cc/docs/en/crawl-vectors.html
583
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)
tions which obtains state-of-the-art results on several Natural Language Process-
ing (NLP) tasks such as text classification, question-answering, labeling tagging
and language model prediction.
One of the main advantages of BERT is that their creators have publicly
released pre-trained English and Multilingual models, which have been trained
on massive corpora of unlabeled data with a new pre-training objective: the
“masked language mode” (MLM), inspired by the Cloze task [17]. This change
allowed authors to use bidirectional networks instead of the left-to-right networks
used in the earlier OpenAi GPT model [15]. Finally, the advantage of using
BERT is that the pre-trained models are ready to be fine-tuned for downstream
tasks with limited amount of data by using transfer learning approaches [16].
This is done by fine-tuning BERTs final layers while taking advantage of the
rich representations of language learned during pre-training.
For our classifier, we used the BERT-Base pre-trained Multilingual Cased
model, which was trained on 104 languages and consists of 12 layers (Transformer
blocks), 768 hidden units, 12 multi-head attentions, which sum up to 110M
parameters. Then, we created 5 different models for each variant by fine-tuning
the model using only the training data for the corresponding variant and checking
the progress along up to 10 different iterations on the development set with a
batch size of 32 samples.
3.4 Averaging Ensemble
We used the three former classifiers to get a distribution of probabilities for
each class (i.e. multi-label classification) given the tweet. Then, we implemented
a soft-voting ensemble by averaging the three probabilities and classified the
tweet with the most likely class. As the feature-based classifier was trained in
several different classifiers, we had 24 different results for each variant. Therefore,
we selected the classifier that obtained the best performance on the dev set, as
shown in Tables 3 and 4.
Table 3. Selected classifiers per variant for the Cross-lingual setting
Variant Selected Classifier
CR OneVsRest Logistic Regression
ES OneVsOne Logistic Regression
MX Normal Ada Boost
PE OneVsRest Gradient Boost
UY Normal Naı̈ve Bayes
4 Results
From the Ensemble explained in the section before, we decided to submit the best
performance for each classifier in the development set and the same approach
584
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)
Table 4. Selected classifiers per variant for the Mono-lingual setting
Variant Selected Classifier
CR Normal Ada Boost
ES Normal Naı̈ve Bayes
MX OneVsRest Ada Boost
PE Normal Ada Boost
UY OneVsRest Ada Boost
for the test set. Our results, for each classifier, are shown in Tables 5 (feature-
based), 6 (Fasttext), and 7 (BERT); results for the final ensemble are presented
in Table 8.
It is important to mention that, when evaluating on the test set, we used
the best hyper-parameters found using the development set and then combined
the training and dev sets for training the final classification models; then we
evaluated the resulting models on the test set. For the cross-lingual setting we
took care, when combining the training and development data, to exclude the
corresponding development set for the variant to be tested.
As we can see in the results, the ensemble outperformed most of the times
the individual classifiers on the test set for both cross and mono lingual settings.
We also found that the feature classifier performed the worst in most of the
cases (except for BERT for the Peruvian variant), however we think it provides
complementary information as we found when performing the optimizations on
the development set.
Table 5. Macro F-1 results for the Feature-based classifier only
Cross-F1-Score Mono-F1-Score
Dev Test Dev Test
CR 0.2979 0.2968 0.3698 0.3823
ES 0.333 0.3670 0.2889 0.2873
MX 0.361 0.3369 0.3697 0.3990
PE 0.317 0.2741 0.349 0.2925
UY 0.296 0.3305 0.3879 0.3846
585
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)
Table 6. Results for the Fasttext classifier
Cross-F1-Score Mono-F1-Score
Dev Test Dev Test
CR 0.5431 0.4518 0.4872 0.4595
ES 0.4415 0.4048 0.442 0.4206
MX 0.4455 0.4542 0.4265 0.4557
PE 0.4598 0.4621 0.4839 0.4226
UY 0.448 0.4520 0.489 0.4793
Table 7. Results for the BERT classifier
Cross-F1-Score Mono-F1-Score
Dev Test Dev Test
CR 0.471 0.4608 0.4362 0.4713
ES 0.4417 0.4680 0.4616 0.4604
MX 0.4374 0.4471 0.296 0.4856
PE 0.4286 0.4641 0.4134 0.0536
UY 0.4826 0.4735 0.4755 0.4555
Table 8. Ensembling classifier results
Cross-F1-Score Mono-F1-Score
Dev Test Dev Test
CR 0.5156 0.4639 0.4923 0.4678
ES 0.4686 0.3772 0.4849 0.4552
MX 0.4732 0.4706 0.4143 0.4867
PE 0.4555 0.4565 0.4868 0.3987
UY 0.5059 0.4811 0.5193 0.4921
5 Conclusions and Future Work
In this paper we have described the first participation of GTH-UPM for the
“Sentiment Analysis at SEPLN” (TASS 2019) at Tweet level. Our final system
consisted of an ensemble using average voting of three different multi-label text
classifiers: a) feature-based using Scikit-Learn, a shallow neural network using
FastText, and a transfer-learning approach using BERT. Our results on the
development set provided an averaged F1 score of 45.0% and 46.0% on the test
set for the cross- and mono-lingual settings respectively. Our system performed
very well when compared with other submitted systems across the two settings
showing that our proposal was robust enough.
As future work we are planning to perform a more exhaustive analysis of
the results given by the three classifiers, fine tuning models hyper-parameters,
and testing new features as the ones proposed in [3] and new pre-trained vector
embeddings such ElMO [14] or UlmFit [9].
586
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)
Acknowledgments
The work leading to these results has been supported by the following projects:
AMIC (MINECO, TIN2017-85854-C4-4-R) and CAVIAR (MINECO, TEC2017-
84593-C2-1-R). We gratefully acknowledge the support of NVIDIA Corporation
with the donation of the Titan X Pascal GPU used for this research.
References
1. Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R.: Sentiment analysis
of twitter data. In: Proceedings of the Workshop on Language in Social Media
(LSM 2011). pp. 30–38 (2011)
2. Bermingham, A., Smeaton, A.: On using twitter to monitor political sentiment and
predict election results. In: Proceedings of the Workshop on Sentiment Analysis
where AI meets Psychology. pp. 2–10. SAAIP 2011 (April 2011)
3. Chiruzzo, L., Rosá, A.: Retuyt-inco at tass 2018: Sentiment analysis in spanish
variants using neural networks and svm. Proceedings of TASS 2172 (2018)
4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
5. Dı́az-Galiano, M.C., et al.: Overview of tass 2019. CEUR-WS, Bilbao, Spain (2019)
6. Garcı́a Cumbreras, M.Á., Martı́nez Cámara, E., Villena Román, J., Garcı́a Morera,
J.: Tass 2015–the evolution of the spanish opinion mining systems (2016)
7. Goodman, J.: Classes for fast maximum entropy training. arXiv preprint
cs/0108006 (2001)
8. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vec-
tors for 157 languages. In: Proceedings of the International Conference on Language
Resources and Evaluation (LREC 2018) (2018)
9. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification.
arXiv preprint arXiv:1801.06146 (2018)
10. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text
classification. In: Proceedings of the 15th Conference of the European Chapter
of the Association for Computational Linguistics: Volume 2, Short Papers. pp.
427–431. Association for Computational Linguistics (April 2017)
11. King, G., Lam, P., Roberts, M.E.: Computer-assisted keyword and document set
discovery from unstructured text. American Journal of Political Science 61(4),
971–988 (2017)
12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)
13. Pang, B., Lee, L., et al.: Opinion mining and sentiment analysis. Foundations and
Trends in Information Retrieval 2(1–2), 1–135 (2008)
14. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K.,
Zettlemoyer, L.: Deep contextualized word representations. arXiv preprint
arXiv:1802.05365 (2018)
15. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language
understanding with unsupervised learning. Tech. rep., Technical report, OpenAI
(2018)
16. Ruder, S.: Neural Transfer Learning for Natural Language Processing. Ph.D. thesis,
National University of Ireland, Galway (2019)
587
Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019)
17. Taylor, W.L.: cloze procedure: A new tool for measuring readability. Journalism
Bulletin 30(4), 415–433 (1953)
18. Vinodhini, G., Chandrasekaran., R.M.: Sentiment analysis and opinion mining: a
survey. International Journal 2(6), 282–292 (2012)
19. Weinberger, K., Dasgupta, A., Attenberg, J., Langford, J., Smola, A.: Feature
hashing for large scale multitask learning. arXiv preprint arXiv:0902.2206 (2009)
588