Sentiment Analysis for Spanish Tweets based on Continual Pre-training and Data Augmentation

Sentiment Analysis for Spanish Tweets based on Continual Pre-training and Data Augmentation YingwenFu School of Information Science and Technology Guangdong University of Foreign Studies

China

ZiyuYang School of Information Science and Technology Guangdong University of Foreign Studies

China

NankaiLin School of Information Science and Technology Guangdong University of Foreign Studies

China

LianxiWang wanglianxi@gdufs.edu.cn School of Information Science and Technology Guangdong University of Foreign Studies

China

Guangzhou Key Laboratory of Multilingual Intelligent Processing Guangdong Uni-versity of Foreign Studies

Guangzhou

FengChen School of Information Science and Technology Guangdong University of Foreign Studies

China

Sentiment Analysis for Spanish Tweets based on Continual Pre-training and Data Augmentation 170B0C406844BF5DFE3C7B00982ECE23 GROBID - A machine learning software for extracting information from scholarly documents Sentiment Analysis BERT Continual Pre-training Back Translation Mix up

In this paper, we report the solution of the team BERT4EVER for the sentiment analysis task for Spanish tweets in EmoEvalEs@IberLEF 2021, which aims to classify Spanish tweets into one of the following emotional categories: Anger, Disgust, Fear, Joy, Sadness, Surprise or Others. We adopt the monolingual Spanish BERT model to tackle the problem. In addition, we leverage two augmented strategies to enhance the classic fine-tuned model, namely continual pre-training and data augmentation to improve the generalization capability. Experimental results demonstrate the effectiveness of the BERT model and two augmented strategies.

Introduction

Sentiment analysis is an important task in the field of natural language processing (NLP). It is often used to determine which type of emotion a text belongs to [1]. However, due to the lack of voice modulations and facial expressions, understanding the emotions expressed by users on social media such as Twitter is a difficult task [2].

Researchers are constantly pursuing efficient algorithms to achieve better classification results. [3,4] Therefore, in EmoEvalEs@IberLEF 2021 [14], a sentiment analysis task is proposed [15], requiring participants to perform sentiment analysis and evaluation of tweets in Spanish and classify them into one of the following emotional categories: Anger, Disgust, Fear, Joy, Sadness, Surprise or Others. This track provides Spanish tweets and the corresponding categories for participants to conduct sentiment classification experiment. However, there are two main challenges for this task:

1) The dataset size is relatively small, which is far from the amount of data required for training of commonly used classification models such as BERT [5] and Bi-LSTM [6].

2) The proportion of categories is extremely imbalanced, in the provided dataset, the proportion of Fear and Disgust is much smaller than that of Others and Joy.

In tackle to the issues above, we, the BERT4EVER team, have leveraged two strategies to boost the classification performance: Continual Pre-training and Data Augmentation. These two strategies can effectively compensate for the two problems of small data size and imbalanced category proportions, so that the trained model has yielded better performance.

The remaining structure of the article is as follows. In Section 2 we will describe the task and data set given by the organizer in detail. Then in Section 3 our specific implementation is given. The final experimental results and conclusions are shown in the Section 4 and Section 5 respectively.

Task Description

The aim of the task is to classify the sentiment conveyed in a Spanish tweet. The task is tough because it lacks the facial expression and intonation and the sentiment can be divided into the following sentiment classes: Anger, Disgust, Fear, Joy, Sadness, Surprise or Others (the sentiment conveyed in a tweet as 'neural' or no sentiment).

The datasets [7] involved in this task were provided by the organizer of the Codalab. There are about 18,000 training datasets. In addition to the tweet, the labels of the dataset also include whether the tweet is offensive and what event the tweet is about. Some statistics about the training set are shown in Table 1. In our conducted experiment, in order to fairly explore the effectiveness of different strategies, we leveraged 5-fold cross-validation in which we divided all the datasets into 5 parts to obtain an ensemble model with a better generalization performance. 4 parts of them are for training and the remaining part is for verification. Afterwards we leverage the average results of 5 cross models as an estimation of the effectiveness of the strategy. BERT (Bidirectional Encoder Representations from Transformers) model [5] is a pretrained language model (PLM) which shows excellent performance on multiple downstream NLP tasks. The model architecture is shown in Fig. 1. It reads the input sequence at once and learns via two strategies, i.e., masked language modeling (MLM) and next sentence prediction (NSP). MLM is mean to randomly mask 15 percent of input words and replace them to other tokens, then predict those masked words. NSP refers to predict whether the input two sentences are consequent in the text or not to enhance the relationship between the sentences. In this paper, we leverage BETO [13] as our base model. BETO is a BERT model trained on a big Spanish corpus Zenodo. BETO is of size similar to a BERT-Base and was trained with the Whole Word Masking technique. It uses a vocabulary of about 31k BPE [8] subwords constructed using SentencePiece and were trained for 2M steps.

However, since our data set is based on Spanish tweet, a general pre-trained model directly applied to this data set may be limited by insufficient domain knowledge. At the same time, the problem of category imbalance (as discussed in Introduction) is also a problem we need to solve. Therefore, we proposed two strategies, Continuous pretraining and Data augmentation, to alleviate the above problems.

Continual Pre-training

Inspired by [11], our continual pre-training approach to domain adaption is straightforward-we continue pretraining BETO on a large corpus of unlabeled domain-specific text. Specifically, we try two domain corpora: (1) Training set in EmoEvalEs@Iber-LEF 2021: we ignore the labels in the training set and only use the raw text for continual pre-training. (2) General Spanish tweet corpus + Training set in EmoEva-lEs@IberLEF 2021: in addition to the unlabeled training data in this track, we also leverage a large general Spanish tweet corpus [12] for domain-adaptive pretraining.

Data Augmentation

Data augmentation is to solve over-fitting from the data level and improve the generalization of the model. By increasing the diversity of training samples, the model can learn more essential features of the data and enhance the model's adaptability to subtle changes in samples. Back Translation. In order to generate more training data, we use back translation generate paraphrases of an unlabeled sentence 𝑥𝑥 𝑢𝑢 in constructing 𝑥𝑥′ 𝑢𝑢 . The paraphrase 𝑥𝑥′ 𝑢𝑢 , generated via translating 𝑥𝑥 𝑢𝑢 to an intermediate language and then translating it back, describes the same content as 𝑥𝑥 𝑢𝑢 and should be close to 𝑥𝑥′ 𝑢𝑢 semantically. In terms of the generated label, 𝑥𝑥 𝑢𝑢 and the corresponding back translation sample 𝑥𝑥′ 𝑢𝑢 share the same labels. We leverage English as intermediate language in back translation.

By observing the Spanish dataset, we find that the three types of categories, Disgust, Fear, and Surprise, account for the lowest proportions. Therefore, we only perform back translation in these three categories. Increase the proportion of low-proportion categories, which not only enriches the amount of training data but also reduces the model's misjudgment rate for these three low-proportion labels.

Mix Up. Mix up [9] is a simple and quick data augmentation method. Its implementation method is to randomly extract two samples from the training sample to perform a simple random weighted summation. At the same time, the label of the sample corresponds to the weighted summation, and then the predicted result and the weighted summation loss is calculated for the subsequent tags, and finally the parameters are updated through backpropagation.

𝑥𝑥 � = 𝜆𝜆𝑥𝑥 𝑖𝑖 + (1 − 𝜆𝜆)𝑥𝑥 𝑗𝑗 , 𝑦𝑦 � = 𝜆𝜆𝑦𝑦 𝑖𝑖 + (1 − 𝜆𝜆)𝑦𝑦 𝑗𝑗(1)

where 𝑥𝑥 𝑖𝑖 , 𝑥𝑥 𝑗𝑗 are raw input vectors and 𝑦𝑦 𝑖𝑖 , 𝑦𝑦 𝑗𝑗 are one-hot label encodings. In this task, we simply set 𝜆𝜆 as 0.5 and get more stable predict results.

Experiment

Experiment Settings

We use Transformers library using Pytorch as backend to construct BERT-based models and ski-learn to construct machine learning models. The hyperparameters are shown in Table 2. As for evaluation, we leverage macro weighted averaged F1 score as our evaluation metric.

Experiment Results

We firstly report the offline results about some machine learning methods such as Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF) and so on and latest neural methods such as fine-tuned XLM [10], fine-tuned BETO as well as some augmented strategies including continual pre-training and back translation. The results are shown in Table 3 and Table 4. Based on the offline results, we use the models (soft voting with 5 cross models) of ID 9, ID 10 and the combination of ID 9 and ID 10 (in Table 3) as our final submissions. The online results are shown in Table 5. We achieve the second place in the competition. It can be seen from Table 5 that Fine-tuned BETO + Training set pre-training + Low proportion data back translation achieves the best result of 0.7222 in accuracy. It is worthwhile to note that the offline performance of Fine-tuned BETO + Training set pre-training + Mix up is excellent, but the online performance of it is not so good. That is also why the performance of the combination of the two models is not as good as that of the single model. We hold the opinion that the model training is over-fitting, resulting in poor generalization performance of the model, and thus the effect is impaired when tested on the test set.

Conclusion

Aiming at sentiment analysis task for Spanish tweets in EmoEvalEs@IberLEF 2021, we adopt a monolingual pre-trained Spanish BERT model as our base model and finetune it with the labeled tweets. In addition, focusing on two problems of small data size and class imbalance in the original training set, we leverage two augmented strategies to enhance the classic fine-tuned model, namely continual pre-training and data augmentation. Specifically, we try two data augmentation methods: back translation and mix up. Experimental results demonstrate the effectiveness of two augmented strategies. In the future, we will further try more data augmentation methods to achieve better results on the sentiment analysis task for Spanish tweets.

Fig. 1 .1Fig. 1. BERT Model.

Table 1 . Statistics of the dataset. Class Num. of Training Instances1Happy4908Fear260Anger2356Surprise952Sad2772Disgust89Others2356

Table 2 .2Hyperparameters.ParameterValueLearning Rate1e-5Batch Size16Epoch15OptimizerAdamDeviceNvidia 1080i

Table 3 .3Correspondence between model and ID.IDModel1LR2SVM3RF4Fine-tuned BETO5Fine-tuned XLM6ID 4 + Training set pre-training7ID 4 + General corpus pre-training8ID 6 + Whole data back translation9ID 6 + Low proportion data back translation10ID 6 + Mix up

Table 4 .4Offline Performance.From the table above, we can see that the SVM method works best in machine learning methods, outperforming LR and RF with 0.0407 and 0.0245. In addition, neural methods are far superior to machine learning methods, indicating the superiority of neural methods especially BERT-based methods. As for BERT-based method itself, we can see that the monolingual BETO achieves better performance than multilingual XLM with an improvement of almost 0.1, demonstrate the effectiveness of monolingual BETO for this task. Besides, two augmented strategies leveraged in this paper have made certain improvements to the base model, among which Mix up augmentation achieves the best effect, reaching an average accuracy of 0.7266. In addition, continual pre-training with training set and low proportion data back translation respectively outperforms continual pre-training with general corpus and whole data back translation.IDAccuracyFold 1Fold 2Fold 3Fold 4Fold 5Average10.51630.51130.53050.52360.52360.520520.53510.55980.57040.56120.57970.561230.53460.54610.5310.54610.52160.536740.7080.70050.71330.70190.70440.705650.59690.61320.61580.60880.61080.609160.70360.71260.71970.71190.71260.712170.71210.70680.71120.71060.70420.70980.71610.70980.71670.71720.71620.7172

Table 5 .5Online Performance.ModelAccuracy PrecisionRecallF1-ScoreID 90.72220.70470.72220.7114ID 100.70470.69270.70470.6942Combination of ID 9 and ID 100.72040.70820.72040.7098

Acknowledgements

This work was supported by the National Social Science Foundation of China (No. 17CTQ045), the Soft Science Research Project of Guangdong Province (No.2019A101002108), the Science and Technology Program of Guangzhou (No.202002030227), the National Natural Science Foundation of China (No. 61572145) and the Key Field Project for Universities of Guangdong Province (No. 2019KZDZX1016). The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.

Sentiment analysis and opinion mining BLiu Synthesis Lectures on Human Language Technologies 5 1 2012 SemEval-2017 task 4: Sentiment analysis in twitter SRosenthal NFarra PNakov Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) the 11th International Workshop on Semantic Evaluation (SemEval-2017)

Vancouver, Canada

2017 BB twtr at SemEval-2017 task 4: Twitter sentiment analysis with CNNs and LSTMs MCliche Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) the 11th International Workshop on Semantic Evaluation (SemEval-2017)

Vancouver, Canada

2017 How Will Your Tweet Be Received? Predicting the Sentiment Polarity of Tweet Replies STArasteh MMonajem VChristlein PHeinrich SEvert IEEE 15th International Conference on Semantic Computing (ICSC) 2021. 2021 BERT: Pre-training of deep bidirectional transformers for language understanding JDevlin MWChang KLee KToutanova Proceedings of NAACLHLT NAACLHLT 2019. 2019 Long short-term memory SHochreiter JSchmidhuber Neural computation 9 8 1997 EmoEvent: A Multilingual Emotion Corpus based on different Events FMPlaza-Del-Arco CStrapparava LAUrena Lopez MartinValdivia MT Proceedings of the 12th Language Resources and Evaluation Conference the 12th Language Resources and Evaluation Conference

Marseille, France

European Language Resources Association 2020 Google's neural machine translation system: Bridging the gap between human and machine translation YWu MSchuster ZChen QVLe MNorouzi WMacherey MKrikun YCao QGao KMacherey CoRR 2016 mixup: BEYOND EMPIRICAL RISK MINIMIZATION HZhang MCisse YNDauphin DLopez-Paz Preceedings of ICLR 2018. 2018 Unsupervised cross-lingual representation learning at scale AConneau KKhandelwal NGoyal VChaudhary GWenzek FGuzmán EGrave MOtt LZettlemoyer VStoyanov Proceedings of ACL 2020 ACL 2020 2020 Don't Stop Pretraining: Adapt Language Models to Domains and Tasks SGururangan AMarasović SSwayamdipta KLo IBeltagy DDowney NASmith Proceedings of ACL ACL 2020 TWilBert: Pre-trained Deep Bidirectional Transformers for Spanish Twitter JÁGonzález LFHurtado FPla Neurocomputing 426 2021 Spanish Pre-Trained BERT Model and Evaluation Data JCañete GChaperon RFuentes JHo HKang JPérez Preceedings of ICLR 2020. 2020 MMontes PRooso JGonzalo Proceedings of the Iberian Languages Evaluation Forum CEUR Workshop Proceedings the Iberian Languages Evaluation Forum

IberLEF

2021. 2021 Overview of the EmoEvalEs task on emotion detection for Spanish at IberLEF 2021 FMPlaza-Del-Arco SMJiménez Zafra AMontejo Ráez MDMolina González LAUreña López Martín MTValdivia Procesamiento del Lenguaje Natural 67 0 2021