=Paper=
{{Paper
|id=Vol-2943/restmex_paper9
|storemode=property
|title=Cascade of Biased Two-class Classifiers for Multi-class Sentiment Analysis
|pdfUrl=https://ceur-ws.org/Vol-2943/restmex_paper9.pdf
|volume=Vol-2943
|authors=José Abreu,Pedro Mirabal,Adrián Ballester-Espinosa
|dblpUrl=https://dblp.org/rec/conf/sepln/AbreuMB21
}}
==Cascade of Biased Two-class Classifiers for Multi-class Sentiment Analysis==
Cascade of Biased Two-class Classifiers for Multi-class Sentiment Analysis. José Abreu1[0000−0002−4637−4206] , Pedro Mirabal2[0000−0001−7345−6007] , and Adrián Ballester-Espinosa3[0000−0003−2506−1785] 1 U.I. for Computer Research. University of Alicante. Spain. ji.abreu@ua.es 2 Departamento de Ingenierı́a Informática. Unversidad Católica de Temuco. Chile. pedro.sanchez@uct.cl 3 Department of Software and Computing Systems. University of Alicante. Spain. adrian.ballester@ua.es Abstract. In this paper, we describe our participation in the Rest-Mex 2021 Sentiment Analysis Task. Our approach is based on an ensemble of BERT|BETO-based classifiers arranged in a cascade of binary models trained with a bias towards specific classes with the aim of lowering the Mean Average Error. The resulting models were judged in the 2nd and the 3rd place according to the evaluation rule of the Mean Absolute Error. Keywords: Sentiment Analysis · Deep Learning · Transformer Models. 1 Introduction Sentiment Analysis is a branch within Natural Language Processing that helps us to analyze the opinion of people as regards different entities such as services and products, classifying them into different categories. It is possible to consider positive, negative, or neutral classes or other more fine-grained scales. This task has received notable attention since stakeholders can leverage data from social media or specialized websites like Tripadvisor to make data-driven decisions. However, there are challenges, for example, the uneven development of resources for different languages [1]. To promote Sentiment Analysis, several challenges have been created on this subject such as SemEval starting in the 2007 edition, IberLEF, and lately Rest- Mex. Recently, Sentiment Analysis has been enhanced by Deep Learning, the details of this topic can be seen in a survey titled Deep learning for sentiment analysis: A survey[2]. Using similar strategies, other teams have participated in competitions related to this field, obtaining good results [3,4,5]. In this paper, we describe our participation in Rest-Mex 2021 Sentiment Analysis Subtask [6]. This subtask is a classification task, the objective is that IberLEF 2021, September 2021, Málaga, Spain. Copyright c 2021 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). systems can predict the polarity of an opinion, published by a tourist about places of Guanajuato, Mexico. The collection provided was obtained from the tourists who shared their opinion on TripAdvisor between 2002 and 2020. Our approach is based on Deep Learning Transformer Models, specifically BERT and BETO, applying a particular architecture and training strategies that we describe in the next section. The resulting models were judged in 2nd and 3rd places according to the evaluation rule of the Mean Absolute Error. 2 Task and Data Description 2500 800 2000 600 1500 400 1000 200 500 0 0 1 2 3 4 5 200 400 600 800 (a) Class distribution. (b) Frequency of opinion length (after BERT tokenizer) Fig. 1. Training data statistics. In this section, we describe the data provided by the organizers for this sub- task and its characterization. The corpus consists of 7.632 opinions shared, where 5.784 opinions are from national tourists (from Mexico) and 1.848 opinions come from Iberoamerican tourists and the different results of our models. Each opinion is classified as an integer, between [1, 5], where 1 represents the most negative polarity and 5 the most positive. For each opinion, organizers also provided infor- mation about nationality and gender. The organizers split the corpus 70% − 30% approximately. 70% of the data was delivered to the participants with complete information about each opinion, specifically 5194 opinions. 30% was reserved for the final testing of competing models. Analyzing the representation of each of the classes, we detected that they had a high level of imbalance, with class 5 as the majority class, with a total of 2, 688 instances, representing 51.75% of the total, a great contrast with the class 1, for which only 80 instances were pro- vided, for the 1.54%. The presence of the rest of the classes is as follows: 1595 instances for class 4, 686 instances for class 2 and 155 instances for class 2, each representing 30.71%, 13.21% and 2.98% respectively. This information can be viewed in Fig. 1a. Our work takes as primary data, only the textual information of the opinion, without taking into consideration other features. One aspect to take into account given the architecture used is the length of each opinion since our model is limited to 512 tokens. In Fig.1b, we show a histogram, where it can be seen that the opinions processed meet this condition. 3 System architecture. In this section we describe the two architectures, shown Fig.2, we explored for the sentiment analysis task. 3.1 BERT—BETO-based multi-class classifiers. This model is a multi-class classifier learning the five categories simultaneously. It is based on BERT [9] as feature extractor, fine-tuned for the text classification downstream task. We evaluated two versions of the architecture depicted in Fig. 2a. In both cases, we leveraged transfer learning from pre-trained embeddings. The first one is the uncased version of BETO4 [8], which is a BERT model trained on a spanish language corpus. The other pre-trained embedding we use is the multilingual uncased version of BERT5 . We aim to compare a model specific for Spanish to a multilingual one. The classifier comprised a dense layer with 768 hidden units and RELU ac- tivation. Dropout with a rate of 0.2 and a dense layer with 5 units and linear activation. For both BETO and BERT we use the base version, i.e. token em- bedding of size 768 and max length of 512. As shows Fig. 2a the embedding of the [CLS] token was used as the representation for the whole opinion. To ad- dress the unbalanced problem, class weight was set proportional to the number of instances in each category. This architecture has been evaluated by the authors of BERT [9] for the sen- timent analysis task over the Stanford Sentiment Treebank dataset [10] achieving state-of-the-art results at the time. This makes the architecture attractive as a benchmark for the sentiment analysis task in Spanish. 3.2 BERT—BETO-based two-class cascade classifiers. The other model we studied is an ensemble of binary classifiers arranged in cascade, as shown in Fig. 2b. Cascading classifiers is the strategy leveraged 4 https://github.com/dccuchile/beto Available through HuggingFace library, model id: ’dccuchile/bert-base-spanish- wwm-uncased’ 5 https://github.com/google-research/bert/blob/master/multilingual.md Available through HuggingFace library, model id: ’bert-base-multilingual-uncased’ Label Stage 1 Category Text Binary stage 1 Classifier Classifier c Text (category stage 1) Dense Stage 2 Category Dropout Binary stage 2 Classifier Dense Text (category stage 2) c [CLS] token Stage 3 Category embedding Binary stage 3 Classifier Feature Extractor c Text (category stage 3) Stage 4 Category Transformer Binary stage 4 Classifier c Text (Category stage 4) (a) BERT-based (b) Cascade classifier. multi-class. Fig. 2. Architectures of the two models studied. by widely known frameworks such as the Viola-Jones one [11]. In Sentiment Analysis, it has been used by [7] to enrich the feature set of a classifier. We evaluated two different ways to use this architecture to solve the five- category classification problem. Let’s denote the target category at stage i as Ci for the sake of conciseness. The first setup teach each classifier to tear apart instances from one class from the rest. For the classifier at stage 1, C1 = {1} while C1c = {2, 3, 4, 5}. The model at stage 2 learn to classify C2 = {2} and C2c = {1, 3, 4, 5} and similarly for stage 3. For the last stage, we set C4 = {5} and C4c = {1, 2, 3, 4}. The other setup explored biasing the classifiers so they tend to classify as the stage target category the miss-classified instances from upstream classifiers. In this case for stage 1 we have C1 = {1} and C1c = {2, 3, 4, 5}. For stage 2, C2 = {1, 2} and C2c = {3, 4, 5}. Note that in this case, stage 2 will also consider category 1 as the target class. We proceeded analogously for the other stages except for the last one which is configured as C4 = {5} and C4 = {1, 2, 3, 4}c . To classify an instance, we present it to the classifier at stage 1. If classified as the stage target category, then we are done. Else, i.e classified as C1c , the instance is presented to the next step classifier. This process is repeated for each stage to the end. Each classifier is a binary version of the model described in section 3.1 all of them trained separately. For each of these models, we evaluated the multi- language BERT [9] and BETO [8], yielding four different approaches. 4 Results After the data was processed, it was divided into 90% − 10% for training and validation. During the fine-tuning process, special attention was paid to different evaluation criteria such as MAE and balanced accuracy. As a result of the experimentation, 4 new models were obtained. In Table 1, we show in total 6 models, because we include benchmark models for BERT and BETO respectively. Following, we will describe each of these 6 models. Table 1 shows models in descending order, based on the MAE values. At the bottom, we can see the BERT Multi model, which is a BERT model trained on a multilingual corpus, wet considered that model as our Benchmark. The rest of the labels of each model can be interpreted using the following nomenclature: Multi indicates that the model was trained with a multilingual corpus. Biased or Unbiased, refers to what has been stated in Subsection 3.2. The BETO Biased model was the one that obtained the best result consid- ering its MAE value of 0.51, so it was selected as our primary submission, along with it, the BETO Multi model was sent as secondary submission, with MAE of 0.53. With the selection of BETO Multi model as a secondary submission, we wanted to validate the two cascade variants proposed in this paper. As the final stage before submitting, the two selected models were trained with the total available data. MAE RMSE Acc. F1 Rec. Prec. train val train val train val train val train val train val BETO Biased (sub1) 0.03 0.51 0.06 0.65 0.95 0.44 0.92 0.40 0.95 0.44 0.90 0.40 BETO Unbiased 0.04 0.53 0.08 0.68 0.95 0.48 0.90 0.44 0.95 0.48 0.87 0.49 BERT Multi Biased 0.08 0.53 0.13 0.68 0.94 0.45 0.87 0.43 0.94 0.45 0.84 0.48 BERT Multi Unbiased 0.26 0.67 0.71 1.23 0.80 0.47 0.70 0.36 0.80 0.47 0.74 0.38 BETO Multi (sub2) 0.39 0.53 0.50 0.73 0.74 0.53 0.65 0.48 0.74 0.53 0.62 0.48 BERT Multi 0.62 0.70 0.91 1.05 0.59 0.51 0.45 0.40 0.59 0.51 0.41 0.37 Table 1. Experiment Results. In this subtask, 8 teams competed, with 14 submissions in total. In the team ranking, we were second, and in the submission ranking we were in second and third place, our best-evaluated submission turned out to be the secondary one, with 0.5451 of MAE. It should be noted that our best submission achieved the best result in F-measure and Precision among all participants. A summary of the competition can be seen in Table 2. 5 Conclusion and Future Work In this paper, we have described the models proposed by UCT-UA in the Senti- ment Analysis subtask at Rest-Mex 2021. We presented two models, the results Team Rank Sub. Rank Team MAE RMSE Accuracy F-measure Recall Precision 1st 1 Minerı́a UNAM 1 0.4752 0.7549 56.7238 0.4280 0.4992 0.4035 2nd 2 UCT-UA 2 0.5451 0.8540 53.2491 0.4512 0.4662 0.4933 3 UCT-UA 1 0.5614 0.9023 53.8357 0.4035 0.3984 0.4626 3rd 4 DCI-UG 1 0.56273 0.8843 53.3394 0.2870 0.3405 0.2827 5 Minerı́a UNAM 2 0.5826 0.9498 54.7834 0.2428 0.2732 0.2549 6 DCI-UG 2 0.6060 0.97046 53.7004 0.2539 0.3004 0.2772 BASELINE 0.7238 1.1620 51.3538 0.1357 0.1027 0.200 Table 2. Competition Ranking. in our secondary submission were obtained from the model described in the Re- sults section as BETO Multi, this model the second-best result in the subtask achieving 0.5451 of MAE. The results in our primary submission were obtained from the model described in the Results section as BETO Biased, this model the third-best result in the subtask achieving 0.5613 of MAE. Comparing to the models using BERT Multi, the results suggest that the monolingual embedding is a better representation. However, this is consistent with results from BERT team 6 where for high-resource languages the multilin- gual model may achieve the worst results respect the single-language model. Moreover, this can be aggravated since the fine-tuning was done using Span- ish only thus degrading the multilingual representation spaces spawned by the transformer. As future work, we are interested in evaluating if the multilingual models can benefit from Tripadvisor reviews in different languages or topics. Also, we would like to study multi-modal approaches that can leverage information from the title or metadata of the review to boost the results. 6 Acknowledgments This research work has been partially funded by the Generalitat Valenciana (Conselleria d’Educació, Investigació, Cultura i Esport) and the Spanish Govern- ment through the projects SIIA (PROMETEO/2018/089, PROMETEU/2018/089) and LIVING-LANG (RTI2018-094653-B-C22), and the Vice Chancellor for Re- search and Postgraduate Studies Office of the Universidad Católica de Temuco, VIPUCT Project No. 2020EM-PS-08; FEQUIP 2019-INRN-03 of the Universi- dad Católica de Temuco. References 1. Agüero-Torales, M., Abreu-Salas, J., López-Herrera, A.: Deep learning and multi- lingual sentiment analysis on social media data: An overview. Applied Soft Com- puting, vol 107 (2021) 6 https://github.com/google-research/bert/blob/master/multilingual.md 2. Zhang, L., Wang, S. and Liu, B.: Deep learning for sentiment analysis: A survey. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. vol 8, number 4 (2018) 3. González, J.A., Hurtado, L.F. and Pla, F. : ELiRF-UPV at TASS 2019: Trans- former Encoders for Twitter Sentiment Analysis in Spanish. In: Proc. of IberLEF@ SEPLN, (2019) 4. Pastorini, M., Pereira, M., Zeballos, N., Chiruzzo, L., Rosá, A. and Etcheverry, M.: RETUYT-InCo at TASS 2019: Sentiment Analysis in Spanish Tweets. In: Proc. of IberLEF@ SEPLN, (2019) 5. González, J., Pla, F. and Hurtado, L.: ELiRF-UPV at SemEval-2017 Task 4: senti- ment analysis using deep learning. In: Proceedings of the 11th international work- shop on semantic evaluation SemEval-2017, (2017) 6. Álvarez-Carmona, Miguel Á and Aranda, Ramón and Arce-Cárdenas, Samuel and Fajardo-Delgado, Daniel and Guerrero-Rodrı́guez, Rafael and López-Monroy, A. Pastor and Martı́nez-Miranda, Juan and Pérez-Espinosa, Humberto and Rodrı́guez-González, Ansel: Overview of Rest-Mex at IberLEF 2021: Recommen- dation System for Text Mexican Tourism. Procesamiento del Lenguaje Natural, vol 67 (2021) 7. Calvo, H., Gambino, O.: Cascading classifiers for Twitter sentiment analysis with emotion lexicons. In: Proc. Int. Conf. on Intelligent Text Processing and Compu- tational Linguistics, pp. 270-280. (2016) 8. Canete, J., Chaperon, G., Fuentes, R., Pérez, J.: Spanish pre-trained bert model and evaluation data. In: Proc. of PML4DC at ICLR. (2020) 9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. (2018) https://doi.org/arXiv:1810.04805 10. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y. and Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. pp. 1631-1642. (2013) 11. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In. Proc. of the 2001 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition. (2001)