Advantages of XLM-R Model for Urdu Sentiment Multi- Classification Mingcan Guo, Zhongyuan Han*, Leilei Kong, Zhijie Zhang, Zengyao Li, Haoyang Chen and Haoliang Qi Foshan University, Foshan, China Abstract Sentiment Multi-Classification detection has gained much attention in recent years. The multi-label sentiment detection task refers to additional comments based on social media or shopping platforms, which usually contain different personal solid emotions. How to classify these reviews into multiple sentiment types using efficient methods such as machine learning is the main content of this type of task. We describe our XLM-R based method for tracking emotion detection task at FIRE 2022 in this paper. The system uses the XLM-R pretrained model to extract semantic features from Urdu text. After using dynamic learning rate-based tuning, we found that the model is more stable in performance and has a higher score on the test set. In the final result, our system achieved a Micro F1 score of 0.759 and a Macro F1 score of 0.687 in this task and won the first rank in the FIRE2022 track emotion detection task. Keywords 1 sentiment, Multi-Classification, XLM-R, social media 1. Introduction With the continuous iteration of Internet technology, more and more people choose to express their emotional remarks and opinions on social media such as Twitter and Facebook [1]. People hope to openly express their views on social economy, culture, politics, etc. from multiple perspectives. How to classify these emotional speeches and give them different emotional labels will have a positive effect on managing and regulating the community and improving the user experience. It can also help companies and governments to collect rich emotional information. At the same time, it has important implications both in the research and industrial fields of artificial intelligence. Therefore, some method needs to be used to identify and divide these speeches, which is the task of multi-label sentiment detection [2]. With the advent of the Internet age, the world is connected as a whole, people from different countries can communicate freely on the Internet, and thousands of speeches are posted daily, providing rich language resources for the emotion detection system in past research. Urdu is the official language of Pakistan. It is the 10th-most widely spoken language in the world, with 230 million total speakers1. It is widely used in many countries, such as India and Nepal. Because of the large number of speakers and the wide range of application areas of Urdu, it is crucial to understand the native speakers of Urdu by analyzing and studying the sentiment classification of Urdu by language processing systems. In the emotions and threat detection shared tasks at FIRE 2022 task [3, 4], our team uses the Urdu script dataset [5] and a pre-trained XLM-R based language processing system for multi-label classification. To achieve faster model convergence and better model performance, we found that the 1https://en.wikipedia.org/wiki/Urdu FIRE 2022: Forum for Information Retrieval Evaluation, December 9–13, 2022, India EMAIL: gmc9812@163.com(M. Guo); hanzhongyuan@gmail.com (Z. Han)(*corresponding author); kongleilei@fosu.edu.cn (L. Kong); zhangzhijie5454@gmail.com(Z. Zhang); lzy1512192979@gmail.com(Z. Li) hoyo.chen.i@gmail.com(H. Chen); haoliang.qi@gmail.com(H. Qi) ORCID: 0000-0002-4977-2138 (M. Guo); 0000-0001-8960-9872 (Z. Han); 0002-4636-3507(L. Kong); 0000-0002-4854-0618 (Z. Zhang); 0000-0001-8472-4150 (Z. Li); 0000-0003-3223-9086 (H. Chen); 0000-0003-1321-5820 (H. Qi) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) XLM-R model has a strong normalization ability for non-mainstream languages like Urdu, and can well meet the needs of the task after learning rate tuning, so we use model tuning and adaptive learning rate tuning to verify the effectiveness of sentiment classification in this dataset. The overall result distribution of this paper is as follows. In the second part, we will introduce the related work and give an overview of the related historical research. We will present our experimental method in the third part and submit the practical steps and associated details in the fourth part. Finally, we will analyze the results of our experiments on the official dataset and add a summary to the whole paper. 2. Related Work People have been working on multi-label sentiment classification for years [6]. People usually divide sentiments according to rules and dictionaries in traditional sentiment classification. For example, J. Blitzer et al. [7] extended the correspondence learning (SCL) algorithm [8] to sentiment classification to detect Amazon's products of different product types. Regarding the sentiment of comments, K. Deneck studied the multi-domain sentiment analysis problem based on SentiWordNet as a lexical resource [9]. G. Xu et al. proposed an extended dictionary-based Chinese sentiment analysis method on the rule-based analysis method [10]. Their experiments build a vast sentiment lexicon covering five domains: hotel, number, fruit, clothing, and shampoo. However, traditional methods do not work well on social media platforms that generate many comments of different types daily. In recent years, with the continuous development of neural network and convolutional network technology and the emergence of transformer models with excellent self-attention mechanisms [11], rule-based analysis methods have gradually been replaced by feature-supervised learning methods based on pre-trained models, S. Mahata et al. [12] proposed a model based on bidirectional LSTM and language tagging using FastText embedding to generate word vectors to train the model, trying to solve the sentiment analysis problem of English-Tamil code mixed data, B. R. Chakravarthi et al. [13] propose Long Short Term Memory (LSTM) networks and language-specific pre-processing, they involved applying an attention layer on contextualized word embeddings and fine-tuning a model pretrained on the training data of the previous version, DravidianCodeMix-2020, to recognize Tamil and Malayalam in social media comments Emotions in mixed languages. In addition to traditional rule-based sentiment classification and popular neural network pre-trained model-based sentiment classification, Y. Wu et al. [14] proposed a multimodal sentiment classification method based on cross-modal prediction centered on text modality, two types of information are mined from speech modalities and image modalities to assist text modalities, and a text-centric multimodal feature fusion mechanism is designed to perform feature fusion on multimodal features. In previous sentiment analysis tasks, transformers-based models often achieve good results in different tracks [15]. For example, Y. Bai et al. [16] combined the fine-tuning method of XLM- RoBERTa and CNN through downstream tasks and obtained first place in the mission of Sentiment Analysis of Dravidian Languages in Code-Mixed Text. In shared task, study by L. Khan et al. [17] shows that the combination of word n-gram features with LR outperformed other classifiers for sentiment analysis task, obtaining the highest F1 score of 82.05% using combination of features. In related supervised classification problems, the content-based word unigram method used by I. Ameer et al. [18] outperforms other content-based feature-based methods. L. Khan et al. [19] used four text representations: word n-grams, char n-grams, pretrained fastText, and BERT word embeddings to train the classifier. Their proposed mBERT model with BERT-pretrained word embeddings outperforms deep learning, machine learning and rule-based classifiers and achieves an F1 score of 81.49%. 3. Methodology 3.1. Model Description In this task, our method uses a pre-trained model based on XLM-RoBERTa [20], referred to as XLM-R. It inherits the method of XLM and draws on the ideas of RoBERTa [21]. Compared with XLM, XLM-R expands not only the language but also the training data. Therefore, similar to other transformer structures, the architecture of XLM-R can be and is better suited for text classification tasks. It takes as input a sequence of no more than 512 tokens and outputs a representation of that sequence. The first token of the sequence is always [CLS], which contains the particular categorical embedding. As shown in Figure 1, Urdu tweets are passed into token classification, a linear classification layer that takes the token sequence and the final hidden state of each Urdu text as input and assigns it to each token. The cards generate the label output, and the top softmax classifier is used to predict the probability of label C, which is finally classified into different sentiment labels. Figure 1: Urdu text classification architecture with XLM-R [22] 3.2. Neural Network Tuning Among many optimization methods, learning rate-based tuning is usually given priority. M. D. Zeiler [23] uses SGD, Momentum, ADAGRAD, and ADADELTA in a supervised manner. The neural network is trained to minimize the cross-entropy between the network output predictions and the target labels, and the results show that the model is sensitive to the parameters of the learning rate, and an accurate learning rate can quickly converge the model error around the optimal performance that occurs with momentum. Common learning rate change strategies include preset rule learning rate change methods, including fixed, step, exp, inv, multistep, poly, sigmoid, etc. Compared with non-adaptive learning rate transformation methods, the model’s absolute value is reduced, and performance impact is. In this task, we use ReduceLROnPlateau [24] in Keras to adjust the learning rate strategy, based on the Adam [25] adaptive learning rate algorithm, we detect the change of the Loss index at each epoch. The learning rate adjustment will be triggered when the Loss no longer decreases within a certain period. The adjustment strategy is shown in formula 1, where λ is the attenuation multiplication factor. 𝑛𝑒𝑤_𝑙𝑟 = 𝜆 × 𝑜𝑙𝑑_𝑙𝑟 (1) The learning rate decay in our experiments is shown in Figure 2. The value on y-axis in the figure is magnified by 105 times. According to the algorithm, after it is detected that the Loss no longer decreases for a certain period, the learning rate will be automatically lowered in the next epoch to make the model converge to the best performance continuously. Figure 2: Learning rate varies with the epoch 4. Experiments 4.1. Data and Pre-Processing The dataset provided by the task is obtained through Twitter [2], tweets are collected in a CSV file through the Twitter open API, other kinds of languages are excluded, and only the purest Urdu tweets are kept. These comments can be divided into seven emotions: anger, disgust, fear, sadness, surprise, happiness, and neutral. A total of over 10,000 tweets collected are divided into 7,800 training sets and 1,950 test sets. Table 1 shows statistical data according to different categories. All Urdu texts are normalized with diacritics removed and spaces added after numbers, punctuation, and stop words. Table 1 The division of train and test sets for different emotion types Datasets Labels Training Test neutral 3014 753 happiness 1046 261 surprise 1550 388 sadness 2190 548 Urdu Tweets fear 609 152 disgust 761 190 anger 811 203 Total 7800 1950 Figure 3: Experimental process We use the XLM-R model to train the dataset and add the adaptive learning rate change strategy tuning in this work. The hyperparameters of the XLM-R model are set to lr=2e-5, batch_size=32, max_len=128, hidden_size=768, epochs=15, learning rate decay parameters are set to factor=0.6, cooldown=0, min_lr=0, eps=1e-08. After loading the model, we choose adamW as the optimizer of the model and cross entropy loss as the loss function. During training, we set the model to be verified every 100 steps and added the gradient reset to save the optimal model when confirming the set. The learning rate gradient decays, and finally, we use the activation function in the output layer to output the predicted label of the model. Figure 3 shows the overall architecture of the experimental process. Before the official release of the test set with labels, we divided the training set into 9:1 for model training and model validation. 4.2. Results We use a variety of evaluation metrics to verify the performance of our model, including Accuracy, Precision, Recall, Macro F1, Micro F1, and Loss. And we choose Accuracy and Loss among these metrics to validate the model in train set. 𝑇𝑃+𝑇𝑁 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 (2) 𝐿𝑜𝑠𝑠 = −𝑤 ∗ [𝑝 ∗ 𝑙𝑜𝑔(𝑞) + (1 − 𝑝) ∗ 𝑙𝑜𝑔(1 − 𝑞)] (3) In training the model, we added the methods without learning rate optimization for comparison. After each epoch, we call the index evaluation method in the sklearn method library to output the evaluation of the model on the validation set. As shown in Figure 4 and 5, on the two evaluation indicators of Accuracy and Loss, the model tuned by the adaptive learning rate can converge to near the best performance to obtain the highest evaluation score. With the adjustment of the learning rate, the model’s score is gradually stabilized at the optimum on the premise that the Loss function does not increase, and the performance can be kept stable in the process of continuous iteration. The performance indicators of the trained model on the validation data set can reach the best, of which the Accuracy reaches 0.676, and the Loss reaches 0.178. Figure 4: Adjust the learning rate on Accuracy Figure 5: Adjust the learning rate on Loss As shown in Table 2, the final evaluation of our model on the test set is 0.636 for Accuracy, 0.759 for Weighted F1, 0.759 for Micro F1, 0.687 for Macro F1, and 0.088 for Loss, ranking first among all eight participating teams, which effectively verifies the reliability of our model. Table 2 Final top six results Rank Team Accuracy Weighted F1 Micro F1 Macro F1 Loss 1 FOSUNlpTeam 0.636 0.759 0.759 0.687 0.088 2 Team2 0.616 0.743 0.749 0.669 0.088 3 Team3 0.612 0.709 0.742 0.615 0.092 4 Team4 0.582 0.696 0.692 0.603 0.113 5 Team5 0.593 0.699 0.720 0.599 0.092 6 Team6 0.385 0.611 0.477 0.466 0.340 5. Conclusion This paper mainly introduces our work results on Emotion and Threat detection in Urdu task. Our work combines a pre-trained XLM-R model with adaptive learning rate optimization to solve the multi- label classification problem of Urdu text. The final ranking effectively validated our method. However, combined with the final labeled dataset, we noticed that our method still needs to be optimized, such as paying more attention to the implementation of downstream tasks and the adjustment of model parameters, ignoring the possible impact of preprocessing such as text filtering on the score. Our work on sentence type classification and sample equalization processing are still relatively lacking. The next step will be to preprocess sentence weighting and classification and combine profound learning aspects with building a more robust model processing system. Our code is available on GitHub2. 6. Acknowledgments This work is supported by the National Social Science Fund of China (No.18BYY125). 7. References [1] H. Slim, M. Hafedh, Social media impact on language learning for specific purposes: A study in english for business administration, Teaching english with technology 19 (2019) 56–71. [2] N. Ashraf, L. Khan, S. Butt, H.-T. Chang, G. Sidorov, A. Gelbukh, Multi-label emotion classification of urdu tweets, PeerJ Computer Science 8 (2022) e896. [3] S. Butt, M. Amjad, F. Balouchzahi, N. Ashraf, R. Sharma, G. Sidorov, A. Gelbukh, Overview of EmoThreat: Emotions and Threat Detection in Urdu at FIRE 2022, in: CEUR Workshop Proceedings, 2022. [4] S. Butt, M. Amjad, F. Balouchzahi, N. Ashraf, R. Sharma, G. Sidorov, A. Gelbukh, EmoTh- reat@FIRE2022: Shared Track on Emotions and Threat Detection in Urdu, in: Forum forInformation Retrieval Evaluation, FIRE 2022, Association for Computing Machinery, New York, NY, USA, 2022. [5] M. Amjad, N. Ashraf, A. Zhila, G. Sidorov, A. Zubiaga, A. Gelbukh, Threatening language detection and target identification in urdu tweets, IEEE Access 9 (2021) 128302–128313. [6] P. Zhao, L. Hou, O. Wu, Modeling sentiment dependencies with graph convolutional networks for aspect-level sentiment classification, Knowledge-Based Systems 193 (2020) 105443. [7] J. Blitzer, M. Dredze, F. Pereira, Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification, in: Proceedings of the 45th annual meeting of the association of computational linguistics, 2007, pp. 440–447. [8] J. Blitzer, R. McDonald, F. Pereira, Domain adaptation with structural correspondence learning, in: Proceedings of the 2006 conference on empirical methods in natural language processing, 2006, pp. 120–128. 2https://github.com/xiguagaizi/multi_label_classification-main.git [9] K. Denecke, Are sentiwordnet scores suited for multi-domain sentiment classification? in: 2009 Fourth International Conference on Digital Information Management, IEEE, 2009, pp. 1–6. [10] G. Xu, Z. Yu, H. Yao, F. Li, Y. Meng, X. Wu, Chinese text sentiment analysis based on extended sentiment dictionary, IEEE Access 7 (2019) 43749–43762. [11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo-sukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). [12] S. Mahata, D. Das, S. Bandyopadhyay, Sentiment classification of code-mixed tweets using bi- directional rnn and language tags, in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, 2021, pp. 28–35. [13] B. R. Chakravarthi, R. Priyadharshini, S. Thavareesan, D. Chinnappa, D. Thenmozhi, E. Sherly, J. P. McCrae, A. Hande, R. Ponnusamy, S. Banerjee, et al. Findings of the sentiment analysis of dravidian languages in code-mixed text, arXiv preprint arXiv:2111.09811 (2021). [14] Y. Wu, Z. Lin, Y. Zhao, B. Qin, L.-N. Zhu, A text-centered shared-private framework via cross- modal prediction for multimodal sentiment analysis, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 4730–4738. [15] Y. P. Babu, R. Eswari, Sentiment analysis on dravidian code-mixed youtube comments using paraphrase xlm-roberta model, Working Notes of FIRE (2021). [16] Y. Bai, B. Zhang, Y. Gu, T. Guan, Q. Shi, Automatic detecting the sentiment of code-mixed text by pre-training model, Working Notes of FIRE (2021). [17] L. Khan, A. Amjad, N. Ashraf, H.-T. Chang, A. Gelbukh, Urdu sentiment analysis with deep learning methods, IEEE Access 9 (2021) 97803–97812. [18] I. Ameer, N. Ashraf, G. Sidorov, H. Gómez Adorno, Multi-label emotion classification using content-based features in twitter, Computación y Sistemas 24 (2020) 1159–1164. [19] L. Khan, A. Amjad, N. Ashraf, H.-T. Chang, Multi-class sentiment analysis of urdu text using multilingual bert, Scientific Reports 12 (2022) 1–17. [20] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019). [21] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [22] T. Ranasinghe, M. Zampieri, Multilingual offensive language identification with cross-lingual embeddings, arXiv preprint arXiv:2010.05324 (2020). [23] M. D. Zeiler, Adadelta: an adaptive learning rate method, arXiv preprint arXiv:1212.5701 (2012). [24] A. Gulli, S. Pal, Deep learning with Keras, Packt Publishing Ltd, 2017. [25] S. J. Reddi, S. Kale, S. Kumar, On the convergence of adam and beyond, arXiv preprint arXiv:1904.09237 (2019).