Automatic Detecting the Sentiment of Code-Mixed
Text by Pre-training Model
Yang Bai, Bangyuan Zhang, Wanli Chen, Yongjie Gu, Tongfeng Guan and
Qisong Shi
Data Intelligence Center, ZhongYuan Bank, Henan, P.R.China


                                      Abstract
                                      With the development of the Internet, more and more people can express their opinions on social
                                      media, which are positive or negative. They sometimes have a positive or negative impact on society
                                      or individuals. So how to detect sentiment automatically is very important, especially in code-mixed
                                      text. In this paper, we describe our system submitted to Dravidian-CodeMix-FIRE2021 which consists
                                      of three sub-tasks. We participate in Malayalam-English task and Tamil-English task. For the system
                                      we submitted, XLM-Roberta is used as the base model, which is fine-tuned in the Malayalam-English
                                      and Tamil-English data in 2020 and 2021. Finally, our system achieves 80.40% of the F1-score in the
                                      Malayalam-English task and 67.60% of the F1-score in the Tamil-English task.

                                      Keywords
                                      code-mixed, sentiment, XLM-Roberta


1. Introduction
With the rapid development of social networks such as Twitter and Facebook, people can
express their views and opinions on the Internet anytime, anywhere. In social media, users can
freely express their views which are negative or positive.[1]These views will have an impact on
individuals and society.[2] So an automated method is needed to recognize the sentiment of
these views. This involves sentiment analysis.
   With the development of globalization, people from different countries communicate with
each other on social media. So there are a lot of code-mixed texts on the Internet.[3] For example,
there are many Malayalam-English[4], Tamil-English[5] and Kannada-English[6] code-mixed
text on the Internet. Malayalam is one of the Dravidian languages in southern India. Tamil is a
language with a history of more than 2000 years. It belongs to the Dravidian language family
and is popular in southern India and northeast Sri Lanka. Kannada is the official language of
Karnataka in India and belongs to the Dravidian language. In this paper, the used methods
and the results obtained by our team, ZYBank-AI, on the Dravidian-CodeMix-FIRE 2021 are
presented.[7] We participate in the Malayalam-English and Tamil-English.


Forum for Information Retrieval Evaluation, December 13-17, 2021, India
Envelope-Open baiyang07@zybank.com.cn (Y. Bai); zhangbangyuan@zybank.com.cn (B. Zhang); chenwanli@zybank.com.cn
(W. Chen); guyongjie01@zybank.com.cn (Y. Gu); guantongfeng01@zybank.com.cn (T. Guan);
shiqisong01@zybank.com.cn (Q. Shi)
Orcid 0000-0002-7175-2387 (Y. Bai)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
2. Related Work
In recent years, more and more researchers have focused on the sentiment analysis of code-
mixed texts.[8] Among the traditional methods, there are methods based on dictionaries and
machine learning to solve sentiment analysis problems. But these methods are not effective
on social media data. In recent years, with the development of transfer learning technology,
language models such as word2vec[9], BERT[10], XLM-RoBERTa[11], etc. have emerged. At the
same time, they have greatly promoted the development of multilingual sentiment analysis. In
2020, Bharathi et al.[12] organized Dravidian-CodeMix-FIRE 2020. This is the first shared task
of sentiment analysis in Malayalam-English and Tamil-English. The SRJ team[13] combined
XLM-RoBERTa and CNN method of fine-tuning through downstream tasks and ranked first in
the two subtasks.


3. Methodology
3.1. Dataset
For the Malayalam and Tamil tasks, all the data in Dravidian-CodeMix-FIRE 2020 (last year’s
data) and the data in Dravidian-CodeMix-FIRE 2021 (this year’s data) are used as the training
data of the model.
   In the Malayalam task. (1) For Dravidian-CodeMix-FIRE 2020, we combine the training
dataset, validation dataset and test dataset. After combining, we scramble it and divide it
into the new training dataset and validation dataset according to the ratio of 5:1. In the new
training set, there are 5777 sentences, 2410 sentences for ’Positive’ class, 632 sentences for
’Negative’, 1631 sentences for ’unknown_state’, 347 sentences for ’Mixed_feelings’, and 757
sentences for ’not-malayalam’. In the new validation set, there are 963 sentences, 401 sentences
for ’Positive’ class, 109 sentences for ’Negative’, 272 sentences for ’unknown_state’, 57 sentences
for ’Mixed_feelings’, and 127 sentences for ’not-malayalam’. (2) For Dravidian-CodeMix-FIRE
2021, we combine the training dataset and validation dataset. After combining, we use stratified
5-fold cross-validation to divide the data.
   In the Tamil task. (1) For Dravidian-CodeMix-FIRE 2020, we combine the training dataset,
validation dataset and test dataset. After combining, we scramble it and divide it into the new
training dataset and validation dataset according to the ratio of 5:1. In the new training set,
there are 13494 sentences, 9050 sentences for ’Positive’ class, 1746 sentences for ’Negative’, 729
sentences for ’unknown_state’, 1543 sentences for ’Mixed_feelings’, and 426 sentences for ’not-
Tamil’. In the new validation set, there are 2250 sentences, 1509 sentences for ’Positive’ class, 291
sentences for ’Negative’, 121 sentences for ’unknown_state’, 258 sentences for ’Mixed_feelings’,
and 71 sentences for ’not-Tamil’. (2) For Dravidian-CodeMix-FIRE 2021, we combine the training
dataset and validation dataset. After combining, we use stratified 5-fold cross-validation to
divide the data.
Figure 1: XLM-RoBERTa with Attention.


3.2. Method Description
In this task, our main work is carried out on the pre-training language model. For the pre-
training language model, we choose the XLM-RoBERTa as the base model. The XML-RoBERTa
model was proposed by the Facebook AI team. It can be understood as the combination of
XLM and Roberta. It is trained in 100 languages and 2.5TB of text (crawled by the common
Crawl corpus). It shows excellent performance in many multilingual tasks. So we choose
XLM-RoBERTa model in this work. Recent research has pointed out that the 12 hidden layers
of the BERT model contain different linguistic information. In order to obtain more semantic
information, we apply the Self-Attention mechanism to the 12 hidden layers of XLM-RoBERTa.
Figure 1 shows our model. The implementation flow of Attention mechanism is shown in Figure
2.
   The size of the data often determines the quality of the final result. Due to the small number
of official data sets, the performance of the model will be affected. Therefore, we fine-tune the
model in the first stage of the Dravidian-CodeMix-FIRE 2020 Malayalam and Tamil language
data, and select the checkpoint with the best result in the validation set as the first-stage fine-
tuned model. Then, the fine-tuned model in the first stage is trained on this year’s Malayalam
language and Tamil language data, and the checkpoint with the best result in the validation set
is selected as the second-stage fine-tuned model. The method we used is shown in Figure 3.
Figure 2: Attention mechanism.


Figure 3: our Method.


3.3. R-Drop
Deep neural network (DNN) has achieved remarkable success in various fields recently. When
training these large-scale DNN models, regularization techniques such as L2 normalization,
batch normalization and dropout are indispensable modules. These techniques can prevent the
model from over-fitting and at the same time improve the generalization ability of the model.
Among them, dropout technology has become the most widely used regularization technology
because it only needs to discard a part of neurons in the training process.
   The researchers proposed a further regularization method based on Dropout: Regularized
Dropout (R-Drop)[14]. In each mini-batch, each data sample has the same model with Dropout
twice, and R-Drop uses KL-divergence to constrain the same output twice. The results show
that R-Drop has achieved a good result improvement.


4. Experiments
4.1. Experiments Setup
These two subtasks are five classification tasks. The official evaluation metrics is weighted
average F1-score. The F1-score formula is as follows:
Table 1
Hyperparameters of the first-stage and second-stage.( α is a parameter in R-Drop.)
                   Hyperparameters           first stage training   second stage training
                       bacth size                       4                     4
              gradient accumulation step                4                     4
                       learn rate                     3e-5                  6e-5
                        dropout                        0.3                   0.3
                         epoch                         15                    15
                           α                            4                     4


                                                        𝑇𝑃
                                       Precision =                                               (1)
                                                      𝑇𝑃 + 𝐹𝑃
                                                      𝑇𝑃
                                         Recall =                                                (2)
                                                    𝑇𝑃 + 𝐹𝑁
                                           Precision ∗ Recall ∗ 2
                                  𝐹1 =                                                           (3)
                                            Precesion + Recall

4.2. Results
In this work, our method is based on pytorch implementation. The xlm-roberta-base is used
as the pre-training language model, Adamw is used as an optimizer, and R-Drop is used for
regularization. There is no preprocessing for all data. For the first stage of training, the hyperpa-
rameters are the same in the two subtasks. For the second stage of training, the hyperparameters
are also the same. Table 1 shows the hyperparameters of our experiment.Finally, our team
achieve the 80.40% F1-score in Malayalam task and 67.60% F1-score in Tamil task.


5. Conclusion
This paper introduces the overall idea and specific scheme of ZYBank-AI Team in Dravidian-
CodeMix-FIRE 2021. We participate in Malayalam-English task and Tamil-English task. Consid-
ering the problem of data size, we use the data set of Dravidian-CodeMix-FIRE2020. In order to
be able to obtain more semantic information, we combine XLM-RoBERTa and self-attention
mechanism. Our code is available at https://github.com/byew/codemix-2021-code.


References
 [1] T. Mandl, S. Modha, A. Kumar M, B. R. Chakravarthi, Overview of the hasoc track at
     fire 2020: Hate speech and offensive language identification in Tamil, Malayalam, Hindi,
     English and German, in: Forum for Information Retrieval Evaluation, 2020, pp. 29–32.
 [2] B. R. Chakravarthi, P. K. Kumaresan, R. Sakuntharaj, A. K. Madasamy, S. Thavareesan,
     P. B, S. Chinnaudayar Navaneethakrishnan, J. P. McCrae, T. Mandl, Overview of the
     HASOC-Dravidian-CodeMix Shared Task on Offensive Language Detection in Tamil and
     Malayalam, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation,
     CEUR, 2021.
 [3] B. R. Chakravarthi, R. Priyadharshini, S. Thavareesan, D. Chinnappa, T. Durairaj, E. Sherly,
     J. P. McCrae, A. Hande, R. Ponnusamy, S. Banerjee, C. Vasantharajan, Findings of the
     Sentiment Analysis of Dravidian Languages in Code-Mixed Text 2021, in: Working Notes
     of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021.
 [4] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis
     dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on
     Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration
     and Computing for Under-Resourced Languages (CCURL), European Language Resources
     association, Marseille, France, 2020, pp. 177–184. URL: https://www.aclweb.org/anthology/
     2020.sltu-1.25.
 [5] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for
     sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint
     Workshop on Spoken Language Technologies for Under-resourced languages (SLTU)
     and Collaboration and Computing for Under-Resourced Languages (CCURL), European
     Language Resources association, Marseille, France, 2020, pp. 202–210. URL: https://www.
     aclweb.org/anthology/2020.sltu-1.28.
 [6] A. Hande, R. Priyadharshini, B. R. Chakravarthi, KanCMD: Kannada CodeMixed dataset
     for sentiment analysis and offensive language detection, in: Proceedings of the Third
     Workshop on Computational Modeling of People’s Opinions, Personality, and Emotion’s
     in Social Media, Association for Computational Linguistics, Barcelona, Spain (Online),
     2020, pp. 54–63. URL: https://www.aclweb.org/anthology/2020.peoples-1.6.
 [7] R. Priyadharshini, B. R. Chakravarthi, S. Thavareesan, D. Chinnappa, T. Durairaj, E. Sherly,
     Overview of the Dravidiancodemix 2021 shared task on sentiment detection in Tamil,
     Malayalam, and Kannada, in: Forum for Information Retrieval Evaluation, FIRE 2021,
     Association for Computing Machinery, 2021.
 [8] B. R. Chakravarthi, R. Priyadharshini, N. Jose, A. Kumar M, T. Mandl, P. K. Kumaresan,
     R. Ponnusamy, H. R L, J. P. McCrae, E. Sherly, Findings of the shared task on offensive
     language identification in Tamil, Malayalam, and Kannada, in: Proceedings of the First
     Workshop on Speech and Language Technologies for Dravidian Languages, Association
     for Computational Linguistics, Kyiv, 2021, pp. 133–145. URL: https://aclanthology.org/2021.
     dravidianlangtech-1.17.
 [9] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in
     vector space, arXiv preprint arXiv:1301.3781 (2013).
[10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[11] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
     M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at
     scale, arXiv preprint arXiv:1911.02116 (2019).
[12] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, E. Sherly,
     J. P. McCrae, Overview of the track on sentiment analysis for Dravidian languages in
     code-mixed text, in: Forum for Information Retrieval Evaluation, 2020, pp. 21–24.
[13] R. Sun, X. Zhou, Srj@ Dravidian-codemix-fire2020: Automatic classification and identifi-
     cation sentiment in code-mixed text., in: FIRE (Working Notes), 2020, pp. 548–553.
[14] X. Liang, L. Wu, J. Li, Y. Wang, Q. Meng, T. Qin, W. Chen, M. Zhang, T.-Y. Liu, R-drop:
     Regularized dropout for neural networks, arXiv preprint arXiv:2106.14448 (2021).