1. Introduction

Hyderabad, India email: zhouxb@ynu.edu.cn(X. Zhou*) orcid:

SRJ @ Dravidian-CodeMix-FIRE2020: Automatic Classification and Identification Sentiment in Code-Mixed Text

Ruijie Sun

XiaobingZhou

0 0 School of Information Science and Engineering, Yunnan University , Yunnan , P.R.China

2020

000 0 0003

Sentiment analysis of the Code-Mixed text has received increasing research attention. To facilitate the researches on Code-Mixed text, the Sentiment Analysis of Dravidian Languages in Code-Mixed Text is proposed in FIRE 2020. This paper introduces the system submitted by the SRJ team. We participate in the Malayalam-English task and Tamil-English task. We use XLM-Roberta as our model. And abundant semantic information is obtained by extracting XLM-Roberta's hidden state. Our approach achieves the best results in both tasks with weighted F-scores of 0.74 and 0.65, respectively. Our code is available on GitHub(https://github.com/lonelyjie323/HASO C).

eol>Sentiment Analysis Abundant Semantic Information Code-Mixed Text XLM-Roberta

1. Introduction

Tamil-English, and English codes. Our goal is to categorize the comments obtained by YouTube into positive, negative, neutral, mixed emotions, or not in the intended langua4g]e. s[

2. Related Work

In recent years, sentiment analysis has attracted the attention of a large number of industrial and academic researchers. With the acceleration of information dissemination, Code-Mixed has gradually become a common phenomenon in multilingual communities.

Asoka Chakravarthi et a5l].[proposed an improvement of word meaning translation by utilizing orthographic information to translate words in languages with insuficient resources, such as Dravidian languages. A word in a language may have more than one meaning, for this reason, R. padamara6[] also proposed a word-level translation method based on knowledge engineering for Tamil-English. There are two traditional approaches to solve the problem of sentiment analysis, such as lexicon-based or machine learning approa7c]h. eIsn[ general, it is dificult to understand and analyze texts written in multiple languages. Veena P V et8a]l.[ developed a word-level language recognition system based on Code-Mixed for social media text, which was used to complete Tamil-English and Malayalam-English Code-Mixed Facebook reviews. Shashi Shekhar et al9.][ proposed a novel architecture combining a multichannel neural network (MNN) and quantum Bi-directional Long Short-Term Memory (QBLSTM).

For this task, the content of sentiment analysis related to Malayalam-English and TamilEnglish is very few1[0], because for Indian language, machine translation between India and English has certain dificulties[11], so the data is not easy to obtain, let alone carry out a sentiment analysis on it.

3. Methodology 3.1. Dataset

In the Malayalam-English task, the oficial organizers provide training set (4,851 comments / posts) and validation set (540 comments / posts). In our experiment, we combine training set and validation set, and label distribution is { positive: 2,246, negative: 600, neutral: 1,505, mixed emotions: 333, not-Malayalam: 707 }, which is an imbalance data set.

In the Tamil-English task, the oficial organizers provide training set (11,335 comments / posts) and validation set (1,260 comments / posts). In our experiment, we combine training set and validation set, and label distribution is { positive: 8,124, negative: 1,613, neutral: 677, mixed emotions: 1,424, not-Tamil:397 }, which is an imbalance data set.

3.2. XLM-Roberta with hidden state

Early work in the field of cross-language understanding has proved the efectiveness of multilingual masked language model (MLM) in cross-language understanding, but models such as XLM[12] and Multilingual BERT13[] (pre-trained on Wikipedia) are still limited in learning useful representations of low resource languages. XLM-Rob1e4r]tas[hows that the performance of cross-language transfer tasks can be significantly improved by using the large-scale multi-language pre-training model. It can be understood as a combination of XLM and Roberta. It is trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages. Because the training of the model in this task must make full use of the whole sentence content to extract useful semantic features, which may help to deepen the understanding of the sentence and reduce the impact of noise on data. Therefore, we use XLM-Roberta in this work.

In the classification task, the original output of XLM-Roberta is obtained through the last hidden state of the model. However, the output usually does not summarize the semantic content of the input. Recent studies have shown that abundant semantic information features are learned by the top hidden layer of BE1R5T][, which we call the semantic layer. In my opinion, the same is true of XLM-Roberta. Therefore, in order to make the model obtain more abundant semantic information features, we propose the following model, as shown in Figure 1. Firstly, we getP_O. Secondly, we extract the hidden state of the last three layers of XLM-Roberta and input them into CNN to getL12, L11, andL10. Finally, we concatenatPe_O, L10, L11, and L12 into the classifier.

4. Experiments and Results 4.1. Preprocessing and Experiments Setup

In the experiment, we try to clean the text. But it does not achieve the desired results. So the text is not cleaned. In the Malayalam-English and Tamil-English tasks, the hyper-parameters are set to the same, and the best weight of the model is saved in the training. In this work, oficial organizers provide training sets and validation sets. We combine the oficial training set not-malayalam/ not-Tamil Positive

Negative unknown_state Mixed_feelings

Macro avg weighted acg 0.78 and the validation set to get the new data set, which is split into the new training set and the validation set by using the Stratified 5-Fold Cross-validat1io.nDue to the imbalance of datasets, the Stratified 5-Fold Cross-validation ensures that the proportion of samples in each category in each fold data set remains unchanged.

For the XLM-Roberta, we use XLM-Roberta-bas2e pre-trained model, which contains 12 layers. We use Adam optimizer with a learning rate as 5e-5. The batch size is set to 32 and the max sequence length is set to 60. We extract the hidden layer state of BERT by setting the output_hidden_States is true. The model is trained in 10 epochs with a dropout rate of 0.2.

For the convolution layer, we use 2D convolutionn(n.Conv2d3). The size of the convolution kernel is set to (3,4,5) and the number of convolution kernels is set to 256.

4.2. Results and Analysis

In this work, we find the limitations of P_O for sentiment analysis of Code-Mixed text in Dravidian languages. In the classification task, the original output of BERPT_iOs. Chakravarthi et al.[ 2, 3 ] pointed out that Multilingual BERT fails to identify “Mixed feeling” class in the Malayalam-English task and Multilingual BERT fails to identify “Negative”, “Neutral”, “Mixed feeling”class in the Tamil-English task. In the same way, we just pPu_tO as the output of XLM-Roberta. The results are shown in Table 1. We can see that the results are not good when P_O is used as the output of Bert and XLM-Roberta. We think that just usinPg_O as the output will lose some efective semantic information. So we think that deep and abundant semantic features are efective for this work. We extract the hidden state of XLM-Roberta and we also discover that the performance of the model improves with the increase of the semantic layer. Table 2 shows the performance of our model at diferent semantic layers.

Table 3 shows our results on the test set. For two tasks, we only use the oficial training set 1https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.htmlsklearn.model_selection.StratifiedKFold 2https://huggingface.co/xlm-roberta-base 3https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.htmltorch.nn.Conv2d and validation set and do not use any external data. The hyper-parameters of the model are set to the same. In the Malayalam-English and Tamil-English tasks, our models all achieve the best performance.

5. Conclusion

In this work, we provide a baseline for sentiment analysis of Code-Mixed text in Dravidian languages (Malayalam-English and Tamil-English). We find the limitation of only using Pooler out as the output of BERT. To obtain deeper and more abundant semantic features, we extract the hidden layer state of XLM-Roberta, which is input into convolution and max pooling. The result shows that it is helpful to improve the performance of XLM-Roberta to obtain more abundant semantic information features by extracting the hidden state of XLM-Roberta. Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association, Marseille, France, 2020, pp. 202–210. UhRttLp:s://www. aclweb.org/anthology/2020.sltu-1.28 [4] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, J. P. Sherly, Elizabeth McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020). CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India, 2020. [5] B. R. Chakravarthi, Leveraging orthographic information to improve machine translation of under-resourced languages, Ph.D. thesis, NUI Galway, 2020. [6] R. Padmamala, Word level translation (tamil-english) with word sense disambiguation in tamil using ontnet, in: 2015 International Conference on Computing and Communications Technologies (ICCCT), IEEE, 2015, pp. 191–198. [7] O. Habimana, Y. Li, R. Li, X. Gu, G. Yu, Sentiment analysis using deep learning approaches: an overview, Science China Information Sciences 63 (2020) 1–36. [8] P. Veena, M. A. Kumar, K. Soman, An efective way of word-level language identification for code-mixed facebook comments using word-embedding via character-embedding, in: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), IEEE, 2017, pp. 1552–1556. [9] S. Shekhar, D. K. Sharma, M. S. Beg, Language identification framework in code-mixed social media text based on quantum lstm—the word belongs to which language?, Modern Physics Letters B 34 (2020) 2050086. [10] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, J. P. Sherly, Elizabeth McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, in: Proceedings of the 12th Forum for Information Retrieval Evaluation, FIRE ’20, 2020. [11] M. A. Kumar, B. Premjith, S. Singh, S. Rajendran, K. Soman, An overview of the shared task on machine translation in indian languages (mtil)–2017, Journal of Intelligent Systems 28 (2019) 455–464. [12] G. Lample, A. Conneau, Cross-lingual language model pretraining, arXiv preprint arXiv:1901.07291 (2019). [13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [14] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019). [15] G. Jawahar, B. Sagot, D. Seddah, What does bert learn about the structure of language?, 2019.

[1]

Haridas ,

Vasudevan ,

G. J.

Nair , G. Gutjahr,

Raman ,

Nedungadi , Spelling errors by normal and poor readers in a bilingual malayalam-english dyslexia screening test , in: 2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT) , IEEE, 2018 , pp. 340 - 344 .

[2]

B. R.

Chakravarthi , N. Jose,

Suryawanshi ,

Sherly ,

J. P.

McCrae , A sentiment analysis dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association , Marseille, France, 2020 , pp. 177 - 184 . URhLtt:ps://www.aclweb.org/anthology/ 2020.sltu- 1 . 25 .

[3]

B. R.

Chakravarthi ,

Muralidaran ,

Priyadharshini ,

J. P.

McCrae , Corpus creation for sentiment analysis in code-mixed Tamil-English text , in: Proceedings of the 1st Joint