NUIG-Shubhanker@Dravidian-CodeMix-FIRE2020: Sentiment Analysis of Code-Mixed Dravidian text using XLNet Shubhanker Banerjeea , Arun Jayapalb and Sajeetha Thavareesanc a National University Of Ireland Galway, Ireland b Trinity college Dublin, Ireland c Eastern University, Sri Lanka Abstract Social media has penetrated into multi-lingual societies, however most of them use English to be a preferred language for communication. So it looks natural for them to mix their cultural language with English during conversations resulting in abundance of multilingual data – call this code-mixed data, available in today’s world. Downstream NLP tasks using such data is challenging due to the semantic nature of it being spread across multiple languages. One such NLP task is Sentiment analysis; for this we use an auto-regressive XLNet model to perform sentiment analysis on code-mixed Tamil-English and Malayalam-English datasets. Keywords code-mixed, XLNet, auto-regressive, attention 1. Introduction Social media content results in large data feeds from wide geographies. Since multiple geogra- phies are involved, the data is multilingual in nature resulting in code mixing 1 often. Sentiment analysis on code-mixed text allows to gain insights on the trends prevalent in different geogra- phies however is a challenge due to the non-trivial nature involved in inferring the semantics of such data. In this paper, we address these challenges using XLNet[1]2 framework. We have fine tuned the pre-trained XLNet model with the available data without any additional pre-processing mechanisms. The rest of the paper is organized as follows. Section 2 illustrates the related work done in the field. Section 3 explains the dataset and the task. Section 4 demonstrates the approach we used and briefly explains the architecture of XLNet. Results are discussed in Section 5. FIRE 2020: Forum for Information Retrieval Evaluation, December 16-20, 2020, Hyderabad, India email: S.Banerjee3@nuigalway.ie (S. Banerjee); jayapala@tcd.ie (A. Jayapal); sajeethas@esn.ac.lk (S. Thavareesan) orcid: 0000-0002-6252-5393 (S. Thavareesan) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 1 Code mixing refers to the linguistic units from different languages being used together by multilingual users 2 XLNet is an auto-regressive model[2] which is built with the transformer architecture[3] with a two stream attention mechanism[4]. The two stream attention mechanism ensures that the language model obtained through training can predict missing words on the basis of bidirectional context. Bidirectional context in XLNet is achieved by permutation language modelling. 2. Related Work Multilingual users have the tendency to mix linguistic units in the social media resulting in code-mixed data being easily available. The phenomenon of code-mixing is explained in [5, 6, 7, 8, 9, 10] and provides an analysis on the possible reasons behind code-mixing. This is done by identifying the languages involved in the code-mixed data which looks inevitable. In the past, several approaches were taken and experiments were conducted aimed at the detection of languages in code-mixed data [11][12][13]. A review of many research works on code-mixing is discussed in [14]. 2.1. Code-mixed data Since code-mixed is mostly sourced from social media platforms, the data in it’s raw form is highly unstructured and hence corpus creation to organize this unstructured data into datasets for further analysis pose a challenge. For some of the Indian languages, [15] has compiled a Tamil-English code-mixed dataset, the first annotated Tanglish3 dataset. Similarly, [16] published a dataset for Malayalam-English code-mixed data, where the authors also provided references to the availability of other code-mixed datasets such as Chinese-English and Spanish- English. But significant work hasn’t been done in the area of corpus creation for code-mixing for Indian languages. The Indian languages are considered to be under-resourced and so there is less interest in performing NLP tasks on these languages. 2.2. Sentiment-analysis Sentiment analysis is a well known NLP task that infers the positive, negative and neutral sentiments from a statement in question. However there are very few works on the sentiment analysis over code-mixed data; [17][18] provides an overview of the work done on sentiment analysis of Dravidian code-mixed text. Another work, [19] compares the performance of different transformer architectures on the task of sentiment analysis of code-mixed data. [20] employed an approach based on lexicon to assign sentiment to Hindi-English code-mixed text. [21] illustrates a method to detect hate speech in code-mixed Hinglish dataset. For the purpose of conducting this research they used FIRE 2013 and FIRE 2014 datasets. [22] used a LSTM [23] based approach to improve the state-of-art performance [20] on the hinglish datasets by 18 percent. [24] used shared parameters in a siamese network [25] to project the code-mixed sentences and sentences in standard languages into a common sentiment space. The similarity of projected sentences is an indicative of how similar their sentiments are, similar sentences have similar sentiment. Ensemble based techniques have also been used for sentiment analysis of code-mixed data, [26] proposed an ensemble model of a character-trigrams based LSTM and a word-ngrams naive bayes to detect the sentiment in Hindi-english code-mixed data. [27] have used a multilayer perceptron to perform sentiment analysis on code-mixed data extracted from social media platforms. [28] used an ensemble of a convolutional neural network and a self-attention based LSTM for sentiment analysis of Spanglish and Hinglish text. 3 Tanglish refers to code switching between Tamil and English, a term predominantly used in the Tamil community Table 1 Dataset size and splits Dataset Training Validation Testing Tamil-English 1,335 1,260 3,149 Malayalam-English 4,716 674 1,348 3. Dataset for sentiment analysis In spoken and written conversations, it is observed that the usage of lexicon, connectives and phrases from English are used in combination with other languages; this can very well be seen in the social media text and in spoken conversations across geographies, especially in India. Sentiment Analysis in social media has drawn attention in recent years. However, sentiment analysis on Tamil-English (Tanglish) and Malayalam-English code-mixed data are not readily available for research. The authors of [15] and [16] have collected 184,573 sentences for Tamil and 116,711 sentences for Malayalam from YouTube comments which are based on the trailers of the movies released in 2019 for building Tamil-English and Malayalam-English datasets where non-code-mixed sentences were removed from the collection. Further, emoticons were removed and sentence length filters were applied to render the mentioned datasets. In the end two data sets of size 15,744 and 6,738 sentences were reported for Tanglish and Malayalam-English texts. To get this dataset ready for sentiment analysis [15] and [16] refers to manual annotation activity carried out with three annotators annotating each sentence in the data set. The Krippendorff’s alpha (α) is used to measure inter-annotator agreement which is 0.6585 and 0.890 for Tamil and Malayalam code-mixed data sets respectively. The dataset was provided for this task in three parts training, validation and testing. The number of sentences used for the dataset splits are provided in table 1. Both these datasets are released in DravidianCodeMix FIRE 2020 competition organized by dravidiancodemixed. These comments were grouped into five categories positive, negative, neutral, mixed emotions, or not in the intended languages. 4. Methodology Language models have been integral to the recent advances made in the field of NLP due to its ability to predict the next token in a sequence. Traditionally this achieved by computing the joint distribution of the tokens in a sequence as a function of conditional probability distribution of each token given other tokens in the sequence. However, XLNet[1] takes a different approach; when these models are trained on large datasets, it achieves state-of-art performances on downstream NLP tasks. This uses permutation language modelling, which trains an autoregressive model on all possible permutation of words in a sentence – see equation 1. During prediction of a word in a sequence, it takes into account bidirectional context and predicts the masked tokens on the basis of the words/tokens to the right as well as the left of the masked token in the sequence. XLNet is based on the transformer architecture[3], which uses the concept of attention[4] to learn the long range token dependencies. Another important aspect of XLnet is two-stream attention; this refers Table 2 Precision, Recall and F-score measures on the Test set Data Classes Precision Recall F1 Score Weighted Average-F1 Accuracy Mixed feelings 0.03 0.22 0.05 Negative 0.12 0.40 0.19 Malayalam-English Positive 0.72 0.50 0.59 0.52 0.49 not-malayalam 0.36 0.58 0.44 unknown state 0.42 0.46 0.44 Mixed feelings 0.23 0.13 0.17 Negative 0.50 0.16 0.24 Tamil-English Positive 0.39 0.73 0.51 0.32 0.35 not-Tamil 0.10 0.40 0.16 unknown state 0.03 0.31 0.05 to attention streams working in parallel, one which encodes the content of the tokens and the other which incorporates the positional information. This property would be useful and so is exploited to perform the sentiment analysis on code-mixed data. The following equation formally describes the language modelling objective using XLNet. In eqn.1, for a give text sequence x, and set of all permutations of the sequence 𝑍𝑇 and z 𝜖 𝑍𝑇 . 𝑇 𝑚𝑎𝑥 𝐸𝑧∼𝑍𝑇 [∑ log 𝑝𝜃 𝑥𝑧𝑡 |𝑥𝑧<𝑡 ) ] (1) 𝜃 𝑡=1 For the purpose of experiments, we fine-tuned the XLNet model using the given datasets – refer to table 1 and sentiment analysis was conducted on this labelled dataset. The training and testing were carried out as per the numbers mentioned in Table 1. Two experiments were conducted on the mentioned datasets and for those experiments the XLNet embeddings were fine-tuned for 4 epochs each with a maximum learning rate of 0.005 to perform sentiment analysis of the given datasets. Results of the experiments carried out are illustrated in the next section. 5. Results and Discussion The experiment results for the experiments outlined in the previous section are provided in table 2. We were able to achieve 0.49 & 0.35 accuracies and 0.52 & 0.32 F-scores on both the datasets respectively. The results are biased towards Positive class because of the class-imbalance seen in the training set. Further, it can be seen that the model performs better on the Malayalam-English dataset despite the Tanglish dataset having more samples; this can be attributed to more noise in the Tamil-English data and hence relatively poor performance. Our results do not perform better than the baseline-results described in [15] and [16]. We hypothesize that these results can further be improved by training the model for more epochs with a pre-processing step performed in combination with oversampling and undersampling of the minority and majority classes respectively. References [1] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive pretraining for language understanding, 2019. a r X i v : 1 9 0 6 . 0 8 2 3 7 . [2] K. Gregor, I. Danihelka, A. Mnih, C. Blundell, D. Wierstra, Deep autoregressive networks, in: International Conference on Machine Learning, PMLR, 2014, pp. 1242–1250. [3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo- sukhin, Attention is all you need, 2017. a r X i v : 1 7 0 6 . 0 3 7 6 2 . [4] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, 2014. a r X i v : 1 4 0 9 . 0 4 7 3 . [5] E. Kim, Reasons and motivations for code-mixing and code-switching, Issues in EFL 4 (2006) 43–61. [6] B. R. Chakravarthi, M. Arcan, J. P. McCrae, WordNet gloss translation for under-resourced languages using multilingual neural machine translation, in: Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Trans- lation, European Association for Machine Translation, Dublin, Ireland, 2019, pp. 1–7. URL: https://www.aclweb.org/anthology/W19-7101. [7] B. R. Chakravarthi, R. Priyadharshini, B. Stearns, A. Jayapal, S. S, M. Arcan, M. Zarrouk, J. P. McCrae, Multilingual multimodal machine translation for Dravidian languages utilizing phonetic transcription, in: Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages, European Association for Machine Translation, Dublin, Ireland, 2019, pp. 56–63. URL: https://www.aclweb.org/anthology/W19-6809. [8] B. R. Chakravarthi, M. Arcan, J. P. McCrae, Comparison of different orthographies for machine translation of under-resourced Dravidian languages, in: 2nd Conference on Language, Data and Knowledge (LDK 2019), Schloss Dagstuhl-Leibniz-Zentrum fuer Infor- matik, 2019. [9] B. R. Chakravarthi, P. Rani, M. Arcan, J. P. McCrae, A survey of orthographic information in machine translation, arXiv preprint arXiv:2008.01391 (2020). [10] B. R. Chakravarthi, Leveraging orthographic information to improve machine translation of under-resourced languages, Ph.D. thesis, NUI Galway, 2020. [11] S. Mandal, S. Banerjee, S. Naskar, P. Rosso, S. Bandyopadhyay, Adaptive voting in multiple classifier systems for word level language identification, 2015. doi:1 0 . 1 3 1 4 0 / R G . 2 . 1 . 3 9 7 6 . 0246. [12] K. Bali, J. Sharma, M. Choudhury, Y. Vyas, ”i am borrowing ya mixing ?” an analysis of english-hindi code mixing in facebook, 2014, pp. 116–126. doi:1 0 . 3 1 1 5 / v 1 / W 1 4 - 3 9 1 4 . [13] T. Solorio, M. Sherman, Y. Liu, L. M. Bedore, E. D. Peña, A. Iglesias, Analyzing language samples of spanish-english bilingual children for the automated prediction of language dom- inance, Nat. Lang. Eng. 17 (2011) 367–395. URL: https://doi.org/10.1017/S1351324910000252. doi:1 0 . 1 0 1 7 / S 1 3 5 1 3 2 4 9 1 0 0 0 0 2 5 2 . [14] N. Jose, B. R. Chakravarthi, S. Suryawanshi, E. Sherly, J. P. McCrae, A survey of current datasets for code-switching research, in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), IEEE, 2020, pp. 136–141. [15] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association, Marseille, France, 2020, pp. 202–210. URL: https://www. aclweb.org/anthology/2020.sltu-1.28. [16] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association, Marseille, France, 2020, pp. 177–184. URL: https://www.aclweb.org/anthology/ 2020.sltu-1.25. [17] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, E. Sherly, J. P. McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, in: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020). CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India, 2020. [18] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, E. Sherly, J. P. McCrae, Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text, in: Proceedings of the 12th Forum for Information Retrieval Evaluation, FIRE ’20, 2020. [19] S. Banerjee, B. R. Chakravarthi, J. P. McCrae, Comparison of pretrained embeddings to identify hate speech in Indian code-mixed text, in: 2nd IEEE International Conference on Advances in Computing, Communication Control and Networking –ICACCCN (ICAC3N- 20), 2020. [20] S. Sharma, P. Srinivas, R. C. Balabantaray, Text normalization of code mix and sentiment analysis, in: 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2015, pp. 1468–1473. [21] P. Rani, S. Suryawanshi, K. Goswami, B. R. Chakravarthi, T. Fransen, J. P. McCrae, A comparative study of different state-of-the-art hate speech detection methods in Hindi- English code-mixed data, in: Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, European Language Resources Association (ELRA), Marseille, France, 2020, pp. 42–48. URL: https://www.aclweb.org/anthology/2020.trac-1.7. [22] A. Prabhu, A. Joshi, M. Shrivastava, V. Varma, Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text (2016). [23] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780. [24] N. Choudhary, R. Singh, I. Bindlish, M. Shrivastava, Sentiment analysis of code-mixed languages leveraging resource rich languages, 2018. a r X i v : 1 8 0 4 . 0 0 8 0 6 . [25] S. K. Roy, M. Harandi, R. Nock, R. Hartley, Siamese networks: The tale of two manifolds, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 3046–3055. [26] M. G. Jhanwar, A. Das, An ensemble model for sentiment analysis of hindi-english code- mixed data, arXiv preprint arXiv:1806.04450 (2018). [27] S. Ghosh, S. Ghosh, D. Das, Sentiment identification in code-mixed social media text, arXiv preprint arXiv:1707.01184 (2017). [28] A. Kumar, H. Agarwal, K. Bansal, A. Modi, Baksa at semeval-2020 task 9: Bolstering cnn with self-attention for sentiment analysis of code mixed text, 2020. a r X i v : 2 0 0 7 . 1 0 8 1 9 .