1. Introduction

Forum for Information Retrieval Evaluation, December

IndicBERT based approach for Sentiment Analysis on Code-Mixed Tamil Tweets

R.Ramesh Kannan

Ratnavel Rajalakshmi

Lokesh Kumar

0 0 School of Computer Science and Engineering, Vellore Institute of Technology , Chennai, TamilNadu , India

2021

1 3 17

Nowadays, Social media networks have made a huge impact in the lifestyle. Many people prefer to express their opinions on various topics in the social media platforms such as Facebook, Twitter etc. Even though, English is predominantly used by most of the people across the world to express their views, the technological advancements have paved a way for people to use their native language also to post their opinions. As many of the social media users are bilingual in nature, the trend of using a combination of English and native language has become a common scenario. Sentiment Analysis, the task of identifying the correct opinion from these Code-Mixed social media posts, is a challenging one, as the existing architectures and algorithms are designed to handle uni-lingual posts. The diversity and the rich linguistic nature of Indian languages demand highly sophisticated systems to address the above issues. In this work, we have conducted an experimental study to handle the challenges in Code-Mixed Tamil tweets and proposed a transformer based Indic-BERT approach. From the experimental results, we have shown that, an 1 score of 61.73% can be achieved, which is a significant improvement over the other traditional methods. This work has been submitted to the shared task on [1] Dravidian-CodeMixFIRE 2021.

eol>Code-Mixed Sentiment Analysis Dravidian Language Tanglish Tamil

1. Introduction

Telugu are the Dravidian Languages spoken by people from Karnataka, Kerala and Andrapradesh. Tamil is a Dravidian Languages which is spoken by people from India, Srilanka, and Tamil diaspora around the world. Tamil is the oficial language in Singapore and Srilanka. These languages are used by people for Various purpose like administration, education and business media. However, people often their native language with Roman script for typing because it is easy for the user to type the contents. Hence, the majority of the under-resourced languages in social media are Code-Mixed in nature.

Regional languages are used to share people opinions on the social media. Many of the resources are developed/generated in Arabic [4],English and other regional languages. In the technological world, people have ease to access internet and share Code-Mixed texts on the internet platform. Texts needs to be understood at linguistic level and the lack of Code-Mixed data to train the model is the challenging part during analysis. Monolingual trained system might not be suitable for Code-Mixed data, since the linguistic structure is diferent for Code-Mixed data.

Shared task [ 3 ] released in Tanglish (Tamil+English) language with social media comments for sentiment analysis on Code-Mixed data. Our proposed system reveals, how sentiment is expressed in Code-Mixed scenarios on social media by applying transformer based approach Indic-BERT[5] and we have obtained an 11 score of 61.73%. This paper is organized as follows: Section 2 shows about the Related works that are carried out on the same domain, Section 3 discusses about the proposed methodology for the shared task, Section 4 deals with the results obtained using the proposed methodology. Section 5 focuses on conclusion part of the study.

2. Related Works

Sentiment polarity analysis on online medium like YouTube comments is an important problem in analyzing people opinion on public, product, sports or on movies etc. Analyzing the polarity on the online medium contents is a challenging task. Various authors [6, 7] carried out their research on under resources languages. Sentiment analysis on online media contents[8],[9] and social media contents [10] had been studied by various authors. Sentiment analysis on movie reviews were studied by [8, 9]. Online movie review is done by [9]combining Convolution Neural Network and Bidirectional Long Short Memory to identify the opinions on movie contents as Hybrid approach. [8] shows the work implementation of feature weighting method on online movie reviews. New Relevance Factor (NRF) weighting method [11] for text classification using Naive Bayes classifier. [ 12] proposed universal dictionary method for text classification on Uniform Resource Locator(URL)using Linear SVM. Text classification on legal documents[13], context aware solution based on Cosine similarity approach and Term Frequency - Inverse Document Frequency(TF-IDF) to obtain the similarity between the documents. Attention mechanism is proposed with Recurrent Convolutional Neural Network(RCNN) [14] for efective learning of text features on uniform resource locator. Deep learning architecture of Convolutional Neural Network (CNN) is combined with Bidirectional Gated Recurrent Unit (BGRU) [15] to extract the features for web page classifications. Sentiment movie reviews is analysed with Long Short-term Memory (LSTM) with word embedding to extract the polarity of the reviews with self attention based approach [16]. Sentiment analysis on Tweet contents were analysed [10] by applying Maximum Entropy supervised approach and obtained 74% cross validation accuracy score. A detailed survey on sentiment analysis was presented in work by [17].

The task on sentiment polarity identification on Code-Mixed data is challenging and recent days works are reported on the Code-Mixed data sets. The authors in [18] proposed an ensemble based machine learning approach on Code-Mixed data set. The authors proposed n-gram features with machine learning to perform classification on Hindi- English and Bengali-English data set and obtained a F1 score of 58% and 69% respectively. Ensemble classifier approach proposed using CHI square feature selection approach[19] on Code-Mixed Hindi-German language using Random Forest Classifier. Rajalakshmi. et al, [ 20] proposed BERT based approach on CodeMixed data set for ofensive language identification by capturing linguistic features. The authors obtained a validation F1 Score of 65% and testing F1 Score of 64%. Hate Speech analysis on CodeMixed Marathi, Hindi data were analysed using Ensembled approach [21] Extreme Gradient Boosting Code-Mixed Hindi, English were analysed for Hate Ofensive detection using IndicBERT [22] with Majority voting approach for HASOC2021. To process multi-lingual queries Code-Mixing and Code-Borrowing were studied in recent days [23, 24, 25]. Relevance metric[26] based approach is proposed for borrowing likeliness of Hindi-English tweets for ranking. [5] proposed a new multilingual ALBERT model based approach for some of the Indian languages. Indic-BERT can be applied to various downstream tasks in Natural Language Processing. In this study, we have applied Indic-BERT on Dravidian Code-Mixed data set for sentiment polarity identifications.

3. Data Set Description

Dravidian Code-Mixed data set is a collection of YouTube video comments, which contains code mixed sentences and the types of Code-Mixed sentences are Inter-Sentential switch, IntraSentential switch and Tag switch[27] . Almost all the comments were written in Tamil grammar with English lexicon or English grammar with Tamil lexicon in native script and Roman scripts. Few of the comments were in Tamil script with English expressions. Data set contains ID,text and Label for each of the comments. Id contains unique number to identify particular row, text contains YouTube comments and label shows the category of the text, which contains five categories like Positive, Negative, not-Tamil, unknown state and mixed feelings.

Example from Data set: Original Text : Yarayellam FDFS paga ippove ready agitinga Meaning : Who are all now ready for FDFS(FIrst Day First Show)- Positive category Original Text : Ennada viswasam mersal sarkar madhri time la likes and views create pannalayae - Negative Category

Meaning : Why likes and views are not created for the films like viswasam, mersal,sarkar. Negative Category

The objective of the task is to identify sentiment polarity of the Code-Mixed data set of comments or posts in Tamil+English collected from social media that contains any of the following 5 category labels viz., Positive(Po), Negative(Ne), Mixed_feelings(Mf), not-Tamil(Nt), unknown_state(Us). The data distribution is tabulated in Table 1. 56% of the comments are positive and other remaining 44% of the comments are in other four categories. The percentage of category values are as follows: Ne with 12%, Nt with 5%, Mf with 11% and Us with 16%. As part of the sentiment analysis task, the training and validation set were released with 35656 and 3962 labelled social media comments. Both the training and validation set follows the same distribution.

4. Proposed Methodology

The Code-Mixed comments contains Tamil, English and other language phrases and words in the context. Instead of converting the text into any of the common language, a Multilingual pretrained model[5] Indic-BERT is used, that has been pretrained on 12 indian languages. Indic-BERT pretrained model is based on ALBERT(A Lite BERT for Self-Supervised Learning of Language Representations) model, which is a recent derivative of BERT(Bidirectional Encoder Representations from Transformers), which is pretrained on 12 indian languages like Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The proposed BERT has less parameters than other public models like mBERT and XLM-R while it manages to give state of the art performance on several tasks.

Since it is a pretrained multilingual transformer model, the data needs to converted into corresponding embeddings for classifications. As a pre processing step, Autotokenizer tokenizes all the sentences into tokens. In tokens, Class[CLS] token is added at the beginning of the sentence and seperation[SEP] token is added at the end of the each sentences. Padding [Pad] token is padded with all the sentences till the maximum length of the sentence. Assign unique id to each token for further processing. Attention mask is also generated for each input sentences and it tells which tokens should be attempted and which should not be attempted by the model during training. This will be useful when input is fed into transformer based Indic-BERT model.

To determine the sentiments expressed in the Code-Mixed YouTube comments/posts, IndicBERT model is proposed with the fine tuned parameters [ 5]. Indic-BERT is a multilingual representation model that extracts the context from diferent language input representations in both the directions. To capture the semantic and linguistic features of a multilingual sentence, Indic-BERT is applied. YouTube comments/Posts may have more than one sentences. Indic-BERT has the ability to consider these multilingual inputs sentences into a single sequence for input representations. Indic-BERT embeddings combines the token embedding,segment embedding and positional embeddings. Pretrained model can be fine-tuned to suit the downstream tasks by adding classification layer at the bottom of the model. Indic-BERT can be used for this task of how sentiment is expressed in Code-Mixed scenarios on social media.

5. Results and Discussion

To study the performance of the sentiment polarity of the system, we have conducted experiment based on Indic-BERT approach on Code-Mixed data set. The experiment was conducted on workstation with Intel Xeon Quad Core Processor, 32 GB RAM, NVIDIA Quadro P4000 GPU 8GB. To capture the sentiment polarity on the Code-Mixed data set, we have tried transformer based approach of Indic-BERT. To attain better performance of the BERT model, we have fine-tuned the parameters and obtained learning rate=3e-5,batch size=64, epochs=5. Figure 1, shows the accuracy graph on training data and validation data. For the 5th epoch accuracy score reached a maximum level. Figure 2, plotted with loss values on training data and validation data. Obtained a training accuracy of 69.91% and loss of 0.6939. For the validation set, obtained a loss of 0.9681 and accuracy of 63.29%. Here the classifier is able to classify all the categories, even the data set is not balanced set. Even the data set contains very less number of Not Tamil categories are classified correctly. From Table 3, Our proposed model out performs on Weighted average F1 score of 61.73%. The model is able to classify all the categories irrespective of the specific category.

6. Conclusion

There is an increase in social media contents in recent days. The goal of the Dravidian-CodeMixFIRE 2021 is to identify the subjective opinions or emotional responses of the social media comments. In this work, we have presented the challenges involved in extracting the key terms to identify the opinion from the Code-Mixed tweets. A detailed experimental study has been performed using diferent architectures and we found that, the sentiment in the social media contents are better captured using the Indic-BERT language model. We have obtained a weighted 1 score of 61.73% with the proposed model. We observed that, the data set is skewed and the lack of enough samples for every category has impacted the performance of the classifier. In our future work, we planned to address the class imbalance problem for Code-Mixed sentiment analysis.

Acknowledgments

The authors would like to thank the management of Vellore Institute of Technology, Chennai for providing the support to carry out this work. We would like to thank the Department of Science and Engineering Research Board (SERB),Government of India for their financial grant (Award Number: ECR/2016/00484) for this research work. [4] H. Mubarak, K. Darwish, W. Magdy, Abusive language detection on Arabic social media, in: Proceedings of the First Workshop on Abusive Language Online, Association for Computational Linguistics, Vancouver, BC, Canada, 2017, pp. 52–56. URL: https://aclanthology.org/W17-3008. [5] D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, P. Kumar, IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 4948–4961. URL: https://aclanthology.org/2020.findings-emnlp.445. [6] S. Thavareesan, S. Mahesan, Sentiment lexicon expansion using word2vec and fasttext for sentiment prediction in tamil texts, 2020. doi:10.1109/MERCon50084.2020.9185369. [7] S. Thavareesan, S. Mahesan, Sentiment analysis in tamil texts: A study on machine learning techniques and feature representation, 2019 14th Conference on Industrial and Information Systems (ICIIS) (2019) 320–325. [8] S. Sivakumar, R. Rajalakshmi, Comparative evaluation of various feature weighting methods on movie reviews, in: H. S. Behera, J. Nayak, B. Naik, A. Abraham (Eds.), Computational Intelligence in Data Mining, Springer Singapore, Singapore, 2019, pp. 721–730. [9] S. Soubraylu, R. Rajalakshmi, Hybrid convolutional bidirectional recurrent neural network based sentiment analysis on movie reviews, Computational Intelligence 37 (2021) 735–757. URL: https://onlinelibrary.wiley.com/ doi/abs/10.1111/coin.12400. doi:https://doi.org/10.1111/coin.12400. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/coin.12400. [10] A. Samuels, J. Mcgonical, Sentiment analysis on social media content, CoRR abs/2007.02144 (2020). URL: https://arxiv.org/abs/2007.02144. arXiv:2007.02144. [11] R. R., Supervised term weighting methods for url classification, Journal of Computer

Science 10 (2014). doi:10.3844/jcssp.2014.1969.1976. [12] R. R., C. Aravindan, An efective and discriminative feature learning for url based web page classification, in: 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2018, pp. 1374–1379. doi:10.1109/SMC.2018.00240. [13] R. R. Kannan, R. Rajalakshmi, Dlrg@aila 2019: Context - aware legal assistance system, in: P. Mehta, P. Rosso, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2019 - Forum for Information Retrieval Evaluation, Kolkata, India, December 12-15, 2019, volume 2517 of CEUR Workshop Proceedings, CEUR-WS.org, 2019, pp. 58–63. URL: http://ceur-ws.org/ Vol-2517/T1-10.pdf. [14] R. R., H. Tiwari, J. Patel, R. R., K. Ramamurthy, Bidirectional GRU-Based Attention Model for Kid-Specific URL Classification, 2020, pp. 78–90. doi: 10.4018/978-1-7998-1192-3. ch005. [15] R. Rajalakshmi, H. Tiwari, J. Patel, A. Kumar, R. Karthik., Design of kids-specific url classiifer using recurrent convolutional neural network, Procedia Computer Science 167 (2020) 2124–2131. URL: https://www.sciencedirect.com/science/article/pii/S1877050920307262. doi:https://doi.org/10.1016/j.procs.2020.03.260, international Conference on Computational Intelligence and Data Science. [16] S. Soubraylu, R. Rajalakshmi, Analysis of sentiment on movie reviews using word embedding self-attentive lstm, International Journal of Ambient Computing and Intelligence 12 (2021) 33–52. doi:10.4018/IJACI.2021040103. [17] V. Ganganwar, R. Rajalakshmi, Implicit aspect extraction for sentiment analysis: A survey of recent approaches, Procedia Computer Science 165 (2019) 485–491. [18] P. Mishra, P. Danda, P. Dhakras, Code-mixed sentiment analysis using machine learning and neural network approaches, CoRR abs/1808.03299 (2018). URL: http://arxiv.org/abs/ 1808.03299. arXiv:1808.03299. [19] R. Rajalakshmi, B. Y. Reddy, Dlrg@hasoc 2019: An enhanced ensemble classifier for hate and ofensive content identification, in: P. Mehta, P. Rosso, P. Majumder, M. Mitra (Eds.), Working Notes of FIRE 2019 - Forum for Information Retrieval Evaluation, Kolkata, India, December 12-15, 2019, volume 2517 of CEUR Workshop Proceedings, CEUR-WS.org, 2019, pp. 370–379. URL: http://ceur-ws.org/Vol-2517/T3-26.pdf. [20] R. Rajalakshmi, Y. Reddy, L. Kumar, DLRG@DravidianLangTech-EACL2021: Transformer based approachfor ofensive language identification on code-mixed Tamil, in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Association for Computational Linguistics, Kyiv, 2021, pp. 357–362. URL: https://aclanthology.org/2021.dravidianlangtech-1.53. [21] R. Rajalakshmi, S. Srivarshan, M. L. P. R. Faerie, M. Faerie, K. E, S. Prithvi, K. M. Anand, Conversational hate-ofensive detection in code-mixed hindi-english tweets, Association for Computing Machinery, 2021. [22] R. Rajalakshmi, L. P. Reddy, M. Faerie, S. Srivarshan, K. M. Anand, Hate speech and ofensive content identification in hindi and marathi languages using ensemble techniques, Association for Computing Machinery, 2021. [23] B. R. Chakravarthi, M. Arcan, J. P. McCrae, Improving wordnets for under-resourced languages using machine translation, in: Proceedings of the 9th Global Wordnet Conference, Global Wordnet Association, Nanyang Technological University (NTU), Singapore, 2018, pp. 77–86. URL: https://aclanthology.org/2018.gwc-1.10. [24] WordNet Gloss Translation for Under-resourced Languages using Multilingual Neural

Machine Translation, Zenodo, 2019. URL: https://doi.org/10.18653/v1/w19-7101. [25] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, E. Sherly, J. P. McCrae, Overview of the track on sentiment analysis for Dravidian languages in code-mixed text, in: Forum for Information Retrieval Evaluation, FIRE 2020, Association for Computing Machinery, New York, NY, USA, 2020, p. 21–24. URL: https://doi.org/10. 1145/3441501.3441515. [26] R. Rajalakshmi, R. Agrawal, Borrowing likeliness ranking based on relevance factor, in: Proceedings of the Fourth ACM IKDD Conferences on Data Sciences, CODS ’17, Association for Computing Machinery, New York, NY, USA, 2017. URL: https://doi.org/10. 1145/3041823.3067694. doi:10.1145/3041823.3067694. [27] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association, Marseille, France, 2020, pp. 202–210. URL: https://www. aclweb.org/anthology/2020.sltu-1.28.

[1]

B. R.

Chakravarthi ,

Priyadharshini ,

Thavareesan ,

Chinnappa ,

Durairaj ,

Sherly ,

J. P.

McCrae ,

Hande ,

Ponnusamy ,

Banerjee ,

Vasantharajan , Findings of the Sentiment Analysis of Dravidian Languages in Code-Mixed Text 2021 , in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation , CEUR , 2021 .

[2]

B. R.

Chakravarthi ,

P. K.

Kumaresan ,

Sakuntharaj ,

A. K.

Madasamy ,

Thavareesan , P. B,

S. Chinnaudayar

Navaneethakrishnan ,

J. P.

McCrae ,

Mandl , Overview of the HASOC-Dravidian CodeMix Shared Task on Ofensive Language Detection in Tamil and Malayalam , in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation , CEUR , 2021 .

[3]

Priyadharshini ,

B. R.

Chakravarthi ,

Thavareesan ,

Chinnappa ,

Durairaj , E. Sherly, Overview of the Dravidiancodemix 2021 shared task on sentiment detection in tamil, malayalam, and kannada, in: Forum for Information Retrieval Evaluation , FIRE 2021 , Association for Computing Machinery , 2021 .