1. Introduction

model for sentiment classification on code-mixed data in Dravidian Languages

S R Mithun Kumar

0 1 2

Nihal Reddy

0 1

Aruna Malapati

arunam@hyderabad.bits-pilani.ac.in 0 1

Lov Kumar

0 1 0 BITS Pilani , Hyderabad , India 1 Spanish, or Indian languages like Tamil , Telugu, Hindi, Malayalam, Kannada in the Asian 2 Uber Research and Development India , Bangalore , India

Dravidian languages, Tamil, Kannada, Malayalam and Telugu, is spoken by over 220 million but is vastly under-resourced for natural language processing tasks. Code-switching and code-mixing have been on the rise, with multilingual speakers opting for expressing their opinion in their mother tongue along with English in both written text as well as in speech. Challenges arise in sentiment analysis of code-switched Dravidian languages because of under-resourced corpora and randomness in language interspersing. This paper applied an ensemble sentiment classification strategy based on majority voting using 13 diferent classification models on the Dravidian code-mixed languages dataset provided in FIRE 2021 1 The key conclusion from our experiments was that the ensemble of multiple classifiers outperformed others for sentiment classification. Our approaches show that a result of weighted F1-score of 0.59, 0.65 and 0.60, respectively, on Kannada, Malayalam and Tamil code-switched data can be achieved with the traditional machine learning algorithms through an ensemble of multiple classifiers.

Dravidian Languages

1. Introduction

on social media sites to day-to-day usage in written communication. Negative sentiments are more often expressed in mother-tongue than the positive sentiments, which are generally expressed in English making it necessary to learn the code-switched languages.

While monolingual NLP tasks form the basis and are no diferent to code-mixed languages in most of the aspects, significant challenges for code-mixed data exist in language identification, data collection and preparation strategy, optimally using the existing resources and on the user-centric design of code-mixed NLP systems. This amplifies even more when one of the languages is under-resourced.

Dravidian languages are vastly under-resourced, and when code-mixed with English is a harder NLP task. Sentiment analysis on code-switched Dravidian languages is still ongoing research which will help analyse the emotion and attitude of the users who express in codeswitched languages with the rising usage on social media like TikTok, YouTube, Whatsapp, etc.

2. Related Work

Computational approaches to code-switching, related workshop and ACL anthology1 has seen an increase in the research papers in the last three to four years.

Graph Convolutional Networks with multi-headed attention was experimented by Dowlagar et al. 2021 where it yielded a weighted F1-score of 0.75 for Malayalam-English code-mixed data with FIRE 2020 dataset published by Chakravarthi, Jose, et al. (2020).

Ensemble of character-trigrams based Long Short Term Memory (LSTM) model and word n-grams based Multinomial Naive Bayes (MNB) has been proposed by Jhanwar et al. (2018) for Hindi-English code-mixed language pair (Prabhu et al. 2016). This model takes in the combined strength of LSTM and probabilistic models. LSTM was performing better on longer length sentences due to its ability to capture sequential information whereas MNB performed generalisation on rare words.

All prior research highlighted above focus on deep learning techniques, which perform significantly well with longer length sentences. For instance, Jhanwar et al. (2018) experimented with datasets which has an average of fity words. However, most social media content, such as Youtube comments, tend to be shorter. For instance Kannada code-mixed dataset of FIRE 2021 (Hande et al. 2020) has an average comment length of fewer than seven words. We argue that probabilistic and deterministic classifiers and an ensemble of traditional classifiers will yield the same or better results on datasets with shorter sentences.

Our approach was to build a pipeline with traditional classifiers, evaluate the performance metrics for sentiment classification and then iterate with ensemble techniques that could be set as a baseline for any short-length code-mixed text. This was done as part of the shared task on sentiment detection along the lines as described by Priyadharshini et al. (2021).

1https://aclanthology.org/search/?q=code+mixing

3. Data 3.1. Data Description

This section presents the detailed description of the dataset and its distribution, along with the research framework used.

The dataset used for the task is from the oficial datasets released in Dravidian-CodeMix - FIRE 2021 which comprises labelled sentiment data of YouTube video comments on language pairs like Kannada-English (Hande et al. 2020), Malayalam-English (Chakravarthi, Jose, et al. 2020) and Tamil-English (Chakravarthi, Muralidaran, et al. 2020) . The data has been code-switched language pairs, mostly in Roman script, both for English and the Dravidian language, wherein the latter has been transliterated from the source language to Roman script. However, there remains a good portion of the Dravidian script too in the data.

3.2. Data Distribution

The data distribution is shown in Table 1. The dataset contains labelled code-mixed sentences into five categories: Positive, Negative, Mixed Feelings, Unknown State and not in the intended language. The dataset contains inter-sentential, intra-sentential code-mixed sentences in Tamil, Malayalam and Kannada with English. As seen in Figure 1, the data is imbalanced, with most of the labels being available for the positive sentiment.

4. Methodology 4.1. Data Preprocessing

The data has been preprocessed for removing stopwords, punctuation and emoticons. NLTK2 library has been used for stemming, lemmatisation and removing stop-words. We have used the spaCy3 library for named entity recognition.

4.2. Experiment Setting

The pipeline was set up to train the data, both on traditional as well as on ensemble techniques, as represented in Figure 2.

4.2.1. Traditional classifiers

In the first approach, the data has been run on multiple traditional machine learning algorithms for classification. The following parameters are common to all. ’vect’, CountVectorizer, min_df=3, max_df=0.2, analyzer=’word’, ngram_range=(1, 3), Tfidf Transformer().

This data was trained on traditional classifiers, including Logistic Regression (LR), Multinomial Naive Bayes (MNB), Linear SVM (L-SVM), RBF SVM (R-SVM), Poly SVM (P-SVM), Random Forest (RF), K-Nearest Neighbours (KNN) and Extra Tree Classifier (XTree).

2https://www.nltk.org/ 3https://spacy.io/

4.2.2. Ensemble of multiple classifiers

The data was then run with ensemble classifiers with estimators as detailed out in Table 2. Various ensemble methods experimented with were AdaBoost (AdaB), XGBoost (XGB), Hard Ensemble of Voting Classifier (HEns), Hard Ensemble of Top 5, Top 3 and All Classifiers (HTop_5, HTop_3, HTop_A).

5. Results

In this paper, eight diferent types of traditional machine learning and three diferent types of ensemble methods have been used to develop a sentiment prediction model for code-mixed language pairs in Tamil-English, Malayalam-English and Kannada-English. The predictive power of these sentiment prediction models are validated using 5-fold cross-validation and compared with four diferent performance metrics such as Precision, Recall, F1-Score and Accuracy. The performance values of these models are presented in Table 3 through which we derived the following observations: • Ensemble classifiers generally outperformed all the single classifiers across all the three language pairs. • The ensemble of a mix of both weak and strong individual learners always had a probabilistic model like logistic regression. • Both, logistic regression model as well as an ensemble model, performed very close to each other.

5.1. Comparative Analysis: Box plot

In this work, box plot diagram for diferent performance metrics, precision, recall and F1scores has been used to compare the performance of the developed models using diferent techniques. Figure 3 shows the box plot for each performance metric, precision, recall and F1-scores compared with each classifier. The information present in Figure 3 suggested that the ensemble methods generally perform better than other classifier. The information present in Figure 3 also suggested that the probabilistic models like logistic regression, in silo, perform better than any other stand-alone classifiers. This performs even better with an ensemble of top classifiers. The performance metrics are very close to the values observed in the baseline model using transformer-based models on the FIRE 2021 dataset published by Chakravarthi, Priyadharshini, Muralidaran, et al. (2021), which achieved an F1-score of 0.67, 0.59 and 0.62 for Kannada, Malayalam and Tamil code-mixed datasets, respectively as published by Chakravarthi, Priyadharshini, Thavareesan, et al. (2021). Our experimentation with ensemble models shows that the best scores of 0.59, 0.67 and 0.60 can be achieved for the same set of language pairs.

5.2. Comparative Analysis: T-test

In this work, the T-test technique has also been applied to find the significant diference in the performance of the developed models using diferent classifiers. The T-test is used to test our considered null hypothesis, i.e., ”There is no significant diference in the performance of the developed sentiment prediction models using diferent techniques”. Figure 4 shows the result of diferent techniques for each of the performance metric, precision, recall and F1 scores. The green dots in Figure 4 indicate that the considered null hypothesis is accepted, i.e., the performance of the models does not depend on techniques; similarly, red dots indicate that there would be a diference in the performance of the models developed using diferent techniques. From Figure 4, we can observe that the predictive power of the models developed using diferent techniques are significantly diferent. From Figure 4, we can also observe that the ensemble methods significantly improved the performance of the models.

6. Conclusion

In this work, we applied diferent traditional machine learning methods as well as ensemble methods on code-mixed data from FIRE 2021, which had Tamil-English, Malayalam-English and Kannada-English language pairs with an objective to develop sentiment prediction models. The performance of these developed sentiment prediction models are computed using precision, recall and F1-score. Our experimental results show that: • The proposed ensemble classifier performs better than any stand-alone classifier. • The developed models based on ensemble technique achieved an F1-score of 0.59 and accuracy of 0.62 for Kannada. • The developed models based on ensemble technique achieved an F1-Score of 0.65 and an accuracy of 0.67 for Malayalam. • The developed models based on ensemble technique achieved an F1-score of 0.60 and an accuracy of 0.64 for Tamil.

The future steps would be to better the results through the transliteration-translation task to augment the preprocessing and complete the sentiment analysis on the monolingual English corpus, rather than a bilingual corpus for code-switched languages.

7. Acknowledgement

Thanks to Dr Aravind Ranganathan, Uber R&D, and the anonymous reviewers for the valuable suggestions and thorough review comments.

8. Appendix

Our code is available on GitHub4

4https://github.com/mithunkumarsr/CodeMixingDravidianLanguage Hande, Adeep, Ruba Priyadharshini, and Bharathi Raja Chakravarthi (Dec. 2020). “KanCMD: Kannada CodeMixed Dataset for Sentiment Analysis and Ofensive Language Detection”. In: Proceedings of the Third Workshop on Computational Modeling of People’s Opinions, Personality, and Emotion’s in Social Media. Barcelona, Spain (Online): Association for Computational Linguistics, pp. 54–63. url: https://www.aclweb.org/anthology/2020.peoples-1.6. Jhanwar, Madan Gopal and Arpita Das (2018). “An Ensemble Model for Sentiment Analysis of Hindi-English Code-Mixed Data”. In: CoRR abs/1806.04450. arXiv: 1806.04450. url: http: //arxiv.org/abs/1806.04450.

Prabhu, Ameya, Aditya Joshi, Manish Shrivastava, and Vasudeva Varma (2016). Towards SubWord Level Compositions for Sentiment Analysis of Hindi-English Code Mixed Text. arXiv: 1611.00472 [ c s . C L ] .

Priyadharshini, Ruba, Bharathi Raja Chakravarthi, Sajeetha Thavareesan, Dhivya Chinnappa, Durairaj Thenmozhi, and Rahul Ponnusamy (2021). “Overview of the DravidianCodeMix 2021 Shared Task on Sentiment Detection in Tamil, Malayalam, and Kannada”. In: Forum for Information Retrieval Evaluation. FIRE 2021. Online: Association for Computing Machinery.

Chakravarthi , Bharathi Raja, Navya Jose, Shardul Suryawanshi, Elizabeth Sherly, and John Philip McCrae (May 2020 ). “A Sentiment Analysis Dataset for Code-Mixed Malayalam-English” . English. In: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Underresourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) . Marseille, France: European Language Resources association, pp. 177 - 184 . isbn: 979 - 10 -95546-35-1. url: https://www.aclweb.org/anthology/2020.sltu- 1 . 25 .

Chakravarthi , Bharathi Raja, Vigneshwaran Muralidaran, Ruba Priyadharshini, and John Philip McCrae (May 2020 ). “Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text” . English. In: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL) . Marseille, France: European Language Resources association, pp. 202 - 210 . isbn: 979 - 10 -95546-35-1. url: https://www.aclweb.org/anthology/2020.sltu- 1 . 28 .

Chakravarthi , Bharathi Raja, Ruba Priyadharshini, Vigneshwaran Muralidaran, Navya Jose, Shardul Suryawanshi, Elizabeth Sherly, and John P. McCrae

(

2021 ). DravidianCodeMix: Sentiment Analysis and Ofensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text . arXiv: 2106 .09460 [ c s . C L ] .

Chakravarthi , Bharathi Raja, Ruba Priyadharshini, Sajeetha

Thavareesan , et al. ( 2021 ). “Findings of the Sentiment Analysis of Dravidian Languages in Code-Mixed Text” . In: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation . Online: CEUR.

Chan , Ka Long Roy ( 2019 ). “Trilingual Code-switching in Hong Kong” . In: ALRJournal 3 .4, pp. 1 - 14 . doi: 10 .14744/alrj. 2019 . 22932 .

Choudhury , Monojit, Anirudh

Srinivasan , and Sandipan Dandapat (Nov. 2019 ). “Processing and Understanding Mixed Language Data” . In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): Tutorial Abstracts. Hong Kong , China: Association for Computational Linguistics . url: https://aclanthology.org/D19-2002.

Dowlagar , Suman and Radhika Mamidi (Apr. 2021 ). “Graph Convolutional Networks with Multi-headed Attention for Code-Mixed Sentiment Analysis” . In: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages. Kyiv: Association for Computational Linguistics , pp. 65 - 72 . url: https://aclanthology.org/ 2021 .dravidianlangtech1.8.