An ensemble-based model for sentiment analysis of Dravidian code-mixed social media posts Abhinav Kumar1 , Sunil Saumya2 and Jyoti Prakash Singh3 1 Department of Computer Science & Engineering, Siksha ’O’ Anusandhan Deemed to be University, Bhubaneswar, India 2 Department: Computer Science & Engineering, Indian Institute of Information Technology Dharwad, India 3 Department: Computer Science & Engineering, National Institute of Technology Patna, India Abstract Sentiment analysis is highly important in social media monitoring since it helps us to see how the general population feels about a certain issue. Several studies have been published in recent years that attempt to extract sentiment from social media messages. However, the majority of the work is verified using just English language datasets. Machine learning algorithms do not perform equally well when social media posts are written in multilingual and code-mixed script. This paper presents an ensemble-based model to classify Kannada-English, Malayalam-English, and Tamil-English social media postings into five different sentiment classes using character-level TF-IDF features as input. The proposed ensemble-based model achieved the weighted 𝐹1 -scores of 0.62, 0.73, and 0.62 for Kannada- English, Malayalam-English, and Tamil-English datasets, respectively. The code for the proposed models is available at: https://github.com/Abhinavkmr/Dravidian-Sentiment-Analysis-.git Keywords Sentiment analysis, Code-mixed, Kannada-English, Tamil-English, Malayalam-English, YouTube, Ma- chine learning, Deep learning 1. Introduction Sentiment analysis helps in the recognition of opinions or responses on a given topic. Due to its enormous influence on companies such as e-commerce, recommendation systems, hate speech detection [1, 2], and disaster management [3, 4], and social media monitoring, it is one of the most explored subjects in natural language processing. English is the most popular and widely accepted language on the world, and it is widely used over Internet. However, in a nation like India, where over 400 million people use the internet, people utilise more than one language to express themselves, resulting in a new code-mixed language [5]. Dravidian languages such as Malayalam and Kannada are spoken in the Indian states of Kerala and Karnataka. Tamil, which is spoken by Tamil people in India, Singapore, and Sri Lanka, is another well-known Dravidian language in India’s southern area. People on social media commonly use Roman script to write these Dravidian languages since it is easy to do so with the keyboards accessible on their devices. The majority of existing models trained to extract sentiment from a single language fail to grasp the semantics of a code-mixed language. Due to its multilingual character, extracting feelings from code mixed user-generated texts becomes more challenging [6, 7]. FIRE 2021: Forum for Information Retrieval Evaluation, December 13-17, 2021, India Envelope-Open abhinavanand05@gmail.com (A. Kumar); sunil.saumya@iiitdwd.ac.in (S. Saumya); jps@nitp.ac.in (J. P. Singh) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) The sentiment analysis of code-mixed language has recently caught the interest of the research community [8, 9]. Kumar et al. [9] proposed a hybrid CNN and Bi-LSTM Network to classify social media posts into different sentiment classes. Mahata et al. [10] proposed bi-directional LSTM with language tagging to classify Tamil-English and Malayalam-English code-mixed social media posts into different sentiment classes. Sharma and Mandalam [11], on the other hand, employed sub-word level representation to capture text sentiment and implemented an LSTM network to classify Tamil-English and Malayalam-English social media posts into the different polarity classes. Patra et al. [12] presented a model for Bengali-English code mixed data using a support vector machine with character n-grams features. To extract emotions from Hinglish and Spanglish (Spanish + English) data, Advani et al. [13] utilised logistic regression using handcrafted lexical and semantic features. Similarly, On Hinglish data, Goswami et al. [14] presented a morphological attention model for sentiment analysis. This paper presents an ensemble-based model that uses character-level TF-IDF features to classify Kannada-English, Malayalam-English, and Tamil-English social media posts into five different sentiment classes. The proposed model is validated on the dataset provided in the DravidianCodeMix FIRE 2021 [15, 16] track. The dataset includes five distinct sentiment classes, including ”positive,” ”negative,” ”mixed feelings,” ”unknown state,” and ”if the post is not in the mentioned Dravidian languages.” The rest of the sections are organized as follows: the proposed methodology is explained in Section 2. The experimental findings are listed in Section 3 and Section 4 concludes the paper by highlighting the main findings of this study. 2. Methodology The systematic diagram of the proposed ensemble-based model for the Kannada-English lan- guage can be seen in Figure 1 whereas, the proposed model for Malayalam-English and Tamil- English can be seen in Figure 2. The proposed model is validated with the datasets given in the DravidianCodeMix FIRE 2021 competition [16]. The overall data statistic for Kannada-English [17], Malayalam-English [18], and Tamil-English [19] can be seen in Table 1. Table 1 Overall data statistic for Kannada, Malayalam, and Tamil dataset Class Kannada-English Malayalam-English Tamil-English Train Validation Test Train Validation Test Train Validation Test Mixed-feelings 574 52 65 926 102 134 4,020 438 470 Negative 1,188 139 157 2,105 237 258 4,271 480 477 Positive 2,823 321 374 6,421 706 780 20,070 2,257 2,546 Unknown state 711 69 62 5,279 580 643 5,628 611 665 Not-Kannada 916 110 110 - - - - - - Not-Malayalam - - - 1,157 141 147 - - - Not-Tamil - - - - - - 1,667 176 244 Total 6,212 691 768 15,888 1,766 1,962 35,156 3,962 4,402 Extensive experiments were carried out with a variety of popular machine learning classifiers using various combinations of one-to-six gram word-level and character-level TF-IDF features. Support Vector Machine Kannada-English Social Media 1/3(Σ PPostive) Positive (1-6)-Gram Character TF-IDF Features 1/3(Σ PNegative) Negative Posts Logistic 1/3(Σ PMixed-feelings) Mixed-feelings Regression 1/3(Σ PUnknown state) Unknown state 1/3(Σ PNot-kannada) Not-kannada Random Forest Figure 1: Proposed model for the Kannada-English language Malayalam-English/Tamil-English Positive 1/3(Σ PPostive) Support Vector (1-6)-Gram Character Social Media Posts Machine Negative TF-IDF Features 1/3(Σ PNegative) 1/3(Σ PMixed-feelings) Mixed-feelings 1/3(Σ PUnknown state) Logistic Unknown state Regression 1/3(Σ PNot-malayalam/tamil) Not- Malayalam/Tamil Figure 2: Proposed model for the Malayalam-English and Tamil-English languages We found that the ensemble of Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF) performed best on the Kannada-English dataset, while the ensemble of SVM and LR performed best on the Malayalam-English and Tamil-English datasets. The proposed models are described in detail in the following sections. • Kannada-English: An ensemble-based model is proposed containing SVM, LR, and RF in parallel (see Figure 1). This ensemble-based model uses one to six-gram character TF-IDF features to predict the probability for each of the classes. To choose the suitable character n-gram range, extensive experimentation was performed with one-gram to six-gram character-level TF-IDF features. We found first 50,000 one to six-gram character-level TF-IDF features were performed better than the other combination of character-level n-gram TF-IDF features. The probabilities of all the three classifiers are then averaged class-wise to get the final class probability. The final class for the post is assigned that has the highest average class probability. Table 2 Performance of the proposed model for Kannada, Malayalam, and Tamil social media posts Class Kannada-English Malayalam-English Tamil-English Precision Recall 𝐹1 -score Precision Recall 𝐹1 -score Precision Recall 𝐹1 -score Mixed-feelings 0.50 0.05 0.08 0.55 0.30 0.39 0.37 0.15 0.21 Negative 0.70 0.60 0.65 0.69 0.57 0.63 0.47 0.33 0.39 Positive 0.67 0.86 0.75 0.76 0.84 0.80 0.70 0.90 0.79 Unknown state 0.39 0.29 0.33 0.71 0.76 0.74 0.50 0.34 0.41 Not-Kannada 0.64 0.61 0.62 - - - - - - Not-Malayalam - - - 0.83 0.74 0.78 - - - Not-Tamil - - - - - - 0.73 0.53 0.61 Weighted Avg. 0.64 0.65 0.62 0.73 0.73 0.73 0.61 0.65 0.62 Confusion matrix Mixed feelings 0.05 0.22 0.54 0.05 0.15 300 250 Negative 0.00 0.60 0.36 0.00 0.04 200 True label Positive 0.01 0.06 0.86 0.06 0.01 150 not-Kannada 0.00 0.00 0.33 0.61 0.06 100 50 unknown state 0.02 0.05 0.47 0.18 0.29 0 ive Ne ngs da e e tiv tat na sit li ga ns ee an Po ow f t-K ed kn no Mix un Predicted label Figure 3: Confusion matrix for the Kannada-English dataset • Malayalam-English & Tamil-English: The proposed ensemble-based model con- tains support vector machine and logistic regression in parallel (see Figure 2). For the Malayalam-English language, first, 30,000 one to six-gram character-level TF-IDF fea- tures performed best in comparison to other combinations of n-gram features. For the Tamil-English language, the first 15,000 one to six-gram character-level TF-IDF features performed best in comparison to other combinations of n-gram features. Similar to the previous model (Figure 1) class-wise averaged probabilities were calculated and the final class label is assigned that has the highest average class probability. 3. Results and Analysis Precision, recall, and the 𝐹 1-score are utilised to assess the suggested ensemble-based model’s performance. The confusion matrix and AUC-ROC curve are also presented to highlight the Receiver operating characteristic curve 1.0 0.8 True Positive Rate 0.6 micro-average ROC curve (area = 0.90) 0.4 macro-average ROC curve (area = 0.86) Mixed feelings (AUC = 0.78) Negative (AUC = 0.91) 0.2 Positive (AUC = 0.84) not-Kannada (AUC = 0.93) unknown state (AUC = 0.84) 0.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate Figure 4: ROC curve for the Kannada-English dataset Confusion matrix Mixed_feelings 0.30 0.16 0.31 0.01 0.23 600 500 Negative 0.05 0.57 0.18 0.00 0.19 400 True label Positive 0.01 0.02 0.84 0.01 0.12 300 not-malayalam 0.01 0.01 0.10 0.74 0.14 200 100 unknown_state 0.02 0.04 0.16 0.02 0.76 ive Ne ngs e ow m e tiv tat un yala sit eli ga n_s Po _fe ala ed t-m kn Mix no Predicted label Figure 5: Confusion matrix for the Malayalam-English dataset model’s performance in addition to these measures. Table 2 shows the outcomes of the suggested model for the Kannada-English, Malayalam-English, and Tamil-English languages. The suggested ensemble-based model has a weighted precision of 0.64, recall of 0.65, and 𝐹 1-score of 0.62 for the Kannada-English dataset. Figures 3 and 4 show the ROC curve and confusion matrix for the Kannada-English dataset, respectively. The suggested ensemble-based model had a weighted precision, recall, and 𝐹 1-score of 0.73 for the Malayalam-English dataset. Figures 5 and 6 show the confusion matrix and ROC curve for the Malayalam-English dataset, respectively. The suggested ensemble-based model achieved a weighted precision of 0.61, recall Receiver operating characteristic curve 1.0 0.8 True Positive Rate 0.6 micro-average ROC curve (area = 0.94) 0.4 macro-average ROC curve (area = 0.92) Mixed_feelings (AUC = 0.88) Negative (AUC = 0.93) 0.2 Positive (AUC = 0.92) not-malayalam (AUC = 0.98) unknown_state (AUC = 0.90) 0.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate Figure 6: ROC curve for the Malayalam-English dataset Confusion matrix Mixed_feelings 0.15 0.12 0.63 0.01 0.09 2000 Negative 0.06 0.33 0.49 0.01 0.10 1500 True label Positive 0.02 0.03 0.90 0.01 0.04 1000 not-Tamil 0.01 0.02 0.32 0.53 0.11 500 unknown_state 0.06 0.05 0.53 0.02 0.34 il ive Ne ngs e e am tiv tat sit eli ga t-T n_s Po _fe no ow ed kn Mix un Predicted label Figure 7: Confusion matrix for the Tamil-English dataset of 0.65, and 𝐹 1-score of 0.62, respectively, on the Tamil-English dataset. Figures 7 and 8 show the confusion matrix and ROC curve for the Tamil-English dataset, respectively. 4. Conclusion Sentiment analysis of social media messages is an essential task in natural language process- ing, which analyses social discussions and feedback to discover the deeper context as they Receiver operating characteristic curve 1.0 0.8 True Positive Rate 0.6 micro-average ROC curve (area = 0.90) 0.4 macro-average ROC curve (area = 0.84) Mixed_feelings (AUC = 0.76) Negative (AUC = 0.84) 0.2 Positive (AUC = 0.83) not-Tamil (AUC = 0.94) unknown_state (AUC = 0.82) 0.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate Figure 8: ROC curve for the Tamil-English dataset pertain to a topic, brand, or theme. This work proposes an ensemble-based model to classify Kannada-English, Malayalam-English, and Tamil-English social media postings into five dif- ferent sentiment classes. The use of one to six-gram character-level feature performed best with the other combinations of n-gram character-level features. For the Kannada-English, Malayalam-English, and Tamil-English datasets, the suggested ensemble-based model achieved weighted 𝐹 1-scores of 0.62, 0.73, and 0.62, respectively. To improve performance, a robust deep ensemble-based model can be developed in the future by integrating character-level and word-level features. References [1] A. K. Mishra, S. Saumya, A. Kumar, IIIT_DWD@ HASOC 2020: Identifying offensive content in indo-european languages (2020). [2] S. Saumya, A. Kumar, J. P. Singh, Offensive language identification in Dravidian code mixed social media text, in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, 2021, pp. 36–45. [3] A. Kumar, J. P. Singh, Y. K. Dwivedi, N. P. Rana, A deep multi-modal neural network for informative twitter content classification during emergencies, Annals of Operations Research (2020) 1–32. [4] A. Kumar, J. P. Singh, Location reference identification from tweets during emergencies: A deep learning approach, International journal of disaster risk reduction 33 (2019) 365–375. [5] N. Jose, B. R. Chakravarthi, S. Suryawanshi, E. Sherly, J. P. McCrae, A survey of current datasets for code-switching research, in: 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), 2020. [6] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, Dravidiancodemix: Sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text, arXiv preprint arXiv:2106.09460 (2021). [7] A. Hande, S. U. Hegde, R. Priyadharshini, R. Ponnusamy, P. K. Kumaresan, S. Thavareesan, B. R. Chakravarthi, Benchmarking multi-task learning for sentiment analysis and of- fensive language identification in under-resourced Dravidian languages, arXiv preprint arXiv:2108.03867 (2021). [8] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, S. Suryawanshi, N. Jose, J. P. Sherly, Elizabeth McCrae, Overview of the track on Sentiment Analysis for Davidian Languages in Code-Mixed Text, in: Working Notes of the FIRE 2020. CEUR Workshop Proceedings., 2020. [9] A. Kumar, S. Saumya, J. P. Singh, NITP-AI-NLP@ Dravidian-CodeMix-FIRE2020: A hybrid CNN and Bi-LSTM network for sentiment analysis of Dravidian code-mixed social media posts., in: FIRE (Working Notes), 2020, pp. 582–590. [10] S. Mahata, D. Das, S. Bandyopadhyay, Sentiment classification of code-mixed tweets using bi-directional rnn and language tags, in: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, 2021, pp. 28–35. [11] Y. Sharma, A. V. Mandalam, Bits2020@ Dravidian-CodeMix-FIRE2020: Sub-word level sentiment analysis of Dravidian code mixed data., in: FIRE (Working Notes), 2020, pp. 503–509. [12] B. G. Patra, D. Das, A. Das, Sentiment analysis of code-mixed indian languages: An overview of SAIL_code-mixed shared task@ ICON-2017, arXiv preprint arXiv:1803.06745 (2018). [13] L. Advani, C. Lu, S. Maharjan, C1 at SemEval-2020 Task 9: SentiMix: Sentiment analysis for code-mixed social media text using feature engineering, arXiv preprint arXiv:2008.13549 (2020). [14] K. Goswami, P. Rani, B. R. Chakravarthi, T. Fransen, J. P. McCrae, ULD@ NUIG at SemEval- 2020 Task 9: Generative morphemes with an attention model for sentiment analysis in code-mixed text, arXiv preprint arXiv:2008.01545 (2020). [15] B. R. Chakravarthi, R. Priyadharshini, S. Thavareesan, D. Chinnappa, D. Thenmozhi, E. Sherly, J. P. McCrae, A. Hande, R. Ponnusamy, S. Banerjee, C. Vasantharajan, Findings of the Sentiment Analysis of Dravidian Languages in Code-Mixed Text, in: Working Notes of FIRE 2021 - Forum for Information Retrieval Evaluation, CEUR, 2021. [16] R. Priyadharshini, B. R. Chakravarthi, S. Thavareesan, D. Chinnappa, D. Thenmozi, E. Sherly, Overview of the DravidianCodeMix 2021 shared task on sentiment detec- tion in tamil, malayalam, and kannada, in: Forum for Information Retrieval Evaluation, FIRE 2021, Association for Computing Machinery, 2021. [17] A. Hande, R. Priyadharshini, B. R. Chakravarthi, KanCMD: Kannada CodeMixed dataset for sentiment analysis and offensive language detection, in: Proceedings of the Third Workshop on Computational Modeling of People’s Opinions, Personality, and Emotion’s in Social Media, Association for Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 54–63. URL: https://www.aclweb.org/anthology/2020.peoples-1.6. [18] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association, Marseille, France, 2020, pp. 177–184. URL: https://www.aclweb.org/anthology/ 2020.sltu-1.25. [19] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association, Marseille, France, 2020, pp. 202–210. URL: https://www. aclweb.org/anthology/2020.sltu-1.28.