Sarcasm Identification in Codemix Dravidian Languages Prabhu Ram. N∗,† , Meera Devi. T† , Kanisha. V∗,† , Meharnath. S† and Manoji. B† Department of Electronics and Communication Engineering, Kongu Engineering College, Erode, TamilNadu Abstract Social media enables forms of communication using text, audio and video. The sarcastic nature of the text has an impact on an individual’s well-being. It is crucial to determine whether the content is sarcastic or not. Social media posts often contain a mix of languages and address issues related to real-life situations. However, identifying sarcasm in the languages spoken in India, which has 22 languages, can be more challenging due to extensive borrowing of vocabulary. The dataset used for this study includes code-mixed languages such as Tamil-English and Malayalam-English. The training, validation, and test datasets are proportionally divided with labels to indicate whether the text is sarcastic. We performed experiments using the transfer learning model and observed that the BERT model gave the best result. Keywords BERT, Codemix, Dravidian Language, NLP, Sarcasm, Transformer 1. Introduction In the 21𝑠𝑡 century, social media has become a platform with over 4.9 billion users [1]. It enables people to easily share their thoughts and opinions, making communication faster and more convenient. However, the content on media can have impacts such as anxiety, depression and even suicidal thoughts among individuals. Unfortunately, some individuals use it to criticize or insult others, employing sarcasm that can be difficult to detect because of its implicitness. People now frequently use irony due to the increasing availability of internet connections and numerous new applications. Consequently, many countries have implemented social media surveillance measures to monitor citizens [2]. However, social media posts often involve a mix of languages, while Dravidian languages like Tamil and Malayalam add complexity due to their nuances. People utilize Natural Language Processing (NLP), a branch of artificial intelligence (AI) that analyses human language patterns, to identify sarcasm [3]. This paper explores the challenges and opportunities in sarcasm identification in codemixed Dravidian languages. In the subsequent sections, we will delve into the historical context of methods used in solving related problems, the methodology involving data pre-processing and modelling, the experimental FIRE 2023: 15th meeting of the Forum for Information Retrieval Evaluation, December 15 – 18, 2023, India ∗ Corresponding author. † These authors contributed equally. Envelope-Open prabhuramnphd@gmail.com (P. Ram. N); tmeeradevi@gmail.com (M. Devi. T); kanishav.20ece@kongu.edu (Kanisha. V); meharnaths.20ece@kongu.edu (Meharnath. S); manojib.20ece@kongu.edu (Manoji. B) Orcid 0000-0003-2769-9790 (P. Ram. N); 0000-0003-4989-4028 (M. Devi. T) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings setup for fine-tuning the pre-trained model, and the statistical analysis of our results with discussion [24]. 2. Related work Natural language processing researchers use various techniques to tackle the challenging task of sarcasm detection. Although hate speech and sarcasm exhibit indirect connections, the pre-trained multilingual BERT model’s contextual understanding shows a more significant similarity due to its training on multilingual Wikipedia language sources [4, 5]. Hate speech detection for English, German and Hindi Languages using the Multilingual BERT model uses Machine Learning (ML) and Deep Learning approaches (DL). We employed various classification architectures based on networks, including models like subword level LSTM, Hierarchical LSTM, BERT, XLM-RoBERTa, LSTM, GRU, and XLNet [6, 7, 8]. Additionally, we utilized machine learning-based classification models such as Support Vector Machine (SVM) Logistic Regression (LR) Random Forest Classifier (RFC) [8, 9, 10, 11, 12] and K-Nearest Neighbour (KNN) [13]. Among these models used for the code mix, the Tamil dataset classification task SVM model showed performance compared to machine learning models. We have also employed RNN and MLP deep learning models to enhance classification. TFIDF (Term-Frequency-Inverse- Document-Frequency) [14] serves as the text preprocessor, and SVM (Support Vector Machines) [15] functions as the classifier. A two-phase approach for sarcasm detection using machine learning algorithms involves feature extraction, feature selection, and classification using sup- port vector machines [16]. Sarcasm detection using a hierarchical attention network that incorporates both word and sentence-level attention mechanisms, using a graph convolutional network that includes both syntactic and semantic information, using a bootstrapping approach that iteratively learns new sarcastic patterns from labelled data achieved state-of-the-art per- formance on several benchmark datasets. In the context of emotion recognition, a study [7] focused on enhancing the effectiveness of BERT word embeddings through knowledge-based fine-tuning techniques. This research underscores the ongoing sentiment analysis and emotion recognition efforts within natural language processing. The model is connected with the fully connected network with a softmax activation function which is used to classify the emotions of the given sentence. A hybrid model of bidirectional LSTM with a softmax attention layer and convolution neural network for real-time sarcasm detection in code-switched tweets achieved a superior classification accuracy of 92.71% and F1-measure of 89.05% [17]. The model trained on larger datasets achieved higher efficiency, and the inclusion of Dravidian Languages in the dataset will be helpful for this task [7, 13, 18]. 3. Methodology The step-by-step process for conducting the experiment has been thoroughly explained and outlined in the Figure ??. These sections provide a comprehensive description of how the experiment was carried out, ensuring a clear and detailed understanding of the experimental flow. The process commences with the selection and initialization of a pre-trained model. Subsequently, the dataset is collected and pre-processed. Rigorous evaluation is performed Figure 1: Block diagram. on a designated validation and testing dataset, and fine-tuning strategies, including transfer learning, are employed to tailor the model for the specific task. Furthermore, hyperparameter optimization techniques are applied to enhance the model’s performance. The ensuing section delves into a comprehensive presentation of results and constructive discussions, highlighting the efficacy of fine-tuning and the implications of hyperparameter adjustments. 3.1. Dataset collection The datasets for Tamil-English and Malayalam-English is included in the work, which is the collection of comments from YouTube video. Data on all three types of code interlinked sentences: Inter-Sentential switch, Intra-Sentential switch and Tag Swapping is included in the dataset. Most comments were written in native script and Roman script with either Tamil / Malayalam grammar with English lexicon or English grammar with Tamil / Malayalam lexicon. Some of the comments have been written in Tamil and Malayalam, with English translations between them. Table 1 Dataset Distribution Training dataset count Label English-Tamil dataset English-Malayalam dataset Sarcastic 7170 2259 Non-Sarcastic 19866 9798 Total length 27036 12057 3.2. Data preprocessing The dataset consists of text samples categorized as either ‘Sarcastic’ or ‘Non-Sarcastic’ and is encoded into values assigning 0 and 1 for ‘Sarcastic’ and ‘Non-Sarcastic’ labels, respectively. The maximum word length of the dataset is fixed to 128 tokens of words [19, 20]. 3.3. Model training For training the model, Simple Transformers is employed. Utilization of Bert-base-uncased model architecture is done because of its ability to capture information from the text data effectively, and this architecture is best suited for Binary Classification. The model is trained to classify the text either as ‘Sarcastic’ or ‘Non-Sarcastic’. 3.4. Model evaluation and training The performance of the trained model is validated and assessed using the test dataset. Based on the predictions made by the model, a classification report is generated that includes accuracy, f1 score, precision and recall. The model may or may not be biased towards a particular class because of class imbalances, as shown in Table 1. So, the highly frequent tokens from the utterance of the Sarcastic class with respect to the Non-Sarcastic class are taken from the training dataset. The high-frequent tokens are removed from the test dataset, and a classification report is generated to check whether the model is influenced by the high-frequent tokens. 4. Experimental setup Three datasets are utilized for the task of sarcasm identification for each English-Tamil and English-Malayalam dataset. The training dataset consists of a larger number of ‘Non-Sarcastic’ labels than the ‘Sarcastic’ labels addressing class imbalance, as shown in Table 1. This huge difference in the occurrence of the labels makes the model more biased towards the ‘Non-Sarcastic’ texts. 5. Results and discussion Precision, recall, and F1-score scores for the bert-based-uncased sarcasm detection model were competitive. The performance of the majority class (Non-sarcastic) and the minority class (Sarcastic), however, showed a clear discrepancy. The bert-base-uncased model tends to favour Table 2 Classification Report before removing the high frequent tokens English-Tamil dataset English-Malayalam dataset Precision Recall F1-score Precision Recall F1-score Sarcastic class 0.63 0.49 0.55 0.46 0.10 0.17 Non-sarcastic class 0.83 0.89 0.86 0.83 0.97 0.90 Accuracy 0.79 0.81 Macro average 0.73 0.69 0.70 0.64 0.54 0.53 Weighted average 0.77 0.79 0.78 0.76 0.81 0.76 the majority class, like many other machine learning algorithms. As a result, the minority class had great accuracy but low recall (Sarcastic). A high F1 score indicates that the model can accurately detect sarcasm while minimising false positives (non-sarcastic text misclassified as sarcastic) and false negatives (sarcastic text misclassified as non-sarcastic). Tamil-English and Malayalam-English datasets shown in Table 2 have respective F1 values of 0.79 and 0.81. The removal of highly frequent words from the Sarcastic class with respect to the Non-Sarcastic class did not influence the model, as shown in Table 3. The differences in accuracy between the English-Tamil and English-Malayalam datasets before and after high-frequency token removal provide important information on how token removal affects model performance. The accuracy in the English-Tamil dataset is 0.79, meaning that 79% of instances are properly classified by the model before high-frequency tokens are eliminated. But it’s clear that the accuracy has slightly decreased to 0.78 after the high-frequency tokens were eliminated. The possible cause of this little decrease in accuracy is the loss of important data that high-frequency tokens were carrying. It implies that these tokens may indeed contribute positively to the model’s capacity to accurately identify instances in the English-Tamil dataset. Figure 2: Confusion matrix on the given test dataset for English-Tamil dataset On the other hand, prior to the removal of high-frequency tokens, the English-Malayalam dataset shows an accuracy of 0.81, showing a high degree of classification accuracy. Upon eliminating high-frequency tokens, the accuracy noticeably increases to 0.82. This improved accuracy raises the possibility that high-frequency tokens might contaminate the data with English-Tamil dataset English-Malayalam dataset Precision Recall F1-score Support Precision Recall F1-score Support Sarcastic 0.63 0.48 0.54 2263 0.59 0.04 0.07 685 Non-Sarcastic 0.82 0.90 0.86 6186 0.82 0.99 0.90 3083 Accuracy 0.78 8449 0.82 3768 Macro average 0.73 0.69 0.70 8449 0.71 0.52 0.49 3768 Weighted average 0.77 0.78 0.77 8449 0.78 0.82 0.75 3768 noise or ambiguity. Eliminating these tokens improves model accuracy by streamlining the dataset. Figure 3: Confusion matrix on the given test dataset for English-Malayalam dataset The differences in accuracy between the two datasets demonstrate how high-frequency token influence on model performance varies depending on the context. Their elimination improved the accuracy of the English-Malayalam dataset but somewhat decreased the accuracy of the English-Tamil dataset. These results highlight the need for a sophisticated strategy in NLP tasks when determining whether high-frequency tokens to keep or discard. Making educated judgments requires a thorough analysis of the linguistic properties of the dataset and the possibility of noise introduction from high-frequency tokens. Enhancing model accuracy in code-mixed Dravidian languages requires striking the ideal balance between information retention and data refining, and this approach should be dictated by the unique linguistic characteristics of each dataset. A confusion matrix is used to analyse the performance of the model, as shown in Figure 3. The task involves binary classification involving the ‘Non-Sarcastic’ and ‘Sarcastic’ classes. This matrix comprises four distinct categories: True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) [21]. Figure 3 and Figure 3 reveals the model’s strengths and areas of improvement in the binary classification task, offering valuable insights for refining its performance and guiding future research efforts. 6. Conclusion and future work Sarcasm identification using the BERT model provides a major improvement in natural lan- guage processing. Intricate language nuances can be captured by BERT, enabling it to capture contextual awareness and pre-trained knowledge, which makes it better for the complex task of sarcasm detection. The accuracy of the model is substantially improved by its capacity to take into account the larger context of a statement and comprehend how specific words or phrases fit into that context. BERT can provide more balanced predictions, reducing the tendency to mis- classify sarcastic statements as non-sarcastic due to the imbalance issue. However, it’s essential to acknowledge that class imbalance remains a challenge, and further research should focus on strategies like oversampling, undersampling, or using different loss functions to fine-tune BERT models effectively. By addressing this class imbalance problem, we can continue to enhance the accuracy and reliability of sarcasm detection using BERT-based models, making them more valuable in real-world applications. Due to the class imbalance problem, the model is more biased towards the Non-Sarcastic texts, which reduces the model’s accuracy. In addressing the class imbalance, the potential integration of the Synthetic Minority Over-sampling Technique (SMOTE) algorithm holds significant promise [22]. By integrating SMOTE into the BERT model, we can create samples for the class. This helps in creating a balanced dataset during training and greatly improves the accuracy of the model. References [1] J. Belle Wong, Top social media statistics and trends of 2023, Online, 2023. URL: https: //www.forbes.com/advisor/business/social-media-statistics/. [2] B. E. Duffy, N. K. Chan, “you never really know who’s looking”: Imagined surveillance across social media platforms, New Media & Society 21 (2019) 119–138. [3] B. Harish, R. K. Rangan, A comprehensive survey on indian regional language processing, SN Applied Sciences 2 (2020) 1204. [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [5] R. Pandey, J. P. Singh, Bert-lstm model for sarcasm detection in code-mixed social media post, Journal of Intelligent Information Systems 60 (2023) 235–254. [6] T. Santosh, K. Aravind, Hate speech detection in hindi-english code-mixed social media text, in: Proceedings of the ACM India joint international conference on data science and management of data, 2019, pp. 310–313. [7] S. Banerjee, A. Jayapal, S. Thavareesan, Nuig-shubhanker@ dravidian-codemix- fire2020: Sentiment analysis of code-mixed dravidian text using xlnet, arXiv preprint arXiv:2010.07773 (2020). [8] B. R. Chakravarthi, R. Priyadharshini, N. Jose, T. Mandl, P. K. Kumaresan, R. Ponnusamy, R. Hariharan, J. P. McCrae, E. Sherly, et al., Findings of the shared task on offensive language identification in tamil, malayalam, and kannada, in: Proceedings of the first workshop on speech and language technologies for Dravidian languages, 2021, pp. 133–145. [9] A. Muneer, S. M. Fati, A comparative analysis of machine learning techniques for cyber- bullying detection on twitter, Future Internet 12 (2020) 187. [10] V. Pathak, M. Joshi, P. Joshi, M. Mundada, T. Joshi, Kbcnmujal@ hasoc-dravidian-codemix- fire2020: Using machine learning for detection of hate speech and offensive code-mixed social media text, arXiv preprint arXiv:2102.09866 (2021). [11] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis dataset for code-mixed malayalam-english, arXiv preprint arXiv:2006.00210 (2020). [12] P. R. Nagarajan, V. Mammen, V. Mekala, M. Megalai, A fast and energy efficient path planning algorithm for offline navigation using svm classifier, Int. J. Sci. Technol. Res 9 (2020) 2082–2086. [13] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for sentiment analysis in code-mixed tamil-english text, arXiv preprint arXiv:2006.00206 (2020). [14] B. Subba, P. Gupta, A tfidfvectorizer and singular value decomposition based host intrusion detection system framework for detecting anomalous system processes, Computers & Security 100 (2021) 102084. [15] V. Cherkassky, Y. Ma, Practical selection of svm parameters and noise estimation for svm regression, Neural networks 17 (2004) 113–126. [16] S. M. Sarsam, H. Al-Samarraie, A. I. Alzahrani, B. Wright, Sarcasm detection using machine learning algorithms in twitter: A systematic review, International Journal of Market Research 62 (2020) 578–598. [17] R. A. Potamias, G. Siolas, A.-G. Stafylopatis, A transformer-based approach to irony and sarcasm detection, Neural Computing and Applications 32 (2020) 17309–17320. [18] P. K. Roy, S. Bhawal, C. N. Subalalitha, Hate speech and offensive language detection in dravidian languages using deep ensemble framework, Computer Speech & Language 75 (2022) 101386. [19] B. R. Chakravarthi, Hope speech detection in youtube comments, Social Network Analysis and Mining 12 (2022) 75. [20] B. R. Chakravarthi, A. Hande, R. Ponnusamy, P. K. Kumaresan, R. Priyadharshini, How can we detect homophobia and transphobia? experiments in a multilingual code-mixed setting for social media governance, International Journal of Information Management Data Insights 2 (2022) 100119. [21] O. Caelen, A bayesian interpretation of the confusion matrix, Annals of Mathematics and Artificial Intelligence 81 (2017) 429–450. [22] B. R. Chakravarthi, D. Chinnappa, R. Priyadharshini, A. K. Madasamy, S. Sivanesan, S. C. Navaneethakrishnan, S. Thavareesan, D. Vadivel, R. Ponnusamy, P. K. Kumaresan, Developing successful shared tasks on offensive language identification for dravidian languages, arXiv preprint arXiv:2111.03375 (2021).