Sarcasm Detection in Tamil and Malayalam Dravidian Code-Mixed Text Supriya Chanda1 , Anshika Mishra2 and Sukomal Pal1 1 Indian Institute of Technology (BHU), Varanasi, INDIA 2 Vellore Institute of Technology Bhopal, Madhya Pradesh, INDIA Abstract Sarcasm is a form of verbal irony that involves saying the opposite of what is actually meant in a mocking or humorous manner. You can find many sarcastic comments on social media these days, which are often code-mixed in nature.To gain insights from the textual data available to us, we need a system to detect sarcasm and identify the sentiments behind the texts. In this paper, we present a solution submitted for the shared task titled ‘Sarcasm Identification of Dravidian Languages Tamil and Malayalam,’ which was organized by Dravidian CodeMix 2023 at the Forum for Information Retrieval Evaluation (FIRE) 2023. This paper explores an approach to sarcasm detection, leveraging the BERT (Bidirectional Encoder Representations from Transformers) and a supplementary layer of neural networks for precise classification into two distinct classes: sarcastic and non-sarcastic comments. Our experiment demonstrates that our model effectively detects sarcastic comments, achieving an F1 score of 0.72 for both the Tamil-English and Malayalam-English code-mixed datasets. Keywords Social Media, Code-Mixed, BERT, Sarcasm, Sentiment Analysis, Tamil, Malayalam 1. Introduction In our modern digital age, the complexities of human language present intriguing challenges for natural language processing systems. Among these intricacies, sarcasm emerges as a captivating linguistic puzzle. Sarcasm involves expressing thoughts in a manner that conceals the true intentions of the speaker, often infused with a dose of mockery or humor, serving as a linguistic tool to convey negative sentiments in a subtle manner. While humans excel at deciphering sarcasm through tone, context, and emotional cues, teaching machines to perform this feat remains a formidable undertaking. Accurate detection of sarcasm holds significant importance, particularly in the domain of sentiment analysis, where it plays a pivotal role in understanding textual data. In our technology-dominated world, the proliferation of social media users has been nothing short of exponential, with a staggering 60% of the global population actively participating on these platforms, dedicating an average of 2 hours and 24 minutes daily to their online engagements FIRE’23: Forum for Information Retrieval Evaluation, December 15-18, 2023, India Envelope-Open supriyachanda.rs.cse18@itbhu.ac.in (S. Chanda); anshika.mishra2019@vitbhopal.ac.in (A. Mishra); spal.cse@itbhu.ac.in (S. Pal) GLOBE https://cse-iitbhu.github.io/irlab/supriya.html (S. Chanda); https://cse-iitbhu.github.io/irlab/spal.html (S. Pal) Orcid 0000-0002-0877-7063 (S. Chanda); 0000-0001-8743-9830 (S. Pal) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings (as reported by smartinsights 1 ). Social media platforms provide an open canvas for individuals to openly express their views across a spectrum of subjects, events, personalities, and products, culminating in the generation of an astounding volume of data, estimated at a staggering 328.77 million terabytes daily. A substantial portion of this textual corpus is characterized by code-mixing, a linguistic phenomenon wherein individuals seamlessly blend elements from different languages, often employing the Roman script as a common bridge. Code-mixing on social media mirrors the diverse linguistic backgrounds of users and the global reach of these digital platforms. It highlights how people consciously opt for code- mixing, skillfully weaving together different languages to enhance their communication, deftly switching between tongues to convey their feelings. Within the sphere of online interactions, encompassing comments, posts, and messages on social media platforms, the amalgamation of multiple languages, frequently expressed in the Roman script, is a common occurrence. The analysis of text not originally scripted in its native form presents an additional layer of complexity in the realm of natural language processing. The importance of applying sentiment analysis to this data cannot be emphasized enough. It unveils a wealth of valuable insights, deeply influencing various fields such as market research, keeping a watchful eye on social media trends, and analyzing customer feedback. Additionally, it assumes a crucial role in countering the spread of hate speech on social platforms, thereby safeguarding the mental well-being of individuals. Furthermore, the capability to decode user queries infused with sarcasm paves the way for providing users with relevant information and responses. The focal point of this shared task revolves around the precise identification of sarcasm and the determination of sentiment polarity within a code-mixed dataset comprising comments and posts in Tamil-English and Malayalam-English, all meticulously curated from the social media. It helps to explore how sarcasm is used in mixed-language conversations on social media. It’s not just about language and computer challenges; it’s also about gaining a deeper insight into how sarcasm works in the constantly changing world of online interactions. This, in turn, helps us become better at deciphering the subtleties of digital communication. In this paper, we applied a method that leverages mBERT to enhance its capability in iden- tifying sarcasm and determining sentiment polarity within code-mixed comments and posts written in Tamil-English and Malayalam-English, which are commonly encountered on social media platforms.. The rest of the paper is structured as follows. Section 2 provides a concise overview of prior research in this field. In Section 3, we delve into the datasets we utilized for our investiga- tion. Section 4 elaborates on our computational methodologies, model specifications, and the techniques we employed for evaluation. Following this, we present our results and conduct a comprehensive analysis in Section 5. We conclude in Section 6. 1 https://www.smartinsights.com/social-media-marketing/social-media-strategy/ new-global-social-media-research/ 2. Related Work Code-mixing in spoken language has been the subject of extensive research for decades. How- ever, the analysis of code-mixed text, particularly in the context of social media, represents a relatively new frontier in the field of Natural Language Processing (NLP). Modern NLP mod- els have demonstrated their prowess in various tasks, including sentiment analysis [1, 2, 3], language identification, hate speech identification [4, 5, 6], information retrieval [7], and named- entity recognition, particularly for monolingual text. Nevertheless, they face notable challenges when confronted with code-mixed content, where multiple languages are interwoven. Sarcasm detection, a significant downstream task in the domain of NLP, has garnered consid- erable attention from researchers. Efforts have been directed toward effectively solving this challenging problem. One notable approach in this regard is the utilization of IndicBERT for detecting sarcasm in social media text, as proposed by Amir et al [8]. This model has captured contextual information and identifying sarcasm cues. Wicana et al. [9] undertook an extensive examination of sarcasm detection, delving into different machine-learning methods. Their review offers valuable perspectives on the latest techniques and the difficulties related to identi- fying sarcasm in text. They explored a range of neural network-based classification structures, including models like subword-level LSTM, Hierarchical LSTM, BERT, XLM RoBERTa, LSTM, GRU, and XLNet. Notably, IndicBERT has proven to be proficient at capturing the nuances and language- specific characteristics of Indian languages [10]. To tackle the intricacies of sarcasm detection, researchers have employed a variety of techniques. For instance, an attention-based BiLSTM model, combined with a feature-rich Convolutional Neural Network (CNN) approach [11], has been utilized. It is essential to note that while sarcasm and hate speech are related, they are not the same, and they demand distinct approaches. Efforts to identify sarcasm in social media content have sparked innovative approaches. One such method combines the multilingual BERT (mBERT) model with a Graph Convolution Network (GCN) [12]. Additionally, Agrawal et al. [13] ventured into the intriguing territory of utilizing emotional transitions to improve sarcasm detection, shedding light on the dynamic nature of emotional signals in recognizing sarcasm. Significantly, sarcasm detection goes beyond just written content. Pandey and Vishwakarma [14] tackled the challenge of multimodal sarcasm detection in videos. Their research focused on employing deep learning models to effectively leverage various sensory inputs, encompassing visual, audio, and textual cues, for sarcasm identification. These studies collectively represent the evolving landscape of sarcasm detection, showcasing a wide array of approaches and techniques, with each offering valuable insights and advancements in the field. 3. Dataset The dataset provided by the organizers is a valuable resource for our research, encompassing code-mixed comments and posts in Tamil-English and Malayalam-English, sourced from social media. While comments and posts may consist of multiple sentences, the dataset predominantly features an average sentence length of one. Importantly, each comment and post comes with sentiment polarity annotations, reflecting real-world scenarios and challenges associated with class imbalance. This dataset encourages us to investigate how sarcasm manifests in code- mixed contexts on social media. It includes development, training, and test dataset of YouTube video comments in Tamil-English and Malayalam-English, encompassing various code-mixing types and linguistic characteristics, providing a rich foundation for our research into sarcasm expression in these multilingual settings. Table 1 Data Distribution for sarcasm detection of code-mixed text in Dravidian languages Tamil - English Class Training Development Test Sarcastic 7170 1820 2263 Non-Sarcastic 19866 4939 6186 Malayalam - English Class Training Development Test Sarcastic 2259 588 685 Non-Sarcastic 9798 2427 3083 4. Methodology 4.1. Preprocessing In the data preprocessing phase, we conducted several essential steps to refine the dataset. We removed hashtags, punctuation marks, URLs, numbers, and mentions that lacked clear semantic significance. Emojis were systematically replaced with their corresponding semantic text representations. Additionally, any extra white spaces or extra spaces were meticulously stripped from the dataset to ensure a clean and consistent text corpus for subsequent analysis. 4.2. Model Architecture In our research methodology, we utilised the robust bert-base-multilingual-cased (mBERT) pre-trained models to create a solid foundation for our task. mBERT is built on the transformer architecture, which employs self-attention mechanisms both in the encoder and decoder. These models are pre-trained on vast text corpora, including Wikipedia, and have a well-established track record of delivering exceptional performance when fine-tuned for various downstream tasks. For our specific objective of identifying code-mixed language and detecting sarcasm, we opted for the BERT (Bidirectional Encoder Representations from Transformers) model, with a focus on the multilingual variant known as mBERT (bert-base-multilingual-cased). This model’s strength lies in its ability to handle text from a wide array of languages, featuring a substantial parameter count of 179 million, encompassing 12 transformer blocks, 768 hidden layers, and 12 attention heads. Our architectural design begins with the model taking a special [CLS] token as input, followed by a sequence of words. This input traverses through the layers, with each layer applying self-attention mechanisms and forwarding the results to the subsequent encoder. The output from the final layer of the pre-trained mBERT model serves as the input to a softmax feedforward neural network, a critical component in classifying statements into two categories: Sarcastic or Non-Sarcastic. This neural network generates a probability distribution for each word within the sequence across predefined tags. During prediction, the tag with the highest associated probability is selected as the predicted tag for each word. In the training phase, we carefully tuned specific hyperparameters to guide the learning process effectively. These included a learning rate of 0.01, a batch size of 16, and a maximum of 10 training epochs. These hyperparameters were meticulously optimized to ensure the model’s proficiency in code-mixed language identification and sarcasm detection. 5. Results and Discussion In this section, we delve into the comprehensive evaluation of our model’s performance on both datasets: Tamil-English and Malayalam-English, as part of the Sarcasm Identification task within Dravidian Languages. The performance of our proposed models is examined using a range of evaluation metrics, with a primary focus on accuracy, recall, macro-averaged F1-score, and weighted average F1-score. The organizers thoughtfully provided test data for both Dravidian languages, which served as the foundation for our model evaluation. Our methodology involved fine-tuning our model based on the training and validation datasets, ensuring it was well-prepared for the subsequent test data. Upon submission of our prediction file for the test data, we achieved an F1 Score of 0.72 for both language pairs. This score reflects a reasonable overall performance and places us at the third position in the ranking for both Tamil-English and Malayalam-English language pairs. Table 2 and 3 display the performance of the test outcomes for our proposed model and top scored team [15] for Tamil-English and Malayalam-English language respectively. Table 4 shows the class wise classification report for both language pairs on test data. Table 2 𝐹1 -scores for Tamil-English test data and rank list Tamil - English Team Name 𝐹1 score Rank hatealert_Tamil 0.74 1/8 IRLabIITBHU_tam 0.72 3/8 While our system demonstrated commendable accuracy, it’s worth noting that other compet- ing teams surpassed us in both Precision and Recall, which ultimately influenced our F1 score and final ranking. This outcome encourages further refinement of our approach to enhance Table 3 𝐹1 -scores for Malayalam-English test data and rank list Malayalam - English Team Name 𝐹1 score Rank SSNCSE1_Malayalaml 0.74 1/8 IRLabIITBHU_mal 0.72 3/8 Table 4 Precision, recall, 𝐹1 -scores, and support for both language on test data Tamil - English Malayalam - English Precision Recall 𝐹1 -score support Precision Recall 𝐹1 -score support Non-sarcastic 0.84 0.89 0.86 6186 0.90 0.90 0.90 3083 Sarcastic 0.64 0.52 0.57 2263 0.54 0.53 0.54 685 macro avg 0.74 0.71 0.72 8449 0.72 0.72 0.72 3768 weighted avg 0.78 0.79 0.79 8449 0.83 0.83 0.83 3768 Accuracy 0.79 8449 0.83 3768 our model’s precision and recall, aiming for even more competitive results in future endeavors. Figure 1 display the confusion matrices of two language pairs based on our subission. (a) Tamil-English (b) Malayalam-English Figure 1: Confusion matrices for our submissions on the corpus test set for both language pairs (a) Tamil-English, (b) Malayalam-English 6. Conclusion In this research, we’ve tackled the intricate task of identifying sarcasm and assessing sentiment polarity in code-mixed comments and posts, specifically in the Tamil-English and Malayalam- English languages, as extracted from the vibrant realm of social media. Our exploration into sentiment analysis reaffirms the growing significance of understanding user opinions, particu- larly in the context of enhancing business strategies. For our experimentation, we harnessed the capabilities of the pre-trained Multilingual BERT model, which yielded a commendable F1 score of 0.72. This achievement reflects the effectiveness of our approach in capturing the nuances of sarcasm within code-mixed contexts. Despite our system’s impressive accuracy, we acknowledge the competitive landscape where other teams excelled in both Precision and Recall, influencing our F1 score and final ranking. This outcome serves as a catalyst for refining our methodology further. In our future endeavors, we will be steadfast in our pursuit of enhancing precision and recall, with the aim of achieving even more competitive results. References [1] S. Chanda, S. Pal, Irlab@ iitbhu@ dravidian-codemix-fire2020: Sentiment analysis for dravidian languages in code-mixed text., in: FIRE (Working Notes), 2020, pp. 535–540. [2] S. Chanda, R. Singh, S. Pal, Is meta embedding better than pre-trained word embedding to perform sentiment analysis for dravidian languages in code-mixed text?, Working Notes of FIRE (2021). [3] S. Chanda, A. Mishra, S. Pal, Sentiment analysis and homophobia detection of code- mixed dravidian languages leveraging pre-trained model and word-level language tag, in: Working Notes of FIRE 2022-Forum for Information Retrieval Evaluation (Hybrid). CEUR, 2022. [4] A. Saroj, S. Chanda, S. Pal, Irlab@ iitv at semeval-2020 task 12: multilingual offensive language identification in social media using svm, in: Proceedings of the Fourteenth Workshop on Semantic Evaluation, 2020, pp. 2012–2016. [5] S. Chanda, S. Ujjwal, S. Das, S. Pal, Fine-tuning pre-trained transformer based model for hate speech and offensive content identification in english, indo-aryan and code-mixed (english-hindi) languages, in: Forum for Information Retrieval Evaluation (Working Notes)(FIRE), CEUR-WS. org, 2021. [6] S. Chanda, S. Sheth, S. Pal, Coarse and fine-grained conversational hate speech and offensive content identification in code-mixed languages using fine-tuned multilingual embedding, in: Forum for Information Retrieval Evaluation (Working Notes)(FIRE). CEUR- WS. org, 2022. [7] S. Chanda, S. Pal, The effect of stopword removal on information retrieval for code-mixed data obtained via social media, SN Computer Science 4 (2023) 494. [8] S. Amir, B. C. Wallace, H. Lyu, P. Carvalho, M. J. Silva, Modelling context with user embeddings for sarcasm detection in social media, in: Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, Association for Computational Linguistics, Berlin, Germany, 2016, pp. 167–177. URL: https://aclanthology.org/K16-1017. doi:10.18653/v1/K16- 1017 . [9] S. G. Wicana, T. Y. Ibisoglu, U. Yavanoglu, A review on sarcasm detection from machine- learning perspective, 2017 IEEE 11th International Conference on Semantic Computing (ICSC) (2017) 469–476. URL: https://api.semanticscholar.org/CorpusID:16074739. [10] K. Jain, A. Deshpande, K. Shridhar, F. Laumann, A. Dash, Indic-transformers: An analysis of transformer language models for indian languages, 2020. arXiv:2011.02323 . [11] D. K. Jain, A. Kumar, G. Garg, Sarcasm detection in mash-up language using soft-attention based bi-directional lstm and feature-rich cnn, Appl. Soft Comput. 91 (2020) 106198. URL: https://api.semanticscholar.org/CorpusID:216439240. [12] M. Marreddy, S. R. Oota, L. S. Vakada, V. C. Chinni, R. Mamidi, Multi-task text classifi- cation using graph convolutional networks for large-scale low resource language, 2022. arXiv:2205.01204 . [13] A. Agrawal, A. An, M. Papagelis, Leveraging transitions of emotions for sarcasm detection, Proceedings of the 43rd International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval (2020). URL: https://api.semanticscholar.org/CorpusID: 220729631. [14] A. Pandey, D. K. Vishwakarma, Multimodal sarcasm detection (msd) in videos using deep learning models, in: 2023 International Conference in Advances in Power, Signal, and Information Technology (APSIT), 2023, pp. 811–814. doi:10.1109/APSIT58554.2023. 10201731 . [15] B. R. Chakravarthi, N. Sripriya, B. Bharathi, K. Nandhini, S. Chinnaudayar Navaneethakr- ishnan, T. Durairaj, R. Ponnusamy, P. K. Kumaresan, K. K. Ponnusamy, C. Rajkumar, Overview of the shared task on sarcasm identification of Dravidian languages (Malayalam and Tamil) in DravidianCodeMix, in: Forum of Information Retrieval and Evaluation FIRE - 2023, 2023.