=Paper=
{{Paper
|id=Vol-3681/T5-11
|storemode=property
|title=Sarcasm Identification Of Dravidian Languages (Malayalam and Tamil)
|pdfUrl=https://ceur-ws.org/Vol-3681/T5-11.pdf
|volume=Vol-3681
|authors=V Indirakanth,Dharunkumar Udayakumar,Thenmozhi Durairaj,B. Bharathi
|dblpUrl=https://dblp.org/rec/conf/fire/IndirakanthUDB23
}}
==Sarcasm Identification Of Dravidian Languages (Malayalam and Tamil)==
Sarcasm Identification Of Dravidian Languages (Malayalam and Tamil) V Indirakanth1,∗∗ , Dharunkumar Udayakumar2 , Thenmozhi Durairaj3 and B. Bharathi4 1 2 3 4 Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering Abstract The rapid growth of social media has led to an increase in sarcastic comments. Detecting sarcasm in posts using multiple languages has become a critical aspect of language processing. Our work in the FIRE-2023 competition, specifically focused on ’Sarcasm Identification Of Dravidian Languages Tamil Malayalam’, centered on identifying sarcasm in texts from social media, a research area of growing importance. We employed various models, including BERT, DistilBERT, XLM-RoBERTa, SVM, and TF-IDF, to categorize text as either sarcastic or not. Our team, SSN_FeaturesAlpha, achieved notable results, with the highest F1 score of 0.68 for Tamil using DistilBERT and 0.63 for Malayalam using BERT. It’s worth highlighting that our submission for Tamil ranked 7th, and our Malayalam submission secured the 5th position, which underscores the effectiveness of our approach. Keywords Dravidian language, Text classification, Transfer learning, Sarcasm Detection 1. Introduction In recent years, the volume of user-generated content on social media platforms like Twitter, Facebook, YouTube, and Instagram has grown exponentially. Social media is expected to continue being a major source of data in the years ahead. Sarcasm detection, in particular, has become a focal point in research due to its significance in moderating and comprehending online discourse. Detecting sarcasm involves the complex task of identifying sarcastic statements within text written in code-mixed form. Tamil, an official Dravidian language spoken in Tamil Nadu (India), Sri Lanka, and Singa- pore, and Malayalam, another Dravidian language spoken in Kerala (India), are the languages under scrutiny. Code-mixing, where native speakers switch between multiple languages and sometimes employ the Roman script, is a common phenomenon in online social media inter- actions . The analysis of code-mixed bilingual or multilingual posts is gaining prominence in contemporary research. Identifying sentiments within indirect expressions such as sarcasm and metaphors is a challenging endeavor when done manually. Thus, the automatic detection of sentiments in various multilingual languages is a substantial challenge. Forum for Information Retrieval Evaluation, December 15-18, 2023, India Envelope-Open indirakanth2010681@ssn.edu.in (V. Indirakanth); dharunkumar2010504@ssn.edu.in (D. Udayakumar); theni\protect\TU_d@ssn.edu.in (T. Durairaj); bharathib@ssn.edu.in (B. Bharathi) GLOBE https://github.com/indira1vik (V. Indirakanth); https://github.com/dharunkumar56 (D. Udayakumar) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings The paper with the title ”Overview of The Shared Task on Sarcasm Identification of Dravidian Languages (Malayalam and Tamil) in Dravidian CodeMix”[9] provides an abstract view of the Shared Task on Sarcasm Identification in Dravidian Languages. The FIRE2023 competi- tion, Sarcasm Identification of Dravidian Languages (Tamil and Malayalam), is designed to develop systems capable of discerning the polarity of sentiments in code-mixed Tamil and Malayalam languages within social media forums. The organizers of FIRE2022 initiated the task of identifying sentiments in Tamil and Malayalam code-mixed languages. Additionally, the challenge requires the categorization of posts into positive, negative, neutral, mixed emotions, or unclassifiable emotions in the intended language. This year’s FIRE2023 competition provides datasets for two languages: Tamil and Malayalam, both mixed with English. In the context of the FIRE2023 competition, which revolves around the challenge of detecting sarcasm in Tamil and Malayalam comments mixed with English, this article outlines our systematic approach. Our primary task was to categorize comments into positive, negative, unknown sentiment, mixed feelings, or unclassifiable in the respective language, ultimately determining if a comment is sarcastic. To achieve this, we followed a structured approach. Firstly, we implemented language-specific transliteration and translation techniques to enhance our understanding of the code-mixed content. Secondly, we conducted data preprocessing using the NLTK library for both training and test data in Tamil and Malayalam. Finally, we selected and fine-tuned a range of machine learning models, including BERT, DistilBERT, SVM, TF-IDF, and XLM-RoBERTa, integrating them with both Tamil and Malayalam while also incorporating English. This article provides a comprehensive overview of our methodology and contributions to the FIRE2023 competition, where our primary focus lies in the domain of sarcasm detection. The structure of this paper is as follows: Section 2 reviews related work on sarcasm detection, Section 3 provides a detailed description of the data and our model methodology, Section 4 presents our experimental results and analysis, and finally, Section 5 offers conclusions drawn from our work and discusses potential avenues for further improvement in sarcasm detection. 2. Related Work In recent years, the field of Natural Language Processing (NLP) has witnessed a surge of interest in the detection of sarcasm, a form of figurative language that presents a formidable challenge due to its often subtle and context-dependent nature. While the majority of research in this domain has focused on widely spoken languages such as English, there is a growing recognition of the need to extend this investigation to underrepresented languages like Tamil and Malayalam, which belong to the Dravidian language family. Previous work on sarcasm detection in NLP has predominantly leveraged various machine learning techniques, including supervised, unsupervised, and deep learning approaches. In the paper [3], they created a multilingual dataset to recognize and encourage positivity in the comments and proposed a novel custom deep network architecture that uses a concatenation of embeddings from T5-Sentence. They have experimented with multiple machine learning models, including SVM, logistic regression, K-nearest neighbour, and decision trees. The paper [4] aims at developing a system that groups posts based on emotions and sentiment and finds sarcastic posts, if present. The proposed system is to develop a prototype that helps come to an inference about the emotions of the posts, namely anger, surprise, happiness, fear, sorrow, trust, anticipation, and disgust, with three sentic levels in each. The task [5] presents the findings of the shared task on Multimodal Sentiment Analysis and Troll meme classification in Dravidian languages. This task assumes the analysis of both textual and image features for making better predictions. The paper [6] investigates negative sentiment tweets with the presence of hyperbole for sarcasm detection. Six thousand and six hundred pre-processed negative sentiment tweets comprising Chinesevirus, Kungflu, COVID19, Hantavirus and Coronavirus were gathered for sarcasm detection. In paper [7] tasks included code-mixing at the intra-token and inter-token levels. In addition to Tamil, Malayalam and Kannada were also introduced. the quality and quantity of the submission show that there is great interest in Dravidian languages in code-mixed setting and state of the art in this domain still needs improvement. Task [8] intends to improve offensive language identification by generating pseudo-labels on the dataset. A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language, either Kannada, Malayalam, or Tamil and then generating pseudo-labels for the transliterated dataset. The two datasets are combined using the generated pseudo-labels to create a custom dataset called CMTRA. As Dravidian languages are under-resourced, their approach increases the amount of training data for the language models. 3. Dataset Description and Proposed Methodology This section provides information about the mixed-language data in Tamil and Malayalam, including details about the dataset and how we prepared it. In our research, we explored various techniques commonly used in Natural Language Processing (NLP). We used methods like Support Vector Machines (SVM), DistilBERT, XLM-RoBERTa, and transfer learning to improve our results. 3.1. Data Description For the task of offensive detection, the organizers offered datasets that were code-mixed in Tamil and Malayalam. The Malayalam dataset has 12,057 posts for training and 3,768 posts for testing, whereas the Tamil dataset has 27,036 posts for training and 8,449 posts for testing the model system. The objective of this work is to divide the postings into two categories in the datasets for Tamil and Malayalam: Sarcastic and Non-sarcastic (Fig 1 & 2). The training data was taken from comments on YouTube [1] [2]. 3.2. Data preprocessing In the context of our research paper, data preprocessing was conducted on both the Tamil and Malayalam datasets to ensure their adaptability for sarcasm detection tasks in Dravidian languages. This preprocessing, executed using NLTK2, aimed to enhance the quality and uni- formity of the text data. Initially, duplicate entries were removed to mitigate their potential influence on model performance. Subsequently, text strings beginning with ”@” symbols, typi- cally representing author names or user IDs, were eliminated. Hashtags, punctuation, URLs, and numerals devoid of semantic significance were also stripped from the text. Emojis were removed Figure 1: Tamil Training data values Figure 2: Malayalam Training data values to maintain textual clarity. Additionally, all uppercase text in English and native language text in the Roman script were converted to lowercase. These preprocessing measures were consistently applied to both the Tamil and Malayalam datasets, ensuring the data’s readiness for sarcasm detection tasks within the Dravidian language context. We downsampled training data, removed emojis, and eliminated common stop words to enhance data quality. We also meticulously fine-tuned labels for clarity, transforming ”Sarcastic” and ”Non-sarcastic” into ”1” (sarcasm) and ”0” (non-sarcasm), ensuring that our data aligns perfectly with our binary classification task. Table 1 Preprocessing Results Text Before Preprocessing �������� enaku matum Tha apdi thonutho... ??? ADI THOOL After Preprocessing enaku matum tha apdi thonutho adi thol 3.3. Proposed Methodology In this section, we delve into the methodology employed for the challenging task of sarcasm detection in Dravidian languages. Our aim is to break down the intricate process, highlighting its various stages, and elucidate how each step contributes to our overarching goal. Sarcasm, known for its subtlety and context-dependency, presents a formidable challenge in the realm of Natural Language Processing (NLP). Our approach leverages NLP techniques and a range of machine learning models to tackle this linguistic puzzle. Tokenization: For the initial data preprocessing, we utilized tokenization to break down the text into individual tokens or subwords. Depending on the model, different tokenizers were employed, such as the BertTokenizer for BERT and DistilBERT, XLMRobertaTokenizer for XLM-RoBERTa, and TfidfVectorizer for the Support Vector Machine (SVM) approach. Tokenization is a crucial step in preparing the text data for further processing. Model Training: Next, we dive into the heart of our methodology: model training. Our ensemble of models, comprising BERT, DistilBERT, and XLM-RoBERTa, embarks on an extensive training journey. These models come pre-equipped with vast textual knowledge, but to excel at the specific task of sarcasm detection in Dravidian languages, they undergo fine-tuning. The fine-tuning process serves as the bridge that adapts these models to the nuanced world of Dravidian sarcasm. Fine- tuning involves training for 4 epochs, with each epoch representing a complete pass through the training data. Reprocessing input data is enabled to adapt the models to sarcasm detection characteristics. Additionally, the models are configured for binary classification, distinguishing between sarcastic and non-sarcastic text. This comprehensive fine-tuning process ensures that our models effectively capture the subtleties of sarcasm, enhancing their performance on the specific sarcasm detection task in Dravidian languages. SVM (Support Vector Machine): Our methodology extends its reach beyond neural network-based models. We introduce the venerable Support Vector Machine (SVM), a classic machine learning algorithm, into the mix. SVM stands as a benchmark for comparison with our neural network counterparts. Leveraging TF-IDF vectors as features and a linear kernel for classification, SVM offers a different perspective on sarcasm detection, enriching our research toolkit. Model Evaluation: To gauge the effectiveness of our sarcasm detection models, we embark on a comprehensive evaluation journey. Our evaluation relies on standard metrics that provide insight into each model’s performance. Classification Report: Our classification report emerges as a comprehensive mirror, reflecting metrics like accuracy, F1-score, precision, and recall. It offers a detailed assessment of our models, shedding light on their prowess in distinguishing between the worlds of sarcasm and non-sarcasm. Confusion Matrix: In addition to the classification report, we construct confusion matrices that bring our model’s classification results to life. These matrices vividly portray true positives, true negatives, false positives, and false negatives, painting a clear picture of our models’ strengths and areas for improvement. Prediction and Labeling: Our methodology culminates in the real-world application of our trained models. We unleash them on testing data, stored conveniently in a CSV file. To facilitate interpretation, we introduce a new column labeled ”Labels” into the data. Within this column, values of 0 denote ”Non- sarcastic,” while values of 1 signify ”Sarcastic” based on our model’s astute predictions. In essence, our methodology represents a harmonious blend of NLP techniques and a diverse range of models. Together, tokenization, model training, and meticulous evaluation allow us to unravel the intricacies of sarcasm detection in Dravidian languages. Through this journey, we seek to gain insights into the effectiveness of each approach and distill meaningful conclusions about the task at hand. 4. Results In this section, we present the evaluation of our model and submitted results for the Tamil, and Malayalam code-mixed languages. 4.1. Experimental Results The below given results are the F1 Scores for each models used. The highest score recorded for Tamil language is 0.75 with DistilBERT model and highest score recorded for Malayalam lan- guage is 0.72 with BERT model. The confusion matrix provided is the validation / development data. Table 2 Classification Report - BERT Malayalam Title Precision Recall F1 score Support 0 0.93 0.70 0.80 2427 1 0.38 0.77 0.51 588 accuracy 0.72 3015 macro avg 0.66 0.74 0.66 3015 weighted avg 0.82 0.72 0.74 3015 Table 3 Classfication Report - DistilBERT Tamil Title Precision Recall F1 score Support 0 0.90 0.75 0.82 4939 1 0.53 0.77 0.63 1820 accuracy 0.75 6759 macro avg 0.71 0.76 0.72 6759 weighted avg 0.80 0.75 0.77 6759 Table 4 Results Model Tamil Malayalam BERT 0.70 0.72 DistilBERT 0.75 0.64 XLM-RoBERTa 0.64 0.63 SVM 0.74 0.71 TF-IDF 0.72 0.70 4.2. Submitted Results We applied the transfer learning technique to enrich the training data for the two languages. Our approach utilized a combination of BERT, DistilBERT, XLM-RoBERTa, SVM, and TF-IDF models. Figure 3: DistilBERT model for Tamil Figure 4: BERT for Malayalam Development data Confusion matrix results The collaborative efforts of our team, FeaturesAlpha_tam, yielded impressive results, with the maximum F1-scores reaching 0.68 in Tamil and 0.63 in Malayalam using various models. In our submissions, we presented the results of DistilBERT for Tamil and BERT for Malayalam, achieving commendable rankings of 7th for Tamil and 5th for Malayalam in the competition. Table 5 represents Tamil Ranking and Table 6 represents Malayalam ranking. Detailed perfor- mance metrics, including accuracy and F1-scores, are available in Table 3 for Tamil and Table 2 for Malayalam, providing a comprehensive view of our models’ effectiveness. S.No Team name MF1 Rank 1 hatealert_Tamil 0.74 1 2 ABC_tamil 0.73 2 3 SSNCSE1_Tamill 0.73 2 4 IRLabIITBHU_tam 0.72 3 5 ramyasiva_tamil 0.71 4 6 ENDEAVOUR 0.70 5 7 MUCS_Tamil 0.70 5 8 Hydrangea_tamilrun3 0.69 6 9 SSN_FeaturesAlpha_tam 0.68 7 10 YenCS_tam 0.68 7 11 TechWhiz_tam 0.66 8 Table 5 Tamil Ranking Furthermore, we’ve presented the prediction values in confusion matrices for both languages, enhancing the interpretability of our results. These visualizations can be found in Figure 3 for Tamil and Figure 4 for Malayalam, allowing for a deeper understanding of the model’s performance. S.No Team name MF1 Rank 1 SSNCSE1_Malayalaml 0.74 1 2 hatealert_Malayalam 0.73 2 3 ABC_malayalam 0.72 3 4 IRLabIITBHU_mal 0.72 3 5 MUCS_mal 0.71 4 6 SSN_FeaturesAlpha_mal 0.63 5 7 TechWhiz_mal 0.63 5 8 YenCS_mal 0.63 5 9 Hydrangea_malayalamrun1 0.57 6 10 ENDEAVOUR_malayalam 0.53 7 11 ramyasiva_malayalam 0.52 8 Table 6 Malayalam Ranking 5. Conclusion This paper outlines our methodology for identifying sarcasm in text from Dravidian languages, specifically Tamil and Malayalam. Our approach focused on preprocessing techniques and the utilization of pre-trained models such as BERT, DistilBERT, XLM-RoBERTa, as well as traditional methods like SVM and TF-IDF, with various input variations for the shared task across both languages. Through rigorous evaluation, our findings indicate that fine-tuning the BERT and DistilBERT architecture yields notable performance improvements. Our team achieved enhanced F1 scores compared to baseline scores, showcasing the effectiveness of our approach. Additionally, we harnessed the power of transfer learning to maximize results. While our current research has shown promising results, there remains room for further advancements. Future research endeavors can explore the potential of different deep learning algorithms to push the boundaries of sarcasm detection in Dravidian languages. Moreover, extending our work to include other languages promises to broaden the scope and applicability of our methodology. In conclusion, our study represents a significant step forward in the realm of sarcasm detection within Dravidian languages. By comparing our methodology and outcomes with existing research, we hope to contribute to the ongoing dialogue and innovation in this field, ultimately paving the way for more accurate and robust sarcasm detection systems. 6. References [1] Chakravarthi, B.R. 2022. Hope speech detection in YouTube comments. Social Network Analysis and Mining, 12(1), 75. Springer. [2] Chakravarthi, B.R., Hande, A., Ponnusamy, R., Kumaresan, P.K., and Priyadharshini, R. 2022. How can we detect Homophobia and Transphobia? Experiments in a multilingual code-mixed setting for social media governance. International Journal of Information Management Data Insights, 2(2), 100119. Elsevier. [3] Chakravarthi, B.R. Hope speech detection in YouTube comments. Soc. Netw. Anal. Min. 12, 75 (2022). https://doi.org/10.1007/s13278-022-00901-z [4] S. Rendalkar and C. Chandankhede, ”Sarcasm Detection of Online Comments Us- ing Emotion Detection,” 2018 International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 2018, pp. 1244-1249, doi: 10.1109/ICIRCA.2018.8597368. URL:https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=ar- number=8597368isnumber=8596764 [5] Premjith B, Chakravarthi BR, Subramanian M, et al. Findings of the Shared Task on Multi- modal Sentiment Analysis and Troll Meme Classification in Dravidian Languages. Associa- tion for Computational Linguistics. January 2022. doi:10.18653/v1/2022.dravidianlangtech- 1.39 [6] A Machine Learning Approach in Analyzing the Effect of Hyperboles Using Negative Sentiment Tweets for Sarcasm Detection.” A Machine Learning Approach in Analyz- ing the Effect of Hyperboles Using Negative Sentiment Tweets for Sarcasm Detection - ScienceDirect, 22 Jan. 2022, doi:10.1016/j.jksuci.2022.01.008. [7] Priyadharshini R, Chakravarthi BR, Thavareesan S, Chinnappa D, Thenmozhi D, Pon- nusamy R. Overview of the DravidianCodeMIX 2021 Shared Task on sentiment Detection in Tamil, Malayalam, and Kannada. Forum for Information Retrieval Evaluation. Decem- ber 2021. doi:10.1145/3503162.3503177 [8] Hande, Adeep, et al. ”Offensive Language Identification in Low-resourced Code- mixed Dravidian Languages Using Pseudo-labeling.” arXiv.org, 27 Aug. 2021, arxiv.org/abs/2108.12177v1. [9] B. R. Chakravarthi, N. Sripriya, B. Bharathi, K. Nandhini, S. Chinnaudayar Nava- neethakrishnan, T. Durairaj, R. Ponnusamy, P.K.Kumaresan, K. K. Ponnusamy, C. Ra- jkumar, Overview of the shared task on sarcasm identification of Dravidian languages (Malayalam and Tamil) in DravidianCodeMix, in: Forum of Information Retrieval and Evaluation FIRE - 2023, 2023.